大家好,我是 Ai 學(xué)習(xí)的老章
推薦一個(gè)大模型周邊項(xiàng)目
一、項(xiàng)目簡(jiǎn)介
Crawl4AI 是一款專(zhuān)為大語(yǔ)言模型(LLM)和 AI 應(yīng)用設(shè)計(jì)的開(kāi)源網(wǎng)頁(yè)爬蟲(chóng)與數(shù)據(jù)抓取工具。它不僅能高效采集網(wǎng)頁(yè)數(shù)據(jù),還能直接輸出結(jié)構(gòu)化、干凈的 Markdown 內(nèi)容,非常適合用于 RAG(檢索增強(qiáng)生成)、AI 微調(diào)、知識(shí)庫(kù)建設(shè)等場(chǎng)景。
二、核心亮點(diǎn)
為 LLM 優(yōu)化:輸出智能、精煉的 Markdown,極大方便 AI 下游處理。
極速高效:實(shí)時(shí)爬取,速度提升 6 倍,性能與成本兼顧。
靈活瀏覽器控制:支持會(huì)話(huà)管理、代理、定制化 hook,輕松應(yīng)對(duì)反爬與復(fù)雜頁(yè)面。
啟發(fā)式智能抽?。杭上冗M(jìn)算法,減少對(duì)大模型的依賴(lài),提升信息提取效率。
開(kāi)源易部署:無(wú)需 API Key,支持 Docker 與云端部署。
安裝
pip install crawl4ai crawl4ai-setup # 一鍵配置瀏覽器環(huán)境
如遇瀏覽器相關(guān)問(wèn)題,可手動(dòng)安裝 Playwright:
python -m playwright install --with-deps chromium
Python 快速示例
import asyncio from crawl4ai import * async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="[https://www.nbcnews.com/business",](https://www.nbcnews.com/business",) ) print(result.markdown) if __name__ == "__main__": asyncio.run(main())
命令行用法
# 基礎(chǔ)爬取并輸出 Markdown crwl [https://www.nbcnews.com/business](https://www.nbcnews.com/business) -o markdown # 深度爬取,BFS 策略,最多 10 頁(yè) crwl [https://docs.crawl4ai.com](https://docs.crawl4ai.com) --deep-crawl bfs --max-pages 10 # 調(diào)用 LLM 按問(wèn)題抽取 crwl [https://www.example.com/products](https://www.example.com/products) -q "提取所有商品價(jià)格"
四、典型應(yīng)用場(chǎng)景構(gòu)建 AI 知識(shí)庫(kù)、FAQ、企業(yè)內(nèi)網(wǎng)檢索 自動(dòng)化采集新聞、論壇、商品信息 支持自定義抽取策略,適配各類(lèi)結(jié)構(gòu)化/半結(jié)構(gòu)化數(shù)據(jù) 結(jié)合 LLM 做智能問(wèn)答、信息抽取
五、進(jìn)階用法示例
自定義內(nèi)容過(guò)濾與 Markdown 生成
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai.content_filter_strategy import PruningContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator asyncdef main(): browser_config = BrowserConfig(headless=True, verbose=True) run_config = CrawlerRunConfig( cache_mode=CacheMode.ENABLED, markdown_generator=DefaultMarkdownGenerator( content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0) ) ) asyncwith AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="[https://docs.micronaut.io/4.7.6/guide/",](https://docs.micronaut.io/4.7.6/guide/",) config=run_config ) print(result.markdown.raw_markdown)
自定義 Schema 結(jié)構(gòu)化抽取
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai.extraction_strategy import JsonCssExtractionStrategy import json asyncdef main(): schema = { "name": "課程信息", "baseSelector": "section.charge-methodology .w-tab-content > div", "fields": [ {"name": "section_title", "selector": "h3.heading-50", "type": "text"}, {"name": "course_name", "selector": ".text-block-93", "type": "text"}, {"name": "course_icon", "selector": ".image-92", "type": "attribute", "attribute": "src"} ] } extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True) browser_config = BrowserConfig(headless=False, verbose=True) run_config = CrawlerRunConfig(extraction_strategy=extraction_strategy, cache_mode=CacheMode.BYPASS) asyncwith AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="[https://www.kidocode.com/degrees/technology",](https://www.kidocode.com/degrees/technology",) config=run_config ) companies = json.loads(result.extracted_content) print(json.dumps(companies, indent=2))
制作不易,如果這篇文章覺(jué)得對(duì)你有用,可否點(diǎn)個(gè)關(guān)注。給我個(gè)三連擊:點(diǎn)贊、轉(zhuǎn)發(fā)和在看。若可以再給我加個(gè),謝謝你看我的文章,我們下篇再見(jiàn)!
特別聲明:以上內(nèi)容(如有圖片或視頻亦包括在內(nèi))為自媒體平臺(tái)“網(wǎng)易號(hào)”用戶(hù)上傳并發(fā)布,本平臺(tái)僅提供信息存儲(chǔ)服務(wù)。
Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.