網易首頁 > 網易號 > 正文申請入駐

AI 爬蟲核武器！Crawl4AI 橫空出世，數據采集只需一行代碼

2025-04-19 20:32:04　來源: 機器學習與Python社區

北京舉報

分享至

大家好，我是 Ai 學習的老章

推薦一個大模型周邊項目

一、項目簡介

Crawl4AI 是一款專為大語言模型（LLM）和 AI 應用設計的開源網頁爬蟲與數據抓取工具。它不僅能高效采集網頁數據，還能直接輸出結構化、干凈的 Markdown 內容，非常適合用于 RAG（檢索增強生成）、AI 微調、知識庫建設等場景。

二、核心亮點

為 LLM 優化：輸出智能、精煉的 Markdown，極大方便 AI 下游處理。
極速高效：實時爬取，速度提升 6 倍，性能與成本兼顧。
靈活瀏覽器控制：支持會話管理、代理、定制化 hook，輕松應對反爬與復雜頁面。
啟發式智能抽取：集成先進算法，減少對大模型的依賴，提升信息提取效率。
開源易部署：無需 API Key，支持 Docker 與云端部署。

三、安裝與快速上手

安裝

pip install crawl4ai crawl4ai-setup  # 一鍵配置瀏覽器環境

如遇瀏覽器相關問題，可手動安裝 Playwright：

python -m playwright install --with-deps chromium

Python 快速示例

import asyncio from crawl4ai import * async def main():     async with AsyncWebCrawler() as crawler:         result = await crawler.arun(             url="[https://www.nbcnews.com/business",](https://www.nbcnews.com/business",)         )         print(result.markdown) if __name__ == "__main__":     asyncio.run(main())

命令行用法

# 基礎爬取并輸出 Markdown crwl [https://www.nbcnews.com/business](https://www.nbcnews.com/business) -o markdown # 深度爬取，BFS 策略，最多 10 頁 crwl [https://docs.crawl4ai.com](https://docs.crawl4ai.com) --deep-crawl bfs --max-pages 10 # 調用 LLM 按問題抽取 crwl [https://www.example.com/products](https://www.example.com/products) -q "提取所有商品價格"

四、典型應用場景

構建 AI 知識庫、FAQ、企業內網檢索自動化采集新聞、論壇、商品信息支持自定義抽取策略，適配各類結構化/半結構化數據結合 LLM 做智能問答、信息抽取

五、進階用法示例

自定義內容過濾與 Markdown 生成

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai.content_filter_strategy import PruningContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator asyncdef main():     browser_config = BrowserConfig(headless=True, verbose=True)     run_config = CrawlerRunConfig(         cache_mode=CacheMode.ENABLED,         markdown_generator=DefaultMarkdownGenerator(             content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)         )     )     asyncwith AsyncWebCrawler(config=browser_config) as crawler:         result = await crawler.arun(             url="[https://docs.micronaut.io/4.7.6/guide/",](https://docs.micronaut.io/4.7.6/guide/",)             config=run_config         )         print(result.markdown.raw_markdown)

自定義 Schema 結構化抽取

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai.extraction_strategy import JsonCssExtractionStrategy import json asyncdef main():     schema = {         "name": "課程信息",         "baseSelector": "section.charge-methodology .w-tab-content > div",         "fields": [             {"name": "section_title", "selector": "h3.heading-50", "type": "text"},             {"name": "course_name", "selector": ".text-block-93", "type": "text"},             {"name": "course_icon", "selector": ".image-92", "type": "attribute", "attribute": "src"}         ]     }     extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)     browser_config = BrowserConfig(headless=False, verbose=True)     run_config = CrawlerRunConfig(extraction_strategy=extraction_strategy, cache_mode=CacheMode.BYPASS)     asyncwith AsyncWebCrawler(config=browser_config) as crawler:         result = await crawler.arun(             url="[https://www.kidocode.com/degrees/technology",](https://www.kidocode.com/degrees/technology",)             config=run_config         )         companies = json.loads(result.extracted_content)         print(json.dumps(companies, indent=2))

制作不易，如果這篇文章覺得對你有用，可否點個關注。給我個三連擊：點贊、轉發和在看。若可以再給我加個，謝謝你看我的文章，我們下篇再見！

特別聲明：以上內容(如有圖片或視頻亦包括在內)為自媒體平臺“網易號”用戶上傳并發布，本平臺僅提供信息存儲服務。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.