利用 ScrapeGraphAI 简化网络抓取 - 发挥大型语言模型的威力

网络搜刮是从网站中提取数据的过程，已成为研究人员、数据科学家和企业必不可少的工具。然而，网络结构的复杂性和对自定义搜索逻辑的需求往往带来巨大的挑战。ScrapeGraphAI是一个Python库，旨在利用大型语言模型（LLM）和直接图逻辑的强大功能，彻底改变网络搜刮的现状。有了 ScrapeGraphAI，用户只需简单描述想要提取的信息，就能毫不费力地创建网站、文档和 XML 文件的搜索管道。

ScrapeGraphAI 在研究领域应用广泛，尤其是从多个在线来源收集数据时。以下是 ScrapeGraphAI 发挥重要作用的几种应用场景：

文献综述：研究人员可以使用 ScrapeGraphAI 从学术论文、期刊和会议记录中提取相关信息，从而简化文献综述流程。
市场分析：通过从电子商务网站、社交媒体平台和新闻文章中抓取数据，研究人员可以为市场分析和消费者行为研究收集有价值的见解。
情感分析：ScrapeGraphAI 可用于从各种网站上抓取用户评论、意见和观点，使研究人员能够对大型数据集进行情感分析。

如何使用 ScrapeGraphAI：使用 ScrapeGraphAI 非常简单，只需极少的设置即可。

使用 pip 安装 ScrapeGraphAI：使用 pip 安装 ScrapeGraphAI。
安装 Playwright 以进行基于 JavaScript 的刮擦playwright 安装
设置 OpenAI API 密钥（如果使用 OpenAI 模型）。从 ScrapeGraphAI 提供的三个主要刮擦管道中选择一个：A。SmartScraperGraph：B.SearchGraph：C.SpeechGraph：从网站提取信息并生成音频文件的单页面刮板。
通过指定 LLM、嵌入模型和其他相关设置来配置刮擦管道。
使用所需的提示和信息源运行搜索管道，并检索提取的信息。

案例 1：使用本地模型的 SmartScraper 请记住已安装 Ollama 并使用 ollama pull 命令下载模型。

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "verbose": True,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)
The output will be a list of projects with their descriptions like the following:

{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}

案例 2：使用混合模型的 SearchGraph 我们使用 Groq 进行 LLM，使用 Ollama 进行嵌入。



from scrapegraphai.graphs import SearchGraph

# Define the configuration for the graph
graph_config = {
    "llm": {
        "model": "groq/gemma-7b-it",
        "api_key": "GROQ_API_KEY",
        "temperature": 0
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set ollama URL arbitrarily
    },
    "max_results": 5,
}

# Create the SearchGraph instance
search_graph = SearchGraph(
    prompt="List me all the traditional recipes from Chioggia",
    config=graph_config
)

# Run the graph
result = search_graph.run()
print(result)
The output will be a list of recipes like the following:

{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}

案例 3：使用 OpenAI 的 SpeechGraph 只需传递 OpenAI API 密钥和模型名称。



from scrapegraphai.graphs import SpeechGraph

graph_config = {
    "llm": {
        "api_key": "OPENAI_API_KEY",
        "model": "gpt-3.5-turbo",
    },
    "tts_model": {
        "api_key": "OPENAI_API_KEY",
        "model": "tts-1",
        "voice": "alloy"
    },
    "output_path": "audio_summary.mp3",
}

# ************************************************
# Create the SpeechGraph instance and run it
# ************************************************

speech_graph = SpeechGraph(
    prompt="Make a detailed audio summary of the projects.",
    source="https://perinim.github.io/projects/",
    config=graph_config,
)

result = speech_graph.run()
print(result)

顾虑和注意事项：尽管 ScrapeGraphAI 提供了一种强大而直观的网络刮擦方法，但仍有一些问题需要注意：

尊重网站服务条款和 robots.txt：确保您的搜索活动符合网站的服务条款，并且不违反任何法律或道德准则。
API 的使用和成本：使用 OpenAI 或 Groq 等第三方 API 时，请注意相关费用和使用限制。
数据质量和可靠性：提取信息的准确性取决于 LLM 的质量和用户提示的清晰度。在将刮擦数据用于关键应用之前，必须对其进行验证。

ScrapeGraphAI 是网络搜索领域的一次重大飞跃，它使用户能够轻松地从网站中提取信息。通过利用大型语言模型和直接图逻辑的力量，ScrapeGraphAI 简化了搜索过程，为研究人员和数据爱好者开辟了新的可能性。无论您是在进行文献综述、市场趋势分析还是情感分析，ScrapeGraphAI 都能为您的数据提取需求提供无缝、高效的解决方案。

生成式AI助力科研计划

利用 ScrapeGraphAI 简化网络抓取 - 发挥大型语言模型的威力

最新文章

Comentários