top of page
Writer's pictureLille My

Simplify Web Scraping with ScrapeGraphAI - Harnessing the Power of Large Language Models

Web scraping, the process of extracting data from websites, has become an essential tool for researchers, data scientists, and businesses alike. However, the complexity of web structures and the need for custom scraping logic often pose significant challenges. ScrapeGraphAI, a Python library, aims to revolutionize the web scraping landscape by leveraging the power of large language models (LLMs) and direct graph logic. With ScrapeGraphAI, users can effortlessly create scraping pipelines for websites, documents, and XML files by simply describing the information they want to extract.



ScrapeGraphAI finds numerous applications in research, particularly when it comes to gathering data from multiple online sources. Here are a few scenarios where ScrapeGraphAI can be invaluable:

  1. Literature Review: Researchers can use ScrapeGraphAI to extract relevant information from academic papers, journals, and conference proceedings, streamlining the literature review process.

  2. Market Analysis: By scraping data from e-commerce websites, social media platforms, and news articles, researchers can gather valuable insights for market analysis and consumer behavior studies.

  3. Sentiment Analysis: ScrapeGraphAI can be employed to scrape user reviews, comments, and opinions from various websites, enabling researchers to perform sentiment analysis on large datasets.


How to Use ScrapeGraphAI:Using ScrapeGraphAI is straightforward and requires minimal setup. Follow these steps to get started:

  1. Install ScrapeGraphAI using pip: pip install scrapegraphai

  2. Install Playwright for JavaScript-based scraping playwright install

  3. Set up your OpenAI API key (if using OpenAI models). Choose one of the three main scraping pipelines provided by ScrapeGraphAI: A. SmartScraperGraph: Single-page scraper that requires a user prompt and an input source. B. SearchGraph: Multi-page scraper that extracts information from the top search results of a search engine. C. SpeechGraph: Single-page scraper that extracts information from a website and generates an audio file.

  4. Configure the scraping pipeline by specifying the LLM, embeddings model, and other relevant settings.

  5. Run the scraping pipeline with your desired prompt and source, and retrieve the extracted information.


Case 1: SmartScraper using Local Models Remember to have Ollama installed and download the models using the ollama pull command.


from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "model": "ollama/mistral",
        "temperature": 0,
        "format": "json",  # Ollama needs the format to be specified explicitly
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set Ollama URL
    },
    "verbose": True,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the projects with their descriptions",
    # also accepts a string with the already downloaded HTML code
    source="https://perinim.github.io/projects",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)
The output will be a list of projects with their descriptions like the following:

{'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}

 

Case 2: SearchGraph using Mixed Models We use Groq for the LLM and Ollama for the embeddings.



from scrapegraphai.graphs import SearchGraph

# Define the configuration for the graph
graph_config = {
    "llm": {
        "model": "groq/gemma-7b-it",
        "api_key": "GROQ_API_KEY",
        "temperature": 0
    },
    "embeddings": {
        "model": "ollama/nomic-embed-text",
        "base_url": "http://localhost:11434",  # set ollama URL arbitrarily
    },
    "max_results": 5,
}

# Create the SearchGraph instance
search_graph = SearchGraph(
    prompt="List me all the traditional recipes from Chioggia",
    config=graph_config
)

# Run the graph
result = search_graph.run()
print(result)
The output will be a list of recipes like the following:

{'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}


 

Case 3: SpeechGraph using OpenAI You just need to pass the OpenAI API key and the model name.



from scrapegraphai.graphs import SpeechGraph

graph_config = {
    "llm": {
        "api_key": "OPENAI_API_KEY",
        "model": "gpt-3.5-turbo",
    },
    "tts_model": {
        "api_key": "OPENAI_API_KEY",
        "model": "tts-1",
        "voice": "alloy"
    },
    "output_path": "audio_summary.mp3",
}

# ************************************************
# Create the SpeechGraph instance and run it
# ************************************************

speech_graph = SpeechGraph(
    prompt="Make a detailed audio summary of the projects.",
    source="https://perinim.github.io/projects/",
    config=graph_config,
)

result = speech_graph.run()
print(result)

Concerns and Considerations: While ScrapeGraphAI offers a powerful and intuitive approach to web scraping, there are a few concerns to keep in mind:

  1. Respect website terms of service and robots.txt: Ensure that your scraping activities comply with the website's terms of service and do not violate any legal or ethical guidelines.

  2. API usage and costs: When using third-party APIs like OpenAI or Groq, be mindful of the associated costs and usage limits.

  3. Data quality and reliability: The accuracy of the extracted information depends on the quality of the LLM and the clarity of the user prompts. It's essential to validate the scraped data before using it for critical applications.


ScrapeGraphAI represents a significant leap forward in web scraping, empowering users to extract information from websites with ease. By harnessing the power of large language models and direct graph logic, ScrapeGraphAI simplifies the scraping process and opens up new possibilities for researchers and data enthusiasts. Whether you're conducting a literature review, analyzing market trends, or performing sentiment analysis, ScrapeGraphAI provides a seamless and efficient solution for your data extraction needs.

Comments


bottom of page