Using Large Language Models to scrap data online

Web scraping can be intimidating and overwhelming if you do not understand the structure of the website. However, LLMs may assist you in a very efficient way. Here, I will introduce two methods of web scraping.

The first method is for beginners, where you will only scrape a web page or two. Text-based scraping involves copying the HTML of the target website and then using a random LLM to extract the data you need. This can be done by pasting the HTML into an LLM such as ChatGPT, which will then return the data you requested.

The second method requires the use of Python. The video below explains how to create a web scraper using Python and the AI chatbot GPT-4. The scraper can summarize information from multiple websites and answer your questions about the content.

First, you need to sign up for an OpenAI account and obtain an API key.

You also need to install a Python library called LayeredChain.

The code consists of a function called web_QA that takes a list of URLs and a query as input.

The function uses LayeredChain to load the web pages into a vector database and then uses the OpenAI API to call GPT-4 to answer your questions about the content of the web pages.

The video demonstrates the code with an example where the user wants to learn about Idiom AI. The user pastes four URLs about Idiom AI into the code and asks GPT-4 to summarize what Idiom AI is, what it does, how to use it, and also provides five interesting prompts for the user to use with Idiom AI.

Here is the video.

Here is the script.

from langchain.document_loaders import WebBaseLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.chat_models.openai import ChatOpenAI
from datetime import datetime
import dotenv

dotenv.load_dotenv()

def web_qa(url_list, query, out_name):
    openai = ChatOpenAI(
        model_name="gpt-3.5-turbo",
        max_tokens=2048
    )
    loader_list = []
    for i in url_list:
        print('loading url: %s' % i)
        loader_list.append(WebBaseLoader(i))

    index = VectorstoreIndexCreator().from_loaders(loader_list)
    ans = index.query(question=query,
                      llm=openai)
    print("")
    print(ans)

    outfile_name = out_name + datetime.now().strftime("%m-%d-%y-%H%M%S") + ".out"
    with open(outfile_name, 'w') as f:
        f.write(ans)

url_list = [
    "https://openaimaster.com/how-to-use-ideogram-ai/",
    "https://dataconomy.com/2023/08/28/what-is-ideogram-ai-and-how-to-use-it/",
    "https://ideogram.ai/launch",
    "https://venturebeat.com/ai/watch-out-midjourney-ideogram-launches-ai-image-generator-with-impressive-typography/"
]

prompt = '''
    Given the context, please provide the following:
    1. summary of what it is
    2. summary of what it does
    3. summary of how to use it
    4. Please provide 5 interesting prompts that could be used with this AI.
'''

web_qa(url_list, prompt, "summary")

Additionally, below is a guideline from the Journal of Marketing about web scraping for research.

In a Journal of Marketing webinar, titled "Web Data Scraping for Marketing Research", discusses the importance of web data in marketing research and the challenges of collecting it. The panel introduces a new methodological framework to help researchers collect web data in a valid and reliable way.

The framework consists of three stages: selecting the source, designing the collection, and extracting the data. Researchers need to carefully consider a number of factors at each stage, such as:

Source selection
Quality: Researchers should assess the quality of the data on the potential source websites. This may involve evaluating the accuracy, completeness, and relevance of the data.
Stability: The chosen websites should be stable and unlikely to undergo significant changes in structure or content. This will help to ensure that the data collected is consistent over time.
Ease of access: Researchers need to consider how easy it is to access the data on the websites they have chosen. Some websites may make it difficult or impossible to scrape data automatically.
Collection design
Data extraction methods: Researchers need to decide how to extract the data from the websites they have chosen. This may involve writing scripts to automate the data collection process.
Sampling: Researchers need to decide how to sample the data from the websites they have chosen. This will depend on the research question and the nature of the data.
Legal and ethical considerations: It is important to ensure that the data scraping process is legal and ethical. Researchers should respect the robots.txt files of the websites they are scraping and avoid collecting data that is protected by copyright or privacy laws.
Data extraction
Data cleaning: Once the data has been extracted, it is important to clean it to remove any errors or inconsistencies. This may involve removing duplicate entries, formatting the data consistently, and checking for missing values.
Data monitoring: Researchers should monitor the data extraction process to ensure that they are getting the data they expect. This may involve checking the data for errors on a regular basis.

The speaker also discusses the importance of documentation and replicability in web data scraping. She recommends that researchers carefully document their data collection process so that others can replicate their results.

Overall, the video provides a valuable overview of web data scraping for marketing research. It is a great resource for researchers who are interested in using web data in their work.

Here are some of the key takeaways from the video:

Web data is a valuable resource for marketing research, but it is important to collect it in a valid and reliable way.
The new methodological framework can help researchers to consider the important factors at each stage of the data scraping process.
Careful consideration needs to be given to legal and ethical issues when scraping data.
Documentation is essential for replicability.

Generative AI for Research Initiative

Using Large Language Models to scrap data online

Recent Posts

Comments