top of page

How to identify main topics in a text

Updated: May 8

It is much easy to employ large language models (LLMs) to identify topics within textual data. These models represent a substantial improvement over older techniques like manual identification or Latent Dirichlet Allocation (LDA). This guide delineates three robust strategies for employing LLMs to discern topics effectively, even within brief texts.

Prior to delving into these methodologies, I recommend examining a pre-print paper [http://doi.org/10.2196/preprints.53376] that offers a thorough examination of these strategies and their respective advantages. You may also practice using the dataset in the github. [https://github.com/lanceyuu/LLMforTM.git] It is critical to acknowledge certain limitations, such as the constraints on text length that LLMs can analyze.





1. Utilizing LLM Tools

For analyzing text, you may utilize any accessible LLM tool within the toolkit. Visit our toolkit here [Toolkit]. For extensive texts, it is advisable to employ either GPT-4 or a GPT developed by me [Topic Modeller]. Other models like Le Chat and Claude have also demonstrated capability in processing sizable texts, though Llama might exhibit limitations. For texts in Chinese, Kimi is the suggested choice. Additionally, Gemini Pro 1.5 is adept at managing up to one million tokens.


Prompt example:

You are a qualitative researcher and your task is to identify the topics in the text. When generating the topics, prioritize correctness and ensure that your response is accurate and grounded in the context of the text.

Your summary should include three components: the first one is the title of the topic; the second one is the definition of the topic; the third one is the number of occurrences of the topic.

insert topic 1: Title, definition, occurrence
insert topic 2: Title, definition, occurrence
insert topic 3: Title, definition, occurrence
…

Here is the text
[paste your text here]

2. Python Script for Analyzing Structured Data

Should you possess structured data, such as a CSV file, or need to analyze numerous texts, a Python script might offer a more systematic approach than manual input. Here is the script. Please replace API key with your key. Also, import the dataset via the path as well.



!pip install openai==0.28
!pip install evaluate
!pip install rouge_score
!pip install panda

import openai
import evaluate
import pandas as pd

# Initialize OpenAI API with your key
openai_api_key = "your_openai_api_key"
openai.api_key = openai_api_key

# Load your test data
test_data_path = 'path/to/your/test.csv'
test_data = pd.read_csv(test_data_path)

# Define the summarization instruction within a chat interaction context
def generate_summary(review):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an AI trained to summarize product reviews."},
            {"role": "user", "content": review}
        ]
    )
    return response.choices[0].message['content']

# Generate summaries
results = []
actual_summaries = test_data['summary'].tolist()  # Adjust the column name if needed

for review in test_data['reviews']:  # Adjust the column name if needed
    summary = generate_summary(review)
    results.append(summary)

# Evaluate the summaries using ROUGE
rouge_scorer = evaluate.load('rouge')
rouge_scores = rouge_scorer.compute(predictions=results, references=actual_summaries)

# Directly print the ROUGE-1 score, assuming it's a numerical value
print("ROUGE-1 F1 Score:", rouge_scores['rouge1'])

For those with advanced expertise, there is an option to substitute GPT-4 with alternative models such as Yi or Qwen. Take a look at this link for the open source LLMs. [https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard]


Google Colab is recommended for script execution. To ascertain the efficacy of your prompt, you can determine the ROUGE-1 score and modify the prompt as necessary. A subsequent tutorial will offer guidance on evaluating the outcomes of topic modeling.


After topic identification, an ancillary script can be employed to associate each text fragment with relevant topics.


import csv
import requests
import openai

# Initialize the OpenAI API client
openai.api_key = 'please paste your API here'

# Open the tweet.csv file, you should upload it to colab or open it in your local enviroment.
with open("tweet.csv", "r") as csvfile:
    reader = csv.DictReader(csvfile)

    # Create a new file called: tweet_results.csv to store the results
    with open("tweet_results.csv", "w") as csv_file: #you can change the name to newname.csv
        writer = csv.DictWriter(csv_file, fieldnames=["tweet", "answer"]) #replace tweet and answer with the new variable name A and B you like.
        writer.writeheader()

        # Iterate over each tweet
        for row in reader:
            # Get the tweet text
            tweet_text = row["tweet"] #here please replace tweet with the variable name of the text in your csv file.
            # Ask ChatGPT 4.0 if the tweet contains related content
            prompt = f"you are now a researcher and you will tell me if the following reply belongs to which topic. In total there are xxx number of topics: Topic 1 xxx; Topic 2 xxx \"{tweet_text}\""
            response= openai.ChatCompletion.create(model="gpt-4",messages=[{"role": "user", "content": prompt}])

            # Get the response from ChatGPT
            #response_text = response["choices"][0]["text"]
            response_text = response["choices"][0]["message"]["content"]
            answer = response_text #replace answer with your new variable name B

            # Write the results to the file
            writer.writerow({"tweet": tweet_text, "answer": answer}) #replace tweet with your new variable name A, and answer with you new variable name B
     

An online workshop on this method will be introduced in September 2024. Subscribe here to keep updated.




3. IBM Watsonx AI

My recent participation in an IBM hackathon revealed the utility of Watsonx AI in this context. The tool facilitates prompt fine-tuning and model testing. Detailed information will be covered in an imminent tutorial.


These methodologies are designed to enhance your textual analysis endeavors. I trust they will prove beneficial in your research.

Comments


bottom of page