如何确定文本中的主要内容

使用大型语言模型（LLM）来识别文本数据中的主题非常容易。与人工识别或潜在德里希勒分配（LDA）等旧技术相比，这些模型有了很大的改进。本指南介绍了使用 LLMs 有效识别主题的三种稳健策略，即使是在简短的文本中。

在深入研究这些方法论之前，我建议先研究一下预印本论文 [http://doi.org/10.2196/preprints.53376 ]，该论文对这些策略及其各自的优势进行了深入研究。您还可以使用 github 中的数据集进行练习。[https://github.com/lanceyuu/LLMforTM.git ].必须承认某些限制，例如 LLM 可以分析的文本长度限制。

1.利用 LLM 工具

分析文本时，您可以使用工具包中任何可访问的 LLM 工具。请访问我们的工具包 [工具包]。对于大篇幅的文本，建议使用 GPT-4 或我开发的 GPT [主题建模器]。其他模型，如 Le Chat 和 Claude，也展示了处理大型文本的能力，不过 Llama 可能会表现出局限性。对于中文文本，建议选择 Kimi。此外，Gemini Pro 1.5 还擅长管理多达一百万个标记。

提示示例：

You are a qualitative researcher and your task is to identify the topics in the text. When generating the topics, prioritize correctness and ensure that your response is accurate and grounded in the context of the text.

Your summary should include three components: the first one is the title of the topic; the second one is the definition of the topic; the third one is the number of occurrences of the topic.

insert topic 1: Title, definition, occurrence
insert topic 2: Title, definition, occurrence
insert topic 3: Title, definition, occurrence
…

Here is the text
[paste your text here]

2.分析结构化数据的 Python 脚本

如果您拥有 CSV 文件等结构化数据，或者需要分析大量文本，Python 脚本可能会提供比手动输入更系统的方法。脚本如下。请将 API 密钥替换为您的密钥。此外，还请通过路径导入数据集。


!pip install openai==0.28
!pip install evaluate
!pip install rouge_score
!pip install panda

import openai
import evaluate
import pandas as pd

# Initialize OpenAI API with your key
openai_api_key = "your_openai_api_key"
openai.api_key = openai_api_key

# Load your test data
test_data_path = 'path/to/your/test.csv'
test_data = pd.read_csv(test_data_path)

# Define the summarization instruction within a chat interaction context
def generate_summary(review):
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are an AI trained to summarize product reviews."},
            {"role": "user", "content": review}
        ]
    )
    return response.choices[0].message['content']

# Generate summaries
results = []
actual_summaries = test_data['summary'].tolist()  # Adjust the column name if needed

for review in test_data['reviews']:  # Adjust the column name if needed
    summary = generate_summary(review)
    results.append(summary)

# Evaluate the summaries using ROUGE
rouge_scorer = evaluate.load('rouge')
rouge_scores = rouge_scorer.compute(predictions=results, references=actual_summaries)

# Directly print the ROUGE-1 score, assuming it's a numerical value
print("ROUGE-1 F1 Score:", rouge_scores['rouge1'])

对于那些具有高级专业知识的人，可以选择用其他模型（如 Yi 或 Qwen）来替代 GPT-4。请查看此链接，了解开源 LLM。[https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard]。

建议使用 Google Colab 执行脚本。要确定提示的有效性，可以确定 ROUGE-1 分数，并根据需要修改提示。后续教程将为评估主题建模结果提供指导。

在识别主题后，可使用辅助脚本将每个文本片段与相关主题关联起来。

import csv
import requests
import openai

# Initialize the OpenAI API client
openai.api_key = 'please paste your API here'

# Open the tweet.csv file, you should upload it to colab or open it in your local enviroment.
with open("tweet.csv", "r") as csvfile:
    reader = csv.DictReader(csvfile)

    # Create a new file called: tweet_results.csv to store the results
    with open("tweet_results.csv", "w") as csv_file: #you can change the name to newname.csv
        writer = csv.DictWriter(csv_file, fieldnames=["tweet", "answer"]) #replace tweet and answer with the new variable name A and B you like.
        writer.writeheader()

        # Iterate over each tweet
        for row in reader:
            # Get the tweet text
            tweet_text = row["tweet"] #here please replace tweet with the variable name of the text in your csv file.
            # Ask ChatGPT 4.0 if the tweet contains related content
            prompt = f"you are now a researcher and you will tell me if the following reply belongs to which topic. In total there are xxx number of topics: Topic 1 xxx; Topic 2 xxx \"{tweet_text}\""
            response= openai.ChatCompletion.create(model="gpt-4",messages=[{"role": "user", "content": prompt}])

            # Get the response from ChatGPT
            #response_text = response["choices"][0]["text"]
            response_text = response["choices"][0]["message"]["content"]
            answer = response_text #replace answer with your new variable name B

            # Write the results to the file
            writer.writerow({"tweet": tweet_text, "answer": answer}) #replace tweet with your new variable name A, and answer with you new variable name B