retrieval-augmented-generation / src / data_annotation / qa-generation.ipynb
qa-generation.ipynb
Raw

Generate synthetic QA pairs for each document.

Adapted from Unsloth AI's official inference scripts.

%%capture

!pip install unsloth
!pip install --upgrade --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip uninstall transformers -y && pip install --upgrade --no-cache-dir "git+https://github.com/huggingface/transformers.git"
from unsloth import FastLanguageModel
from transformers import TextStreamer
from unsloth.chat_templates import get_chat_template
import json
from tqdm import tqdm
import gc
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct",
    max_seq_length = 8192,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
prompt1 = f"""
You are a helpful AI agent with the task of generating question-answer pairs accurately. Here are examples of factual question-answer pairs based on a given context:

Example 1:
Context: Pittsburgh, named after William Pitt, 1st Earl of Chatham, is known for its **steelproduction** history. The city is a hub for **education and technology,** with institutions like **Carnegie Mellon University (CMU)** and the University of Pittsburgh. In 1980, it hosted the first **International Conference on Machine Learning (ICML),** marking its role in AI research. $$Techforward?
The **PPG Paints Arena** is home to the **Pittsburgh Penguins (NHL)** and hosts events, like **Billie Eilish’s** concert on **October 13, 2024**. Fans are already booking tickets for the show—don’t miss out! **Check availability: ticketpage_postererror404**. Pittsburgh, with its population of **300k** and over **2.3 million** in the metro area, is growing in innovation and culture. 🎟️.
The city has 446 bridges—**more than Venice**—and is called the **“City of Bridges.”** Take a walk along the rivers (Monongahela, Allegheny, Ohio) to experience it yourself. Error: **riverwalk_mapunavailable.** Pittsburgh’s transformation includes a booming **tech industry,** with startups in the Strip District and innovation at **CMU.** UPMC is the largest employer. Explore more at **pgh_innovation.html**.
For concert tickets or more info on upcoming events at **PPG Arena,** visit [tickets-2024.com/billieeilish]. Not finding answers? Try **404-error/subpage**. See the **AndyWarholMuseum** or **PointStatePark** while you’re in town. Pittsburgh offers everything from sports to tech—stay updated with **eventlist@pittsburgh-events**!

Q: Who is Pittsburgh named after? A: William Pitt
Q: What famous machine learning venue had its first conference in Pittsburgh in 1980? A: ICML
Q: What musical artist is performing at PPG Arena on October 13? A: Billie Eilish

Now, based on the following context, generate 2-3 important factual question-answer pairs that are highly relevant to the facts in the context. Each question should be clear,important, concise, and directly related to the facts provided. Ensure the answers are concise and accurate. Prioritize the most important or unique details from the context. Think deeply step by step to make sure one of the questions is complex from a long-context dependency standpoint.
"""

prompt2 = f"""
You are a helpful AI agent with the task of generating question-answer pairs accurately. Here are examples of factual question-answer pairs based on a given context:

Example 1:
Pittsburgh, named after William Pitt, 1st Earl of Chatham, is known for its rich industrial history, particularly in steel production. It has transformed into a hub for education and technology, home to prestigious institutions like Carnegie Mellon University (CMU) and the University of Pittsburgh. In 1980, Pittsburgh hosted the first International Conference on Machine Learning (ICML), establishing its connection to AI research.

Context: The city also boasts a vibrant cultural scene, highlighted by the PPG Paints Arena, home to the NHL’s Pittsburgh Penguins. On October 13, 2024, Grammy-winning artist Billie Eilish will perform at the arena, drawing fans from across the region. With a population of over 300,000 and a greater metropolitan area exceeding 2.3 million, Pittsburgh continues to thrive as a center of innovation and culture.
Q: Who is Pittsburgh named after? A: William Pitt
Q: What famous machine learning venue had its first conference in Pittsburgh in 1980? A: ICML
Q: What musical artist is performing at PPG Arena on October 13? A: Billie Eilish

Now, based on the following context, generate exactly 3 factual question-answer pairs that are highly relevant to the facts in the context. Each question should be clear,important, concise, and directly related to the facts provided. Ensure the answers are concise and accurate. Prioritize the most important or unique details from the context.
"""
def parse_qa_pairs(qa_text):
    questions = []
    answers = []

    lines = qa_text.split('\n')

    for line in lines:
        line = line.strip()

        if line.startswith("Q:"):
            question = line.replace("Q:", "").strip()
            questions.append(question)
        elif line.startswith("A:"):
            answer = line.replace("A:", "").strip()
            answers.append(answer)

    return questions, answers

def generate_qa_pairs(input_file, output_file, metadata_file, start_line, end_line):
    question_id = 1

    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile, open(metadata_file, 'w') as meta_outfile:
        for line_num, line in enumerate(tqdm(infile, desc="Processing entries")):
            if line_num < start_line:
                continue
            if line_num >= end_line:
                break

            entry = json.loads(line)

            # Extract metadata
            source_id = entry.get('source_id', '')
            chunk_id = entry.get('chunk_id', '')
            source_name = entry.get('source_name', '')
            
            text_content = entry.get('text_content', '')

            if not text_content:
                continue

            if isinstance(text_content, list):
                text_content = " ".join(text_content)

            # Build prompt for question generation
            prompt = prompt1 + f"""
            Context: {text_content}
            
            Output format:
            
            Q: <question>, A: <answer>."""

            messages = [{"from": "human", "value": prompt}]
            inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

            returns = model.generate(input_ids=inputs, max_new_tokens=256, use_cache=True)
            generated_text = tokenizer.decode(returns[0], skip_special_tokens=True)
            sections = generated_text.split("<answer>")

            # Take the last part after the "Context:", which contains the relevant Q&A pairs
            relevant_section = sections[-1].strip()
            print(relevant_section)

            questions, answers = parse_qa_pairs(relevant_section)

            # Add generated questions and answers to the original entry
            entry['questions'] = questions
            entry['answers'] = answers

            # Write the modified entry to the output file
            json.dump(entry, outfile)
            outfile.write('\n')

            # Now write individual Q&A pairs with metadata to another file
            for q, a in zip(questions, answers):
                qa_metadata = {
                    "source_id": source_id,
                    "chunk_id": chunk_id,
                    "source_name": source_name,
                    "question_id": question_id,
                    "question": q,
                    "gt_answer": a
                }
                json.dump(qa_metadata, meta_outfile)
                meta_outfile.write('\n')

                question_id += 1
input_file = '/path/to/input.jsonl'
output_file = '/path/to/output.jsonl'
metadata_file = '/path/to/metadata.jsonl'

gc.collect()
torch.cuda.empty_cache()
generate_qa_pairs(input_file, output_file, metadata_file, start_line=0, end_line=2500)