Dataset generation #164

Kannav02 · 2025-07-09T22:26:07Z

This PR introduces a complete dataset generation and evaluation pipeline for creating high-quality question-answer pairs from OpenROAD documentation. The system includes automated QA pair generation using Gemini Pro and comprehensive quality evaluation metrics

The reference for all of the code and logic has been taken from the following link:
https://huggingface.co/learn/cookbook/en/rag_evaluation

The following files have been added under the folder dataset_gen_eval

eval_dataset.py: Evaluation script that loads generated QA pairs and applies quality metrics
generate_qa_pairs.py: Automated QA pair generation using Gemini Pro 2.5 to create factoid questions from domain-specific document chunks
ingest_doc.py: Document processing pipeline that chunks and indexes PDF/Markdown files into FAISS
quality_agents.py: Custom DeepEval metrics implementation with three quality assessment classes

Signed-off-by: Kannav02 <[email protected]>

…ation Signed-off-by: Kannav02 <[email protected]>

Signed-off-by: Kannav02 <[email protected]>

Signed-off-by: Jack Luar <[email protected]>

Kannav02 and others added 5 commits July 9, 2025 18:20

feat: added the file to ingest the document to vector db

6d55182

Signed-off-by: Kannav02 <[email protected]>

feat: added the file to generate the qa pairs

5f839c8

Signed-off-by: Kannav02 <[email protected]>

feat: added the quality evaluation agents for synthetic dataset gener…

cca788d

…ation Signed-off-by: Kannav02 <[email protected]>

feat: added the script to run the evaluation agent for each QA pair

46dab55

Signed-off-by: Kannav02 <[email protected]>

remove path hacks, fix checks

fbb472b

Signed-off-by: Jack Luar <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset generation #164

Dataset generation #164

Uh oh!

Kannav02 commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dataset generation #164

Are you sure you want to change the base?

Dataset generation #164

Uh oh!

Conversation

Kannav02 commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants