GPU-Accelerated Q&A Extraction and RAG Evaluation Pipeline for Audio Equipment Documentation
Transform audio equipment manuals into high-quality training datasets through automated Q&A generation, FAISS vector indexing, and RAG-enhanced evaluation using Llama-3-8B-Instruct.
This repository implements a complete Dual RAG AutoRAG Pipeline that:
- π Extracts Q&A pairs from audio equipment PDFs using GPU-accelerated LLM processing
- π― Generates 9 matrix combinations (3 difficulty levels Γ 3 creativity styles)
- π Builds dual vector stores with both Standard and Adaptive RAG approaches (4 FAISS indices)
- β‘ Evaluates both RAG approaches with scientific A/B testing and comparison analysis
- π Produces domain-specific insights with dual approach performance comparison
- π Generates training datasets using the winning RAG approach based on empirical results
graph LR
A[π Audio Manual PDF] --> B[πΈ Q&A Generation<br/>Matrix 3Γ3]
B --> C[π Top-K Selection]
C --> D[π§ Dual Vector Store Builder]
D --> E1[πΉ Standard RAG<br/>Answer Embeddings]
D --> E2[πΈ Adaptive RAG<br/>Q+A Embeddings]
E1 --> F1[πΎ CPU/GPU Indices<br/>Standard]
E2 --> F2[πΎ CPU/GPU Indices<br/>Adaptive]
F1 --> G1[β‘ Standard RAG Eval]
F2 --> G2[β‘ Adaptive RAG Eval]
G1 --> H[π A/B Comparison]
G2 --> H
H --> I[π Winner Selection]
I --> J[π Training Dataset]
style A fill:#f9d71c
style H fill:#e74c3c
style J fill:#27ae60
| Difficulty | High Creativity (0.9) | Balanced (0.7) | Conservative (0.3) |
|---|---|---|---|
| Basic | Broad, creative questions | Standard questions | Focused, literal |
| Intermediate | Complex scenarios | Technical details | Specific procedures |
| Advanced | Expert-level analysis | Professional insights | Precise specifications |
- GPU Required: NVIDIA GPU with CUDA support (L40S recommended)
- Python: 3.9+
- Storage: ~10GB for models + datasets
# Clone the repository
git clone <repository-url>
cd autorag
# Install Poetry (if not already installed)
curl -sSL https://install.python-poetry.org | python3 -
# Install dependencies with Poetry
poetry install
# Set up Hugging Face token
export HF_TOKEN="your_hugging_face_token_here"# Trigger the full AutoRAG pipeline via GitHub Actions
gh workflow run pdf-qa-autorag.yaml \
--field input_file="pdfs/UAFX_Ruby_63_Top_Boost_Amplifier_Manual.pdf" \
--field model_name="meta-llama/Meta-Llama-3-8B-Instruct" \
--field top_k_selection="50"Or run components individually:
# 1. Generate Q&A pairs (example: basic difficulty, balanced creativity)
python cli_pdf_qa.py \
pdfs/UAFX_Ruby_63_Top_Boost_Amplifier_Manual.pdf \
--output outputs/qa_basic_balanced.jsonl \
--difficulty-levels basic \
--temperature 0.7 --top-p 0.9
# 2. Select best pairs
python qa_pair_selector.py \
--qa-artifacts-dir outputs \
--output-dir rag_input \
--top-k 50
# 3. Build dual vector stores (Standard + Adaptive RAG)
python qa_faiss_builder.py \
--qa-pairs-file rag_input/selected_qa_pairs.json \
--output-dir rag_store
# 4. Run parallel RAG evaluation (both approaches)
# Standard RAG
python qa_autorag_evaluator.py \
--qa-pairs-file rag_input/selected_qa_pairs.json \
--qa-faiss-index rag_store/qa_faiss_index_standard_gpu.bin \
--output-dir autorag_results/standard_rag
# Adaptive RAG
python qa_autorag_evaluator.py \
--qa-pairs-file rag_input/selected_qa_pairs.json \
--qa-faiss-index rag_store/qa_faiss_index_adaptive_gpu.bin \
--output-dir autorag_results/adaptive_rag
# 5. Compare RAG approaches
python rag_comparison_analyzer.py \
--standard-results autorag_results/standard_rag \
--adaptive-results autorag_results/adaptive_rag \
--output-file autorag_results/rag_comparison_report.json
# 6. Domain-specific evaluation
python domain_eval_gpu.py \
--config audio_equipment_domain_questions.json \
--results-dir outputsautorag/
βββ πΈ pdfs/ # Audio equipment manuals
β βββ UAFX_Ruby_63_Top_Boost_Amplifier_Manual.pdf
βββ βοΈ qa_extraction_lib/ # Core extraction library
β βββ pdf_generator.py # PDF text processing
β βββ prompt_manager.py # LLM prompt templates
β βββ text_processing.py # Text chunking & preprocessing
βββ π§ Pipeline Scripts
β βββ cli_pdf_qa.py # Main Q&A generator (9 matrix combinations)
β βββ qa_pair_selector.py # Top-K selection algorithm
β βββ qa_faiss_builder.py # GPU FAISS index builder
β βββ qa_autorag_evaluator.py # RAG vs Base model evaluation
β βββ training_dataset_generator.py # High-quality dataset generator
β βββ domain_eval_gpu.py # Audio equipment domain evaluator
βββ π― Configuration
β βββ audio_equipment_domain_questions.json # Domain-specific evaluation config
β βββ pyproject.toml # Poetry dependencies and project config
βββ π€ .github/workflows/
β βββ pdf-qa-autorag.yaml # Complete CI/CD pipeline
βββ π Output Directories (auto-created)
βββ outputs/ # Generated Q&A pairs (9 matrix files)
βββ rag_input/ # Selected pairs + metadata
βββ rag_store/ # FAISS indices + embeddings
βββ autorag_results/ # Evaluation reports + training datasets
- 4 FAISS indices (CPU/GPU Γ Standard/Adaptive) for comprehensive evaluation
- Standard RAG: Traditional answer-only embeddings (speed-optimized)
- Adaptive RAG: Combined Q+A embeddings (quality-optimized)
- Scientific A/B testing with quantitative performance comparison
- Automatic winner selection based on empirical results
- Standard vs Adaptive RAG head-to-head comparison
- RAG vs Base Model performance analysis
- BERT-Score semantic evaluation for both approaches
- Domain relevance scoring with dual approach insights
- Uncertainty detection and confidence calibration
- Performance metrics (speed vs quality trade-offs)
- Winner-based training data using best-performing RAG approach
- High-quality Q&A pairs filtered by semantic similarity and comparison results
- JSONL format compatible with popular training frameworks
- Metadata preservation (difficulty, creativity, source tracking, RAG comparison scores)
- Quality metrics and approach selection rationale for dataset curation
The pipeline provides multi-dimensional evaluation:
| Metric Category | Measures | Good For |
|---|---|---|
| Semantic Quality | BERT-Score F1, Precision, Recall | Answer accuracy (both approaches) |
| Domain Relevance | Audio equipment term frequency | Specialization (Standard vs Adaptive) |
| Response Length | Word count, token count | Completeness comparison |
| Uncertainty | "I don't know" phrase detection | Confidence calibration |
| Retrieval Quality | Dense + sparse score combination | Context relevance (dual comparison) |
| Performance | Retrieval/generation time (ms) | Speed vs quality trade-offs |
| Approach Comparison | Standard vs Adaptive metrics | Winner selection criteria |
Specifically tuned for guitar amplifiers and effects:
- Domain Terms: amplifier, guitar, tone, distortion, overdrive, gain, EQ, tube, preamp, etc.
- Question Categories: Technical specifications, setup procedures, troubleshooting, comparisons
- Knowledge Areas: Impedance matching, tube saturation, power handling, signal processing
For in-depth technical details on each component:
- Q&A Generation - PDF processing and LLM-based Q&A extraction
- Quality Selection - Multi-metric quality assessment and filtering
- Dual Vector Store - GPU-accelerated FAISS indexing with Standard + Adaptive approaches
- RAG Evaluation - Parallel RAG evaluation and performance comparison
- Training Dataset - Winner-based training data generation
- Domain Evaluation - Domain expertise with dual RAG analysis
- Pipeline Architecture - Complete dual RAG system design and data flow
- Dual RAG Architecture - Standard vs Adaptive RAG comparison framework
Each component document includes technical implementation details, configuration options, performance characteristics, and use cases.
# Process your own audio equipment manual
python cli_pdf_qa.py your_manual.pdf \
--chunk-size 600 \
--batch-size 4 \
--difficulty-levels basic intermediate \
--quantize # Enable for lower GPU memory# More aggressive filtering
python qa_pair_selector.py \
--qa-artifacts-dir outputs \
--top-k 25 \
--min-quality-threshold 0.7Edit audio_equipment_domain_questions.json to:
- Add new domain terms
- Create custom evaluation questions
- Modify confidence templates
After running the complete dual RAG pipeline, expect:
- ~500-1000 Q&A pairs from a typical amplifier manual
- 50+ high-quality pairs selected for dual RAG evaluation
- 4 GPU/CPU FAISS indices with sub-millisecond query times
- Comparative analysis showing Standard vs Adaptive performance differences
- Domain relevance scores typically 0.6-0.8 for in-domain questions (both approaches)
- BERT-Score improvements of 0.1-0.3 F1 with RAG vs base model
- Winner determination and deployment recommendations
- Training dataset generated from best-performing approach
This pipeline is designed for audio equipment domain specialization. To adapt for other domains:
- Replace PDF: Add your domain-specific documentation to
pdfs/ - Update domain config: Modify
audio_equipment_domain_questions.json - Adjust prompts: Edit templates in
qa_extraction_lib/prompt_manager.py - Update workflow: Change default paths in
.github/workflows/pdf-qa-autorag.yaml
This project demonstrates advanced RAG pipeline techniques for domain-specific knowledge extraction. Built with modern ML tools including PyTorch, Transformers, FAISS, and Llama-3.
Key Technologies: Python 3.9+, PyTorch 2.1+, Transformers 4.42+, FAISS GPU, Sentence-Transformers, BERT-Score
πΈ Ready to amplify your audio equipment knowledge with AI? Let's rock! π€