Readers please feel free to contact authors if you are interested in this problem scope.
This version utilizes Bird (69db) and Spider's training dataset (166db), composed a total of 235db. Each database targets a certain domain and consists of multiple tables.
The pipeline implements a 9-step workflow with forward and backward passes for SQL generation quality improvement:
- Setup Sampler: Initialize diversity sampling based on batch_size and num_iteration
- Generate Query (Forward): Natural language query generation from database schema
- Generate Groundtruth (Forward): SQL generation from natural language query
- Verify Format: Execute SQL and materialize results to verdict database
- Verify Groundtruth (Forward): LLM-based verification of query-SQL correctness and adherence
- Generate Unit Test (Backward): Create comprehensive unit tests from SQL result table
- Generate Query (Backward): Generate improved natural language query from SQL + result table
- Verify Again (Backward): Execute unit tests for final verification
- Save to Dataset: Convert to dataset.jsonl if verdict is "correct" and adherence is "adheres" or "partial"
The pipeline supports batch execution and iterations:
- Batch Size: Number of parallel pipeline runs per iteration (default: 5)
- Iterations: Number of batch runs to execute (default: 1)
- Diversity Sampling: Intelligent sampling for specification diversity across batches
- Execution Logging: Full process logs saved to
execution_log/directory