-
Notifications
You must be signed in to change notification settings - Fork 9
Feature Request: Script to Generate Pretrain Data (parquet files) #4
Description
Hello,
Thank you for sharing this impressive work.
The provided pretrain_pairs_ctx512 parquet files contain pre-computed distances and indices columns, which appear to be calculated using a single retrieval method
When training with different retrieval methods, we need pretrain data that reflects the retrieval results for each specific retrieval methods. Currently, all combinations use the same pretrain data, resulting in identical pretrained checkpoints.
Could you provide a script to generate pretrain data parquet files for different retrieval (similarity) methods? The script should:
- Generate
indicesanddistancesusing the specified similarity metric - Create parquet files with the same structure as the provided
pretrain_pairs_ctx512files - Support configurable parameters: top_k, lookback_length, etc.
This would allow researchers to generate pretrain data for their specific retrieval methods and enable proper comparison between different retrieval methods.
Thank you!