Skip to content

Feature Request: Script to Generate Pretrain Data (parquet files) #4

@seunghan96

Description

@seunghan96

Hello,
Thank you for sharing this impressive work.

The provided pretrain_pairs_ctx512 parquet files contain pre-computed distances and indices columns, which appear to be calculated using a single retrieval method

When training with different retrieval methods, we need pretrain data that reflects the retrieval results for each specific retrieval methods. Currently, all combinations use the same pretrain data, resulting in identical pretrained checkpoints.

Could you provide a script to generate pretrain data parquet files for different retrieval (similarity) methods? The script should:

  1. Generate indices and distances using the specified similarity metric
  2. Create parquet files with the same structure as the provided pretrain_pairs_ctx512 files
  3. Support configurable parameters: top_k, lookback_length, etc.

This would allow researchers to generate pretrain data for their specific retrieval methods and enable proper comparison between different retrieval methods.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions