Hi, thanks for the great work and for releasing the code.
I noticed a discrepancy between the paper and the implementation regarding the audio encoder.
In the paper, wav2vec is described as the audio feature extractor, while in the released code the audio encoder seems to be Whisper.
Could you please clarify the motivation behind this choice?
嗨,感谢您出色的工作以及为我们提供了代码。
我注意到在音频编码器这一方面,论文描述与实际实现之间存在差异。
在该论文中,wav2vec 被描述为音频特征提取器,而在github发布的代码中,音频编码器似乎是 Whisper。
您能否解释一下做出这一选择的原因呢?