Conversation
|
Yeah, the data is quite noisy. I am leaning towards a "no", unless we can somehow clean it up around a particular subject. |
|
I like this task, but I agree that noise is a concern. |
|
I checked again and I think there's noisy data in short instances too. |
What I meant above was to retain only the longer instances. Longer instances seem to contain lesser noise. |
|
Oh sorry, I didn't read it carefully! |
|
Sorry for being late. I think it makes sense to keep the longer instances(not sure about the threshold though). Should I add other languages too? |
Yes, if you have time, feel free to add. It's also fine if you skip this and decide to focus on other ToDos we have in this project. |
|
I agree with Swaroop. If cleaning up this PR will take more than 1hr, I would say, it's not worth it. |
This task is created from the MRS dataset from this issue #283.
However, I am in doubt whether this is a good addition or not. The data is driven from Reddit replies and they're not good quality examples to learn from. I've cleaned them as much as I could but there're still lots of nonsense going on. I'm submitting the English task, to get other people's opinions. If it's good enough, or there's a good way to filter out nonsense, I will go on and add other languages too.
@swarooprm @danyaljj