Current review is based on human expert, but it costs a lot of time. Also need to rerun every new results. possible solutinos: 1. evaluation by top models like grok3, grok3 thinking ... 2. evaluation by thinking model: o3-mini-high, gemeni flash thinking