[Feature] LLM-as-a-Judge: evaluate review alignment

Current review is based on human expert, but it costs a lot of time. Also need to rerun every new results.

possible solutinos:

1. evaluation by top models like grok3, grok3 thinking ...
2. evaluation by thinking model: o3-mini-high, gemeni flash thinking