InternScience
diff --git a/‎paper.pdf‎
0 Bytes b/‎paper.pdf‎
0 Bytes
diff --git a/‎paper/sections/5-discussion.tex‎
Lines changed: 1 addition & 1 deletion b/‎paper/sections/5-discussion.tex‎
Lines changed: 1 addition & 1 deletion
@@ -186,7 +186,7 @@ \subsection{Fragmentation Across the Four Quadrants of SGI}
 At a finer granularity, Deep Research tasks involving \textbf{Data} and \textbf{Properties} are the weakest: performance on these categories is substantially below that of \textbf{Micro-} and \textbf{Macro-experiment} questions, with \emph{all four categories rarely exceeding 30\%} accuracy (Figure~\ref{fig: deep research on different task}). This aligns with the task design: data/property questions require retrieving dispersed numerical details across heterogeneous papers, while experiment-oriented questions provide more structured evidence. The results thus expose a core SGI bottleneck: \emph{meta-analytic retrieval + numerical aggregation over scattered literature}.
 
 \paragraph{Conception: Ideas lack implementability.}
-Idea Generation in SGI-Bench is assessed using \textbf{Effectiveness}, \textbf{Detailedness}, and \textbf{Feasibility} (Table~\ref{tab:idea_gen_res}). \textbf{Feasibility is low across models}: many systems score in the 14–20 range, and the best result reaches 22.90 (\texttt{o3}), indicating that feasibility consistently lags behind novelty and detailedness. \textbf{Detailedness remains insufficient for several models}, with implementation steps frequently missing concrete parameters, resource assumptions, or step ordering; \textbf{Effectiveness is moderate for most systems}, with the highest result of 51.36 (GPT-5.2-Pro) and open-source models clustering around 24.95–28.50 (e.g., DeepSeek-V3.2, Llama-4-Scout).
+Idea Generation in SGI-Bench is assessed using \textbf{Effectiveness}, \textbf{Detailedness}, and \textbf{Feasibility} (Table~\ref{tab:idea_gen_res}). \textbf{Feasibility is low across models}: many systems score in the 14–20 range, and the best result reaches 22.90 (\texttt{o3}), indicating that feasibility consistently lags behind novelty and detailedness. \textbf{Detailedness remains insufficient for several models}, with implementation steps frequently missing concrete parameters, resource assumptions, or step ordering; \textbf{Effectiveness is moderate for most systems}, with the highest result of 51.36 (GPT-5.2-Pro) and open-source models clustering around 24.95–28.74 (e.g., DeepSeek-V3.2, Llama-4-Scout).
 
 Recurring issues include: (i) underspecified implementation steps—absent data acquisition or preprocessing plans, missing hyperparameters or compute assumptions, vague module choices (e.g., solver type, training objective, evaluation protocol), and unclear interfaces, ordering, or data flow; and (ii) infeasible procedures—reliance on unavailable instruments or data, uncoordinated pipelines that cannot be executed, and designs lacking reproducibility.