This is the official repository for the paper "SurfaceBench: Can Self-Evolving LLMs Find the Equations of 3D Scientific Surfaces?" (TMLR 2026)
In this paper, we introduce SurfaceBench, a benchmark built to push equation discovery beyond simple scalar functions and into the richer setting of symbolic surface recovery. The benchmark contains 183 tasks covering 15 different categories, each capturing a distinct kind of symbolic or geometric complexity. Every task provides a ground-truth surface equation along with variable meanings and a 3D point cloud sampled from it, spanning scientific areas such as fluid dynamics, robotics, electromagnetics, and classical geometry. Our goal is to challenge LLM-based symbolic regression systems in a way that existing datasets do not: the surfaces in SurfaceBench are constructed to avoid easy memorization, include diverse representation forms (explicit, implicit, parametric), and reflect realistic scientific structure.
To run the code, create a conda environment and install the dependencies provided in the requirements.txt or environment.yml:
conda create -n SurfaceBench python=3.11.7
conda activate SurfaceBench
pip install -r requirements.txt
Note: Requires Python ≥ 3.9
You also need to install other packages for each search method from their original github repositories.
The data for the benchmark will be automatically downloaded from huggingface.
We provide implementation for OpenEvolve in the OpenEvolve/ folder, LLMSR, LaSR, SGA, PySR in the methods/ folder.
In order to include a new method, please refer to the implementation section for detailed instructions on how to add a new discovery method to the project. This includes setting up the necessary configurations, implementing the searcher class, and ensuring compatibility with the existing framework.
- Activate the appropriate conda environment.
- Launch a local LLM server. While our implementation utilizes
vllm, you can also opt for other libraries as long as you implement the necessary functionality in the searcher class. For example, to start the server with the vllm library, use the command:
vllm serve meta-llama/Llama-3.1-8B-Instruct --dtype auto --api-key token-abc123 --port 10005
-
Configure the environment variables in the
.envfile. Duplicate.env.exampleto.envand specify the following:VLLM_API_KEY: Your API key for the local vLLM server (e.g., 'token-abc123').OPENAI_API_KEY: Your OpenAI API key if you are utilizing OpenAI models.SGA_PYTHON_PATH: The path to the Python executable in your SGA conda environment if you are using the SGA searcher.
-
Execute the eval.py script with the required arguments:
--searcher_config: Path to the YAML configuration file for the searcher (mandatory).--dataset: The name of the dataset to evaluate (mandatory).--resume_from: The path to a previous run directory to continue from (optional).--problem_name: The specific problem name to evaluate (optional).--local_llm_port: The port number for the local LLM server (optional).
Available dataset options include:
-
Nonlinear Analytic Composition Surfaces(Non-Nonlinear_Analytic_Composition_Surfaces) -
Piecewise-Defined Surfaces(Piecewise-Defined_Surfaces) -
Mixed Transcendental Analytic Surfaces(Mixed_Transcendental_Analytic_Surfaces) -
Conditional Multi-Regime Surfaces(Conditional_Multi-Regime_Surfaces) -
Oscillatory Composite Surfaces(Oscillatory_Composite_Surfaces) -
Trigonometric–Exponential Composition Surfaces(Trigonometric–Exponential_Composition_Surfaces) -
Multi-Operator Composite Surfaces(Multi-Operator_Composite_Surfaces) -
Elementary Bivariate Surfaces(Elementary_Bivariate_Surfaces) -
Discrete Integer-Grid Surfaces(Discrete_Integer-Grid_Surfaces) -
Nonlinear Coupled Surfaces(Nonlinear_Coupled_Surfaces) -
Exponentially-Modulated Trigonometric Surfaces(Exponentially-Modulated_Trigonometric_Surfaces) -
Localized and Radially-Decaying Surfaces(Localized_and_Radially-Decaying_Surfaces) -
Polynomial–Transcendental Mixtures(Polynomial–Transcendental_Mixtures) -
High-Degree Implicit Surfaces(High-Degree_Implicit_Surfaces) -
Parametric Multi-Output Surfaces(Parametric_Multi-Output_Surfaces)
For example, for running discovery methods on all the Elementary Bivariate Surfaces datasets with open LLM backbone Qwen-3-8B on local server, you can use the following commands:
LLM-SR:
python eval.py --dataset "Elementary_Bivariate_Surfaces" --searcher_config configs/llmsr_qwen3-8B.yaml --local_llm_port 10005
LaSR:
python eval.py --dataset "Elementary_Bivariate_Surfaces" --searcher_config configs/lasr_qwen3-8B.yaml --local_llm_port 10005
SGA method:
python eval.py --dataset "Elementary_Bivariate_Surfaces" --searcher_config configs/sga_qwen3-8B.yaml --local_llm_port 10005
More evaluation scripts for running discovery methods with different LLM backbones on different datasets are provided in example_script.sh.
The execution of eval.py will generate log files in the logs/ folder. You can resume your run using the --resume_from <log_dir> option. For instance,
--resume_from logs/Elementary_Bivariate_Surfaces/Elementary_Bivariate_Surfaces/llmsr-Qwen3-8B/09-20-2025_18-09-19-121210 will bypass already completed problems.
The working directory structure will be as follows:
project
│ README.md
| eval.py
| .env
└───bench/
|
└───methods/
| └───llmsr
| └───lasr
| └───PySR
| └───sga_sr
|
└───datasets/
|
└───OpenEvolve/
|
└───logs/
└───<dataset-name>
└───<method-name>
└───<date>
To implement a new searcher, you must create a class that inherits from the base class BaseSearcher. This base class provides the foundational structure for your LLM-based searcher, including essential methods that need to be overridden.
class BaseSearcher:
def __init__(self, name) -> None:
self._name = name
def discover(self, task: SEDTask) -> List[SearchResult]:
'''
Return:
List of SearchResult
'''
raise NotImplementedError
def __str__(self):
return self._nameThe input task will provide a description of the target equation, input variables, and training data points.
An example of a searcher is
class NewSearcher(BaseSearcher):
def __init__(self, name, num_sample, api_type, api_model, api_url):
super().__init__(name)
self.num_samples = num_samples
self.llm = LLM(api_type, api_model, api_url)
def discover(self, task: SEDTask):
dataset = task.samples
symbol_descs = task.symbol_descs
prompt = f"Find the mathematical function skeleton that represents {symbol_descs[0]}, given data on {", ".join(symbol_descs[1:-1]) + ", and " + symbol_descs[-1]}"
best_program, best_score = None, -np.inf
for _ in range(self.num_samples):
program_str, aux = self.llm.sample_program(prompt)
score = evaluate(program_str, dataset)
if score > best_score:
best_program = program_str
best_equation = Equation(
symbols=info["symbols"],
symbol_descs=info["symbol_descs"],
symbol_properties=info["symbol_properties"],
expression=None,
program_format = best_program,
lambda_format = programstr2lambda(best_program)
)
return [
SearchResult(
equation=best_equation,
aux=aux
)
]Once you’ve implemented your searcher, create a corresponding configuration file in the configs/ folder. For example:
name: NewSearcher-Llama31_8b
class_name: NewSearcher
api_type: "vllm"
api_model: "meta-llama/Llama-3.1-8B-Instruct"
api_url: "http://localhost:{}/v1/"
num_samples: 1000To evaluate with this searcher, run eval.py and provide the path to its corresponding configuration file; this will load the settings and initiate the evaluation process on the specified dataset.
The API key is read from the environment OPENAI_API_KEY by default. You can check create_config() in data_api.py.
The data_api.py script is crucial for setting up the environment. It prepares tasks from the SurfaceBench dataset (defined by classes in ./bench, and will be located at ./problems).
For each benchmark task, this script will automatically generate:
initial_program.py: A starting Python program, typically a simple linear model.evaluator.py: A tailored evaluation script for the task.config.yaml: An OpenEvolve configuration file specific to the task.
Run the script from your terminal:
python data_api.pyThis will create subdirectories for each benchmark task, populated with the necessary files.
Use the provided shell script scripts.sh to execute OpenEvolve across the generated benchmark tasks. This script iterates through the task-specific configurations and applies the evolutionary process.
bash scripts.shAfter OpenEvolve has completed its runs, you can evaluate the performance on different subsets of tasks. The eval.py script collates the results and provides a summary.
python eval.py <subset_path>For example, to evaluate results for the 'Elementary_Bivariate_Surfaces' subset located in ./problems/Elementary_Bivariate_Surfaces/, you would run:
python eval.py ./problems/Elementary_Bivariate_SurfacesThis script will also save a JSON file containing detailed results for your analysis.
python eval.py --dataset "Elementary_Bivariate_Surfaces" --searcher_config configs/pysr_explicit.yaml
If you find this work useful, please cite:
@article{
kabra2026surfacebench,
title={{SURFACEBENCH}: A Geometry-Aware Benchmark for Symbolic Surface Discovery},
author={Sanchit Kabra and Shobhnik Kriplani and Parshin Shojaee and Chandan K. Reddy},
journal={Transactions on Machine Learning Research},
year={2026},
url={https://openreview.net/forum?id=sHLTzkczSi}
}This repository is licensed under MIT licence.
This work is built on top of other open source projects, including Openevolve, LLM-SR, LaSR, SGA, and PySR, and is inspired by the effort behind LLM-SRBench. We sincerely thank the original contributors of these works for open-sourcing their valuable source codes.
For any questions or issues, you are welcome to open an issue in this repo, or contact us at sanchit23@vt.edu and shobhnik1@gmail.com.
