OpenAI on December 16 announced FrontierScience, a new benchmark designed to evaluate artificial intelligence systems on expert-level scientific reasoning across physics, chemistry and biology, as AI models increasingly demonstrate their ability to support real scientific research.
The company said reasoning lies at the heart of scientific work, going beyond factual recall to include hypothesis generation, testing, refinement and cross-disciplinary synthesis. As AI systems grow more capable, OpenAI said the key question is how deeply they can reason to meaningfully contribute to scientific discovery.
AI models increasingly used in real research
Over the past year, OpenAI’s models have reached major milestones, including gold-medal-level performance at the International Math Olympiad and the International Olympiad in Informatics. At the same time, advanced systems such as GPT-5 are already being used by researchers to accelerate scientific workflows.
According to OpenAI, scientists are deploying these models for tasks such as cross-disciplinary literature searches, multilingual research reviews and complex mathematical proofs. In many cases, work that once took days or weeks can now be completed in hours.
This progress was detailed in OpenAI’s November 2025 paper, Early science acceleration experiments with GPT-5, which presented early evidence that GPT-5 can measurably speed up scientific workflows.
Why FrontierScience was created
OpenAI said that as models’ reasoning and knowledge capabilities scale, existing scientific benchmarks are no longer sufficient. Many prior benchmarks focus on multiple-choice questions, have become saturated, or are not centered on real scientific reasoning.
For example, when the GPQA “Google-Proof” benchmark was released in November 2023, GPT-4 scored 39%, well below the expert baseline of 70%. Two years later, GPT-5.2 scored 92%, highlighting the need for more challenging evaluations.
FrontierScience was created to fill this gap by measuring expert-level scientific capabilities using difficult, original and meaningful questions written and verified by domain experts.
What FrontierScience measures
The full FrontierScience benchmark includes more than 700 textual questions, with 160 in a gold-standard set, spanning subfields across physics, chemistry and biology.
It is divided into two tracks:
-FrontierScience-Olympiad:
-100 short-answer questions
-Designed by international science olympiad medalists
-Focused on constrained, theoretical scientific reasoning
-Difficulty at least comparable to international olympiad competitions
FrontierScience-Research:
-60 original research subtasks
-Written by PhD-level scientists
-Designed to reflect real-world, multi-step research challenges
-Graded using a detailed 10-point rubric
Each task was authored and verified by subject-matter experts. Olympiad contributors were medalists in at least one international competition, while Research contributors all held relevant PhD degrees.
How model performance is graded
Olympiad questions are graded using short answers, such as numerical values, expressions or fuzzy string matches, allowing for clear verification.
For Research tasks, OpenAI introduced a rubric-based grading system. Each question includes multiple objectively assessable criteria totaling 10 points, evaluating both final answers and intermediate reasoning steps. A score of 7 out of 10 or higher is considered correct.
Responses are evaluated using a model-based grader (GPT-5). While human expert grading would be ideal, OpenAI said it is not scalable at this level, so rubrics were designed to be reliably checked by a model-based system, supported by a verification pipeline.
How leading AI models performed
OpenAI evaluated several frontier AI models on FrontierScience, including GPT-5.2, Claude Opus 4.5, Gemini 3 Pro, GPT-4o, OpenAI o4-mini and OpenAI o3.
In the initial results:
-GPT-5.2 scored 77% on FrontierScience-Olympiad
-GPT-5.2 scored 25% on FrontierScience-Research
-Gemini 3 Pro closely matched GPT-5.2 on the Olympiad track with a 76% score
OpenAI said the results show substantial progress in expert-level reasoning, while leaving significant headroom for improvement, particularly on open-ended research tasks.
Strengths, limits and next steps
While FrontierScience represents a step forward in evaluating scientific reasoning, OpenAI acknowledged key limitations. The benchmark focuses on constrained, expert-written problems and does not fully capture how science is conducted in practice.
In particular, it does not assess how models generate genuinely novel hypotheses, work with experimental systems, or interact with multimodal data such as video and physical-world experiments.
Looking ahead, OpenAI said progress in scientific reasoning will come from both stronger general-purpose reasoning systems and targeted improvements in scientific capabilities. FrontierScience is one tool among many, and the company plans to expand the benchmark to new domains and pair it with real-world evaluations.
Ultimately, OpenAI said, the most important measure of AI’s scientific value will be the new discoveries it helps generate—and FrontierScience is designed to serve as an early indicator of that potential.
Key takeaways:
-OpenAI launched FrontierScience to test AI on expert-level scientific reasoning across physics, chemistry and biology.
-Focus is on reasoning, not recall, including hypothesis generation, testing and cross-disciplinary thinking.
-AI models like GPT-5 are already accelerating research, cutting tasks from weeks to hours.
-Existing science benchmarks are no longer sufficient, prompting the need for harder, expert-written evaluations.
-FrontierScience has two tracks: Olympiad (theoretical reasoning) and Research (real-world, multi-step tasks).
-GPT-5.2 leads performance, scoring 77% on Olympiad tasks and 25% on Research tasks.