Theoria

From the Ancient Greek: the act of pure beholding, seeing what is before naming it.

Here I share some of my recent research and the things I am deeply excited for the next year.

For the past year, together with our team at the Center for Decoding the Universe @Stanford, we've been trying to answer a simple but critical question: can AI agents replicate our scientific research? Can they set up the experimental pipeline, code the method, and get the answer we did through many restless days (and nights) of thinking and debugging?

To find out, we built ReplicationBench, a bold project led by the one and only Christine Ye working closely with Sihan Yuan and Suchetha Cooray as well as our team of astrophysicists and computer scientists at Stanford and SLAC. We took 19 peer-reviewed astrophysics papers and decomposed them into 121 expert-written tasks, each co-developed with the original paper authors, each targeting a key scientific result. The tasks span experimental setup, derivations, data analysis, and full codebase implementation. We evaluated frontier language models in agentic environments, measuring both faithfulness (did the agent adhere to the original methods?) and correctness (did it get the right answer?).

The best-performing model at the time scored just under 20%.

When we analyzed the agent trajectories with domain experts, we found a rich set of failure modes: agents that submitted early, incorrectly claiming the task was impossible. Agents that executed plausible-looking but wrong procedures. Agents that defaulted to simplified methodologies when the real work required careful, context-dependent choices. Agents that simply gave up.

Astrophysics is a particularly good testbed for this because the workflows are entirely computational. We have archival data, open-source code, strong reproducibility norms. If agents can't replicate research here, in a domain purpose-built for reproducibility, the implications extend well beyond our field.

You might ask what comes next. Once agents can do our science, can they do new science? Stay tuned for our upcoming project!

→ Read the paper: ReplicationBench

In parallel, with the wonderful team at UniverseTBD, we've been working on the other side of the problem: can AI systems generate scientific hypotheses that are both novel and feasible? Through Sparks of Science, we built HypoGen, which is a dataset of structured problem-hypothesis pairs using what we call a Bit-Flip-Spark schema, where the Bit is the conventional assumption, the Spark is the key insight, and the Flip is the new direction. It's an attempt to formalise the creative leap at the heart of research, and to measure whether language models can learn to make it.

→ Read the paper: Sparks of Science

Together with my dear friend Pranav Agarwal (Wayve), we've also been exploring what these models reveal about themselves when they reason. With Supernova Event Dataset, we asked LLMs to extract and rank the most critical events from biographies, historical moments, and scientific discoveries — a subjective, long-context task that requires causal reasoning with no single right answer. What we found is that different models exhibit distinct and consistent personality-like patterns: some reason emotionally, others analytically, others through strategic framing. Understanding these patterns matters if we want to build systems we can actually trust with scientific thought.

→ Read the paper: Supernova Event Dataset

So where do we go from here? What am I excited about?

I am now obsessed with what it would take to build multi-agent cognitive architectures capable of driving the next scientific breakthrough. Not a single model that knows everything, but systems where multiple AI agents, each with different roles, different epistemic tendencies, different failure modes, collaborate with human scientists in ways we are yet to identify. What would it look like to design an architecture where the agents challenge each other, where the human remains a deep thinker in the loop, where the system as a whole sees further than any single mind?

We don't have the answer yet. But I believe the path forward is a mix of understanding where agents fail, how they shift the texture of human thought, and how creative leaps happens. I find it likely that the next breakthrough won't come from either AI or humans alone. It will come from the architecture of the collaboration itself.

More here

Ioana (Jo) Ciucă is a Research Scientist at Stanford's Kavli Institute for Particle Astrophysics and Cosmology, and co-founder of UniverseTBD. Reach her at iciuca@stanford.edu