The Last Manual Step in AI Research? Making the Figures

This is a Plain English Papers summary of a research paper called PaperBanana: Automating Academic Illustration for AI Scientists. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.

The bottleneck nobody talks about

AI researchers can now automate almost everything. Language models can design experiments. Neural networks can optimize hyperparameters. Autonomous systems can run training loops, analyze results, and even discover novel algorithms. But then comes the moment they need to publish: create the diagrams, the figures, the illustrations that actually communicate what they've built. And suddenly, all that automation grinds to a halt.

This is the core problem PaperBanana addresses. After spending weeks developing a method, running experiments, and validating results, researchers still face hours of manual labor creating publication-ready illustrations. A diagram explaining your approach. A figure comparing your method to baselines. Visualizations that make your findings clear to someone reading the paper for the first time. These should be routine, but they're not. They require decisions that machines haven't learned to make.

The irony cuts deep: we've solved the hard part, the actual research, but not the communication part. That's backwards. And it creates a ceiling on what can be automated. A fully autonomous AI researcher would still need humans in the loop to make it publication-ready.

Why this matters beyond time savings

It's tempting to dismiss illustration creation as a minor chore, a final step in a longer workflow. But that misunderstands how science actually gets communicated. When you open a paper, you probably skip the abstract and jump straight to the figures. Those illustrations are where the core idea lives, where insights become intuitive. A muddy diagram can bury a brilliant finding. A clear one can make a complex method feel obvious.

This is why illustration quality shapes how ideas spread through the research community. Poor visualization means good insights travel slower. They get cited less often. They influence fewer downstream projects. In a world of rapidly accelerating research output, the papers with clear figures are the ones that stick in people's minds.

For domains like computer vision, robotics, and biology, this becomes even more critical. Illustrations aren't supplementary, they're essential. A system that can't generate good figures limits what research can be automated at all. You can't have a truly autonomous researcher if it can't communicate its own findings.

The deeper problem: illustration creation requires exactly the kind of human judgment that seems irreducibly creative. You need to understand what details matter in your field. You need aesthetic sensibility. You need to iterate based on feedback. These are the things we assume machines can't do well.

How human illustrators actually think

Before building a system to automate illustration, you need to understand what decisions human illustrators actually make. The process looks deceptively simple from the outside, but it's structured underneath.

A skilled illustrator doesn't start drawing. They start by reading and questioning. What's the core concept I'm communicating? What do my readers already know about this field? Which details would confuse them versus clarify? What visual style matches the publication and the specific field? Should I use photographs, technical diagrams, or abstract shapes? How much text versus pure visuals? This entire decision-making phase happens before a single line gets drawn.

The workflow decomposes into stages:

Retrieval involves understanding context. The illustrator reads the relevant section, looks at similar illustrations in published papers, and builds a mental model of what conventions exist and what styles are appropriate.

Planning means making explicit choices about content (what to include, what to leave out), structure (how to arrange elements spatially), and style (visual language, color schemes, level of abstraction). This is where the illustrator's plan takes shape.

Rendering is execution, using whatever tools and techniques suit the plan.

Critique is comparing the result against the original intent. Does it match what you intended to communicate? Would someone unfamiliar with the work understand it? Does it fit the publication's visual standards? Usually this leads to iteration and refinement.

This is structured decision-making, not a single creative leap. That structure matters enormously because it means an AI system can potentially learn to follow similar logic.

Building an AI that follows the illustrator's workflow

PaperBanana works by creating specialized agents that each handle one stage of the workflow. Instead of a single model trying to magically generate publication-ready illustrations, the system orchestrates multiple components that think sequentially: retrieve context, plan the content and style, render an image, then critique the result and iterate.

The reference retrieval agent starts by grounding the system in reality. It takes the paper section and searches for published illustrations from similar papers, especially recent publications from venues like NeurIPS. This is crucial because image generation models are trained on internet images, not academic papers specifically. Without this grounding step, the system would hallucinate what academic illustrations should look like, generating beautiful but inappropriate visuals. Real examples teach the system what actually works in the domain.

The content and style planning agent is where the actual decision-making happens. Using a vision language model, it reads the paper section and decides what the illustration should communicate. Then, using the retrieved references as examples, it selects an appropriate visual style. The output isn't an image yet, it's a detailed specification in natural language: "Generate a 4-part process diagram. Part 1 shows input data using realistic data visualization. Part 2 shows the novel filtering step using abstract geometric elements. Parts 3 through 4 show standard processing using schematic style. Use a professional color palette consistent with the style patterns found in recent papers."

This planning step is essential for faithfulness. The system isn't trying to reverse-engineer both content and style from scratch. It's explicitly committing to what should be drawn before the rendering phase begins. The specification is written in natural language, which makes it verifiable and debuggable.

The image generation agent takes that detailed plan and creates the actual visual. It uses state-of-the-art image generation models, but the advantage of having a clear plan is that the generator isn't operating in a vacuum. It's executing a well-specified brief. This consistency is what prevents the classic failure mode where generated images miss key details or introduce incorrect elements.

But generation is where most automated systems stop. PaperBanana doesn't.

The self-critique loop that makes it work

The real innovation is the self-critique agent. After generating an image, a vision language model evaluates it against a rubric of publication-ready criteria:

Faithfulness: Does the illustration accurately represent what the paper claims? Does it match the content specification that was planned?

Conciseness: Is anything unnecessary or unclear? Are key elements present or missing? Could it be simplified?

Readability: Is the text legible? Are elements clearly distinguishable? Would someone unfamiliar with the research understand it?

Aesthetics: Does it look professional? Do the colors work well together? Is the layout balanced? Would it fit in a published paper?

When the system detects problems, it doesn't just fail. It generates a refined prompt that specifically addresses the detected issues and tries again. This isn't brute-force retry, it's targeted refinement based on understanding what went wrong.

This is the moment where the system becomes genuinely useful. Generated images are often imperfect in subtle ways, but the self-critique loop creates a feedback mechanism that converges toward publication quality. A methodology diagram might iterate twice. A complex comparison figure might need three cycles. The system typically reaches acceptable quality in 2-3 iterations, which is cheap compared to manual creation.

The critique step forces the system to reason about whether its work actually works. It's not just technically correct, it's communicatively effective. That distinction is what separates a nice demo from a publication-ready tool.

Measuring effectiveness across real papers

To know whether this actually works, the researchers created PaperBananaBench, a rigorous evaluation framework built from real academic papers. They curated 292 illustration tasks directly from NeurIPS 2025 publications, covering diverse research domains and multiple illustration types: methodology diagrams, experimental comparisons, conceptual illustrations, and statistical plots.

For each task, they have the original paper section, the published illustration, and what makes that specific illustration good in context. This grounds evaluation in real publication standards rather than abstract aesthetic principles. A methodology diagram for a vision paper has different requirements than a systems architecture diagram, and the benchmark captures that variation.

Evaluation measured performance across the four core dimensions:

Faithfulness captures whether the generated illustration accurately represents the paper's claims. This is non-negotiable for scientific communication.

Conciseness measures whether the illustration is streamlined or unnecessarily cluttered. Precision of expression matters.

Readability tests whether someone unfamiliar with the work can understand the illustration. This is the litmus test for communication effectiveness.

Aesthetics measures whether the result looks publication-ready. Does it match professional standards for layout, color, and visual hierarchy?

Comparing against baselines reveals where the multi-stage approach adds value. Using image generation models directly, without planning or critique, produces lower quality across all dimensions. Using template-based diagramming tools lacks the visual sophistication of generation. Manual creation remains the gold standard, but PaperBanana closes the gap substantially.

The benchmark itself becomes a resource for the research community, providing a standard way to evaluate illustration systems going forward.

What changes when illustration becomes cheap

The immediate impact is straightforward: researchers save hours on every paper. That compounds across thousands of researchers, but the structural implications are larger.

Right now, illustration quality depends heavily on how much time a researcher is willing to invest or how much they can afford to pay a designer. Remove that friction, and visual quality becomes decoupled from design skill. Good ideas stop getting buried in poorly designed figures just because their creators aren't skilled illustrators.

Better illustrations also become the norm because generating multiple variations and selecting the best becomes trivial. You don't pick your first attempt if you can generate five alternatives in seconds and choose the clearest one.

This shifts where the publishing bottleneck actually lives. Illustration stops being a constraint. Now the friction is in writing, in getting feedback from reviewers, in conceptualization itself. You've elevated the minimum bar for what counts as a publishable figure.

For autonomous AI researchers, this closes a critical loop. A system that can conduct experiments, analyze results, and communicate findings in publication-ready form is genuinely autonomous. It doesn't need humans for the final mile of communication. This is a prerequisite for the "AI scientist" vision that researchers are actively pursuing.

There are open questions. How does this affect human illustrators? Likely it frees them from routine diagrams to focus on more creative and strategic work, the kind that requires deeper understanding of the research landscape. What happens when every paper has professional illustrations? Does it raise community standards for clarity overall, or does it flatten some of the signal that careful visualization previously provided?

Perhaps most importantly: can the system handle entirely novel visualization types, or is it limited to patterns it's seen before? That will determine whether this scales to truly novel research domains or serves best as an enhancement to established practices.

The broader pattern here is significant. We're watching a skill that seemed irreducibly creative, that required human judgment and aesthetic sensibility, become automatable through the right combination of multi-agent orchestration and self-critique. PaperBanana is one example of this pattern. It won't be the last.