What Happened

Simon Willison, creator of the popular llm Python CLI and library, has begun a major architectural overhaul of the tool 's abstraction layer. The project, documented in a new public repository tagged research-llm-apis 2026-04-04, is a preparatory research phase aimed at handling vendor features that the current abstraction cannot support — most notably server-side tool execution.

The llm library currently provides a unified interface over hundreds of models from dozens of vendors through a plugin system. As providers like Anthropic, OpenAI, Google (Gemini), and Mistral have added new capabilities over the past year, the existing abstraction layer has begun to show its limits.

Technical Deep Dive

To understand the raw API surface across providers, Willison employed Claude Code to read through the official Python client libraries for all four vendors and generate curl commands that hit the underlying JSON APIs directly. The goal was to capture both streaming and non-streaming response shapes across a range of scenarios.

The output of this research — including the generated scripts and captured JSON responses — now lives in a dedicated GitHub repository. This approach is methodologically notable: rather than reading documentation (which often lags implementation), Claude Code analyzed the actual client library source code to infer what the APIs do in practice.

Why Server-Side Tool Execution Breaks the Current Model

The current llm abstraction assumes a request-response loop where tool calls are handled client -side. Anthropic's and OpenAI's APIs now support server-side tool execution, where the provider's infrastructure can call tools and return results without the client managing the loop. This fundamentally changes the call signature, the streaming event types, and the state machine a client needs to implement.

For example, a streaming response with server-side tool use might emit event types like:

  • tool_use blocks mid-stream in Anthropic's API
  • tool_calls deltas in OpenAI's streaming chunks
  • Function call parts in Gemini's GenerateContent Response

Each vendor uses different field names, different chunking strategies, and different conventions for signaling tool completion. The current llm plugin interface doesn't expose enough surface area for plugin authors to handle these differences correctly.

The Research Artifact

The repository contains curl commands and raw JSON captures for each provider in both streaming (text/event-stream) and non-streaming modes. This gives the project a concrete, versioned reference point for what each API actually returns today — something that will inform the new abstract base classes and plugin protocol Willison designs next.

A typical non-streaming capture for a tool-use scenario would include the full stop_reason, tool_use content block, and input JSON that the model decided to pass to the tool. The streaming equivalent shows how those same fields arrive as incremental delta events with index offsets.

Who Should Care

LLM plugin authors should pay close attention. Any plugin currently wrapping a provider that has added tool execution, extended thinking, or other stat eful features will likely need to be updated once the new abstraction ships. Willison's research phase signals that a breaking or at least additive change to the plugin protocol is coming .

Python developers building on top of the llm CLI for scripting or automation workflows should be aware that the underlying plugin API is in flux. The current plugin interface — centered on Model, Response, and Conversation classes — may gain new optional methods or abstract properties.

Tooling teams at AI vendors may find the research repo useful as an independent, third-party snapshot of how their streaming and non-streaming APIs behave in practice, compared to competitors.

What To Do This Week

  • Star or watch the research-llm-apis repository on GitHub to track when Willison begins translating the research into actual interface proposals.
  • If you maintain an llm plugin, audit whether your provider now supports server-side tool execution and document any streaming event types your current implementation silently drops.
  • Run the existing llm CLI against your provider and capture a raw streaming response with llm -- no-stream vs streaming mode to understand what your plugin currently surfaces vs. discards.
  • Review the Anthropic, OpenAI, and Gemini Python SDK source on GitHub directly — the same exercise Willison used Claude Code to automate — to identify any new response fields added in the last six months that your integration doesn't handle.

The immediate deliverable from this research phase is a versioned JSON reference corpus . The abstraction redesign itself hasn't been proposed yet, but having clean empirical data on what each API returns is the right prerequis ite for designing an interface that doesn't paper over important differences.