← metaobjects.dev

Prompt construction · AI-era drift

Prompts are code now — even OpenAI says so

On November 30, 2026, OpenAI is shutting down its managed Prompts API. Their migration guide tells customers what to do instead, and it reads like a manifesto:

A year earlier, Humanloop — one of the most engineering-grade prompt platforms — shut down after its team joined Anthropic.

The signal is hard to miss. The first generation of prompt management treated prompts like CMS content: strings in someone else's cloud, edited in a dashboard, fetched at runtime. That model is being abandoned by its biggest exponent, and the replacement guidance — typed inputs, version control, build-time verification, cache-stable rendering — describes something else entirely.

It describes code.

We already learned this lesson once

Twenty years ago, SQL scattered through string concatenation gave us injection bugs and unrefactorable data access — so we moved queries behind typed interfaces. Configuration scattered through environments gave us "works on my machine" — so we moved it into declarative, versioned files.

Prompts are at that same inflection point, with a forcing function the earlier shifts didn't have: AI now writes much of your code. Every regeneration is a chance for your prompts to quietly disagree with the model they describe — a renamed field, a changed enum, a restructured payload. A prompt that drifts doesn't throw an exception. It just gets worse, silently, in production.

So what would it actually take to treat a prompt as code? Four things:

  1. Typed inputs — the prompt's variables come from a declared, typed structure, not ad-hoc string substitution.
  2. Build-time verification — rename a field the prompt uses, and the build fails.
  3. Deterministic rendering — the same inputs produce byte-identical prompt text (your snapshot tests and your prompt cache both depend on this).
  4. Typed output handling — the response contract is declared once, and parsing — including recovering from the malformed output real models actually produce — is generated from it.

Here's the uncomfortable part: almost nothing in the current tool landscape does these.

The prompt platforms don't type your variables — none of them

I went through the docs of every major prompt-management platform: Langfuse, LangSmith, PromptLayer, Agenta, Braintrust. They're genuinely good at what they're built for — versioned templates, deployment labels, A/B testing, dashboards, non-engineer editing.

But on the four requirements above:

And the typed-extraction tools — Instructor, Outlines, Pydantic-AI, LangChain's structured output, TypeChat, BAML — own the output half of the problem, with real sophistication. But across all of them: only BAML has a genuinely cross-language schema (and no native Java, Kotlin, or C#); repairs are either silent (BAML's SAP), bought with another LLM call (everyone's retry loops), or delegated to vendor-locked constrained decoding; no tool in the field produces a recovery report telling you what it fixed; no tool repairs XML; and none of them connect the prompt's schema to the rest of your application — the database column, the API field, the UI form that hold the same data.

That last gap is the one that matters most, because prompt drift is almost never a prompt-only event. The field got renamed in the domain model. The prompt was just the surface nobody checked.

What prompts-as-code looks like in practice

This is the problem MetaObjects was built around. It's an open-source (Apache 2.0) metadata standard: you declare your data model once, in plain YAML in your repo, and it generates idiomatic code in TypeScript, C#, Java, Kotlin, and Python — entities, schema, API routes, and prompts, all from the same model.

(It grew out of a mess of my own — a few thousand lines of StringBuilder driving the characters in an LLM game. For the origin story, and the data + text + render decomposition that became the fix, the longer essay is The prompt is code — and yours is drifting too.)

A prompt declares its input as a typed payload — a projection of the same entities that drive your database and API:

# prompt = typed payload + external text
object.value:
  name: AuthorBlurbPayload
  children:
    - field.string: { name: authorName }
    - field.string: { name: bio }

template.prompt:
  name: authorBlurb
  payloadRef: AuthorBlurbPayload
  textRef: author/blurb        # template text lives in a versioned file, not a cloud

From that one declaration you get a generated, typed render function per language — deterministic and byte-identical across all five ports (there's a shared conformance corpus that proves it), so snapshot tests pass and your exact-prefix prompt cache stays warm. Nothing reformats your prompt behind your back.

And because the payload is declared, the build can check it. Rename authorName in the model — or reference a variable the payload doesn't have — and:

$ meta verify
[authorBlurb] (prompt) ERR_VAR_NOT_ON_PAYLOAD: displayName
meta verify — 1 drift error(s) across 1 template(s).   # exit 1 — the build fails

That's the moment a prompt stops being a string and becomes code: the renamed field is a compile error now, not a quality regression you discover in three weeks.

The output side is declared the same way, and one declaration drives three generated artifacts: the output-format instructions injected into the prompt, a strict typed parser, and — the part I haven't seen anywhere else — a tolerant extractor for the output real models actually produce: prose wrappers, code fences, unclosed XML tags, off-vocabulary enum values. It returns best-effort typed data plus a structured recovery report of what was repaired and what was lost, and it never throws. JSON and XML both — XML matters because a missing close tag is locally repairable, while a missing JSON brace corrupts everything after it.

Even the traces are typed: LLM calls persist into your own database with the request and response as typed value objects — the same declared payload — instead of opaque JSON blobs. (And if you like your Langfuse dashboards, keep them: the recorder exports to Langfuse and OpenTelemetry while the typed copy lands in your DB.)

You don't have to replace your stack to do this

To be fair to the incumbents, and clear about boundaries:

The shift isn't about a tool. It's the same shift SQL and config made: the prompt's inputs and outputs join your type system, your build, and your version control — because once AI writes half the code, agreement between the pieces is the scarce thing, and "we'll notice in the dashboard" is not an agreement strategy.

OpenAI just told its customers the same thing. The only question left is whether your prompts find out about a renamed field at build time — or in production.

MetaObjects is Apache-2.0 open source, installable today in five languages. npm i @metaobjectsdev/cli  ·  pip install metaobjects  ·  dotnet add package MetaObjects  ·  Maven Central Start at metaobjects.dev or read the spec.