A Deeper Dive into Ratchet

I’ve been building ratchet, a tool for giving AI coding agents a very particular kind of job: improve a measurable result, one iteration at a time.

At the surface, it looks simple. Give the agent some editable files, a benchmark command, and a metric to optimize. Then let it edit code, run the experiment, keep the improvement, or throw it away.

But the part I find interesting is not the headline. It is the machinery around the loop.

Ratchet is really my attempt at answering a more specific question:

What if, instead of asking an AI to solve a coding problem in one shot, we gave it a tight experimental harness with memory, strategy, and consequences?

Open Table of contents

The basic shape
Single metric is the easy case
Multi-metric optimization is where it gets more real
Constraints are different from objectives
How memory works
Strategy hints are a lightweight kind of learning
The multi-armed bandit is my favorite weird feature
Rollback matters more than people think
Branches, prompts, and keeping the mess contained
What I think this is actually good for
Why I built it this way
Wrapping up
What I think Ratchet is actually good for
Where I would be careful
The part I find most exciting
Where I want to try it
Wrapping up

The basic shape

At the center of ratchet is a small config:

name: "sort-benchmark"

editable:
  - sort.py

readonly:
  - benchmark.py

run: "python3 benchmark.py"

metric:
  name: throughput
  grep: "^throughput:"
  direction: maximize

constraints:
  - name: correctness
    grep: "^correctness:"
    fail_below: 100.0

That config defines a few important boundaries:

what the agent is allowed to edit
what command actually measures performance
how metrics are extracted from output
what counts as success
what must not regress

From there, ratchet can either generate instructions for an external agent, or run the full loop itself with ratchet loop.

I like this shape a lot because it narrows the problem down to something machines are actually decent at:

make a change
measure the outcome
keep the win
discard the loss
repeat without getting bored

That is not how all software work should be done. But it is a very real category of work.

Single metric is the easy case

The easiest version of ratchet is a single metric plus hard constraints.

That already gets you pretty far.

The sort benchmark is the obvious demo: maximize throughput, but keep correctness at 100%. That framing is intentionally blunt, and I think that is part of why it works. The agent cannot talk its way into a win. The benchmark either prints a better number or it does not.

The KNN example is a better illustration of the same idea. Throughput goes up, accuracy must stay above a floor, and suddenly the search space becomes very productive: skip square roots, change data structures, build trees, flatten storage, reorder dimensions.

That is already useful. But the more interesting part starts when there is no single number that fully captures what you want.

Multi-metric optimization is where it gets more real

I added support for multiple primary metrics because many optimization problems are not honestly expressible as one scalar objective.

The HTTP handler example is the cleanest case. Throughput should go up, but p99 latency should go down:

metrics:
  - name: requests_per_sec
    grep: "^requests_per_sec:"
    direction: maximize
  - name: latency_p99
    grep: "^latency_p99:"
    direction: minimize

Once you do that, the loop should not ask “is this better than the current best number?” It should ask a more nuanced question: does this result improve the frontier?

Ratchet uses Pareto dominance for that. A new result is interesting if it is not dominated by an existing kept result. In practice, that means the agent can find tradeoffs instead of being forced into a fake single-score ranking.

I especially like this for things like:

throughput vs latency
speed vs quality
accuracy vs memory
compression ratio vs encode time

The BPE tokenizer example shows the idea nicely. One metric is throughput, the other is tokens_per_word, which is a rough vocabulary-quality proxy. That creates tension on purpose. A change can be worth keeping even if it does not improve speed, as long as it improves token quality without making the other metric worse.

That feels much closer to real engineering than “maximize one number and hope the rest sort themselves out.”

Constraints are different from objectives

I think it is useful to separate primary metrics from constraints.

Some numbers are things you actively want to optimize. Others are just guardrails.

For example:

constraints:
  - name: correctness
    grep: "^correctness:"
    fail_below: 100.0

  - name: memory_mb
    grep: "^memory_mb:"
    warn_above: 512.0

That distinction matters.

If correctness drops below the floor, the run is not merely worse; it is invalid. Same for error rate, broken round-trip behavior, or whatever hard boundary your system has. Ratchet treats those as crash-like outcomes and reverts them.

Soft warnings are different. Sometimes you do want the loop to explore a little past a comfort zone without immediately throwing the work away.

That split between objectives and constraints was important to me because otherwise optimization loops become too brittle or too naive.

How memory works

This is probably the part I care about most.

When people say an agent has “memory,” they often mean some vague long-context conversation state. I wanted something more explicit and inspectable.

Ratchet’s memory mostly lives in files and git state.

The main pieces are:

results.tsv, which logs each experiment
.ratchet-prompt.md, which is regenerated each iteration
bandit.json, when the strategy selector is enabled
git commits, which give the loop a durable history of real code states

results.tsv is the scoreboard. Each row stores the commit, metrics, constraints, strategy, status, and a description of what changed. That means the loop does not need to “remember” in a mystical sense; it can read its history back in every iteration.

Then the next prompt is built from that history. The agent gets:

the current editable file contents
the experiment history table
the current best result
constraints and domain context
the last attempt and whether it was kept, discarded, or crashed
strategy hints derived from recent history

I like this because the memory is externalized. You can open the files and inspect exactly what the loop knows.

No hidden chain-of-thought mythology. Just state.

Strategy hints are a lightweight kind of learning

Before even getting to the bandit, ratchet already tries to turn history into guidance.

It looks for a few simple patterns:

plateau detection when recent kept results are all within a narrow range
failure streaks when several recent attempts were discarded or crashed
the biggest successful jump so far
recently failed approaches to avoid repeating

None of this is fancy machine learning. It is just cheap feedback distilled into prompt guidance.

If the loop has been making tiny gains for a while, it nudges the agent toward a more radical move. If one type of change produced the biggest jump, it suggests exploring nearby ideas. If the last few attempts failed, it says so directly.

I think this matters because most agent loops are stateless in the worst way: they repeat mistakes with a lot of confidence. Even small amounts of structured memory help.

The multi-armed bandit is my favorite weird feature

One feature I especially enjoyed adding is the multi-armed bandit mode.

By default, the agent decides what kind of change to try next. That is fine, but it leaves strategy selection completely implicit.

With bandit mode enabled:

bandit: true

ratchet uses UCB1 to choose a strategy arm for each iteration. The built-in arms are:

algorithm
data-structure
micro-optimization
parallelism
memory-layout
rewrite

The selected arm gets injected into the prompt as a directive. Then the outcome updates the arm’s stats: kept runs count as reward, discarded or crashed runs do not.

This is intentionally simple. The reward is binary, not fancy. But I think that is okay.

The value of the bandit is not that it magically discovers truth. It is that it introduces a disciplined exploration/exploitation policy into the loop.

At the start, all untried arms get explored. Later, the loop starts favoring strategies that have actually produced wins in this benchmark. And because the state is persisted to bandit.json, that preference survives restarts.

I like this because it gives the loop a crude but useful answer to a question agents are usually bad at: what kind of move should I try next?

The compression benchmark shows this well. Some arms do nothing. Some crash. Some produce the big jumps. Over time, the loop shifts attention toward the strategies that are actually paying off.

Rollback matters more than people think

One subtle design choice in ratchet is rollback-to-best.

If an experiment crashes, violates a hard constraint, or just does not produce an improvement, ratchet does not merely shrug and continue from a messy state. It restores the editable files back to the best known state and keeps going from there.

That sounds small, but it changes the behavior of the whole system.

Without rollback, the loop can drift into a pile of accidental complexity where every future iteration inherits dubious baggage. With rollback, each new attempt starts from a strong footing rather than from the last random mutation.

This is one of the reasons I wanted git involved directly. The loop is not only exploring code changes; it is exploring code changes with durable checkpoints.

Branches, prompts, and keeping the mess contained

Another pragmatic detail I care about: ratchet loop runs on experiment branches.

If you start it from main or master, ratchet creates a ratchet/... branch automatically. That keeps the optimization history out of the way and makes the whole thing feel more like a lab notebook than a normal feature branch.

I also like that the per-iteration prompt is regenerated every time rather than growing into one giant conversation. The prompt is fresh, but it is grounded in persistent artifacts: current code, benchmark definition, experiment history, hints, and optionally a selected strategy arm.

So the loop gets continuity without turning into a giant chat log.

What I think this is actually good for

The examples in the repo are not random. They point at the kinds of work where I think ratchet makes the most sense:

hot paths with measurable throughput or latency
algorithmic code with lots of plausible alternatives
correctness-constrained transforms like compression, parsing, tokenization, or routing
model or retrieval pipelines with offline evaluation metrics

The sort benchmark is a dramatic toy. The KNN, HTTP handler, compression, and BPE examples feel closer to the real use case.

Where I would still be careful is anything with weak metrics, shallow benchmarks, or fuzzy product tradeoffs. If the benchmark is fake, the optimization will be fake too. Ratchet does not solve Goodhart’s law; it just makes it easier to run headfirst into it.

Why I built it this way

I did not want to build a giant agent platform.

I wanted something much smaller and stricter: a tool that gives agents a loop, a scoreboard, some bounded memory, and a way to recover from bad ideas.

That is also why I like the fact that ratchet can either generate instructions for any coding agent or orchestrate the loop itself. The important thing is not the specific model. The important thing is the harness around it.

I think that is where a lot of useful AI tooling will end up living: not in bigger promises, but in tighter feedback systems.

Wrapping up

Ratchet started as a simple idea: tell an AI what number to improve.

What made it interesting to me was everything that had to grow around that idea for it to become usable: multi-metric Pareto selection, constraints, persistent experiment history, strategy hints, rollback-to-best, and the bandit-driven strategy selector.

I still think of it as exploratory. I am not claiming this is the final form of AI-assisted engineering.

But I do think loops like this are much more promising than one-shot code generation. They give the model a narrower job, and they give the human much better artifacts to inspect.

If you want to poke around, the code is here: github.com/alephmelo/ratchet.

The toy sorting benchmark starts with bubble sort at roughly 85 arrays per second, and the loop eventually lands on a C extension counting sort at roughly 71k arrays per second. That is intentionally dramatic, but it demonstrates the point very clearly: once the objective is measurable, an agent can search a surprisingly wide space.

The KNN example is more interesting to me because it feels less like a gimmick. It starts with brute-force nearest neighbor search and ends up with a KD-tree plus a bunch of careful implementation details, while preserving accuracy. That is exactly the kind of task where an optimization loop makes sense: there is a clear metric, a hard correctness floor, and many plausible ways to improve.

Then there are the more opinionated examples:

an HTTP handler optimizing both throughput and p99 latency with Pareto selection
a compression benchmark using a multi-armed bandit to steer strategy selection
a BPE tokenizer balancing throughput against token quality instead of pretending there is only one number that matters

That last part is important. Ratchet is not just saying “maximize speed at all costs.” It has room for constraints and multiple metrics, which makes it feel more like a research harness than a benchmark toy.

What I think Ratchet is actually good for

Right now I see it as a strong fit for tasks with four properties:

The code is local and editable.
There is a benchmark you can run cheaply and repeatedly.
Success can be expressed numerically.
The search space is large enough that trial-and-error is valuable.

So: throughput, latency, memory use, accuracy, compression ratio, tokenization quality, maybe even prompt evaluation scores.

Basically, it seems best when the question is not “can the model build this product?” but “can the model keep finding better moves inside a measurable box?”

That is a narrower claim, but a much more believable one.

Where I would be careful

The weakness is also obvious: ratchet is only as good as the number you give it.

If the benchmark is shallow, the agent will optimize for a shallow benchmark. If the metric is incomplete, it will exploit the incompleteness. If the benchmark corpus is too synthetic, you may end up with a very clever solution to a fake problem.

That is not a ratchet problem so much as a Goodhart’s law problem. Still, the tool makes it very easy to forget that the benchmark is standing in for reality rather than being reality itself.

I also would not confuse this with product development more broadly. A lot of important work still depends on judgment, taste, user empathy, and ambiguous tradeoffs. Ratchet becomes compelling when the task can be constrained enough that a loop can safely explore.

So I would trust it much sooner for:

hot paths
data transformations
parsing and compression
algorithmic experiments
inference or training scripts

…than for UX architecture, feature design, or anything where the benchmark is mostly vibes.

The part I find most exciting

The most exciting thing here is not full autonomy. It is interface design.

Ratchet gives AI coding agents a better shape of work.

Instead of:

“Please optimize this somehow.”

It becomes:

“Here are the files you may edit. Here is the benchmark. Here is what success means. Here is how to keep score.”

That is a much better contract between a human and a model.

I suspect we will see more tools like this: not giant agent platforms, just small systems that wrap models in clearer feedback loops. Less magic, more instrumentation.

Where I want to try it

The obvious next step is to point it at a real codebase with a benchmark that actually matters.

I would be curious to use ratchet on:

an API endpoint with clear throughput and latency constraints
a tokenizer or parser where correctness is easy to test
a ranking or retrieval pipeline with offline evaluation
a data import job where runtime and memory are currently painful

Not because I expect the agent to discover unheard-of computer science, but because I expect it to grind through a lot more iterations than I would do manually on a Tuesday afternoon.

That alone is useful.

Wrapping up

I do not think ratchet is the final form of AI-assisted engineering. But I do think it points at something real.

If you can define a loop tightly enough, AI stops feeling like a novelty autocomplete layer and starts looking more like an optimization worker: still needing supervision, still capable of dumb mistakes, but genuinely able to search for improvements while you evaluate the results.

That feels like a good direction.

Not definitive. Still exploratory. But very much worth playing with.

If you want to poke around, the code is here: github.com/alephmelo/ratchet.