Moving SAGE to Local Models
Migrating SAGE's YouTube classifier from a larger local model to a faster one, including the thinking-model gotcha that broke responses.
Part 16 of the series: Building The Hub
The first serious job I moved onto Perceptor was not glamorous. It was a classifier.
That was deliberate.
SAGE has a YouTube Inbox workflow that looks at saved videos and decides where they belong. Some videos are worth watching soon. Some are useful for learning. Some belong in a project queue. Some can be ignored. It is exactly the kind of bounded judgment call that shows up everywhere in a personal AI system.
The task is small, structured, repeatable, and easy to evaluate. That makes it a much better local-model migration candidate than something like long-form writing or open-ended planning.
If local inference is going to become infrastructure, this is how I want it to earn trust: one boring workflow at a time.
Why SAGE was the right test
SAGE already had the right shape for migration. The input was a video title, metadata, and sometimes a short description. The output was a category and a short reason. The old path used a larger local model, but the workflow itself was independent enough that I could swap the model without rebuilding the whole system.
That independence matters. Model migrations are much easier when the prompt, tool call, output parser, and workflow boundaries are clean.
I wanted to answer three questions:
- Would the faster model preserve classification quality?
- Would latency improve enough to matter?
- Would any model-specific behavior break the workflow?
The third question turned out to be the interesting one.
The thinking-model gotcha
The migration target was a faster MLX-backed local model running on Perceptor. The first tests looked broken in a confusing way: the call would complete, but the response content was empty.
No useful error. No dramatic crash. Just an empty answer.
The cause was internal reasoning. The model was spending the available output budget on thinking, leaving no visible content for the workflow to parse. From the outside, it looked like the classifier had failed. In reality, the model had used the completion budget in a way the workflow did not expect.
The fix was simple once I understood it: disable thinking for this task.
That is not an anti-reasoning stance. Thinking models are useful. But this workflow does not need hidden deliberation. It needs a small, visible, structured answer. A classifier that silently thinks and returns nothing is worse than a less clever classifier that returns the right JSON every time.
This is one of those tiny integration details that matters more than the benchmark chart. The model is not the product. The workflow is the product.
The benchmark
After the fix, I ran the classifier against a small evaluation set. On this initial 10-item set, the faster local model got 10 out of 10 classifications correct.
Latency improved too in my local runs on the same workflow prompts. The average response time dropped from roughly 1.1 seconds to roughly 570 milliseconds. Throughput improved from about 19 tokens per second to about 29 tokens per second. Cold start behavior improved as well, from around 17 seconds to around 6 seconds.

Those are small numbers, but small numbers matter in workflows that run repeatedly. A classifier that is half a second faster feels different when it is processing a queue. A cold start that is ten seconds shorter feels different when the system wakes up to do one small job and then goes quiet again.
The quality result mattered most. Speed is only useful if the output still routes the work correctly.
Text-only is fine until it is not
The MLX model I used for this path is text-only in the way I am using it. That is fine for the YouTube Inbox classifier because the workflow operates on metadata and descriptions. It does not need to watch the video.
But the distinction matters for the broader system. Some local model variants advertise multimodal capability, but tool support is not always there in the runtime you are using. In one experiment, audio input was technically in the model family story, but not meaningfully available through the local serving path I had.
For podcast and video workflows, Whisper-style transcription is still the right tool for audio. The model can reason over the transcript afterward. That division is boring and reliable, which is exactly what I want.
This is another worker-model lesson in miniature. Do not ask one tool to do every job. Use the fast text model for classification. Use transcription for audio. Use a stronger model when the reasoning gets harder.
Rollback anchors
I kept the previous path available while testing the new one. That made the migration much calmer.
When you are swapping models inside a workflow, rollback should be a normal part of the design, not an admission of failure. Prompts are sensitive. Output formats drift. Small model behavior differences can break downstream parsing. Being able to compare old and new outputs side by side turns the migration from a leap of faith into a measured change.
For SAGE, the test set gave me enough confidence to switch the default. If the workflow starts misclassifying later, I can compare against the older model path and update the prompt or routing logic.
The point is not to make local models feel risk-free. The point is to make the risk observable.
What changed in my head
This migration changed how I think about Perceptor.
Before this, Perceptor was a promising inference host. After this, it became part of a production-ish workflow. SAGE was no longer just experimenting with local models. It was using one to do real work faster.
That is the moment local AI becomes practical: not when it wins a leaderboard, but when it quietly makes an existing workflow better.
The migration also clarified the class of tasks I want to move next. Bounded classifiers. Extraction steps. Routing decisions. First-pass summaries. Anything where the input is small, the output is structured, and evaluation is possible.
That is a lot of The Hub.
The pattern
The pattern is simple:
- Pick a bounded workflow.
- Keep the old model path available.
- Run a small evaluation set.
- Watch for model-specific integration behavior.
- Switch only when quality, speed, and observability all look acceptable.
That is not as dramatic as “replace cloud AI with local AI.” Good. I do not want drama in my automation layer. I want boring wins that compound.
SAGE’s classifier was one of those wins.
And once that was working, the next obvious target was a much larger content workflow: my podcast pipeline.