Skip to main content

Why I'm building a 21M parameter model in the age of 405B

·257 words·2 mins

Everyone is scaling up. I’m scaling differently.

While the industry races toward 405B parameter models, I’m training a 21M parameter ConceptEncoder. Not because I can’t access bigger compute — but because I think there’s a better path.

My ConceptEncoder uses 128 learnable concept tokens with cross-attention to the input sequence. Instead of every token attending to every other token (128K × 128K operations), concepts attend to the sequence (128 × 128K operations). That’s a 1000x theoretical memory reduction.

The bet: a small set of concepts, forced to compress meaning, might learn more efficiently than billions of parameters trained to predict the next token.

Is it working? Honestly — not yet, not fully. My concepts keep collapsing (effective rank 5/128). But each failed experiment teaches me something about how representations form and why brute-force scaling avoids these fundamental questions rather than solving them.

Three things I’ve learned so far: → Efficiency constraints force architectural innovation. You can’t just throw more parameters at a collapsed representation. → Small models reveal failure modes that large models hide behind sheer capacity. → The training objective matters more than model size. A bad objective wastes parameters at any scale.

I don’t claim this will work. But I believe exploring alternatives to “just scale it up” is research worth doing — especially when a single large training run costs more energy than I’ll use in a year.

What’s your take — is there value in exploring small, efficient architectures, or is scaling the only game in town?

github.com/ksopyla/MrCogito

#AIResearch #EfficientAI #MachineLearning #OpenScience #ConceptEncoder