Home > Uncategorized > Specification based programming

Specification based programming

June 14, 2026 (3 weeks ago) Leave a comment Go to comments

The use of LLM to write software has focused on integrating them within existing practices, i.e., using LLMs as very fancy auto-completers for chunks of code or functionality. This use is programming by conversation, or less politely, programming by stream of thought. The term vibe-coding creates an illusion of trendiness; after all, software engineering is a hedonistic activity.

With vibe-code on top of vibe-code on top of vibe-code, refactoring becomes a complete rewrite, at least in theory. A rewrite assumes that it’s possible to extract a specification that is complete and accurate enough to recreate the software. A lot of software has a short lifetime, so a major rewrite may never be needed. However, for software that is expected to have a long life, management are going to want a more controlled/structured/repeatable approach.

LLMs’ ability to write software is now good enough to support a more controlled/structured/repeatable approach: Programming by specification. That is a specification of the desired behavior is given to one or more LLMs, which use it to generate the appropriate software.

The human input to the program creation process is via the specification.
Features are changed/added/removed by updating the specification. Bugs are fixed by updating the specification. If there are mistakes in the generated code, the specification has to work around them, in the same way that compiler bugs have to be worked around.

Business logic can be expressed as a specification, which is how application domain experts, who are not programmers, are able to create minimal viable products using LLMs.

How might a specification be created?

Agile has taught the lesson that software creation is an iterative process. Requiring a complete specification before coding starts is the stuff of armchair project managers.

One possible specification iteration process starts with a basic outline specification of what is required, and is followed by the following cycle:

  1. Using the current specification, developer+LLM produces code. Perhaps particular functionality is implemented, or the work continues for some amount of time, or etc,
  2. the transcript of the LLM conversation is used to create an updated specification of the code that exists when work stopped. Conversations involving code that came and went is not part of the updated specification, although logging it for future reference costs little,
  3. a new version of all the software covered by the updated specification is generated. This can be tested using existing tests and also by differential testing using multiple implementations created from the same specification (a recent paper generated five implementations in different languages),
  4. if more functionality is needed, go to step 1.

Specifications share many characteristics with source code. They can be split up and organized into modules/packages/components/phases, as was done for this LLM generated C compiler.

LLM generated code is more verbose than human generated code, just like the machine code generated by early compilers.

Open source projects could soon just be making the specification available. Why ship the source code generated from a specification, projects don’t ship the assembler code generated by compilers, they ship the original source code. However, given the current reliability of LLM source code generation, they are benefits to making the generated source of at least one implementation available (as a kind of checksum).

Reduced implementation costs, using LLMs, make it possible to create programs containing more functionality (Jevrons paradox in action). This in turned leads to specifications becoming larger, complicated and poorly organized, just like source code.

English usage is full of ambiguities. This ambiguity can be reduced by using a controlled language. If specification programming becomes popular, it’s easy to imagine the invention of controlled languages becoming as popular as the invention of programming languages. In 1957, there were compilers for at least 28 programming languages.

Specification based programming is a continuation of the trend of computers handling more of the details involved in program creation, with the program creation process requiring less and less knowledge about computers. Increasing amounts of computer time are spent to reduce or eliminate developer time.

Programming has evolved from physically connecting subsystems by cables to specify the flow of bits in a punch card computer, to a sequence of machine code instructions executed by a stored-program computer, then high-level programming languages reducing the need to know lots of details about the underlying cpu (details that remain include: number of bits in the integer types and type compatibility rules).

Specification based programming requires discipline, and I don’t expect it to be popular. I expect multiple LLM-derived project disasters need to occur before there are any significant changes to the current LLM approaches to software development.

  1. June 17, 2026 (3 weeks ago) 09:49 | #1

    In my limited experiments with free-tier LLMs I found it works better if you start a new conversation, give it the generated code and the specs, and ask it to find a discrepancy and to fix it. Then start a new conversation *again* and repeat to get another discrepancy fixed, and so on until it doesn’t find one or it starts saying things are discrepancies that aren’t. This is of course in addition to running rigorous compilers, linters and tests to catch problems at each iteration.

    Why start a new conversation when you want the LLM to look for a new problem? because LLMs seem to be trained to “think” better toward the start of a conversation than later on. In a long context window the LLM is more likely to tell you it’s ready to ship when it isn’t. Presumably it’s modelling a reasoning of “well after all that work it must be ready now” and its attention mechanism won’t be able to surface more problems in all that clutter. (Additionally, long-context conversations consume quota faster, as despite key-value caching etc the repeated use of history still counts. But even with an infinite supply of zero-climate-consequence reasoning, it would still usually work better to start a new conversation when looking for a new problem.)

    Consequently, the work log could be made up of a dozen or more separate “chats” with the LLM. I’ve not tried tools like Cursor but I think they work by managing the context window so the LLM does not see everything at once.

    What many people overlook about LLMs is that these things are probabilistic. It’s not “predicting the next word” (token) as is often claimed. It’s predicting a *distribution* of plausible next tokens, from which one is chosen by calling the random number generator and biasing it to the shape of that distribution. Some platforms expose a “temperature” setting that works by scaling the distribution in favour of more or less likely next tokens. In theory you could make an LLM conversation reproducible by saving the model weights and the RNG seed and ensuring everything is called sequentially but it would be a huge performance hit, and the model weights still seem much less rigorously constructed than a traditional compiler (which is closer to old-fashioned “expert system” AI than it is to LLMs). Therefore it’s still going to be important to keep the resulting source code, because the guarantee of being able to get something functionally equivalent is much less iron-clad than it is with traditional compilers, quite apart from considerations like “the deployment site might not have the resources to re-run that LLM” (but might still be on a CPU you don’t know, so shipping source works better than shipping binary).

    (If the AI bubble bursts and we all get jobs again to pick up this mess by hand, I don’t know if it’ll be easier to fix the LLM-generated code or to rewrite it; we might have to look at that on a case-by-case basis. I hope in many cases it will turn out be reasonably maintainable legacy code but who knows.)

  2. June 17, 2026 (3 weeks ago) 19:45 | #2

    @Silas S. Brown
    Context size is an issue, and I expect that there will be lots of research around optimum window size and when to restart conversations.
    If? You mean when the ‘AI’ bubble bursts, in the sense of company valuations and using VC money to subsidize customer usage. LLMs use will continue to be available at a price. The question is what LLM functionality is cost-effective at whatever the commercially viable cost turned out to be.

  3. Pavel Gurov
    June 18, 2026 (3 weeks ago) 06:57 | #3

    >very fancy auto-completers for chunks of code or functionality

    We have actually already moved past this stage in software development. Starting with models like Claude 3.5 Sonnet or the latest reasoning models, you can simply paste a Jira ticket description into the LLM, and the task will be completed and fully covered by tests.

    > management are going to want a more controlled/structured/repeatable approach.

    That might not necessarily be the case. If something goes wrong, you can now rewrite an entire module from scratch and fully cover it with tests in just half a day.

    >Features are changed/added/removed by updating the specification

    This doesn’t fully align with what we are seeing in practice. For instance, in our workflow, we generate specifications once, tweak them manually, and then rarely change them at all. Looking ahead, I expect that specifications might not even need to be updated. Roughly speaking, the LLM will be capable of understanding the overall direction and evolution of the project on its own.

    LLM generated code is more verbose than human generated code

    I would respectfully disagree with this as a blanket statement, as several key factors come into play here:

    Readability by design: By default, LLMs strive to write clean, clear, and readable code—not just for human developers, but also to ensure that other LLMs can easily parse and work with it later. Furthermore, LLMs tend to include multi-line comments when dealing with non-trivial business logic. I once considered adding a rule to our specifications to restrict multi-line comments (since developers rarely read comments longer than two lines), but a colleague advised against it. It turns out LLMs read the full comment carefully and heavily rely on it during subsequent code generation.

    Customizability: If you explicitly need the code to be highly concise or dense, you just need to add that requirement to the system prompt or specification. The model will then generate compact, heavily condensed code (though it may become much harder for humans to read).

    Thoroughness: LLMs naturally tend to account for edge cases and various auxiliary factors that human developers frequently overlook. This thoroughness can sometimes be mistaken for mere verbosity.

    Commercial incentives: There is also the economic factor to consider—generating more code consumes more tokens, which directly increases revenue for LLM providers.

    >This in turned leads to specifications becoming larger, complicated and poorly organized, just like source code.

    While it’s true that about six months ago specifications were growing exponentially, this trend has significantly slowed down over the last couple of months. With the advent of specialized agent skills and plugins that already embed the necessary instructions, LLMs have become smart enough to infer the context and intent on their own. In my experience, instructions actually need to be short and precise. If a specification becomes too lengthy and bloated, the quality of the generated output tends to decrease significantly.

  4. June 18, 2026 (3 weeks ago) 12:16 | #4

    @Pavel Gurov
    >>very fancy auto-completers for chunks of code or functionality
    > We have actually already moved past this stage in software

    Yes, the functionality is there, but lots of people (based on those I talk to or read about) are not doing the ‘clever’ stuff. In some cases this is because they have been bitten in the past and are now cautious, in other cases they are stuck in a rut.

    > you can now rewrite an entire module from scratch and fully cover it with tests in just half a day.

    Wouldn’t you rather do it in half an hour?

    > instance, in our workflow, we generate specifications once, tweak them manually, and then rarely change them at all.

    In a few years time the data will be available for somebody to measure the half-life of lines in a specification 😉

    > Roughly speaking, the LLM will be capable of understanding the overall direction and evolution of the project on its own.

    Once LLMs can understand and react to customer demand, you are out of a job.

    >> LLM generated code is more verbose than human generated code
    > Readability by design: By default, LLMs strive to write clean, clear, and readable code—not just for human developers, but also to

    LLMs write simple code (which does not require lots of effort to read) because they average over lots of code, which smooths out the complicated stuff. Readability is a marketing term.

    > Thoroughness: LLMs naturally tend to account for edge cases and various auxiliary factors that human developers frequently overlook. This thoroughness can sometimes be mistaken for mere verbosity.

    Yes, this is certainly true.

    > While it’s true that about six months ago specifications were growing exponentially, this trend has significantly slowed down over the last couple of months.

    We are still are the start of the process. In ten years time we should have a better idea of how to use LLMs.

  5. Martyn Thomas
    June 19, 2026 (3 weeks ago) 15:55 | #5

    What does “fully cover it with tests” mean? Is it every possible path through the code with every possible value for each variable?. That feels like a lot of tests, even for a small module.

  6. Pavel Gurov
    June 22, 2026 (2 weeks ago) 12:46 | #6

    @Derek Jones

    >Readability is a marketing term.
    I’d suggest coming at it from a different angle than tracking eye movement or anything like that. A simple example: a developer comes across a function in the code. If the function reads well, it’s easy to modify and to call. Whereas if you find yourself wanting to rewrite it, that’s often a sign the function isn’t readable. In my own practice I’ve seen code from developers that passes the linters but is unreadable — the kind that makes other developers’ eyes bleed.

    >you are out of a job
    That’s no surprise to me. Back in 2013 I wrote a short story about exactly this: https://habr.com/ru/articles/202460/ — and here’s the English translation: https://habr.com/ru/articles/708560/

  7. June 29, 2026 (6 days ago) 01:28 | #7

    @Pavel Gurov
    I wish I could come at the readability issue from an eye tracking perspective. The application of eye tracking to source code readability is relatively new. Previously researchers have simply asked students to judge the readability of code.

    In general, people are good at following along with things they have practiced. Readability is high, if what you are reading, follows patterns you are already familiar with.

  1. No trackbacks yet.