Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Harness engineering: Leveraging Codex in an agent-first world (openai.com)

177 points by pramodbiligiri 2 days ago | 114 comments

yurimo 6 hours ago [-]

What I still can't understand is why is massive amount of code generated is a flex? I don't feel that software has gotten a lot better in past 3 years, only sloppier. It's surprising to me that people who know about reward hacking choose a simple objective like lines of code generated as a signal for quality. I'd argue you have to optimize for less lines generated as possible while secondary optimization should be readability for humans. I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.

And if I am working on an existing codebase then isn't a good commit often a negative sum between added and removed lines? I don't want to bloat my codebase but make it more polished and elegant. After reading that I wonder if what they have done could have been accomplished for a far fewer LoC budget.

YZF 4 hours ago [-]

Lines of code has always been a terrible metric. But all else being equal it is a measure. If all else is not equal, which is usually the case, then it's not.

A lot of the focus has been on AI recently.

Three years ago we didn't have software where a non-software engineer can describe what they want in English and get working (-ish) software generated by other software? Is that not "software has gotten a lot better"?

Other than that I'm not sure how we measure "software has gotten better". New applications? More features? How do we measure sloppier? Is Google Maps suddenly taking you the wrong way more often? I'm not really doubting your subjective experience but seriously how do tell? I mean a doc is a doc and a spreadsheet is a spreadsheet.

We're also only about 10 months into models that are powerful enough to potentially make a bigger difference and we are still figuring out how to use them best.

yurimo 53 minutes ago [-]

I personally don't view coding agents making software as "software gotten better" you are comparing a tool and the end result, these are two different things. Agent you use going down and your product going down mean two different things to you customers. I will not deny that we made incredible progress in coding and hell, even design over the past 3.5 years, this technology is here to stay.

That being said while I agree that measuring better quality of software is vague (part of the reason it is hard for models as well), there are universal things I believe every engineer will agree on. Reliability, uptime, customer feedback, legibility of your engineering, performance, these are things we often optimized for. Google Maps is a bit of a strawman because neither of us (unless you work on it), knows how much agent code there is, I think it is likely that it's little since it was working fine prior to 2023. I could bring up github reliability as an example, given how much copilot usage they promote at MS, but once again only folks there know for certain. I do, however, see scores of various AI powered SAAS that looks like it is in a perpetual MVP state. I think you are right in that even if agents give us "good enough" results and we can swallow failure rates and our increasingly lesser understanding of what we, or more so model, created, then it is still progress overall, but this is progress not to human-AI collaboration but to AI-only engineering IMO, this is good or bad depending on how you view the future.

I'm a scientist and most of code I currently write is somewhere on the intersection of critical software and machine learning, squaring these two is not easy and I guess the way I was taught to reason about engineering informs my opinions on this. Maybe it's just a matter of time before codex can help here in an unconstrained manner as well, but I am skeptical at the moment.

arcticbull 2 hours ago [-]

I also can't help but notice they didn't mention how many tokens were burned, or how much that translates to in terms of cost over the 5 months at enterprise AI prices. I'm going to guess this wasn't a cheap demo.

jh00ker 56 minutes ago [-]

>I suspect it's not seen as a problem by providers because more lines generated means more tokens used and hence more billing put out on customers.

I have also grown skeptical of token usage in order to run up my bill! But since I feel like it takes me MORE effort to write LESS lines of code myself, I'd expect a quick and dirty AI-generated solution to be MORE lines of code and cost LESS to generate than a concise/elegant solution in LESS lines of code.

trollbridge 4 hours ago [-]

The "lines of code" at this point are basically the same thing as binary code that comes out of a compiler - something you almost never look at and certainly won't try to touch by hand.

The actual "code" is everything driving the harness.

The current problem for this is that the harness is not (yet) deterministic, so it's sort of like having a compiler where your output program works slightly differently every build, and then the compiler tries to just patch the binary programs when you recompile to minimise this problem, or even worse, disassembles the whole thing to figure out what it does, makes the chance, and then recompiles it.

yurimo 2 hours ago [-]

I think the telling part is in this line:

> Because the repository is entirely agent-generated, it’s optimized first for Codex’s legibility

I asked a question from a perspective of a human engineer, as in, I will have to read the code and understand, fix it once it breaks. OpenAI approach is opposite, even if it is breaking it is the agent that will be doing the fixing, millions of lines and inelegant designs don't matter because human readability doesn't matter. In any case you use more tokens so you fork over more money.

I will say, however, that IMHO there is objectively bad and good code in terms what it can do and performance, if I can do the same thing in 50 lines as opposed to 1000 lines, this difference still matters for the model. Smaller context usage, better approach that informs downstream generation.

ArtRichards 2 hours ago [-]

This is the part I think we will see become more relevant.

I created docs-cli (pypi) to manage the index of specs as source code: the framework that goes with it will first create tests for as much as it can, so reproducability becomes the goal, not readability.

https://github.com/ArtRichards/docs-cli

https://artrichards.github.io/agent-playbook-suite/blog/

crdrost 5 hours ago [-]

Yeah so all of personal computing—text editing, SVG antialiasing, etc, fits in 20,000 LOC (VPRI's STEPS project) so a million lines of code is 50 reïnventions of personal computing. BUT: it is unlikely that humans would have solved this problem in 20 kLOC. Sussman said “we really don't know how to compute!” as his talk title and LLMs had to ossify some pre-existing voice as the forever programming habitus and it chose a persona that doesn't know how to program—because we don't —and now we are stuck with it. Claude is our tickets, our implementations, our documentation... And if you tell it “hey the node role should not have those permissions, that should be a service account” it will happily do the right thing, but it has no intrinsic sense of taste and the error message it's trying to clear just says “the node role doesn't have that permission and the system prompt says “keep it short, stupid ” and graybeards might be our last bulwark.

zahlman 3 hours ago [-]

> VPRI's STEPS project

The what now? Search engines failed me here.

linguae 2 hours ago [-]

This is a description of Alan Kay’s STEPS project:

https://worrydream.com/refs/Kay_2007_-_STEPS_2007_Progress_R...

This is the final report:

https://tinlizzie.org/VPRIPapers/tr2012001_steps.pdf

B-Con 2 hours ago [-]

People want to do X, so the metric is how much X can be done.

Everyone is over-complicating the explanation. The answer for "why are we fixating on this bad metric" is almost always the same pattern.

Broad audiences need simple metrics to talk about. If the metric itself requires nuance, it's hard to communicate and hard to reason about. It's easier to push the need for nuance from understanding the metric itself down the road to where the metric is applied, which allows everyone to ignore it in immediate conversation.

bonigv 3 hours ago [-]

It is a very valid question. My intution (no grounding) is to the model training. Optimizations traditionally have worked well in human wrote software with either experience of the developer , usage of architectural patterns or a second ir third pass of fine tuning. In case of model written code - (e/p one token at a time), only possible orchitectural optimization is either with a strict guardrail on patterns to use for a specific implementation OR by giving a second or third optmization path. All of which burns more tokens, but can lead to better software.

dotancohen 2 hours ago [-]

I've heard it said that measuring productivity of a software developer by lines of code added, is akin to measuring the productivity of an aerospace engineer by mass added.

It is a metric. It is often not a good metric. But it is easy to measure.

Sparkyte 1 hours ago [-]

Indeed. The more you do to add complexity or generate without ensuring it isn't pointless code adds bloat. It is still probably more MVP than junk developers will do, but not ever anywhere near as great as someone who studies a programming language.

Npovview 2 hours ago [-]

Don't compare human loc with machine loc.

Compare machine to machine (as these headlines come) and discount that by a factor.

yurimo 2 hours ago [-]

You can't really do that here because one of the key arguments for this, as people in the thread focus on is "1/10th of time" estimate, the comparison with humans is here already, albeit it is just an estimate and no actual comparison has been done.

This is a problem of conflicting incentives that exists today in my opinion. Companies will market greater human-AI collaboration in science and engineering but focus on releasing things like this where it is clear that downstream goal is complete agent ownership over the product, from inception to testing to monitoring. Maybe the speculative future agents will use their own very efficient language to code that won't be readable for people at all. They focus on agent code being readable by agent in the article, as you've said. But in my mind in at least near future, there is a case where your prod will break, you won't be able to understand it or the attempted fixes. Maybe agent will fail to fix it at all and start a massive rewrite. In any case is this different from kicking technical debt down the road along with worse interpretability of what you have built?

I do think there is a way where agent can write great solid code that we can read, but with the way LLMs are built this requires something new in terms of reward that accounts for "taste" and constant refinement so it might take more than 1/10th of a time to produce something good.

cpard 4 hours ago [-]

I don’t think the flex here is the amount of code alone. Their goal is to show that AI can improve productivity, the number of lines is just the proxy to that. This article is a marketing piece after all.

Now someone can argue that lines of code are not a good proxy of engineering productivity, but I wouldn’t be surprised if the audience they target with this content is not the HN commenters of this thread.

threethirtytwo 3 hours ago [-]

Because sloppy code and doesn't matter. He wrote he completed it in 1/10th time. That's equivalent to 1/10th the engineers.

That is a business win. That is really all that matters in capitalism.

The flex is a direct insult to your face. He is shitting on the faces of all software engineers (me included). It is equivalent to saying we don't need you to code anymore. One man can produce 10x the code.

So why am i voting him up even though he's shitting on my face? Because what he says is true. I value honesty and people who say things like it is. Yes my identity as a software engineer is getting dismantled before my very eyes. But the solution to this problem isn't some delusional statement about not understanding what he's flexing about. We're not stupid. Everyone on this thread understands his flex. The difference is some people like you don't want to understand it.

Like seriously. He literally wrote it was completed in 1/10th of the time and you expect me to believe that YOU don't know what HE is flexing about? Be real. You're not stupid.

estetlinus 3 hours ago [-]

The real flex is delivering it in one-tenth of the time. Mentioning lines of code is mostly noise.

I’ve worked with 20-year-old codebases and products that grew organically over decades and still sit well below a million lines of code. Using LOC as some kind of health or success metric makes me more suspicious than impressed.

dumbdumb125 5 hours ago [-]

It's a huge flex if the alternative is no code at all. Reward hacking aside, LOC resonates with me in the sense that I've seen 10+ projects to fruition that wouldn't have even begun without an agentic harness and an LLM.

It's like the difference between doing stock price predictions with binary "up" or "down" histories and trying to figure out how to normalize actual price histories (basically impossible). The binary work gives a well-defined signal.

drivebyhooting 5 hours ago [-]

I wish these breathless blog posts would actually try to be more didactic.

For example, actually doing a walkthrough of how to set up these allegedly super powered workflows and concrete demonstrations.

I’m not an AI skeptic. Rather I’d don’t want to miss out on any actual super powers.

nimonian 4 hours ago [-]

I do quite a lot of what this post describes in a reasonably large project. Here's what works for me:

- write gherkin features for new features; update them for enhancements; don't touch them for refactors. Label your PRs with these nouns.

- use pre-push hooks for type checks, linting, unit tests, and other quick, scriptable validations.

- make a viteperess subsite in your repo, have the agents maintain it - document important principles, architecture, etc.

- make a cli command which lists all pages along with the yaml frontmatter description so agents can choose what to read without blowing up the context window.

- use ddd and monorepo - write your logic in headless layers, and compose layers into apps. agents navigate layers very successfully.

- use zod (or your language equivalent) and contract-first API development; this is my favourite bit tbh, I use orpc

- make a single skill called "code" which describes the lifecycle: open a worktree, setup .env to guarantee no conflict with other agents (choose unused ports etc - docker is good here), write or update feature file (this is where you negotiate the spec), implement, validate (e.g. using playwright mcp), pre-push checks, push and wait for review, tear down and fast forward main

- testcontainers is great for ensuring multiple agents can run tests that don't conflict

Seriously I only have one skill that's it. Everything else is in the docs. I'm feeling very productive like this, in a "making good software" sense not a LoC sense.

nullbio 3 hours ago [-]

Can you share your skill please?

pramodbiligiri 36 minutes ago [-]

I agree with many of the points made by nimonian above (esp the one starting with 'make a single skill called "code" which describes the lifecycle'), based on my limited experience with these things.

I'm building a skill + CLI tool along those lines (for solo devs not corporates). Here is what my "lifecycle" type skill looks like right now: https://github.com/bitkentech/shipsmooth/blob/releases/dist/... (warning, heavily work in progress). You can see a demo here: https://shipsmooth.net/

I was not happy with the default code quality generated by Claude Code. So I've been adding some skill-file rules to address that, and so far happy with the results: https://github.com/bitkentech/shipsmooth/tree/main/skills/ex.... There was a similar one on HN yesterday called opencodereview: https://news.ycombinator.com/item?id=48406358

There are many such workflows out there! Matt Pocock gave a good talk about how he approaches it: https://www.youtube.com/watch?v=-QFHIoCo-Ko

rednb 2 hours ago [-]

That's a big ask. This kind of harness usually contains plenty of proprietary insights about their business. And also, nowadays, a good harness is a major competitive advantage.

nullbio 20 minutes ago [-]

Good thing I wasn't asking you.

Also, a skill is not a harness.

bze12 16 minutes ago [-]

I agree. I followed this article for a repo I'm working on, and I had a very hard time inferring how, specifically, they implemented "providers" and enforced import layers. A sample repo would've been nice.

simonbarker87 14 minutes ago [-]

I’d be interested to know two things:

1. What’s the job satisfaction like day to day being an engineer on this project? How have they adapted to this way of working?

2. How much did it cost? Work is being done whilst the engineers sleep but if that 6 hours overnight task cost $300 and could have been done by a person in 2 hours is it a real saving?

bko 8 hours ago [-]

> We had weeks to ship what ended up being a million lines of code... Five months later, the repository contains on the order of a million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. Over that period, roughly 1,500 pull requests have been opened and merged with a small team of just three engineers driving Codex. This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.

That's an insane level of throughput. What's a good baseline? Prior to agentic coding, whats the typical number of PRs engineers were expected to push? Maybe a 2-10?

Do people feel the software has gotten better in the last 6 months? The number of engs is prob the same so we should expect maybe 5x faster cycle in major software apps, but I don't see it. The AI apps do change very fast but given its a very new field, I'd expect as much. But outside of that, I don't see it.

torben-friis 7 hours ago [-]

Here's a fun one: firefox lists its current count at about 2.5M LOC, from roughly 1M commits during the years.

You end up with about 3 lines added per commit, which is not ridiculous when you consider that most would be editions rather than full additions.

Here, we have 1500 PRs and 1M LOC, which is about 650 added LOC per PR. Remember, not 650 lines total in the PR, but +650 balance after additions-removals.

Fun questions for attentive readers:

- What does a project growing at a rate of one full firefox-codebase worth of LOC per year look like, a decade down the line?

- What does the line count say about the verbosity of the tool, and what does it say about outcomes that the purpose of the project isn't clearly disclosed?

- Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?

- If it was confirmed that LLM usage blows up your line count, what's the implication for codebases that want to return to manual coding after months of usage? (Say, because the tool gets expensive).

stult 3 hours ago [-]

> - Do we have reasons to care about LOC in a world where we don't write code manually? What happens to token usage numbers when the codebase is significantly larger?

Yes, at least to the extent that we care about context windows and tokens consumed by coding agents processing code that is ultimately irrelevant to their assigned task.

Anecdotally, I've found keeping file sizes small has been important for agentic coding not just to maintain human readability, but also for optimizing agent performance, precisely because it limits the amount of incidental context they load while working a problem, because they generally load entire files rather than just parsing the part relevant to their current assignment as a human might. That smaller file size thus reduces input noise and the LLM generates a tighter solution, which in turn reduces input noise for future solutions. Or at least this strategy avoids a death spiral into exploding context length.

I expect (but cannot currently prove) that keeping overall LOC down yields similar benefits even when file sizes are kept small because it spares the LLM from parsing potentially relevant files that prove irrelevant to its current task.

therealdrag0 5 hours ago [-]

Does the Firefox LOC include ALL forms of text: infrastructure (Firefox doesn’t have), documentation, developer scripts,tests, etc? How is the test coverage of Firefox?

CleanCoder 6 hours ago [-]

When I got to the 1M LOC I involuntarily paused feeling like this must be satire.

krackers 7 hours ago [-]

They never specified what exactly the product was, without which it's impossible to judge the post.

For some reason most of the uses of "agents" are to build yet other AI products, it's turtles all the way down. Maybe that says more about the field of harnesses than it does about the power of "agents".

theptip 5 hours ago [-]

There is a sense in which it doesn’t matter at all; many of the limitations of agents in large codebases are just the context management challenges. So proving that you can cohere and progress at O(1m) is a useful scale observation. “Can I use agents in my 1m line codebase?”

There is of course another sense in which the output quality is the only thing that matters. “Can I use agents to build a 1m line codebase that I want to maintain going forward.”

I take this as being exclusively a tech demo of the former. Quality (feature velocity, bugs, scalability) is not demonstrated.

becomevocal 7 hours ago [-]

Feels like the active discovery going on is trying to understand what is computer vs what is AI, for every product.

Agents help a ton with the discovery, but the act of building a product needs a deeper level of thought and validation to make it actually better than what came before. So IMO what you see is people still learning what needs to be understood and crafted first hand to make a product better (including economics)

We’ll get there if more of us try

techblueberry 5 hours ago [-]

I’ve been vibe coding a lot over the past year or so, and I think I’m going to stop. In fact, I sort of want to challenge myself to see, can I go back to a sort of the fork in the road with the old copilot autocomplete workflow and really maximize that. Be in the drivers seat for most of the code being written, but find ways to use AI to really enhance the flow state / remove blockers. Tools only minimal actual code generation.

59nadir 4 minutes ago [-]

I would be very impressed with someone who's been vibecoding "a lot" for about a year who could then go back to being fully in the loop for even 50%. I would even say I'd expect withdrawal symptoms at that point.

The dopamine hits are core to why people even do vibecoding (or vibecoding-in-a-dress/spec-driven development) and why they tend to overestimate its output so much. Hell, it's core to all forms of LLM-assisted development (because it feels like magic), but most of the other forms are more reward, less delusion.

Aperocky 7 hours ago [-]

> ended up being a million lines of code

This almost reeks of "I've never cleaned up our code base because there is too much code, and didn't even bother having agents/LLM cleaning them up".

You almost never need a million lines of code - this includes your software, infra, testing and operational tools. You didn't ship the linux kernel in 3 weeks and you know it. The code is already speghetti and it achieve the basic functions OK but it will harder and harder to simplify and untangle and maintain.

bombcar 7 hours ago [-]

Even the linux kernel doesn't need millions of lines of code; most of the actual LOC is device drivers, and you don't need all of them, you just need the ones for the devices you have.

Chu4eeno 6 hours ago [-]

And Linux maintainers are actively pushing to radically cut down on the LOC by eliminating drivers etc.

zahlman 3 hours ago [-]

As a point of reference, 1MLOC is about the size of the entire Python standard library including tests, as well as stuff like IDLE. (Well, the Python part of the code. There's about half that much again of C in Modules/ .)

girvo 7 hours ago [-]

Yeah I cannot see how "we shipped 1 million lines of code in three weeks" is... something to be proud of haha

faustin 5 hours ago [-]

They directly address routine code cleanup and regularly paying down technical debt near the end of the article.

Aperocky 2 hours ago [-]

I stand corrected, but the LOC being advertised still make me doubt the efficacy of their process.

aleqs 7 hours ago [-]

> should expect maybe 5x faster cycle in major software apps

To what end and what would that even look like though? Enshittifying everything at maximum speed? The apps/platforms I use regularly - GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.

linsomniac 6 hours ago [-]

>GitHub, Spotify, Google maps (just to name a few), have gotten noticeably shittier in recent times.

What if AI lets you create new versions of those tools, but without the enshitification?

I say that being in the "soaking" stage of using AI to rebuild a shitty software project in 70KLOC over about 2 weeks of spare time, so this may not be as theoretical as you might think.

aleqs 5 hours ago [-]

Oh I definitely agree that AI can and will help create great software.

It's just that creating great software isn't really the SV/VC/big tech business model or main goal.

NoraCodes 5 hours ago [-]

> What if AI lets you create new versions of those tools, but without the enshitification?

I'm not sure I fully understand what you're saying here. Isn't the value of these tools almost entirely independent of their actual software? That is, we have many good open source, self-hostable forges (Forgejo, sr.ht, etc.), lots of great music player software (Jellyfin, Symphonium, etc.), and decent maps software (OsmAnd and Organic Maps). People use GitHub, Spotify, and Google Maps -- perhaps even _put up_ with their often bad/glitchy software -- because of network effects (all three) and content/licensing partnerships (Spotify/GMaps). That proprietary data isn't something AI can help you with, right?

linsomniac 5 hours ago [-]

It really depends on the use-case. For example, my most starred github repo is a tool to convert Spotify playlists to YouTube Music (that was done pre-AI). Github depends on what issues you have with it, what your use case is, and whether you can leverage some of the network effects via API from the github source. Maps, same story.

nostrademons 5 hours ago [-]

AI coders are great for making scrapers, possibly because AI companies use their own tools to make an awful lot of scrapers.

dchftcs 5 hours ago [-]

This is a lot tamer than what Claude Code's team claims tbf.

ai-roundup 6 hours ago [-]

[dead]

jakolaptu 6 hours ago [-]

It is likely better because AI agents make access to domain knowledge easier. However, I would wager that the problem is people don’t remember the code well. The problems are going to be long-term as the pace of change increases.

If you think about it, successful products rely on designing well-thought-out experiences, customer discovery (see all the Forward-Deployed Enginneer job listings at OpenAI) so the code velocity somewhat becomes irrelevant.

If you’re solving the right problem and you’ve got a good team then competitive advantage comes from somewhere OUTSIDE of code velocity.

The more important question I think is does faster code yield more value long-term? At the moment, it’s like yeah we do 3.5 pull requests per day.

I’m thinking, great, good for you. You could also combine three pull requests into one and then you’re doing 1 per day. This is quantitative data that doesn’t really mean anything tangible.

varenc 7 hours ago [-]

digression:

It's interesting this was submitted to HN over 15 times since it was published in February: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

But this is the only submission that's had any traction. Since the content is nearly the same for all submissions, it highlights how getting to the front page can be a bit random. (Though this is the only one that capitalized 'Leveraged' so maybe that's the secret)

swyx 1 hours ago [-]

time of day also matters

spacebacon 20 minutes ago [-]

Leveraging a better way. No last mile.

https://github.com/space-bacon/SRT

thelucent 5 hours ago [-]

This might work only if you have “infinite” compute and infinite tokens.

As someone that used the $20 plan, this pure agentic approach is impossible to do because I’d hit the limit fast and I would end up with less outcome.

What I found that work incredibly well was to provide a human written code as reference, and ask it to extend it. So I scaffold the entire thing, architect it, write few samples code (controllers, services, models, components, database schema, how auth works, etc) so the LLM can have a headstart on their attention (pun intended)

I usually wrote a stub with a lot of details on how to implement it. Something like a higher abstraction pseudo code. Then ask the LLM to implement it.

When it fails, it is often better to undo the whole changes, adjust the stub so it catches what fails before, and try again.

Or, commit the changes, and use a new fresh context and only address what went wrong.

Whenever I tried this agentic from scratch approach, I always end up disappointed; both on the outcome and on the limit that I hit before an hour even passed.

zuzululu 2 hours ago [-]

You are not going anywhere with $20 plan

Upgrade to $200/month and you should see more usage but even for a hardcore user for me, one can never have enough.

I'm still very jealous of those guys that got 200x usage simply by RSVP'ing to openai party

murat124 7 hours ago [-]

The other day I came across to a video showing workers in a e-vape factory. They pick up a bunch of e-vapes from the conveyor belt (each has 6 e-vape think), stick in their mouth and vigorously vape all of them for about 5 seconds, then test the next bunch. Humans reviewing hundreds of lines of change in a PR written by AI is not very different.

prakashn27 5 hours ago [-]

Very true. If a PR has 1000 lines I would check only a handful full of them and leave the rest for test suit .

h4ny 3 hours ago [-]

I'm not an AI skeptic but I'm skeptical of the intent of this article. It makes great claims about agent-first engineering and tries to make a real case based on a real product, with real users, and a real team that's been growing — all without even saying what was built or showing it, just like every other AI hype article.

swyx 1 hours ago [-]

we interviewed Ryan here: https://www.latent.space/p/harness-eng

and he gave a talk version of it in london: https://www.youtube.com/watch?v=am_oeAoUhew

shepherdjerred 6 hours ago [-]

This mirrors exactly what I have been doing.

- Give Claude/Codex a way to verify its own work (browser, smoke tests, e2e tests, high-fidelity local environment)

- Keep all context (issue tracking, docs, ideas, plans, worklogs) in-repo (https://github.com/shepherdjerred/monorepo/tree/main/package...)

- Give Claude/Codex access to observability (Grafana, Prometheus, Tempo, PagerDuty)

- Have Claude/Codex follow good engineering guidelines like fail-fast, type safety, parse at boundaries

I haven't yet been able to achieve full autonomy due to cost and CI load on my homelab.

para_parolu 5 hours ago [-]

Does it yield good results? I found that instead of docs it’s easier just to ask ai to read code. I feel like this is same as comments in code. Become outdated fast

shepherdjerred 5 hours ago [-]

I don't really use "docs" for documentation. I've prompted Claude/Codex to always write a "log" and save it in-repo to track what it did and why.

I've found this to be really helpful, e.g. "you did this last week, and now some other thing is happening" or "you tried this approach before to solve alert X but it didn't work" -- except it can discover this itself.

https://github.com/shepherdjerred/monorepo/tree/main/package...

I've also used it to store TODOs and plans. For example I might want to explore some idea and defer it for later, or some weekend have it execute on some tech debt I've put off. One last use case is asking "what did I work on in the last 2-3 weeks, is it healthy, and what additional quality checks can/should I do; is there any follow-up work?"

bigcat12345678 1 hours ago [-]

This matches quite verbatim for my cursor based agentic repo.

There isn't anything that were not already experienced and factored into constructs in the repo.

And I also find all of the bits created for an effective agentic engineering project, matches perfectly with the main stream engineering best practices. That has been one of my primary reason to all in on agentic engineering, prior to this, applying best practices is always too costly and conflict with teams daily priority.

zatkin 6 hours ago [-]

I worry most about blindspots with this kind of approach. Let's say that this repository goes on for years, at which point the docs folder is several MB in size. Would Codex be able to think outside of the box? Or would the aggregate of the Markdown content fundamentally cover enough ground to prevent it from thinking of novel new approaches to existing problems?

esikich 6 hours ago [-]

You tell it to update the docs: not append. I've done the same thing with a readme in the root with links to the docs. After every commit, before the push, I have my agent "update all relevant and related docs, add or remove what's needed" or something to that extent. And it works remarkably well. I also have an append only change log it's supposed to add to. Between that, good commit messages, and comprehensive testing, I've built a homebrew OS and updating it is remarkably smooth. Runs a homebrew FTP and HTTP server and can run Wolfenstein. Working on DOOM right now. Close, but sound has been difficult.

https://github.com/ESikich/smallos

satvikpendem 3 hours ago [-]

Someone else in the comments said to have it make a static website with the info instead with clickable pages and sections so it reads only the content it needs to rather than dumping a long file into context windows. Although I suppose you can have a ToC in the readme too with multiple smaller markdown files as references.

vibcdingenjoyer 5 hours ago [-]

Yep. You’ve got to have it update the docs. After a few sessions, if I forget to request this, opus starts rehashing the same tasks and finds that they are complete - and sometimes still won’t update those docs unless I ask.

Another tip is to condense the doc files into the minimal required. Sometimes I’ll end up with 5 to 6 floating around in various states of staleness. Condensing to 2-3 and removing completed tasks seems to help a lot

therealdrag0 5 hours ago [-]

It’s not a self coding machine. There is human in the loop, they even added MORE engineers to the team of this project! 7 engineers should be able to collaborate with the AI to find good solutions to problems.

ajpaulson 3 hours ago [-]

> The diagram below shows the rule: within each business domain (e.g. App Settings), code can only depend “forward” through a fixed set of layers (Types → Config → Repo → Service → Runtime → UI). Cross-cutting concerns (auth, connectors, telemetry, feature flags) enter through a single explicit interface: Providers. Anything else is disallowed and enforced mechanically.

Can anyone give me a simplified explanation of what they’re saying here? Having some trouble understanding.

shepherdjerred 3 hours ago [-]

They're describing a layered architecture enforced by some script in CI.

For example, if you had a `backend`, `common`, and `frontend` package, you would be OK having backend/frontend depending on common, but you wouldn't want common depending on backend/frontend or backend/frontend depending on each other.

If you think about JavaScript, there is nothing stopping your dependency graph from becoming spaghetti. It sounds like they built static analysis to enforce rules.

Some languages have this built in like Java (Project Jigsaw), Go, and Rust. JavaScript, Python, etc. have no such feature.

It's really nothing special -- it has existed before. It just becomes a _lot_ more important with agents since they produce a lot of code, and it is good to have lots of static analysis when heavily utilizing agents.

They mention this in the article:

> This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it’s an early prerequisite: the constraints are what allows speed without decay or architectural drift.

iso1337 3 hours ago [-]

IIUC its just strict separation of concerns

Eg UI cannot reach down and directly read config files

Configs must be only read by (im assuming) a storage interface layer called repo

There’s a strict directionality of dependency

Somewhat similar to ports and adaptors but presumably more strictly enforced by deterministic linters

faangguyindia 7 hours ago [-]

Codex updates usually appear every few hours (i am not saying this how often it's published) but that's my perception as a user. Often i update codex just to see new update within an hour so.

Many times those updates are not properly tested, for example in one update the model selector got completely changed.

then next hotfix was pushed which restored original.

dawnerd 7 hours ago [-]

Who needs a QA team when you can just test on users and iterate instantly /s

frictasolver 2 hours ago [-]

[flagged]

jonmoore 6 hours ago [-]

This would be much more convincing if the repos, issue trackers, etc. were accessible.

nullbio 28 minutes ago [-]

Step 1: Be rich.

angrydev 7 hours ago [-]

Published Feb 11, 2026

ukuina 6 hours ago [-]

Might as well be 2025.

Frannky 5 hours ago [-]

I started using chatgpt for functions and checking, then for single file changes and checking, now for multiple changes and checking. I am at a point where the only changes I correct are architectural. So it may start to become smarter to learn how to see only the architectural directions while multiple agents work, test, and commit both on unit and against live deployment.

charintstr 5 hours ago [-]

I am at a major company that is essentially vibe coding. I’ve shipped about 100k LoC this entire half and am toward top 10% of my team. I find it likely that either

A. The code is absolute garbage and is speed for speed sake B. They’re using an internal model that is a generation beyond GPT 5.5

I say this because we’ve attempted to do something similar using the latest gen Claude models and a significantly larger team. The code is probably along the lines of millions LoC but is an absolute mess because of vibing. There’s a price you pay for speed

iso1337 3 hours ago [-]

Q1 - How much effort did you put into deterministic guardrails like AST linters, etc?

I find there’s a ton of slop unless hard guardrails are added, eg step 1 is just around syntax, step 2 is to enforce mental models

You still need someone steering direction and have a logically consistent idea of what you actually want to build

Q2 - I find that vibe coding really accelerates FE projects because it’s possible to run everything locally and check results

For pure distributed infra backend more investments have to be made into the devloop to be able to shift left the feedback loop and decouple it from humans or real deploys

therealdrag0 5 hours ago [-]

I like how they said they were spending 20% of their time addressing slop. Sounds like they’ve tried to automate the slop correction but it’s a good honest reminder.

Additionally it’s an internal tool, which is likely much more amenable to slop.

zuzululu 2 hours ago [-]

I think a lot of people are sleeping on the contents of this article. There is still valuable tidbits I'm going to be applying.

andai 6 hours ago [-]

> To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a Ralph Wiggum Loop ).

https://ghuntley.com/loop/

mgaunard 2 hours ago [-]

Isn't this essentially normal AI usage and what everyone has been doing for 6 months?

Aperocky 6 hours ago [-]

1 million lines of code aside, I feel like anyone who seriously thought about this would eventually run their own harness.

Just like .vimrc and .zshrc, the harness "code" itself can be easy and personal. Provided that it's built on working and existing construct such as tmux.

darepublic 7 hours ago [-]

Codex pushed an update that made my old threads inaccessible. This takes a million of lines to put out a half baked crud app?

aulin 1 hours ago [-]

Dear OpenAI, the target audience of your blog or at least of this blog post understands English pretty well. Why won't you give them a simple way to disable the shitty ai translation and read the original content? Why translate it at all in the first place?

EDIT: found the button, all the way down in the bottom of the page... I hate this so much, give me the original content, I will decide if and when I need translation

rfw300 7 hours ago [-]

I understand that the’ve written zero lines of code for this application, but would it kill them to write a few lines of the blog post by hand?

Forcing readers to wade through an unceasing string of LLM clichés demonstrates the opposite of the point you’re trying to make—that the consumers of your work are worse off because you exercised no human judgment in creating it.

shevy-java 2 hours ago [-]

The world is now agent-first already?

drchaim 7 hours ago [-]

But this is almost what we have been doing for the last 3/5 months, isn’t?

wilsonnb3 7 hours ago [-]

Article is from February so that tracks

fbrncci 7 hours ago [-]

Well to a lot of people this is still a foreign concept.

bronny1989 6 hours ago [-]

why do you have “weeks” to ship what would take “months”?

andai 6 hours ago [-]

> We intentionally chose this constraint so we would build what was necessary to increase engineering velocity by orders of magnitude.

robotresearcher 6 hours ago [-]

I guess orders of magnitude ain’t what they used to be.

6 hours ago [-]

IAmGraydon 5 hours ago [-]

Title should probably be marked with (February 2026).

Sarkie 7 hours ago [-]

I would never dare put that in production

Yokohiii 4 hours ago [-]

> in an agent-first world

casual gaslighting

apical_dendrite 6 hours ago [-]

I wonder why we as engineers aren't protesting AI in the same way that artists and people in film and television are. This post should instill the same terror that visual artists feel.

If you're a more senior person in tech, this post is effectively saying that a large portion of your skillset is about to become completely worthless. This goes beyond the skills involved in writing the code. Everything that you've learned over years about how to determine whether code is good or bad, and what practices make an engineering team effective is not just obsolete, it's fundamentally counter-productive because it assumes a slow, human-centric process that requires you to actually review and understand the code. Even your ability to mentor junior engineers is now obsolete, because all that experience you've built up is now worthless to them.

If this is the approach the industry takes, particularly when combined with a lack of interest in quality from the business (and let's face it, consumers have shown us that they're happy to pay for cheap crap), it's hard to see much of a future for software engineers. You don't need thousands of people with deep technical expertise, you need a handful of manager-types, who will focus on defining product and business requirements and configuring how the AI gets enough context to implement the requirements.

Maybe, if we're extremely lucky, there's so much demand for software that total employment doesn't fall off a cliff, but the nature of the work will change so much that many older, more expensive engineers will become unemployable. Those who remain will have to accept that the skills they spend decades developing are now worthless, that younger engineers no longer respect or listen to them, that the business no longer sees them as experts worthy of respect, but old fogies who grew up in a different world.

Joe Biden liked to say that a job is more than just a paycheck, it's part of your identity and your sense of self-worth. We're all very used to a certain level of respect (and commensurate remuneration). If you don't think that's true, compare how a software engineer is treated to how a warehouse worker is treated. What happens when we lose that?

linsomniac 6 hours ago [-]

>a large portion of your skillset is about to become completely worthless

I'm not convinced of that.

I watched a video of an architect using AI to create architectural drawings. It became very clear to me that he has a lot of skills and terminology that helped him produce something very specific, in a few minutes. I've been working on some home improvement stuff including a studio/shed and I've struggled to produce even something simple (currently trying to get a conversation packet on the roof trusses to take the the permit department to get started). Even with my high school architecture class.

After watching that I wonder how much of what I'm doing with AI that looks easy is because I hae a deep technical knowledge, plus 3 years of heavy work with AI.

apical_dendrite 2 hours ago [-]

This is the case now - I can explain to the AI that I want to re-factor a component to support different implementations using a strategy pattern, and I can get a similar outcome to what I would have written, just implemented a bit faster. My expertise brings value.

But that's not what this specific article is describing. The world this article is describing is one where you describe the business requirements, and you don't think about how it's implemented. You don't write the code, you don't review the code, you don't test the code. You give the AI business requirements and you give it access to sources of context (slack, meeting notes, etc). Every place where the human would act as a gate reduces throughput, so it should be eliminated through building harnesses and providing context.

What they're doing here is the equivalent of taking a factory where you have 2 process engineers and 100 operators, and replacing all the operators with robots. They want to automate the whole process of making the software and just leave the part that figures out how to make the automation work effectively.

In this world, the average software company doesn't need people who know how to write good software, because writing, reviewing, maintaining, and testing the software will be entirely automated. There will be a small number of people at companies like OpenAI that need to know how to write good software in order to supervise training the models, and there will be a small number of people at the software companies who have expertise in setting up the automation.

jplusequalt 4 hours ago [-]

How do you keep your skills if you no longer engage in the activity that keeps them sharp?

YZF 3 hours ago [-]

What is that protest going to get us? We'll convince or force business leaders to not use a cheaper/better tool and protect our jobs? And nobody else in the world is going to pivot either? And our companies will remain competitive?

Software engineers have always adapted to new technologies. New languages, frameworks, native apps, browser apps etc. So far this doesn't seem to be close to completely removing us from the loop.

If you are smart, educated, and can adapt, you'll figure it out. The economy has to find some stable equilibrium and it's not a zero sum game. Everyone in the economy getting a paycheck is also a consumer. With no consumers there is no business. The companies who are using AI and become more productive can do more things that before were not profitable but now are. Some of the people who are getting laid off are going to start new businesses and hire people. These things always cycle, and they basically have to.

I don't have a crystal ball though.

briHass 5 hours ago [-]

It's the other way around, unfortunately. The senior engineers will still be useful for architecture and infrastructure considerations, as well as guiding the agents. It's the junior engineers that get nailed, because there's little incentive to hire one when a LLM does a better job immediately and costs less.

apical_dendrite 2 hours ago [-]

That's true now. But in the world of this article, it's also the senior engineers that get nailed. In the world of this article, all code is like what machine code or bytecode is now - it's designed to be used by the machine, not the human, because the expectation is that humans will rarely, if ever, touch it.

mhitza 2 hours ago [-]

Individual voices aren't strong enough to drown the marketing machine.

Artists and writers are unionized, why they have a more powerful collective voice.

Second, there are enough peole for which their jobs are very well paid and too cozy to dare to rock the boat.

The economy and job market isn't so hot either at the moment for people to quickly be able to jump ship.

Can you even be sure that you find a tech company that isn't jumping head first onto the AI hype train? Even politicians can't have enough of AI in their mouth.

zuzululu 2 hours ago [-]

engineers undervalue their own process

artists overvalue their own outputs

usernametaken29 2 hours ago [-]

I for one am not protesting because I know that this is bullshit marketing nonsense. Look at reliability metrics of OpenAI, they’re terrible. Everyone knew a long way ahead that it’s a scam, now they’re cranking up pricing and trying to rug pull. There will be a lot of developers who will come out very well once the stock tanks. That’s my two cents

Waffle2180 4 hours ago [-]

[flagged]

jlintc 7 hours ago [-]

[flagged]

trytodupe 5 hours ago [-]

[dead]

EnPissant 5 hours ago [-]

> Over the past five months, our team has been running an experiment: building and shipping an internal beta of a software product with 0 lines of manually-written code.

This is such a common thing among software engineers nowadays that I was very surprised that OpenAI would open with that line as if it were mind blowing.

But then I saw it was published in February and OP is just reposting it to farm karma.

knicholes 7 hours ago [-]

Everyone is criticizing the number of lines of code and the lack of attention that must certainly have been applied to generate that code and push it into production. What is being ignored is this awesome prompt that is almost certainly better than having no agents.md or plans.md or whatever you've come up with, to add validation steps for committed changes. You're still free to look at your code, the changes, and ask the agent to clean up. Try it. It's really nice.

Rendered at 07:55:43 GMT+0000 (Coordinated Universal Time) with Vercel.