I do feel people will end up using this for things where a deterministic rule could be used - more effective, faster and cheaper. See this starting to happen at work...'We need AI to solve X....no you don't"
simianwords 40 minutes ago [-]
I remember when I tried to set something up with the ChatGPT equivalent like "notify me only if there are traffic disruptions in my route every morning at 8am" and it would notify me every morning even if there was no disruption.
theredbeard 37 minutes ago [-]
This is because for some reason all agentic systems think that slapping cron on it is enough, but that completely ignores decades of knowledge about prospective memory. Take a look at https://theredbeard.io/blog/the-missing-memory-type/ for a write-up on exactly that.
scottmcdot 23 minutes ago [-]
Me too. It doesn't have ability to alert only on true positive. I has to also alert on true negative. So dumb
gowthamgts12 2 hours ago [-]
interesting to see feature launches are coming via official website while usage restrictions are coming in with a team member's twitter account - https://x.com/trq212/status/2037254607001559305.
To me it makes perfect sense for them to encourage people to do this, rather than eg making things more expensive for everyone.
The same as charging a different toll price on the road depending on the time of day.
nickandbro 2 hours ago [-]
I feel like we are just inching closer and closer to a world where rapid iteration of software will be by default. Like for example a trusted user makes feedback -> feedback gets curated into a ticket by an AI agent, then turned into a PR by an Agent, then reviewed by an Agent, before being deployed by an Agent. We are maybe one or two steps from the flywheel being completed. Or maybe we are already there.
chatmasta 2 hours ago [-]
I love everything about this direction except for the insane inference costs. I don’t mind the training costs, since models are commoditized as soon as they’re released. Although I do worry that if inference costs drop, the companies training the models will have no incentive to publish their weights because inference revenue is where they recuperate the training cost.
Either way… we badly need more innovation in inference price per performance, on both the software and hardware side. It would be great if software innovation unlocked inference on commodity hardware. That’s unlikely to happen, but today’s bleeding edge hardware is tomorrow’s commodity hardware so maybe it will happen in some sense.
If Taalas can pull off burning models into hardware with a two month lead time, that will be huge progress, but still wasteful because then we’ve just shifted the problem to a hardware bottleneck. I expect we’ll see something akin to gameboy cartridges that are cheap to produce and can plug into base models to augment specialization.
But I also wonder if anyone is pursuing some more insanely radical ideas, like reverting back to analog computing and leveraging voltage differentials in clever ways. It’s too big brain for me, but intuitively it feels like wasting entropy to reduce a voltage spike to 0 or 1.
eksu 1 hours ago [-]
This is the wrong way to see it. If a technology gets cheaper, people will use more and more and more of it. If inference costs drop, you can throw way more reasoning tokens and a combination of many many agents to increase accuracy or creativity and such.
mastermage 27 minutes ago [-]
I mean theoretically if there are many competitiors the costs of the product should generally drop because competition.
Sadly enough I have not seen this happening in a long time.
theredbeard 34 minutes ago [-]
We haven’t been inching closer to users writing a half-decent ticket in decades though.
Leptonmaniac 2 hours ago [-]
I think that as a user I'm so far removed from the actual (human) creation of software that if I think about it, I don't really care either way.
Take for example this article on Hacker News: I am reading it in a custom app someone programmed, which pulls articles hosted on Hacker News which themselves are on some server somewhere and everything gets transported across wires according to a specification. For me, this isn't some impressionist painting or heartbreaking poem - the entity that created those things is so far removed from me that it might be artificial already.
And that's coming from a kid of the 90s with some knowledge in cyber security, so potentially I could look up the documentation and maybe even the source code for the things I mentioned; if I were interested.
slopinthebag 40 minutes ago [-]
Art is and has always been about the creator.
jvuygbbkuurx 2 hours ago [-]
Tusted user like Jia Tan.
heavyset_go 20 minutes ago [-]
Feedback loops like that would be an exercise in raising garbage-in->garbage-out to exponential terms.
It's the "robots will just build/repair themselves" trope but the robots are agents
edf13 24 minutes ago [-]
Or perhaps we end up where all software is self evolving via agents… adjusting dynamically to meet the users needs.
eru 43 minutes ago [-]
Instead of having a trusted user, you can also do statistics on many users.
(That's basically what A/B testing is about.)
tuo-lei 45 minutes ago [-]
The missing piece for me is post-hoc review.
A PR tells me what changed, but not how an AI coding session got there: which prompts changed direction, which files churned repeatedly, where context started bloating, what tools were used, and where the human intervened.
I ended up building a local replay/inspection tool for Claude Code / Cursor sessions mostly because I wanted something more reviewable than screenshots or raw logs.
hyperionultra 40 minutes ago [-]
"Trusted user" also can be an Agent.
slopinthebag 2 hours ago [-]
What kind of software are people building where AI can just one shot tickets? Opus 4.6 and GPT 5.4 regularly fail when dealing with complicated issues for me.
withinboredom 2 hours ago [-]
Not just complicated, but even simple ones if the current software is too “new” of a pattern they’ve never seen before or trained on.
slopinthebag 1 hours ago [-]
I dunno if Rust async or native platform API's which have existed for years count as new patterns, but if you throw even a small wrench in the works they really struggle. But that's expected really when you look at what the technology is - it's kind of insane we've even gotten to this point with what amounts to fancy autocomplete.
victorbjorklund 24 minutes ago [-]
Of course not all tickets are complex. Last week I had to fix a ticket which was to display the update date on a blog post next to the publish date. Perfect use case for AI to one shot.
thin_carapace 2 hours ago [-]
i dont see anyone sane trusting ai to this degree any time soon, outside of web dev. the chances of this strategy failing are still well above acceptable margins for most software, and in safety critical instances it will be decades before standards allow for such adoption. anyway we are paying pennies on the dollar for compute at the moment - as soon as the gravy train stops rolling, all this intelligence will be out of access for most humans. unless some more efficient generalizable architecture is identified.
heavyset_go 29 minutes ago [-]
> as soon as the gravy train stops rolling, all this intelligence will be out of access for most humans. unless some more efficient generalizable architecture is identified.
All Chinese labs have to do to tank the US economy is to release open-weight models that can run on relatively cheap hardware before AI companies see returns.
Maybe that's why AI companies are looking to IPO so soon, gotta cash out and leave retail investors and retirement funds holding the bag.
PeterStuer 13 minutes ago [-]
They could still eliminate relatively cheap hardware.
thin_carapace 15 minutes ago [-]
i was under the impression that we were approaching performance bottlenecks both with consumer GPU architecture and with this application of transformer architecture. if my impression is incorrect, then i agree it is feasible for china to tank the US economy that way (unless something else does it first)
m00x 1 hours ago [-]
Several fintechs like Block and Stripe are boasting thousands of AI-generated PRs with little to no human reviews.
Of course it's in the areas where it doesn't matter as much, like experiments, internal tooling, etc, but the CTOs will get greedy.
thin_carapace 30 minutes ago [-]
these companies contribute to swathes of the west's financial infrastructure, not quite safety critical but critical enough, insane to involve automation here to this degree
slopinthebag 1 hours ago [-]
I don't think anybody is doubting its ability to generate thousands of PR's though. And yes, it's usually in the stuff that should have been automated already regardless of AI or not.
slopinthebag 1 hours ago [-]
Even in webdev it rots your codebase unchecked. Although it's incredibly useful for generating UI components, which makes me a very happy webslopper indeed.
thin_carapace 7 minutes ago [-]
im grateful to have never bothered learning web dev properly, it was enlightening witnessing chat gpt transform my ten second ms paint job into a functional user interface
bredren 2 hours ago [-]
What you're describing is absolutely where we're headed.
But the entire SWE apparatus can be handled.
Automated A/B testing of the feature. Progressive exposure deployment of changes, you name it.
tossandthrow 2 hours ago [-]
I think the Ai agent will directly make a PR - tickets are for humans with limited mental capacity.
At least in my company we are close to that flywheel.
_puk 1 hours ago [-]
Tickets need to exist purely from a governance perspective.
Tickets may well not look like they do now, but some semblance of them will exist. I'm sure someone is building that right now.
No. It's not Jira.
tossandthrow 1 hours ago [-]
Yes, so my point is that PRs act as that governance layer - with preview environments, you can see the complexity and risk of the change etc.
Gigachad 2 hours ago [-]
The agents have even more limited capacity
eru 42 minutes ago [-]
At the moment, maybe. But it's growing.
MattGaiser 2 hours ago [-]
I am already there with a project/startup with a friend. He writes up an issue in GitHub and there is a job that automatically triggers Claude to take a crack at it and throw up a PR. He can see the change in an ephemeral environment. He hasn't merged one yet, but it will get there one day for smaller items.
I am already at the point where because it is just the two of us, the limiting factor is his own needs, not my ability to ship features.
jondwillis 2 hours ago [-]
Why doesn’t he merge them?
m00x 1 hours ago [-]
Must be nice working on simple stuff.
yieldcrv 2 hours ago [-]
We do feedback to ticket automatically
We dont have product managers or technical ticket writers of any sort
But us devs are still choosing how to tackle the ticket, we def don't have to as I’m solving the tickets with AI. I could automate my job away if I wanted, but I wouldn't trust the result as I give a degree of input and steering, and there’s bigger picture considerations its not good at juggling, for now
charcircuit 2 hours ago [-]
Then sets up telemetry and experiments with the change. Then if data looks good an agent ramps it up to more users or removes it.
eranation 2 hours ago [-]
Um, we are already there...
mkagenius 2 hours ago [-]
This is a bit restrictive, doesn't take screenshots. So you can't "say take screenshots of my homepage and send it to me via email"
It doesnt allow egress curl, apart from few hardcoded domains.
I have created Cronbox in the cloud which has a better utility than above.
Did a "Show HN: Cronbox – Schedule AI Agents" a few days back.
Grok has had this feature for some time now. I was wondering why others haven't done it yet.
This feature increases user stickiness. They give 10 concurrent tasks free.
I have had to extract specific news first thing in the morning across multiple sources.
iBelieve 2 hours ago [-]
Looks like I'm limited to only 3 cloud scheduled tasks. And I'm on the Max 20x plan, too :(
"Your plan gets 3 daily cloud scheduled sessions. Disable or delete an existing schedule to continue."
But otherwise, this looks really cool. I've tried using local scheduled tasks in both Claude Code Desktop and the Codex desktop app, and very quickly got annoyed with permissions prompts, so it'll be nice to be able to run scheduled tasks in the cloud sandbox.
Here are the three tasks I'll be trying:
Every Monday morning: Run `pnpm audit` and research any security issues to see if they might affect our project. Run `pnpm outdated` and research into any packages with minor or major upgrades available. Also research if packages have been abandoned or haven't been updated in a long time, and see if there are new alternatives that are recommended instead. Put together a brief report highlighting your findings and recommendations.
Every weekday morning: Take at Sentry errors, logs, and metrics for the past few days. See if there's any new issues that have popped up, and investigate them. Take a look at logs and metrics, and see if anything seems out of the ordinary, and investigate as appropriate. Put together a report summarizing any findings.
Every weekday morning: Please look at the commits on the `develop` branch from the previous day, look carefully at each commit, and see if there are any newly introduced bugs, sloppy code, missed functionality, poor security, missing documentation, etc. If a commit references GitHub issues, look up the issue, and review the issue to see if the commit correctly implements the ticket (fully or partially). Also do a sweep through the codebase, looking for low-hanging fruit that might be good tasks to recommend delegating to an AI agent: obvious bugs, poor or incorrect documentation, TODO comments, messy code, small improvements, etc.
I ran all of these as one-off tasks just now, and they put together useful reports; it'll be nice getting these on a daily/weekly basis. Claude Code has a Sentry connector that works in their cloud/web environment. That's cool; it accurately identified an issue I've been working on this week.
I might eventually try having these tasks open issues or even automatically address issues and open PRs, but we'll start with just reports for now.
A trivial way to rack up hundreds of dollars in API costs, sure.
But you can set up a claude -p call via a cronjob without too much hassle and that can use subscriptions.
zmmmmm 2 hours ago [-]
i'm missing something basic here .... what does it actually do? It executes a prompt against a git repository. Fine - but then what? Where does the output go? How does it actually persist whatever the outcome of this prompt is?
Is this assuming you give it git commit permission and it just does that? Or it acts through MCP tools you enable?
jngiam1 2 hours ago [-]
MCP tools. We're doing some MCP bundling and giving it here, pretty cool stuff.
tossandthrow 2 hours ago [-]
We use to do do automated sec audits weekly on the code base and post the result on slack
zmmmmm 2 hours ago [-]
so is slack posting an MCP tool it has? or a skill it just knows?
tossandthrow 2 hours ago [-]
In Claude it is a "connector" which is essentially an mcp tool.
also, someone rightly predicted this rugpull coming in when they announced 2x usage - https://x.com/Pranit/status/2033043924294439147
The same as charging a different toll price on the road depending on the time of day.
Either way… we badly need more innovation in inference price per performance, on both the software and hardware side. It would be great if software innovation unlocked inference on commodity hardware. That’s unlikely to happen, but today’s bleeding edge hardware is tomorrow’s commodity hardware so maybe it will happen in some sense.
If Taalas can pull off burning models into hardware with a two month lead time, that will be huge progress, but still wasteful because then we’ve just shifted the problem to a hardware bottleneck. I expect we’ll see something akin to gameboy cartridges that are cheap to produce and can plug into base models to augment specialization.
But I also wonder if anyone is pursuing some more insanely radical ideas, like reverting back to analog computing and leveraging voltage differentials in clever ways. It’s too big brain for me, but intuitively it feels like wasting entropy to reduce a voltage spike to 0 or 1.
Sadly enough I have not seen this happening in a long time.
It's the "robots will just build/repair themselves" trope but the robots are agents
(That's basically what A/B testing is about.)
A PR tells me what changed, but not how an AI coding session got there: which prompts changed direction, which files churned repeatedly, where context started bloating, what tools were used, and where the human intervened.
I ended up building a local replay/inspection tool for Claude Code / Cursor sessions mostly because I wanted something more reviewable than screenshots or raw logs.
All Chinese labs have to do to tank the US economy is to release open-weight models that can run on relatively cheap hardware before AI companies see returns.
Maybe that's why AI companies are looking to IPO so soon, gotta cash out and leave retail investors and retirement funds holding the bag.
Of course it's in the areas where it doesn't matter as much, like experiments, internal tooling, etc, but the CTOs will get greedy.
But the entire SWE apparatus can be handled.
Automated A/B testing of the feature. Progressive exposure deployment of changes, you name it.
At least in my company we are close to that flywheel.
Tickets may well not look like they do now, but some semblance of them will exist. I'm sure someone is building that right now.
No. It's not Jira.
I am already at the point where because it is just the two of us, the limiting factor is his own needs, not my ability to ship features.
We dont have product managers or technical ticket writers of any sort
But us devs are still choosing how to tackle the ticket, we def don't have to as I’m solving the tickets with AI. I could automate my job away if I wanted, but I wouldn't trust the result as I give a degree of input and steering, and there’s bigger picture considerations its not good at juggling, for now
It doesnt allow egress curl, apart from few hardcoded domains.
I have created Cronbox in the cloud which has a better utility than above. Did a "Show HN: Cronbox – Schedule AI Agents" a few days back.
https://cronbox.sh
and a pelican riding a bicycle job -
https://cronbox.sh/jobs/pelican-rides-a-bicycle?variant=term...
https://grok.com/tasks
Grok has had this feature for some time now. I was wondering why others haven't done it yet.
This feature increases user stickiness. They give 10 concurrent tasks free.
I have had to extract specific news first thing in the morning across multiple sources.
"Your plan gets 3 daily cloud scheduled sessions. Disable or delete an existing schedule to continue."
But otherwise, this looks really cool. I've tried using local scheduled tasks in both Claude Code Desktop and the Codex desktop app, and very quickly got annoyed with permissions prompts, so it'll be nice to be able to run scheduled tasks in the cloud sandbox.
Here are the three tasks I'll be trying:
Every Monday morning: Run `pnpm audit` and research any security issues to see if they might affect our project. Run `pnpm outdated` and research into any packages with minor or major upgrades available. Also research if packages have been abandoned or haven't been updated in a long time, and see if there are new alternatives that are recommended instead. Put together a brief report highlighting your findings and recommendations.
Every weekday morning: Take at Sentry errors, logs, and metrics for the past few days. See if there's any new issues that have popped up, and investigate them. Take a look at logs and metrics, and see if anything seems out of the ordinary, and investigate as appropriate. Put together a report summarizing any findings.
Every weekday morning: Please look at the commits on the `develop` branch from the previous day, look carefully at each commit, and see if there are any newly introduced bugs, sloppy code, missed functionality, poor security, missing documentation, etc. If a commit references GitHub issues, look up the issue, and review the issue to see if the commit correctly implements the ticket (fully or partially). Also do a sweep through the codebase, looking for low-hanging fruit that might be good tasks to recommend delegating to an AI agent: obvious bugs, poor or incorrect documentation, TODO comments, messy code, small improvements, etc.
I ran all of these as one-off tasks just now, and they put together useful reports; it'll be nice getting these on a daily/weekly basis. Claude Code has a Sentry connector that works in their cloud/web environment. That's cool; it accurately identified an issue I've been working on this week.
I might eventually try having these tasks open issues or even automatically address issues and open PRs, but we'll start with just reports for now.
Seems trivial.
But you can set up a claude -p call via a cronjob without too much hassle and that can use subscriptions.
Is this assuming you give it git commit permission and it just does that? Or it acts through MCP tools you enable?