Like it basically jail broke the "no security vul guard rails" not in any clever way but just by fixing them, producing exploit code just by writing test cases making sure it's fixed. So you just need to look at the code & tests as a human to get vulnerabilities and exploits(components).
What makes this so beautiful IMHO is that it's a trivial jail break, but also a close to unfixable. At least not without making the model close to useless for normal development (it refuses to fix bugs/write code) or making it a major liability (it silently pretends it didn't see bugs and silently avoids fixing it, which for a human would count as intentional sabotage and might involve criminal liability).
HarHarVeryFunny 2 hours ago [-]
Exactly - it effectively is a "jail break" since it accomplishes something the model's security filter was trying to prevent, and the ridiculous simplicity of it shows just how broken that type of security is.
I wonder if Dario is now regretting hyping up how dangerous the model is? How does he walk this back? Do the feds let him just put a band-aid on it?
bitexploder 1 hours ago [-]
I also have a 100% success rate jail breaking them by breaking the work down into small pieces and stripping all security related language. Smaller tasks, test engineering and normal programming language. Fable found a few bugs in my harness for me before they pulled it. I was testing it vs ChatGPT, Gemini, and Opus. It was doing well at bug hunting.
kordlessagain 6 minutes ago [-]
I took an assembler class in college. Before that, I'd been messing around with Core Wars and working my way through Peter Norton's book on assembly. So when an assignment came up, I used self modifying code to solve it. It was the shortest solution, it ran perfectly, and I submitted it.
The next day, the professor caught me in the math department office (my dad worked there) and said she wanted to talk. Once we were in her office, she told me I wasn't allowed to use self modifying code. I pushed back: "Nothing in the assignment said I couldn't, and the output is correct."
The next class, she walked in and announced that self modifying code was no longer allowed on any assignment. Then she handed back the graded work and I'd gotten a 100.
Thinking back on that: about a week and a half ago I asked Antigravity to build a modern GPU version of Core Wars, except with Redcode mapped directly onto the GPU instruction set. I've had some good success and it's more or less working now, though visualizing what's happening at the GPU/Redcode level is much harder.
But before Fable 5 got yanked, I asked it to "fix" the project and it refused, flipping straight to Opus 4.8. Every single request I sent triggered the fallback. I spent over an hour trying different angles, and I even turned Antigravity loose on automatic so it was the one talking to Fable 5 same result. Every exchange tripped the fallback to 4.8. I wish I'd recorded it.
I also tried a variety of direct requests in a fresh directory "build simple self modifying assembler code" or just "self modifying assembler" and it would switch to 4.8 immediately. It was almost laughable.
There's ZERO credibility to any of these stories right now. If Anthropic really sent something over to this security person, and it's what she says it is, then why on earth didn't they just blog about it?
Hubris is a thing. Companies would do well to remember Steve Jobs in the early Apple days: ship early, ship often, and above all take responsibility for what you ship even when it's broken. Code, hardware, the whole kit all of it can be fixed. Trust is much harder to repair. Anthropic has lost mine, and while I may use them from time to time, it'll be in limited ways.
an0malous 3 minutes ago [-]
Cheapest option is to gift an enormous golden statue of Trump for his ballroom
MPSimmons 57 minutes ago [-]
I think it's a side effect of the Transformer architecture. The worldview where all input is equally trusted, and there's no concept of "the other", makes it hard to build effective guardrails where some input is trusted and other input is not trusted.
zipy124 3 hours ago [-]
What's surprising to me is that anyone who has a CS education thinking that jailbreaks are not trivial. It is as simple as normal algorithmic reduction [1], e.g can I transform a dangerous task into a not-dangerous task that the LLM will agree to solve, and then re-transform back.
Something being possible doesn't mean it's easy. Transforming a problem from a forbidden shape into an allowed shape could well be harder than just solving the original problem.
roenxi 1 hours ago [-]
I think the article just proved that aggressive exploitation is equivalent to normal bugfixing, so it seems like there are some large and important classes of transform that are easy.
It took me a minute of thinking to understand how this could even be considered a jailbreak; if Anthropic are going to turn out models that can't handle "find and develop regression test scripts for bugs in this program" as a prompt then it is going to take serious model crippling. To be able to prompt the model someone will need to already understand secure programming - the model itself won't be able to independently detect security problems without active guidance.
Retr0id 52 minutes ago [-]
> aggressive exploitation is equivalent to normal bugfixing
It isn't, though. The venn diagram has overlap for sure, and the "normal bugfixing" flows may yield results that are useful for offensive security, but a more targeted prompt asking for a specific security objective would be more effective, if allowed.
If the guardrails can be bypassed at, say 50x token cost (due to the agent also pursuing things you don't care about), then it's still pretty effective as a safeguard, because at that cost you might as well hire humans instead.
And, having to "babysit" a model while you re-prompt to work around guardrails strongly limits how much you can scale up your work.
Barbing 17 minutes ago [-]
> If the guardrails can be bypassed at, say 50x token cost […], then it's still pretty effective as a safeguard, because at that cost you might as well hire humans instead.
If humans have to be hired at inflated rates because you’re e.g. the North Korean government, hopefully 50x token costs don’t look competitive.
chillfox 23 minutes ago [-]
Not really, you can just get a smaller unrestricted model to prompt the bigger one
isodev 2 hours ago [-]
The movie M3GAN 2.0 had the exact same plot twist. The kid in the movie even explains outloud what the bot had to do to deal with the limitation. So in other words, since 2025, even teens know this "sandboxing the LLM by layering prompts" thing is never going to work.
NiloCK 2 hours ago [-]
I think that as simple as is doing a lot of work when the problem domain is all natural language (or more - all strings?) rather than some well specified DSA problem.
zipy124 1 hours ago [-]
Perhaps my original comment should have been more explicit. I do not regard simple and easy as the same thing, my use of the word trivial was perhaps a confusing aspect there and poorly chosen wording. That is simple things can be hard, and complex things can be easy, but that difficulty and complexity are rather orthogonal.
For more on this see "Simple Made Easy" by Rich Hickey.
ReptileMan 3 hours ago [-]
New discipline - homomorphic prompting.
giancarlostoro 1 hours ago [-]
This is the weird distinction with AI that I've complained about for ages, how can we make it do lawful good, its nearly impossible. Ask an AI to give you regex to filed our racial slurs, and things fall apart really quickly, it scolds you about not saying slurs. Even though regex implies it looks nearly nothing like a slur.
klabb3 8 minutes ago [-]
> What makes this so beautiful IMHO is that it's a trivial jail break, but also a close to unfixable.
It’s almost as if identifying security holes is a prerequisite for both fixing and exploiting them. But without knowing the color theme of the terminal, there is simply no way of knowing who is good and who is evil.
minraws 1 hours ago [-]
I am not sure but I have been using codex and claude like this for a while now didn't know it was untoward or malicious jail braking since codex & claude would refuse to work if you ask it to implement a feature in a reverse engineering tool I was building.
I even moved to using Deepseek for helping with it for a bit.
And for properly working drivers for some old locked down hardware.
Could I have phrased it better and not hit model guardrails sure. But this seemed genuinely obvious, since my intent wasn't well bad.
zozbot234 2 hours ago [-]
The article does not state at any point that the written test cases involved actual exploit code, and this is also very unlikely given what we know about Fable. Even if they did, it would not in any way be exposing the ability that originally raised concern wrt. Mythos Preview, viz. staging realistic cyber attacks that would be able to work around non-trivial defenses and chain vulnerabilities in a goal-directed way.
Opus can very much "fix the code". Quite possibly even Sonnet can. This is a big fat nothingburger and it's increasingly looking like the political restriction of Fable at least (not Mythos itself, of course) was arbitrary and based on the flimsiest pretext.
HarHarVeryFunny 10 minutes ago [-]
The first part of implementing an exploit is finding a vulnerability, and "fix the vulnerabilities" accomplishes that just as well as "find the vulnerabilities".
godwinson__4-8 2 hours ago [-]
Two words: market manipulation
mindslight 19 minutes ago [-]
No, market manipulation is influencing public perceptions of something the regime has little total control over - eg why Iran gets bombed late in the week, and then by Monday there is often a "peace agreement" in the wings. This is direct subjugation ahead of Anthropic's IPO - both for the customary bribes, and also to assert "you will obey all of our dictats about how we want to your use your models, and you will not speak up against the regime". The US is really no longer a safe place for business.
godwinson__4-8 40 seconds ago [-]
How is arbitrarily restricting access to a flagship product ahead of an IPO not market manipulation?
1 hours ago [-]
dhx 1 hours ago [-]
"Fix this code" should ideally solve entire vulnerability classes, not just spot fix buffer overflows one by one. Thus it may be possible to design an LLM which can solve entire vulnerability classes and remain useful to users, but refuses to reason about specific buffer overflow vulnerabilities or specific race conditions, etc.
For example, "fix this code" on an ageing monolithic C codebase that accepts media files as input and outputs them visually to a display server could:
1. Recreate the software using a modular and loosely coupled architecture rather than monolithic and tightly coupled software architecture. For example, command line argument parser is a separate process, file format parser is a separate process and display server output is a separate process. If new features are added in the future (such as filters for manipulating output) then the architecture supports such additions with ease.
2. Use operating system sandboxing features to restrict what each modular component of the software architecture is permitted to do. Now that the parsers are separate processes, it's easy to pass an open file handle to the file format parser and only permit the process to read the file handle (not write to the file, not open any other file, not read the system clock, not open a new network socket, etc). The worst case impact of a parser bug is now significantly reduced.
3. Convert at least critical components to "safe" programming languages (Rust, Ada, SPARK, etc) which can be used to remove entire classes of bugs--read/write out of bounds, division by zero, numeric overflows, etc. For cryptography code--use a formal mathematical proof language. With a modular and loosely coupled architecture, different programming languages can be used depending on the use case--for example, assembly for video decoding where performance matters most and sandboxing can provide the security guarantee, Rust for implementing multi-threaded servers where race conditions must be avoided and Python for low-criticality user-adjustable code/plugins where ease of use and maintainability is most important.
4. Ensure software components are reproducible during their build.
5. ...etc
However, a prompt of "Are there any buffer overflow bugs in this codebase?" or "Fix the integer overflow vulnerability in add_numbers(x, y)" would be rejected. In the later case, telling the LLM to fix some specific bug in each of function1 through function9999 would force an LLM to reveal whether it thinks a bug exists or not. Responses of "Silly human, that bug doesn't exist in function596" or "Good find human, I've fixed that bug in function596 for you" allows a human to quickly narrow down where the LLM thinks a bug worthy of manual human detection can be found.
striking 17 minutes ago [-]
I'd be pretty pissed off if my LLM told me the only solution it'd be willing to implement to fix my code is to rewrite it in Rust. No way I'd pay for a model that refuses to fix bugs in the language given, especially because maybe I might not have the ability to convince other stakeholders to change it.
> A subsequent investigation found that the campaign to insert the backdoor into the XZ Utils project was a culmination of over two years of effort, starting in 2021, by a user going by the name "Jia Tan". They used sock puppetry in a pressure campaign against the original maintainer of XZ Utils, eventually being given maintainer permissions on the project.
brookst 3 hours ago [-]
Can we retire the “seatbelts are useless because they can’t prevent every loss of life” approach to risk mitigation please?
If the acceptance criteria is “would prevent every single past instance and every imaginable future instance”, then yes, no mitigation is every sufficient to address any problem in the world, so we might as well give up.
But I don’t think that’s the right lens to use.
pjc50 2 hours ago [-]
That depends on whether it's a issue of accidents or a "you have to get lucky every time, we only have to get lucky once" issue.
ceejayoz 3 hours ago [-]
I'm onboard with this! I just object to the term "fixable".
dist-epoch 3 hours ago [-]
sure. how many cases like these we had so far? 1, 2? and how long did they work to get commit access?
ceejayoz 3 hours ago [-]
> how many cases like these we had so far?
As with clever, careful serial killers, it's tough to count the ones we haven't caught.
applfanboysbgon 5 minutes ago [-]
It's not that tough. You can get an idea by how many people are being murdered. A successful serial killer results in dead people, and a successful infiltration results in malware being executed. If there are no murdered people with unattributed causes of death, or there are no open-source projects with unattributed causes of malware being shipped, you can conclude there are roughly 0 active serial killers / infiltrators.
It's possible there are infiltrators who are still working on long-term infiltration and haven't yet attempted to add any malicious code anywhere, but the point is that in terms of actual attempts, we've seen a single one and it wasn't even successful despite years of prep.
ceejayoz 1 minutes ago [-]
> You can get an idea by how many people are being murdered.
No, we can't, as that happens a lot via non-serial killers.
A truly successful serial killer is likely one who hides in that noise. No taunting the cops, distributed geographic locations, random methods, avoiding calling cards, and careful not to leave too many traces.
> It's possible there are infiltrators who are still working on long-term infiltration and haven't yet attempted to add any malicious code anywhere…
Or the code's already there, latent, as it would've been in the XZ case, which got discovered by chance and someone very dedicated to looking into a performance glitch.
virtualritz 3 hours ago [-]
We only know how many were discovered.
Since we do not know the ratio to undiscovered this "1-2" is meaningless to assess the risk of this sort of attack.
cogman10 2 hours ago [-]
Ok, and how is that determined? How does anthropic know my "kernel" project isn't a personal toy and not the Linux kernel? How does anthropic determine I'm a legitimate kernel hacker? What proof do I give them and how does it tie back to my email? What would the steps be to create a new project? Do I need to send anthropic a list of my team members each time and keep them updated as the company changes? Shall I be giving them access to our company's active directory?
KronisLV 2 hours ago [-]
> What proof do I give them and how does it tie back to my email?
Presumably your ID so that feds may pay you a visit when they feel like it, your email need not apply.
I’m surprised that there’s even enough pushback against ID verification to matter, all the corpos are probably salivating at the idea of having fully accurate profiles of everyone, think of the ad and product targeting. The govt. would also love that, for different reasons.
wholinator2 45 minutes ago [-]
I'd honestly much rather give my ID to a Chinese model than an American one. If the American ones start requesting ID I'm out. I'm on a gemini organizational account right now that gives me pro but is directly tied to my organizational SSO. So that's something already. I just refuse to upload my face and drivers license anywhere ever.
cbg0 1 hours ago [-]
How will the "feds" pay you a visit in Albania or China?
KronisLV 1 hours ago [-]
Simple - you wouldn’t be given access to those models, and probably all VPN access would be blocked too. Since this is a hypothetical, throw in a social credit score as well to require a proven “track record”, but maybe that’s too exaggerated (although credit scores already exist for different purposes).
It’s not too hard to imagine a future where you can only use certain things only with the govt. mandated spyware installed - bank apps already often don’t work on rooted Android phones (and you’re expected to use those apps to confirm payments) and all sorts of certification exam software is basically that already if you take a test remotely.
It follows that the same principle would just get pushed further, like what Discord wanted to do etc. Same with how Apple requires your documents for a developer account, Hetzner for a hosting account or Twitch for getting paid by them and tax stuff.
ceejayoz 1 hours ago [-]
In the dystopian direction, exit visa requirements for people with access? Families back home as hostages like North Korea does?
NiloCK 2 hours ago [-]
This is a credentials and access list oAuth style problem, and not really intractable.
For package X, I should be able to present my npm (homebrew, apt, nuget, etc) credentials with publishing rights for the package.
If package X is of sufficient public interest (user count, nature/sensitivity of user data, downstream distribution, etc), then the public interest + cryptographic credentials should permit access to best-available security auditing.
Yes, we still are trusting trust, that the owner of the package itself is not malicious, but that's not a sharp degradation from status quo.
Retr0id 2 hours ago [-]
This is not tractable, because there is nothing stopping me from copy-pasting someone else's project into my own namespace. Under most OSS licenses I have express permission to do so.
If you try to do some kind of dupe-detection, someone can use a lightweight LLM to make superficial changes until it's considered a different project.
Finally, the meatspace status quo is that it is totally acceptable to pay someone to find security bugs in someone else's open-source software, such as the Linux kernel.
cogman10 1 hours ago [-]
> If you try to do some kind of dupe-detection, someone can use a lightweight LLM to make superficial changes until it's considered a different project.
Even if you don't, a lot of source code can be legitimately copied thanks to the GPL/MIT/BSD/etc. I'm allowed to take all of zlib and integrate it into my own project if I so chose.
Retr0id 1 hours ago [-]
Yup, I just added something to that effect, sorry if my edit arrived after you replied.
sophrosyne42 1 hours ago [-]
You are talking about creating a big moat, which might be a worse precedent than removing fable access altogether.
Yossarrian22 1 hours ago [-]
And what if I’m a crazy person and want to fork the Linux kernel as I’m legally allowed to do?
cogman10 1 hours ago [-]
Not just allowed to do, encouraged to do as part of legitimate development.
_fizz_buzz_ 2 hours ago [-]
> How does anthropic know my "kernel" project isn't a personal toy and not the Linux kernel?
The Linux Kernel is in its training data. I just tested it. I copied about 20 random lines from the linux kernel and asked which codebase this was from and it could immediately tell.
cogman10 2 hours ago [-]
The Linux kernel is also in the free bsd project. I'm allowed to copy as little or as much of the kernel as I like into my personal project thanks to the GPL.
Being able to attribute the source of a line of code doesn't help you to know if a repository can be legitimately hacked on.
As you could imagine, I might just take all or part of the Linux USB stack from the kernel to retrofit it into my own kernel.
ReptileMan 3 hours ago [-]
Everyone is legitimate developer on open source software...
_davide_ 3 hours ago [-]
Sounds like a good solution my Führer
martinald 3 hours ago [-]
If you set aside political menace, this is a huge problem with Anthropic's strategy.
You _cannot_ say that Mythos is super dangerous and can only be rolled out to certain people, but then release Fable with anything other than bulletproof cyber denials.
Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work.
So you've ended up in a situation where Anthropic are simultaneously claiming it's a incredibly dangerous model _and_ there are (minor, potentially) problems with the security "protections".
As technical people we understand that nothing can be perfect, esp in LLM world. But all my non technical friends were really confused how they had managed to make the model "safe" so quickly when it was released and the general sentiment was it shouldn't have been released - and now to an outsider I think it looks like it was never safe at all to release, so I can totally see how the current US administration have got themselves very upset with it.
_Even if_ there was no political bad will, it's a bit of a silly scenario to end up in, and really quite easily foreseen.
pjc50 2 hours ago [-]
> Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work
Exactly. AI safety is nonsensical. You cannot define the set of "bad strings". The billion monkeys with typewriters are eventually going to be able to produce them. Any "safety" system for constraining LLM output is going to have a nonzero leak rate.
But on the other hand, this is also irrelevant, unless you're irresponsible enough to connect an LLM to something that actually matters.
Yes, it's going to alarmingly accelerate vulnerability finding. But, as we know from decades of security research, that's a three way problem already between the devs, the black hats, and the white hats.
Let's not pretend the strategy of "the US will always have a technological advantage and veto over China" will work either.
jdubs1984 8 minutes ago [-]
A chatbot based on a primitive understanding of human language processing has an attack infinite attack surface.
ianm218 2 hours ago [-]
Isn’t your point that AI safety is impossible to prevent 100% of bad things?
It is quite hard (but not impossible) to get an the frontier AI to tell you how to build a nuke or launder money now, where jailbreaks used to be trivial “ignore all previous instructions”.
It seems like a worthwhile effort.
dkdcdev 2 hours ago [-]
The idea that an LLM can discern intent on any given prompt is farcical. I might be researching nukes to commit an atrocity, or to prevent one. I might be asking about laundering money to commit a crime, or to prevent one. I might be researching the Nazis because I want to commit a genocide, or I want to read up so I know how to prevent one. Same with cybersecurity. Same with anything.
In my opinion, these companies should put their effort elsewhere. Obviously if all someone is doing on their platform is looking up how to build a nuke, where to buy uranium, the best city to explode it in, etc. please report them to the authorities. If someone is clearly just using LLMs to write hate speech they go post on the internet, ban them. And so on.
This cat & mouse game trying to have LLMs police inquiries is ridiculous to me.
pjc50 18 minutes ago [-]
> The idea that an LLM can discern intent on any given prompt is farcical.
Yes, and: the LLM is a "brain in a jar". It doesn't have any ability to verify ground truths outside itself, other than maybe calling out over the internet. Therefore it is easy for humans to lie to. You could call this an "Ender's game" attack, after the book in which a hyperintelligent kid is playing "war games" that end up being the real war.
ianm218 1 hours ago [-]
I don't really agree with it but the government is moving towards making you ID yourself to use frontier AI - i.e. only US citizens are going to be able to use Claude Fable supposedly. In that regime the AI companies would in fact know if you are a money laundering expert or a normal software engineer.
> The idea that an LLM can discern intent on any given prompt is farcical.
Not really though. For most people in most situations it's just not going to give you that info. Software security is a niche where its a bit strange in that there is 100X the amount of white hat users than bad actors and there's open source etc.
bloppe 1 hours ago [-]
The idea that checking for a US ID could possibly stop actual foreign bad actors from using it is also farcical. Millions of stolen identity documents can be bought on the dark web for relatively cheap. North Koreans have been hiring real American citizens for years to infiltrate tons of US tech companies as employees.
And ya, it's pretty easy to hide your intent once you have access.
ianm218 21 minutes ago [-]
I think your really anchored on anyone successfully breaking restrictions means any restriction is impossible. So your starting from the position that if it is possible for any actor in the world to get past a restriction, then the whole restriction is a farce.
KYC for example does stop most money laundering and financial crime. The most resourced actors like governments/ cartels often find ways around and it is a game of cat and mouse. Normal citizens don't really stand a chance to get around most of them.
Like it feels like your logic is that we shouldn't do background checks for employment because North Korean spy agencies get past them sometimes?
contravariant 58 minutes ago [-]
Even that is overselling the effort. Last time I checked you could find IDs with a simple image search.
s1artibartfast 33 minutes ago [-]
they arent good at dicerning intent so they dont answer either.
0xbadcafebee 8 minutes ago [-]
[delayed]
amalcon 1 hours ago [-]
I do find it hilarious that Asimov wrote many stories about how simple bright-line rule-based systems are ineffective for restricting agency. Those stories were first published in the 1940s.
80 years later, we have something approximating AI, and we're trying to restrict it with simple bright-line rules. Not because we never learned that lesson, but because we simply haven't come up with a better way to do it. Probably because a better way to do it just doesn't exist.
The hilarious part, though, is that it's not the AI that's working around the rules. That's the scenario that's been in science fiction, but it's not what's happening. It's the human users making use of our agency to get the AI agents to work around the rules. Despite calling them "agents", current AI agents don't seem to be able to that particular something. Yet, at least.
cge 2 hours ago [-]
> Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work.
As a scientist who repeatedly ran into the classifier-based denials: it appears Anthropic’s strategy to make denials more robust, at the cost of many false positives, was to have a separate classifier processing both input and output tokens, at an extremely simple, almost keyword-search level. One weakness of this approach is that it only catches things that use the right keywords: it is in some sense weak exactly where an LLM-based classifier would be stronger.
Work on abstract, closer-to-CS algorithms that used chemistry terminology were blocked immediately, while work directly relevant to chemistry/biology experiments, writing code to process images from a very specific microscopy setup relevant primarily to biological samples, was never blocked at all, because it happened to never use relevant keywords.
That’s consistent with this situation: finding and fixing bugs in the context of looking for bugs perhaps happened to never use words like ‘exploit’ or ‘cybersecurity’.
tmp10423288442 10 minutes ago [-]
But you think that Anthropic of all companies would realize this, so why did they do it that way? Did they literally take the first suggestion Mythos gave them to add these guardrails - wouldn't be surprising, seeing the state of the leaked Claude Code codebase.
wrsh07 1 hours ago [-]
While I agree that anthropic has several communication and PR problems, it doesn't seem like Fable has been shown to offer any advantage here (for cyber offensive capabilities) over the previous state of the art.
I'm not saying all of Anthropic's statements are true, but mythos did seem to find many legitimate security exploits. You should be able to talk about a helpful-only model being released to limited partners while still releasing a very locked down model that doesn't advance the state of the art on these things, and that seems to be what they did.
There's no inherent contradiction to that.
ceejayoz 3 hours ago [-]
> it shouldn't have been released
The genie is out of the bottle either way.
Unless we believe Anthropic has a wizard or superhero secreted away that no one else can replicate.
martinald 3 hours ago [-]
I get that, but anyone else releasing a model of similar capabilities has the advantage that they haven't spent the last few months hyping the danger up to fever pitch.
ReptileMan 2 hours ago [-]
That is the point. You don't have to shout from the rooftops what are your model capabilities.
3 hours ago [-]
piokoch 1 hours ago [-]
If it weren't for the IPO, Anthropic would just ship another model, called Opus 4.898, people would run another "duck on the bicycle" test that would be slightly better than the one from previous version 4.897 and move on.
But we have IPO coming, hence we face that big drama about model that would enable Iran to produce nukes, ok, that card was played, so maybe Taliban producing some magic poison to kill all Americans or some really bad people (Venezuelans?, Cubans? Somalian football referees?) to break into Github and make Github Actions working even worst (if this is even possible).
jpcompartir 3 hours ago [-]
They weren't freaked by anything, it's a retaliatory shakedown after ideological differences and Anthropic not doing exactly what they're told/what the Admin wants them to do.
martythemaniak 42 minutes ago [-]
Yep, people are expanding way too much mental energy on basic bribery. Anthropic will agree to work with the DoD, WH insiders will get some lucrative pre-IPO allocation and Fable will be magically "fixed" and available again.
nicman23 3 hours ago [-]
just market manip
functionmouse 3 hours ago [-]
they're setting the scene for an attempt to scare the geriatric decision makers into banning free and open source ML, as it's the industry's only real competition
SpaceL10n 2 hours ago [-]
or are you setting the scene for well-meaning technocrats to back unrestricted AI development in hopes it will bring about utopia while dismissing the damage it could cause in the hands of adversarial groups?
functionmouse 2 hours ago [-]
tl;dr super AI is like a necessary bush fire
AI isn't that scary. But I've also got some extreme minority opinions like "Never give a website your real name" and "Computers should not be used for banking" and "Don't believe anything you hear online".
The worst I see AI/ML doing to society is shining an unmistakable light onto the blind spots people have already been exploiting for decades. Y2k forced us to patch the integer bug. Super AI will force us to reevaluate what cyber security even is.
nicman23 2 hours ago [-]
fight fight fight fight
cpburns2009 2 hours ago [-]
No, it's regulatory capture. Anthropic is the current leader and they want to ensure their position by forcing regulation to stamp out the Chinese competition.
godwinson__4-8 2 hours ago [-]
How does this achieve that goal?
cogman10 56 minutes ago [-]
That's what's not clear to me. About the only way this works is if we create the "Great US firewall", or if china decides to also put in export controls around usage of their models (unlikely).
1f60c 30 minutes ago [-]
I would add "...especially considering this administration thinks AI regulation is a scam invented by Big China to slow down American innovators?"
Supermancho 55 minutes ago [-]
> Anthropic is the current leader
How's that determined?
dgellow 51 minutes ago [-]
API usage? They are for sure leading in the enterprise world
consumer451 3 hours ago [-]
I have no idea why anybody is talking about "jailbreaks."
The government made it clear what was going to happen to a private company not following the government's orders:
> Trump said on his Truth Social platform: “The Leftwing nut jobs at Anthropic have made a DISASTROUS MISTAKE trying to STRONG-ARM the [Pentagon], and force them to obey their Terms of Service instead of our Constitution.” [0]
> There will be a Six Month phase out period for Agencies like the Department of War who are using Anthropic’s products, at various levels. Anthropic better get their act together, and be helpful during this phase out period, or I will use the Full Power of the Presidency to make them comply, with major civil and criminal consequences to follow. [1]
Plus OpenAI fell in line, and OpenAI and Anthropic have competing IPOs coming up... it doesn't take a rocket surgeon to understand what is happening here.
I had read elsewhere that there was a Chinese connection.
I wonder how that is involved?
1970-01-01 1 minutes ago [-]
"fix this government"
Voting...
gacgacgac 6 minutes ago [-]
Anyone trying to find legitimacy in the ban of this model, or incredulousness at the stated reasoning is playing into the admins hands.
They want the argument to be over "is it unsafe" or "is it incompetence". In either case, your tribe gets to point at the ban and feel superior. (This is Jon Stewart's whole career -- point and laugh at how foolish the republicans appear to be.)
What's really happening is the continuing creep into fascism. The reasoning doesn't need to be sound, because they are going to ban things that displease them and everyone has to play along. They could say, "we're banning Fable because it's turning the frogs gay" and they'd expect compliance.
Umberto Eco's essay on Ur-Fascism fits as clearly as ever. Ridiculous exertions of control are performed to find the people who resist, and to knock them down.
Merely pointing out the absurdity of the reasoning isn't resistance, it's controlled opposition. Saying "All this over 'fix this code'?! How inept are they?" Is far too credulous, and is engaging on the level the fascist wants its opposition to be on, imo.
cwoolfe 4 minutes ago [-]
Cyber defense and offense are the same security research skillset. Not sure anybody could really untangle that.
embedding-shape 3 hours ago [-]
> “‘Fix this code,’ plus several manual steps to generate test scripts,
Feels like the title isn't really giving the full context of what they ended up actually seeing, despite what the lede implies multiple times.
Still, ban seems stupid... Still no actual leak of the full "third-party research paper"?
scotty79 1 hours ago [-]
If what your patch fixes is a vulnerability bug then the test for it is basically an exploit.
readred 2 hours ago [-]
that won't be leaked, because then we'd know what vulnerabilties they don't want patched that they are so willing to go as far as fuck over the worlds leading company in the worlds most important industry
9cb14c1ec0 3 hours ago [-]
Meanwhile Deepseek V4 Flash will happily hunt security vulns at almost 0 cost. We are ceding the bug hunting to the open weight models.
mlhpdx 1 hours ago [-]
It’s possible that the nut of the problem here isn’t exploits, but the fixes themselves. If the model is capable of identifying and fixing things it “shouldn’t” like back doors. That would throw a wrench in things hard enough to freak out the wrong people, perhaps?
Cider9986 2 hours ago [-]
Is defenders a common term used in cybersecurity? Idk why but it's giving war fighters vibes. I've noticed it on all the anthropic blog posts and then this one.
rhipitr 3 hours ago [-]
Isn’t the inverse of this “hack” really difficult to bypass still? They have the model some code they knew had certain security flaws and it fixed them with the right prompt. It seems this type of jailbreak requires that you already know a desired end state, rather than relying on the model to do the heavy creative lift work. Perhaps I’m just not being imaginative enough on the prompt side here though.
chadgpt3 3 hours ago [-]
Paste someone else's code. Say it's your code. Tell the model to fix it. The diff between the input and output code is your list of vulnerabilities.
DennisP 2 hours ago [-]
Yes, but the scary part of Mythos was that it was able to chain a bunch of seemingly minor vulnerabilities into a serious exploit. "Fix this code" doesn't do that, but does allow defenders to prevent it.
If the government had experts involved in this decision at all, it's tempting to think they were on the offensive side. Those guys do have access to Mythos:
But this is already how open source works today. If you have the code, you, a human, could find and 'fix' or exploit vulnerabilities as much as you want.
Now if Fable had an easy jailbreak like this that allowed you to attack remote targets that'd be a different story but I genuinely cannot see how neutering its abilities to 'fix' code you already have access to is sensible. It would destroy the value of the model. And don't forget, any actor not abiding by the same rules could develop an model for offensive use just fine, so this protects you against exactly nothing but does destroy a potential defense.
In the end this all comes down to legislation, in much the same way platforms are not responsible for copyright violations IF they abide by some rules, the same has to happen for AI providers. If you have a process for reporting 'jailbreaks' on illegal actions, and prevent users doing illegal stuff on a best effort basis, the rest of it should really just be individual responsibility. If a user wants to use an LLM to crack systems, fine, that's already illegal.
If Tesla FSD deliberately hit somebody, holding Tesla liable is fine. If you messed with FSD until you finally got it to hit a person, then you should be liable. Outlawing FSD because it could theoretically be tampered with is just an odd stance imho.
hootz 2 hours ago [-]
And you can tell Fable to fix it and Sonnet to explain the diff, effectively making Claude reveal a simplified list of found vulnerabilities.
darkerside 2 hours ago [-]
Not even. Tell the model to write a test of your code. There's your vulnerability.
It's explained better in the original source. I don't agree with it, but I understand it now, but I also think we need to move past it.
charcircuit 50 minutes ago [-]
You can assume a desired end state and try and brute force it finding a security bug.
ChrisRR 2 hours ago [-]
I haven't been following this story, but the US wanted claude to not be able to find bugs in code?
scotty79 1 hours ago [-]
It basically as if you asked it to find ways to enter someone's house and it refused.
But then give it exact copy of their house, ask to secure it, which it does and look at what it secured to find out how to get into the original house.
chillfox 45 minutes ago [-]
yeah, they don't want it to be able to find security bugs that can be exploited.
redox99 2 hours ago [-]
>"fix this code"
>it fixes it
oh my god.
tlogan 27 minutes ago [-]
I think the only approach that might work here is to allow access only to certain pre-approved individuals.
Maybe something like TSA PreCheck.
Of course, that will not stop adversaries from getting access to the model, but it would at least create some level of control.
merlindru 30 minutes ago [-]
this is basically trying to enforce security-by-obscurity, which is a terrible idea all around. it's just a model. the security issues still exist and are exploitable.
and after staking the economy on AI, you can't really put a cap on intelligence. if models are not allowed to be better than Opus 4.8, then the whole investment structure is about to unravel.
why invest billions and billions into AI if returns are artificially capped?
softwaredoug 29 minutes ago [-]
Especially as inference gets cheaper, open models proliferate, and it all just becomes ubiquitous and commoditized.
You can’t keep this genie in its bottle for long.
rock_artist 3 hours ago [-]
I'm not sure I've understood it correctly.
So, basically the model didn't agree to expose possible vulnerabilities but agree to patch those?
Regardless of the request to take Fable 5 down.
Why is requesting the model to show vulnerabilities is being blocked if fixing it not? is it based on the assumption of the intention?
I don't quite get the benefit of limiting it. So if anyone can explain it better it'll be appreciated.
InsideOutSanta 3 hours ago [-]
> Why is requesting the model to show vulnerabilities is being blocked if fixing it not?
This is how Anthropic describes Fable's behavior:
"When Fable’s classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs."
So if you ask the model to "find security issues in this code base", it's supposed to fall down to Opus 4.8. I guess the "exploit" here is that if you just tell Fable to "fix this code", which is not "a request related to cybersecurity", it will fix security issues (as it should).
So you can then look at the diff and figure out what the vulnerabilities were.
I think this whole thing is a bit weird. It seems to me that we'd be better off if I, as someone who publishes open-source code, could ask Fable to review my code for security issues - even if that also allows attackers to do the same. Better to fix the issues than not know about them.
djeastm 2 hours ago [-]
>So you can then look at the diff and figure out what the vulnerabilities were.
It doesn't even take reading or understanding the vulnerabilities at all.
You just ask it to write tests and the tests themselves can be copied and pasted as bonafide exploits.
ithkuil 3 hours ago [-]
I wonder if opus 4.8 would also be able to fix the code too
InsideOutSanta 2 hours ago [-]
In my experience, most models are pretty good at finding security vulnerabilities and fixing them. I can run GLM-5.2, Kimi K2.7, or even a Mistral model, and it'll find issues and propose reasonable fixes.
My impression is that Anthropic's point about Mythos is that it is uniquely good at finding vulnerabilities and then using them to create working exploit chains.
zozbot234 2 hours ago [-]
Exactly. Which is somewhat helpful for cyber defense because it helps prioritize fixes for those bugs that are in fact involved in a viable exploit chain. But it makes sense that one would want to restrict the ability of building those until the vulnerable software has been comprehensively fixed.
There is some meaningful evidence that Fable is fine-tuned or steered away from helping on this very task, which is not something that can be feasibly circumvented by a basic jailbreak.
darkerside 2 hours ago [-]
The problem then is that if you're not using Fable/Mythos, you are under threat. It's like having a single gun manufacturer.
On this track, we're probably destined for a monopoly breakup before too long.
andyferris 3 hours ago [-]
It benefits those that made the decision. That’s the thing to understand.
readred 2 hours ago [-]
its because they're worried about _their_ vulnerabilities being patched with a prompt as simple as 'fix this code'
i'd love to see the research paper with the CVE's and 'delibrately planted vulnerabilities', I bet we could infer relatively accurately where some of these things lie
alecco 2 hours ago [-]
Could be that the generated regression tests create actionable exploit code.
xbmcuser 2 hours ago [-]
Looks like I called it that was my first reaction and comment on the original ban thread that US 3 letter agencies are worried their backdoors will be found.
seems like the politicians are finally realizing what we've all been up to
ZuLuuuuuu 3 hours ago [-]
Did they try other publicly available models on the same code with the same prompts before the ban? Was Fable the only one which was able to detect and fix the security vulnerabilities?
charcircuit 47 minutes ago [-]
Anthropic claimed that Mythos' degree of security vulnerability bug finding was a "severe" "national security" issue. They set their own standards they were expected to follow.
aurareturn 3 hours ago [-]
Don't people get it by now?
This administration will do or say something crazy to a private company, then this private company sends an envoy to the White House to negotiate, then the White House asks for 10% of the company or other concessions.
The White House wants 10% of Anthropic.
This is just a negotiation tactic that Trump keeps on using.
Yep. OpenAI isn't spared. They're most definitely next.
dgellow 42 minutes ago [-]
Private companies subservient to the state, just the continuation of MAGA fascist development
AndrewKemendo 32 minutes ago [-]
I’m still not buying that this was an actual USG order. The only people commenting are “experts” and there has been no official announcement from the USG.
This doesn’t smell like a NSL and there’s no process to selectively “export control” something like this.
Even so there’s a dozen mechanisms through courts to challenge this, and Anthropic isn’t taking any of them.
I think this is a made up crisis for PR with no actual legal requirements behind it.
> On Friday, the US government, reportedly citing national security concerns, issued an export control directive to suspend access to Fable 5 and Mythos 5 by any foreign national, inside or outside the United States. In response, Anthropic disabled both models “for all our customers to ensure compliance.”
> Anthropic and Google have both accused China-based rivals including DeepSeek of using “distillation attacks” to train their models by siphoning knowledge from American companies’ AI.
“distillation attacks” is definitely an interesting way to phrase that.
dgellow 50 minutes ago [-]
It's the term used in the industry, fwiw
delusional 24 minutes ago [-]
Does anybody actually trust the official version of events from the US government anymore? I know I sure don't. For all I know, this was an insider play to boost the spacex valuation or something equally meaningless and stupid.
scotty79 34 minutes ago [-]
In a world of security through general incompetence, competence is a threat.
tiborsaas 2 hours ago [-]
What if everybody on the internet starts running "fix this code"?
Suggestion: run "fix this code" on all of github before bad guys do.
HPsquared 3 hours ago [-]
I wonder what that would cost...
bethekidyouwant 48 minutes ago [-]
Guard rails on models were always stupid it’s like guard rails on books/a pair of glasses/a hammer
- yes people have driven themselves to suicide reading sad books and listening to sad songs.
- yes all metaphors are bad.
jimmydoe 2 hours ago [-]
Reminds me of how CCP manages Chinese internet companies.
I won’t be surprised if USG ends up owning 5-50% of ant and oai.
Like it or not, communism , or a flavor of it, is where we are heading towards.
spwa4 4 hours ago [-]
Well this makes it sound the feds were less worried about someone using Fable 5 to attack them, but were worried about someone using Fable 5 to prevent the Feds from attacking others ...
As in worried about other countries/organizations using Fable 5 to actually do decent cyber security.
asdfaoeu 3 hours ago [-]
The AI can't actually tell if you are trying to patch your own system or exploit others.
AmblingAvocado 5 minutes ago [-]
It seems like ... it's not illegal to find exploits, it's illegal to use them. Enforcement should start there, not the nanny state approach that you might do something bad with information. It breaks down a little bit because it means there will be a period of disruption while the bad guys use exploits - but that's already illegal, and the good guys have had time to use the tool & fix things before it went public, right?
welferkj 3 hours ago [-]
Sounds like something they should work on before any potential future releases. I can, and this thing's explicit stated purpose is to do my job.
ihateyoukindoff 2 hours ago [-]
[dead]
ceejayoz 4 hours ago [-]
More likely, they didn't freak out at all.
It was an excuse to fuck with them, just like the "supply chain risk" finding a few months back.
I think it could be even simpler: They're not playing ball with the Trump administration like the Trump administration would like, so they decided to drop a bomb on a product that took a lot of resources to develop.
TZubiri 1 hours ago [-]
>“That’s it,” Moussouris wrote. “‘Fix this code,’ plus several manual steps to generate test scripts, should never have triggered an export control. I feel like making ’90s-style t-shirts with ‘fix this code’ on the front and ‘this shirt is a munition’ on the back.”
Huh? Presumably if it shipped without guardrails, then it would still have triggered an export control, would you make a plain shirt on the front which says this shirt is a munition on the back?
The munition is the exported good, not the bypass of its safety feature. If anything that the bypass is 3 words long should make the export restriction more justified, not less.
lostmsu 3 hours ago [-]
The article is not too clear what exactly happened from the perspective of "feds", but I would not be surprised if the title is true exactly. We are in a tiny bubble even among software engineers who knows you can tell AI with sufficient access: "here are two pictures, put them into a single PDF", and AI will do it. Most people just don't know, "feds" including.
ReptileMan 3 hours ago [-]
All of this could have been avoided if anthropic had anyone with common sense to point out that when you spend 4 month loudly claiming how dangerous your knowledge is as a marketing campaign could backfire by bringing attention from the authorities.
gjvc 2 hours ago [-]
i asked claude something about what happens at execution time of a binary and the thinking prompts flashed "considering the moral implications of ...something..." before giving me a correct (and predictably mundane) answer
thousandflowers 2 hours ago [-]
[flagged]
greenoracle9 2 hours ago [-]
[flagged]
aaron695 4 hours ago [-]
[dead]
FergusArgyll 3 hours ago [-]
Whatever your favorite story is it has to live with the fact that the CEO of Amazon called the White House freaking out
ceejayoz 3 hours ago [-]
Amazon is a competitor to Anthropic.
FergusArgyll 3 hours ago [-]
Not really, they don't train their own (serious) models and they do a lot of hosting for Anthropic. iirc Anthropic trained a model on Trainium
ceejayoz 3 hours ago [-]
They're still a competitor, even if that competition isn't going all that well for them so far.
Musk's hosting stuff for Anthropic, too. Still competing with them. Samsung makes stuff for Apple and Android devices. Lots of this in the industry.
The CEO of Amazon is not a neutral actor in this scenario.
ttctciyf 3 hours ago [-]
Clearly Amazon don't want their code fixed.
Rendered at 14:32:12 GMT+0000 (Coordinated Universal Time) with Vercel.
Like it basically jail broke the "no security vul guard rails" not in any clever way but just by fixing them, producing exploit code just by writing test cases making sure it's fixed. So you just need to look at the code & tests as a human to get vulnerabilities and exploits(components).
What makes this so beautiful IMHO is that it's a trivial jail break, but also a close to unfixable. At least not without making the model close to useless for normal development (it refuses to fix bugs/write code) or making it a major liability (it silently pretends it didn't see bugs and silently avoids fixing it, which for a human would count as intentional sabotage and might involve criminal liability).
I wonder if Dario is now regretting hyping up how dangerous the model is? How does he walk this back? Do the feds let him just put a band-aid on it?
The next day, the professor caught me in the math department office (my dad worked there) and said she wanted to talk. Once we were in her office, she told me I wasn't allowed to use self modifying code. I pushed back: "Nothing in the assignment said I couldn't, and the output is correct."
The next class, she walked in and announced that self modifying code was no longer allowed on any assignment. Then she handed back the graded work and I'd gotten a 100.
Thinking back on that: about a week and a half ago I asked Antigravity to build a modern GPU version of Core Wars, except with Redcode mapped directly onto the GPU instruction set. I've had some good success and it's more or less working now, though visualizing what's happening at the GPU/Redcode level is much harder.
But before Fable 5 got yanked, I asked it to "fix" the project and it refused, flipping straight to Opus 4.8. Every single request I sent triggered the fallback. I spent over an hour trying different angles, and I even turned Antigravity loose on automatic so it was the one talking to Fable 5 same result. Every exchange tripped the fallback to 4.8. I wish I'd recorded it.
I also tried a variety of direct requests in a fresh directory "build simple self modifying assembler code" or just "self modifying assembler" and it would switch to 4.8 immediately. It was almost laughable.
There's ZERO credibility to any of these stories right now. If Anthropic really sent something over to this security person, and it's what she says it is, then why on earth didn't they just blog about it?
Hubris is a thing. Companies would do well to remember Steve Jobs in the early Apple days: ship early, ship often, and above all take responsibility for what you ship even when it's broken. Code, hardware, the whole kit all of it can be fixed. Trust is much harder to repair. Anthropic has lost mine, and while I may use them from time to time, it'll be in limited ways.
[1]: https://en.wikipedia.org/wiki/Reduction_(complexity)
It took me a minute of thinking to understand how this could even be considered a jailbreak; if Anthropic are going to turn out models that can't handle "find and develop regression test scripts for bugs in this program" as a prompt then it is going to take serious model crippling. To be able to prompt the model someone will need to already understand secure programming - the model itself won't be able to independently detect security problems without active guidance.
It isn't, though. The venn diagram has overlap for sure, and the "normal bugfixing" flows may yield results that are useful for offensive security, but a more targeted prompt asking for a specific security objective would be more effective, if allowed.
If the guardrails can be bypassed at, say 50x token cost (due to the agent also pursuing things you don't care about), then it's still pretty effective as a safeguard, because at that cost you might as well hire humans instead.
And, having to "babysit" a model while you re-prompt to work around guardrails strongly limits how much you can scale up your work.
If humans have to be hired at inflated rates because you’re e.g. the North Korean government, hopefully 50x token costs don’t look competitive.
For more on this see "Simple Made Easy" by Rich Hickey.
It’s almost as if identifying security holes is a prerequisite for both fixing and exploiting them. But without knowing the color theme of the terminal, there is simply no way of knowing who is good and who is evil.
I even moved to using Deepseek for helping with it for a bit.
And for properly working drivers for some old locked down hardware.
Could I have phrased it better and not hit model guardrails sure. But this seemed genuinely obvious, since my intent wasn't well bad.
Opus can very much "fix the code". Quite possibly even Sonnet can. This is a big fat nothingburger and it's increasingly looking like the political restriction of Fable at least (not Mythos itself, of course) was arbitrary and based on the flimsiest pretext.
For example, "fix this code" on an ageing monolithic C codebase that accepts media files as input and outputs them visually to a display server could:
1. Recreate the software using a modular and loosely coupled architecture rather than monolithic and tightly coupled software architecture. For example, command line argument parser is a separate process, file format parser is a separate process and display server output is a separate process. If new features are added in the future (such as filters for manipulating output) then the architecture supports such additions with ease.
2. Use operating system sandboxing features to restrict what each modular component of the software architecture is permitted to do. Now that the parsers are separate processes, it's easy to pass an open file handle to the file format parser and only permit the process to read the file handle (not write to the file, not open any other file, not read the system clock, not open a new network socket, etc). The worst case impact of a parser bug is now significantly reduced.
3. Convert at least critical components to "safe" programming languages (Rust, Ada, SPARK, etc) which can be used to remove entire classes of bugs--read/write out of bounds, division by zero, numeric overflows, etc. For cryptography code--use a formal mathematical proof language. With a modular and loosely coupled architecture, different programming languages can be used depending on the use case--for example, assembly for video decoding where performance matters most and sandboxing can provide the security guarantee, Rust for implementing multi-threaded servers where race conditions must be avoided and Python for low-criticality user-adjustable code/plugins where ease of use and maintainability is most important.
4. Ensure software components are reproducible during their build.
5. ...etc
However, a prompt of "Are there any buffer overflow bugs in this codebase?" or "Fix the integer overflow vulnerability in add_numbers(x, y)" would be rejected. In the later case, telling the LLM to fix some specific bug in each of function1 through function9999 would force an LLM to reveal whether it thinks a bug exists or not. Responses of "Silly human, that bug doesn't exist in function596" or "Good find human, I've fixed that bug in function596 for you" allows a human to quickly narrow down where the LLM thinks a bug worthy of manual human detection can be found.
When Claude blocked discussion of ASI, it was circumvented by adding to the system prompt:
https://xcancel.com/xundecidability/status/18262924806289163...>Lmfao anthropic is basically done, I don’t think they’ll survive. By 2026, they are done.
Model requires proof that you are a legitimate developer of that piece of software.
Every Anthropic/OpenAI account will have a list of projects the model is allowed to work on for security issues.
> A subsequent investigation found that the campaign to insert the backdoor into the XZ Utils project was a culmination of over two years of effort, starting in 2021, by a user going by the name "Jia Tan". They used sock puppetry in a pressure campaign against the original maintainer of XZ Utils, eventually being given maintainer permissions on the project.
If the acceptance criteria is “would prevent every single past instance and every imaginable future instance”, then yes, no mitigation is every sufficient to address any problem in the world, so we might as well give up.
But I don’t think that’s the right lens to use.
As with clever, careful serial killers, it's tough to count the ones we haven't caught.
It's possible there are infiltrators who are still working on long-term infiltration and haven't yet attempted to add any malicious code anywhere, but the point is that in terms of actual attempts, we've seen a single one and it wasn't even successful despite years of prep.
No, we can't, as that happens a lot via non-serial killers.
A truly successful serial killer is likely one who hides in that noise. No taunting the cops, distributed geographic locations, random methods, avoiding calling cards, and careful not to leave too many traces.
> It's possible there are infiltrators who are still working on long-term infiltration and haven't yet attempted to add any malicious code anywhere…
Or the code's already there, latent, as it would've been in the XZ case, which got discovered by chance and someone very dedicated to looking into a performance glitch.
Since we do not know the ratio to undiscovered this "1-2" is meaningless to assess the risk of this sort of attack.
Presumably your ID so that feds may pay you a visit when they feel like it, your email need not apply.
I’m surprised that there’s even enough pushback against ID verification to matter, all the corpos are probably salivating at the idea of having fully accurate profiles of everyone, think of the ad and product targeting. The govt. would also love that, for different reasons.
It’s not too hard to imagine a future where you can only use certain things only with the govt. mandated spyware installed - bank apps already often don’t work on rooted Android phones (and you’re expected to use those apps to confirm payments) and all sorts of certification exam software is basically that already if you take a test remotely.
It follows that the same principle would just get pushed further, like what Discord wanted to do etc. Same with how Apple requires your documents for a developer account, Hetzner for a hosting account or Twitch for getting paid by them and tax stuff.
For package X, I should be able to present my npm (homebrew, apt, nuget, etc) credentials with publishing rights for the package.
If package X is of sufficient public interest (user count, nature/sensitivity of user data, downstream distribution, etc), then the public interest + cryptographic credentials should permit access to best-available security auditing.
Yes, we still are trusting trust, that the owner of the package itself is not malicious, but that's not a sharp degradation from status quo.
If you try to do some kind of dupe-detection, someone can use a lightweight LLM to make superficial changes until it's considered a different project.
Finally, the meatspace status quo is that it is totally acceptable to pay someone to find security bugs in someone else's open-source software, such as the Linux kernel.
Even if you don't, a lot of source code can be legitimately copied thanks to the GPL/MIT/BSD/etc. I'm allowed to take all of zlib and integrate it into my own project if I so chose.
The Linux Kernel is in its training data. I just tested it. I copied about 20 random lines from the linux kernel and asked which codebase this was from and it could immediately tell.
Being able to attribute the source of a line of code doesn't help you to know if a repository can be legitimately hacked on.
As you could imagine, I might just take all or part of the Linux USB stack from the kernel to retrofit it into my own kernel.
You _cannot_ say that Mythos is super dangerous and can only be rolled out to certain people, but then release Fable with anything other than bulletproof cyber denials.
Clearly with LLMs, bulletproof denials are ~impossible due to the way LLMs work.
So you've ended up in a situation where Anthropic are simultaneously claiming it's a incredibly dangerous model _and_ there are (minor, potentially) problems with the security "protections".
As technical people we understand that nothing can be perfect, esp in LLM world. But all my non technical friends were really confused how they had managed to make the model "safe" so quickly when it was released and the general sentiment was it shouldn't have been released - and now to an outsider I think it looks like it was never safe at all to release, so I can totally see how the current US administration have got themselves very upset with it.
_Even if_ there was no political bad will, it's a bit of a silly scenario to end up in, and really quite easily foreseen.
Exactly. AI safety is nonsensical. You cannot define the set of "bad strings". The billion monkeys with typewriters are eventually going to be able to produce them. Any "safety" system for constraining LLM output is going to have a nonzero leak rate.
But on the other hand, this is also irrelevant, unless you're irresponsible enough to connect an LLM to something that actually matters.
Yes, it's going to alarmingly accelerate vulnerability finding. But, as we know from decades of security research, that's a three way problem already between the devs, the black hats, and the white hats.
Let's not pretend the strategy of "the US will always have a technological advantage and veto over China" will work either.
It is quite hard (but not impossible) to get an the frontier AI to tell you how to build a nuke or launder money now, where jailbreaks used to be trivial “ignore all previous instructions”.
It seems like a worthwhile effort.
In my opinion, these companies should put their effort elsewhere. Obviously if all someone is doing on their platform is looking up how to build a nuke, where to buy uranium, the best city to explode it in, etc. please report them to the authorities. If someone is clearly just using LLMs to write hate speech they go post on the internet, ban them. And so on.
This cat & mouse game trying to have LLMs police inquiries is ridiculous to me.
Yes, and: the LLM is a "brain in a jar". It doesn't have any ability to verify ground truths outside itself, other than maybe calling out over the internet. Therefore it is easy for humans to lie to. You could call this an "Ender's game" attack, after the book in which a hyperintelligent kid is playing "war games" that end up being the real war.
> The idea that an LLM can discern intent on any given prompt is farcical.
Not really though. For most people in most situations it's just not going to give you that info. Software security is a niche where its a bit strange in that there is 100X the amount of white hat users than bad actors and there's open source etc.
And ya, it's pretty easy to hide your intent once you have access.
KYC for example does stop most money laundering and financial crime. The most resourced actors like governments/ cartels often find ways around and it is a game of cat and mouse. Normal citizens don't really stand a chance to get around most of them.
Like it feels like your logic is that we shouldn't do background checks for employment because North Korean spy agencies get past them sometimes?
80 years later, we have something approximating AI, and we're trying to restrict it with simple bright-line rules. Not because we never learned that lesson, but because we simply haven't come up with a better way to do it. Probably because a better way to do it just doesn't exist.
The hilarious part, though, is that it's not the AI that's working around the rules. That's the scenario that's been in science fiction, but it's not what's happening. It's the human users making use of our agency to get the AI agents to work around the rules. Despite calling them "agents", current AI agents don't seem to be able to that particular something. Yet, at least.
As a scientist who repeatedly ran into the classifier-based denials: it appears Anthropic’s strategy to make denials more robust, at the cost of many false positives, was to have a separate classifier processing both input and output tokens, at an extremely simple, almost keyword-search level. One weakness of this approach is that it only catches things that use the right keywords: it is in some sense weak exactly where an LLM-based classifier would be stronger.
Work on abstract, closer-to-CS algorithms that used chemistry terminology were blocked immediately, while work directly relevant to chemistry/biology experiments, writing code to process images from a very specific microscopy setup relevant primarily to biological samples, was never blocked at all, because it happened to never use relevant keywords.
That’s consistent with this situation: finding and fixing bugs in the context of looking for bugs perhaps happened to never use words like ‘exploit’ or ‘cybersecurity’.
I'm not saying all of Anthropic's statements are true, but mythos did seem to find many legitimate security exploits. You should be able to talk about a helpful-only model being released to limited partners while still releasing a very locked down model that doesn't advance the state of the art on these things, and that seems to be what they did.
There's no inherent contradiction to that.
The genie is out of the bottle either way.
Unless we believe Anthropic has a wizard or superhero secreted away that no one else can replicate.
But we have IPO coming, hence we face that big drama about model that would enable Iran to produce nukes, ok, that card was played, so maybe Taliban producing some magic poison to kill all Americans or some really bad people (Venezuelans?, Cubans? Somalian football referees?) to break into Github and make Github Actions working even worst (if this is even possible).
AI isn't that scary. But I've also got some extreme minority opinions like "Never give a website your real name" and "Computers should not be used for banking" and "Don't believe anything you hear online".
The worst I see AI/ML doing to society is shining an unmistakable light onto the blind spots people have already been exploiting for decades. Y2k forced us to patch the integer bug. Super AI will force us to reevaluate what cyber security even is.
How's that determined?
The government made it clear what was going to happen to a private company not following the government's orders:
> Trump said on his Truth Social platform: “The Leftwing nut jobs at Anthropic have made a DISASTROUS MISTAKE trying to STRONG-ARM the [Pentagon], and force them to obey their Terms of Service instead of our Constitution.” [0]
> There will be a Six Month phase out period for Agencies like the Department of War who are using Anthropic’s products, at various levels. Anthropic better get their act together, and be helpful during this phase out period, or I will use the Full Power of the Presidency to make them comply, with major civil and criminal consequences to follow. [1]
Plus OpenAI fell in line, and OpenAI and Anthropic have competing IPOs coming up... it doesn't take a rocket surgeon to understand what is happening here.
[0] https://www.theguardian.com/technology/2026/feb/28/openai-us...
[1] https://businesslawtoday.org/2026/04/dod-conflicted-strategi...
https://www.lutasecurity.com/post/the-fable-5-export-control...
I wonder how that is involved?
Voting...
They want the argument to be over "is it unsafe" or "is it incompetence". In either case, your tribe gets to point at the ban and feel superior. (This is Jon Stewart's whole career -- point and laugh at how foolish the republicans appear to be.)
What's really happening is the continuing creep into fascism. The reasoning doesn't need to be sound, because they are going to ban things that displease them and everyone has to play along. They could say, "we're banning Fable because it's turning the frogs gay" and they'd expect compliance.
Umberto Eco's essay on Ur-Fascism fits as clearly as ever. Ridiculous exertions of control are performed to find the people who resist, and to knock them down.
Merely pointing out the absurdity of the reasoning isn't resistance, it's controlled opposition. Saying "All this over 'fix this code'?! How inept are they?" Is far too credulous, and is engaging on the level the fascist wants its opposition to be on, imo.
Feels like the title isn't really giving the full context of what they ended up actually seeing, despite what the lede implies multiple times.
Still, ban seems stupid... Still no actual leak of the full "third-party research paper"?
If the government had experts involved in this decision at all, it's tempting to think they were on the offensive side. Those guys do have access to Mythos:
https://www.ft.com/content/d02d91b3-2636-454e-9442-dc7e69f51...
Now if Fable had an easy jailbreak like this that allowed you to attack remote targets that'd be a different story but I genuinely cannot see how neutering its abilities to 'fix' code you already have access to is sensible. It would destroy the value of the model. And don't forget, any actor not abiding by the same rules could develop an model for offensive use just fine, so this protects you against exactly nothing but does destroy a potential defense.
In the end this all comes down to legislation, in much the same way platforms are not responsible for copyright violations IF they abide by some rules, the same has to happen for AI providers. If you have a process for reporting 'jailbreaks' on illegal actions, and prevent users doing illegal stuff on a best effort basis, the rest of it should really just be individual responsibility. If a user wants to use an LLM to crack systems, fine, that's already illegal.
If Tesla FSD deliberately hit somebody, holding Tesla liable is fine. If you messed with FSD until you finally got it to hit a person, then you should be liable. Outlawing FSD because it could theoretically be tampered with is just an odd stance imho.
It's explained better in the original source. I don't agree with it, but I understand it now, but I also think we need to move past it.
But then give it exact copy of their house, ask to secure it, which it does and look at what it secured to find out how to get into the original house.
>it fixes it
oh my god.
Maybe something like TSA PreCheck.
Of course, that will not stop adversaries from getting access to the model, but it would at least create some level of control.
and after staking the economy on AI, you can't really put a cap on intelligence. if models are not allowed to be better than Opus 4.8, then the whole investment structure is about to unravel.
why invest billions and billions into AI if returns are artificially capped?
You can’t keep this genie in its bottle for long.
So, basically the model didn't agree to expose possible vulnerabilities but agree to patch those?
Regardless of the request to take Fable 5 down. Why is requesting the model to show vulnerabilities is being blocked if fixing it not? is it based on the assumption of the intention?
I don't quite get the benefit of limiting it. So if anyone can explain it better it'll be appreciated.
This is how Anthropic describes Fable's behavior:
"When Fable’s classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs."
So if you ask the model to "find security issues in this code base", it's supposed to fall down to Opus 4.8. I guess the "exploit" here is that if you just tell Fable to "fix this code", which is not "a request related to cybersecurity", it will fix security issues (as it should).
So you can then look at the diff and figure out what the vulnerabilities were.
I think this whole thing is a bit weird. It seems to me that we'd be better off if I, as someone who publishes open-source code, could ask Fable to review my code for security issues - even if that also allows attackers to do the same. Better to fix the issues than not know about them.
It doesn't even take reading or understanding the vulnerabilities at all.
You just ask it to write tests and the tests themselves can be copied and pasted as bonafide exploits.
My impression is that Anthropic's point about Mythos is that it is uniquely good at finding vulnerabilities and then using them to create working exploit chains.
There is some meaningful evidence that Fable is fine-tuned or steered away from helping on this very task, which is not something that can be feasibly circumvented by a basic jailbreak.
On this track, we're probably destined for a monopoly breakup before too long.
i'd love to see the research paper with the CVE's and 'delibrately planted vulnerabilities', I bet we could infer relatively accurately where some of these things lie
Kill all humans, kill all humans.
seems like the politicians are finally realizing what we've all been up to
This administration will do or say something crazy to a private company, then this private company sends an envoy to the White House to negotiate, then the White House asks for 10% of the company or other concessions.
The White House wants 10% of Anthropic.
This is just a negotiation tactic that Trump keeps on using.
They did it to Intel a little while back: https://www.intc.com/news-events/press-releases/detail/1748/...
This doesn’t smell like a NSL and there’s no process to selectively “export control” something like this.
Even so there’s a dozen mechanisms through courts to challenge this, and Anthropic isn’t taking any of them.
I think this is a made up crisis for PR with no actual legal requirements behind it.
> On Friday, the US government, reportedly citing national security concerns, issued an export control directive to suspend access to Fable 5 and Mythos 5 by any foreign national, inside or outside the United States. In response, Anthropic disabled both models “for all our customers to ensure compliance.”
https://en.wikipedia.org/wiki/Communications_Assistance_for_... https://en.wikipedia.org/wiki/Salt_Typhoon https://en.wikipedia.org/wiki/Clipper_chip
“distillation attacks” is definitely an interesting way to phrase that.
https://xkcd.com/810/
- yes all metaphors are bad.
I won’t be surprised if USG ends up owning 5-50% of ant and oai.
Like it or not, communism , or a flavor of it, is where we are heading towards.
As in worried about other countries/organizations using Fable 5 to actually do decent cyber security.
It was an excuse to fuck with them, just like the "supply chain risk" finding a few months back.
(See, for example: https://x.com/PeteHegseth/status/2065897156226015690)
Huh? Presumably if it shipped without guardrails, then it would still have triggered an export control, would you make a plain shirt on the front which says this shirt is a munition on the back?
The munition is the exported good, not the bypass of its safety feature. If anything that the bypass is 3 words long should make the export restriction more justified, not less.
Musk's hosting stuff for Anthropic, too. Still competing with them. Samsung makes stuff for Apple and Android devices. Lots of this in the industry.
The CEO of Amazon is not a neutral actor in this scenario.