Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Kimi K2 1T model runs on 2 512GB M3 Ultras (twitter.com)

96 points by jeudesprits 4 hours ago | 49 comments

A_D_E_P_T 3 hours ago [-]

Kimi K2 is a really weird model, just in general.

It's not nearly as smart as Opus 4.5 or 5.2-Pro or whatever, but it has a very distinct writing style and also a much more direct "interpersonal" style. As a writer of very-short-form stuff like emails, it's probably the best model available right now. As a chatbot, it's the only one that seems to really relish calling you out on mistakes or nonsense, and it doesn't hesitate to be blunt with you.

I get the feeling that it was trained very differently from the other models, which makes it situationally useful even if it's not very good for data analysis or working through complex questions. For instance, as it's both a good prose stylist and very direct/blunt, it's an extremely good editor.

I like it enough that I actually pay for a Kimi subscription.

wasting_time 3 hours ago [-]

It's also the only model that consistently nails my favorite AI benchmark: https://clocks.brianmoore.com/

tootie 1 hours ago [-]

I use that one for image gen too. Ask for a picture of a grandfather clock at a specific time. Most are completely unable. Clocks are always 10:20 because that's the most photogenic time used in most stock photos.

amelius 2 hours ago [-]

But how sure are we that it wasn't trained on that specifically?

3abiton 8 minutes ago [-]

> I get the feeling that it was trained very differently from the other models

It's actually based on a deepseek architecture just bigger size experts if I recall correctly.

jug 1 hours ago [-]

And given this, it unsurprisingly scores very well on https://eqbench.com

Kim_Bruning 3 hours ago [-]

Speaking of weird. I feel like Kimi is a shoggoth with its tentacles in a man-bun. If that makes any sense.

stingraycharles 3 hours ago [-]

> As a chatbot, it's the only one that seems to really relish calling you out on mistakes or nonsense, and it doesn't hesitate to be blunt with you.

My experience is that Sonnet 4.5 does this a lot as well, but this is more often than not due to a lack of full context, eg accusing the user of not doing X or Y when it just wasn’t told that was already done, and proceeding to apologize.

How is Kimi K2 in this regard?

Isn’t “instruction following” the most important thing you’d want out of a model in general, and a model pushing back more likely than not being wrong?

Kim_Bruning 3 hours ago [-]

> Isn’t “instruction following” the most important thing you’d want out of a model in general,

No. And for the same reason that pure "instruction following" in humans is considered a form of protest/sabotage.

https://en.wikipedia.org/wiki/Work-to-rule

SkyeCA 1 hours ago [-]

It's still insanity to me that doing your job exactly as defined and not giving away extra work is considered a form of action.

Everyone should be working-to-rule all the time.

stingraycharles 3 hours ago [-]

I don’t understand the point you’re trying to make. LLMs are not humans.

From my perspective, the whole problem with LLMs (at least for writing code) is that it shouldn’t assume anything, follow the instructions faithfully, and ask the user for clarification if there is ambiguity in the request.

I find it extremely annoying when the model pushes back / disagrees, instead of asking for clarification. For this reason, I’m not a big fan of Sonnet 4.5.

IgorPartola 2 hours ago [-]

Full instruction following looks like monkey’s paw/malicious compliance. A good way to eliminate a bug from a codebase is to delete the codebase, that type of thing. You want the model to have enough creative freedom to solve the problem otherwise you are just coding using an imprecise language spec.

I know what you mean: a lot of my prompts include “never use em-dashes” but all models forget this sooner or later. But in other circumstances I do want it to push back on something I am asking. “I can implement what you are asking but I just want to confirm that you are ok with this feature introducing an SQL injection attack into this API endpoint”

stingraycharles 1 hours ago [-]

My point is that it’s better that the model asks questions to better understand what’s going on before pushing back.

Kim_Bruning 2 hours ago [-]

I can't help you then. You can find a close analogue in the OSS/CIA Simple Sabotage Field Manual. [1]

For that reason, I don't trust Agents (human or ai, secret or overt :-P) who don't push back.

[1] https://www.cia.gov/static/5c875f3ec660e092cf893f60b4a288df/... esp. Section 5(11)(b)(14): "Apply all regulations to the last letter." - [as a form of sabotage]

stingraycharles 1 hours ago [-]

How is asking for clarification before pushing back a bad thing?

Kim_Bruning 32 minutes ago [-]

Sounds like we're not too far apart then!

Sometimes pushback is appropriate, sometimes clarification. The key thing is that one doesn't just blindly follow instructions; at least that's the thrust of it.

InsideOutSanta 2 hours ago [-]

I would assume that if the model made no assumptions, it would be unable to complete most requests given in natural language.

stingraycharles 1 hours ago [-]

Well yes, but asking the model to ask questions to resolve ambiguities is critical if you want to have any success in eg a coding assistant.

There are shitloads of ambiguities. Most of the problems people have with LLMs is the implicit assumptions being made.

Phrased differently, telling the model to ask questions before responding to resolve ambiguities is an extremely easy way to get a lot more success.

simlevesque 2 hours ago [-]

I think the opposite. I don't want to write down everything and I like when my agents take some initiative or come up with solutions I didn't think of.

MangoToupe 39 minutes ago [-]

> and ask the user for clarification if there is ambiguity in the request.

You'd just be endlessly talking to the chatbots. Humans are really bad at expressing ourselves precisely, which is why we have formal languages that preclude ambiguity.

wat10000 2 hours ago [-]

If I tell it to fetch the information using HTPP, I want it to ask if I meant HTTP, not go off and try to find a way to fetch the info using an old printing protocol from IBM.

scotty79 2 hours ago [-]

> is that it shouldn’t assume anything, follow the instructions faithfully, and ask the user for clarification if there is ambiguity in the request

We already had those. They are called programming languages. And interacting with them used to be a very well paid job.

logicprog 1 hours ago [-]

How do you feel K2 Thinking compares to Opus 4.5 and 5.2-Pro?

jug 1 hours ago [-]

? The user directly addresses this.

Kim_Bruning 3 hours ago [-]

Kimi K2 is a very impressive model! It's particularly un-obsequious, which makes it useful for actually checking your reasoning on things.

Some especially older ChatGPT models will tell you that everything you say is fantastic and great. Kimi -on the other hand- doesn't mind taking a detour to question your intelligence and likely your entire ancestry if you ask it to be brutal.

diydsp 3 hours ago [-]

Upon request cg roasts. Good for reducing distractions.

mehdibl 58 minutes ago [-]

Claims as always misleading as they don't show the context length or prefill if you use a lot of context. As it will be fun waiting minutes for a reply.

websiteapi 3 hours ago [-]

I get tempted to buy a couple of these, but I just feel like the amortization doesn’t make sense yet. Surely in the next few years this will be orders of magnitude cheaper.

NitpickLawyer 55 minutes ago [-]

Before committing to purchasing two of these, you should look at the true speeds that few people post. Not just the "it works". We're at a point where we can run these very large models "at home", and it is great! But true usage is now with very large contexts, both in prompt processing, and token generations. Whatever speeds these models get at "0" context is very different than what they get at "useful" context, especially in coding and such.

cubefox 7 minutes ago [-]

DeepSeek-v3.2 should be be better for long context because it is using (near linear) sparse attention.

stingraycharles 3 hours ago [-]

I don’t think it will ever make sense; you can buy so much cloud based usage for this type of price.

From my perspective, the biggest problem is that I am just not going to be using it 24/7. Which means I’m not getting nearly as much value out of it as the cloud based vendors do from their hardware.

Last but not least, if I want to run queries against open source models, I prefer to use a provider like Groq or Cerebras as it’s extremely convenient to have the query results nearly instantly.

2 hours ago [-]

lordswork 2 hours ago [-]

As long as you're willing to wait up to an hour for your GPU to get scheduled when you do want to use it.

stingraycharles 1 hours ago [-]

I don’t understand what you’re saying. What’s preventing you from using eg OpenRouter to run a query against Kimi-K2 from whatever provider?

bgwalter 17 minutes ago [-]

Because you have Cloudflare (MITM 1), Openrouter (MITM 2) and finally the "AI" provider who can all read, store, analyze and resell your queries.

hu3 1 hours ago [-]

and you'll get a faster model this way

websiteapi 2 hours ago [-]

my issue is once you have it in your workflow I'd be pretty latency sensitive. imagine those record-it-all apps working well. eventually you'd become pretty reliant on it. I don't want to necessarily be at the whims of the cloud

stingraycharles 16 minutes ago [-]

Aren’t those “record it all” applications implemented as a RAG and injected into the context based on embedding similarity?

Obviously you’re not going to always inject everything into the context window.

givinguflac 2 hours ago [-]

I think you’re missing the whole point, which is not using cloud compute.

stingraycharles 1 hours ago [-]

Because of privacy reasons? Yeah I’m not going to spend a small fortune for that to be able to use these types of models.

chrsw 3 hours ago [-]

The only reason why you run local models is for privacy, never for cost. Or even latency.

websiteapi 3 hours ago [-]

indeed - my main use case is those kind of "record everything" sort of setups. I'm not even super privacy conscious per se but it just feels too weird to send literally everything I'm saying all of the time to the cloud.

luckily for now whisper doesn't require too much compute, bu the kind of interesting analysis I'd want would require at least a 1B parameter model, maybe 100B or 1T.

andy99 2 hours ago [-]

Autonomy generally, not just privacy. You never know what the future will bring, AI will be enshittified and so will hubs like huggingface. It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.

Aurornis 1 hours ago [-]

> You never know what the future will bring, AI will be enshittified and so will hubs like huggingface.

If anyone wants to bet that future cloud hosted AI models will get worse than they are now, I will take the opposite side of that bet.

> It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.

You can pay cloud providers for access to the same models that you can run locally, though. You don’t need a local setup even for this unlikely future scenario where all of the mainstream LLM providers simultaneously decided to make their LLMs poor quality and none of them sees this as market opportunity to provide good service.

But even if we ignore all of that and assume that all of the cloud inference everywhere becomes bad at the same time at some point in the future, you would still be better off buying your own inference hardware at that point in time. Spending the money to buy two M3 Ultras right now to prepare for an unlikely future event is illogical.

The only reason to run local LLMs is if you have privacy requirements or you want to do it as a hobby.

chrsw 2 hours ago [-]

Yes, I agree. And you can add security to that too.

alwillis 2 hours ago [-]

Hopefully the next time it’s updated, it should ship with some variant of the M5.

amelius 2 hours ago [-]

Maybe wait until RAM prices have normalized again.

Alifatisk 3 hours ago [-]

You should mention that it is 4bit quant. Still very impressive!

geerlingguy 3 hours ago [-]

Kiki K2 was made to be optimized at 4-bit, though.

natrys 2 hours ago [-]

That's the Kimi K2 Thinking, this post seems to be talking about original Kimi K2 Instruct though, I don't think INT4 QAT (quantization aware training) version was released for this.

Rendered at 16:54:56 GMT+0000 (Coordinated Universal Time) with Vercel.