Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

108 points by pember 19 hours ago | 9 comments

revolvingthrow 11 hours ago [-]

I really wish the benchmarks were even slightly trustworthy for AI models. ~120B are the largest models I can run locally. Naturally I grabbed the 122B Qwen3.5, which had great benchmarks and… frankly, the model is garbage, worse than glm air 4.5 IMO. But then, qwen famously benchmaxxes.

And here we have another release. The benchmarks are just a tiny bit worse than qwen3.5 (for far less tokens). Am I to take it that the model is worse? Or does qwen’s benchmaxxing mean that slightly worse result of non-qwen models means a better model? I’d rather not spend hours testing things myself for every noteworthy release.

Ah well. Mistral has been fairly decent so worth taking a look. Obviously they’re behind the big 3, but in my experience their small models are probably the best you can get for several months after each release. I’m not sure how it works as a sales funnel for their paid models, same as with chinese models - people likely just go for google/openai/anthropic in this case - but I’m thankful for their existence.

kristianp 12 hours ago [-]

Interesting that they target around 120 billion parameters. Just enough to fit onto a single H100 with 4 bit quant. Or 128GB APU like apple silicon, AMD AI cpus or the GB spark.

Copying GPT-OSS-120b?

Available to try at https://build.nvidia.com/mistralai/mistral-small-4-119b-2603

rurban 5 hours ago [-]

Hopefully better than gpt-oss-120b because this sucks big time. Completely unusable. gpt-5.3 and 4 are very fine though.

Testing it tomorrow

zacksiri 11 hours ago [-]

I tested the model in an agentic workflow. Here is the report:

https://upmaru.com/llm-tests/simple-tama-agentic-workflow-q1...

Reubend 9 hours ago [-]

Seems like it does quite well on that particular benchmark?

zacksiri 8 hours ago [-]

It's ok, it's not the best. There are models that do better, I'd use it for some basic tasks but not actual complex tasks like query generation and retrieval.

2001zhaozhao 18 hours ago [-]

Which Haiku model are they comparing to? Is it 4.5? In which case it's absolutely wild that Qwen3.5 122B is shredding it in those graphs

adt 17 hours ago [-]

https://lifearchitect.ai/models-table/

7777777phil 5 hours ago [-]

Been spending a bunch of time lately trying to figure out why these ~120B MoE models keep beating much larger dense ones.

With Mistral it's 128 experts but only 4 active per token, so any given forward pass is like 6B params. That's a very different kind of model than scaling a dense transformer bigger. Also wrote a little post on where I think this is going: https://philippdubach.com/posts/the-last-architecture-design...

Rendered at 16:09:53 GMT+0000 (Coordinated Universal Time) with Vercel.