I'll be looking at this in detail. I've started a company to do similar things, https://6k.ai
I'm currently concentrating on better data gathering for low-resource languages.
When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.
intended 16 minutes ago [-]
There’s many nation states working on this, have you looked into availability of those data sets?
What languages are you prioritizing?
ks2048 3 minutes ago [-]
Yes, there are government datasets, languge "acadamies" (or "regulators") - organizations focused on preserving / teaching the language, and often smaller, local publishers that publish material in their local language.
I'm living in Guatemala, so have been focusing on the Mayan languages here (22 languages, millions of speakers).
stingraycharles 4 hours ago [-]
I find that meta’s translations are very poor compared to others, at least for relatively obscure languages, which I figured was relevant considering the article.
Google Translate is a good default, but LLMs are really good at translations, as they’re better capable at understanding context and providing culturally appropriate translations.
(I live in Cambodia where they speak Khmer)
djsamseng 3 hours ago [-]
Hello from Siem Reap, Cambodia! Awesome to see a fellow tech enthusiast from Cambodia.
I actually found Facebook’s translations pretty good (better than Google Translate for things longer than a sentence). From my understanding of Khmer, Khmer is a bit more verbose and context dependent, hence LLMs in Khmer would be a big help understand those nuances.
In the inverse case (LLMs generating khmer from English) I heard from locals that it sounds formal and “robotic” which I found quite interesting.
pseudocomposer 2 hours ago [-]
Kagi Translate is fantastic. Multilingual support is honestly one of the best things about LLMs, imo.
ks2048 40 minutes ago [-]
So, LLMs are noticeably better in Khmer than Google Translate? I wonder why Google Translate doesn't use Gemini under-the-hood. Perhaps it's more prone to hallucinations.
I'm interested in find some thorough testing of translations on different LLMs vs Translation APIs.
pattilupone 34 minutes ago [-]
There's a dropdown on Google Translate that lets you choose "Advanced" mode or "Classic" mode. Advanced mode uses Gemini but it's only available for select languages.
yellow_lead 2 hours ago [-]
It's not even good for Chinese
smallerize 4 hours ago [-]
*they're
(Sorry I had to)
stingraycharles 4 hours ago [-]
I could have sworn I edited it! I did notice myself as well, but thanks for the correction.
tomrod 3 hours ago [-]
*ពួកគេគឺជា
ks2048 30 minutes ago [-]
Meta released No Language Left Behind (NLLB) [1], I think in 2022. I wonder why this in not "NLLB 2.0"? These companies love introducing new names to confuse things
Just spent a long time trying to find where you can download any of these weights.
Is it open weight? If so, why isn't there just a straight link to the models?
ks2048 15 minutes ago [-]
I haven't seen anywhere claiming they are open weight (although their last similar model, NLLB was).
They say their leaderboard and evaluation datasets are freely available. Closest statement I've seen in the paper, "Our translation models are built on top of
freely available models."
garyclarke27 2 hours ago [-]
They can translate 1600 languages, but they cannot do basic text formatting, where are the paragraphs?
canjobear 11 minutes ago [-]
It's an abstract for a paper, so it's officially supposed to be one paragraph.
ks2048 28 minutes ago [-]
Another interesting thing mentioned here is: BOUQuET: Benchmark and Open-initiative for Universal Quality Evaluation in Translation.
That's a high count, but still a bit away from "Omni". Usual count is between 4k and 8k depending the source. But the first 1k might be the hardest, certainly.
simultsop 2 hours ago [-]
when you market, you use frontier and edge terms, so it sounds pro max
croes 4 hours ago [-]
Off topic, since the AI craze MS‘ documentation translation has ridiculous errors like translating try catch keywords to "versuchen" and "fangen" for German pages
Tarq0n 3 hours ago [-]
Yes their translations offer negative value, which is annoying because at work you can't usually choose your locale settings.
And the errors are really basic, like translating shortly to short, not the same thing at all!
rowanseerwald 2 hours ago [-]
[dead]
ath3nd 48 minutes ago [-]
[dead]
true21733 3 days ago [-]
[dead]
bikeshaving 2 hours ago [-]
I’m very wary of celebrating Meta’s language work when the company was credibly found to have contributed to the genocide against the Rohingya in Myanmar, and separately, to human rights abuses against Tigrayans during the conflict in northern Ethiopia. Be careful whose sins you’re laundering.
I'm currently concentrating on better data gathering for low-resource languages.
When you look in detail at data like Common Crawl, finepdfs, and fineweb, (1) they are really lacking quality data sources if you know where to look, and (2) the sources they have are not processed "finely" enough (e.g. finepdfs classify each page of PDF as having a specific language, where-as many language learning sources have language pairs, etc.
What languages are you prioritizing?
I'm living in Guatemala, so have been focusing on the Mayan languages here (22 languages, millions of speakers).
Google Translate is a good default, but LLMs are really good at translations, as they’re better capable at understanding context and providing culturally appropriate translations.
(I live in Cambodia where they speak Khmer)
I actually found Facebook’s translations pretty good (better than Google Translate for things longer than a sentence). From my understanding of Khmer, Khmer is a bit more verbose and context dependent, hence LLMs in Khmer would be a big help understand those nuances.
In the inverse case (LLMs generating khmer from English) I heard from locals that it sounds formal and “robotic” which I found quite interesting.
I'm interested in find some thorough testing of translations on different LLMs vs Translation APIs.
(Sorry I had to)
[1] https://ai.meta.com/research/no-language-left-behind/
Is it open weight? If so, why isn't there just a straight link to the models?
They say their leaderboard and evaluation datasets are freely available. Closest statement I've seen in the paper, "Our translation models are built on top of freely available models."
https://huggingface.co/spaces/facebook/bouquet
And the errors are really basic, like translating shortly to short, not the same thing at all!
https://www.amnesty.org/en/latest/news/2025/02/meta-new-poli... https://www.amnesty.org/en/latest/news/2023/10/meta-failure-...