Fantastic tool and love the delivery; no sign up required. Interested to hear how you pulled that off.
Also interested to hear if you plan to eventually support an option to add pitch accent; I've never seen what training material exists for that or how that is supported in unicode.
epitrochoid413 3 hours ago [-]
I built a context-aware furigana converter for Japanese text, files, and web pages.
The main problem I wanted to solve was that simple dictionary-based furigana works well for common cases, but breaks on words where the reading depends on context:
* 市場: いちば or しじょう
* 大分: おおいた or だいぶ
* 人気: にんき or ひとけ
* 最中: さいちゅう or さなか or もなか
* 方: かた or ほう
The engine is a hybrid system:
* Sudachi for tokenization, base forms, POS, and candidate readings
* Expanded dictionary coverage for compounds and fixed expressions
* Custom rules for counters, suffixes, rendaku patterns, and phrase overrides
* ModernBERT fallback for 144 especially context-dependent target words
I have been testing it against an LLM-assisted benchmark of 7,500 Japanese lines. On the current benchmark, it gets about 12 wrong readings per 1,000 tokens. I treat that as a practical regression benchmark rather than a formal academic evaluation, but it has been useful for comparing versions and catching regressions.
The hardest remaining cases are personal names, place names, rendaku, rare vocabulary, and domain-specific terms.
I would especially appreciate examples where it gets the reading wrong, since those are the most useful for improving the system.
fenomas 29 minutes ago [-]
Nice work, just gave a quick pass but seems to work well!
(Also: vouched, your comment was dead FYI)
epitrochoid413 22 minutes ago [-]
Thanks, that’s great to hear. Thanks for the vouch too, I didn’t realize the comment was dead.
altilunium 49 minutes ago [-]
It really works. Very cool. I’ve been looking for this kind of service for a long time since I started learning Japanese, and I’ve rarely been satisfied with the available services.
epitrochoid413 24 minutes ago [-]
[flagged]
Rendered at 15:24:59 GMT+0000 (Coordinated Universal Time) with Vercel.
Also interested to hear if you plan to eventually support an option to add pitch accent; I've never seen what training material exists for that or how that is supported in unicode.
The main problem I wanted to solve was that simple dictionary-based furigana works well for common cases, but breaks on words where the reading depends on context:
* 市場: いちば or しじょう
* 大分: おおいた or だいぶ
* 人気: にんき or ひとけ
* 最中: さいちゅう or さなか or もなか
* 方: かた or ほう
The engine is a hybrid system:
* Sudachi for tokenization, base forms, POS, and candidate readings
* Expanded dictionary coverage for compounds and fixed expressions
* Custom rules for counters, suffixes, rendaku patterns, and phrase overrides
* ModernBERT fallback for 144 especially context-dependent target words
I have been testing it against an LLM-assisted benchmark of 7,500 Japanese lines. On the current benchmark, it gets about 12 wrong readings per 1,000 tokens. I treat that as a practical regression benchmark rather than a formal academic evaluation, but it has been useful for comparing versions and catching regressions.
The hardest remaining cases are personal names, place names, rendaku, rare vocabulary, and domain-specific terms.
I would especially appreciate examples where it gets the reading wrong, since those are the most useful for improving the system.
(Also: vouched, your comment was dead FYI)