I continue to be skeptical about the introduction of the ruby-type attribute. While I appreciate the effort to improve Ruby TTS, I am concerned that the direction of "manual categorization via HTML attributes" might be a technical dead end. It risks over-simplifying the linguistic reality of Ruby and places an unsustainable burden on authors.
The Paradox of "Unclassifiable" Ruby
The proposed model assumes Ruby can be binary-classified into phonetic or complementary. However, real-world usage often defies such categorization.
-
The Example: 海上衝突予防規則 (Base) with ruby うみのうえつきあたりようじんのきまり (Annotation). Educated users will ignore the ruby and read the base as かいじょうしょうとつよぼうきそく. Other users cannot read the base and read the ruby text
-
The Problem: This annotation is phonetic in the sense that it provides the intended audio, but it is also complementary as it provides a plain-language explanation of a complex term.
-
The User Context: The ideal behavior depends on the user's literacy and context. A proficient reader might prefer the standard kanji reading, while a novice needs the descriptive ruby explanation. By forcing a static ruby-type attribute, authors are asked to dictate a single behavior for all users, which contradicts the goal of flexible, user-centered accessibility. Forcing authors to perform this complex mental triage for every ruby instance is, in my view, the wrong approach.
Lessons from Existing Practice: The DAISY Model
We should look at how complex cases are already handled in accessible publishing. In Japanese DAISY textbooks, which often feature ruby for mathematical symbols or chemical formulas (e.g., $CO_2$ with ruby しーおーつー):
- The system does not rely on UA heuristics or manual tagging of ruby elements.
- Instead, it uses Media Overlays to map authoritative audio directly to visual elements.This proves that high-fidelity TTS for complex content is best achieved by decoupling the "Source of Truth" for audio from the visual markup, rather than relying on fragile manual attributes in the HTML.
The "North Star": PLS and Automated NLP
Instead of complicating the HTML layer, we should rely on more robust linguistic data layers.
Proposed Logic for Speech UAs:
-
Dictionary Cross-reference: The UA should use its internal morphological dictionary to check the Ruby. If it matches a standard reading, it performs a single, clean reading.
-
PLS Priority: If a PLS (Pronunciation Lexicon Specification) is provided, the UA must prioritize it, treating it as the authoritative reading and ignoring the Ruby.
-
Safe Fallback: If the Ruby matches neither the dictionary nor a PLS (suggesting a unique name or a descriptive annotation), the UA should read both the base and the annotation. This provides the most information to the user without making an incorrect guess.
Strategic Direction
Rather than cluttering HTML with attributes, we should steer the industry toward:
- Improving NLP/Morphological engines for better automatic disambiguation.
- Encouraging PLS provision as a best practice (or eventual requirement) for WCAG compliance in documents with non-standard readings or specialized jargon.
Conclusion:
The ruby-type attribute seems to be an over-engineered solution for a problem better addressed at the lexicon and policy levels. Let's maintain HTML as a clean structural layer and treat PLS/Media Overlay as our "North Star" for phonetic accuracy.
I continue to be skeptical about the introduction of the ruby-type attribute. While I appreciate the effort to improve Ruby TTS, I am concerned that the direction of "manual categorization via HTML attributes" might be a technical dead end. It risks over-simplifying the linguistic reality of Ruby and places an unsustainable burden on authors.
The Paradox of "Unclassifiable" Ruby
The proposed model assumes Ruby can be binary-classified into phonetic or complementary. However, real-world usage often defies such categorization.
The Example: 海上衝突予防規則 (Base) with ruby うみのうえつきあたりようじんのきまり (Annotation). Educated users will ignore the ruby and read the base as かいじょうしょうとつよぼうきそく. Other users cannot read the base and read the ruby text
The Problem: This annotation is phonetic in the sense that it provides the intended audio, but it is also complementary as it provides a plain-language explanation of a complex term.
The User Context: The ideal behavior depends on the user's literacy and context. A proficient reader might prefer the standard kanji reading, while a novice needs the descriptive ruby explanation. By forcing a static ruby-type attribute, authors are asked to dictate a single behavior for all users, which contradicts the goal of flexible, user-centered accessibility. Forcing authors to perform this complex mental triage for every ruby instance is, in my view, the wrong approach.
Lessons from Existing Practice: The DAISY Model
We should look at how complex cases are already handled in accessible publishing. In Japanese DAISY textbooks, which often feature ruby for mathematical symbols or chemical formulas (e.g.,$CO_2$ with ruby しーおーつー):
The "North Star": PLS and Automated NLP
Instead of complicating the HTML layer, we should rely on more robust linguistic data layers.
Proposed Logic for Speech UAs:
Dictionary Cross-reference: The UA should use its internal morphological dictionary to check the Ruby. If it matches a standard reading, it performs a single, clean reading.
PLS Priority: If a PLS (Pronunciation Lexicon Specification) is provided, the UA must prioritize it, treating it as the authoritative reading and ignoring the Ruby.
Safe Fallback: If the Ruby matches neither the dictionary nor a PLS (suggesting a unique name or a descriptive annotation), the UA should read both the base and the annotation. This provides the most information to the user without making an incorrect guess.
Strategic Direction
Rather than cluttering HTML with attributes, we should steer the industry toward:
Conclusion:
The ruby-type attribute seems to be an over-engineered solution for a problem better addressed at the lexicon and policy levels. Let's maintain HTML as a clean structural layer and treat PLS/Media Overlay as our "North Star" for phonetic accuracy.