MINTED
Purpose
Deterministic bridges between phones/phonemes and either Hanzi or Romanization (Pinyin, with tone marks or numbers), including tone-sandhi and erhua handling, tied cleanly into the Master Cross-Script Lattice Index (MCLI).
Identity
pgm::v1.0::ZH_HAN::<profile>
Orthography Profiles
ZH_HAN.pinyin_marks— Hanyu Pinyin with tone diacritics (ā á ǎ à); neutral tone unmarked.ZH_HAN.pinyin_numbers— Hanyu Pinyin with tone numbers (a1 a2 a3 a4 a5).ZH_HAN.hanzi_strict— output Hanzi only (requires lexicon/LM for morpheme selection).ZH_HAN.hanzi+pinyin_gloss— Hanzi with inline Pinyin for pedagogy.ZH_HAN.zhuyin(optional) — maps phonemes to Bopomofo ( ㄅㄆㄇㄈ … ), if desired in Taiwan contexts.
Lossiness
- Pinyin (with tones): none.
- Pinyin (numbers but neutral omitted): controlled.
- Hanzi: high (logographic selection needs lexicon/semantics).
- Zhuyin: none.
Phoneme Inventory (core)
Consonant initials (IPA → Pinyin onset):
- /p pʰ m f/ → b p m f
- /t tʰ n l/ → d t n l
- /k kʰ x/ → g k h
- /t͡s t͡sʰ s/ → z c s
- /t͡ʂ t͡ʂʰ ʂ/ → zh ch sh
- /t͡ɕ t͡ɕʰ ɕ/ → j q x
- /ɻ/ → r
- /ŋ/ (velar nasal onset; rare, dialectal) → handled as ng- (loan/onomatopoeia only).
Medials/Finals (rimes) with tone layer T ∈ {55,35,214,51,neutral} ↔ {1,2,3,4,5}:
- a, o, e, i, u, y(ü) nuclei with codas {-n, -ŋ, -r(兒化)} and glides {i̯, u̯, y̯}.
- Orthographic y-, w- carriers in syllable-initial vowel starts; ü spelled yu/ü (after j q x y → write u but pronounced /y/).
Representative rime classes (MCLI class_ids shown conceptually):
CLS.ZH.R.A→ a, ai, an, ang, aoCLS.ZH.R.E→ e, ei, en, eng, erCLS.ZH.R.O→ o, ong, ouCLS.ZH.R.I(front)→ i (ji/qi/xi/yi), ia, iao, ian, iang, ie, in, ing, iong, iuCLS.ZH.R.I(apical)→ syllabic [ɿ]/[ʮ] after z c s / zh ch sh r → written zi/ci/si / zhi/chi/shi/riCLS.ZH.R.U→ u, ua, uo, uai, ui, uan, uang, unCLS.ZH.R.Ü→ ü, üe, ün, yuan (spelled yu, yue, yun, yuan; after j/q/x → u graph, /y/ phoneme)
Tone Representation
- Diacritics order: a > o > e > i > u > ü (place mark on the first eligible vowel).
- 3rd-tone sandhi: 214 → 35 before another 3rd tone.
- 不 bù sandhi: bù → bú before 4th-tone syllables.
- 一 yī sandhi: yī → yí (before 4th tone), yì (before 1/2/3 tones); yī when isolated.
- Neutral tone: write unmarked (marks = numbers if using numeric profile as 5).
Erhua (兒化)
- R-coloring suffix merges with rime; Pinyin usually adds -r: 花儿 huār.
- Policy switch:
erhua=merge|explicit_er(merge by default).
Disambiguation Policies
PGM.DFLT— standard Mainland Pinyin conventions; tone sandhi applied.PGM.TW_PREF— prefer Zhuyin output or Tâi-lô for Minnan loan paths (only when profile requests).PGM.SEMANTIC_REQUIRED— for Hanzi emission, require lexicon + LM; otherwise fall back to Pinyin.
Mapping Logic (phones → phonemes → graphemes)
- Normalize phones (IPA, feature bundles).
- Apply allophony (aspiration, palatalization conditioned by high front glides).
- Compose syllable: Onset + Medial + Nucleus + Coda + Tone.
- Tone-sandhi & y/w/ü carrier rules.
- Emit grapheme per profile (Pinyin/Hanzi/Zhuyin).
- Round-trip: Hanzi → candidate morphemes → phonemes (LM), Pinyin → exact phonemes (tones preserved).
YAML Skeleton (engine spec)
pgm_version: "1.0"
language: "ZH"
script_pref: ["ZH_HAN", "LATN_PINYIN", "BPMF"]
profiles:
- id: "pinyin_marks"
orthography_profile: "ZH_PINYIN_MARKS_2025"
disambiguation_policy: "PGM.DFLT"
- id: "pinyin_numbers"
orthography_profile: "ZH_PINYIN_NUM_2025"
disambiguation_policy: "PGM.DFLT"
- id: "hanzi_strict"
orthography_profile: "ZH_HAN_STD"
disambiguation_policy: "PGM.SEMANTIC_REQUIRED"
inventory:
phonemes:
- id: "PH.ZH.tɕʰ" # q-
ipa: "t͡ɕʰ"
features: {place: alveolo-palatal, manner: affricate, voice: VL, aspirated: true}
grapheme_map:
class_id: "CLS.ZH.ON.Q"
scripts:
- {script: LATN_PINYIN, mapping: "q"}
- {script: BPMF, glyph: "ㄑ"}
- id: "PH.ZH.a" # nucleus a
ipa: "a"
features: {vowel: true}
grapheme_map:
class_id: "CLS.ZH.R.A"
scripts:
- {script: LATN_PINYIN, mapping: "a"} # tone added later
- {script: BPMF, glyph: "ㄚ"}
tones:
- {id: "T1", contour: "55", mark: "¯", num: 1}
- {id: "T2", contour: "35", mark: "´", num: 2}
- {id: "T3", contour: "214", mark: "ˇ", num: 3}
- {id: "T4", contour: "51", mark: "`", num: 4}
- {id: "T0", contour: "neutral", mark: "", num: 5}
spelling_policies:
ZH_PINYIN_MARKS_2025:
tones: "diacritics"
y_w_carriers: true
umlaut_u: "after_jqx_write_u_else_ü"
erhua: "merge"
numbers_mode: false
ZH_PINYIN_NUM_2025:
tones: "numbers"
y_w_carriers: true
umlaut_u: "after_jqx_write_u_else_u_diaeresis_optional"
erhua: "merge"
numbers_mode: true
ZH_HAN_STD:
require_lexicon: true
variant: "simplified|traditional|auto"
lossiness:
to_grapheme: "none" # for pinyin
to_grapheme_han: "high" # for hanzi
Unit Test Fixtures
tests:
- id: "ZH_001"
in_phones: ["t͡ɕʰ","a","ŋ","T1"] # qiang1
profile: "pinyin_marks"
expect: "qiāng"
roundtrip_ok: true
- id: "ZH_002"
in_phonemes: "/ʂ a 51/" # sha4
profile: "pinyin_numbers"
expect: "sha4"
roundtrip_ok: true
- id: "ZH_003_sandhi"
in_pinyin: "nǐ hǎo" # 3rd + 3rd
expect_after_sandhi: "ní hǎo"
- id: "ZH_004_erhua"
in_pinyin_base: "hua1 + r"
profile: "pinyin_marks"
expect: "huār"
- id: "ZH_005_hanzi"
in_phonemes: "/ʈ͡ʂ aŋ 55/" # zhāng
profile: "hanzi_strict"
expect: ["张","章"] # needs lexicon; ambiguity set
roundtrip_ok: false
Edge Policies
- Apical vowels: zi/ci/si vs zhi/chi/shi/ri handled by rime class
I(apical). - ü disambiguation: after j/q/x/y write “u” but map to /y/; elsewhere write “ü”.
- Syllable segmentation: resolve x + iong → xiong (not xi + ong).
- Loanwords:
loan_policy=en_to_pinyin(e.g., “咖啡 kāfēi”), else tag as foreign and keep Latn.
Worked Examples
- /t͡ɕ y ɛ 35/ → jué (q/j/x + ü-class with tone-2 diacritic).
- /ʈ͡ʂ a ŋ 55/ → zhāng → Hanzi candidates: 张/章 (semantic).
- “不 + 对(4)” → bú对 (bù→bú before 4th tone).
- “一(1) + 个(4)” → yí gè (yī→yí before 4th).
✅ PGM-01 (Mandarin) is MINTED and attached to MCLI.