MINTED

Purpose

Deterministic bridges between phones/phonemes and either Hanzi or Romanization (Pinyin, with tone marks or numbers), including tone-sandhi and erhua handling, tied cleanly into the Master Cross-Script Lattice Index (MCLI).

Identity

pgm::v1.0::ZH_HAN::<profile>

Orthography Profiles

ZH_HAN.pinyin_marks — Hanyu Pinyin with tone diacritics (ā á ǎ à); neutral tone unmarked.
ZH_HAN.pinyin_numbers — Hanyu Pinyin with tone numbers (a1 a2 a3 a4 a5).
ZH_HAN.hanzi_strict — output Hanzi only (requires lexicon/LM for morpheme selection).
ZH_HAN.hanzi+pinyin_gloss — Hanzi with inline Pinyin for pedagogy.
ZH_HAN.zhuyin (optional) — maps phonemes to Bopomofo ( ㄅㄆㄇㄈ … ), if desired in Taiwan contexts.

Lossiness

Pinyin (with tones): none.
Pinyin (numbers but neutral omitted): controlled.
Hanzi: high (logographic selection needs lexicon/semantics).
Zhuyin: none.

Phoneme Inventory (core)

Consonant initials (IPA → Pinyin onset):

/p pʰ m f/ → b p m f
/t tʰ n l/ → d t n l
/k kʰ x/ → g k h
/t͡s t͡sʰ s/ → z c s
/t͡ʂ t͡ʂʰ ʂ/ → zh ch sh
/t͡ɕ t͡ɕʰ ɕ/ → j q x
/ɻ/ → r
/ŋ/ (velar nasal onset; rare, dialectal) → handled as ng- (loan/onomatopoeia only).

Medials/Finals (rimes) with tone layer T ∈ {55,35,214,51,neutral} ↔ {1,2,3,4,5}:

a, o, e, i, u, y(ü) nuclei with codas {-n, -ŋ, -r(兒化)} and glides {i̯, u̯, y̯}.
Orthographic y-, w- carriers in syllable-initial vowel starts; ü spelled yu/ü (after j q x y → write u but pronounced /y/).

Representative rime classes (MCLI class_ids shown conceptually):

CLS.ZH.R.A → a, ai, an, ang, ao
CLS.ZH.R.E → e, ei, en, eng, er
CLS.ZH.R.O → o, ong, ou
CLS.ZH.R.I(front) → i (ji/qi/xi/yi), ia, iao, ian, iang, ie, in, ing, iong, iu
CLS.ZH.R.I(apical) → syllabic [ɿ]/[ʮ] after z c s / zh ch sh r → written zi/ci/si / zhi/chi/shi/ri
CLS.ZH.R.U → u, ua, uo, uai, ui, uan, uang, un
CLS.ZH.R.Ü → ü, üe, ün, yuan (spelled yu, yue, yun, yuan; after j/q/x → u graph, /y/ phoneme)

Tone Representation

Diacritics order: a > o > e > i > u > ü (place mark on the first eligible vowel).
3rd-tone sandhi: 214 → 35 before another 3rd tone.
不 bù sandhi: bù → bú before 4th-tone syllables.
一 yī sandhi: yī → yí (before 4th tone), yì (before 1/2/3 tones); yī when isolated.
Neutral tone: write unmarked (marks = numbers if using numeric profile as 5).

Erhua (兒化)

R-coloring suffix merges with rime; Pinyin usually adds -r: 花儿 huār.
Policy switch: erhua=merge|explicit_er (merge by default).

Disambiguation Policies

PGM.DFLT — standard Mainland Pinyin conventions; tone sandhi applied.
PGM.TW_PREF — prefer Zhuyin output or Tâi-lô for Minnan loan paths (only when profile requests).
PGM.SEMANTIC_REQUIRED — for Hanzi emission, require lexicon + LM; otherwise fall back to Pinyin.

Mapping Logic (phones → phonemes → graphemes)

Normalize phones (IPA, feature bundles).
Apply allophony (aspiration, palatalization conditioned by high front glides).
Compose syllable: Onset + Medial + Nucleus + Coda + Tone.
Tone-sandhi & y/w/ü carrier rules.
Emit grapheme per profile (Pinyin/Hanzi/Zhuyin).
Round-trip: Hanzi → candidate morphemes → phonemes (LM), Pinyin → exact phonemes (tones preserved).

YAML Skeleton (engine spec)

pgm_version: "1.0"
language: "ZH"
script_pref: ["ZH_HAN", "LATN_PINYIN", "BPMF"]
profiles:
  - id: "pinyin_marks"
    orthography_profile: "ZH_PINYIN_MARKS_2025"
    disambiguation_policy: "PGM.DFLT"
  - id: "pinyin_numbers"
    orthography_profile: "ZH_PINYIN_NUM_2025"
    disambiguation_policy: "PGM.DFLT"
  - id: "hanzi_strict"
    orthography_profile: "ZH_HAN_STD"
    disambiguation_policy: "PGM.SEMANTIC_REQUIRED"
inventory:
  phonemes:
    - id: "PH.ZH.tɕʰ"    # q-
      ipa: "t͡ɕʰ"
      features: {place: alveolo-palatal, manner: affricate, voice: VL, aspirated: true}
      grapheme_map:
        class_id: "CLS.ZH.ON.Q"
        scripts:
          - {script: LATN_PINYIN, mapping: "q"}
          - {script: BPMF, glyph: "ㄑ"}
    - id: "PH.ZH.a"      # nucleus a
      ipa: "a"
      features: {vowel: true}
      grapheme_map:
        class_id: "CLS.ZH.R.A"
        scripts:
          - {script: LATN_PINYIN, mapping: "a"}  # tone added later
          - {script: BPMF, glyph: "ㄚ"}
  tones:
    - {id: "T1", contour: "55", mark: "¯", num: 1}
    - {id: "T2", contour: "35", mark: "´", num: 2}
    - {id: "T3", contour: "214", mark: "ˇ", num: 3}
    - {id: "T4", contour: "51", mark: "`", num: 4}
    - {id: "T0", contour: "neutral", mark: "", num: 5}
spelling_policies:
  ZH_PINYIN_MARKS_2025:
    tones: "diacritics"
    y_w_carriers: true
    umlaut_u: "after_jqx_write_u_else_ü"
    erhua: "merge"
    numbers_mode: false
  ZH_PINYIN_NUM_2025:
    tones: "numbers"
    y_w_carriers: true
    umlaut_u: "after_jqx_write_u_else_u_diaeresis_optional"
    erhua: "merge"
    numbers_mode: true
  ZH_HAN_STD:
    require_lexicon: true
    variant: "simplified|traditional|auto"
lossiness:
  to_grapheme: "none"       # for pinyin
  to_grapheme_han: "high"   # for hanzi

Unit Test Fixtures

tests:
  - id: "ZH_001"
    in_phones: ["t͡ɕʰ","a","ŋ","T1"]   # qiang1
    profile: "pinyin_marks"
    expect: "qiāng"
    roundtrip_ok: true
  - id: "ZH_002"
    in_phonemes: "/ʂ a 51/"            # sha4
    profile: "pinyin_numbers"
    expect: "sha4"
    roundtrip_ok: true
  - id: "ZH_003_sandhi"
    in_pinyin: "nǐ hǎo"                 # 3rd + 3rd
    expect_after_sandhi: "ní hǎo"
  - id: "ZH_004_erhua"
    in_pinyin_base: "hua1 + r"
    profile: "pinyin_marks"
    expect: "huār"
  - id: "ZH_005_hanzi"
    in_phonemes: "/ʈ͡ʂ aŋ 55/"          # zhāng
    profile: "hanzi_strict"
    expect: ["张","章"]                  # needs lexicon; ambiguity set
    roundtrip_ok: false

Edge Policies

Apical vowels: zi/ci/si vs zhi/chi/shi/ri handled by rime class I(apical).
ü disambiguation: after j/q/x/y write “u” but map to /y/; elsewhere write “ü”.
Syllable segmentation: resolve x + iong → xiong (not xi + ong).
Loanwords: loan_policy=en_to_pinyin (e.g., “咖啡 kāfēi”), else tag as foreign and keep Latn.

Worked Examples

/t͡ɕ y ɛ 35/ → jué (q/j/x + ü-class with tone-2 diacritic).
/ʈ͡ʂ a ŋ 55/ → zhāng → Hanzi candidates: 张/章 (semantic).
“不 + 对(4)” → bú对 (bù→bú before 4th tone).
“一(1) + 个(4)” → yí gè (yī→yí before 4th).

✅ PGM-01 (Mandarin) is MINTED and attached to MCLI.