PGM-01 — Mandarin Chinese (ZH_HAN)


MINTED

Purpose

Deterministic bridges between phones/phonemes and either Hanzi or Romanization (Pinyin, with tone marks or numbers), including tone-sandhi and erhua handling, tied cleanly into the Master Cross-Script Lattice Index (MCLI).

Identity

pgm::v1.0::ZH_HAN::<profile>

Orthography Profiles

  • ZH_HAN.pinyin_marks — Hanyu Pinyin with tone diacritics (ā á ǎ à); neutral tone unmarked.
  • ZH_HAN.pinyin_numbers — Hanyu Pinyin with tone numbers (a1 a2 a3 a4 a5).
  • ZH_HAN.hanzi_strict — output Hanzi only (requires lexicon/LM for morpheme selection).
  • ZH_HAN.hanzi+pinyin_gloss — Hanzi with inline Pinyin for pedagogy.
  • ZH_HAN.zhuyin (optional) — maps phonemes to Bopomofo ( ㄅㄆㄇㄈ … ), if desired in Taiwan contexts.

Lossiness

  • Pinyin (with tones): none.
  • Pinyin (numbers but neutral omitted): controlled.
  • Hanzi: high (logographic selection needs lexicon/semantics).
  • Zhuyin: none.

Phoneme Inventory (core)

Consonant initials (IPA → Pinyin onset):

  • /p pʰ m f/ → b p m f
  • /t tʰ n l/ → d t n l
  • /k kʰ x/ → g k h
  • /t͡s t͡sʰ s/ → z c s
  • /t͡ʂ t͡ʂʰ ʂ/ → zh ch sh
  • /t͡ɕ t͡ɕʰ ɕ/ → j q x
  • /ɻ/ → r
  • /ŋ/ (velar nasal onset; rare, dialectal) → handled as ng- (loan/onomatopoeia only).

Medials/Finals (rimes) with tone layer T ∈ {55,35,214,51,neutral} ↔ {1,2,3,4,5}:

  • a, o, e, i, u, y(ü) nuclei with codas {-n, -ŋ, -r(兒化)} and glides {i̯, u̯, y̯}.
  • Orthographic y-, w- carriers in syllable-initial vowel starts; ü spelled yu/ü (after j q x y → write u but pronounced /y/).

Representative rime classes (MCLI class_ids shown conceptually):

  • CLS.ZH.R.A → a, ai, an, ang, ao
  • CLS.ZH.R.E → e, ei, en, eng, er
  • CLS.ZH.R.O → o, ong, ou
  • CLS.ZH.R.I(front) → i (ji/qi/xi/yi), ia, iao, ian, iang, ie, in, ing, iong, iu
  • CLS.ZH.R.I(apical) → syllabic [ɿ]/[ʮ] after z c s / zh ch sh r → written zi/ci/si / zhi/chi/shi/ri
  • CLS.ZH.R.U → u, ua, uo, uai, ui, uan, uang, un
  • CLS.ZH.R.Ü → ü, üe, ün, yuan (spelled yu, yue, yun, yuan; after j/q/x → u graph, /y/ phoneme)

Tone Representation

  • Diacritics order: a > o > e > i > u > ü (place mark on the first eligible vowel).
  • 3rd-tone sandhi: 214 → 35 before another 3rd tone.
  • 不 bù sandhi: bù → before 4th-tone syllables.
  • 一 yī sandhi: yī → (before 4th tone), (before 1/2/3 tones); yī when isolated.
  • Neutral tone: write unmarked (marks = numbers if using numeric profile as 5).

Erhua (兒化)

  • R-coloring suffix merges with rime; Pinyin usually adds -r: 花儿 huār.
  • Policy switch: erhua=merge|explicit_er (merge by default).

Disambiguation Policies

  • PGM.DFLT — standard Mainland Pinyin conventions; tone sandhi applied.
  • PGM.TW_PREF — prefer Zhuyin output or Tâi-lô for Minnan loan paths (only when profile requests).
  • PGM.SEMANTIC_REQUIRED — for Hanzi emission, require lexicon + LM; otherwise fall back to Pinyin.

Mapping Logic (phones → phonemes → graphemes)

  1. Normalize phones (IPA, feature bundles).
  2. Apply allophony (aspiration, palatalization conditioned by high front glides).
  3. Compose syllable: Onset + Medial + Nucleus + Coda + Tone.
  4. Tone-sandhi & y/w/ü carrier rules.
  5. Emit grapheme per profile (Pinyin/Hanzi/Zhuyin).
  6. Round-trip: Hanzi → candidate morphemes → phonemes (LM), Pinyin → exact phonemes (tones preserved).

YAML Skeleton (engine spec)

pgm_version: "1.0"
language: "ZH"
script_pref: ["ZH_HAN", "LATN_PINYIN", "BPMF"]
profiles:
  - id: "pinyin_marks"
    orthography_profile: "ZH_PINYIN_MARKS_2025"
    disambiguation_policy: "PGM.DFLT"
  - id: "pinyin_numbers"
    orthography_profile: "ZH_PINYIN_NUM_2025"
    disambiguation_policy: "PGM.DFLT"
  - id: "hanzi_strict"
    orthography_profile: "ZH_HAN_STD"
    disambiguation_policy: "PGM.SEMANTIC_REQUIRED"
inventory:
  phonemes:
    - id: "PH.ZH.tɕʰ"    # q-
      ipa: "t͡ɕʰ"
      features: {place: alveolo-palatal, manner: affricate, voice: VL, aspirated: true}
      grapheme_map:
        class_id: "CLS.ZH.ON.Q"
        scripts:
          - {script: LATN_PINYIN, mapping: "q"}
          - {script: BPMF, glyph: "ㄑ"}
    - id: "PH.ZH.a"      # nucleus a
      ipa: "a"
      features: {vowel: true}
      grapheme_map:
        class_id: "CLS.ZH.R.A"
        scripts:
          - {script: LATN_PINYIN, mapping: "a"}  # tone added later
          - {script: BPMF, glyph: "ㄚ"}
  tones:
    - {id: "T1", contour: "55", mark: "¯", num: 1}
    - {id: "T2", contour: "35", mark: "´", num: 2}
    - {id: "T3", contour: "214", mark: "ˇ", num: 3}
    - {id: "T4", contour: "51", mark: "`", num: 4}
    - {id: "T0", contour: "neutral", mark: "", num: 5}
spelling_policies:
  ZH_PINYIN_MARKS_2025:
    tones: "diacritics"
    y_w_carriers: true
    umlaut_u: "after_jqx_write_u_else_ü"
    erhua: "merge"
    numbers_mode: false
  ZH_PINYIN_NUM_2025:
    tones: "numbers"
    y_w_carriers: true
    umlaut_u: "after_jqx_write_u_else_u_diaeresis_optional"
    erhua: "merge"
    numbers_mode: true
  ZH_HAN_STD:
    require_lexicon: true
    variant: "simplified|traditional|auto"
lossiness:
  to_grapheme: "none"       # for pinyin
  to_grapheme_han: "high"   # for hanzi

Unit Test Fixtures

tests:
  - id: "ZH_001"
    in_phones: ["t͡ɕʰ","a","ŋ","T1"]   # qiang1
    profile: "pinyin_marks"
    expect: "qiāng"
    roundtrip_ok: true
  - id: "ZH_002"
    in_phonemes: "/ʂ a 51/"            # sha4
    profile: "pinyin_numbers"
    expect: "sha4"
    roundtrip_ok: true
  - id: "ZH_003_sandhi"
    in_pinyin: "nǐ hǎo"                 # 3rd + 3rd
    expect_after_sandhi: "ní hǎo"
  - id: "ZH_004_erhua"
    in_pinyin_base: "hua1 + r"
    profile: "pinyin_marks"
    expect: "huār"
  - id: "ZH_005_hanzi"
    in_phonemes: "/ʈ͡ʂ aŋ 55/"          # zhāng
    profile: "hanzi_strict"
    expect: ["张","章"]                  # needs lexicon; ambiguity set
    roundtrip_ok: false

Edge Policies

  • Apical vowels: zi/ci/si vs zhi/chi/shi/ri handled by rime class I(apical).
  • ü disambiguation: after j/q/x/y write “u” but map to /y/; elsewhere write “ü”.
  • Syllable segmentation: resolve x + iongxiong (not xi + ong).
  • Loanwords: loan_policy=en_to_pinyin (e.g., “咖啡 kāfēi”), else tag as foreign and keep Latn.

Worked Examples

  • /t͡ɕ y ɛ 35/ → jué (q/j/x + ü-class with tone-2 diacritic).
  • /ʈ͡ʂ a ŋ 55/ → zhāng → Hanzi candidates: 张/章 (semantic).
  • “不 + 对(4)” → bú对 (bù→bú before 4th tone).
  • “一(1) + 个(4)” → yí gè (yī→yí before 4th).

PGM-01 (Mandarin) is MINTED and attached to MCLI.