PGM-09 — Cantonese (YUE_HAN)


MINTED

Purpose

Deterministic mapping between phones/phonemes and Cantonese orthography in both Traditional Han characters and phonetic romanization systems (Jyutping, Yale), with support for tone marks/numbers, colloquial characters, and controlled-loss ASCII folding.

Identity

pgm::v1.0::YUE_HAN::<profile>

Orthography Profiles

  • YUE_HAN.hanzi_strict — Traditional Hanzi for all morphemes, including colloquial forms (啱, 嘅, 冇, 咩, etc.). Lossiness = high (logographic).
  • YUE_HAN.jyutping_num — Jyutping with tone numbers (1–6).
  • YUE_HAN.jyutping_marks — Jyutping with diacritics for tones (optional; less common).
  • YUE_HAN.yale — Yale romanization (tones via marks or numbers).
  • YUE_HAN.ascii — Tone-stripped Jyutping or Yale (controlled loss).

Lossiness

  • Hanzi: high (logographic; requires lexicon for back-mapping).
  • Jyutping/Yale: none (if tones retained).
  • ASCII: controlled (tone & vowel length lost).

Script Mechanics

  • Syllable structure: (C)(G)V(V)(C) + tone.
    • Onset (C): ~19 initials (p, pʰ, m, f, t, tʰ, n, l, k, kʰ, ŋ, h, ts, tsʰ, s, kw, kwʰ, w, j).
    • Glides (G): /w/ after k/kʰ, /j/ after some consonants.
    • Nucleus (V): 53 rimes (monophthongs/diphthongs + codas).
    • Coda (C): p, t, k, m, n, ŋ, or ∅.
  • Tone system (HK 6-tone):
    • 1: high level (¯), 2: high rising (´), 3: mid level, 4: low falling (`), 5: low rising, 6: low level.
    • Entering tones (checked syllables w/ p, t, k codas) share pitch with 1, 3, 6 → labeled 7, 8, 9 in traditional schemes, but in modern Jyutping are merged with 1, 3, 6.

Phoneme Inventory (MCLI-linked)

Vowels/Rimes: /a aː ɐ ɐː ɛ ɛː eː iː oː ɔ ɔː uː yː œː ɵ/ and diphthongs /ai au ei ou iu ui œy ɐi ɐu ɔi etc./.
Consonants: full onset set above; coda set {p t k m n ŋ}.
Tones: stored as numeric 1–6; features = {pitch: high/mid/low, contour: level/rising/falling, checked: true/false}.


Mapping Logic

Phones → Graphemes (Jyutping/Yale)

  1. Onset phoneme → initial table (e.g., /pʰ/ → p in Jyutping, p in Yale).
  2. Nucleus + coda → rime mapping table.
  3. Tone assignment → numeric suffix (Jyutping) or diacritic (Yale marks).
  4. For Hanzi: requires lexicon lookup; select appropriate character(s) for morpheme.

Graphemes → Phones

  • Jyutping: split initial+rime; assign phoneme features; read tone number/mark.
  • Yale: map spelling conventions to same phoneme set.
  • Hanzi: ambiguous → requires lexicon for reading.

Edge Policies

  • Tone sandhi: optional; not marked in writing; may be applied in speech-layer output.
  • Colloquial characters: preserved in hanzi_strict; replaced with standard synonyms if colloquial=off.
  • Entering tones: merge with 1/3/6 unless traditional_tone_numbers=true.
  • ASCII folding: strip tone numbers or marks; vowels kept as plain aeiouy.

YAML Skeleton (engine spec)

pgm_version: "1.0"
language: "YUE"
script_pref: ["YUE_HAN","YUE_Jyutping","YUE_Yale","YUE_ASCII"]

profiles:
  - id: "hanzi_strict"
    orthography_profile: "YUE_HAN_STD"
    disambiguation_policy: "PGM.SEMANTIC_REQUIRED"
  - id: "jyutping_num"
    orthography_profile: "YUE_Jyutping_Num"
  - id: "jyutping_marks"
    orthography_profile: "YUE_Jyutping_Marks"
  - id: "yale"
    orthography_profile: "YUE_Yale"
  - id: "ascii"
    orthography_profile: "YUE_ASCII"

inventory:
  tones:
    - {id: "T1", pitch: "high", contour: "level", num: 1}
    - {id: "T2", pitch: "high", contour: "rising", num: 2}
    - {id: "T3", pitch: "mid", contour: "level", num: 3}
    - {id: "T4", pitch: "low", contour: "falling", num: 4}
    - {id: "T5", pitch: "low", contour: "rising", num: 5}
    - {id: "T6", pitch: "low", contour: "level", num: 6}
  initials:
    - {ipa: "p", jyutping: "b", yale: "b"}
    - {ipa: "pʰ", jyutping: "p", yale: "p"}
    - {ipa: "m", jyutping: "m", yale: "m"}
    - {ipa: "f", jyutping: "f", yale: "f"}
    # ... (all initials filled in table)
  rimes:
    - {ipa: "aː", jyutping: "aa", yale: "a"}
    - {ipa: "ɐ", jyutping: "a", yale: "a"}
    - {ipa: "ai", jyutping: "ai", yale: "ai"}
    # ... (full rime set)
operators:
  - {name: "merge_entering_tones", fn: "7→1, 8→3, 9→6 unless traditional=true"}
  - {name: "ascii_fold", fn: "strip_tone_numbers_and_marks"}
lossiness:
  hanzi_to_phoneme: "high"
  romanized_to_phoneme: "none"
  ascii_to_phoneme: "controlled"

Unit Test Fixtures

tests:
  - id: "YUE_001_ngoh"
    in_phonemes: "/ŋ ɔː/ + T5"
    profile: "jyutping_num"
    expect: "ngo5"

  - id: "YUE_002_sik"
    in_phonemes: "/s ɪ k/ + T1"
    profile: "jyutping_num"
    expect: "sik1"

  - id: "YUE_003_hou2"
    in_phonemes: "/h ou/ + T2"
    profile: "jyutping_num"
    expect: "hou2"

  - id: "YUE_004_yale"
    in_phonemes: "/j œː/ + T5"
    profile: "yale"
    expect: "yéuh"

  - id: "YUE_005_ascii"
    in_romanized: "ngo5"
    profile: "ascii"
    expect: "ngo"  # tone removed

  - id: "YUE_006_hanzi"
    in_phonemes: "/m ou/ + T5"
    profile: "hanzi_strict"
    expect: ["冇"]  # requires lexicon

Worked Micro-Examples

  • → Jyutping: ngo5, Yale: ngóh.
  • → Jyutping: sik6, Yale: sihk.
  • → Jyutping: hou2, Yale: hóu.
  • Colloquial (not have) → ngo5; Hanzi profile preserves form.

Operational Knobs

  • pgm.profile=hanzi_strict|jyutping_num|jyutping_marks|yale|ascii
  • pgm.traditional_tone_numbers=true|false
  • pgm.colloquial=on|off
  • pgm.ascii.mode=strip_tones|keep_tones
  • pgm.lossiness_report=true
  • pgm.audit_trace=true

PGM-09 (Cantonese) is MINTED and connected to the Master Cross-Lattice Index.