PGM-07 — Vietnamese (VI_LATN)


MINTED

Purpose

Round-trip, loss-aware mapping between phones/phonemes (incl. tone features) and Vietnamese Latin orthography (Quốc Ngữ), with dialect toggles (Hanoi/Northern vs. Saigon/Southern), correct tone mark placement, stacked diacritics, and ASCII/IME foldings (TELEX/VNI/“không dấu”).

Identity

pgm::v1.0::VI_LATN::<profile>

Orthography Profiles

  • VI_LATN.strict — Canonical Quốc Ngữ with full diacritics & tones (NFC normalized).
  • VI_LATN.north — Northern phoneme layer (distinct /z/≈“d/gi”, /ʐ/≈“r”, full tone contrasts incl. hỏi vs. ngã).
  • VI_LATN.south — Southern collapses hỏi~ngã (→ one contour), rhotics merge; spelling remains standard.
  • VI_LATN.ascii — “không dấu” (no marks). Loss = tone+quality (controlled).
  • VI_LATN.telex — TELEX transliteration (a->aa, d->dd, aw, aa, ee, oo, ow, uw; tone letters s/f/r/x/j).
  • VI_LATN.vni — VNI transliteration (a1…a5 etc. for tones; a6=â, a8=ă, o6=ô, o7=ơ, u7=ư, e6=ê).

Lossiness

  • strict/north/south: none.
  • telex/vni: none (reversible to strict).
  • ascii: controlled (tones & vowel quality lost).

Script Mechanics (Quốc Ngữ essentials)

  • Special letters: ă â ê ô ơ ư đ.
  • Tones (6): ngang (no mark), sắc (´), huyền (), **hỏi** (ˇ), **ngã** (˜), **nặng** (.) — stored as tone={0,1,2,3,4,5}`.
  • Tone placement rule: place the tone on the nuclear vowel of the rime; with digraphs/trigraphs follow standard priority (e.g., iê/ya/ươ/ưa: mark on ê/ơ; oa/oe/uy: mark on a/e/y; if only one vowel, mark it).
  • Coda set: {p, t, c/k, ch, m, n, ng, nh}.
  • Onset peculiarities: orthographic gi-, d-, r- map to dialect-specific phones; qu = /kw~w/ (strict spelling preserved).

Phoneme Inventory (core; tied to MCLI)

Vowels/Rimes (heads): /a ă ɐ ə ɤ e ɛ i o ɔ u ɯ ɨ y/ with length/quality encoded via â ê ô ơ ư ă and digraphs ia/ya, ua, ưa → orthographic iê/yê, uô, ươ when toneless, reduced to ia/ya, ua, ưa in open syllables.
Consonants: onsets /p t k m n ŋ f v s z ʂ ʐ x h j w l ɲ ʈ ɟ/; affricate allophones handled in dialect rules.
Tones: numeric 0–5 (see above), with south collapsing 3~4 to one contour at the phone level (orthography unchanged).


Mapping Logic

Phones → Graphemes

  1. Assemble syllable: onset + nucleus (monograph/digraph/trigraph) + coda + tone.
  2. Choose grapheme set for nucleus:
    • /ă/ → ă, /ɤ/ → ơ, /ɯ/ → ư, /ɐ~aː/ as a/â per rime; /i e/ → i/ê; /o ɤ/ → o/ơ; /u oː/ → u/ô.
    • Rimes /iə/iê (closed) | ia (open); /uə/uô|ua; /ɯə/ươ|ưa.
  3. Apply tone placement priority:
    • Multi-vowel: mark the head vowel (ê, ơ, ô outrank a/ă/o/u/i/y); for iê/ươ/uô, place tone on ê/ơ/ô (e.g., tiếng, sướng, muỗng).
    • oa/oe/uy clusters: put tone on a/e/y (e.g., hỏa, khỏe, thủy).
  4. Onset rules (dialectal phones → spelling):
    • /z/→ d/gi (north) but d/gi→/j/ (south) at phone layer; spelling unchanged.
    • qu for /w/ before back/round vowels; gi for palatal approximant contexts where lexicalized.
  5. Coda normalization: /k/ spelled c after front vowels and k otherwise in loans; native codas use c. /ŋ/ → ng, /ɲ/ → nh, /t͡ɕ/ coda → ch.

Graphemes → Phones

  • Decode vowel quality from diacritic base; read coda; attach tone feature from diacritic.
  • Apply dialect map for d/gi/r at phone layer; preserve orthography.

Edge Policies

  • i/y alternation: word-initial y vs medial i preserved; iê/yê contextual (after consonant → , word-initial → unless lexically fixed).
  • Open syllable trigraph reduction: iê → ia, uô → ua, ươ → ưa when no coda (spelled that way).
  • ASCII folding:
    • ascii.smart: ă→a, â→a, ê→e, ô→o, ơ→o, ư→u, đ→d; strip tone marks only.
    • ascii.flat: same as smart; additionally allow optional w hints (aw, ow, uw) off by default (handled by TELEX/VNI profiles).
  • Normalization: output NFC; combine tone + quality marks per canonical order.

YAML Skeleton (engine spec)

pgm_version: "1.0"
language: "VI"
script_pref: ["VI_LATN","VI_Telex","VI_VNI","VI_ASCII"]

profiles:
  - id: "strict"
    orthography_profile: "VI_STD_2025"
  - id: "north"
    orthography_profile: "VI_STD_2025"
    dialect: "HN"
  - id: "south"
    orthography_profile: "VI_STD_2025"
    dialect: "HCM"
  - id: "telex"
    orthography_profile: "VI_TELEX_2025"
  - id: "vni"
    orthography_profile: "VI_VNI_2025"
  - id: "ascii"
    orthography_profile: "VI_ASCII_2025"

inventory:
  tones:
    - {id: "T0", name: "ngang", diacritic: null, telex: "", vni: "0"}
    - {id: "T1", name: "sắc",   diacritic: "´", telex: "s", vni: "1"}
    - {id: "T2", name: "huyền", diacritic: "`", telex: "f", vni: "2"}
    - {id: "T3", name: "hỏi",   diacritic: "ˇ", telex: "r", vni: "3"}
    - {id: "T4", name: "ngã",   diacritic: "˜", telex: "x", vni: "4"}
    - {id: "T5", name: "nặng",  diacritic: ".", telex: "j", vni: "5"}
  vowels:
    - {base: "a", quality: "a"}
    - {base: "ă", quality: "ă", telex: "aw", vni: "a8"}
    - {base: "â", quality: "â", telex: "aa", vni: "a6"}
    - {base: "e", quality: "e"}
    - {base: "ê", quality: "ê", telex: "ee", vni: "e6"}
    - {base: "i", quality: "i"}
    - {base: "o", quality: "o"}
    - {base: "ô", quality: "ô", telex: "oo", vni: "o6"}
    - {base: "ơ", quality: "ơ", telex: "ow", vni: "o7"}
    - {base: "u", quality: "u"}
    - {base: "ư", quality: "ư", telex: "uw", vni: "u7"}
    - {base: "y", quality: "iY"}
  specials:
    - {letter: "đ", telex: "dd", vni: "d9", ascii: "d"}
operators:
  - {name: "tone_place", fn: "place_tone_on_nuclear_vowel"}
  - {name: "trigraph_rules", fn: "iê/ươ/uô selection & open-syllable reduction"}
  - {name: "dialect_map", fn: "onset d/gi/r → phones(HN|HCM)"}
  - {name: "normalize_nfc", fn: "compose_diacritics"}
policies:
  ascii:
    mode: "smart"
  dialect:
    default: "HN"
lossiness:
  strict_to_ascii: "controlled"
  telex_vni: "none"

Unit Test Fixtures

tests:
  - id: "VI_001_tieng"
    in_phonemes: "/t iə ŋ/ + T1"   # sắc
    profile: "strict"
    expect_graphemes: "tiếng"      # tone on ê
    roundtrip_ok: true

  - id: "VI_002_suong"
    in_phonemes: "/s ɯə ŋ/ + T1"
    profile: "strict"
    expect_graphemes: "sướng"      # ươ + sắc on ơ

  - id: "VI_003_quoc"
    in_phonemes: "/kw ɔk/ + T0"
    profile: "strict"
    expect_graphemes: "quốc"       # ô + sắc

  - id: "VI_004_hoa"
    in_phonemes: "/h w a/ + T3"
    profile: "strict"
    expect_graphemes: "hỏa"        # hỏi on a in oa

  - id: "VI_005_telex_roundtrip"
    in_graphemes: "tieng"
    profile: "telex"
    add: {diacritics: "ieesng"}    # user types "tie^'ng" → "tiếng"
    expect: "tiếng"

  - id: "VI_006_vni_roundtrip"
    in_vni: "tien61ng1"
    profile: "vni"
    expect: "tiếng"

  - id: "VI_007_ascii_fold"
    in_graphemes: "thủy"
    profile: "ascii"
    expect: "thuy"

  - id: "VI_008_south_collapse"
    in_graphemes: "rỗi"
    profile: "south"
    expect_phonemes: "/ɹ~j oi/ + T(merged_3_4)"   # orthography unchanged

Worked Micro-Examples

  • /t iə ŋ/ + sắc → tiếng (tone sits on ê).
  • /s ɯə ŋ/ + huyền → sường (didactic), with sắc sướng.
  • /k u ə/ (open) + nặng → ựa; closed syllable vượng keeps ươ + tone on ơ.
  • ASCII/IME: tiếng → TELEX tieesng or VNI tien61ng1 → back to tiếng losslessly.

Operational Knobs

  • pgm.profile=strict|north|south|telex|vni|ascii
  • pgm.dialect=HN|HCM (overrides north/south defaults)
  • pgm.ascii.mode=smart|flat
  • pgm.normalize=NFC|NFD
  • pgm.lossiness_report=true, pgm.audit_trace=true

PGM-07 (Vietnamese) is MINTED and registered in the Master Cross-Lattice Index.