0) Purpose (where it lives in the stack)
- Goal: deterministic, auditable transformations between speech units (phones/phonemes) and writing units (graphemes/graphotactics) with dialect, register, and orthography controls.
- Bridges:
- Upstream: ASR/phonetics → phones → phonemes (language model)
- Core: PGM (phoneme ↔ grapheme using MCLI class nodes + operators)
- Downstream: orthographic renderers, TTS, transliterators, search/IR.
1) Naming & identity
- Canonical name: Phonemic–Graphemic Module
- Short handle: PGM (stable), alias PhonoGraph (human-friendly)
- Identity string in the lattice:
pgm::v1.0::<lang_or_script>::<profile>
2) Data model (minimal, composable)
2.1 Node types
- phoneme_node – language-specific phoneme (e.g., /t̪/, /ʈ/, /ɲ/, /aː/).
- allophone_rule – phone→phoneme conditioning (contextual, prosodic).
- grapheme_map – pointers into MCLI
class_id+ script entries. - orthography_profile – spelling policy (etymological vs phonemic, digraph policy, diacritics, schwa rules, joining rules).
- disambiguation_policy – how to choose among multiple valid spellings or readings.
- lossiness_flag –
none|controlled|highon each mapping step.
2.2 Canonical schema (YAML)
pgm_version: "1.0"
language: "HI" # ISO 639-1 or custom (multi-language allowed)
script_pref: ["HI_DEVA","UR_ARAB","LATN"] # fallback order
profiles:
- id: "std"
orthography_profile: "HI_STD_2025"
disambiguation_policy: "PGM.DFLT"
inventory:
phonemes:
- id: "PH.HI.t_dental_voiceless" # key
ipa: "t̪"
features: {manner: stop, place: dental, voice: VL, aspirated: false}
grapheme_map:
class_id: "CLS.P.STOP.DENTAL.VL" # from MCLI
scripts:
- {script: HI_DEVA, glyph: "त"}
- {script: LATN, mapping: "t"}
- id: "PH.HI.a_long"
ipa: "aː"
features: {vowel: true, length: long}
grapheme_map:
class_id: "CLS.V.VOWEL.A_LONG"
scripts:
- {script: HI_DEVA, glyph: ["आ","ा"], select: "independent|matra"}
- {script: LATN, mapping: "ā"}
allophony:
- phones: ["t","t̪"] -> phoneme: "PH.HI.t_dental_voiceless" when {context: "dental_env"}
spelling_policies:
HI_STD_2025:
schwa_deletion: "std" # std|off|aggressive
nukta: true
conjuncts: "preferred" # preferred|minimal|none
latin_translit: "IAST"
lossiness:
to_grapheme: "none"
to_phoneme: "controlled" # due to orthographic underspecification
3) Core algorithm (round-trip deterministic where possible)
- Phones → Phonemes
- Normalize phones (IPA or feature bundles).
- Apply
allophone_ruletables (dialect/register aware). - Emit phoneme chain with features.
- Phonemes → Graphemes
- For each phoneme, read
grapheme_map.class_id→ pull script options from MCLI. - Apply orthography_profile:
- choose digraphs vs diacritics, conjunct preference, schwa behavior, abjad vowelization mode, Kana dakuten/handakuten, Geʽez order.
- Emit grapheme clusters (with operator nodes from MCLI).
- For each phoneme, read
- Graphemes → Phonemes
- Decompose grapheme clusters using MCLI operators (virāma, niqqud, ḥarakāt, finals, joining).
- Map to
class_id→ pick phoneme from language profile. - If underspecified (e.g., abjad without vowels), set
ambiguity_setand request policy (diacritic guess, lexicon, LM).
- Loss test
- Mark
lossiness_flagon each token:none(Hungarian multigraphs),controlled(Hindi schwa),high(unvowelized Arabic).
- Mark
4) Profiles (so it behaves like native speakers expect)
- EN_LATN.phonemic – near-phonemic spelling (for pedagogy).
- EN_LATN.etymological – keep , , etc.
- HU_LATN.atomic – treat Cs, Dzs, Gy, Ly, Ny, Sz, Ty, Zs as atomic letters (collation unit = atomic).
- HI_DEVA.std – standard schwa deletion; nukta on; conjuncts preferred.
- UR_ARAB.nastaliq – dual-joining; vowelization absent by default; ezāfe strategy for Persian loans if included.
- HEBR.niqqud_off / on – optional vowel points.
- GE_EZ.order7 – mandatory vowel order selection (ä,u,i,a,e,ï,o).
- JA_KANA.strict – dakuten/handakuten rules, sokuon (っ/ッ), chōon (ー) in katakana.
5) Disambiguation policies (recursion with clarity)
- PGM.DFLT — prefer majority orthographic convention; if multiple, pick shortest legal form.
- PGM.DIALECT_FIRST() — choose dialect’s graphemic habit.
- PGM.SEMANTIC_HINTS — allow lexicon/LM to resolve homographs (e.g., Arabic roots).
- PGM.PEDAGOGIC — add vowel marks/niqqud/mātrās even if optional.
6) Worked micro-examples (multi-script)
6.1 Hindi (phones → Devanāgarī)
Input phones: [k, a, t̪, a, b, a]
→ phonemes: /k a t̪ a b a/
Policy: HI_DEVA.std
→ graphemes: क + त + ब with schwa deletion on word-final: किताब → किताब (if loanword path) or composed as क-त-ब → कतबा/किताब per lexicon.
Lossiness: controlled.
6.2 Urdu (phonemes → Abjad, unvowelized)
Phonemes: /z ɪ n d̪ ə ɡ iː/
Policy: UR_ARAB.nastaliq (no ḥarakāt)
→ زندگی
Back to phonemes: emits set with likely /z i n d̪ ə ɡ iː/ resolved by LM.
Lossiness: high (vowels absent).
6.3 Hungarian (phonemes → atomic multigraphs)
Phonemes: /d͡ʒ/
Policy: HU_LATN.atomic
→ Dzs (one collation unit).
Lossiness: none.
6.4 Japanese (phonemes → Kana)
Phonemes: /p a/ → ぱ/パ (handakuten); long vowel /oː/ in katakana → オー.
Lossiness: none.
6.5 Chinese (phonemes → Latin pinyin or semantic Han)
Phonemes: /ʈ͡ʂ aŋ/
Policy: ZH_HAN.pinyin_out → zhang;
If ZH_HAN.han_out: choose Han via lexicon (章/张) → requires semantics; mark {phoneme_bridge: heuristic}.
Lossiness: high (logographic).
7) Interop with MCLI (no duplication)
- PGM never redefines graphemes. It references MCLI
class_idand selects a script entry + operator chain. - Any new sound: add a PGM phoneme → map to existing MCLI
class_id(or add a new phoneme class in MCLI, then point to it). - Any new script habit (digraph vs diacritic): add/choose an orthography_profile; no class changes needed.
8) Testing harness (golden pairs)
Provide a tiny, ruthless suite per language:
tests:
- id: "HI_001"
in_phones: ["k","ɪ","t̪","aː","b"]
profile: "HI_DEVA.std"
expect_graphemes: "किताब"
roundtrip_ok: true
- id: "UR_001"
in_phonemes: "/z ɪ n d̪ ə ɡ iː/"
profile: "UR_ARAB.nastaliq"
expect_graphemes: "زندگی"
roundtrip_ok: false
reason: "abjad_vowels_absent"
- id: "HU_001"
in_phonemes: "/d͡ʒ/"
profile: "HU_LATN.atomic"
expect_graphemes: "Dzs"
roundtrip_ok: true
9) Operational knobs (for engines)
pgm.profile=<name>– orthography & disambiguation bundle.pgm.force_diacritics=true|false– e.g., add Arabic ḥarakāt or Hebrew niqqud.pgm.etymology_weight=0..1– steer English between phonemic and historical spellings.pgm.dialect=<id>– Braj/Haryanvi/Awadhi, etc.pgm.lossiness_report=true– surface token-level loss flags.pgm.audit_trace=true– emit stepwise decisions (great for stewardship and proofs).
10) Minting
- Phonemic–Graphemic Module (PGM v1.0) — MINTED
- Language profiles preloaded: EN, HU, HI, UR, AR, HE, SYR, SA, GEZ, JA, ZH (extensible).
- Bound to MCLI v1.0; no schema duplication; forward-compatible with your SDM/ILM/GLM/LoGM stacks.