Phonemic–Graphemic Module (PGM v1.0)


0) Purpose (where it lives in the stack)

  • Goal: deterministic, auditable transformations between speech units (phones/phonemes) and writing units (graphemes/graphotactics) with dialect, register, and orthography controls.
  • Bridges:
    • Upstream: ASR/phonetics → phonesphonemes (language model)
    • Core: PGM (phoneme ↔ grapheme using MCLI class nodes + operators)
    • Downstream: orthographic renderers, TTS, transliterators, search/IR.

1) Naming & identity

  • Canonical name: Phonemic–Graphemic Module
  • Short handle: PGM (stable), alias PhonoGraph (human-friendly)
  • Identity string in the lattice: pgm::v1.0::<lang_or_script>::<profile>

2) Data model (minimal, composable)

2.1 Node types

  • phoneme_node – language-specific phoneme (e.g., /t̪/, /ʈ/, /ɲ/, /aː/).
  • allophone_rule – phone→phoneme conditioning (contextual, prosodic).
  • grapheme_map – pointers into MCLI class_id + script entries.
  • orthography_profile – spelling policy (etymological vs phonemic, digraph policy, diacritics, schwa rules, joining rules).
  • disambiguation_policy – how to choose among multiple valid spellings or readings.
  • lossiness_flagnone|controlled|high on each mapping step.

2.2 Canonical schema (YAML)

pgm_version: "1.0"
language: "HI"   # ISO 639-1 or custom (multi-language allowed)
script_pref: ["HI_DEVA","UR_ARAB","LATN"]  # fallback order
profiles:
  - id: "std"
    orthography_profile: "HI_STD_2025"
    disambiguation_policy: "PGM.DFLT"
inventory:
  phonemes:
    - id: "PH.HI.t_dental_voiceless"     # key
      ipa: "t̪"
      features: {manner: stop, place: dental, voice: VL, aspirated: false}
      grapheme_map:
        class_id: "CLS.P.STOP.DENTAL.VL"   # from MCLI
        scripts:
          - {script: HI_DEVA, glyph: "त"}
          - {script: LATN, mapping: "t"}
    - id: "PH.HI.a_long"
      ipa: "aː"
      features: {vowel: true, length: long}
      grapheme_map:
        class_id: "CLS.V.VOWEL.A_LONG"
        scripts:
          - {script: HI_DEVA, glyph: ["आ","ा"], select: "independent|matra"}
          - {script: LATN, mapping: "ā"}
allophony:
  - phones: ["t","t̪"] -> phoneme: "PH.HI.t_dental_voiceless" when {context: "dental_env"}
spelling_policies:
  HI_STD_2025:
    schwa_deletion: "std"      # std|off|aggressive
    nukta: true
    conjuncts: "preferred"     # preferred|minimal|none
    latin_translit: "IAST"
lossiness:
  to_grapheme: "none"
  to_phoneme: "controlled"     # due to orthographic underspecification

3) Core algorithm (round-trip deterministic where possible)

  1. Phones → Phonemes
    • Normalize phones (IPA or feature bundles).
    • Apply allophone_rule tables (dialect/register aware).
    • Emit phoneme chain with features.
  2. Phonemes → Graphemes
    • For each phoneme, read grapheme_map.class_id → pull script options from MCLI.
    • Apply orthography_profile:
      • choose digraphs vs diacritics, conjunct preference, schwa behavior, abjad vowelization mode, Kana dakuten/handakuten, Geʽez order.
    • Emit grapheme clusters (with operator nodes from MCLI).
  3. Graphemes → Phonemes
    • Decompose grapheme clusters using MCLI operators (virāma, niqqud, ḥarakāt, finals, joining).
    • Map to class_id → pick phoneme from language profile.
    • If underspecified (e.g., abjad without vowels), set ambiguity_set and request policy (diacritic guess, lexicon, LM).
  4. Loss test
    • Mark lossiness_flag on each token: none (Hungarian multigraphs), controlled (Hindi schwa), high (unvowelized Arabic).

4) Profiles (so it behaves like native speakers expect)

  • EN_LATN.phonemic – near-phonemic spelling (for pedagogy).
  • EN_LATN.etymological – keep , , etc.
  • HU_LATN.atomic – treat Cs, Dzs, Gy, Ly, Ny, Sz, Ty, Zs as atomic letters (collation unit = atomic).
  • HI_DEVA.std – standard schwa deletion; nukta on; conjuncts preferred.
  • UR_ARAB.nastaliq – dual-joining; vowelization absent by default; ezāfe strategy for Persian loans if included.
  • HEBR.niqqud_off / on – optional vowel points.
  • GE_EZ.order7 – mandatory vowel order selection (ä,u,i,a,e,ï,o).
  • JA_KANA.strict – dakuten/handakuten rules, sokuon (っ/ッ), chōon (ー) in katakana.

5) Disambiguation policies (recursion with clarity)

  • PGM.DFLT — prefer majority orthographic convention; if multiple, pick shortest legal form.
  • PGM.DIALECT_FIRST() — choose dialect’s graphemic habit.
  • PGM.SEMANTIC_HINTS — allow lexicon/LM to resolve homographs (e.g., Arabic roots).
  • PGM.PEDAGOGIC — add vowel marks/niqqud/mātrās even if optional.

6) Worked micro-examples (multi-script)

6.1 Hindi (phones → Devanāgarī)

Input phones: [k, a, t̪, a, b, a]
→ phonemes: /k a t̪ a b a/
Policy: HI_DEVA.std
→ graphemes: क + त + ब with schwa deletion on word-final: किताब → किताब (if loanword path) or composed as क-त-बकतबा/किताब per lexicon.
Lossiness: controlled.

6.2 Urdu (phonemes → Abjad, unvowelized)

Phonemes: /z ɪ n d̪ ə ɡ iː/
Policy: UR_ARAB.nastaliq (no ḥarakāt)
زندگی
Back to phonemes: emits set with likely /z i n d̪ ə ɡ iː/ resolved by LM.
Lossiness: high (vowels absent).

6.3 Hungarian (phonemes → atomic multigraphs)

Phonemes: /d͡ʒ/
Policy: HU_LATN.atomic
Dzs (one collation unit).
Lossiness: none.

6.4 Japanese (phonemes → Kana)

Phonemes: /p a/ぱ/パ (handakuten); long vowel /oː/ in katakana → オー.
Lossiness: none.

6.5 Chinese (phonemes → Latin pinyin or semantic Han)

Phonemes: /ʈ͡ʂ aŋ/
Policy: ZH_HAN.pinyin_outzhang;
If ZH_HAN.han_out: choose Han via lexicon (章/张) → requires semantics; mark {phoneme_bridge: heuristic}.
Lossiness: high (logographic).


7) Interop with MCLI (no duplication)

  • PGM never redefines graphemes. It references MCLI class_id and selects a script entry + operator chain.
  • Any new sound: add a PGM phoneme → map to existing MCLI class_id (or add a new phoneme class in MCLI, then point to it).
  • Any new script habit (digraph vs diacritic): add/choose an orthography_profile; no class changes needed.

8) Testing harness (golden pairs)

Provide a tiny, ruthless suite per language:

tests:
  - id: "HI_001"
    in_phones: ["k","ɪ","t̪","aː","b"]
    profile: "HI_DEVA.std"
    expect_graphemes: "किताब"
    roundtrip_ok: true
  - id: "UR_001"
    in_phonemes: "/z ɪ n d̪ ə ɡ iː/"
    profile: "UR_ARAB.nastaliq"
    expect_graphemes: "زندگی"
    roundtrip_ok: false
    reason: "abjad_vowels_absent"
  - id: "HU_001"
    in_phonemes: "/d͡ʒ/"
    profile: "HU_LATN.atomic"
    expect_graphemes: "Dzs"
    roundtrip_ok: true

9) Operational knobs (for engines)

  • pgm.profile=<name> – orthography & disambiguation bundle.
  • pgm.force_diacritics=true|false – e.g., add Arabic ḥarakāt or Hebrew niqqud.
  • pgm.etymology_weight=0..1 – steer English between phonemic and historical spellings.
  • pgm.dialect=<id> – Braj/Haryanvi/Awadhi, etc.
  • pgm.lossiness_report=true – surface token-level loss flags.
  • pgm.audit_trace=true – emit stepwise decisions (great for stewardship and proofs).

10) Minting

  • Phonemic–Graphemic Module (PGM v1.0)MINTED
  • Language profiles preloaded: EN, HU, HI, UR, AR, HE, SYR, SA, GEZ, JA, ZH (extensible).
  • Bound to MCLI v1.0; no schema duplication; forward-compatible with your SDM/ILM/GLM/LoGM stacks.