PGM-02 — Bengali / Bangla (BN_BENG)


MINTED

Purpose

Deterministic, auditable mapping between phones/phonemes and Bengali script (Eastern Nagari), with clean round-trips to Latin transliteration. Tied to the MCLI so cross-script projection (e.g., Latin/Devanagari) is consistent.

Identity

pgm::v1.0::BN_BENG::<profile>

Orthography Profiles

  • BN_BENG.std — Modern standard Bangla (Kolkata/Dhaka compromise), inherent vowel /ɔ/ (“ô”) behavior typical; conjuncts allowed; nukta letters where used.
  • BN_BENG.phonemic — Closer to surface phonology (neutralize historical spellings where safe; optional nasalization marks).
  • BN_LATN.iso15919 — ISO 15919 transliteration (diacritics).
  • BN_LATN.bgn — BGN/PCGN-style Latin (diacritics minimized; pedagogy-friendly).

Lossiness

  • Bengali script: controlled (inherent vowel, optional nasalization, sandhi, conjunct suppression).
  • Latin (ISO 15919): none.
  • Latin (BGN): controlled (diacritics dropped).

Script Mechanics (Eastern Nagari essentials)

  • Inherent vowel: অ “a/ô” ≈ /ɔ/ after bare consonants unless suppressed.
  • Dependent vowel signs (mātrā): ি/ী i, া a, ু/ূ u, ে e, ৈ ai, ো o, ৌ au, ৃ r̥, ৄ r̥̄ (rare).
  • Virāma / Hasanta: ◌্ (kills inherent vowel; builds conjuncts).
  • Anusvāra: ং (ṃ) — homorganic nasal or vowel nasalization (context/policy).
  • Candrabindu: ঁ (˜) — explicit vowel nasalization (optional in modern print).
  • Visarga: ঃ (ḥ) — rare/learned/Sanskritic.
  • Khanda ta: ৎ (t̪̚) — final unreleased dental stop (special orthographic case).
  • Nukta letters: ড় (ṛ), ঢ় (ṛh), য-ফলা (y-phala), র-ফলা (r-phala) manage palatal/retroflex and liquid clusters.

Phoneme Inventory (core; pointers to MCLI class_ids)

Vowels (phonemic targets):
/a ~ ɔ/, /o/, /e/, /i iː/, /u uː/, (loan/learned) /æ, ə, ɯ/; vocalic /r̩/ (ঋ), long /r̩ː/ (ঌ~rare).

Consonants (selected):
Stops: /p pʰ b b̤/, /t̪ t̪ʰ d̪ d̪̤/, /ʈ ʈʰ ɖ ɖ̤/, /k kʰ ɡ ɡ̤/
Affricates: /t͡ʃ t͡ʃʰ d͡ʒ d͡ʒ̤/
Fricatives: /ʃ ʂ s h/ (orthography favors শ ষ স distinctions)
Nasals: /m n ŋ ɲ/
Liquids/Glides: /l r j w/
(“̤” denotes breathy-voice legacy contrasts; modern Bangla tends toward aspiration/voicing contrasts; map via profile.)


Key Mapping Logic

  1. Phones → Phonemes: normalize aspiration/voicing; handle alveolar~dental and retroflex contrasts per profile.
  2. Phonemes → Graphemes:
    • If vowel-initial syllable → independent vowel letter (অ আ ই …).
    • Else use consonant + mātrā or hasanta conjuncts.
    • Apply r-phala (্র) and y-phala (্য) per cluster rules.
    • Optional candrabindu for nasalized vowels (policy-driven).
  3. Inherent vowel policy (inherent=auto|strict_off|strict_on):
    • auto (std): realize /ɔ/ after bare consonants unless: word-final C#, before virāma conjunct, or in prescribed lexical exceptions.
  4. Latin output:
    • iso15919 preserves all contrasts (e/ē, o/ō, ṛ, ṃ, ḥ).
    • bgn uses plain letters (sh/ch/jh/ph/kh; no macrons).

Edge Policies & Disambiguation

  • Homorganic nasal: ং before stops → map nasal place to following stop in phoneme layer.
  • Candrabindu vs anusvāra: prefer anusvāra; promote to candrabindu in phonemic profile when vowel nasalization is phonemic.
  • Retroflex vs dental: preserve orthographic শ/ষ/স; phoneme layer collapses to /ʃ s/ where appropriate; keep retroflexes /ʈ ɖ/ distinct.
  • Khanda ta (ৎ): emit on final_t_dental_unreleased=true.
  • Loan alignment: English loans may keep cluster vowels (স্কুল → /skul/ “school”); profile can insert epenthetic ə if etymology_weight<0.5.

YAML Skeleton (engine spec)

pgm_version: "1.0"
language: "BN"
script_pref: ["BN_BENG", "LATN_ISO15919", "LATN_BGN"]
profiles:
  - id: "std"
    orthography_profile: "BN_STD_2025"
    disambiguation_policy: "PGM.DFLT"
  - id: "phonemic"
    orthography_profile: "BN_PHONEMIC_2025"
    disambiguation_policy: "PGM.PEDAGOGIC"
  - id: "iso15919"
    orthography_profile: "BN_LATN_ISO_2025"
    disambiguation_policy: "PGM.DFLT"
  - id: "bgn"
    orthography_profile: "BN_LATN_BGN_2025"
    disambiguation_policy: "PGM.DFLT"

inventory:
  phonemes:
    - id: "PH.BN.t_dental_VL"
      ipa: "t̪"
      features: {place: dental, manner: stop, voice: VL}
      grapheme_map:
        class_id: "CLS.BN.C.T_DENTAL"
        scripts:
          - {script: BN_BENG, glyph: "ত"}
          - {script: LATN_ISO15919, mapping: "t"}
          - {script: LATN_BGN, mapping: "t"}
    - id: "PH.BN.ɔ"
      ipa: "ɔ"
      features: {vowel: true}
      grapheme_map:
        class_id: "CLS.BN.V.O_OPEN"
        scripts:
          - {script: BN_BENG, glyph: "অ", matra: null, inherent: true}
          - {script: LATN_ISO15919, mapping: "ô"}
          - {script: LATN_BGN, mapping: "o"}
  diacritics:
    - {name: "anusvara", glyph: "ং", function: "nasal_place_assim"}
    - {name: "candrabindu", glyph: "ঁ", function: "vowel_nasalization"}
    - {name: "visarga", glyph: "ঃ", function: "voiceless_breath"}
operators:
  - {name: "hasanta", glyph: "্", fn: "suppress_inherent; build_conjunct"}
spelling_policies:
  BN_STD_2025:
    inherent: "auto"
    conjuncts: "preferred"
    nasalization: "anusvara_default"
    retroflex_preserve: true
    khanda_t_final: "auto"
  BN_PHONEMIC_2025:
    inherent: "strict_off"      # always write a vowel if present
    conjuncts: "minimal"
    nasalization: "candrabindu_on_phonemic"
  BN_LATN_ISO_2025:
    scheme: "ISO15919"
  BN_LATN_BGN_2025:
    scheme: "BGN"
lossiness:
  to_grapheme: "controlled"
  to_latin_iso: "none"
  to_latin_bgn: "controlled"

Unit Test Fixtures

tests:
  - id: "BN_001_word_bangla"
    in_phonemes: "/b a ŋ l a/"
    profile: "std"
    expect_graphemes: "বাংলা"
    roundtrip_ok: true

  - id: "BN_002_inherent_off"
    in_phonemes: "/b a ŋ g l a d e ʃ/"
    profile: "phonemic"
    expect_graphemes: "বাাংলাদেশ"   # pedagogy may force explicit vowels; engine may refine to "বাংলাদেশ"
    notes: "pedagogic explicit vowels; real orthography prefers conventional form"
    roundtrip_ok: true

  - id: "BN_003_iso_out"
    in_graphemes: "শিক্ষা"
    profile: "iso15919"
    expect: "śikṣā"

  - id: "BN_004_bgn_out"
    in_graphemes: "বাংলা"
    profile: "bgn"
    expect: "Bangla"

  - id: "BN_005_nasalization"
    in_phonemes: "/dʒ o ñ/"
    profile: "phonemic"
    expect_graphemes: "জোঁন"    # example shows candrabindu on vowel; exact lexeme may vary
    roundtrip_ok: true

  - id: "BN_006_khanda_t"
    in_phonemes: "/b a t̪̚/"    # final unreleased dental stop
    profile: "std"
    expect_graphemes: "বৎ"

Worked Micro-Examples

  • /ʃ i k ʃ aː/শিক্ষা (śikṣā) → iso15919: śikṣā, bgn: shiksha.
  • /b a ŋ l a/বাংলাiso15919: bāṅlā, bgn: Bangla.
  • /r̩/ (ঋ) in learned forms → ঋষি (ṛṣi) → iso15919: ṛṣi, bgn: rishi.
  • Anusvāra before /k g/ → লঙ্কা (Loŋkā) “Lanka”: nasal becomes ŋ (homorganic).

Operational Knobs

  • pgm.profile=std|phonemic|iso15919|bgn
  • pgm.inherent=auto|strict_off|strict_on
  • pgm.conjuncts=preferred|minimal|none
  • pgm.nasalization=candrabindu|anusvara|auto
  • pgm.dialect=dhaka|kolkata (fine-tunes /eæ/, /oɔ/ realizations in Latin output)
  • pgm.lossiness_report=true, pgm.audit_trace=true

PGM-02 (Bengali) is MINTED and linked to MCLI classes.