Sanskrit Graphemic Module (SGM v1.0)


(aka: Literal–Graphemic Module — Sanskrit / Devanāgarī)

0) Orientation

  • Script type: Abugida (consonant+inherent vowel /a/; dependent vowel signs = mātrās)
  • Direction: Left-to-right; conjunct (ligature) shaping for clusters
  • Phonology target: Classical Sanskrit (IAST transliteration)
  • Numbers: ० १ २ ३ ४ ५ ६ ७ ८ ९
  • Core signs: Anusvāra ◌ं (ṃ), Visarga ◌ः (ḥ), Candrabindu ◌ँ (̃), Virāma ◌् (halant), Avagraha ऽ (’)

1) Vowels (Independent Letters & Dependent Signs)

IASTIndependentMātrā (dependent)IPANotes
a(inherent) or ◌‌əa/inherent in every consonant
ā/aː/
iि/i/pre-base matra (renders before consonant)
ī/iː/
u/u/subjoined matra
ū/uː/
/r̩/vocalic r
/r̩ː/rare, classical
/l̩/rare
/l̩ː/very rare
e/eː/historically long
ai/ai̯/
o/oː/historically long
au/au̯/

Lattice flags: {vowel: true, length: short|long, syllabic: r|l, diphthong: true|false, position: independent|matra}


2) Consonants (by Varga / place & phonation)

Velars (ka-varga): क ka /k/, ख kha /kʰ/, ग ga /ɡ/, घ gha /ɡʱ/, ङ ṅa /ŋ/
Palatals (ca-varga): च ca /t͡ɕt͡ʃ/, छ cha /t͡ɕʰt͡ʃʰ/, ज ja /d͡ʑd͡ʒ/, झ jha /d͡ʑʱd͡ʒʱ/, ञ ña /ɲ/
Retroflex (ṭa-varga): ट ṭa /ʈ/, ठ ṭha /ʈʰ/, ड ḍa /ɖ/, ढ ḍha /ɖʱ/, ण ṇa /ɳ/
Dentals (ta-varga): त ta /t̪/, थ tha /t̪ʰ/, द da /d̪/, ध dha /d̪ʱ/, न na /n/
Labials (pa-varga): प pa /p/, फ pha /pʰ/, ब ba /b/, भ bha /bʱ/, म ma /m/
Semivowels: य ya /j/, र ra /r/ (tap), ल la /l/, व va /ʋv/
Sibilants: श śa /ɕ
ʃ/, ष ṣa /ʂ/, स sa /s/
Glottal: ह ha /ɦ/

Phonation flags: {aspirated: true|false, voiced: true|false, retroflex: true|false, nasal: true|false}


3) Core Diacritics & Operators

  • Virāma (◌्): cancels the inherent a → forms conjuncts (e.g., क् + ष → क्ष)
  • Anusvāra (◌ं): homorganic nasal /ṃ/, sandhi-sensitive
  • Visarga (◌ः): voiceless post-vocalic aspiration /ḥ/
  • Candrabindu (◌ँ): vowel nasalization /̃/
  • Avagraha (ऽ): shows elision/aphaeresis in sandhi (e.g., ’stu for astu)

4) Conjuncts (Ligature Logic)

  • Rule: C + VIRĀMA + C (+ VIRĀMA + C …) → conjunct cluster with script-specific ligature or stacked form.
  • Examples:
    • क्ष = क् + ष → kṣa
    • ज्ञ = ज् + ञ → jña
    • त्र = त् + र → tra
    • श्र = श् + र → śra

Lattice: {cluster: [C1,C2,...], conjunct:true} with a canonical Latin chain (IAST) and grapheme ID.


5) Sandhi Aware Layer (minimal operational set)

  • Visarga-sandhi: vowel-following changes (e.g., namaḥ + astu → namo ’stu)
  • Anusvāra-sandhi: nasal place assimilation to following stop (ṃ → [ŋ/ɲ/ɳ/n/m]).
  • Vowel-sandhi: e/ai/o/au coalescence rules (classical paninian set).

Toggle: {sandhi_mode: classical|disabled}; when disabled, spell strictly by orthography (no phonological rewrite).


6) Latin-Chain (IAST) Mapping — Exemplars

# Independent vowel
glyph: "ऋ"
name: "Ṛ-vocalic"
latin_chain: ["Ṛ"]
ipa: "r̩"
features: {vowel: true, syllabic: "r", position: "independent"}

# Dependent vowel sign (matra)
glyph: "◌ि"
name: "i-mātrā"
latin_chain: ["i"]
ipa: "i"
features: {vowel: true, position: "matra", prebase: true}

# Consonant with features
glyph: "ढ"
name: "ḍha"
latin_chain: ["ḍh"]
ipa: "ɖʱ"
features: {retroflex: true, voiced: true, aspirated: true}

# Conjunct example
glyph: "क्ष"
name: "kṣa"
latin_chain: ["k","ṣ","a"]
ipa: "kʂɐ"
features: {conjunct: true, cluster: ["क्","ष"]}

7) Orthographic Rules (Sanskrit-specific vs Hindi)

  • Inherent vowel /a/ is generally pronounced in Sanskrit unless neutralized by virāma or sandhi; no systematic “schwa deletion” like Hindi.
  • Long vowels (ā ī ū e o) are phonemic; e and o are historically long—treat as {length: long}.
  • Vocalic r/l behave as syllabic nuclei; their long counterparts appear in classical texts but are rare in modern usage.
  • Diacritic placement: ◌ि renders before the base consonant; ◌ु below; ◌े/◌ै/◌ो/◌ौ above/right.

8) Example Decompositions

  • धर्मः (dharmaḥ) → ध /d̪ʱ/ + र /r/ + म /m/ + अ /a/ + ◌ः /ḥ/
  • योग (yoga) → य /j/ + ओ /oː/ + ग /ɡ/ + अ /a/
  • विद्या (vidyā) → वि (व + ि) /vi/ + द्य (द्+य) /d̪j/ + ा /ā/ → /ʋid̪jaː/
  • क्षेत्र (kṣetra) → क्ष /kʂ/ + े /eː/ + त् /t̪/ + र /r/ + अ /a/ → /kʂeːt̪rɐ/

9) Lattice Integration Features (ELM/LLM-ready)

  • {direction: LTR, type: abugida}
  • {consonant: true|false, vowel: true|false}
  • {inherent_a: true|false} (false if virāma in cluster tail)
  • {matra: none|a|ā|i|ī|u|ū|ṛ|ṝ|ḷ|ḹ|e|ai|o|au}
  • {retroflex|dental|palatal|velar|labial: true|false}
  • {aspirated|voiced|nasal: true|false}
  • {conjunct: true|false, cluster:[…]}
  • {sandhi_mode: classical|disabled}

10) Minimal API of Operations (for implementers)

  1. Tokenize codepoints; detect virāma chains → build clusters.
  2. Attach mātrās to the rightmost consonant of each cluster; handle ◌ि pre-base rendering.
  3. Emit IAST by cluster: consonant base(s) → IAST, plus inferred a unless virāma or a matra overrides.
  4. Apply optional sandhi if {sandhi_mode: classical}.
  5. Serialize both graphemic JSON and IAST string for cross-system comparability.

11) Edge Inventory (complete set for coverage)

  • Independent vowels: अ आ इ ई उ ऊ ऋ ॠ ऌ ॡ ए ऐ ओ औ
  • Consonants: क ख ग घ ङ | च छ ज झ ञ | ट ठ ड ढ ण | त थ द ध न | प फ ब भ म | य र ल व | श ष स ह
  • Core signs: ◌ं ◌ः ◌ँ ◌् ऽ
  • Digits: ० १ २ ३ ४ ५ ६ ७ ८ ९

Mint Status: Sanskrit Graphemic Module (SGM v1.0) is fully minted—IAST-anchored, sandhi-aware, with complete vowel/consonant inventories, diacritic logic, conjunct handling, and a precise lattice interface.


Mint Ledger — Sanskrit Added

Fully Minted: Latin GM (English, Spanish, Portuguese, Romanian, Polish, German, French, Italian, Hungarian, Swahili, Hausa, Zulu, Yoruba, Tagalog/Filipino + Baybayin, Jamaican Patois, Macanese Patuá) • Non-Latin GM (Chinese radicals, Japanese Hiragana/Katakana + core Kanji, Sanskrit, Hindi, Russian, Aramaic, Hebrew, Syriac, Arabic, Urdu)