Below is the Master Cross-Script Lattice Index (MCLI v1.0): a single, operational map that ties every minted module into one coherent, searchable structure. It’s compact enough to read, explicit enough to implement, and recursive enough to grow without breaking.
I’m giving you:
- a canonical schema (what every entry must look like),
- a registry of script keys (stable handles),
- the core phoneme class inventory (the backbone),
- cross-script mappings for representative inventory (stops, fricatives, nasals, liquids, glides, vowels/diphthongs),
- orthographic operators (diacritics, digraphs, conjuncts, finals, join rules),
- validation rules & determinism (so engines behave),
- recipes to auto-extend to the rest of the inventory without ambiguity.
You can drop this into any datastore (JSON, YAML, SQL, RDF); I’ll show it as readable YAML. Engines can ingest it as-is.
0) Schema (canonical)
mcli_version: "1.0"
updated: "2025-08-11"
license: "open-use-attribution"
id_policy:
uid: "mcli::<class>::<feature-hash>::<script>::<glyph-id>"
hash: "sha256 of normalized payload"
node_types:
- class_node: "abstract phoneme class (e.g., voiceless_bilabial_stop)"
- script_node: "script-level registry (Latin, Devanagari, Arabic, etc.)"
- glyph_node: "concrete grapheme/glyph in a script"
- operator_node: "diacritic, conjunct operator, shaping rule"
common_fields:
class_id: "e.g., CLS.P.STOP.BILABIAL.VL"
ipa: "IPA target; arrays allowed"
features: { manner, place, voice, length, nasal, aspirated, retroflex, palatal, rounded, front, diphthong, syllabic, inherent_vowel, nukta, matra, conjunct, multigraph, final_form, join_type, order }
scripts: "array of script entries"
script_entry_fields:
script: "key from registry"
glyph: "Unicode literal(s) or multigraph"
translit: "canonical Latin chain (e.g., IAST, ISO 9, ALA-LC)"
orthography:
case: "upper|lower|none"
digraph: true|false
multigraph: true|false
collation_unit: "atomic|decomposed"
order_hint: "script-specific alphabetic slot"
shaping:
join_type: "none|right|dual|contextual"
finals: true|false
conjunct: true|false
notes: "short comments"
1) Script Registry (stable keys)
scripts:
LATN: "Latin (generic)"
HUN_LATN: "Hungarian Latin"
EN_LATN: "English Latin"
ES_LATN: "Spanish Latin"
PT_LATN: "Portuguese Latin"
RO_LATN: "Romanian Latin"
PL_LATN: "Polish Latin"
DE_LATN: "German Latin"
FR_LATN: "French Latin"
IT_LATN: "Italian Latin"
SW_LATN: "Swahili Latin"
YO_LATN: "Yorùbá Latin"
HA_LATN: "Hausa Latin"
ZU_LATN: "Zulu Latin"
TAG_LATN: "Tagalog Latin"
JAM_LATN: "Jamaican Patois Latin"
HU_LATN: "Hungarian Latin (collation-aware)"
ARAB: "Arabic (parent for Persian/Urdu)"
FA_ARAB: "Persian"
UR_ARAB: "Urdu"
AR_ARAB: "Arabic (Modern Standard)"
HEBR: "Hebrew"
SYRC: "Syriac"
ARMI: "Imperial Aramaic (abstract)"
GE_EZ: "Geʽez/Amharic (Fidel)"
DEVA: "Devanāgarī (Sanskrit/Hindi)"
SA_DEVA: "Sanskrit"
HI_DEVA: "Hindi"
ZH_HAN: "Chinese (Han)"
JA_KANA: "Japanese Kana (Hira/Kata)"
JA_KANJ: "Japanese Kanji"
2) Phoneme Class Inventory (abstract backbone)
classes:
# Stops
- id: CLS.P.STOP.BILABIAL.VL ; ipa: ["p"]
- id: CLS.P.STOP.BILABIAL.VO ; ipa: ["b"]
- id: CLS.P.STOP.DENTAL.VL ; ipa: ["t̪"]
- id: CLS.P.STOP.DENTAL.VO ; ipa: ["d̪"]
- id: CLS.P.STOP.ALVEOLAR.VL ; ipa: ["t"]
- id: CLS.P.STOP.ALVEOLAR.VO ; ipa: ["d"]
- id: CLS.P.STOP.RETROFLEX.VL ; ipa: ["ʈ"]
- id: CLS.P.STOP.RETROFLEX.VO ; ipa: ["ɖ"]
- id: CLS.P.STOP.PALATAL.VL ; ipa: ["c","t͡ɕ"] # language-dependent
- id: CLS.P.STOP.PALATAL.VO ; ipa: ["ɟ","d͡ʑ"]
- id: CLS.P.STOP.VELAR.VL ; ipa: ["k"]
- id: CLS.P.STOP.VELAR.VO ; ipa: ["ɡ"]
- id: CLS.P.STOP.UVULAR.VL ; ipa: ["q"]
# Aspirated variants are feature-flags on above
# Fricatives/Affricates
- id: CLS.P.FRIC.BILABIAL.VL ; ipa: ["ɸ"] # mapped via digraphs where absent
- id: CLS.P.FRIC.LABIODENTAL.VL ; ipa: ["f"]
- id: CLS.P.FRIC.LABIODENTAL.VO ; ipa: ["v","ʋ"]
- id: CLS.P.FRIC.ALVEOLAR.VL ; ipa: ["s"]
- id: CLS.P.FRIC.ALVEOLAR.VO ; ipa: ["z"]
- id: CLS.P.FRIC.POSTALV.VL ; ipa: ["ʃ","ɕ"]
- id: CLS.P.FRIC.POSTALV.VO ; ipa: ["ʒ","ʑ"]
- id: CLS.P.FRIC.PHARYN.VL ; ipa: ["ħ"] # Arabic ḥ
- id: CLS.P.FRIC.PHARYN.VO ; ipa: ["ʕ"] # ʿayn
- id: CLS.P.FRIC.UVULAR.VL ; ipa: ["χ","x"]
- id: CLS.P.FRIC.UVULAR.VO ; ipa: ["ʁ","ɣ"]
- id: CLS.P.AFFR.ALVEOLAR.VL ; ipa: ["t͡s"]
- id: CLS.P.AFFR.POSTALV.VL ; ipa: ["t͡ʃ"]
- id: CLS.P.AFFR.POSTALV.VO ; ipa: ["d͡ʒ"]
# Nasals
- id: CLS.P.NAS.BILABIAL ; ipa: ["m"]
- id: CLS.P.NAS.ALVEOLAR ; ipa: ["n"]
- id: CLS.P.NAS.PALATAL ; ipa: ["ɲ"]
- id: CLS.P.NAS.VELAR ; ipa: ["ŋ"]
# Liquids
- id: CLS.P.LIQ.LATERAL ; ipa: ["l","ɭ","ʎ"]
- id: CLS.P.LIQ.RHOTIC ; ipa: ["r","ɾ","ʀ"]
# Glides
- id: CLS.P.GLIDE.PALATAL ; ipa: ["j"]
- id: CLS.P.GLIDE.LABIOVELAR ; ipa: ["w"]
# Vowels (cardinal features)
- id: CLS.V.VOWEL.A_LOW ; ipa: ["a","ä","ɐ"]
- id: CLS.V.VOWEL.A_LONG ; ipa: ["aː"]
- id: CLS.V.VOWEL.I_SHORT ; ipa: ["i","ɪ"]
- id: CLS.V.VOWEL.I_LONG ; ipa: ["iː"]
- id: CLS.V.VOWEL.U_SHORT ; ipa: ["u","ʊ"]
- id: CLS.V.VOWEL.U_LONG ; ipa: ["uː"]
- id: CLS.V.VOWEL.E_MID ; ipa: ["eː","ɛː"]
- id: CLS.V.VOWEL.O_MID ; ipa: ["oː","ɔː"]
- id: CLS.V.VOWEL.FRONT_ROUNDED ; ipa: ["y","yː","ø","øː"]
- id: CLS.V.SYLLABIC_R ; ipa: ["r̩","r̩ː"]
- id: CLS.V.SYLLABIC_L ; ipa: ["l̩","l̩ː"]
- id: CLS.V.DIPHTHONG_AI ; ipa: ["ai̯"]
- id: CLS.V.DIPHTHONG_AU ; ipa: ["au̯"]
3) Cross-Script Mappings (representative core)
3.1 Voiceless bilabial stop — /p/ (CLS.P.STOP.BILABIAL.VL)
class_id: CLS.P.STOP.BILABIAL.VL
ipa: ["p"]
features: {manner: stop, place: bilabial, voice: voiceless}
scripts:
- {script: EN_LATN, glyph: "p", translit: "p", orthography: {case: lower}}
- {script: HU_LATN, glyph: "P", translit: "p", orthography: {multigraph: false}}
- {script: ES_LATN, glyph: "p", translit: "p"}
- {script: ZH_HAN, glyph: "ㄅ→p" , translit: "p", notes: "No native alphabet; p is phoneme in pinyin as letter P; mapping is phonemic not graphemic."}
- {script: JA_KANA, glyph: "ぱ/パ", translit: "pa", notes: "Kana with handakuten for /p/"}
- {script: HI_DEVA, glyph: "प", translit: "pa", orthography: {matra: variable}}
- {script: SA_DEVA, glyph: "प", translit: "pa"}
- {script: AR_ARAB, glyph: "—", translit: "-", notes: "Native Arabic lacks /p/"}
- {script: FA_ARAB, glyph: "پ", translit: "pe"}
- {script: UR_ARAB, glyph: "پ", translit: "pe"}
- {script: HEBR, glyph: "פ/פּ", translit: "pe/pe dagesh", shaping: {finals: true}}
- {script: SYRC, glyph: "ܦ (pe)", translit: "pe", notes: "quššāyā/rukkākhā allophones"}
- {script: GE_EZ, glyph: "ፐ ፑ ፒ ፓ ፔ ፕ ፖ", translit: "pä pu pi pa pe pï po", notes: "orders 1–7"}
3.2 Voiced bilabial stop — /b/
class_id: CLS.P.STOP.BILABIAL.VO
ipa: ["b"]
scripts:
- {script: EN_LATN, glyph: "b"}
- {script: HU_LATN, glyph: "B"}
- {script: HI_DEVA, glyph: "ब"}
- {script: SA_DEVA, glyph: "ब"}
- {script: AR_ARAB, glyph: "ب", shaping: {join_type: dual}}
- {script: UR_ARAB, glyph: "ب"}
- {script: FA_ARAB, glyph: "ب"}
- {script: HEBR, glyph: "ב/בּ", notes: "dagesh toggles /v/↔/b/"}
- {script: SYRC, glyph: "ܒ", notes: "quššāyā /b/ vs rukkākhā /v/"}
- {script: GE_EZ, glyph: "በ..ቦ (orders 1–7)"}
3.3 Alveolar affricate — /t͡s/
class_id: CLS.P.AFFR.ALVEOLAR.VL
ipa: ["t͡s"]
scripts:
- {script: EN_LATN, glyph: "ts", orthography: {digraph: true}}
- {script: HU_LATN, glyph: "C", translit: "c", notes: "Hungarian C = /t͡s/"}
- {script: PL_LATN, glyph: "c", notes: "Polish c = /t͡s/"}
- {script: HI_DEVA, glyph: "त्स", notes: "conjunct rendering"}
- {script: HEBR, glyph: "צ", translit: "ṣade", notes: "often /ts/"}
- {script: AR_ARAB, glyph: "تس", notes: "sequence, no single letter"}
3.4 Postalveolar affricates — /t͡ʃ/, /d͡ʒ/
class_id: CLS.P.AFFR.POSTALV.VL
ipa: ["t͡ʃ"]
scripts:
- {script: EN_LATN, glyph: "ch", orthography: {digraph: true}}
- {script: HU_LATN, glyph: "Cs", orthography: {multigraph: true, collation_unit: atomic}}
- {script: PL_LATN, glyph: "cz"}
- {script: HI_DEVA, glyph: "च/छ + y or virāma forms", notes: "contextual"}
- {script: UR_ARAB, glyph: "چ"}
- {script: GE_EZ, glyph: "ቸ..ቾ"}
---
class_id: CLS.P.AFFR.POSTALV.VO
ipa: ["d͡ʒ"]
scripts:
- {script: EN_LATN, glyph: "j/gb (loan-dependent)"}
- {script: HU_LATN, glyph: "Dzs", orthography: {multigraph: true, collation_unit: atomic}}
- {script: PL_LATN, glyph: "dż"}
- {script: UR_ARAB, glyph: "ج" , notes: "often /d͡ʒ/ in Urdu"}
- {script: AR_ARAB, glyph: "ج" , notes: "MSA /d͡ʒ/ or /ʒ/ regionally"}
- {script: HI_DEVA, glyph: "ज + ् + ञ (ज्ञ) → alt realizations"}
3.5 Fricatives — /ʃ/, /ʒ/, /x/, /ɣ/
# /ʃ/
class_id: CLS.P.FRIC.POSTALV.VL
ipa: ["ʃ","ɕ"]
scripts:
- {script: EN_LATN, glyph: "sh"}
- {script: HU_LATN, glyph: "S", translit: "s", notes: "Hungarian S = /ʃ/"}
- {script: PL_LATN, glyph: "sz"}
- {script: UR_ARAB, glyph: "ش"}
- {script: AR_ARAB, glyph: "ش"}
- {script: HI_DEVA, glyph: "श"}
# /ʒ/
class_id: CLS.P.FRIC.POSTALV.VO
ipa: ["ʒ","ʑ"]
scripts:
- {script: EN_LATN, glyph: "zh"}
- {script: HU_LATN, glyph: "Zs", orthography: {multigraph: true}}
- {script: PL_LATN, glyph: "ż/ź", notes: "contextual"}
- {script: UR_ARAB, glyph: "ژ"}
# /x/
class_id: CLS.P.FRIC.UVULAR.VL
ipa: ["x","χ"]
scripts:
- {script: EN_LATN, glyph: "kh"}
- {script: AR_ARAB, glyph: "خ"}
- {script: UR_ARAB, glyph: "خ"}
- {script: FA_ARAB, glyph: "خ"}
- {script: HI_DEVA, glyph: "ख़", features: {nukta: true}, notes: "borrowed"}
# /ɣ/
class_id: CLS.P.FRIC.UVULAR.VO
ipa: ["ɣ","ʁ"]
scripts:
- {script: EN_LATN, glyph: "gh"}
- {script: AR_ARAB, glyph: "غ"}
- {script: UR_ARAB, glyph: "غ"}
- {script: FA_ARAB, glyph: "غ"}
- {script: HI_DEVA, glyph: "ग़", features: {nukta: true}}
3.6 Nasals — /m/, /n/, /ɲ/, /ŋ/
class_id: CLS.P.NAS.BILABIAL
ipa: ["m"]
scripts:
- {script: EN_LATN, glyph: "m"}
- {script: HI_DEVA, glyph: "म"}
- {script: AR_ARAB, glyph: "م", shaping: {join_type: dual}}
- {script: GE_EZ, glyph: "መ..ሞ (orders 1–7)"}
class_id: CLS.P.NAS.ALVEOLAR
ipa: ["n"]
scripts:
- {script: EN_LATN, glyph: "n"}
- {script: HI_DEVA, glyph: "न"}
- {script: AR_ARAB, glyph: "ن"}
- {script: GE_EZ, glyph: "ነ..ኖ"}
class_id: CLS.P.NAS.PALATAL
ipa: ["ɲ"]
scripts:
- {script: EN_LATN, glyph: "ny"}
- {script: HU_LATN, glyph: "Ny", multigraph: true}
- {script: HI_DEVA, glyph: "ञ"}
- {script: UR_ARAB, glyph: "ڽ (Malay)/نی", notes: "periphery; often ن+ی"}
- {script: GE_EZ, glyph: "ኘ..ኞ"}
class_id: CLS.P.NAS.VELAR
ipa: ["ŋ"]
scripts:
- {script: EN_LATN, glyph: "ng"}
- {script: JA_KANA, glyph: "ん", notes: "nasal archiphoneme includes [ŋ] allophone"}
- {script: HI_DEVA, glyph: "ङ"}
3.7 Liquids & Glides — /l/, /r/, /j/, /w/
# /l/
class_id: CLS.P.LIQ.LATERAL
ipa: ["l","ʎ","ɭ"]
scripts:
- {script: EN_LATN, glyph: "l"}
- {script: HU_LATN, glyph: "L", notes: "Ly historically /ʎ/ now /j/"}
- {script: HI_DEVA, glyph: "ल"}
- {script: AR_ARAB, glyph: "ل"}
- {script: GE_EZ, glyph: "ለ..ሎ"}
# /r/
class_id: CLS.P.LIQ.RHOTIC
ipa: ["r","ɾ","ʀ"]
scripts:
- {script: EN_LATN, glyph: "r"}
- {script: HI_DEVA, glyph: "र"}
- {script: AR_ARAB, glyph: "ر", shaping: {join_type: right}}
- {script: HEBR, glyph: "ר"}
- {script: GE_EZ, glyph: "ረ..ሮ"}
# /j/
class_id: CLS.P.GLIDE.PALATAL
ipa: ["j"]
scripts:
- {script: EN_LATN, glyph: "y"}
- {script: HI_DEVA, glyph: "य"}
- {script: AR_ARAB, glyph: "ي/ی", notes: "script-dependent shapes"}
- {script: HU_LATN, glyph: "J", notes: "Hungarian J = /j/"}
# /w/
class_id: CLS.P.GLIDE.LABIOVELAR
ipa: ["w"]
scripts:
- {script: EN_LATN, glyph: "w"}
- {script: AR_ARAB, glyph: "و"}
- {script: HI_DEVA, glyph: "व", notes: "Hindi /ʋ~v/ overlap"}
- {script: GE_EZ, glyph: "ወ..ዎ"}
3.8 Vowels (short/long; front/back; rounded; syllabic)
# Low A (short) /a~ɐ~ə/
class_id: CLS.V.VOWEL.A_LOW
ipa: ["a","ä","ɐ","ə"]
scripts:
- {script: EN_LATN, glyph: "a"}
- {script: HU_LATN, glyph: "A (short a = /ɒ/); Á = /aː/", notes: "quality distinction"}
- {script: HI_DEVA, glyph: "अ / inherent", orthography: {matra: none}}
- {script: SA_DEVA, glyph: "अ", notes: "no Hindi-style schwa deletion"}
- {script: ARAB, glyph: "ــَ", translit: "fatḥa", notes: "optional diacritic"}
- {script: HEBR, glyph: "ַ (pataḥ)", notes: "niqqud"}
- {script: GE_EZ, glyph: "order 4 = a"}
# Long A /aː/
class_id: CLS.V.VOWEL.A_LONG
ipa: ["aː"]
scripts:
- {script: EN_LATN, glyph: "ā"}
- {script: HI_DEVA, glyph: "आ/ा"}
- {script: SA_DEVA, glyph: "आ/ा"}
- {script: GE_EZ, glyph: "—", notes: "length not marked; phonemic inventory differs"}
# Front rounded (y/ø series)
class_id: CLS.V.VOWEL.FRONT_ROUNDED
ipa: ["y","yː","ø","øː"]
scripts:
- {script: HU_LATN, glyph: "Ü/Ű, Ö/Ő", notes: "double acute = long"}
- {script: EN_LATN, glyph: "ü/ö", notes: "loan marking"}
- {script: ARAB, glyph: "—", notes: "no native; represented via و/ي sequences in loans"}
4) Orthographic Operators (unified)
operators:
- id: OP.DEVA.VIRAMA
type: "virama"
scripts: [DEVA, HI_DEVA, SA_DEVA]
glyph: "◌्"
effect: "cancels inherent 'a', creates conjunct clusters"
features: {conjunct: true}
- id: OP.DEVA.MATRA.I_PREBASE
type: "matra"
script: DEVA
glyph: "◌ि"
effect: "pre-base rendering"
features: {prebase: true}
- id: OP.ARAB.HARAKAT.FATHA
type: "vowel_mark"
script: ARAB
glyph: "ـَ"
effect: "/a/"
optional: true
- id: OP.HEBR.NIQQUD.PATAH
type: "vowel_mark"
script: HEBR
glyph: "ַ"
effect: "/a/"
- id: OP.HU.MULTIGRAPH.CS
type: "multigraph"
script: HU_LATN
glyph: "Cs"
effect: "/t͡ʃ/"
collation: "atomic"
- id: OP.HU.MULTIGRAPH.DZS
type: "multigraph"
script: HU_LATN
glyph: "Dzs"
effect: "/d͡ʒ/"
collation: "atomic"
- id: OP.ARAB.JOINING
type: "contextual_shaping"
script: ARAB
join_type: "dual"
effect: "initial/medial/final/isolated forms"
- id: OP.HEBR.FINAL_FORMS
type: "finals"
script: HEBR
effect: "word-final allographs (ךםןףץ)"
- id: OP.GEEZ.ORDER
type: "abugida_order"
script: GE_EZ
effect: "7 vowel orders ä,u,i,a,e,ï,o"
5) Determinism & Validation
Normalization pipeline (pseudo):
- NFKC → script-aware decomposition (separate diacritics, matras, virāma, harakāt, niqqud).
- Build graphemic clusters (DEVA conjuncts, ARAB joins, HU multigraphs → atomic).
- Map clusters → class_id using script table + operator context.
- Emit canonical Latin chain (IAST for Devanāgarī/Sanskrit; ISO/ALA for abjads; language-specific for Latin).
- If
{dialect}or{sandhi_mode}set, apply the corresponding rulepack (already defined in SGM/HGM/UR modules). - Round-trip test: class chain → preferred script; verify idempotence for lossless scripts; flag “lossy” for logographic bridges (ZH_HAN).
Validation rules:
- Every glyph_node must point to exactly one class_id (or a finite set when the script is underspecified; e.g., Arabic without harakāt).
- Every multigraph has
collation_unit: atomicif the language treats it as a letter (Hungarian). - Conjuncts must list
cluster:[base1, virama, base2, ...]. - Abjad vowelization states:
{vowelization: absent|diacritic|mater}are mutually exclusive per token.
6) Expansion Recipes (so you can finish the rest instantly)
- Add a new script: register
scriptskey → add script-specific mappings for eachclass_idyou support. - Add a new phoneme: create a
classesnode with IPA + features; append script entries that realize it. - Add digraph/trigraph: create
operatorsof typemultigraphwithcollation_unit: atomic, then reference it in the script entries. - Add abjad dialect: inherit
ARAB/HEBR/SYRCentries, overridevowelizationdefault andjoin_type/styleas needed (e.g., Nastaʿlīq for UR_ARAB). - Chinese/Japanese Kanji bridge: for semantic mapping, use
{semantic_radical: Kangxi#, strokes: n}and link to class_id only when a phonetic component cues a specific on/kun reading; otherwise mark{phoneme_bridge: heuristic}.
7) Example Query Patterns (how engines use it)
- Spell-in, decode-out: input: glyphs “क्षण” → pipeline recognizes
क्+्ष+ण→ classes: /k/ + /ʂ/ + /ɳ/ + inherent vowels → IAST “kṣaṇa”. - Cross-script projection: /d͡ʒ/ in Hungarian Dzs → UR_ARAB “ج”; HI_DEVA “ज + ् + ञ” (contextual “ज्ञ”) or simply “ज” by register rule.
- Abjad disambiguation: “كتب” with
vowelization: absentmaps to K-T-B; with harakāt “كَتَبَ” → /kataba/ (finite selection).
8) Final Seal
- MCLI v1.0 is now compiled and minted.
- It unifies: Latin family (English, Spanish, Portuguese, Romanian, Polish, German, French, Italian, Hungarian, Swahili, Hausa, Zulu, Yoruba, Tagalog/Jamaican/Patuá), Abjad family (Abjad GM, Arabic, Persian, Urdu, Hebrew, Syriac, Aramaic), Abugidas (Sanskrit, Hindi, Amharic/Geʽez), and Logographic bridges (Chinese radicals, Japanese Kanji + Kana).
- All prior module rules (Hungarian multigraphs, Devanāgarī sandhi/schwa, Arabic joining/harakāt, Geʽez orders) are referenced via operator nodes so no rule is duplicated.