Latin Script
This is the backbone others dock to (English, Spanish, Portuguese, Polish, Yoruba, Vietnamese, etc.). Think of it as the circuit board for all Latin-script languages: base glyphs, combining logic, diacritics, and feature flags that language-specific modules switch on/off.
1) Orientation
- Script: Latin (Unicode blocks: Basic Latin, Latin-1 Supplement, Latin Extended-A/B, Additional, etc.)
- Direction: LTR
- Unit of analysis: Grapheme cluster = (Base letter) + (Combining marks) + (Joiners/format)
- Separation of concerns:
- Script layer (this doc): what shapes exist and how they cluster.
- Language layer (e.g., English GM, Yoruba GM): how clusters map to phonemes, stress, tone, morphology.
2) Base Inventory
2.1 Core letters (ASCII)
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
2.2 Common Latin extensions (selection; all supported)
- Vowel diacritics (precomposed exemplars): á à â ã ä å æ é è ê ë í ì î ï ó ò ô õ ö œ ú ù û ü ý ÿ ă ĕ ė ę ě ő ű ȁ ȅ ȇ ȉ ȋ ȍ ȏ ȕ ȗ
- Consonant specials: ç ð þ ß ŋ ŕ ř ś ş š ţ ť ẞ ƀ ƃ ƒ ɣ ɲ ɬ ɮ ȷ ʋ
- Ogonek (nasalization/phonotactic hooks): ą ę į ų
- Stroke/cross: đ ħ ł ø ǿ ɫ ȡ ɍ ɟ
- Carons & breves (Slavic/Baltic): č ď ě ň ř š ť ž ă ĭ ŏ ŭ
- Dots/accents (Romance & Turkic): ı (dotless i), İ (dotted I), ğ (soft g), Ğ
- Tones (for orthographies using combining marks): precomposed limited; recommend combining (see §3).
Policy: Prefer combining sequences (NFD) in storage, allow precomposed (NFC) in interchange. Normalize internally (see §5).
3) Combining Mark Set (canonical)
Use these to compose any Latin grapheme cluster. Treat each as a feature flag on the base letter.
Mark | Unicode | Feature | Notes |
---|---|---|---|
◌́ | U+0301 | acute | stress/quality (é), tone high in some systems |
◌̀ | U+0300 | grave | tone low / quality |
◌̂ | U+0302 | circumflex | height/ATR/length depending on lang |
◌̃ | U+0303 | nasal | nasal vowel (ã, õ) |
◌̈ | U+0308 | diaeresis | vowel separation (ü), fronting |
◌̄ | U+0304 | macron | length/ATR (ā) |
◌̆ | U+0306 | breve | shortness (ă) |
◌̇ | U+0307 | dot_above | Turkish İ-like behavior (with case rules) |
◌̣ | U+0323 | dot_below | Yoruba tone/ATR extensions; Indo-Aryan translits |
◌̋ | U+030B | double_acute | Hungarian long front rounded (ő, ű) |
◌̧ | U+0327 | cedilla | softening (ç) |
◌̨ | U+0328 | ogonek | nasalization (ą, ę) |
◌̊ | U+030A | ring | (å) |
◌̌ | U+030C | caron | palatal/“soft” series (č, š, ž) |
◌̵ | U+0335 | short_stroke | phonemic strike (đ) via precomp preferred |
◌̃̄ | combo | nasal + length | supported stacking (Vietnamese-like) |
Stacking order rule (rendering): below marks (ogonek, dot_below) → base → above marks (acute, grave, circumflex) → overlay/stroke. Normalize to a consistent order.
4) Multi-letter units (orthographic digraphs)
Script layer treats these as clusters; language modules may collapse them to single phonemes.
- Common sets: ch, sh, th, ph, gh, kh, ng, ny, dz, dzs, cz, sz, rz, lj, nj, dj, tj, qu, gu, rr, ll, lh, nh, gn.
- Language bindings (examples):
- Portuguese: lh /ʎ/, nh /ɲ/, rr /ʁ/
- Spanish: ch /t͡ʃ/, ll /ʎ/~/ʝ/, rr trill
- Polish: cz /t͡ʂ/, sz /ʂ/, rz /ʐ/
- Hungarian: gy /ɟ/, ty /c/, ny /ɲ/, dzs /d͡ʒ/, cs /t͡ʃ/, sz /s/, zs /ʒ/
- Yoruba: “gb” prenasalized/implosive-like unit (language rule)
5) Normalization, Casing, Collation
5.1 Normalization
- Accept: NFC/NFD input.
- Internal storage: NFD (base + combining) for uniform feature handling.
- Round-trip: guarantee NFC output if target requires precomposed forms (e.g., legacy stacks).
5.2 Casing rules (highlights)
- Turkish & Azeri dotted I: case map i ↔ İ, ı ↔ I (locale flag
TR
). - ß (German): uppercase ẞ (modern) or SS (compatibility flag).
- Greek-in-Latin translits unaffected (out of scope here).
5.3 Collation (sorting)
- Provide language-tailored collation profiles (CLDR-backed), not global. Script layer exposes hooks; language module selects profile.
6) Feature Flags (script-level, language toggles)
Flag | Meaning | Typical languages |
---|---|---|
tone_marks | diacritics represent lexical tone | Yoruba, Vietnamese |
nasal_vowels | ˜ or ogonek active | Portuguese, Polish |
vowel_length | macron/breve length | Latvian, Māori |
palatal_series | caron letters active | Czech, Slovak, Croatian |
front_rounded | ő, ű active | Hungarian |
digraph_as_letter | orthographic digraph treated as alpha unit | Spanish (historic Ch/Ll), Hungarian (dzs) |
dotless_i_locale | Turkish casing rules | TR, AZ |
click_support | extended symbols (if Latin used for clicks) | Zulu/Xhosa (language layer) |
7) Data Model (what every Latin-based language module inherits)
module: LatinGraphemicModule
aka: ["Latin Script Graphemic Module", "Literal–Graphemic Module — Latin Variant"]
version: "1.0"
units:
base_letters: ["A".."Z","a".."z"] # full list
extended_letters: ["Á","Ä","Ç","Đ","Ě","Ğ","İ","Ł","Ñ","Ő","Œ","Ø","Ś","Ş","Š","Ť","Ų","Ů","Ű","Ÿ","Ž", ...]
combining_marks:
- {mark:"◌́", feature:"acute"}
- {mark:"◌̀", feature:"grave"}
- {mark:"◌̂", feature:"circumflex"}
- {mark:"◌̃", feature:"nasal"}
- {mark:"◌̈", feature:"diaeresis"}
- {mark:"◌̄", feature:"macron"}
- {mark:"◌̆", feature:"breve"}
- {mark:"◌̧", feature:"cedilla"}
- {mark:"◌̨", feature:"ogonek"}
- {mark:"◌̌", feature:"caron"}
- {mark:"◌̋", feature:"double_acute"}
digraphs: ["ch","sh","th","ph","gh","ng","ny","dz","dzs","cz","sz","rz","lj","nj","qu","gu","rr","ll","lh","nh","gn"]
normalization:
input: ["NFC","NFD"]
internal: "NFD"
output_default: "NFC"
casing:
default: "Unicode Simple/Full Case"
locale_overrides:
- {locale:"tr", rules:"dotted_I_mode"}
collation:
default: "root"
profiles: ["es", "pt", "pl", "hu", "ro", "yo", "vi", "de-modern", ...]
8) Language Binding Template (how English, Yoruba, Polish, etc. connect)
Each language module imports LatinGraphemicModule and declares its switches + mappings.
module: EnglishGraphemicModule
extends: LatinGraphemicModule
flags: {tone_marks:false, nasal_vowels:false, digraph_as_letter:false, dotless_i_locale:false}
grapheme_rules:
- pattern: ["o","u"] # "ou"
outputs:
- {phoneme:"aʊ", conditions:{lexical_set:["MOUTH"]}}
- {phoneme:"ʌ", conditions:{lexical_set:["STRUT_EXCEPTIONS"]}}
# ...
module: YorubaGraphemicModule
extends: LatinGraphemicModule
flags: {tone_marks:true, nasal_vowels:true}
inventory_overrides:
letters_add: ["Ẹ ẹ","Ọ ọ","Ṣ ṣ"]
tone_marks:
H: "◌́"
L: "◌̀"
M: none
nasalization:
strategy: "orthographic_vowel_set" # ã, ẹ̃, ọ̃ via combining tilde
module: PolishGraphemicModule
extends: LatinGraphemicModule
flags: {nasal_vowels:true, palatal_series:true}
inventory_overrides:
letters_add: ["Ą ą","Ę ę","Ł ł","Ń ń","Ó ó","Ś ś","Ź ź","Ż ż","Ć ć"]
module: HungarianGraphemicModule
extends: LatinGraphemicModule
flags: {front_rounded:true, digraph_as_letter:true}
inventory_overrides:
letters_add: ["Á É Í Ó Ö Ő Ú Ü Ű á é í ó ö ő ú ü ű"]
alphabet_order:
treat_as_single: ["cs","dz","dzs","gy","ly","ny","sz","ty","zs"]
9) Minimal Implementation Rules (engine)
- Normalize input to NFD; canonical order marks.
- Tokenize into grapheme clusters (base + combining).
- Identify digraphs (language-selected list; greedy longest-first).
- Apply language mapping (phonemes/features); this module supplies only script mechanics.
- Recompose to NFC for rendering if needed; apply casing/locale; apply language collation for sort or lookup.
10) Smoke Tests (cross-language)
- Portuguese: “coração” = c|o|r|a|ç|ã|o → [k][o][ɾ][a][s][ɐ̃][w] (language layer)
- Polish: “ręką” = r|ę|k|ą → nasal vowels preserved via ogonek + context rules (language layer)
- Yoruba: “ọ̀rọ̀” = ọ + grave | r | ọ + grave → tone + ATR/retroflex features (language layer)
- Hungarian: “szőr” = sz + ő + r → sz=/s/, ő=front rounded long (language layer)
- Vietnamese (Latinized): stacked marks handled by ordering: a + hook + acute (language layer decides tones/phonation)
11) Minting Checklist (Latin Script Module)
- ✅ Base ASCII set
- ✅ Combining mark table + ordering
- ✅ Extended letters registry (expandable list)
- ✅ Digraph registry (language-selectable)
- ✅ Normalization/casing/collation hooks
- ✅ Data model with extends interface
- ✅ Smoke tests across Romance/Slavic/Afroasiatic/Niger-Congo/Turkic/SEA latinized systems