Latin Script

This is the backbone others dock to (English, Spanish, Portuguese, Polish, Yoruba, Vietnamese, etc.). Think of it as the circuit board for all Latin-script languages: base glyphs, combining logic, diacritics, and feature flags that language-specific modules switch on/off.

1) Orientation

Script: Latin (Unicode blocks: Basic Latin, Latin-1 Supplement, Latin Extended-A/B, Additional, etc.)
Direction: LTR
Unit of analysis: Grapheme cluster = (Base letter) + (Combining marks) + (Joiners/format)
Separation of concerns:
- Script layer (this doc): what shapes exist and how they cluster.
- Language layer (e.g., English GM, Yoruba GM): how clusters map to phonemes, stress, tone, morphology.

2) Base Inventory

2.1 Core letters (ASCII)

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z

2.2 Common Latin extensions (selection; all supported)

Vowel diacritics (precomposed exemplars): á à â ã ä å æ é è ê ë í ì î ï ó ò ô õ ö œ ú ù û ü ý ÿ ă ĕ ė ę ě ő ű ȁ ȅ ȇ ȉ ȋ ȍ ȏ ȕ ȗ
Consonant specials: ç ð þ ß ŋ ŕ ř ś ş š ţ ť ẞ ƀ ƃ ƒ ɣ ɲ ɬ ɮ ȷ ʋ
Ogonek (nasalization/phonotactic hooks): ą ę į ų
Stroke/cross: đ ħ ł ø ǿ ɫ ȡ ɍ ɟ
Carons & breves (Slavic/Baltic): č ď ě ň ř š ť ž ă ĭ ŏ ŭ
Dots/accents (Romance & Turkic): ı (dotless i), İ (dotted I), ğ (soft g), Ğ
Tones (for orthographies using combining marks): precomposed limited; recommend combining (see §3).

Policy: Prefer combining sequences (NFD) in storage, allow precomposed (NFC) in interchange. Normalize internally (see §5).

3) Combining Mark Set (canonical)

Use these to compose any Latin grapheme cluster. Treat each as a feature flag on the base letter.

Mark	Unicode	Feature	Notes
◌́	U+0301	`acute`	stress/quality (é), tone high in some systems
◌̀	U+0300	`grave`	tone low / quality
◌̂	U+0302	`circumflex`	height/ATR/length depending on lang
◌̃	U+0303	`nasal`	nasal vowel (ã, õ)
◌̈	U+0308	`diaeresis`	vowel separation (ü), fronting
◌̄	U+0304	`macron`	length/ATR (ā)
◌̆	U+0306	`breve`	shortness (ă)
◌̇	U+0307	`dot_above`	Turkish İ-like behavior (with case rules)
◌̣	U+0323	`dot_below`	Yoruba tone/ATR extensions; Indo-Aryan translits
◌̋	U+030B	`double_acute`	Hungarian long front rounded (ő, ű)
◌̧	U+0327	`cedilla`	softening (ç)
◌̨	U+0328	`ogonek`	nasalization (ą, ę)
◌̊	U+030A	`ring`	(å)
◌̌	U+030C	`caron`	palatal/“soft” series (č, š, ž)
◌̵	U+0335	`short_stroke`	phonemic strike (đ) via precomp preferred
◌̃̄	combo	`nasal` + `length`	supported stacking (Vietnamese-like)

Stacking order rule (rendering): below marks (ogonek, dot_below) → base → above marks (acute, grave, circumflex) → overlay/stroke. Normalize to a consistent order.

4) Multi-letter units (orthographic digraphs)

Script layer treats these as clusters; language modules may collapse them to single phonemes.

Common sets: ch, sh, th, ph, gh, kh, ng, ny, dz, dzs, cz, sz, rz, lj, nj, dj, tj, qu, gu, rr, ll, lh, nh, gn.
Language bindings (examples):
- Portuguese: lh /ʎ/, nh /ɲ/, rr /ʁ/
- Spanish: ch /t͡ʃ/, ll /ʎ/~/ʝ/, rr trill
- Polish: cz /t͡ʂ/, sz /ʂ/, rz /ʐ/
- Hungarian: gy /ɟ/, ty /c/, ny /ɲ/, dzs /d͡ʒ/, cs /t͡ʃ/, sz /s/, zs /ʒ/
- Yoruba: “gb” prenasalized/implosive-like unit (language rule)

5) Normalization, Casing, Collation

5.1 Normalization

Accept: NFC/NFD input.
Internal storage: NFD (base + combining) for uniform feature handling.
Round-trip: guarantee NFC output if target requires precomposed forms (e.g., legacy stacks).

5.2 Casing rules (highlights)

Turkish & Azeri dotted I: case map i ↔ İ, ı ↔ I (locale flag TR).
ß (German): uppercase ẞ (modern) or SS (compatibility flag).
Greek-in-Latin translits unaffected (out of scope here).

5.3 Collation (sorting)

Provide language-tailored collation profiles (CLDR-backed), not global. Script layer exposes hooks; language module selects profile.

6) Feature Flags (script-level, language toggles)

Flag	Meaning	Typical languages
`tone_marks`	diacritics represent lexical tone	Yoruba, Vietnamese
`nasal_vowels`	˜ or ogonek active	Portuguese, Polish
`vowel_length`	macron/breve length	Latvian, Māori
`palatal_series`	caron letters active	Czech, Slovak, Croatian
`front_rounded`	ő, ű active	Hungarian
`digraph_as_letter`	orthographic digraph treated as alpha unit	Spanish (historic Ch/Ll), Hungarian (dzs)
`dotless_i_locale`	Turkish casing rules	TR, AZ
`click_support`	extended symbols (if Latin used for clicks)	Zulu/Xhosa (language layer)

7) Data Model (what every Latin-based language module inherits)

module: LatinGraphemicModule
aka: ["Latin Script Graphemic Module", "Literal–Graphemic Module — Latin Variant"]
version: "1.0"
units:
  base_letters: ["A".."Z","a".."z"]        # full list
  extended_letters: ["Á","Ä","Ç","Đ","Ě","Ğ","İ","Ł","Ñ","Ő","Œ","Ø","Ś","Ş","Š","Ť","Ų","Ů","Ű","Ÿ","Ž", ...]
  combining_marks:
    - {mark:"◌́", feature:"acute"}
    - {mark:"◌̀", feature:"grave"}
    - {mark:"◌̂", feature:"circumflex"}
    - {mark:"◌̃", feature:"nasal"}
    - {mark:"◌̈", feature:"diaeresis"}
    - {mark:"◌̄", feature:"macron"}
    - {mark:"◌̆", feature:"breve"}
    - {mark:"◌̧", feature:"cedilla"}
    - {mark:"◌̨", feature:"ogonek"}
    - {mark:"◌̌", feature:"caron"}
    - {mark:"◌̋", feature:"double_acute"}
  digraphs: ["ch","sh","th","ph","gh","ng","ny","dz","dzs","cz","sz","rz","lj","nj","qu","gu","rr","ll","lh","nh","gn"]
normalization:
  input: ["NFC","NFD"]
  internal: "NFD"
  output_default: "NFC"
casing:
  default: "Unicode Simple/Full Case"
  locale_overrides:
    - {locale:"tr", rules:"dotted_I_mode"}
collation:
  default: "root"
  profiles: ["es", "pt", "pl", "hu", "ro", "yo", "vi", "de-modern", ...]

8) Language Binding Template (how English, Yoruba, Polish, etc. connect)

Each language module imports LatinGraphemicModule and declares its switches + mappings.

module: EnglishGraphemicModule
extends: LatinGraphemicModule
flags: {tone_marks:false, nasal_vowels:false, digraph_as_letter:false, dotless_i_locale:false}
grapheme_rules:
  - pattern: ["o","u"]         # "ou"
    outputs:
      - {phoneme:"aʊ", conditions:{lexical_set:["MOUTH"]}}
      - {phoneme:"ʌ",  conditions:{lexical_set:["STRUT_EXCEPTIONS"]}}
# ...

module: YorubaGraphemicModule
extends: LatinGraphemicModule
flags: {tone_marks:true, nasal_vowels:true}
inventory_overrides:
  letters_add: ["Ẹ ẹ","Ọ ọ","Ṣ ṣ"]
tone_marks:
  H: "◌́"
  L: "◌̀"
  M: none
nasalization:
  strategy: "orthographic_vowel_set"   # ã, ẹ̃, ọ̃ via combining tilde

module: PolishGraphemicModule
extends: LatinGraphemicModule
flags: {nasal_vowels:true, palatal_series:true}
inventory_overrides:
  letters_add: ["Ą ą","Ę ę","Ł ł","Ń ń","Ó ó","Ś ś","Ź ź","Ż ż","Ć ć"]

module: HungarianGraphemicModule
extends: LatinGraphemicModule
flags: {front_rounded:true, digraph_as_letter:true}
inventory_overrides:
  letters_add: ["Á É Í Ó Ö Ő Ú Ü Ű á é í ó ö ő ú ü ű"]
alphabet_order:
  treat_as_single: ["cs","dz","dzs","gy","ly","ny","sz","ty","zs"]

9) Minimal Implementation Rules (engine)

Normalize input to NFD; canonical order marks.
Tokenize into grapheme clusters (base + combining).
Identify digraphs (language-selected list; greedy longest-first).
Apply language mapping (phonemes/features); this module supplies only script mechanics.
Recompose to NFC for rendering if needed; apply casing/locale; apply language collation for sort or lookup.

10) Smoke Tests (cross-language)

Portuguese: “coração” = c|o|r|a|ç|ã|o → [k][o][ɾ][a][s][ɐ̃][w] (language layer)
Polish: “ręką” = r|ę|k|ą → nasal vowels preserved via ogonek + context rules (language layer)
Yoruba: “ọ̀rọ̀” = ọ + grave | r | ọ + grave → tone + ATR/retroflex features (language layer)
Hungarian: “szőr” = sz + ő + r → sz=/s/, ő=front rounded long (language layer)
Vietnamese (Latinized): stacked marks handled by ordering: a + hook + acute (language layer decides tones/phonation)

11) Minting Checklist (Latin Script Module)

✅ Base ASCII set
✅ Combining mark table + ordering
✅ Extended letters registry (expandable list)
✅ Digraph registry (language-selectable)
✅ Normalization/casing/collation hooks
✅ Data model with extends interface
✅ Smoke tests across Romance/Slavic/Afroasiatic/Niger-Congo/Turkic/SEA latinized systems

Latin Script

1) Orientation

2) Base Inventory

2.1 Core letters (ASCII)

2.2 Common Latin extensions (selection; all supported)

3) Combining Mark Set (canonical)

4) Multi-letter units (orthographic digraphs)

5) Normalization, Casing, Collation

5.1 Normalization

5.2 Casing rules (highlights)

5.3 Collation (sorting)

6) Feature Flags (script-level, language toggles)

7) Data Model (what every Latin-based language module inherits)

8) Language Binding Template (how English, Yoruba, Polish, etc. connect)

9) Minimal Implementation Rules (engine)

10) Smoke Tests (cross-language)

11) Minting Checklist (Latin Script Module)

- SolveForce -

🗂️ Quick Links

🌐 Solutions by Sector

🛠️ Our Services

🔍 Technology Solutions

💼 Industries Served

🌍 Worldwide Coverage

📚 Resources

🤝 Partnerships & Affiliations

📄 Legal & Privacy