Arabic Letter Module (ArLM)

Factory Blueprint


A. Purpose & Scope

ArLM (Arabic Letter Module) governs the 28 core Arabic letters as interoperable units:

  • each with geometric form (isolated/initial/medial/final),
  • joining behavior (right-joining vs dual-joining),
  • phoneme/IPA (and dialectal notes),
  • etymological lineage (Phoenician → Aramaic → Arabic; crosswalks to Greek/Latin),
  • Abjad value (traditional numerals),
  • diacritic & ligature logic (hamza, shadda, harakāt; special lam-alef),
  • and morphemic affordances (root compatibility, common prefixes/suffixes).

Mantra: Letter → Grapheme → Phoneme → Morpheme → Meaning.
Shape is signal; signal becomes sense.


B. Data Model (per-letter record)

Every letter is a record with the following fields (these become JSON schema later):

  • name_ar (Arabic letter name, spelled in Arabic), name_lat (DIN/ISO transliteration), aliases
  • codepoints (base + presentation forms), forms (isolated/initial/medial/final)
  • joining_type (right_only | dual)
  • phoneme (IPA), allophones (dialectal)
  • abjad_value (1…1000), order (Hijā’ order index)
  • lineage (Phoenician/Aramaic root; Greek/Latin correspondences)
  • diacritic_rules (harakāt allowed; hamza hosting; shadda behavior)
  • ligature_rules (e.g., lam-alef), confusable_notes (GLM)
  • morph_notes (root roles, affix roles), examples (canonical words)
  • safety (RTL/bidi guidance), accessibility (ASCII fallback/voice hints)

C. The 28-Letter Inventory (concise, canonical)

Key: Letter (name) — IPA — Join — Abjad — lineage (rough Phoenician → Greek/Latin)

  1. ا (ʾalif) — /ʔ/; mater of āright_only1 — ʼaleph → alpha/A
  2. ب (bāʾ) — /b/ — dual2 — beth → beta/B
  3. ت (tāʾ) — /t/ — dual400 — tav → tau/T
  4. ث (thāʾ) — /θ/ — dual500 — (taw with dot lineage) → theta (sound kin)
  5. ج (jīm) — /dʒ/ (MENA) ~ /ʒ/ (Maghreb) — dual3 — gimel → gamma/G (shape diverged)
  6. ح (ḥāʾ) — /ħ/ — dual8 — ḥet → (Greek eta’s ancestor by path)
  7. خ (khāʾ) — /x ~ χ/ — dual600 — ḥet-derived (dot)
  8. د (dāl) — /d/ — right_only4 — dalet → delta/D
  9. ذ (dhāl) — /ð/ — right_only700 — dalet-derived (dot)
  10. ر (rāʾ) — /r/ (tap/trill) — right_only200 — resh → rho/R
  11. ز (zāy) — /z/ — right_only7 — zayin → zeta/Z
  12. س (sīn) — /s/ — dual60 — samekh/šīn split ancestry → sigma/S (distant)
  13. ش (shīn) — /ʃ/ — dual300 — šin → (Greek sigma/sanity via Phoenician)
  14. ص (ṣād) — /sˤ/ — dual90 — ṣade → (Latin S distant kin)
  15. ض (ḍād) — /dˤ/ — dual800 — Arabic innovation (emphatic d)
  16. ط (ṭāʾ) — /tˤ/ — dual9 — ṭet → (Greek theta distant)
  17. ظ (ẓāʾ) — /zˤ/ or /ðˤ/ — dual900 — ẓāʾ (dot/emphatic)
  18. ع (ʿayn) — /ʕ/ — dual70 — ʿayin → omicron/O (vocalic shift path)
  19. غ (ghayn) — /ɣ/ — dual1000 — ʿayin-derived (dot)
  20. ف (fāʾ) — /f/ — dual80 — pe/fe → phi/F (via Latin)
  21. ق (qāf) — /q/ (velar/uvular; /g/ in some dialects) — dual100 — qoph → koppa → Q
  22. ك (kāf) — /k/ — dual20 — kaph → kappa/K
  23. ل (lām) — /l/ — dual30 — lamed → lambda/L
  24. م (mīm) — /m/ — dual40 — mem → mu/M
  25. ن (nūn) — /n/ — dual50 — nun → nu/N
  26. ه (hāʾ) — /h/ — dual5 — he → eta/epsilon ancestry
  27. و (wāw) — /w/; mater of ū/ōright_only6 — waw → upsilon/W/V (via Latin)
  28. ي (yāʾ) — /j/; mater of ī/ēdual10 — yod → iota/I, J/Y (Latin bifurcation)

Auxiliary signs (governed but not in the 28 core)

  • ء (hamza) — glottal stop carrier-dependent (on/under/over ا و ي or on chair (ٮ)): non-joining, orthographic ruleset required.
  • ة (tāʾ marbūṭa) — right_only, feminine morpheme; sounds /a(h)/~ /t/ in construct; not a root letter.
  • ى (alif maqṣūra) — final ā written with yāʾ shape without dots; right_only; maps to ا semantically.
  • (lam-alef ligature) — contextual ligature of ل+ا; GLM handles shaping/normalization.

D. Joining & Shaping Rules (GLM contract)

  • Right-only (non-connecting to left): ا د ذ ر ز و (+ ى/ة contextually).
  • Dual-joining: all others.
  • Diacritics: fatḥa ◌َ, ḍamma ◌ُ, kasra ◌ِ, sukūn ◌ْ, shadda ◌ّ, tanwīn (◌ً ◌ٌ ◌ٍ).
  • Hamza placement: on ا و ي or standalone (ء) per surrounding vowels; normalization logs carrier & seat.
  • Required ligatures: only lam-alef families; stylistic ligatures are optional and must not change text semantics.
  • Bidi safety: Arabic is RTL; numbers default LTR; ArLM emits bidi guidance (avoids invisible overrides unless whitelisted).

E. Morphology Hooks (into MLM/ELM)

  • Root roles: Every triliteral root slot {C1,C2,C3} must be dual-joining or final-legal; ArLM flags roots that end in right-only letters which restrict suffixation.
  • Affix roles:
    • Prefixes: و, ف, ل, ب, ك, س, ال- (article)
    • Suffixes: ـون, ـات, ـة/ـه, ـي/ـك/ـه (pronominal), ـان
  • Mater lectionis policy: و and ي (and ا) may carry vowel length; ArLM records vocalic load for SDM disambiguation.

F. Cross-Script Lineage (interoperability map)

ArLM carries a lineage map (for ILM/ELM) so your stack can transpose letters and concepts across families:

  • ألف (alif)Alpha/Α, AAleph (Phoenician)
  • بيت (bāʾ)Beta/Β, BBeth
  • عين (ʿayn)AyinO (Greek omicron) (vocalization shift)
  • قاف (qāf)Qoppa/Q → Q
    …and so on. (Full map gets minted from the lineage tables when we generate files.)

G. Safety & Confusable Policy (GLM)

  • Mixed-script lookalikes are blocked (Latin “O” vs Arabic “و/ه/ع” contexts).
  • Dot-position confusables (ب/ت/ث; ج/ح/خ; ف/ق regional dot order) receive a confusableRisk score; ArLM proposes safe typography and training examples.
  • Normalization logs NFC forms and presentation form bans (store logical letters, refuse legacy ligature codepoints except lam-alef).

H. Minimal Examples (seed records)

ا (ʾalif)

  • phoneme: /ʔ/, long ā carrier; joining: right_only; abjad: 1.
  • diacritics: all harakāt; hamza carrier allowed (أ/إ/ؤ/ئ/ء).
  • morph: article ال, verb patterns IV (أفعل).
  • lineage: ʼaleph → alpha/A.
  • examples: قال /qāla/ (mater ā), أمن (hamza on alif).

ق (qāf)

  • phoneme: /q/ (dialectal /g/); joining: dual; abjad: 100.
  • morph: frequent C1/C2 root; heavy in telecom/energy Arabic loanwords.
  • lineage: qoph → koppa/Q.

و (wāw)

  • phoneme: /w/; mater of ū/ō; joining: right_only; abjad: 6.
  • morph: conjunction و “and”, plural patterns (ـون).
  • hamza host: ؤ.

(When we mint, every letter gets a full record like this.)


I. Endpoints to Generate Later (OpenAPI sketch)

  • POST /arlm/verify → text in Arabic → { normalized, decision, graphemeIntegrity, confusabilityRisk, joiningReport }
  • GET /arlm/inventory → sanctioned letters and rules; accent/hamza table.
  • POST /arlm/transliterate → Arabic ⇄ Latin/Greek (lossless flags).
  • POST /arlm/shape → produce legal forms (isolated/initial/medial/final) with proofs.

Headers downstream:
X-ArLM-JoinPolicy, X-ArLM-Normalized, X-ArLM-ConfusableRisk, X-Glyph-Status: ⌗|Ξ|∴.


J. How this plugs your stack (recursive fit)

  • GLM: ArLM is GLM’s Arabic pack (scripts, joins, diacritics, normalization).
  • MLM/ELM: roots and patterns consume ArLM’s join legality + mater rules.
  • SDM: senses differentiate homographs using diacritics/hamza provenance.
  • ILM: lineage tables let you transpose to Hebrew/Greek/Phoenician/Latin teaching materials.
  • PLM: channel policies (web, PDF, signage) enforce render portability.
  • LM/logM: claims about transliteration, etymology, or safety ship with Proof Cards and receipts.
  • ALM: uses ArLM checks before any Arabic-surface publication.

K. Acceptance Criteria (done = done)

  1. All 28 letters encoded with full records; auxiliary signs covered (hamza, tāʾ marbūṭa, alif maqṣūra, lam-alef).
  2. /arlm/verify catches illegal clusters, bidi hazards, and confusables; proposes safe remaps.
  3. Transliteration round-trips where lossless; loss flagged where not.
  4. Morph hooks enable lawful root & pattern generation.
  5. Lineage map compiles; cross-script demos pass.