Factory Blueprint
A. Purpose & Scope
ArLM (Arabic Letter Module) governs the 28 core Arabic letters as interoperable units:
- each with geometric form (isolated/initial/medial/final),
- joining behavior (right-joining vs dual-joining),
- phoneme/IPA (and dialectal notes),
- etymological lineage (Phoenician → Aramaic → Arabic; crosswalks to Greek/Latin),
- Abjad value (traditional numerals),
- diacritic & ligature logic (hamza, shadda, harakāt; special lam-alef),
- and morphemic affordances (root compatibility, common prefixes/suffixes).
Mantra: Letter → Grapheme → Phoneme → Morpheme → Meaning.
Shape is signal; signal becomes sense.
B. Data Model (per-letter record)
Every letter is a record with the following fields (these become JSON schema later):
name_ar(Arabic letter name, spelled in Arabic),name_lat(DIN/ISO transliteration),aliasescodepoints(base + presentation forms),forms(isolated/initial/medial/final)joining_type(right_only|dual)phoneme(IPA),allophones(dialectal)abjad_value(1…1000),order(Hijā’ order index)lineage(Phoenician/Aramaic root; Greek/Latin correspondences)diacritic_rules(harakāt allowed; hamza hosting; shadda behavior)ligature_rules(e.g., lam-alef),confusable_notes(GLM)morph_notes(root roles, affix roles),examples(canonical words)safety(RTL/bidi guidance),accessibility(ASCII fallback/voice hints)
C. The 28-Letter Inventory (concise, canonical)
Key: Letter (name) — IPA — Join — Abjad — lineage (rough Phoenician → Greek/Latin)
- ا (ʾalif) — /ʔ/; mater of ā — right_only — 1 — ʼaleph → alpha/A
- ب (bāʾ) — /b/ — dual — 2 — beth → beta/B
- ت (tāʾ) — /t/ — dual — 400 — tav → tau/T
- ث (thāʾ) — /θ/ — dual — 500 — (taw with dot lineage) → theta (sound kin)
- ج (jīm) — /dʒ/ (MENA) ~ /ʒ/ (Maghreb) — dual — 3 — gimel → gamma/G (shape diverged)
- ح (ḥāʾ) — /ħ/ — dual — 8 — ḥet → (Greek eta’s ancestor by path)
- خ (khāʾ) — /x ~ χ/ — dual — 600 — ḥet-derived (dot)
- د (dāl) — /d/ — right_only — 4 — dalet → delta/D
- ذ (dhāl) — /ð/ — right_only — 700 — dalet-derived (dot)
- ر (rāʾ) — /r/ (tap/trill) — right_only — 200 — resh → rho/R
- ز (zāy) — /z/ — right_only — 7 — zayin → zeta/Z
- س (sīn) — /s/ — dual — 60 — samekh/šīn split ancestry → sigma/S (distant)
- ش (shīn) — /ʃ/ — dual — 300 — šin → (Greek sigma/sanity via Phoenician)
- ص (ṣād) — /sˤ/ — dual — 90 — ṣade → (Latin S distant kin)
- ض (ḍād) — /dˤ/ — dual — 800 — Arabic innovation (emphatic d)
- ط (ṭāʾ) — /tˤ/ — dual — 9 — ṭet → (Greek theta distant)
- ظ (ẓāʾ) — /zˤ/ or /ðˤ/ — dual — 900 — ẓāʾ (dot/emphatic)
- ع (ʿayn) — /ʕ/ — dual — 70 — ʿayin → omicron/O (vocalic shift path)
- غ (ghayn) — /ɣ/ — dual — 1000 — ʿayin-derived (dot)
- ف (fāʾ) — /f/ — dual — 80 — pe/fe → phi/F (via Latin)
- ق (qāf) — /q/ (velar/uvular; /g/ in some dialects) — dual — 100 — qoph → koppa → Q
- ك (kāf) — /k/ — dual — 20 — kaph → kappa/K
- ل (lām) — /l/ — dual — 30 — lamed → lambda/L
- م (mīm) — /m/ — dual — 40 — mem → mu/M
- ن (nūn) — /n/ — dual — 50 — nun → nu/N
- ه (hāʾ) — /h/ — dual — 5 — he → eta/epsilon ancestry
- و (wāw) — /w/; mater of ū/ō — right_only — 6 — waw → upsilon/W/V (via Latin)
- ي (yāʾ) — /j/; mater of ī/ē — dual — 10 — yod → iota/I, J/Y (Latin bifurcation)
Auxiliary signs (governed but not in the 28 core)
- ء (hamza) — glottal stop carrier-dependent (on/under/over ا و ي or on chair (ٮ)): non-joining, orthographic ruleset required.
- ة (tāʾ marbūṭa) — right_only, feminine morpheme; sounds /a(h)/~ /t/ in construct; not a root letter.
- ى (alif maqṣūra) — final ā written with yāʾ shape without dots; right_only; maps to ا semantically.
- ﻻ (lam-alef ligature) — contextual ligature of ل+ا; GLM handles shaping/normalization.
D. Joining & Shaping Rules (GLM contract)
- Right-only (non-connecting to left): ا د ذ ر ز و (+ ى/ة contextually).
- Dual-joining: all others.
- Diacritics: fatḥa ◌َ, ḍamma ◌ُ, kasra ◌ِ, sukūn ◌ْ, shadda ◌ّ, tanwīn (◌ً ◌ٌ ◌ٍ).
- Hamza placement: on ا و ي or standalone (ء) per surrounding vowels; normalization logs carrier & seat.
- Required ligatures: only lam-alef families; stylistic ligatures are optional and must not change text semantics.
- Bidi safety: Arabic is RTL; numbers default LTR; ArLM emits bidi guidance (avoids invisible overrides unless whitelisted).
E. Morphology Hooks (into MLM/ELM)
- Root roles: Every triliteral root slot {C1,C2,C3} must be dual-joining or final-legal; ArLM flags roots that end in right-only letters which restrict suffixation.
- Affix roles:
- Prefixes: و, ف, ل, ب, ك, س, ال- (article)
- Suffixes: ـون, ـات, ـة/ـه, ـي/ـك/ـه (pronominal), ـان
- Mater lectionis policy: و and ي (and ا) may carry vowel length; ArLM records vocalic load for SDM disambiguation.
F. Cross-Script Lineage (interoperability map)
ArLM carries a lineage map (for ILM/ELM) so your stack can transpose letters and concepts across families:
- ألف (alif) ↔ Alpha/Α, A ↔ Aleph (Phoenician)
- بيت (bāʾ) ↔ Beta/Β, B ↔ Beth
- عين (ʿayn) ↔ Ayin → O (Greek omicron) (vocalization shift)
- قاف (qāf) ↔ Qoppa/Q → Q
…and so on. (Full map gets minted from the lineage tables when we generate files.)
G. Safety & Confusable Policy (GLM)
- Mixed-script lookalikes are blocked (Latin “O” vs Arabic “و/ه/ع” contexts).
- Dot-position confusables (ب/ت/ث; ج/ح/خ; ف/ق regional dot order) receive a confusableRisk score; ArLM proposes safe typography and training examples.
- Normalization logs NFC forms and presentation form bans (store logical letters, refuse legacy ligature codepoints except lam-alef).
H. Minimal Examples (seed records)
ا (ʾalif)
- phoneme: /ʔ/, long ā carrier; joining: right_only; abjad: 1.
- diacritics: all harakāt; hamza carrier allowed (أ/إ/ؤ/ئ/ء).
- morph: article ال, verb patterns IV (أفعل).
- lineage: ʼaleph → alpha/A.
- examples: قال /qāla/ (mater ā), أمن (hamza on alif).
ق (qāf)
- phoneme: /q/ (dialectal /g/); joining: dual; abjad: 100.
- morph: frequent C1/C2 root; heavy in telecom/energy Arabic loanwords.
- lineage: qoph → koppa/Q.
و (wāw)
- phoneme: /w/; mater of ū/ō; joining: right_only; abjad: 6.
- morph: conjunction و “and”, plural patterns (ـون).
- hamza host: ؤ.
(When we mint, every letter gets a full record like this.)
I. Endpoints to Generate Later (OpenAPI sketch)
POST /arlm/verify→ text in Arabic →{ normalized, decision, graphemeIntegrity, confusabilityRisk, joiningReport }GET /arlm/inventory→ sanctioned letters and rules; accent/hamza table.POST /arlm/transliterate→ Arabic ⇄ Latin/Greek (lossless flags).POST /arlm/shape→ produce legal forms (isolated/initial/medial/final) with proofs.
Headers downstream:X-ArLM-JoinPolicy, X-ArLM-Normalized, X-ArLM-ConfusableRisk, X-Glyph-Status: ⌗|Ξ|∴.
J. How this plugs your stack (recursive fit)
- GLM: ArLM is GLM’s Arabic pack (scripts, joins, diacritics, normalization).
- MLM/ELM: roots and patterns consume ArLM’s join legality + mater rules.
- SDM: senses differentiate homographs using diacritics/hamza provenance.
- ILM: lineage tables let you transpose to Hebrew/Greek/Phoenician/Latin teaching materials.
- PLM: channel policies (web, PDF, signage) enforce render portability.
- LM/logM: claims about transliteration, etymology, or safety ship with Proof Cards and receipts.
- ALM: uses ArLM checks before any Arabic-surface publication.
K. Acceptance Criteria (done = done)
- All 28 letters encoded with full records; auxiliary signs covered (hamza, tāʾ marbūṭa, alif maqṣūra, lam-alef).
/arlm/verifycatches illegal clusters, bidi hazards, and confusables; proposes safe remaps.- Transliteration round-trips where lossless; loss flagged where not.
- Morph hooks enable lawful root & pattern generation.
- Lineage map compiles; cross-script demos pass.