Graphemic Language Module (GLM)


A. Purpose & Scope

GLM (Graphemic Language Module) governs the form of writing: which glyphs exist, how they combine into grapheme clusters, how text is normalized, rendered, transliterated, and secured (confusables, mixed-script traps), and how these decisions propagate to the rest of the stack.

Mantra: Shape is signal.
If MLM is the word foundry and SDM the zoning board of sense, GLM is the building code for letters.

Primary jobs

  1. Define and validate glyph inventories and script policies (Latin, Greek, Cyrillic… plus domain packs).
  2. Govern diacritic logic, ligatures, joiners, variation selectors, and cluster boundaries.
  3. Enforce normalization, confusable control, and render portability across platforms.
  4. Provide transliteration and orthography adapters for cross-script/cross-market deployment.
  5. Produce explanations and receipts: what codepoints were used, why, and how they were made safe and readable.

B. Factory Overview (same machine, new blueprint)

  1. Blueprints — declare inventories, policies, transforms, and risk rules.
  2. Templates — shape artifacts: schema, JSON Schemas, OpenAPI, rulebooks, mapping tables, seeds, tests.
  3. Generators — render final files from blueprints.
  4. Validators — compile grapheme rules, check confusables, simulate renders.
  5. Signers — hash & record provenance.
  6. Publishers — ship to the ledger + SolveForce/Logos clients.

C. GLM Blueprints (source of truth)

C1. Module Blueprint (GLM)

  • name: “Graphemic Language Module”
  • intent: “Governed writing-form and safety”
  • units: Glyph, Codepoint, Cluster, Script, Diacritic, Ligature, Joiner, Variant, Transliteration Map
  • script_policies: allowed scripts, mixed-script rules, forbidden joins, casing rules, digit sets
  • normalization_policy: canonical (e.g., NFC) with allowed exceptions by channel
  • confusable_policy: detection thresholds, safelist/banlist, remediation strategies (substitution, annotate, reject)
  • diacritic_rules: attachment legality, stacking limits, lossless fallback rules
  • cluster_rules: grapheme cluster segmentation, boundary legality, shaping constraints
  • transliteration_maps: Latin ⇄ Greek ⇄ Cyrillic (and others), reversible where possible; lossy flags
  • render_profiles: target OS/browser/editor/font stacks; required render proofs
  • scores: graphemeIntegrity, confusabilityRisk, renderPortability, readability, accessibility, codepointSafety, typographicHarmony
  • thresholds: τ_integrity, τ_confusable, τ_portability, τ_readability, τ_accessibility, τ_safety
  • decisions: ACCEPT | REVIEW | REJECT (per string/term or per policy change)
  • io-contracts: text-in + channel/domain → normalized text + decision + scores + explain[]
  • glyphs: ⌗ (grapheme-checked), Ξ (validated), ∴ (settled), ✠ (ethics)

C2. Inventory Blueprints

  • Base Latin set for SolveForce; extension packs for Greek, Cyrillic, Arabic, etc.
  • Domain glyph packs (telecom symbols, energy units, math/logic marks) with use permissions.

C3. Seeds Blueprint

  • Positive: safe ASCII + sanctioned diacritics;
  • Edge: mixed-script lookalikes (“a” vs “а”), ZWJ/ZWNJ misuse, stacked diacritics, ligature-only forms;
  • Negative: homoglyph spoofs, forbidden clusters, unsafe controls.

D. Templates to Mint Later (content requirements)

  1. DB Schema (templates/db/schema.sql.tmpl)
    • Tables: glyphs, codepoints, scripts, clusters, policies, transforms, decisions, audits
    • Views: v_grapheme_inventory, v_text_proof (input→normalized→flags→scores).
  2. JSON Schemas (templates/schemas/*.json.tmpl)
    • text_proof_request.json: { text, channel, domain, target_scripts?, render_profiles? }
    • text_proof_response.json: { normalized, decision, scores{}, warnings[], explain[], receipts{} }
    • inventory_record.json: glyph/script definitions and status.
  3. OpenAPI
    • /glm/verify (POST text) → graphemic decision + normalized output + reasons.
    • /glm/inventory (GET/POST) to list/update sanctioned glyphs.
    • /glm/transliterate (POST) → mapped string + fidelity notes.
    • /glm/confusables (POST) → report of risky spans.
  4. Rulebook (templates/rules/glm_rulebook.md.tmpl)
    • R0 Script Legality, R1 Normalization, R2 Cluster Boundaries, R3 Diacritics, R4 Confusables, R5 Controls & Joiners, R6 Render Portability, R7 Accessibility, R8 Overrides.
  5. Transforms
    • normalization_profiles.yaml.tmpl (e.g., NFC default, NFKC for legacy),
    • confusable_map.txt.tmpl (homoglyph sets + weights),
    • transliteration/*.yaml.tmpl (pairwise mappings, reversible flags),
    • render_profiles.yaml.tmpl (font stacks & test matrices).
  6. Seeds/Tests
    • glm_seeds.jsonl.tmpl, glm_cases.json.tmpl—ACCEPT/REVIEW/REJECT with explicit rule triggers.
  7. Generator/Validator Stubs
    • Cluster segmenter, normalization/applier, confusable finder, render simulator (font fallback), accessibility checks (screen reader hints).

E. Processing Pipeline (runtime contract to implement later)

Input → Inspect → Normalize → Analyze → Score → Decide → Explain

  1. Inspect
    • Detect scripts, controls, joiners, diacritics; mark mixed-script spans and suspicious clusters.
  2. Normalize
    • Apply canonical policy (e.g., NFC) with channel-specific overrides; record deltas and any lossy steps (never silent).
  3. Analyze
    • Cluster legality (grapheme boundaries), diacritic stacking limits, confusable sets (lookalikes), controls (ZWJ/ZWNJ, bidi marks), render proofs (font coverage/fallback), accessibility (pronounceability, alt mappings).
  4. Score
    • graphemeIntegrity (legal clusters + policy match)
    • confusabilityRisk (weighted homoglyph proximity & mixing)
    • renderPortability (coverage across profiles)
    • readability (cluster simplicity; diacritic burden; casing)
    • accessibility (screen-reader fidelity; ASCII fallback quality)
    • codepointSafety (controls/privates/forbidden blocks)
    • typographicHarmony (spacing/kerning risk, ligature reliance)
  5. Decide
    • ACCEPT / REVIEW / REJECT under thresholds; auto-remediation suggestions if safe (e.g., prefer base+diacritic to ambiguous precomposed form).
  6. Explain
    • Emit text proof: before/after, rules fired, confusable spans highlighted, render matrix, accessibility notes, and headers for downstream modules.

F. Scoring (deterministic skeleton)

  • graphemeIntegrity = 1 − (illegal_cluster_penalties + policy_violations)
  • confusabilityRisk = max(homoglyph_weighted_score, mixed_script_factor)
  • renderPortability = min(coverage across target profiles)
  • readability = function(diatrics_count, cluster_complexity, case consistency)
  • accessibility = min(screen_reader_similarity, fallback fidelity)
  • codepointSafety = 1 − unsafe_codepoint_ratio

Default pass (tunable):
graphemeIntegrity ≥ 0.90 ∧ confusabilityRisk ≤ 0.20 ∧ renderPortability ≥ 0.85 ∧ readability ≥ 0.75 ∧ accessibility ≥ 0.80 ∧ codepointSafety ≥ 0.95 ∧ ethicsPass = true.


G. Validators (what “good” means)

  • JSON Schemas validate with examples.
  • OpenAPI typed; examples provided.
  • Inventory: no orphan glyphs; scripts labeled; status (allowed/review/banned).
  • Transforms parse; transliteration maps are acyclic where required.
  • Seeds/Tests pass; each test cites which rule IDs fired.
  • Render simulation: no ACCEPT where coverage < policy threshold.

H. Policies & Overrides

  • Mixed-script policy: default deny, with explicit allowlists per domain/channel.
  • Confusable remediation: prefer safe lookalike replacements or annotate with combining marks; log all remaps.
  • Controls & Joiners: ZWJ/ZWNJ, bidi marks allowed only in whitelisted contexts.
  • Overrides: curator-required with rationale; audit stored immutably.

I. Playbooks (ops steps)

  1. Author GLM Blueprint (/blueprints/glm.yaml) with script policies, normalization, confusables, transliteration.
  2. Dry run: validate inventories; compile confusable tables; run seed cases.
  3. Mint: render DB schema, JSON Schemas, OpenAPI, rulebook, transforms, seeds, tests into /build/GLM/....
  4. Prove: run tests + render simulations; confirm accessibility checks.
  5. Publish: enable endpoints; wire to ledger + editors.

J. Content Requirements (when minted)

  • schema.sql: glyphs, codepoints, scripts, clusters, policies, transforms, decisions, audits; views v_grapheme_inventory, v_text_proof.
  • text_proof_request/response.json: as above.
  • OpenAPI: /glm/verify, /glm/inventory, /glm/transliterate, /glm/confusables.
  • rulebook: R0–R8 with examples and remediation patterns.
  • transforms: normalization profiles, confusable maps, transliteration tables, render profiles.
  • seeds/tests: representative examples (safe, edgy, malicious).

K. Runtime Endpoints (to implement after mint)

  • POST /glm/verify { text, channel, domain, target_scripts?, render_profiles? }
    { normalized, decision, scores, warnings[], explain[], receipts{} }
  • POST /glm/transliterate { text, source_script, target_script }
    { mapped, fidelity: lossless|lossy, notes[] }
  • POST /glm/confusables { text }{ spans: [{i,j,type,neighbors}], risk }
  • GET|PATCH /glm/inventory → manage glyph packs.

L. Interlocks (binding tissue)

  • Feeds MLM: validates that candidate terms use legal clusters; provides safe remaps before morphology scoring.
  • Feeds SDM: ensures the surface form uniquely signals the intended sense (confusable risk included in ambiguity).
  • Feeds ILM: supplies transliteration and orthography adapters per domain/geography.
  • Feeds PLM: channel-specific render/accessibility guidance (e.g., voice UIs ignore silent diacritics).
  • Used by ALM: preflight step for any publication; receipts include codepoint lists and remaps.

Downstream Headers
X-GLM-Normalized: NFC|…
X-GLM-ConfusableRisk: <score>
X-GLM-RenderProfiles: ok/<list>
X-Glyph-Status: ⌗|Ξ|∴


M. Acceptance Criteria (done = done)

  1. Factory mints GLM artifacts from blueprint with zero manual edits.
  2. Inventories & transforms load; seeds/tests pass; render sims green.
  3. /glm/verify returns normalized forms + decisions + rationales.
  4. Confusable & mixed-script traps are caught; safe remaps proposed.
  5. Headers consumed by other modules; logs/receipts hash to ledger.

N. Roadmap

  • Font-agnostic shaping tests (cover complex scripts: Arabic, Indic, SE Asian).
  • Dynamic confusable lists updated from telemetry.
  • Perceptual readability model (human-in-the-loop judgments).
  • Accessibility exporters (phonemic/ASCII fallbacks for low-vision and TTS).
  • Right-to-left + bidirectional policy packs with strict joiner rules.

O. Micro-Examples (seed calibrators)

  1. ACCEPTLANOMICS (pure Latin, NFC, no confusables)
    • graphemeIntegrity .99, confusabilityRisk .02, renderPortability .98 → ACCEPT.
  2. REVIEWLАNOMICS where “A” is Cyrillic U+0410
    • Mixed script; indistinguishable visually → suggest replace with Latin A; else REJECT for public branding.
  3. REJECT — term with stacked diacritics + ZWJ misuse
    • Illegal cluster; screen-reader corruption risk; renderPortability low → REJECT + remediation plan.

Notes for SolveForce & Ron’s corpus (Logos Codex / Linomics)

  • Brand safety: GLM enforces that SolveForce marks are single-script, non-confusable, and accessible.
  • Linomics/LANOMICS family: register sanctioned diacritics (if any), preferred transliterations, and ASCII fallbacks for low-fidelity channels.
  • Mutation Ledger tie-in: every minted term stores the codepoint recipe and normalization path as part of the audit trail.