The Universal Name Graph – SolveForce Unified Intelligence

1) Problem Definition & Success Criteria

Goal: Given a root (e.g., JOHN), generate and curate ≥100,000 linked variants (diminutives, cognates, orthographic/phonetic/transliteration variants, compounds, surnames-from-given, etc.) with traceable provenance and confidence scoring.
Outputs: Graph + API + export (JSONL/CSV/Parquet), reproducible pipeline, and a browsable UI.
KPIs: Precision/recall vs. gold sets, % edges with provenance, average confidence ≥ threshold, latency for top-k retrieval, and human-curation velocity.

2) Scope & Taxonomy (Name Types)

Given names: canonical forms, cognates, hypocoristics (nicknames), pet forms.
Morphological derivatives: diminutive/augmentative suffixes, gendered forms.
Orthographic variants: diacritics, historical spellings, spacing/hyphens.
Phonetic variants: dialectal shifts, merger/split phenomena.
Cross-script forms: transliterations (round-trip when possible).
Compounds & theophoric: e.g., John-Paul, Giancarlo.
Patronymics & surnames: Johnson, Ivanov, Seán Ó…
Inflected forms (where relevant): case/number/grammatical gender in certain languages.

3) Data Model (Graph-First + Relational Mirror)

Graph (primary): NameNode(id, lemma, lang, script, era, gender, freq, sources[])
Edges (typed):EDGE(type, confidence, rule_id, source_ids[], notes)
- Types: COGNATE, DIMINUTIVE, ORTHO_VARIANT, PHONETIC_VARIANT, TRANSLIT, COMPOUND_PART, DERIVED_SURNAME, ALIASED, HISTORICAL
Relational mirror: For analytics & versioning: names, edges, sources, rules, evaluations.
Provenance: Every node/edge must carry source_ids + timestamps + license.

4) Canonicalization & Unicode Discipline

Normalize to NFC; store raw + normalized.
Track casefolded and diacritic-stripped keys for matching.
Persist script (ISO 15924), language (BCP 47), region.
Treat punctuation/hyphenation as first-class features (not noise).

5) Seed Ingestion (Foundations)

Open lexicographic/onomastic sources (structured + scraped).
Civil registries & frequency lists (SSA-style, electoral rolls where legally allowed).
Wikidata-style knowledge (for multilingual bridges).
Clerical & historical corpora (transliterations; archaic forms).
Phonology resources (IPA mappings, grapheme-to-phoneme models per language).
Store each seed with license, version, URL/hash, and coverage notes.

6) Variant-Generation Engines (Rule Packs)

Create modular rule packs keyed by language family and script:

Etymological cognates: curated tables mapping Johannes → Giovanni → Ioan → Ivan → Yahya, etc., with historical pathways.
Morphology: suffix/prefix transforms (e.g., Slavic diminutives: -ek, -ka, -sha; Romance: -ito/-ita, -inho/-inha).
Nicknames: culture-bound substitutions (e.g., Mary → Molly/Polly via M→P shift).
Orthographic alternation: y/i/j, h-insertion, doubled consonants, legacy spellings.
Phonology-to-orthography: G2P/P2G per language for plausible spellings.
Compounding: language-specific comp rules (Jean-Luc, João Pedro).
Patronymics/surnames-from-given: John → Johnson, MacIan, Ivanov (with region constraints). Each rule emits: candidate_form, edge_type, rule_id, confidence_base, explanatory_note.

7) Transliteration & Transcription (Cross-Script)

Use deterministic transliteration tables per direction (ICU/CLDR-class rules).
For languages with multiple standards (e.g., Russian), support systems (ISO 9, BGN/PCGN, GOST) and score accordingly.
Round-trip fidelity test: A→B→A; penalize lossiness in confidence.
Maintain phonemic vs orthographic tracks (Arabic names → multiple Latinizations).

8) Statistical & ML Expansion

String similarity: Levenshtein, Damerau, Jaro-Winkler (as features, never as sole proof).
Phonetic hashing: Double Metaphone, NYSIIS, Soundex variants per language (customize!).
Embeddings: Character/phoneme-level models to score closeness across scripts.
Candidate ranking model: Gradient boosting or small transformer re-ranker using:
- features: rule_id, path length, freq priors, script distance, round-trip loss, source authority.
Blocking keys to scale dedup (e.g., fold+metaphone buckets).

9) Confidence Scoring (Composable)

Define edge confidence C as a convex combination:

C = α(rule_prior) + β(source_authority) + γ(freq_evidence) 
    + δ(phonetic_match) + ε(roundtrip_fidelity) + ζ(human_validation)

Calibrate α…ζ via held-out gold sets; expose per-edge explainability (feature attributions).

10) Human-in-the-Loop Curation

Inbox queues: high-potential/low-confidence edges.
Batch review UI: diff views, IPA overlays, source snippets.
Adjudication states: accepted/rejected/needs-data; curator notes propagate to rule tuning.
Active learning: surface patterns where rules underperform; auto-suggest new micro-rules.

11) Graph Growth Strategy (to 100k+)

Start from root node (e.g., JOHN).
Expand layered BFS by edge type priority: COGNATE → DIMINUTIVE → TRANSLIT → ORTHO → PHONETIC → COMPOUND → DERIVED_SURNAME.
Branch control: cap branching factor per layer by confidence & novelty.
Periodically merge duplicates via canonical keys and curator-confirmed merges.
Thematic expansions: e.g., enumerate all country/dialect diminutives before moving to compounds.

12) Deduplication & Disambiguation

Multi-key match: (folded form, script, lang, metaphone, embedding centroid).
Contextual disambig: tie to meaning root (e.g., Hebrew Yehochanan “YHWH is gracious”) to separate false friends/homographs across languages with different roots.
Keep homograph flags with distinct root_etymon_id to avoid over-collapse.

13) Storage, Indexing, and Infra

Graph DB: Neo4j / JanusGraph (typed edges, path queries).
Warehouse/Lake: DuckDB/Parquet for bulk ops & reproducibility.
Search: OpenSearch/Elasticsearch with analyzers per language/script.
Job orchestration: Airflow/Prefect; every run fingerprinted (Git-tagged rules, source versions).
Releases: Semantic versioning of the corpus and API.

14) APIs & SDKs

Resolve: /resolve?name=Giovanni&lang=it → canonical root + neighbors.
Expand: /expand?root=JOHN&max_depth=4&type=DIMINUTIVE,COGNATE
Explain: /edge/{id} → rule, sources, confidence breakdown.
Search: /search?q=Yahya&f=lang:ar with highlighting.
Bulk: async export endpoints; webhooks for snapshot availability.
SDKs: Python/JS clients with convenience helpers (transliterate, score, visualize).

15) Evaluation & Gold Sets

Build language-specific gold lists (curated by onomastics experts).
Metrics: edge-type accuracy, path precision@k, cultural appropriateness checks, error taxonomy (false merges vs missed links).
A/B rule tuning with offline evaluation + spot human audits.

16) Ethics, Culture, and Safety

Respect endonym/exonym sensitivities and naming taboos.
Mark sacred/theophoric names; avoid careless conflation.
Support opt-out (if tying to living individuals), comply with local data laws.
Transparent provenance & license display.

17) Performance & Scale

Precompute nearest-neighbor indices (FAISS/Annoy) for embeddings.
Use language sharding and script partitions.
Batch all costly transliteration/phonology steps; cache aggressively.
Streaming dedup with probabilistic structures (MinHash/LSH) to avoid O(n²).

18) Minimal Working Prototype (MVP Path)

Pick a root (e.g., JOHN).
Ingest 3–4 trusted seed sources for EN/DE/FR/IT/RU/AR/HE/EL.
Implement rule packs for:
- Latin↔Cyrillic, Latin↔Greek, Latin↔Arabic/Arabic↔Latin
- Diminutive rules for Slavic/Romance/Gaelic
Build a confidence scorer v0 (hand-tuned α…ε).
Stand up Neo4j + a small FastAPI with /expand and /explain.
Curate 1,000 edges; tune; then scale to 100k via BFS + batching.

19) Example Rule (Pseudocode)

def slavic_diminutive(base: str, lang: str):
    rules = [
      {"suf":"ek","w":0.85},{"suf":"ka","w":0.75},{"suf":"sha","w":0.7},
      {"suf":"enka","w":0.65},{"suf":"ochka","w":0.6}
    ]
    for r in rules:
        cand = phon_safe(base, lang) + r["suf"]
        yield Edge(base, cand, type="DIMINUTIVE",
                   rule_id="SLAVIC_DIM_001", confidence=0.4+0.6*r["w"])

20) Example Confidence Blend (Explainable)

Edge: JOHN —(COGNATE)→ IOANNIS
Features: rule_prior=0.92, source_authority=0.90 (lexicon A + corpus B),
freq_evidence=0.48, phonetic_match=0.81, roundtrip=0.76, human_validation=—
C = 0.25*0.92 + 0.20*0.90 + 0.15*0.48 + 0.20*0.81 + 0.10*0.76 + 0.10*0.00
  = 0.76

21) UI for Exploration (Analyst & Public)

Root page with etymon, gloss (“God is gracious”), timelines, heatmaps.
Variant facets: by language, script, type, confidence bins.
Path viewer: shortest/strongest path from root to variant with edge explanations.
IPA & audio (TTS per language) to compare phonology.

22) Versioning, Reproducibility, and Audit

Every corpus build pinned to:
- Rule pack commit hashes
- Source versions + checksums
- Model artifacts (embedding, transliteration tables)
Change logs: added/removed edges, confidence shifts, new languages.

23) Roadmap (Going from Solid to Sublime)

v0.1: 8 languages, 20k variants, public read API.
v0.3: 20+ languages, 100k+ variants per major root (John/Mary/Maria).
v0.5: Curator marketplace; guided corrections; multilingual UI.
v1.0: Full orthography–phonology–etymology triad with export-grade QA.