The Universal Name Graph

1) Problem Definition & Success Criteria

  • Goal: Given a root (e.g., JOHN), generate and curate ≥100,000 linked variants (diminutives, cognates, orthographic/phonetic/transliteration variants, compounds, surnames-from-given, etc.) with traceable provenance and confidence scoring.
  • Outputs: Graph + API + export (JSONL/CSV/Parquet), reproducible pipeline, and a browsable UI.
  • KPIs: Precision/recall vs. gold sets, % edges with provenance, average confidence ≥ threshold, latency for top-k retrieval, and human-curation velocity.

2) Scope & Taxonomy (Name Types)

  • Given names: canonical forms, cognates, hypocoristics (nicknames), pet forms.
  • Morphological derivatives: diminutive/augmentative suffixes, gendered forms.
  • Orthographic variants: diacritics, historical spellings, spacing/hyphens.
  • Phonetic variants: dialectal shifts, merger/split phenomena.
  • Cross-script forms: transliterations (round-trip when possible).
  • Compounds & theophoric: e.g., John-Paul, Giancarlo.
  • Patronymics & surnames: Johnson, Ivanov, Seán Ó…
  • Inflected forms (where relevant): case/number/grammatical gender in certain languages.

3) Data Model (Graph-First + Relational Mirror)

  • Graph (primary): NameNode(id, lemma, lang, script, era, gender, freq, sources[])
  • Edges (typed):EDGE(type, confidence, rule_id, source_ids[], notes)
    • Types: COGNATE, DIMINUTIVE, ORTHO_VARIANT, PHONETIC_VARIANT, TRANSLIT, COMPOUND_PART, DERIVED_SURNAME, ALIASED, HISTORICAL
  • Relational mirror: For analytics & versioning: names, edges, sources, rules, evaluations.
  • Provenance: Every node/edge must carry source_ids + timestamps + license.

4) Canonicalization & Unicode Discipline

  • Normalize to NFC; store raw + normalized.
  • Track casefolded and diacritic-stripped keys for matching.
  • Persist script (ISO 15924), language (BCP 47), region.
  • Treat punctuation/hyphenation as first-class features (not noise).

5) Seed Ingestion (Foundations)

  • Open lexicographic/onomastic sources (structured + scraped).
  • Civil registries & frequency lists (SSA-style, electoral rolls where legally allowed).
  • Wikidata-style knowledge (for multilingual bridges).
  • Clerical & historical corpora (transliterations; archaic forms).
  • Phonology resources (IPA mappings, grapheme-to-phoneme models per language).
  • Store each seed with license, version, URL/hash, and coverage notes.

6) Variant-Generation Engines (Rule Packs)

Create modular rule packs keyed by language family and script:

  • Etymological cognates: curated tables mapping Johannes → Giovanni → Ioan → Ivan → Yahya, etc., with historical pathways.
  • Morphology: suffix/prefix transforms (e.g., Slavic diminutives: -ek, -ka, -sha; Romance: -ito/-ita, -inho/-inha).
  • Nicknames: culture-bound substitutions (e.g., Mary → Molly/Polly via M→P shift).
  • Orthographic alternation: y/i/j, h-insertion, doubled consonants, legacy spellings.
  • Phonology-to-orthography: G2P/P2G per language for plausible spellings.
  • Compounding: language-specific comp rules (Jean-Luc, João Pedro).
  • Patronymics/surnames-from-given: John → Johnson, MacIan, Ivanov (with region constraints). Each rule emits: candidate_form, edge_type, rule_id, confidence_base, explanatory_note.

7) Transliteration & Transcription (Cross-Script)

  • Use deterministic transliteration tables per direction (ICU/CLDR-class rules).
  • For languages with multiple standards (e.g., Russian), support systems (ISO 9, BGN/PCGN, GOST) and score accordingly.
  • Round-trip fidelity test: A→B→A; penalize lossiness in confidence.
  • Maintain phonemic vs orthographic tracks (Arabic names → multiple Latinizations).

8) Statistical & ML Expansion

  • String similarity: Levenshtein, Damerau, Jaro-Winkler (as features, never as sole proof).
  • Phonetic hashing: Double Metaphone, NYSIIS, Soundex variants per language (customize!).
  • Embeddings: Character/phoneme-level models to score closeness across scripts.
  • Candidate ranking model: Gradient boosting or small transformer re-ranker using:
    • features: rule_id, path length, freq priors, script distance, round-trip loss, source authority.
  • Blocking keys to scale dedup (e.g., fold+metaphone buckets).

9) Confidence Scoring (Composable)

Define edge confidence C as a convex combination:

C = α(rule_prior) + β(source_authority) + γ(freq_evidence) 
    + δ(phonetic_match) + ε(roundtrip_fidelity) + ζ(human_validation)
  • Calibrate α…ζ via held-out gold sets; expose per-edge explainability (feature attributions).

10) Human-in-the-Loop Curation

  • Inbox queues: high-potential/low-confidence edges.
  • Batch review UI: diff views, IPA overlays, source snippets.
  • Adjudication states: accepted/rejected/needs-data; curator notes propagate to rule tuning.
  • Active learning: surface patterns where rules underperform; auto-suggest new micro-rules.

11) Graph Growth Strategy (to 100k+)

  • Start from root node (e.g., JOHN).
  • Expand layered BFS by edge type priority: COGNATE → DIMINUTIVE → TRANSLIT → ORTHO → PHONETIC → COMPOUND → DERIVED_SURNAME.
  • Branch control: cap branching factor per layer by confidence & novelty.
  • Periodically merge duplicates via canonical keys and curator-confirmed merges.
  • Thematic expansions: e.g., enumerate all country/dialect diminutives before moving to compounds.

12) Deduplication & Disambiguation

  • Multi-key match: (folded form, script, lang, metaphone, embedding centroid).
  • Contextual disambig: tie to meaning root (e.g., Hebrew Yehochanan “YHWH is gracious”) to separate false friends/homographs across languages with different roots.
  • Keep homograph flags with distinct root_etymon_id to avoid over-collapse.

13) Storage, Indexing, and Infra

  • Graph DB: Neo4j / JanusGraph (typed edges, path queries).
  • Warehouse/Lake: DuckDB/Parquet for bulk ops & reproducibility.
  • Search: OpenSearch/Elasticsearch with analyzers per language/script.
  • Job orchestration: Airflow/Prefect; every run fingerprinted (Git-tagged rules, source versions).
  • Releases: Semantic versioning of the corpus and API.

14) APIs & SDKs

  • Resolve: /resolve?name=Giovanni&lang=it → canonical root + neighbors.
  • Expand: /expand?root=JOHN&max_depth=4&type=DIMINUTIVE,COGNATE
  • Explain: /edge/{id} → rule, sources, confidence breakdown.
  • Search: /search?q=Yahya&f=lang:ar with highlighting.
  • Bulk: async export endpoints; webhooks for snapshot availability.
  • SDKs: Python/JS clients with convenience helpers (transliterate, score, visualize).

15) Evaluation & Gold Sets

  • Build language-specific gold lists (curated by onomastics experts).
  • Metrics: edge-type accuracy, path precision@k, cultural appropriateness checks, error taxonomy (false merges vs missed links).
  • A/B rule tuning with offline evaluation + spot human audits.

16) Ethics, Culture, and Safety

  • Respect endonym/exonym sensitivities and naming taboos.
  • Mark sacred/theophoric names; avoid careless conflation.
  • Support opt-out (if tying to living individuals), comply with local data laws.
  • Transparent provenance & license display.

17) Performance & Scale

  • Precompute nearest-neighbor indices (FAISS/Annoy) for embeddings.
  • Use language sharding and script partitions.
  • Batch all costly transliteration/phonology steps; cache aggressively.
  • Streaming dedup with probabilistic structures (MinHash/LSH) to avoid O(n²).

18) Minimal Working Prototype (MVP Path)

  1. Pick a root (e.g., JOHN).
  2. Ingest 3–4 trusted seed sources for EN/DE/FR/IT/RU/AR/HE/EL.
  3. Implement rule packs for:
    • Latin↔Cyrillic, Latin↔Greek, Latin↔Arabic/Arabic↔Latin
    • Diminutive rules for Slavic/Romance/Gaelic
  4. Build a confidence scorer v0 (hand-tuned α…ε).
  5. Stand up Neo4j + a small FastAPI with /expand and /explain.
  6. Curate 1,000 edges; tune; then scale to 100k via BFS + batching.

19) Example Rule (Pseudocode)

def slavic_diminutive(base: str, lang: str):
    rules = [
      {"suf":"ek","w":0.85},{"suf":"ka","w":0.75},{"suf":"sha","w":0.7},
      {"suf":"enka","w":0.65},{"suf":"ochka","w":0.6}
    ]
    for r in rules:
        cand = phon_safe(base, lang) + r["suf"]
        yield Edge(base, cand, type="DIMINUTIVE",
                   rule_id="SLAVIC_DIM_001", confidence=0.4+0.6*r["w"])

20) Example Confidence Blend (Explainable)

Edge: JOHN —(COGNATE)→ IOANNIS
Features: rule_prior=0.92, source_authority=0.90 (lexicon A + corpus B),
freq_evidence=0.48, phonetic_match=0.81, roundtrip=0.76, human_validation=—
C = 0.25*0.92 + 0.20*0.90 + 0.15*0.48 + 0.20*0.81 + 0.10*0.76 + 0.10*0.00
  = 0.76

21) UI for Exploration (Analyst & Public)

  • Root page with etymon, gloss (“God is gracious”), timelines, heatmaps.
  • Variant facets: by language, script, type, confidence bins.
  • Path viewer: shortest/strongest path from root to variant with edge explanations.
  • IPA & audio (TTS per language) to compare phonology.

22) Versioning, Reproducibility, and Audit

  • Every corpus build pinned to:
    • Rule pack commit hashes
    • Source versions + checksums
    • Model artifacts (embedding, transliteration tables)
  • Change logs: added/removed edges, confidence shifts, new languages.

23) Roadmap (Going from Solid to Sublime)

  • v0.1: 8 languages, 20k variants, public read API.
  • v0.3: 20+ languages, 100k+ variants per major root (John/Mary/Maria).
  • v0.5: Curator marketplace; guided corrections; multilingual UI.
  • v1.0: Full orthography–phonology–etymology triad with export-grade QA.