1) Problem Definition & Success Criteria
- Goal: Given a root (e.g., JOHN), generate and curate ≥100,000 linked variants (diminutives, cognates, orthographic/phonetic/transliteration variants, compounds, surnames-from-given, etc.) with traceable provenance and confidence scoring.
- Outputs: Graph + API + export (JSONL/CSV/Parquet), reproducible pipeline, and a browsable UI.
- KPIs: Precision/recall vs. gold sets, % edges with provenance, average confidence ≥ threshold, latency for top-k retrieval, and human-curation velocity.
2) Scope & Taxonomy (Name Types)
- Given names: canonical forms, cognates, hypocoristics (nicknames), pet forms.
- Morphological derivatives: diminutive/augmentative suffixes, gendered forms.
- Orthographic variants: diacritics, historical spellings, spacing/hyphens.
- Phonetic variants: dialectal shifts, merger/split phenomena.
- Cross-script forms: transliterations (round-trip when possible).
- Compounds & theophoric: e.g., John-Paul, Giancarlo.
- Patronymics & surnames: Johnson, Ivanov, Seán Ó…
- Inflected forms (where relevant): case/number/grammatical gender in certain languages.
3) Data Model (Graph-First + Relational Mirror)
- Graph (primary):
NameNode(id, lemma, lang, script, era, gender, freq, sources[]) - Edges (typed):
EDGE(type, confidence, rule_id, source_ids[], notes)- Types:
COGNATE,DIMINUTIVE,ORTHO_VARIANT,PHONETIC_VARIANT,TRANSLIT,COMPOUND_PART,DERIVED_SURNAME,ALIASED,HISTORICAL
- Types:
- Relational mirror: For analytics & versioning:
names,edges,sources,rules,evaluations. - Provenance: Every node/edge must carry
source_ids+ timestamps + license.
4) Canonicalization & Unicode Discipline
- Normalize to NFC; store raw + normalized.
- Track casefolded and diacritic-stripped keys for matching.
- Persist script (ISO 15924), language (BCP 47), region.
- Treat punctuation/hyphenation as first-class features (not noise).
5) Seed Ingestion (Foundations)
- Open lexicographic/onomastic sources (structured + scraped).
- Civil registries & frequency lists (SSA-style, electoral rolls where legally allowed).
- Wikidata-style knowledge (for multilingual bridges).
- Clerical & historical corpora (transliterations; archaic forms).
- Phonology resources (IPA mappings, grapheme-to-phoneme models per language).
- Store each seed with license, version, URL/hash, and coverage notes.
6) Variant-Generation Engines (Rule Packs)
Create modular rule packs keyed by language family and script:
- Etymological cognates: curated tables mapping Johannes → Giovanni → Ioan → Ivan → Yahya, etc., with historical pathways.
- Morphology: suffix/prefix transforms (e.g., Slavic diminutives: -ek, -ka, -sha; Romance: -ito/-ita, -inho/-inha).
- Nicknames: culture-bound substitutions (e.g., Mary → Molly/Polly via M→P shift).
- Orthographic alternation: y/i/j, h-insertion, doubled consonants, legacy spellings.
- Phonology-to-orthography: G2P/P2G per language for plausible spellings.
- Compounding: language-specific comp rules (Jean-Luc, João Pedro).
- Patronymics/surnames-from-given: John → Johnson, MacIan, Ivanov (with region constraints). Each rule emits:
candidate_form,edge_type,rule_id,confidence_base,explanatory_note.
7) Transliteration & Transcription (Cross-Script)
- Use deterministic transliteration tables per direction (ICU/CLDR-class rules).
- For languages with multiple standards (e.g., Russian), support systems (ISO 9, BGN/PCGN, GOST) and score accordingly.
- Round-trip fidelity test: A→B→A; penalize lossiness in confidence.
- Maintain phonemic vs orthographic tracks (Arabic names → multiple Latinizations).
8) Statistical & ML Expansion
- String similarity: Levenshtein, Damerau, Jaro-Winkler (as features, never as sole proof).
- Phonetic hashing: Double Metaphone, NYSIIS, Soundex variants per language (customize!).
- Embeddings: Character/phoneme-level models to score closeness across scripts.
- Candidate ranking model: Gradient boosting or small transformer re-ranker using:
- features: rule_id, path length, freq priors, script distance, round-trip loss, source authority.
- Blocking keys to scale dedup (e.g., fold+metaphone buckets).
9) Confidence Scoring (Composable)
Define edge confidence C as a convex combination:
C = α(rule_prior) + β(source_authority) + γ(freq_evidence)
+ δ(phonetic_match) + ε(roundtrip_fidelity) + ζ(human_validation)
- Calibrate α…ζ via held-out gold sets; expose per-edge explainability (feature attributions).
10) Human-in-the-Loop Curation
- Inbox queues: high-potential/low-confidence edges.
- Batch review UI: diff views, IPA overlays, source snippets.
- Adjudication states: accepted/rejected/needs-data; curator notes propagate to rule tuning.
- Active learning: surface patterns where rules underperform; auto-suggest new micro-rules.
11) Graph Growth Strategy (to 100k+)
- Start from root node (e.g., JOHN).
- Expand layered BFS by edge type priority: COGNATE → DIMINUTIVE → TRANSLIT → ORTHO → PHONETIC → COMPOUND → DERIVED_SURNAME.
- Branch control: cap branching factor per layer by confidence & novelty.
- Periodically merge duplicates via canonical keys and curator-confirmed merges.
- Thematic expansions: e.g., enumerate all country/dialect diminutives before moving to compounds.
12) Deduplication & Disambiguation
- Multi-key match: (folded form, script, lang, metaphone, embedding centroid).
- Contextual disambig: tie to meaning root (e.g., Hebrew Yehochanan “YHWH is gracious”) to separate false friends/homographs across languages with different roots.
- Keep homograph flags with distinct
root_etymon_idto avoid over-collapse.
13) Storage, Indexing, and Infra
- Graph DB: Neo4j / JanusGraph (typed edges, path queries).
- Warehouse/Lake: DuckDB/Parquet for bulk ops & reproducibility.
- Search: OpenSearch/Elasticsearch with analyzers per language/script.
- Job orchestration: Airflow/Prefect; every run fingerprinted (Git-tagged rules, source versions).
- Releases: Semantic versioning of the corpus and API.
14) APIs & SDKs
- Resolve:
/resolve?name=Giovanni&lang=it→ canonical root + neighbors. - Expand:
/expand?root=JOHN&max_depth=4&type=DIMINUTIVE,COGNATE - Explain:
/edge/{id}→ rule, sources, confidence breakdown. - Search:
/search?q=Yahya&f=lang:arwith highlighting. - Bulk: async export endpoints; webhooks for snapshot availability.
- SDKs: Python/JS clients with convenience helpers (transliterate, score, visualize).
15) Evaluation & Gold Sets
- Build language-specific gold lists (curated by onomastics experts).
- Metrics: edge-type accuracy, path precision@k, cultural appropriateness checks, error taxonomy (false merges vs missed links).
- A/B rule tuning with offline evaluation + spot human audits.
16) Ethics, Culture, and Safety
- Respect endonym/exonym sensitivities and naming taboos.
- Mark sacred/theophoric names; avoid careless conflation.
- Support opt-out (if tying to living individuals), comply with local data laws.
- Transparent provenance & license display.
17) Performance & Scale
- Precompute nearest-neighbor indices (FAISS/Annoy) for embeddings.
- Use language sharding and script partitions.
- Batch all costly transliteration/phonology steps; cache aggressively.
- Streaming dedup with probabilistic structures (MinHash/LSH) to avoid O(n²).
18) Minimal Working Prototype (MVP Path)
- Pick a root (e.g., JOHN).
- Ingest 3–4 trusted seed sources for EN/DE/FR/IT/RU/AR/HE/EL.
- Implement rule packs for:
- Latin↔Cyrillic, Latin↔Greek, Latin↔Arabic/Arabic↔Latin
- Diminutive rules for Slavic/Romance/Gaelic
- Build a confidence scorer v0 (hand-tuned α…ε).
- Stand up Neo4j + a small FastAPI with
/expandand/explain. - Curate 1,000 edges; tune; then scale to 100k via BFS + batching.
19) Example Rule (Pseudocode)
def slavic_diminutive(base: str, lang: str):
rules = [
{"suf":"ek","w":0.85},{"suf":"ka","w":0.75},{"suf":"sha","w":0.7},
{"suf":"enka","w":0.65},{"suf":"ochka","w":0.6}
]
for r in rules:
cand = phon_safe(base, lang) + r["suf"]
yield Edge(base, cand, type="DIMINUTIVE",
rule_id="SLAVIC_DIM_001", confidence=0.4+0.6*r["w"])
20) Example Confidence Blend (Explainable)
Edge: JOHN —(COGNATE)→ IOANNIS
Features: rule_prior=0.92, source_authority=0.90 (lexicon A + corpus B),
freq_evidence=0.48, phonetic_match=0.81, roundtrip=0.76, human_validation=—
C = 0.25*0.92 + 0.20*0.90 + 0.15*0.48 + 0.20*0.81 + 0.10*0.76 + 0.10*0.00
= 0.76
21) UI for Exploration (Analyst & Public)
- Root page with etymon, gloss (“God is gracious”), timelines, heatmaps.
- Variant facets: by language, script, type, confidence bins.
- Path viewer: shortest/strongest path from root to variant with edge explanations.
- IPA & audio (TTS per language) to compare phonology.
22) Versioning, Reproducibility, and Audit
- Every corpus build pinned to:
- Rule pack commit hashes
- Source versions + checksums
- Model artifacts (embedding, transliteration tables)
- Change logs: added/removed edges, confidence shifts, new languages.
23) Roadmap (Going from Solid to Sublime)
- v0.1: 8 languages, 20k variants, public read API.
- v0.3: 20+ languages, 100k+ variants per major root (John/Mary/Maria).
- v0.5: Curator marketplace; guided corrections; multilingual UI.
- v1.0: Full orthography–phonology–etymology triad with export-grade QA.