One Golden Record—Governed, Shareable, and Auditable
Master Data Management (MDM) creates a single, trusted view of customers, products, providers, locations, assets—so analytics are consistent, apps agree, and AI learns from clean truth.
SolveForce implements MDM as a system: governed models → match/merge rules → survivorship → versioned history → synchronized downstream consumers—wired to lineage, DQ tests, and SIEM/SOAR evidence.
Connective tissue:
🏛️ Data Platform → /data-warehouse • 🔄 Pipelines → /etl-elt • 📚 Governance → /data-governance
🔐 Privacy/Egress → /dlp • 👤 Lifecycle (identity links) → /identity-lifecycle
🧠 AI/RAG → /vector-databases • 📊 Evidence/Automation → /siem-soar
🎯 Outcomes (Why SolveForce MDM)
- One golden record per domain with explainable survivorship and audit trail.
- Consistent analytics & AI — metrics and features reference the same IDs across apps and regions.
- Fewer data defects — standardization, validation, and DQ gates catch issues before downstream apps do.
- Faster change — governed models and APIs make adds/updates predictable.
- Audit-ready — lineage, versions, match decisions, approvals, and deltas export to SIEM.
🧭 Scope (What We Build & Operate)
- Domains — Customer/Patient/Provider • Product/Catalog • Location/Site • Supplier/Vendor • Asset/Device.
- Core services — standardization (names/addresses/phones/codes), match/merge, survivorship, versioning (SCD2), crosswalk/ID mapping.
- Integration — CDC/ELT ingest, hub & registry patterns, publish/subscribe to downstream apps (CRM/ERP/EHR/Commerce), event APIs.
- Stewardship — UI & workflows for review/exceptions; task queues; approvals; comments with history.
- Governance — business glossary, reference data (codes/values), policy catalog, data contracts, lineage.
🧱 Building Blocks (Spelled Out)
- Modeling
- Canonical schemas per domain (core + extensions); type systems, code sets, and valid state machines.
- Reference data management (taxonomy, hierarchies) with versioning.
- Standardization & Validation
- Address/postal & geocoding, phone/email normalization, product identifiers (GTIN/SKU/UDI), legal entity formats, healthcare/provider IDs (NPI), etc.
- Rule sets + regex/lookup & ML-assisted cleansers; DQ tests (nulls/range/PK-FK/uniqueness).
- Match & Merge
- Deterministic keys (exact/phonetic/id-based) + probabilistic/ML scores; threshold bands (auto-match/auto-suspect/no-match).
- Survivorship policy (source trust, recency, completeness, freshness) with explainable decisions and versioned history.
- Golden Record & Crosswalk
- Persist golden with SCD2; maintain crosswalks across source IDs; emit event on change for subscribers.
- Distribution
- Pub/Sub topics & REST/GraphQL APIs; CDC to marts and apps; caching & SLAs by domain.
- Security & Privacy
- RLS/CLS & label-based masking; DLP for PII/PHI/PAN; tokenization where required; immutable logs. → /dlp
- Evidence
- Lineage (column-level), DQ scores, match rules & overrides, steward actions, and publish events → SIEM/SOAR. → /siem-soar
🧰 Reference Architectures (Choose Your Fit)
A) Registry + Publish (Lightweight, Fast)
- Golden stored in warehouse/lake; registry tables + crosswalks; Pub/Sub events to apps; stewardship in BI/micro-UI.
B) Hub (Operational MDM)
- Dedicated hub with APIs, match/merge engine, workflow UI; bidirectional sync to CRM/ERP/EHR; near-real-time events.
C) Analytical MDM (Lakehouse)
- ELT/dbt standardization → match/merge in SQL/Spark; golden to curated zone; features exported to AI with provenance.
D) Privacy-First MDM
- Domain labels (PII/PHI/PAN), tokenization, RLS/CLS, regional perimeters; VPC SC/Private Endpoints for cloud services.
E) Multi-Region / Sovereign
- Region-bound masters + periodic reconciliation; conflict rules; regional caches; lawful processing & residency controls.
📐 SLO Guardrails (You Can Measure)
| KPI / SLO | Target (Recommended) |
|---|---|
| DQ pass rate (golden tables) | ≥ 99% |
| Match precision / recall (golden IDs) | ≥ 98% / ≥ 95% (domain-tuned) |
| Golden availability | ≥ 99.95% |
| Publish latency (source→golden event) | ≤ 1–5 min (stream: ≤ 30–60 s) |
| Stewardship SLA (critical queue) | ≤ 24 h resolution |
| Lineage coverage (column-level) | ≥ 95% |
| Evidence completeness (decisions/overrides) | = 100% |
SLO breaches open tickets and trigger SOAR playbooks (quarantine feed, re-run matching, roll back rule, notify owners). → /siem-soar
🔒 Compliance Mapping (Examples)
- HIPAA / 42 CFR Part 2 — PHI labeling, minimum necessary, auditability; BAAs for tooling.
- PCI DSS — tokenization of PAN in customer records; CDE segmentation.
- GDPR/CCPA — consent flags, data minimization, residency, DSR workflows (access/erasure/rectification).
- SOX / ISO 27001 / SOC 2 — change control, access logs, evidence packs.
📊 Observability & Evidence
- Pipelines — freshness, row counts, schema drift; DQ checks.
- Matching — candidate counts, precision/recall, threshold hits, override rates.
- Golden health — version churn, publish lag, subscriber delivery success.
- Stewardship — queue depth, aging, SLA breaches, action audit.
All streams export to SIEM; dashboards track SLOs, lineage, and cost ($/golden record).
💸 FinOps for MDM (Cost That Behaves)
- Per-domain budgets; $/golden record and $/1k events KPIs.
- Partitioning/clustering; small-object compaction; cache tiers; selective recompute.
- Auto-suspend/slot reservations where applicable; anomaly alerts.
🛠️ Implementation Blueprint (No-Surprise Rollout)
1) Domaining & glossary — define entities/attributes, owners, SLAs; map to sources. → /data-governance
2) Standardize & contracts — dbt/ELT rules; schema contracts; DQ tests & quarantine. → /etl-elt
3) Match/merge design — deterministic + probabilistic/ML rules; thresholds; survivorship; versioning.
4) Stewardship — workflows, UI, approvals; exception queues; notifications.
5) Golden & crosswalk — SCD2, ID maps; publish (events/APIs); subscribe patterns for apps.
6) Security & privacy — labels, RLS/CLS, tokenization, regional perimeters; keys/secrets posture.
7) Observability — lineage, DQ, precision/recall dashboards; SIEM/SOAR wiring.
8) Pilot & rings — one domain (e.g., Customer) → Product/Location → Supplier/Asset; success gates per SLO.
9) Operate — quarterly rule tuning; certification cycles; cost/SLO reviews; publish wins & RCAs.
✅ Pre-Engagement Checklist
- 🗂️ Target domains & priority (Customer/Product/Provider/Location/Supplier/Asset).
- 🧾 Source systems (CRM/ERP/EHR/Commerce), volumes, freshness SLAs.
- 📚 Business rules (standardization, match/merge, survivorship); steward org & SLAs.
- 🔐 Privacy labels (PII/PHI/PAN), residency, tokenization needs.
- 🧰 Tooling preferences (SQL/Spark, hub, ML assist), event bus.
- 📊 Lineage/DQ stack; SIEM destination; reporting cadence.
- 💸 Budget guardrails; $/golden record target; performance constraints.
🔄 Where MDM Fits (Recursive View)
1) Grammar — master data travels over /connectivity & lives on /networks-and-data-centers.
2) Syntax — curated truth in /data-warehouse with pipelines from /etl-elt.
3) Semantics — /data-governance + /dlp preserve integrity & privacy.
4) Pragmatics — /solveforce-ai retrieves masters with guardrails and cites or refuses.