Turn Text, Code, and Signals into Measurable, Safe, AI-Ready Units
Tokenization is how we cut language and signals into stable, meaningful pieces—so search works, AI “understands,” privacy labels stick, and costs are predictable.
SolveForce treats tokenization as a system: Unicode-safe normalization → domain-aware segmentation → subword modeling (BPE/WordPiece/SentencePiece) → chunking for retrieval → label propagation for privacy — wired to governance and evidence.
Why this matters
- Accuracy: Better tokens → better embeddings, search, and generations.
- Safety: Labels for PII/PHI/PAN must travel with tokens/chunks.
- Cost: Tokens define LLM pricing & performance; chunking shapes RAG quality.
See: /language-of-code-ontology • /ai-knowledge-standardization • /vector-databases • /data-governance • /dlp
🎯 Outcomes (Why SolveForce Tokenization)
- Higher precision & recall in search and RAG (fewer hallucinations; more citations).
- Consistent semantics across human text, code, and logs (AST-aware where possible).
- Privacy by design — labels attached at entity/attribute → token → chunk → index.
- Predictable spend & speed — tokens-per-query budgets, chunk SLOs, and cache policy.
- Audit-ready — lineage of normalization, segmentation, and chunk assembly exported to SIEM.
🧭 Scope (What We Tokenize & Chunk)
- Human text — multilingual documents, email, chats, knowledge bases, PDFs (layout-aware).
- Code & config — source (C/Java/Python/JS), IaC (Terraform/K8s YAML), SQL, logs; AST-driven when feasible.
- Tables & semi-structured — CSV/JSON/Parquet; column- and cell-aware tokenization.
- Docs with structure — headings, lists, tables, figures; semantic HTML/PDF trees.
- Identifiers & secrets — PAN, emails, keys, UUIDs; tokenize for detection without leaking. → /dlp
🧱 Building Blocks (Spelled Out)
1) Normalization
- Unicode: NFC/NFKC; canonicalize quotes/dashes/whitespace; standardize numerals and diacritics.
- Case & punctuation: case-folding with domain exceptions (e.g., gene/stock tickers).
- Locale: language detection → tokenizer choice.
2) Segmentation
- Sentence/utterance: trained boundary models with abbreviations list.
- Word: language-specific rules; CJK/Thai segmentation; clitics & contractions.
- Subword: BPE / WordPiece / SentencePiece (unigram LM) to cover OOVs & morphology.
3) Domain-Aware Tokenizers
- Code: language-specific lexers → tokens by AST node (function/class/import), not raw whitespace.
- Logs/JSON/YAML: key/value & path-aware tokens (e.g.,
spec.template.spec.containers[0].image). - Docs: section/heading/table cell as boundaries; preserve captions/alt text.
4) Chunking (for RAG & IR)
- Semantic chunks: section/heading/AST function/event windows ± controlled overlap (5–15%).
- Length bounds: fit model context (e.g., 512–2k tokens); keep whole paragraphs when possible.
- Metadata: source, page, section, locale, labels, time, author, hash, citations requirement.
5) Label Propagation (Safety)
- Labels from governance (PII/PHI/PAN/CUI, sensitivity) attach at entity/attribute → flow to tokens → roll up to chunks and indices.
- Redaction/tokenization transforms preserve verifiability (e.g., PAN Luhn check replaced with format-preserving token). → /data-governance • /dlp
6) Subword Model Ops
- Train BPE/Unigram on domain corpora; freeze versioned vocabularies; keep per-domain vocab shards when needed (healthcare, telecom, finance).
- Maintain a refusal list for dangerous merges (e.g., secrets/keys patterns).
🧰 Reference Patterns (Choose Your Fit)
A) Knowledge Hub → Guarded RAG
Normalization → sentence/section segmentation → BPE subwords → section-level chunks with labels → vector index with label/ACL pre-filters → “cite-or-refuse”.
→ /vector-databases • /ai-knowledge-standardization
B) Code & IaC Copilot
Language-specific lexers → AST node chunks (function/class) → embeddings + symbol table cross-links → DLP scans for secrets → guarded completions.
→ /language-of-code-ontology • /dlp
C) Email & Ticket Intelligence
Robust MIME/PDF extraction → sentence/quoted-text separation → intent + entity tagging → PII label propagation → safe summarization with redaction.
D) Tabular & JSON Analytics
Column-wise tokenization; special handling for IDs/dates/money; chunk by row groups; keep schema & constraints as side-car context.
E) Multilingual + CJK
Language ID per segment; switch tokenizers (CJK/Indic/Thai) and trained BPEs; keep transliteration maps; dual-index by native & romanized forms.
📐 SLO Guardrails (Quality, Cost, Safety)
| Domain | KPI / SLO | Target (Recommended) |
|---|---|---|
| Segmentation | Sentence boundary F1 (gold set) | ≥ 98–99% |
| RAG | Citation coverage | = 100% |
| Answer refusal correctness | ≥ 98% | |
| Chunking | Overlap/overflow rate | ≤ 1–3% |
| Privacy | Label propagation coverage | = 100% |
| Cost | Tokens per doc (p95) vs budget | Within 10% |
| Perf | Tokenization throughput | Publish per language/code; alert at 80% of SLO |
SLO breaches open tickets and trigger SOAR (re-segment, re-index, tighten labels, cache hot chunks). → /siem-soar
🔍 Observability & Evidence
- Pipelines: normalization diffs, tokenizer version, chunk maps, label traces.
- Indices: vector density, recall metrics, drift of vocab distributions.
- Safety: DLP hits, secrets detections, redaction stats, refusal ledger.
- Cost: tokens per page/repo, $/question, cache hit-rate.
All streams export to SIEM; monthly reports attach to governance packs. → /siem-soar
⚠️ Common Pitfalls & Fixes
- Fixed-size chunking only → use section/AST/event boundaries first, then size.
- One tokenizer for all domains → maintain per-language/domain tokenizers/vocabs.
- Ignoring Unicode & locale → normalize; use locale-aware rules and exceptions lists.
- Losing privacy labels → propagate labels from entity to token to chunk; verify with DLP tests.
- Embedding secrets by accident → run secret scanners pre-index; mask at source; blocklist risky merges.
- Over-aggressive BPE merges → lock vocab versions; diff merges; maintain domain stop-lists.
🛠️ Implementation Blueprint (No-Surprise Rollout)
1) Inventory corpora & domains (languages, codebases, logs, PDFs).
2) Normalize (Unicode, case, punctuation, numerals, locale).
3) Segment (sentence/phrase/AST/event); evaluate with gold sets.
4) Subword modeling (BPE/Unigram) per domain; version & publish vocabularies.
5) Chunk by semantic units; attach metadata & labels; hash for dedupe.
6) Label propagation (entity → token → chunk); verify with DLP.
7) Index & cache; enforce pre-filters (labels/ACL) before ANN search.
8) Wire evidence to SIEM; SLO dashboards; cost budgets.
9) Iterate with steward feedback; publish rule diffs in the Codex. → /solveforce-codex
✅ Pre-Engagement Checklist
- 📚 Languages/domains (text, code, tables, logs) & volumes.
- 🗂️ Structure mix (headings, tables, forms, PDFs) & quality (OCR needs).
- 🔏 Privacy labels (PII/PHI/PAN/CUI) & DLP policies; redaction rules.
- 🧪 Gold sets for boundary tests; target precision/recall.
- 🧠 Model contexts & providers; token pricing; prompt/token budgets.
- 🧰 Tooling preferences (SpaCy/ICU/MeCab/Jieba, Pygments/tree-sitter, SentencePiece).
- 📊 SIEM/SOAR endpoints; reporting cadence; success metrics.
🔄 Where Tokenization Fits (Recursive View)
1) Grammar — language units feed /vector-databases & RAG.
2) Syntax — chunking aligns with documents, AST, and events for retrieval & governance.
3) Semantics — /data-governance & /dlp enforce labels and privacy.
4) Pragmatics — /solveforce-ai uses tokens/chunks to answer with citations or refusals.