🔤 Tokenization

Turn Text, Code, and Signals into Measurable, Safe, AI-Ready Units

Tokenization is how we cut language and signals into stable, meaningful pieces—so search works, AI “understands,” privacy labels stick, and costs are predictable.
SolveForce treats tokenization as a system: Unicode-safe normalization → domain-aware segmentation → subword modeling (BPE/WordPiece/SentencePiece) → chunking for retrieval → label propagation for privacy — wired to governance and evidence.

Why this matters


🎯 Outcomes (Why SolveForce Tokenization)

  • Higher precision & recall in search and RAG (fewer hallucinations; more citations).
  • Consistent semantics across human text, code, and logs (AST-aware where possible).
  • Privacy by design — labels attached at entity/attribute → token → chunk → index.
  • Predictable spend & speed — tokens-per-query budgets, chunk SLOs, and cache policy.
  • Audit-ready — lineage of normalization, segmentation, and chunk assembly exported to SIEM.

🧭 Scope (What We Tokenize & Chunk)

  • Human text — multilingual documents, email, chats, knowledge bases, PDFs (layout-aware).
  • Code & config — source (C/Java/Python/JS), IaC (Terraform/K8s YAML), SQL, logs; AST-driven when feasible.
  • Tables & semi-structured — CSV/JSON/Parquet; column- and cell-aware tokenization.
  • Docs with structure — headings, lists, tables, figures; semantic HTML/PDF trees.
  • Identifiers & secrets — PAN, emails, keys, UUIDs; tokenize for detection without leaking. → /dlp

🧱 Building Blocks (Spelled Out)

1) Normalization

  • Unicode: NFC/NFKC; canonicalize quotes/dashes/whitespace; standardize numerals and diacritics.
  • Case & punctuation: case-folding with domain exceptions (e.g., gene/stock tickers).
  • Locale: language detection → tokenizer choice.

2) Segmentation

  • Sentence/utterance: trained boundary models with abbreviations list.
  • Word: language-specific rules; CJK/Thai segmentation; clitics & contractions.
  • Subword: BPE / WordPiece / SentencePiece (unigram LM) to cover OOVs & morphology.

3) Domain-Aware Tokenizers

  • Code: language-specific lexers → tokens by AST node (function/class/import), not raw whitespace.
  • Logs/JSON/YAML: key/value & path-aware tokens (e.g., spec.template.spec.containers[0].image).
  • Docs: section/heading/table cell as boundaries; preserve captions/alt text.

4) Chunking (for RAG & IR)

  • Semantic chunks: section/heading/AST function/event windows ± controlled overlap (5–15%).
  • Length bounds: fit model context (e.g., 512–2k tokens); keep whole paragraphs when possible.
  • Metadata: source, page, section, locale, labels, time, author, hash, citations requirement.

5) Label Propagation (Safety)

  • Labels from governance (PII/PHI/PAN/CUI, sensitivity) attach at entity/attribute → flow to tokens → roll up to chunks and indices.
  • Redaction/tokenization transforms preserve verifiability (e.g., PAN Luhn check replaced with format-preserving token). → /data-governance/dlp

6) Subword Model Ops

  • Train BPE/Unigram on domain corpora; freeze versioned vocabularies; keep per-domain vocab shards when needed (healthcare, telecom, finance).
  • Maintain a refusal list for dangerous merges (e.g., secrets/keys patterns).

🧰 Reference Patterns (Choose Your Fit)

A) Knowledge Hub → Guarded RAG

Normalization → sentence/section segmentation → BPE subwords → section-level chunks with labels → vector index with label/ACL pre-filters → “cite-or-refuse”.
/vector-databases/ai-knowledge-standardization

B) Code & IaC Copilot

Language-specific lexers → AST node chunks (function/class) → embeddings + symbol table cross-links → DLP scans for secrets → guarded completions.
/language-of-code-ontology/dlp

C) Email & Ticket Intelligence

Robust MIME/PDF extraction → sentence/quoted-text separation → intent + entity tagging → PII label propagation → safe summarization with redaction.

D) Tabular & JSON Analytics

Column-wise tokenization; special handling for IDs/dates/money; chunk by row groups; keep schema & constraints as side-car context.

E) Multilingual + CJK

Language ID per segment; switch tokenizers (CJK/Indic/Thai) and trained BPEs; keep transliteration maps; dual-index by native & romanized forms.


📐 SLO Guardrails (Quality, Cost, Safety)

DomainKPI / SLOTarget (Recommended)
SegmentationSentence boundary F1 (gold set)≥ 98–99%
RAGCitation coverage= 100%
Answer refusal correctness≥ 98%
ChunkingOverlap/overflow rate≤ 1–3%
PrivacyLabel propagation coverage= 100%
CostTokens per doc (p95) vs budgetWithin 10%
PerfTokenization throughputPublish per language/code; alert at 80% of SLO

SLO breaches open tickets and trigger SOAR (re-segment, re-index, tighten labels, cache hot chunks). → /siem-soar


🔍 Observability & Evidence

  • Pipelines: normalization diffs, tokenizer version, chunk maps, label traces.
  • Indices: vector density, recall metrics, drift of vocab distributions.
  • Safety: DLP hits, secrets detections, redaction stats, refusal ledger.
  • Cost: tokens per page/repo, $/question, cache hit-rate.
    All streams export to SIEM; monthly reports attach to governance packs. → /siem-soar

⚠️ Common Pitfalls & Fixes

  • Fixed-size chunking only → use section/AST/event boundaries first, then size.
  • One tokenizer for all domains → maintain per-language/domain tokenizers/vocabs.
  • Ignoring Unicode & locale → normalize; use locale-aware rules and exceptions lists.
  • Losing privacy labels → propagate labels from entity to token to chunk; verify with DLP tests.
  • Embedding secrets by accident → run secret scanners pre-index; mask at source; blocklist risky merges.
  • Over-aggressive BPE merges → lock vocab versions; diff merges; maintain domain stop-lists.

🛠️ Implementation Blueprint (No-Surprise Rollout)

1) Inventory corpora & domains (languages, codebases, logs, PDFs).
2) Normalize (Unicode, case, punctuation, numerals, locale).
3) Segment (sentence/phrase/AST/event); evaluate with gold sets.
4) Subword modeling (BPE/Unigram) per domain; version & publish vocabularies.
5) Chunk by semantic units; attach metadata & labels; hash for dedupe.
6) Label propagation (entity → token → chunk); verify with DLP.
7) Index & cache; enforce pre-filters (labels/ACL) before ANN search.
8) Wire evidence to SIEM; SLO dashboards; cost budgets.
9) Iterate with steward feedback; publish rule diffs in the Codex. → /solveforce-codex


✅ Pre-Engagement Checklist

  • 📚 Languages/domains (text, code, tables, logs) & volumes.
  • 🗂️ Structure mix (headings, tables, forms, PDFs) & quality (OCR needs).
  • 🔏 Privacy labels (PII/PHI/PAN/CUI) & DLP policies; redaction rules.
  • 🧪 Gold sets for boundary tests; target precision/recall.
  • 🧠 Model contexts & providers; token pricing; prompt/token budgets.
  • 🧰 Tooling preferences (SpaCy/ICU/MeCab/Jieba, Pygments/tree-sitter, SentencePiece).
  • 📊 SIEM/SOAR endpoints; reporting cadence; success metrics.

🔄 Where Tokenization Fits (Recursive View)

1) Grammar — language units feed /vector-databases & RAG.
2) Syntax — chunking aligns with documents, AST, and events for retrieval & governance.
3) Semantics/data-governance & /dlp enforce labels and privacy.
4) Pragmatics/solveforce-ai uses tokens/chunks to answer with citations or refusals.


📞 Make Tokenization Accurate, Safe & Cost-Predictable