πŸ”€ Tokenization

Turn Text, Code, and Signals into Measurable, Safe, AI-Ready Units

Tokenization is how we cut language and signals into stable, meaningful piecesβ€”so search works, AI β€œunderstands,” privacy labels stick, and costs are predictable.
SolveForce treats tokenization as a system: Unicode-safe normalization β†’ domain-aware segmentation β†’ subword modeling (BPE/WordPiece/SentencePiece) β†’ chunking for retrieval β†’ label propagation for privacy β€” wired to governance and evidence.

Why this matters


🎯 Outcomes (Why SolveForce Tokenization)

  • Higher precision & recall in search and RAG (fewer hallucinations; more citations).
  • Consistent semantics across human text, code, and logs (AST-aware where possible).
  • Privacy by design β€” labels attached at entity/attribute β†’ token β†’ chunk β†’ index.
  • Predictable spend & speed β€” tokens-per-query budgets, chunk SLOs, and cache policy.
  • Audit-ready β€” lineage of normalization, segmentation, and chunk assembly exported to SIEM.

🧭 Scope (What We Tokenize & Chunk)

  • Human text β€” multilingual documents, email, chats, knowledge bases, PDFs (layout-aware).
  • Code & config β€” source (C/Java/Python/JS), IaC (Terraform/K8s YAML), SQL, logs; AST-driven when feasible.
  • Tables & semi-structured β€” CSV/JSON/Parquet; column- and cell-aware tokenization.
  • Docs with structure β€” headings, lists, tables, figures; semantic HTML/PDF trees.
  • Identifiers & secrets β€” PAN, emails, keys, UUIDs; tokenize for detection without leaking. β†’ /dlp

🧱 Building Blocks (Spelled Out)

1) Normalization

  • Unicode: NFC/NFKC; canonicalize quotes/dashes/whitespace; standardize numerals and diacritics.
  • Case & punctuation: case-folding with domain exceptions (e.g., gene/stock tickers).
  • Locale: language detection β†’ tokenizer choice.

2) Segmentation

  • Sentence/utterance: trained boundary models with abbreviations list.
  • Word: language-specific rules; CJK/Thai segmentation; clitics & contractions.
  • Subword: BPE / WordPiece / SentencePiece (unigram LM) to cover OOVs & morphology.

3) Domain-Aware Tokenizers

  • Code: language-specific lexers β†’ tokens by AST node (function/class/import), not raw whitespace.
  • Logs/JSON/YAML: key/value & path-aware tokens (e.g., spec.template.spec.containers[0].image).
  • Docs: section/heading/table cell as boundaries; preserve captions/alt text.

4) Chunking (for RAG & IR)

  • Semantic chunks: section/heading/AST function/event windows Β± controlled overlap (5–15%).
  • Length bounds: fit model context (e.g., 512–2k tokens); keep whole paragraphs when possible.
  • Metadata: source, page, section, locale, labels, time, author, hash, citations requirement.

5) Label Propagation (Safety)

  • Labels from governance (PII/PHI/PAN/CUI, sensitivity) attach at entity/attribute β†’ flow to tokens β†’ roll up to chunks and indices.
  • Redaction/tokenization transforms preserve verifiability (e.g., PAN Luhn check replaced with format-preserving token). β†’ /data-governance β€’ /dlp

6) Subword Model Ops

  • Train BPE/Unigram on domain corpora; freeze versioned vocabularies; keep per-domain vocab shards when needed (healthcare, telecom, finance).
  • Maintain a refusal list for dangerous merges (e.g., secrets/keys patterns).

🧰 Reference Patterns (Choose Your Fit)

A) Knowledge Hub β†’ Guarded RAG

Normalization β†’ sentence/section segmentation β†’ BPE subwords β†’ section-level chunks with labels β†’ vector index with label/ACL pre-filters β†’ β€œcite-or-refuse”.
β†’ /vector-databases β€’ /ai-knowledge-standardization

B) Code & IaC Copilot

Language-specific lexers β†’ AST node chunks (function/class) β†’ embeddings + symbol table cross-links β†’ DLP scans for secrets β†’ guarded completions.
β†’ /language-of-code-ontology β€’ /dlp

C) Email & Ticket Intelligence

Robust MIME/PDF extraction β†’ sentence/quoted-text separation β†’ intent + entity tagging β†’ PII label propagation β†’ safe summarization with redaction.

D) Tabular & JSON Analytics

Column-wise tokenization; special handling for IDs/dates/money; chunk by row groups; keep schema & constraints as side-car context.

E) Multilingual + CJK

Language ID per segment; switch tokenizers (CJK/Indic/Thai) and trained BPEs; keep transliteration maps; dual-index by native & romanized forms.


πŸ“ SLO Guardrails (Quality, Cost, Safety)

DomainKPI / SLOTarget (Recommended)
SegmentationSentence boundary F1 (gold set)β‰₯ 98–99%
RAGCitation coverage= 100%
Answer refusal correctnessβ‰₯ 98%
ChunkingOverlap/overflow rate≀ 1–3%
PrivacyLabel propagation coverage= 100%
CostTokens per doc (p95) vs budgetWithin 10%
PerfTokenization throughputPublish per language/code; alert at 80% of SLO

SLO breaches open tickets and trigger SOAR (re-segment, re-index, tighten labels, cache hot chunks). β†’ /siem-soar


πŸ” Observability & Evidence

  • Pipelines: normalization diffs, tokenizer version, chunk maps, label traces.
  • Indices: vector density, recall metrics, drift of vocab distributions.
  • Safety: DLP hits, secrets detections, redaction stats, refusal ledger.
  • Cost: tokens per page/repo, $/question, cache hit-rate.
    All streams export to SIEM; monthly reports attach to governance packs. β†’ /siem-soar

⚠️ Common Pitfalls & Fixes

  • Fixed-size chunking only β†’ use section/AST/event boundaries first, then size.
  • One tokenizer for all domains β†’ maintain per-language/domain tokenizers/vocabs.
  • Ignoring Unicode & locale β†’ normalize; use locale-aware rules and exceptions lists.
  • Losing privacy labels β†’ propagate labels from entity to token to chunk; verify with DLP tests.
  • Embedding secrets by accident β†’ run secret scanners pre-index; mask at source; blocklist risky merges.
  • Over-aggressive BPE merges β†’ lock vocab versions; diff merges; maintain domain stop-lists.

πŸ› οΈ Implementation Blueprint (No-Surprise Rollout)

1) Inventory corpora & domains (languages, codebases, logs, PDFs).
2) Normalize (Unicode, case, punctuation, numerals, locale).
3) Segment (sentence/phrase/AST/event); evaluate with gold sets.
4) Subword modeling (BPE/Unigram) per domain; version & publish vocabularies.
5) Chunk by semantic units; attach metadata & labels; hash for dedupe.
6) Label propagation (entity β†’ token β†’ chunk); verify with DLP.
7) Index & cache; enforce pre-filters (labels/ACL) before ANN search.
8) Wire evidence to SIEM; SLO dashboards; cost budgets.
9) Iterate with steward feedback; publish rule diffs in the Codex. β†’ /solveforce-codex


βœ… Pre-Engagement Checklist

  • πŸ“š Languages/domains (text, code, tables, logs) & volumes.
  • πŸ—‚οΈ Structure mix (headings, tables, forms, PDFs) & quality (OCR needs).
  • πŸ” Privacy labels (PII/PHI/PAN/CUI) & DLP policies; redaction rules.
  • πŸ§ͺ Gold sets for boundary tests; target precision/recall.
  • 🧠 Model contexts & providers; token pricing; prompt/token budgets.
  • 🧰 Tooling preferences (SpaCy/ICU/MeCab/Jieba, Pygments/tree-sitter, SentencePiece).
  • πŸ“Š SIEM/SOAR endpoints; reporting cadence; success metrics.

πŸ”„ Where Tokenization Fits (Recursive View)

1) Grammar β€” language units feed /vector-databases & RAG.
2) Syntax β€” chunking aligns with documents, AST, and events for retrieval & governance.
3) Semantics β€” /data-governance & /dlp enforce labels and privacy.
4) Pragmatics β€” /solveforce-ai uses tokens/chunks to answer with citations or refusals.


πŸ“ž Make Tokenization Accurate, Safe & Cost-Predictable