1) Pipeline at a glance (speech → code → letters → back again)
[Voice]
⤏ ASR (audio → text)
⤏ Normalizer (case, punctuation, unicode)
⤏ Tokenizer (words/subwords)
⤏ Characters/bytes
⤏ Features (n-grams, counts, positions)
⤏ Encodings (BPE/WordPiece/byte-level ids)
⤏ Vectors (embeddings)
⤏ Model compute (graphs, attention)
⤏ Decoding (ids → subwords → words → text)
⤏ TTS (text → audio) ←─────── round-trip
2) “From words back to letters,” then redistribute
Take a sentence:
S = "Transparency transduces; induction conducts."
Words → letters
- words:
["Transparency","transduces;","induction","conducts."]
- strip punctuation →
["Transparency","transduces","induction","conducts"]
- letters (lowercased): concatenate or keep per-word:
- transparency →
t r a n s p a r e n c y
- transduces →
t r a n s d u c e s
- induction →
i n d u c t i o n
- conducts →
c o n d u c t s
- transparency →
Redistribute letters (one simple scheme)
- pool multiset of letters and reallocate by rule sets (e.g., maximize real English bigram likelihood or enforce semantic families).
- example, greedily form new word shapes that keep high-probability bigrams {tr, an, on, ct}:
- possible recompositions:
transduction
,conducts
,transparency
(originals), or playful neologisms liketransducticon
(still respecting letter supply).
- possible recompositions:
As code (readable, minimal)
import re, collections, itertools, math
S = "Transparency transduces; induction conducts."
W = [re.sub(r"[^\w]","",w).lower() for w in S.split() if re.sub(r"[^\w]","",w)]
pool = collections.Counter("".join(W))
# Bigram scores (toy): favor common English bigrams
GOOD = {"th","tr","an","on","ct","io","ti","re","en","ra","du","co","nd","uc","ce","ns"}
def score(word):
return sum(2 if a+b in GOOD else 0 for a,b in zip(word, word[1:]))
# Generate candidate recompositions from letter pool (very pruned toy demo)
def can_make(word, pool):
c = collections.Counter(word)
return all(pool[ch] >= n for ch,n in c.items())
cands = ["transduction","conduction","transparency","induction","conducts","transduces"]
cands = [w for w in cands if can_make(w, pool)]
cands.sort(key=score, reverse=True)
print("reallocations:", cands[:3])
This shows “letters → new words” under explicit constraints you can swap (phonotactics, domain vocab, etc.).
3) Subwords & byte levels (what many models actually use)
Most modern tokenizers operate between words and letters:
- BPE/WordPiece learns merges like
["tran","s","duc","tion"]
. - Byte-level tokenizers work directly on UTF-8 bytes (100% reversible) and avoid OOV.
Tiny BPE sketch (conceptual)
Start: t r a n s d u c t i o n
Merge high-freq pairs: tr | an | s | du | ct | ion
Vectors attach at: ["tr","an","s","du","ct","ion"]
Re-distribution at subword level becomes: swap ion
tails, keep tr
heads, etc., giving grammatically plausible recombinations while staying in-vocabulary for the model.
4) Process graph you can follow (multi-perspective)
Dataflow (deterministic steps)
Text → Normalize → Tokenize → Map-to-IDs → Embed → Model-Forward → IDs → Detokenize → Text
Structural view (what carries what)
- Surface: strings, punctuation, whitespace
- Orthography: letters/bytes, casefolding, Unicode normalization (NFC/NFKC)
- Morphemes/subwords: BPE pieces (stable, model-friendly)
- Distributional: vectors (semantics, syntax)
- Computation: attention graph (who attends to whom; positions; heads)
- Decoding: beam sampling with constraints (grammar, style, safety)
Observer slices you can log simultaneously
- Counts: character/word n-grams, entropy per position
- Structure: token boundaries, merge ops, part-of-speech
- Semantics: cosine neighborhoods of tokens/sentences
- Attribution: attention maps, gradient-based saliency
- Constraint checks: does output consume only allowed letters? only allowed merges?
5) A clean, minimal “instrumented” pipeline (pseudo-code)
def normalize(text):
return unicodedata.normalize("NFKC", text).strip()
def tokenize_words(text):
return re.findall(r"\w+|\S", text) # words + punctuation
def to_chars(words):
return [list(w.lower()) for w in words if w.isalnum()]
def bpe_encode(words, merges):
# merges: learned pair->merge table
encoded = []
for w in words:
pieces = list(w)
improved = True
while improved:
improved = False
for i in range(len(pieces)-1):
pair = pieces[i]+pieces[i+1]
if pair in merges:
pieces[i:i+2] = [pair]
improved = True
break
encoded += pieces
return encoded
def redistribute_letters(words, strategy):
# strategy can be: maximize bigram prob, keep stems, preserve syllable count, etc.
pool = Counter("".join(words).lower())
return construct_words_from_pool(pool, strategy)
# instrument:
log = {}
text = "Transparency transduces; induction conducts."
log["normalized"] = normalize(text)
log["tokens"] = tokenize_words(log["normalized"])
chars = to_chars(log["tokens"])
log["char_counts"]= Counter("".join("".join(cs) for cs in chars))
log["bpe"] = bpe_encode([w for w in log["tokens"] if w.isalnum()], merges=LEARNED_MERGES)
log["redistrib"] = redistribute_letters([w for w in log["tokens"] if w.isalnum()], strategy="bigram")
Swap in different strategies (phoneme constraints, domain lexicons, meter/rhythm) and you get different “harmonizations” of the same substrate.
6) Quick diagram tying “transduction, induction, conduction” to flows
letters ──(pool/constraints)──► subwords ──(merges)──► words
▲ │ │
└────────── (detokenize/ids) ◄─────┴──────(decode) ◄──┘
energy metaphors:
- transduction: signal format shift (audio→text, text→ids, ids→audio)
- induction: pattern transfer (n-gram priors, embeddings, prompts)
- conduction: pathway/graph through which compute & gradients flow
7) How to “watch” all perspectives at once (practical)
- Character lens: histogram, bigram heatmap, editable letter pool.
- Subword lens: token sequence with merge highlights.
- Word lens: POS tags, dependency arcs.
- Vector lens: nearest-neighbor terms for each token.
- Graph lens: attention matrices (heads × layers), saliency overlays.
- Constraint lens: show which constraints gated which recompositions.
If you’d like, I can drop in a small notebook that:
- logs each lens for your own sentence(s),
- tries a few redistribution strategies (pure bigram vs. subword-aware),
- prints side-by-side “before/after” with the decisions that led there.