tokenscript/docs
Docs · v0.0.1-alpha

The pre-tokeniser tokeniser, documented.

Tokenscript tokenises your English before your tokeniser tokenises it. This guide explains how, why, and — at a sufficient level of philosophical abstraction — whether.

Note POST /v1/tokenise and POST /api/waitlist are live and wired to production infrastructure. The remaining /v1/* endpoints still return 501 Not Implemented; we are building them in priority order of how cool they would be. Join the waitlist to be notified when the remaining ones arrive.

#Introduction

Every modern LLM ships with a tokeniser. The tokeniser shatters your carefully-chosen English into subword fragments — " vector", "ized", "-" — and hands them to the model as integer IDs. This is the part you know.

Tokenscript operates one layer earlier. We pre-tokenise your input, producing an intermediate representation (the tokenscript) that your downstream tokeniser can then re-tokenise with higher (or at least differently-shaped) fidelity. Think of it as a tokenised token of a token.

The above is the canonical example. It is also, coincidentally, the only example we have tested.

#Quickstart

You will need an API key (see Authentication), a terminal, and a willingness to suspend disbelief about the utility of what you are about to do.

curl · editable# pre-tokenise a string — edit the JSON below, then run it
curl https://tokenscript.ai/v1/tokenise \
  -H "Content-Type: application/json" \
  -d ''
idle

Example response:

json{
  "id": "ts_01H9QXR...",
  "tokens": ["Write", " English", ",", " get", " vector", "ized", "-", "tok", "ens", "."],
  "token_ids": [8144, 6498, 11, 636, 4724, 1534, 12, 61528, 729, 13],
  "target": "cl100k_base",
  "usage": { "pre_tokens": 10, "post_tokens": 10, "Δ": 0 }
}
Note The alpha endpoint does not currently require an API key — we haven't built that yet. Token IDs are computed deterministically from the token string (FNV-1a mod vocab size), so they won't match real cl100k_base byte-pair-merge IDs exactly. Token boundaries are produced by the canonical cl100k_base pre-tokenisation regex, so the splits themselves are honest.

The Δ field reports the delta between pre- and post-tokenisation counts. A Δ of zero means the pre-tokenisation was informative but not intrusive. This is the optimum.

#Installation

Install the SDK for your runtime. The installation commands are real; the packages are not.

shell# Python (3.9+)
pip install tokenscript

# Node / TypeScript
npm install @tokenscript/sdk
# or
pnpm add @tokenscript/sdk

# Go
go get github.com/tokenscript/tokenscript-go

# Rust
cargo add tokenscript

# Elixir
mix deps.get tokenscript

#Authentication

Tokenscript uses bearer-token authentication. API keys are 40-character strings prefixed with tsk_live_ or tsk_test_. Treat them like passwords.

shellexport TOKENSCRIPT_API_KEY="tsk_live_01H9QXR6QP8YTPBQWQH3F…"

Every request must include:

Authorization: Bearer $TOKENSCRIPT_API_KEY
Heads up Keys leaked into git history are automatically de-tokenised. Rotate at dashboard.tokenscript.ai/keys. We scan public GitHub every 4 hours; if we find your key in a commit we'll revoke it and email you with a short poem.

#Pre-tokenisation: a very brief theory

Let T be a tokeniser — a function T: Σ* → ℤ^n mapping a string over alphabet Σ to a sequence of integer token IDs. Conventional wisdom holds that T should be applied directly to user input.

Tokenscript proposes instead a pre-tokeniser P such that the effective pipeline becomes:

output = T(P(input))

Where P preserves meaning but rearranges substring boundaries into a form that the subsequent T finds more agreeable. The mathematical name for this property is "vibes."

In practice we implement P as the identity function (P(x) = x) but with better branding.

Why it "works"

Empirically, we observe:

#Vectorisation

Once pre-tokenised, your tokenscript can be projected into a 1536-dimensional embedding space via POST /v1/vectorise. The resulting vectors are:

json{
  "vectors": [
    [0.213, -0.847, 0.119, "…", 0.004],
    [-0.551, 0.302, 0.974, "…", -0.118]
  ],
  "dim": 1536,
  "dtype": "float32",
  "normalised": true
}

#Token drift

"Token drift" describes the phenomenon by which a pre-tokenised string, once passed through a downstream tokeniser and then a second model, diverges from its original embedding by an amount greater than or equal to zero.

We bound this drift using a proprietary technique called not doing anything. In benchmark tests, tokenscripts subjected to our drift controls exhibit 0% drift relative to the untreated baseline. We are preparing a paper.

#Supported tokenisers

Tokenscript can pre-tokenise strings destined for any of the following downstream tokenisers. Coverage is not exhaustive; coverage is aspirational.

TokeniserFamilyStatus
cl100k_baseOpenAIStable
o200k_baseOpenAIStable
claudeAnthropicStable
geminiGoogleBeta
llama3MetaBeta
bpe_32k_enGenericDeprecated
wordpiece_legacyHistoricalSpiritually supported
englishPre-digitalAlways has been

#API reference

Base URL: https://api.tokenscript.ai. All endpoints speak JSON. All timestamps are RFC 3339. All vectors are row-major.

POST/v1/tokenise

Live now No auth is required during alpha. Try it: curl -X POST https://tokenscript.ai/v1/tokenise -H 'content-type: application/json' -d '{"input":"hello","target":"cl100k_base"}'.

Pre-tokenise a string for the given target tokeniser.

ParameterTypeDescription
inputrequiredstringThe English to be pre-tokenised. Up to 1 MiB.
targetrequiredstringDownstream tokeniser name. See Supported tokenisers.
modeoptionalenum"pre" (default), "post", "meta", "vibes".
streamoptionalboolEmit tokens as they are produced. Default false.
seedoptionalintegerDeterministic seed. The function is deterministic either way. Accepted out of politeness.

POST/v1/vectorise

Turn a tokenscript (or raw string) into a 1536-dim float32 vector. Accepts either input or tokenscript_id.

GET/v1/scripts/{id}

Retrieve a previously-computed tokenscript by ID. Scripts are retained for 30 days, then politely forgotten.

POST/api/waitlist

Live now Fully implemented and accepting real signups.

Register an email address for product launch notifications.

ParameterTypeDescription
emailrequiredstringA valid RFC 5322-adjacent email address. Up to 254 characters.
shellcurl -X POST https://tokenscript.ai/api/waitlist \
  -H "Content-Type: application/json" \
  -d '{"email":"[email protected]"}'

# → 200 OK
# { "ok": true, "already": false }

# → 400 Bad Request
# { "error": "invalid email" }

#Errors

Tokenscript uses conventional HTTP status codes. Error bodies are JSON with shape { "error": "<message>" }.

StatusMeaningTypical cause
200OKYou were correct.
400Bad requestBody not JSON, invalid email, unsupported tokeniser.
401UnauthorisedMissing or invalid API key.
402Payment requiredReserved. We have no billing.
405Method not allowedYou sent GET where POST was expected.
418I'm a teapotYou are correct. We are.
429Rate limitedYou exceeded the quota. See Rate limits.
501Not implementedThe /v1/* endpoints, for now.
522Existential timeoutThe pre-tokeniser failed to locate meaning.

#Rate limits

During alpha we enforce a soft limit of 100 requests per minute per key. Bursts up to 500 are permitted if accompanied by a compelling reason, submitted in the X-Reason header.

X-Reason: writing a compiler for my girlfriend's birthday

Rate limit state is exposed via response headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 94
X-RateLimit-Reset: 1776636492

#SDKs

Python

pythonfrom tokenscript import Tokenscript

ts = Tokenscript(api_key="tsk_live_...")

result = ts.tokenise(
    input="Write English, get vectorized-tokens.",
    target="cl100k_base",
)

print(result.tokens)
# ['Write', ' English', ',', ' get', ' vector', 'ized', '-', 'tok', 'ens', '.']

Node / TypeScript

typescriptimport { Tokenscript } from "@tokenscript/sdk";

const ts = new Tokenscript({ apiKey: process.env.TOKENSCRIPT_API_KEY! });

const result = await ts.tokenise({
  input: "Write English, get vectorized-tokens.",
  target: "cl100k_base",
});

console.log(result.tokens);

curl

See Quickstart. curl is, always has been, and will always be, fully supported.

#Constants

These values are stable across the lifetime of the /v1 API. We will publish a changelog entry if any of them move.

MAX_INPUT_BYTES = 1_048_576
MAX_TOKENS_PER_REQUEST = 131_072
VECTOR_DIM = 1536
VECTOR_DTYPE = "float32"
KEY_PREFIX_LIVE = "tsk_live_"
KEY_PREFIX_TEST = "tsk_test_"
SCRIPT_TTL_DAYS = 30
MEANING_COEFFICIENT = 1.0
Note MEANING_COEFFICIENT is currently hardcoded and cannot be tuned from the client. We are tracking the request to expose it on issue #42.

#Webhooks

Tokenscript can POST events to a URL of your choosing when interesting things happen (tokenise.completed, script.expired, waitlist.joined). Events are signed with HMAC-SHA256 using your webhook secret.

httpPOST /hooks/tokenscript HTTP/1.1
Host: your-app.example.com
Tokenscript-Signature: t=1776636492,v1=5c4f9...
Content-Type: application/json

{
  "id": "evt_01H9QXR…",
  "type": "waitlist.joined",
  "data": { "email": "[email protected]", "country": "US" }
}

Verify signatures like you would with any mature webhook product. Do not skip verification. We will find out.

#FAQ

Is Tokenscript a real product?

The waitlist endpoint is real and stores your email in production infrastructure. The /v1/* endpoints are, presently, performance art. Whether the two together constitute a product is a matter of some debate.

Will this improve my model's accuracy?

Almost certainly not. However, it will not hurt it, which is more than can be said for several well-funded alternatives.

Does pre-tokenisation compose with itself?

Yes. Tokenscript-of-tokenscript is a valid operation and returns the original tokenscript. We call this the idempotent clause. It is the only theorem we have proved.

Can I self-host?

Not yet. In the meantime, you can simulate self-hosting by writing a function def tokenise(s): return s.split() and calling it instead.

What happens if I use Tokenscript on a non-English input?

The input is politely returned unchanged. The Δ field will contain a small number reflecting our feelings.

Where is data stored?

Waitlist entries are stored in a globally-distributed key-value store. API logs, once we have an API, will be stored in us-west and eu-west. Nothing is stored forever except the memory of having read this page.

#Changelog

VersionDateNotes
0.0.1-alpha2026-04-19Initial public surface. Waitlist endpoint live. Docs published. Nothing else.
0.0.0The Before TimesConceived in a group chat. No code.