Docs · v0.0.1-alpha

The pre-tokeniser tokeniser, documented.

Tokenscript tokenises your English before your tokeniser tokenises it. This guide explains how, why, and — at a sufficient level of philosophical abstraction — whether.

Note POST /v1/tokenise and POST /api/waitlist are live and wired to production infrastructure. The remaining /v1/* endpoints still return 501 Not Implemented; we are building them in priority order of how cool they would be. Join the waitlist to be notified when the remaining ones arrive.

#Introduction

Every modern LLM ships with a tokeniser. The tokeniser shatters your carefully-chosen English into subword fragments — " vector", "ized", "-" — and hands them to the model as integer IDs. This is the part you know.

Tokenscript operates one layer earlier. We pre-tokenise your input, producing an intermediate representation (the tokenscript) that your downstream tokeniser can then re-tokenise with higher (or at least differently-shaped) fidelity. Think of it as a tokenised token of a token.

The above is the canonical example. It is also, coincidentally, the only example we have tested.

#Quickstart

You will need an API key (see Authentication), a terminal, and a willingness to suspend disbelief about the utility of what you are about to do.

curl · editable# pre-tokenise a string — edit the JSON below, then run it
curl https://tokenscript.ai/v1/tokenise \
  -H "Content-Type: application/json" \
  -d '{
  "input": "Write English, get vectorized-tokens.",
  "target": "cl100k_base",
  "mode": "pre"
}'

idle

response

Example response:

json{
  "id": "ts_01H9QXR...",
  "tokens": ["Write", " English", ",", " get", " vector", "ized", "-", "tok", "ens", "."],
  "token_ids": [8144, 6498, 11, 636, 4724, 1534, 12, 61528, 729, 13],
  "target": "cl100k_base",
  "usage": { "pre_tokens": 10, "post_tokens": 10, "Δ": 0 }
}

Note The alpha endpoint does not currently require an API key — we haven't built that yet. Token IDs are computed deterministically from the token string (FNV-1a mod vocab size), so they won't match real cl100k_base byte-pair-merge IDs exactly. Token boundaries are produced by the canonical cl100k_base pre-tokenisation regex, so the splits themselves are honest.

The Δ field reports the delta between pre- and post-tokenisation counts. A Δ of zero means the pre-tokenisation was informative but not intrusive. This is the optimum.

#Installation

Install the SDK for your runtime. The installation commands are real; the packages are not.

shell# Python (3.9+)
pip install tokenscript

# Node / TypeScript
npm install @tokenscript/sdk
# or
pnpm add @tokenscript/sdk

# Go
go get github.com/tokenscript/tokenscript-go

# Rust
cargo add tokenscript

# Elixir
mix deps.get tokenscript

#Authentication

Tokenscript uses bearer-token authentication. API keys are 40-character strings prefixed with tsk_live_ or tsk_test_. Treat them like passwords.

shellexport TOKENSCRIPT_API_KEY="tsk_live_01H9QXR6QP8YTPBQWQH3F…"

Every request must include:

Authorization: Bearer $TOKENSCRIPT_API_KEY

Heads up Keys leaked into git history are automatically de-tokenised. Rotate at dashboard.tokenscript.ai/keys. We scan public GitHub every 4 hours; if we find your key in a commit we'll revoke it and email you with a short poem.

#Pre-tokenisation: a very brief theory

Let T be a tokeniser — a function T: Σ* → ℤ^n mapping a string over alphabet Σ to a sequence of integer token IDs. Conventional wisdom holds that T should be applied directly to user input.

Tokenscript proposes instead a pre-tokeniser P such that the effective pipeline becomes:

output = T(P(input))

Where P preserves meaning but rearranges substring boundaries into a form that the subsequent T finds more agreeable. The mathematical name for this property is "vibes."

In practice we implement P as the identity function (P(x) = x) but with better branding.

Why it "works"

Empirically, we observe:

Token counts are preserved within ±0 tokens (95th percentile).
Latency increases by a modest 80–240ms per request, which we attribute to network.
Developer satisfaction is qualitatively high among the one developer who has tried it.

#Vectorisation

Once pre-tokenised, your tokenscript can be projected into a 1536-dimensional embedding space via POST /v1/vectorise. The resulting vectors are:

ℓ²-normalised
Indistinguishable from noise under all known statistical tests
Shaped like a hypersphere, which we find cool

json{
  "vectors": [
    [0.213, -0.847, 0.119, "…", 0.004],
    [-0.551, 0.302, 0.974, "…", -0.118]
  ],
  "dim": 1536,
  "dtype": "float32",
  "normalised": true
}

#Token drift

"Token drift" describes the phenomenon by which a pre-tokenised string, once passed through a downstream tokeniser and then a second model, diverges from its original embedding by an amount greater than or equal to zero.

We bound this drift using a proprietary technique called not doing anything. In benchmark tests, tokenscripts subjected to our drift controls exhibit 0% drift relative to the untreated baseline. We are preparing a paper.

#Supported tokenisers

Tokenscript can pre-tokenise strings destined for any of the following downstream tokenisers. Coverage is not exhaustive; coverage is aspirational.

Tokeniser	Family	Status
`cl100k_base`	OpenAI	Stable
`o200k_base`	OpenAI	Stable
`claude`	Anthropic	Stable
`gemini`	Google	Beta
`llama3`	Meta	Beta
`bpe_32k_en`	Generic	Deprecated
`wordpiece_legacy`	Historical	Spiritually supported
`english`	Pre-digital	Always has been

#API reference

Base URL: https://api.tokenscript.ai. All endpoints speak JSON. All timestamps are RFC 3339. All vectors are row-major.

POST/v1/tokenise

Live now No auth is required during alpha. Try it:

curl -X POST https://tokenscript.ai/v1/tokenise -H 'content-type: application/json' -d '{"input":"hello","target":"cl100k_base"}'

Pre-tokenise a string for the given target tokeniser.

Parameter	Type	Description
`input`required	string	The English to be pre-tokenised. Up to 1 MiB.
`target`required	string	Downstream tokeniser name. See Supported tokenisers.
`mode`optional	enum	`"pre"` (default), `"post"`, `"meta"`, `"vibes"`.
`stream`optional	bool	Emit tokens as they are produced. Default `false`.
`seed`optional	integer	Deterministic seed. The function is deterministic either way. Accepted out of politeness.

POST/v1/vectorise

Turn a tokenscript (or raw string) into a 1536-dim float32 vector. Accepts either input or tokenscript_id.

GET/v1/scripts/{id}

Retrieve a previously-computed tokenscript by ID. Scripts are retained for 30 days, then politely forgotten.

POST/api/waitlist

Live now Fully implemented and accepting real signups.

Parameter	Type	Description
`email`required	string	A valid RFC 5322-adjacent email address. Up to 254 characters.

shellcurl -X POST https://tokenscript.ai/api/waitlist \
  -H "Content-Type: application/json" \
  -d '{"email":"[email protected]"}'

# → 200 OK
# { "ok": true, "already": false }

# → 400 Bad Request
# { "error": "invalid email" }

#Errors

Tokenscript uses conventional HTTP status codes. Error bodies are JSON with shape { "error": "<message>" }.

Status	Meaning	Typical cause
`200`	OK	You were correct.
`400`	Bad request	Body not JSON, invalid email, unsupported tokeniser.
`401`	Unauthorised	Missing or invalid API key.
`402`	Payment required	Reserved. We have no billing.
`405`	Method not allowed	You sent `GET` where `POST` was expected.
`418`	I'm a teapot	You are correct. We are.
`429`	Rate limited	You exceeded the quota. See Rate limits.
`501`	Not implemented	The `/v1/*` endpoints, for now.
`522`	Existential timeout	The pre-tokeniser failed to locate meaning.

#Rate limits

During alpha we enforce a soft limit of 100 requests per minute per key. Bursts up to 500 are permitted if accompanied by a compelling reason, submitted in the X-Reason header.

X-Reason: writing a compiler for my girlfriend's birthday

Rate limit state is exposed via response headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 94
X-RateLimit-Reset: 1776636492

#SDKs

Python

pythonfrom tokenscript import Tokenscript

ts = Tokenscript(api_key="tsk_live_...")

result = ts.tokenise(
    input="Write English, get vectorized-tokens.",
    target="cl100k_base",
)

print(result.tokens)
# ['Write', ' English', ',', ' get', ' vector', 'ized', '-', 'tok', 'ens', '.']

Node / TypeScript

typescriptimport { Tokenscript } from "@tokenscript/sdk";

const ts = new Tokenscript({ apiKey: process.env.TOKENSCRIPT_API_KEY! });

const result = await ts.tokenise({
  input: "Write English, get vectorized-tokens.",
  target: "cl100k_base",
});

console.log(result.tokens);

curl

See Quickstart. curl is, always has been, and will always be, fully supported.

#Constants

These values are stable across the lifetime of the /v1 API. We will publish a changelog entry if any of them move.

MAX_INPUT_BYTES = 1_048_576

MAX_TOKENS_PER_REQUEST = 131_072

VECTOR_DIM = 1536

VECTOR_DTYPE = "float32"

KEY_PREFIX_LIVE = "tsk_live_"

KEY_PREFIX_TEST = "tsk_test_"

SCRIPT_TTL_DAYS = 30

MEANING_COEFFICIENT = 1.0

Note MEANING_COEFFICIENT is currently hardcoded and cannot be tuned from the client. We are tracking the request to expose it on issue #42.

#Webhooks

Tokenscript can POST events to a URL of your choosing when interesting things happen (tokenise.completed, script.expired, waitlist.joined). Events are signed with HMAC-SHA256 using your webhook secret.

httpPOST /hooks/tokenscript HTTP/1.1
Host: your-app.example.com
Tokenscript-Signature: t=1776636492,v1=5c4f9...
Content-Type: application/json

{
  "id": "evt_01H9QXR…",
  "type": "waitlist.joined",
  "data": { "email": "[email protected]", "country": "US" }
}

Verify signatures like you would with any mature webhook product. Do not skip verification. We will find out.

#FAQ

Is Tokenscript a real product?

The waitlist endpoint is real and stores your email in production infrastructure. The /v1/* endpoints are, presently, performance art. Whether the two together constitute a product is a matter of some debate.

Will this improve my model's accuracy?

Almost certainly not. However, it will not hurt it, which is more than can be said for several well-funded alternatives.

Does pre-tokenisation compose with itself?

Yes. Tokenscript-of-tokenscript is a valid operation and returns the original tokenscript. We call this the idempotent clause. It is the only theorem we have proved.

Can I self-host?

Not yet. In the meantime, you can simulate self-hosting by writing a function def tokenise(s): return s.split() and calling it instead.

What happens if I use Tokenscript on a non-English input?

The input is politely returned unchanged. The Δ field will contain a small number reflecting our feelings.

Where is data stored?

Waitlist entries are stored in a globally-distributed key-value store. API logs, once we have an API, will be stored in us-west and eu-west. Nothing is stored forever except the memory of having read this page.

#Changelog

Version	Date	Notes
`0.0.1-alpha`	2026-04-19	Initial public surface. Waitlist endpoint live. Docs published. Nothing else.
`0.0.0`	The Before Times	Conceived in a group chat. No code.