ai
Intro to Tokenization
Apr 2026
To create an LLM, pretraining and finetuning is involved. Pretraining is done through self supervised learning on raw data to create a base or a foundation model. In self supervised learning, an LLM uses the labelled data generated by itself to train itself. Fine tuning is more specific to the use case. Instruction and classification are two types of fine tuning. InstructGPT is a fine tuned GPT-3 model using instruction finetuning. Transformer architecture is used at the core of an LLM which was originally used for machine translations. It has an encoder and decoder. Encoder takes the input text and converts it into numerical values. Decoder takes the numerical values and generates the output text. BERT is one of the variants of transformer architecture. It’s mainly used for masked word prediction, which means the input data can consist of what comes before or after the token to be predicted. GPT is used for next word prediction. Autoregressive models use previous outputs as the next input. Zero shot learning means the model doesn’t require any examples to convey the requirement and expectation of the task. Pretraining involves data preparation which includes converting the text to tokens and then to vector representations. Then a sampling strategy is used to generate input output pairs. The process of converting raw text to vector representation is called embeddings. LLMs are not compatible with categorical raw data. They need vectors for mathematical computations. Embedding models like text embedding models are used for this. Word2Vec is one of the landmark algorithms. Dimensionality of embeddings determines the expressive capacity of the model. The first step is tokenization. Take the text from the public domain and split the characters. Whether to include whitespaces is a business or technical choice. The tokens generated from this raw text are now ready to be converted into token ids. From these tokens we create a set of unique tokens and assign a numeric ID to each. When a new text is used as input, each token of that text is converted to these numeric IDs from our vocabulary.
read more