...
AI
Intro to Tokenization
4 min read
TocreateanLLM,pretrainingandfinetuningisinvolved.Pretrainingisdonethroughselfsupervisedlearningonrawdatatocreateabaseorafoundationmodel.Inselfsupervisedlearning,anLLMusesthelabelleddatageneratedbyitselftotrainitself.
Finetuningismorespecifictotheusecase.Instructionandclassificationaretwotypesoffinetuning.InstructGPTisafinetunedGPT-3modelusinginstructionfinetuning.

TransformerarchitectureisusedatthecoreofanLLMwhichwasoriginallyusedformachinetranslations.Ithasanencoderanddecoder.Encodertakestheinputtextandconvertsitintonumericalvalues.Decodertakesthenumericalvaluesandgeneratestheoutputtext.
BERTisoneofthevariantsoftransformerarchitecture.It’smainlyusedformaskedwordprediction,whichmeanstheinputdatacanconsistofwhatcomesbeforeorafterthetokentobepredicted.GPTisusedfornextwordprediction.Autoregressivemodelsusepreviousoutputsasthenextinput.
Zeroshotlearningmeansthemodeldoesn’trequireanyexamplestoconveytherequirementandexpectationofthetask.
Pretraininginvolvesdatapreparationwhichincludesconvertingthetexttotokensandthentovectorrepresentations.Thenasamplingstrategyisusedtogenerateinputoutputpairs.
Theprocessofconvertingrawtexttovectorrepresentationiscalledembeddings.LLMsarenotcompatiblewithcategoricalrawdata.Theyneedvectorsformathematicalcomputations.
Embeddingmodelsliketextembeddingmodelsareusedforthis.Word2Vecisoneofthelandmarkalgorithms.Dimensionalityofembeddingsdeterminestheexpressivecapacityofthemodel.
Thefirststepistokenization.Takethetextfromthepublicdomainandsplitthecharacters.Whethertoincludewhitespacesisabusinessortechnicalchoice.Thetokensgeneratedfromthisrawtextarenowreadytobeconvertedintotokenids.
FromthesetokenswecreateasetofuniquetokensandassignanumericIDtoeach.Whenanewtextisusedasinput,eachtokenofthattextisconvertedtothesenumericIDsfromourvocabulary.
Aclasswhoseencodemethodtakestherawtext,convertsittotokensandtokenids,andinitializesitasvocabulary.Decodemethodtakesthenumerictokenidsandgivesusbackthetokensassociatedwithit.
Whilecreatingthevocabusingmultipleindependenttexts,it’sgoodtoaddanendoftexttokeninthevocab.Whileencoding,ifanewtokenismet,thenanunktokencanalsobeaddedtothevocabanditstokenidcanbeusedfornewwordswhichwerenotpresentinthevocabdata.
BytepairtokenizationisusedbyGPTmodels.Ithasavocabularyofaround50ktokens.Itsolvestheproblemofwordleveltokenization.Ifawordisnotpresentinthevocabulary,itdoesn’tgettokenizedcorrectly.Also,ifmultiplewordscomefromthesamerootword,memorygetswastedbyassigningseparatetokenstoallofthem.
Itstartsfromthealphabetsofthevocabulary,eachbyte,andthenfindsthemostcommonadjacentpairinthisarrayofcharacters.Itappendsthatpairintothevocabularyandcontinuestodosountilthetargetvocabularysizeismetornohighfrequencypairsareleft.

Thefirststepistocreateyourbasevocabularyoftheexistingcorpus.Herewecreateadictionaryofwordsandtheirfrequenciesandeachwordisstoredasatupleofcharacters.
Thenextstepistocountthefrequencyofeachadjacentpairineachofthesetuplesofwords.Whicheverpairhasthehighestfrequencygetsmergedandappendedtotheexistingvocabulary.Forthis,weiterateoverthevocabularydictionaryandcounteachneighboringpairinaseparatedictionary.
Then,formerging,wetakeourvocabularyandthepairthatistobemerged.Wecheckeachtupleandifatuplehasthispair,thenwecreateanewtuplewherethispairappearsmergedinsteadofseparatecharactersandthistuplegetsappendedtothevocab.
Thenfromthisvocabulary,wecanassignidstoeachtokenofthetuple,whichcanjustbeanindexofanarrayfornow.Wethenhaveourtokensreadytobeencodedanddecoded.