LLM Tokens

Intro to Tokenization

4 min read

TocreateanLLM,pretrainingandfinetuningisinvolved.Pretrainingisdonethroughselfsupervisedlearningonrawdatatocreateabaseorafoundationmodel.Inselfsupervisedlearning,anLLMusesthelabelleddatageneratedbyitselftotrainitself.

Finetuningismorespecifictotheusecase.Instructionandclassificationaretwotypesoffinetuning.InstructGPTisafinetunedGPT-3modelusinginstructionfinetuning.

TransformerarchitectureisusedatthecoreofanLLMwhichwasoriginallyusedformachinetranslations.Ithasanencoderanddecoder.Encodertakestheinputtextandconvertsitintonumericalvalues.Decodertakesthenumericalvaluesandgeneratestheoutputtext.

BERTisoneofthevariantsoftransformerarchitecture.It’smainlyusedformaskedwordprediction,whichmeanstheinputdatacanconsistofwhatcomesbeforeorafterthetokentobepredicted.GPTisusedfornextwordprediction.Autoregressivemodelsusepreviousoutputsasthenextinput.

Zeroshotlearningmeansthemodeldoesn’trequireanyexamplestoconveytherequirementandexpectationofthetask.

Pretraininginvolvesdatapreparationwhichincludesconvertingthetexttotokensandthentovectorrepresentations.Thenasamplingstrategyisusedtogenerateinputoutputpairs.

Theprocessofconvertingrawtexttovectorrepresentationiscalledembeddings.LLMsarenotcompatiblewithcategoricalrawdata.Theyneedvectorsformathematicalcomputations.

Embeddingmodelsliketextembeddingmodelsareusedforthis.Word2Vecisoneofthelandmarkalgorithms.Dimensionalityofembeddingsdeterminestheexpressivecapacityofthemodel.

Thefirststepistokenization.Takethetextfromthepublicdomainandsplitthecharacters.Whethertoincludewhitespacesisabusinessortechnicalchoice.Thetokensgeneratedfromthisrawtextarenowreadytobeconvertedintotokenids.

FromthesetokenswecreateasetofuniquetokensandassignanumericIDtoeach.Whenanewtextisusedasinput,eachtokenofthattextisconvertedtothesenumericIDsfromourvocabulary.

Aclasswhoseencodemethodtakestherawtext,convertsittotokensandtokenids,andinitializesitasvocabulary.Decodemethodtakesthenumerictokenidsandgivesusbackthetokensassociatedwithit.

Whilecreatingthevocabusingmultipleindependenttexts,it’sgoodtoaddanendoftexttokeninthevocab.Whileencoding,ifanewtokenismet,thenanunktokencanalsobeaddedtothevocabanditstokenidcanbeusedfornewwordswhichwerenotpresentinthevocabdata.

Hereisasimpletexttokenizerinpython.

BytepairtokenizationisusedbyGPTmodels.Ithasavocabularyofaround50ktokens.Itsolvestheproblemofwordleveltokenization.Ifawordisnotpresentinthevocabulary,itdoesn’tgettokenizedcorrectly.Also,ifmultiplewordscomefromthesamerootword,memorygetswastedbyassigningseparatetokenstoallofthem.

Itstartsfromthealphabetsofthevocabulary,eachbyte,andthenfindsthemostcommonadjacentpairinthisarrayofcharacters.Itappendsthatpairintothevocabularyandcontinuestodosountilthetargetvocabularysizeismetornohighfrequencypairsareleft.

Thefirststepistocreateyourbasevocabularyoftheexistingcorpus.Herewecreateadictionaryofwordsandtheirfrequenciesandeachwordisstoredasatupleofcharacters.

Thisishowtheinitialvocabularyiscreated.

Thenextstepistocountthefrequencyofeachadjacentpairineachofthesetuplesofwords.Whicheverpairhasthehighestfrequencygetsmergedandappendedtotheexistingvocabulary.Forthis,weiterateoverthevocabularydictionaryandcounteachneighboringpairinaseparatedictionary.

Thisishowthepaircountsarecomputed.

Then,formerging,wetakeourvocabularyandthepairthatistobemerged.Wecheckeachtupleandifatuplehasthispair,thenwecreateanewtuplewherethispairappearsmergedinsteadofseparatecharactersandthistuplegetsappendedtothevocab.

Thisishowthemergestepisdone.

Thenfromthisvocabulary,wecanassignidstoeachtokenofthetuple,whichcanjustbeanindexofanarrayfornow.Wethenhaveourtokensreadytobeencodedanddecoded.