Blog Detaisl

Tokens and Embeddings in LLMs

Have you ever heard of Large Language Models (LLMs) and wondered what "X amount of tokens" means? What is the difference between Gemini 1.5 that has a context window of 1M tokens vs the upcoming version that has 2M? Let me explain it in simple words and in under 1 min.

Imagine an LLM as a giant bookshelf. This bookshelf holds all the knowledge the LLM has learned from reading massive amounts of text online, like books, articles, and even social media posts. Now, each piece of information on that shelf is like a building block. These blocks can be whole words, parts of words, or even punctuation marks. We call these building blocks tokens. The more tokens an LLM has access to, the bigger its bookshelf!

Here's how the size of the bookshelf (number of tokens) affects the LLM:

  • Small Bookshelf (Few Tokens): This LLM might be like a young child who's just learning to read. It can understand and generate simple sentences but might struggle with complex topics or miss connections between ideas.

  • Medium Bookshelf (Moderate Tokens): This LLM is like a well-read teenager. It can handle most conversations, write different kinds of creative text formats, and answer your questions in a comprehensive way.

  • Giant Bookshelf (Massive Tokens): This LLM is like a super-intelligent professor! It has a vast knowledge base and can understand nuances, analyze complex topics, and even generate different writing styles with near-human fluency.

While tokens and their number are a way to evaluate LLMs it is just one metric. You should take into consideration what data the model is trained on (quality and relevance), the Model architecture and if the model is fine tuned properly for the task you are willing to answer. So, the next time you hear about LLMs with X amount of tokens, you'll have a better idea of what that means! :) It's all about the size and content of their virtual bookshelf, shaping their ability to understand and generate language.


Why do we need to tokenize strings?

  • To break down complex text into manageable units.

  • To present text in a format that is easier to analyze or perform operations on.

  • Useful for specific linguistic tasks like part-of-speech tagging, syntactic parsing, and named entity recognition.Uniformly preprocess text in NLP applications and create structured training data.

Most NLP systems perform some operations on these tokens to perform a specific task. For example, we can design a system to take a sequence of tokens and predict the next token. We can also convert the tokens into their phonetic representation as part of a text-to-speech system. Many other NLP tasks can be done, like keyword extraction, translation, etc.


How do we actually use these tokens to build these systems in the first place?

  • Feature Extraction: Tokens are used to extract features that are fed into machine learning models. Features might include the tokens themselves, their frequency, their part-of-speech tags, their position in a sentence, etc. For instance, in sentiment analysis, the presence of certain tokens might be strongly indicative of positive or negative sentiment.

  • Vectorization: In many NLP tasks, tokens are converted into numerical vectors using techniques like Bag of Words (BoW), TF-IDF (Term Frequency-Inverse Document Frequency), or word embeddings (like Word2Vec, GloVe). This process turns text data into numbers that machine learning models can understand and work with.

  • Sequence Modeling: For tasks like language modeling, machine translation, and text generation, tokens are used in sequence models like Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), or Transformers. These models learn to predict sequences of tokens, understanding the context and the likelihood of token occurrence.

  • Training the Model: In the training phase, models are fed tokenized text and corresponding labels or targets (like categories for classification tasks or next tokens for language models). The models learn patterns and associations between the tokens and the desired output.

  • Context Understanding: Advanced models like BERT and GPT use tokens to understand context and generate embeddings that capture the meaning of a word in a specific context. This is crucial for tasks where the same word can have different meanings based on its usage.

If you are new to all this, don’t worry about the keywords you just read. In very simple terms, we have text strings that we convert to independent units called tokens. This makes it easier to convert them to “numbers,” later, which the computer understands..

ChatGPT and Tokens

What do tokens look like in the context of LLMs like ChatGPT? The tokenization methods used for LLMs differ from those used in general NLP. Broadly speaking, we can call it “subword tokenization,” where we create tokens that need not necessarily be complete words as we see in whitespace tokenization. This is precisely why one word is not equal to one token. When they say GPT-4 Turbo has 128K tokens as its context length, it is not exactly 128K words but a number close to it.


Comments

Leave a message here

We welcome your messages and feedback. Whether you have questions about our event schedule, need additional information, or simply want to share your thoughts, we're here to listen. Your input is valuable to us and helps us improve our services and offerings. Feel free to reach out with any comments, suggestions, or inquiries you may have. We're committed to providing you with the best possible experience and look forward to hearing from you.