Meta’s new BLT architecture upgrades LLMs by replacing tokens

Meta’s new BLT architecture upgrades LLMs by replacing tokens
Source: Venture Beat

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


The AI research community continues to find new ways to improve large language models (LLMs), the latest being a new architecture introduced by scientists at Meta and the University of Washington.

Their technique, Byte latent transformer (BLT), could be the next important paradigm for making LLMs more versatile and scalable.

BLT solves one of the longstanding problems of LLMs that operate at byte level as opposed to tokens. BLT can open the way for new models that can process raw data, are robust to changes and don’t rely on fixed vocabularies.

Tokens vs bytes

Most LLMs are trained based on a static set of tokens, predefined groups of byte sequences.

During inference, a tokenizer breaks the input sequence down into tokens before passing it to the LLM.

This makes the models more efficient in using compute resources but also creates biases that can degrade the model’s performance when faced with tokens not included in the vocabulary.

For example, many leading language models can become slow and more costly when faced with languages that have a small representation on the web because their words were not included in the model’s token vocabulary. Misspelled words can also cause the model to tokenize the input incorrectly. And tokenized models can struggle with character-level tasks, such as manipulating sequences.

Moreover, modifying the vocabulary requires the model to be retrained. And expanding the token vocabulary can require architectural changes to the model to accommodate the added complexity.

Alternatively, LLMs can be trained directly on single bytes, which can solve many of the abovementioned problems. However, byte-level LLMs are prohibitively costly to train at scale and can’t handle very long sequences, which is why tokenization remains an essential part of current LLMs.

Byte latent transformer (BLT)

Byte latent transformer (BLT) is a tokenizer-free architecture that learns directly from raw bytes and matches the performance of tokenization-based models. To solve the inefficiencies of other byte-level LLMs, BLT uses a dynamic method that groups bytes based on the level of information they contain.

“Central to our architecture is the idea that models should dynamically allocate compute where it is needed,” the researchers write. 

Unlike tokenized models, BLT has no fixed vocabulary. Instead, it maps arbitrary groups of bytes into patches using entropy measures. BLT does this dynamic patching through a novel architecture with three transformer blocks: two small byte-level encoder/decoder models and a large “latent global transformer.”

BLT architecture (source: arXiv)

The encoder and decoder are lightweight models. The encoder takes in raw input bytes and creates the patch representations that are fed to the global transformer. At the other end, the local decoder takes the batch representations processed by the global transformer and decodes them into raw bytes.

The latent global transformer is the model’s main workhorse. It takes in the patch representations generated by the encoder and predicts the next patch in the sequence. When processed by the decoder, this patch is unpacked into one or several bytes.

The global transformer accounts for the largest share of compute resources during training and inference. Therefore, the patching mechanism determines how the global transformer is used and can help control the amount of compute used for different portions of the input and output.

BLT redefines the tradeoff between vocabulary size and compute requirements. In standard LLMs, increasing the size of the vocabulary means larger tokens on average, which can reduce the number of steps required to process a sequence. However, it will also require larger dimensions in the projection layers inside the transformer, which itself consumes more resources. 

In contrast, BLT can balance compute resources based on the complexity of the data instead of the vocabulary size. For example, the ending of most words is easy to predict and requires fewer resources. On the other hand, predicting the first byte of a new word or the first word of a sentence requires more compute cycles.

“BLT unlocks a new dimension for scaling, allowing simultaneous increases in model and patch size within a fixed inference budget,” the researchers write. “This new paradigm becomes advantageous for compute regimes commonly encountered in practical settings.”

BLT in action

The researchers conducted experiments with BLT and classic transformers on models of different scales, running from 400 million to 8 billion parameters.

According to the authors, this is “the first flop-controlled scaling study of byte-level models up to 8B parameters and 4T training bytes, showing that we can train a model end-to-end at scale from bytes without fixed-vocabulary tokenization.”

Their findings show that when controlled for the amount of compute resources allocated to training, BLT matches the performance of Llama 3 while using up to 50% fewer FLOPs at inference. This efficiency comes from the model’s dynamic patching, which results in longer groups of bytes, saving compute that can be reallocated to grow the size of the global latent transformer.

“To the best of our knowledge, BLT is the first byte-level Transformer architecture to achieve matching scaling trends with BPE-based models at compute optimal regimes,” the researchers write.

Beyond efficiency, BLT models proved to be more robust to noisy inputs compared to tokenizer-based models. They had enhanced character-level understanding abilities and also showed improved performance on tasks such as character manipulation and low-resource machine translation. According to the researchers, the ability of BLT to directly process raw bytes as opposed to tokens “provides significant improvements in modeling the long tail of the data,” which means the models are better at working with patterns that don’t appear often in the training corpus.

This is still the beginning of what could be a new standard for creating language models. The researchers note that existing transformer libraries and codebases are designed to be highly efficient for tokenizer-based transformer architectures. This means that BLT still has room to benefit from software and hardware optimizations.



Read Full Article

Leave a Reply

Your email address will not be published. Required fields are marked *