Allen Institute for AI launches Bolmo 7B and Bolmo 1B, claiming they are “the first fully open byte-level language models”, built on its Olmo 3 models
and every token gets the same compute, regardless of complexity. Benjamin Minixhofer / @bminixhofer : There are also some things Bolmo lets us do which we just can't do using subword-level LMs. For example, we can increase the compression in bytes per patch to achieve an arbitrary speedup⚡ In contrast, subword-level LMs eventually get yote back by the Softmax Bottleneck. [image] @allen_ai : Introducing Bolmo, a new family of byte-level language models built by “byteifying” our open Olmo 3—and to our knowledge, the first fully open byte-level LM to match or surpass SOTA subword models across a wide range of tasks. 🧵 [image] Edoardo Ponti / @pontiedoardo : Finally, you can count the r's in strawberry and check if 3.11 is higher than 3.9 without tokenisation interfering: Here's Bolmo, a fully open byte-level LLM with latent tokenisation, derived from a SOTA LLM (Olmo 3). Promising on coding and char-level understanding! Benjamin Minixhofer / @bminixhofer : We are releasing Bolmo today! Bolmo is the best byte-level model so far. It comes close to and sometimes surpasses Olmo 3. Bolmo also performs competitively in terms of speed & is fully open. I was skeptical of byte-level models for a long time but I finally switched camps🧵 [image] @allen_ai : On our eval suite & character-focused benchmarks like CUTE & EXECUTE, Bolmo matches/surpasses subword models while excelling at character-level reasoning. Once you byteify a base model, you can import capabilities from post-trained checkpoints via weight arithmetic. [image] Luca Soldaini / @soldni : I've been grumbling about tokenizers but it took a @bminixhofer to do something about it! Really neat approach: rather than committing to tokenizer-free methods from the get-go, we show how to switch from BPE at any point during training run 👾 @teortaxestex : incredibly based. Now that we have fast long contexts and an abundance of compute, it's about damn time to explore byte-level models again. Meta has disappointed me with MegaByte, but Meta generally is bad at execution. This path is not yet closed... [image]