With the advent of any new technology, humanity’s first attempt is typically achieved through brute force. As the technology evolves, we attempt to optimize and come up with a more elegant solution to the brute breakthrough. With the latest advancements in Artificial Intelligence (AI) — in particular the development of Large Language Models (LLMs) — we’ve made significant strides in recent years demonstrating impressive capabilities. But these strides are still very much in the brute force stage of this technology evolution. We’ve seen the Cambrian explosion of transformer-like models, bringing forth large models that range all the way up to trillions of parameters. This is quite analogous to the transition of the combustion engine to the more efficient electric successor. This transition was observed in sedans and in my favorite hobby toy: racing cars. This started in the 1960s with the likes of the Pontiac GTO, the Shelby Cobra 427 or the Dodge Charger R/T showcasing Detroit muscle with a large block engine, gas guzzling,0-to-60 MPH in 10 seconds street Hemi engines with gas mileage ranging from 7–14 miles per gallon (MPG). Today, with the latest electric cars, like Rimac’s Nevera, you can achieve 0-to-60 MPH in 1.74 seconds while achieving 54MPGe. The early brute force was a necessary step to catalyze the efficiency that followed.

It’s become evident to me that history needs to repeat itself with Large Language Models; We are on the cusp of shifting from brute attempts, towards more elegant solutions in addressing AI models; in particular moving away from larger more complex language models (our modern equivalent of the GTO, Cobra and Hemi engine) towards smaller, much more efficient models. To be frank, driving such efficiency has been a key focus of mine for the past several years. Working with an incredible team of colleagues, I’ve been fortunate to work at the intersection of Ai and compute in recent roles, designing accelerated machines and codesigning Meta’s Ai infrastructure. When Babak Pahlavan and I set out to build our current venture — NinjaTech AI — we inscribed a key fundamental of our technical DNA into the company’s culture — the efficient execution and operation of our intelligence platform from Day 1. NinjaTech is building an AI Executive Assistant to make professionals more productive, by taking on the administrative tasks like scheduling, expenses and travel booking, which consume considerable time.

While studying autoregressive and generative models with language models exceeding 100s of billions of parameters, it became clear to me that there needs to be a more efficient and simpler way to achieve these administrative tasks. It’s one thing if you’re trying to answer “what’s the meaning of life” questions, or asking your model to write the python code for an automated music producer. For many administrative tasks, simpler less complex models suffice. We have put this to the test by leveraging an assortment of model sizes for various administrative tasks, some so small and efficient that they can be run on CPU! This not only prevents us from breaking the bank with high-cost large-scale training jobs, but it also saves us inference time by not requiring expensive GPU instances with large memory footprints to serve our models. Much like the combustion-to-electric examples above, we’re becoming more efficient, but very quickly!

We’re excited to see a shift towards more efficient operation by the industry and the research community. One such example includes Meta’s Llama release which showcased their 13B parameter model outperforming GPT-3 (175B) on most benchmarks by training on more data on an order-of-magnitude smaller model. Consequently, Meta research outdid themselves again with LIMA (Less Is More For Alignment,) which banked on leveraging 1000 “diverse” prompts as a clever pre-training method to achieve high quality results. This is truly remarkable and imperative to curb our compute demand for Ai, which continues to soar exponentially and can have detrimental effects on our planet due to Ai’s carbon footprint. To put things in perspective, an MIT study demonstrated that small transformer models with only 65M parameters can consume up to 27KWh and 26 lbs of CO2e to train. This number can grow dramatically when looking at large models such as GPT3, creating up to ~502 tonnes in carbon equivalent emissions in 2022 alone. Furthermore, while inference is less compute intensive than training once a model is published, its emissions start to skyrocket 10–100x over its lifetime compared to training when being leveraging inference for serving.

We’re only at the tip of the iceberg with the vast possibilities of Ai; However, to do more within a more narrow footprint and given cluster size and budget it’s imperative to consider efficiency of our operations. We need to curb the gas guzzling Hemi and employ more efficient smaller models — this will improve operations, lower costs and meaningfully reduce AI’s carbon footprint.