Today we're announcing a state-of-the-art breakthrough that's pushing the boundaries of AI code generation. Our science team at NinjaTech AI developed a novel approach for code generation named “Multi-Programming Language Ensemble” (MPLE). 

Using this approach with Llama 3.1 405B Instruct we were able to statistically show better performance than OpenAI’s GPT-4o and Anthropic’s Sonnet 3.5 as measured by the HumanEval benchmark. You can find these results, along with details of our approach, in our technical paper published on arxiv.org

In this blog post, we will delve into three pivotal findings related to the Multi-Programming Language Ensemble (MPLE), which has the potential to significantly enhance code generation capabilities across all AI assistants:

  1. An agentic workflow (MPLE) approach, for code generation, produces 2-9% better accuracy in a cost-effective manner.  
  2. Integrating MPLE with advanced inference techniques like Reflection and Monte Carlo Tree Search (MCTS), applied to existing LLM models yields additional performance gains.
  3. To our pleasant surprise, we observed that MPLE+MCTS techniques powered by llama3.1-405b-instruct reached an accuracy of 96.25% in HumanEval Benchmark.

The Challenge of Code Generation

Writing code can be challenging because it requires a deep understanding of programming languages and the ability to translate complex ideas into executable code. Additionally, developers must also contend with issues like debugging, optimization, and compatibility - which can make the coding process even more intricate and time-consuming.

This is where AI helps; Large Language Models (LLMs) can generate large amounts of code quickly when prompted by a user. Since we launched MyNinja.ai in late May, 2024, it has processed over 734K Coder tasks from approximately 173K developers.

But AI code generation does not eliminate all of the challenges of writing code. LLMs are not 100% reliable - they can make errors when generating code. 

The reason LLMs make errors is that traditional approaches to AI code generation rely on a single programming language, which can lead to language-specific errors and biases. For example, an LLM may perform well in Python code generation because it was trained in Python. But when prompted to create a C++ or Java code it may not create an accurate result due to differences in error handling or library usage. Another reason that an LLMs can struggle with generating accurate code is the difficulty in translating human readable text (i.e. the user’s prompt) into code. 

One solution to this problem is to invest in more model training. Companies can spend more time - and more and more money - training LLMs to answer coding questions on the first try. This is called zero-shot learning, which means the model produces a correct answer given no examples, just a prompt, and it generates an appropriate response based only on its prior knowledge. However, a zero shot approach requires an enormous investment of time and money ($10M-$100M+) - to improve the model accuracy. 

But inference costs have dropped by more than 100X in the last 2 years and they continue to do so, giving rise to a new way: Agentic Workflows for CodeGen.

Our Solution: Multi-Programming Language Ensemble (MPLE)

Our team has proposed a novel approach to AI code generation called Multi-Programming Language Ensemble (MPLE). This approach leverages the strength of multiple programming languages to generate more accurate and robust code using a multi-shot approach.

Rather than relying on the model to create a correct answer on the first try (single shot) we interface with the LLM multiple times to generate the correct response. Each time we interface with the model we ask it to solve the user’s problem using a different programming language. This allows us to leverage a natural strength of an LLM - the ability to translate code from one programming language to another.

Think of MPLE as a team of experts, each well versed in a different programming language, working together to create the most accurate code on the first try. Each expert (or programming language) brings their unique strengths to the table, and by combining their strengths, they can produce a more accurate and reliable solution. MPLE asks all of these experts to contribute their “expertise” while our system generates the correct user-requested code. 

Here's how MPLE works:

  1. Initial Code Generation: When a user prompts our system with a code-related question, our AI model generates an initial response in the user-requested programming language.
  2. Multi-Language Sampling and Translation: Before the response is returned to the user, the code is tested to insure it’s quality and accuracy. If the code fails to pass all tests, our model generates new code in a different programming language. The new code is then translated back to the user-request programming language. This alternative version differs from the original version because it leverages the strengths of the alternatives language, potentially eliminating the errors that caused the test failures.
  3. Iterative Refinement: The refined code version is tested again against the test cases. If it fails to pass all tests, the process continues by iterating through different coding languages until a version successfully passes all tests. When the process is complete, a response (i.e. the final version of the code) is returned to the user.
  4. Ensemble Integration: Throughout the iterations, the ensemble framework integrates the strengths of multiple languages to progressively refine the program. By treating each language-specific code generation as an individual “weak expert,” the framework combines their outputs to mitigate language-specific errors and biases.

It’s important to note that this approach is not a new model, nor does it require retraining a model. It is a system that orchestrates multiple interactions with your existing LLM to create the best possible answer using a multi-shot approach. We believe that, as inference costs continue to come down, the future of AI-powered results - that are economically viable - will be led by agentic workflows that leverage inference heavy techniques - where a lot of tokens are generated to answer complicated questions.

Validating MPLE Through Testing

We tested the MPLE framework on two widely recognized code generation benchmarks: HumanEval and HumanEval-plus. These benchmarks measure the ability of LLMs to generate functional code based on a user prompt.

HumanEval is designed for text-to-code generation tasks where the input is a user prompt describing the intended functionality of the program. The LLM output is then evaluated based on its ability to pass unit tests with specified requirements. HumanEval-plus extends HumanEval by incorporating a large number of additional test cases to rigorously evaluate the code’s robustness and correctness.

We measured the effectiveness of MPLE using Pass@1. This method of code evaluation measures the percentage of tasks that are successfully completed by the generated code on the first attempt.

We conducted experiments using both our proprietary LLMs and other well known models like GPT-4o, Claude-Sonnet-3.5, and Llama3.1-405b.

Our testing showed that the proposed MPLE framework consistently improves the Pass@1 accuracy across all tested LLMs when compared to a baseline. For example, GPT3.5-turbo’s accuracy increased from 65.83% in the Baseline to 74.17% with MPLE, highlighting the effectiveness of leveraging multiple programming languages to reduce language-specific biases and errors (see Table 1).

Table 1: HumanEval Test Results for MPLE applied to various LLMs

Integrating MPLE with Reflection & Monte Carlo Yields Better Results

We pushed MPLE further and designed a system that integrates MPLE with advanced inference techniques like Reflection and MCTS. These techniques are used to enhance problem-solving and reasoning abilities of LLMs, especially when it comes to tasks that require strategic or multi-step decision-making. Our hypothesis was that the combination of these systems would produce even higher results than using MPLE alone. Through a series of tests we were able to prove this hypothesis true.

We also tested MPLE using the HumanEval-plus benchmark and the results further validate the benefits of our multi-language ensemble approach. Notably, llama3.1-8binstruct’s performance improved from 60.00% in the Baseline to 71.88% with MPLE+Reflection (see table 2).

Table 2: HumanEval-plus test results for MPLE, and Reflection or MCTS, applied to various LLMs

Additionally, MPLE+Reflection and MPLE+MCTS deliver competitive results, with multiple models (GPT-4omini, GPT-4o, Claude-Sonnet-3.5, and llama3.1-405b-instruct) achieving 87.50% (see table 3).

Table 3: HumanEval-plus test results for MPLE, and Reflection or MCTS, applied to various LLMs

Our results clearly demonstrate that the MPLE framework, especially when used in conjunction with additional inference algorithms, offers a powerful and flexible approach to enhance code generation across multiple LLMs. MPLE’s consistent performance improvements underscore its potential for practical applications in AI-driven software development.

Llama 3.1-405b Achieves State-of-The-Art Results with MPLE & MCTS: 96.25% accuracy! 

During our analysis we were especially impressed with the performance of MPLE when applied to The combination of MPLE+MCTS achieved the highest accuracy for several models, such as llama3.1-405b-instruct, which reached a SOTA Pass@1 accuracy of 96.25% (see table 4).

Table 4: HumanEval test results for MPLE and MCTS, applied to various LLMs

The Road Ahead

As we continue to push the boundaries of what's possible with MPLE, we're excited to explore new areas of research and development. Potential areas of focus include:

  1. Developing More Robust Evaluation Metrics: Creating effective evaluation metrics is crucial for accurately measuring MPLE's performance. By focusing on more comprehensive metrics, we can ensure the framework’s accuracy, reliability, and practical applicability across various domains.
  2. Applying MPLE to Real-World Problems: MPLE has the potential to address complex real-world challenges, such as automating business processes, optimizing software development pipelines, and improving cybersecurity measures. Evaluating MPLE across a broader range of datasets and real-world applications will be critical in assessing its generalizability and impact.
  3. Enhancing NinjaLLM for coding tasks: our NinjaLLM 3.0 (a fine-tuned and quantized version of llama3.1-405b), available at MyNinja.ai has achieved promising scores on HumanEval (93.85%) and HumanEval-plus (86.67%), and we are on the path to further improving its performance. We will be launching an MPLE based Ninja Coder in a month.

We're excited to share our research on MPLE with the world. We believe this approach has the potential to make a significant impact on LLM code generation. MPLE reduces language-specific errors and biases, resulting in more accurate and robust code generation - which means more time for developers to tackle bigger problems. We're committed to continuing our research and development in this area, and we look forward to seeing the innovative applications of MPLE in the future. 

P.S. Future is a lot less buggy :-)

P.P.S. We used MyNinja.ai to assist in the creation of this blog post.