Ninja AI's SuperAgent is setting a new benchmark for what an AI system can achieve. By combining cutting-edge inference level optimization with multi-model orchestration and critique-based refinement, SuperAgent is delivering results that outperform even the most popular foundational models like GPT-4o, Gemini 1.5 Pro, and Claude Sonnet 3.5. 

Ninja achieved SOTA in Arena-Hard benchmark, which we will discuss in this blog post, along with its performance in other benchmarks.

What is SuperAgent?

We previously introduced our SuperAgent, a powerful AI system designed to generate better answers than any single model alone. SuperAgent uses inference level optimization, which involves combining responses from multiple AI models. This means that instead of relying on a single perspective, SuperAgent utilizes a mixture of models and then refines the output using a critiquing model to deliver more comprehensive, accurate, and helpful answers. The result is a level of quality that stands above traditional single-model approaches.

The SuperAgent is a natural extension of our multi-model feature and our belief that you should have some choice in which model you use. Building on the foundation we created for our Pro and Ultra subscribers, SuperAgent takes things further by aligning these models together, seamlessly. This means that instead of just choosing a model, SuperAgent brings them together to deliver the most comprehensive, nuanced, and optimized responses possible.

We built three versions of the SuperAgent to balance speed, depth, and cost. 

Turbo: Designed for speed, Turbo is the fastest version of SuperAgent. It consults with our custom Ninja-405B and Ninja-70B Nemotron (Critique model) to generate rapid and accurate responses. 

Nexus: Striking a balance between speed and depth, Nexus provides richer insights by consulting with GPT-4o-mini, Claude 3.5 Haiku, Gemini 1.5 Flash, and Ninja-405B (Critique model). This version is perfect for users who need detailed yet timely answers.

Apex: The most robust version of SuperAgent, it delivers thoroughly researched and comprehensive responses. It consults with Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, and uses Claude 3.5 Sonnet as the Critique model to ensure the highest level of detail and accuracy.

Why We Tested SuperAgent Against Industry Benchmarks

To evaluate the SupeAgent’s performance we conducted state of the art testing against multiple foundational models like GPT-4, Gemini 1.5 Pro, and Claude Sonnet 3.5. Benchmark tests like this are a common practice in computer science and helps us evaluate how our approach to AI compares to the single-model approach.

Here are the benchmarks we used:

Arena-Hard-Auto (Chat): A benchmark designed to test complex conversational abilities, focusing on the ability to handle intricate dialogue scenarios that require nuanced understanding and contextual awareness.

MATH-500: A benchmark aimed at evaluating an AI’s mathematical reasoning and problem-solving capabilities, specifically focusing on complex problems that involve higher-level mathematics.

Livecodebench (Coding): A coding test that measures an AI’s ability to understand and generate code. This benchmark assesses the model’s capacity to write accurate code in response to a variety of prompts, including basic and intermediate programming challenges.

Livecodebench Hard (Coding): An extension of Livecodebench, focusing on advanced coding tasks that involve complex problem-solving and algorithmic challenges. It’s designed to push the limits of an AI’s coding skills and evaluate its ability to manage more difficult programming scenarios.

GPQA (General Problem-solving and Question Answering): A benchmark that tests an AI’s general reasoning abilities by requiring it to answer questions involving complex, multi-step logic, factual recall, and inference.

AIME2024 (Advanced Inference and Mathematical Evaluation): A benchmark focused on advanced reasoning and mathematical evaluation. It assesses the model’s ability to handle problems that require both logic and numerical computations.

These benchmarks represent a comprehensive, industry-standard way to evaluate various aspects of AI performance, allowing us to evaluate SuperAgent's capabilities compared to standalone models.

SuperAgent Outperforms Foundational Models on Arena-Hard

As we've mentioned, SuperAgent delivered outstanding results compared to all foundational models in multiple benchmarks. Let’s take a closer look at Arena-Hard with no-style control, one of the most crucial benchmarks for assessing how well an AI system handles common, everyday tasks. This benchmark is essential for understanding practical AI performance, and SuperAgent excelled, demonstrating capabilities far beyond those of other leading models.

The results: SuperAgent beat all other foundational models as measured by Arena-Hard

Ninja's SuperAgent scored highest on the Arena-Hard No Style Control Benchmark when compared to other foundational models. Last updated: 12/03/2024

We want to highlight that Ninja’s SuperAgent outperformed OpenAI’s o1-mini and o1-preview - two reasoning models. This is very exciting as o1-mini and o1-preview are not just AI models, they are advanced reasoning systems that, in general, are not compared to foundational models like Gemini 1.5 pro or Claude 3.5. For Ninja to perform better than two reasoning models, proves that the SuperAgent approach - combining the results from multiple models using a critiquing model - can produce superior results to a single AI system.

SuperAgent Excels On Other Benchmarks

Beyond Arena-Hard, the Apex version of Ninja’s SuperAgent demonstrated exceptional performance in math, coding, and general problem-solving. These results highlight SuperAgent's outstanding capability to tackle complex problems, showing advanced logic and precision compared to other models. Its ability to generate accurate and functional code consistently outperformed other models tested.

The Apex version of SuperAgent scored highest on the LiveCodeBench Coding benchmark test when compared to foundational models. Last updated: 12/03/2024

The Apex version of SuperAgent scored highest on the LiveCodeBench - Coding Hard benchmark, beating many foundational models. Last updated: 12/03/2024

All Ninja SuperAgent’s excelled at the AIME2024-Reasoning benchmark. Last updated: 12/03/2024

The Apex version of SuperAgent scored highest in the GPQA-Reasoning benchmark. Last updated: 12/03/2024

The Apex and Nexus versions of the Ninja SuperAgent scored highest on the Math-500 benchmark. Last updated: 12/03/2024

Across all benchmarks, SuperAgent showed a level of performance that surpassed many well-known foundational models - sometimes beating the most advanced reasoning models on the market. Because we love data, here's the same benchmark information in a table.

Last updated: 12/03/2024

Final Thoughts

The results speak for themselves—SuperAgent is a leap forward in how we think about AI-powered solutions. By leveraging multiple models, a refined critique system, and advanced inference level optimization, SuperAgent delivers answers that are deeper, more accurate, and more relevant to your needs. Whether you need a complex coding solution, advanced reasoning, or simply the best possible conversational support, SuperAgent has proven it can outperform traditional single-model approaches. 

As we continue to innovate, our commitment remains the same: delivering the most intelligent, efficient, and powerful AI system possible—because better answers means a better experience for you.