A Breakthrough In AI Reasoning

Cost-efficient reasoning is key to Agentic workflows

At Ninja AI, we believe that cutting-edge AI should be both powerful and accessible, helping users boost productivity without breaking the bank. For the past two years we’ve been focused on building an agentic productivity system, continuously adding the latest AI advancements into Ninja AI to make it smarter, faster, and more capable.

Along the way we’ve introduced features that require sophisticated agentic workflows, such as Deep Research and Multi-Turn File Analysis. We also launched a beta version of a scheduling workflow, allowing Ninja to negotiate meeting times with multiple participants via email.

As we continuously refine these skills, we recognize a critical need—to enhance Ninja’s intelligence and decision-making. Reducing errors in high-risk tasks (e.g., modifying calendar events) and enabling more autonomous workflows (e.g., executing composite tasks that interact with APIs and people) require our agents to make more accurate decisions and predictions in many different types of situations.

We've discovered that incorporating "step-by-step thinking" into our workflows significantly boosts their accuracy and ability to generalize. Step-by-Step Thinking is a process that involves: planning, breaking down tasks, backtracking, verifying and reflecting before executing tasks by intelligent function-calling. Recent reasoning models have successfully applied ‘step-by-step thinking’ to solve complex math, science, and coding problems. However, due to the following limitations, these models aren't suitable for our Ninja Agentic workflows:

First, most current reasoning models are very expensive. For example, a single complex agentic task using OpenAI’s O1 API could cost anywhere between $0.75 to $2.25¹ - that is “per task” cost which is a price that is economically unsustainable for us as a business and also unviable for customers if we were to pass the costs to them per task.

¹Assuming each agentic task requires an estimated 5,000 to 10,000 input tokens and 10,000 to 30,000 output tokens

Second, the more affordable reasoning models do not have the necessary features to power agentic workflows. For example, DeepSeek R1 is a free reasoning model - but it is limited. R1, due to its size, requires Nvidia H200s GPUs (or better) for high latency and low throughput for the model; hence, making it difficult to use it in a real-time task-oriented chat system. Using H200s also makes it expensive to run. Additionally, R1 has challenges handling general capability and software engineering tasks - these limitations are confirmed by the last section of the R1 paper.

Furthermore, existing reasoning models lack the customizations. At Ninja, we are aspiring to build the most advanced agentic system for productivity. As such, we need the ability to fine-tune the models to better suit our needs. This is not possible when accessing current reasoning models via API or using existing large open-sourced reasoning models (such as the 671B param R1).

Given these drawbacks, we decided to design our own reasoning system - SuperAgent-R 2.0 - to help us enable a sustainable agentic system that’s fast, affordable & fine tunable for customers.

Ninja’s Reasoning Model - SuperAgent-R 2.0

SuperAgent-R 2.0 is a compound AI system: it leverages Ninja’s own fine-tuned model with reasoning capability, which is based on DeepSeek R1 distilled on Llama 70B. The SuperAgent-R 2.0 also uses other models to support reasoning via advanced inference-level optimizations. The whole system runs end-to-end to AWS infrastructure which makes it affordable & scalable. The end result delivers near state-of-the-art performance at a fraction of the cost of proprietary models like OpenAI’s O, O3-mini (high) or Anthropic’s Sonnet 3.7 (thinking mode).

SuperAgent-R 2.0 brings together several industry-first innovations to create a system that can complete complex reasoning tasks at low costs. A key component of the system is a new, Multi-Gear Reasoning approach. Unlike other models that force users into a fixed level of computation, our system dynamically adjusts reasoning effort based on task complexity. The SuperAgent’s level of computation are:

No Thinking – For straightforward lookups and rapid responses.
Light Thinking – For medium-complexity tasks like structured reasoning.
High Thinking – For deep, multi-step reasoning tasks requiring advanced logic.

SuperAgent-R 2.0 can self-determine the reasoning effort and automatically adjust to a user request. Admittedly, this is difficult to achieve all the time because the system still can over-think. We are constantly reviewing customer feedback and will continue to make improvements.

SuperAgent-R 2.0 has undergone rigorous testing against leading AI benchmarks, demonstrating best-in-class performance across multiple domains. In these tests, SuperAgent-R 2.0 is consistently competitive with leading AI models, proving its superior reasoning and problem-solving abilities.

Advantages of SuperAgent-R 2.0 compared to DeepSeek R1

DeepSeek-R1, rightly so, has received a lot of attention recently as a high-quality, free reasoning model. However, it comes with some notable drawbacks. One major limitation is its hardware requirement (which we mentioned above)—it must run on Nvidia H200 GPUs (or better), which can increase operational costs; and even then, it’s not a fast model for real-time speedy inference.

Additionally, as we evaluated DeepSeek-R1 and reviewed its documentation, we identified other drawbacks that could impact our customers:

General Capability: DeepSeek-R1 falls short of DeepSeek-V3 in key areas such as function calling, multi-turn interactions, and complex role-playing.
Language Capabilities: DeepSeek-R1 is optimized for Chinese and English, which can lead to issues when handling queries in other languages. Since we support users in multiple languages, broader language support is essential.
Prompting Sensitivity: DeepSeek-R1 is highly sensitive to prompt variations. Few-shot prompting - which is common amongst customers - degrades overall performance, making it less reliable for our needs.
Software Engineering Tasks: Benchmark results indicate that DeepSeek-R1 has limited software engineering capabilities. Given that many of our customers rely on Ninja for software-related tasks, this limitation would significantly impact their experience.

DeepSeek R1 is a fantastic model, but these factors make DeepSeek R1 less suitable for our needs and drove our decision to develop SeuperAgent-R 2.0.

Competition Math (AIME 2024)

For Competitive Math, a determinant of reasoning capability, our testing has shown that SuperAgent-R 2.0 is exceeding the performance of OpenAI O1, Sonnet 3.7 (64k extended thinking), DeepSeek R1 models and SuperAgent-R 2.0 is on-par with OpenAI O3-high reasoning model. OpenAI has published data that a model that is good at Competitive Math such as AIME 2024, will be good at autonomous agentic workflows.

PhD-level Science Questions (GPQA Diamond)

This test measures how well a system can solve PhD-level science questions. This test is important to our users who work in many different industries and have various job functions. SuperAgent-R 2.0 exceeded human PhD-level accuracy on this benchmark of physics, biology, and chemistry problems.

Competition Code (Codeforces)

On Codeforces competitive programming, SuperAgent-R 2.0 achieves progressively higher ELO scores than DeepSeek V3 and competitive scores with many OpenAI models.

LiveBench - Coding

Used to test real-world coding performance.

SuperAgent-R 2.0 is available at myninja.ai

Unlike various products in the market, we will not be charging additional subscription fees for unlimited access to the SuperAgent-R 2.0 model. This model is available to all our Ultra users ($15/mo) and Business plan users ($20/mo/seat). Pricing details. Please note that we do reserve the right to limit usage based on excessive use.

Try it out at myninja.ai

What’s Next: New Skills and API Access

As we look ahead, we will continue to deliver agentic workflows, powered by the SuperAgent-R 2.0 - to help our users be more productive. One of the first ways we plan to use SuperAgent-R 2.0 is to enhance our DeepResearch feature.

We also plan to provide API access to SuperAgent-R 2.0 soon - helping developers and businesses build their own custom systems.

A Breakthrough in Cost-Effective, High-Performance Multi-Gear AI Reasoning

Cost-efficient reasoning is key to Agentic workflows

Ninja’s Reasoning Model - SuperAgent-R 2.0