-
Notifications
You must be signed in to change notification settings - Fork 322
feat(blog): add Kimi K2.6 Arena leaderboard refresh post #2988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+97
−0
Merged
Changes from 1 commit
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
97 changes: 97 additions & 0 deletions
97
src/routes/blog/post/kimi-k2-6-arena-leaderboard-refresh/+page.markdoc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,97 @@ | ||
| --- | ||
| layout: post | ||
| title: "Kimi K2.6 lands on Appwrite Arena, alongside a full leaderboard refresh" | ||
| description: "Kimi K2.6 from MoonshotAI ranks #3 without skills and #4 with skills on Appwrite Arena, in a refresh that swaps in eleven current frontier models and hardens the benchmark runner." | ||
| date: 2026-05-08 | ||
| cover: /images/blog/kimi-k2-6-arena-leaderboard-refresh/cover.avif | ||
| timeToRead: 6 | ||
| author: atharva | ||
| category: ai | ||
| featured: false | ||
| --- | ||
|
|
||
| [Appwrite Arena](https://arena.appwrite.io) measures how well AI models understand Appwrite, and the leaderboard has just had its biggest change since launch. Kimi K2.6 from MoonshotAI is the new headline addition, the model roster is now refreshed to current frontier versions across the board, and the benchmark runner has picked up retries, deterministic output, and configurable concurrency. | ||
|
|
||
| This post walks through what changed, where Kimi K2.6 lands, and how to read the new numbers. | ||
|
|
||
| # Kimi K2.6, the headline addition | ||
|
|
||
| Kimi K2.6 is the latest open-weight model from MoonshotAI. Pricing on OpenRouter sits around $0.75 per million input tokens and $3.50 per million output tokens, which puts it between Mistral Large 3 and GLM 5.1 in the Arena cost order. | ||
|
|
||
| The interesting result is what happens when you take skills away. | ||
|
|
||
| | Mode | Rank | Overall | MCQ | Free-form | Cost | Correct | | ||
| | --- | --- | --- | --- | --- | --- | --- | | ||
| | With skills | 4 of 11 | 96.3% | 97.0% | 91.9% | $1.64 | 185 / 191 | | ||
| | Without skills | 3 of 11 | 93.6% | 95.2% | 83.5% | $0.48 | 179 / 191 | | ||
|
|
||
| Without skills, the only models ahead of Kimi K2.6 are Claude Opus 4.7 and GPT 5.5, both of which cost roughly four times more on this run. With skills, Kimi K2.6 lands inside one point of Qwen 3.6 Plus and DeepSeek V4 Flash, two of the most cost-efficient models on the board. | ||
|
|
||
| The free-form jump is also worth a mention. Kimi K2.6 goes from 83.5% on free-form questions without skills to 91.9% with skills, an 8.4 point gain. That gap tells you the model can use Appwrite documentation effectively when it is in the prompt, rather than relying on memorized patterns alone. | ||
|
|
||
| The trade-off is speed. Kimi K2.6 averages 17 tokens per second and finishes the with-skills run in roughly 134 minutes, slower than every other model on the board except DeepSeek V4 Flash. If you are picking a model for an interactive coding loop, that matters. If you are picking one for batch generation or scheduled jobs, it matters less. | ||
|
|
||
|  | ||
|
|
||
| # The leaderboard you remember is gone | ||
|
|
||
| The version of Arena most readers saw at launch ran an older roster. That roster has been retired. Eleven models were swapped or upgraded in a single roster overhaul before the Kimi K2.6 addition. | ||
|
|
||
| | Model out (old roster) | Model in (current roster) | | ||
| | --- | --- | | ||
| | Grok 4.1 Fast | Grok 4.3 | | ||
| | MiniMax M2.5 | MiniMax M2.7 | | ||
| | DeepSeek V3.2 | DeepSeek V4 Flash | | ||
| | Qwen 3.5 397B A17B | Qwen 3.6 Plus | | ||
| | Kimi K2.5 | Kimi K2.6 (added in a follow-up run) | | ||
| | GLM 5 | GLM 5.1 | | ||
| | GPT 5.3 Codex | (removed) | | ||
| | GPT 5.4 | GPT 5.5 | | ||
| | Claude Opus 4.6 | Claude Opus 4.7 | | ||
| | (not present) | Gemini 3.1 Pro (Preview) | | ||
| | (not present) | Gemini 3.1 Flash Lite (Preview) | | ||
| | (not present) | Mistral Large 3 2512 | | ||
|
|
||
| The roster is now ordered roughly by price, from DeepSeek V4 Flash at around $0.10 per million tokens up through Claude Opus 4.7 and GPT 5.5 at roughly $5. Both Gemini 3.1 variants and Mistral Large 3 2512 are new to the board. Every other slot is a current-generation upgrade of the model that was there before. | ||
|
|
||
| # Without skills tells a sharper story | ||
|
|
||
| The without-skills view is where Kimi K2.6's rank stands out, and where the new roster reshuffles harder than the with-skills view. | ||
|
|
||
|  | ||
|
|
||
| The top of the without-skills board now reads: | ||
|
|
||
| | # | Model | Overall | MCQ | Free-form | Cost | | ||
| | --- | --- | --- | --- | --- | --- | | ||
| | 1 | Claude Opus 4.7 | 96.2% | 96.4% | 94.8% | $1.89 | | ||
| | 2 | GPT 5.5 | 94.2% | 94.5% | 90.0% | $2.19 | | ||
| | 3 | **Kimi K2.6** | **93.6%** | **95.2%** | **83.5%** | **$0.48** | | ||
| | 4 | Gemini 3.1 Pro | 92.4% | 95.2% | 76.9% | $1.31 | | ||
| | 5 | GLM 5.1 | 90.2% | 91.5% | 81.9% | $0.30 | | ||
|
|
||
| Two things to notice here. First, Kimi K2.6 is the cheapest model in the top three by a wide margin. Second, the gap between MCQ and free-form is large for every model in this view, which lines up with the original Arena thesis: pulling Appwrite documentation into the prompt closes a knowledge gap that shows up most clearly on open-ended questions. | ||
|
|
||
| The with-skills view, by contrast, compresses everyone toward the top. Six models score above 95% once skills are added, and the practical question shifts from *which model knows Appwrite* to *which model gives me the right answer cheapest and fastest*. | ||
|
|
||
| # A more credible benchmark runner | ||
|
|
||
| The numbers are only as good as the runner that produced them. The latest changes to the benchmark scripts make the runs more reliable and the output more reproducible: | ||
|
|
||
| - **Retries with backoff.** Each question is now attempted up to three times. Empty MCQ tool calls are treated as errors and trigger a retry, instead of being recorded as a wrong answer. Transient OpenRouter errors no longer poison a model's score for an entire category. | ||
| - **Deterministic output ordering.** Per-model results are sorted by question order before being written to disk, so two runs that score the same produce diff-clean JSON. Easier to review, easier to diff. | ||
| - **Atomic writes.** Result files are written to a temporary path and renamed into place. A crashed run can no longer leave a half-written JSON file behind. | ||
| - **Configurable concurrency.** The runner reads `BENCHMARK_CONCURRENCY` from the environment, defaulting to 1. Useful for re-running a single model quickly without serializing all 191 questions over a single connection. | ||
|
|
||
| These are the kind of changes you make when you want a benchmark to be quoted, not only shipped. They also make community contributions safer: an external pull request that adds a model or a question can re-run the benchmark and produce a clean diff against the previous results. | ||
|
|
||
| # Where to go next | ||
|
|
||
| If you want to dig in further, the Arena UI lets you filter by category, switch between with and without skills, sort by any column, and click through to a per-model breakdown that includes per-question reasoning and tool call counts. The repo is open source, so you can also re-run the benchmark locally against your own OpenRouter key. | ||
|
|
||
| - [Appwrite Arena leaderboard](https://arena.appwrite.io) | ||
| - [Kimi K2.6 on Arena](https://arena.appwrite.io/model/kimi-k2-6) | ||
| - [Arena on GitHub](https://github.com/appwrite/arena) | ||
| - [Arena documentation](/docs/tooling/arena) | ||
| - [Appwrite Skills](/docs/tooling/skills) | ||
|
greptile-apps[bot] marked this conversation as resolved.
Outdated
|
||
| - [Discord community](https://appwrite.io/discord) | ||
|
Comment on lines
+94
to
+97
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need all of these?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think these are fine, helps in SEO |
||
Binary file added
BIN
+25.5 KB
static/images/blog/kimi-k2-6-arena-leaderboard-refresh/arena-kimi-detail.avif
Binary file not shown.
Binary file added
BIN
+22 KB
static/images/blog/kimi-k2-6-arena-leaderboard-refresh/arena-leaderboard-without-skills.avif
Binary file not shown.
Binary file not shown.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.