Kimi K2.7 Code — AI Glossary

Released by Moonshot AI in June 2026, Kimi K2.7 Code is a frontier scale, coding focused agentic model built upon its predecessor, Kimi K2.6. Designed specifically for long horizon coding tasks, multi step agentic workflows, and complex software engineering, it introduces major performance jumps while cutting inference costs.

Available through Moonshot’s official API and as downloadable open weights (released under a Modified MIT License), Kimi K2.7 Code can be deployed locally using high performance inference servers like vLLM or SGLang.

Architecture & Core Features

Kimi K2.7 Code employs a massive 1 trillion parameter Mixture of Experts (MoE) architecture, but it only activates 32 billion parameters per token. This sparsity allows it to match the reasoning power of enormous models while running efficiently on consumer hardware, especially when natively INT4 quantized.

Key capabilities include:

Massive Context Window: It supports up to 262,144 tokens of context, making it ideal for digesting entire repositories, large log files, and extensive documentation.
Native Multimodality: It processes both text and visual inputs, including direct video and image understanding, for tasks like frontend UI development from mockups or analyzing visual bug reports.
Preserve Thinking Mode: Kimi K2.7 Code forces preserve_thinking mode. It retains full, internal reasoning content across multi turn interactions. This gives the model a persistent memory of why it made past decisions, greatly enhancing its reliability in long agentic loops.
Interleaved Tool Calling: Designed for multi step agentic workflows, Kimi excels at complex, high reliability tool invocation. For example, it can execute Python, browse files, and call external APIs.

Coding and Agentic Performance

Kimi K2.7 Code achieves significant improvements across realistic software engineering benchmarks. It reduces the thinking token overhead by approximately 30% compared to K2.6, which results in faster and cheaper agentic loops.

Coding Benchmarks Comparison: Kimi K2.7 Code vs K2.6

Performance on proprietary coding benchmarks (%)

Kimi Code Bench V2: This evaluates models on over 10 languages across a full production stack, including infrastructure, performance engineering, and system programming.
Program Bench: This tests whether the model can recreate a compiled binary’s exact behavior purely from documentation, with no source code provided.
MLS-Bench-Lite: This evaluates the agent’s ability to invent and scale ML methods across domains like reinforcement learning and computer vision.
Agentic Endurance: On Kimi Claw 24/7 Bench, an internal test of multi day coworking tasks, K2.7 proves highly resilient to context degradation.

Local Deployment & Usage

Kimi K2.7 Code weights are available on Hugging Face (moonshotai/Kimi-K2.7-Code). It natively integrates with local hosting tools like vLLM and SGLang. The recommended setting is a Temperature of 1.0 and Top-P of 0.95 when using Thinking mode.

Running with vLLM

You can spin up an OpenAI-compatible API server using vLLM:

# Install vLLM
pip install vllm

# Start the server (requires 4.57.1 <= transformers < 5.0.0)
vllm serve "moonshotai/Kimi-K2.7-Code"

Once running, you can hit the local server with standard chat completion requests, complete with multimodality:

curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-K2.7-Code",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this UI mockup and generate React code for it."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://example.com/mockup.png"
						}
					}
				]
			}
		]
	}'

Kimi K2.7-Code is heavily optimized for use within developer ecosystems. According to Moonshot AI, it achieves maximum potential when paired with the Kimi Code CLI, making it a drop-in agent for autonomous repo management, incident debugging, and extensive refactoring.

Using Kimi K2.7 Code's Preserve Thinking Feature

python

import openai

# Kimi K2.7 Code forces `preserve_thinking` mode, meaning it retains 
# full reasoning context across multi turn interactions.
def chat_with_preserve_thinking(client: openai.OpenAI, model_name: str):
    messages = [
        {
            "role": "user",
            "content": "Tell me three random numbers."
        },
        {
            "role": "assistant",
            "reasoning_content": "I'll start by listing five numbers: 473, 921, 235, 215, 222, and I'll tell you the first three.",
            "content": "473, 921, 235"
        },
        {
            "role": "user",
            "content": "What are the other two numbers you have in mind?"
        }
    ]
    
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        stream=False,
        max_tokens=4096,
    )
    
    # The assistant remembers "215" and "222" from its own prior reasoning content
    print(f"Reasoning: {response.choices[0].message.reasoning}")
    print(f"Response: {response.choices[0].message.content}")
    return response.choices[0].message.content

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams