GPT-OSS Series — AI Glossary

In August 2025, OpenAI did something surprising: it released the weights of two of its most capable reasoning models — gpt-oss-120b and gpt-oss-20b completely free, under the Apache 2.0 open-source licence. This was OpenAI’s first significant step into the open-weight model space, which had previously been dominated by Meta (Llama) and Alibaba (Qwen).

The GPT-OSS series is not a watered-down “open-source edition.” Both models use the same Mixture-of-Experts (MoE) architecture as OpenAI’s production models, feature configurable chain-of-thought reasoning, and can use tools natively — browsing the web, running Python, or calling APIs.

120B vs. 20B: Choosing the Right Model

Think of it like choosing between a professional workstation and a high-end laptop. Both can do serious work; the right one depends on what you have available.

Feature	gpt-oss-120b	gpt-oss-20b
Total Parameters	~117 Billion	~21 Billion
Active Parameters per token	~5.1 Billion	~3.6 Billion
Transformer Layers	36	24
Context Window	128,000 tokens	128,000 tokens
Quantization	MXFP4	MXFP4
Min. GPU VRAM needed	~80 GB (e.g., 1× H100)	~16 GB (consumer GPU / Mac)
Best for	Enterprise, complex reasoning	Developers, local experimentation
Licence	Apache 2.0	Apache 2.0

Active parameters is the key number here. Even though the 120B model knows a lot (117B total parameters), it only “thinks with” 5.1B parameters for any single token. This is the MoE trick, more on that below.

How Does It Actually Perform?

The chart below shows a head-to-head comparison of gpt-oss-120b and gpt-oss-20b across five major AI benchmarks, from competition maths (AIME) to graduate science (GPQA Diamond).

Full Benchmark Comparison — 120B vs 20B

Head-to-head across five major benchmarks (%)

The 120B model consistently outperforms the 20B across all tasks, but the gap is smallest on general knowledge (MMLU) and largest on expert-level science (GPQA Diamond). Remarkably, gpt-oss-20b on AIME beats Llama 3.1 405B by a wide margin, despite being a fraction of the size — proof that MoE reasoning models punch above their weight.

What is Mixture-of-Experts (MoE)?

Most AI models are “dense”, every single parameter is used for every single word they process. This is like asking every employee in a company to review every document, which is thorough but very slow.

MoE is different. The model is split into many “expert” sub-networks. For each word, a small “router” network decides which 2-4 experts are most relevant and activates only those. The result:

The model has the knowledge of a 120B model (because all those experts exist).
But it runs with the speed and memory cost of a ~5B model (because most experts are idle at any moment).

This is why gpt-oss-120b fits on a single 80GB H100, something no 117B dense model could do.

Safety: The “Worst-Case Fine-Tuning” Test

Releasing a powerful model’s weights publicly is a serious responsibility. Anyone can download the weights and attempt to modify (“fine-tune”) the model for malicious purposes. For example, training it to help build weapons or bypass safety restrictions.

OpenAI ran a rigorous “worst-case fine-tuning” evaluation before publishing the weights:

They simulated an attacker. Safety researchers deliberately fine-tuned the model on specialised, high-risk datasets (cybersecurity exploits, dangerous chemistry) in the most aggressive way a bad actor plausibly could.
They measured the ceiling of harm. The question was: “Even if someone tries their hardest to make this dangerous, how far can they get?”
The result was reassuring. Even under worst-case conditions, the fine-tuned models did not reach “high-risk” capability thresholds in biological, chemical, or cyberattack domains, they remained comparable to existing open models already available on the internet.

This evaluation was reviewed by independent biosafety and cybersecurity experts before release, providing the safety justification for an open-weight release.

How to Run Locally (Quick Start)

Ollama (easiest — 1 command):

# Pull and run in your terminal — works on Mac, Windows, Linux
ollama run gpt-oss-20b

The Python snippet using Hugging Face Transformers is in the “How to Use” section below.

Run GPT-OSS 20B via Hugging Face Transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "openai/gpt-oss-20b"

# Load tokenizer and model (4-bit quantization for low memory use)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    load_in_4bit=True,   # requires bitsandbytes
)

messages = [
    {"role": "system", "content": "You are a helpful AI capable of deep reasoning."},
    {"role": "user", "content": "Explain the Mixture-of-Experts architecture in simple terms."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True,
).to("cuda")

outputs = model.generate(inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Ready to build?

Leverage AI technologies to build your product stack

Superteams can help you build, deploy and launch AI application stacks using open source technologies — from architecture through to production.

Talk to Superteams