Living Resource

Ultimate Guide to Open Source AI Models

A practical, no-nonsense guide for founders, engineers, and AI teams deciding which open source or open-weight models are actually worth testing — by workload, benchmark profile, license fit, and hardware reality.

Last reviewed: March 6, 2026 Best for: model selection, evaluation, deployment planning Maintained as: evergreen reference
Illustration of open AI model categories including language, coding, agents, multimodal, image, and video.

Executive summary

If you only need the short version, it is this: most teams evaluating open models in 2026 should begin with a strong 7B-32B text model, an explicit evaluation harness, and a clear hardware budget before they touch giant MoE systems, open video stacks, or "frontier" model marketing.

MoE architectures now dominate the frontier

Qwen 3.5, Llama 4, DeepSeek-V3.2, Mistral Large 3, and gpt-oss all use mixture-of-experts, delivering frontier quality with a fraction of the active parameters. Teams should plan for MoE-friendly serving.

Open models now match proprietary frontier performance

DeepSeek-V3.2 matches GPT-5 on reasoning benchmarks. Qwen 3.5 397B competes with Claude 4.5 Opus on multimodal tasks. OpenAI's gpt-oss models run on consumer hardware. The gap between open and closed has effectively closed for many workloads.

"Open source" and "open weight" are not the same

A model can publish weights without publishing the full training data or reproducible training recipe. That matters for auditability, governance, and legal clarity.

Native multimodal is now the default

Qwen 3.5, Llama 4, and Gemma 3 all handle text and images natively. Separate vision adapters are no longer the primary path for multimodal work.

Bottom line

The right model is not the one with the biggest benchmark headline. It is the one that clears your real task evaluations, fits your hardware envelope, survives structured-output tests, and carries a license your company can live with.

Decision framework: if you want X, start with Y

This matrix is designed to help teams choose a sensible starting point instead of trying everything at once.

Use case Start here Why Hardware reality Watch-out
General writing, chat, summaries, RAG Qwen 3.5 9B/27B or Llama 4 Scout Strong general capability with efficient MoE inference and native multimodal support. 12–24GB VRAM for smaller Qwen 3.5 variants; Scout fits on a single H100 with on-the-fly int4 quantization. Check license and language fit before standardizing.
Coding assistant for real development work Qwen3-Coder-30B-A3B or Qwen3-Coder-Next Purpose-built for agentic coding with strong tool-use and environment interaction. Qwen3-Coder-Next runs comfortably on 24GB VRAM thanks to hybrid MoE. You still need tests, linting, and security review.
Tool-using agents and workflow automation Qwen 3.5 Instruct or Llama 4 Maverick Native tool-calling support, strong structured output, and broad ecosystem adoption. 9B–35B is the practical starting band for Qwen 3.5. Agent quality depends as much on orchestration and evals as on the base model.
Top-end open reasoning and large-scale inference DeepSeek-V3.2 GPT-5-level reasoning with integrated thinking and tool-use. V3.2-Speciale variant exceeds GPT-5 on math and reasoning but does not support tool calling. Datacenter-class serving is the realistic target (685B total params, 37B active). Too large for most local teams, but API access is widely available.
Vision-language understanding Qwen 3.5 (natively multimodal) or Gemma 3 12B/27B Native multimodal architectures handle text, images, and video without separate adapters. Single prosumer GPU for Gemma 3 27B; Qwen 3.5 small models fit consumer hardware. Use dedicated evaluation for grounding, not just chat quality.
New image generation SDXL or FLUX.1 schnell Best mix of maturity, workflow support, and local deployment options. SDXL is happiest around 12GB VRAM; more if using refiners heavily. Commercial terms vary by checkpoint.
Image editing, control, inpainting Diffusers + ControlNet + SAM 2 + LaMa Editing is a stack problem, not a single-model problem. Consumer GPUs work well for many workflows. Workflow quality depends on masks, conditioning, and operator skill.
Speech and audio Whisper large-v3 Best open speech-to-text model with broad language coverage and strong accuracy. Runs comfortably on consumer GPUs; large-v3 needs ~10GB VRAM. AudioCraft weights are CC-BY-NC 4.0, limiting commercial music/audio generation use.
Open video experiments Wan2.1 or FramePack-style local workflows Practical on better consumer hardware and improving rapidly. Expect 16–24GB VRAM to be comfortable; smaller variants can go lower. Video quality, latency, and consistency remain uneven.
Rule of thumb

Choose the smallest model that reliably completes your real task with the right output shape. Then move up only if the gains are measurable.

What counts as "open" here

This page separates fully open releases from open-weight releases and license-restricted releases because the market still collapses those categories into the same marketing label.

Best for research and auditability

Fully open

Weights, code, and meaningful training information are available. These releases are the closest match to the OSI-style vision of Open Source AI.

OLMo 2
Best for practical deployment

Open weight

You can download and run the weights, but the full training data and recipe are not completely reproducible. This is where most high-performing "open" models sit today.

Qwen 3.5Llama 4DeepSeek-V3.2Gemma 3Mistral Large 3gpt-oss
Read the license carefully

Source-available or restricted

Some code and weights are available, but there are revenue thresholds, non-commercial clauses, or behavioral restrictions that matter in production.

Some Stability releasesSome FLUX variantsAudioCraft weights
A capability map showing language, coding, agent, multimodal, image, and video model categories.
Open model selection works best when you think in capability families, not only in leaderboard rows.

The practical model landscape

Open AI is no longer one category. The ecosystem now includes general-purpose LLMs, code-specialized models, multimodal models, image generators, image-editing stacks, and increasingly capable video systems.

Category Best for Start here Move up to Sweet spot Watch-outs
Writing / general LLM Chat, drafting, summarization, RAG, internal copilots Qwen 3.5 9B or Llama 4 Scout Qwen 3.5 397B-A17B, Llama 4 Maverick, Mistral Large 3, DeepSeek-V3.2 Qwen 3.5 27B–35B-A3B or gpt-oss-20b covers most team needs. License terms and quantization quality matter more than leaderboard hype.
Coding PR assistance, code generation, refactors, test writing Qwen3-Coder-Next or Qwen3-Coder-30B-A3B Qwen3-Coder-480B-A35B, Mistral Large 3, or DeepSeek-V3.2 The 30B-A3B MoE variant is the most practical for real developer use. Do not deploy without tests, sandboxing, and dependency/security review.
Agents Tool use, workflow automation, multi-step task execution Qwen 3.5 27B or Llama 4 Scout Llama 4 Maverick, Mistral Large 3, or DeepSeek-V3.2 Smaller models plus strong orchestration often beat oversized models with weak tooling. JSON breakage, tool misuse, and cascading failures are the real bottlenecks.
Multimodal / VLM Document understanding, image Q&A, visual agents, OCR-heavy workflows Gemma 3 12B/27B or Qwen 3.5 9B Qwen 3.5 397B-A17B or Llama 4 Maverick Gemma 3 27B is the most practical local starting point with 128K context. Grounding mistakes and OCR hallucinations still require checks.
Image generation Concept art, marketing assets, ideation, product visuals SDXL FLUX.1 dev or SD3.5 when licensing permits SDXL remains the safest default for broad ecosystem compatibility. Typography and exact prompt fidelity still need workflow iteration.
Image editing Inpainting, control, masking, pose/depth guidance, product edits ControlNet + SAM 2 + LaMa + Diffusers Project-specific editing stacks with custom masks and pipelines Editing quality comes from stack design, not one magic checkpoint. Commercial rights differ across base checkpoints and extensions.
Speech / audio Transcription, translation, voice interfaces, audio understanding Whisper large-v3 Qwen2.5-Omni for unified multimodal audio + text Whisper large-v3 covers most transcription and translation needs. AudioCraft code is MIT but model weights are CC-BY-NC 4.0. Verify license for audio generation.
Video generation Short exploratory clips, motion concepts, early creative prototyping Wan2.1 small variants Open-Sora 2.0-style research stacks Today, open video is a prototyping tool more than a production default. Temporal flicker, identity drift, and long render times remain common.
Video editing Interpolation, inpainting, retiming, experimental edit pipelines RIFE, ProPainter, Wan2.1 VACE-style workflows Custom pipelines for domain-specific video tasks Use specialized tools rather than expecting one general model to handle everything. Workflow complexity is high and results are sensitive to clip quality and masking.

The biggest change from 2024 to 2026 is not just raw model quality. It is the breadth of credible open options across text, coding, multimodal, and media generation.

Benchmark snapshot: what the top open families report

These numbers are useful as a map, not as a verdict. Benchmark settings vary. Prompt formatting moves scores. Preference benchmarks can overstate real operational reliability. Use this as the first filter, then test on your own workload.

Model General Reasoning Coding Notes
Qwen 3.5 397B-A17B MMLU-Pro 87.8, SuperGPQA 70.4 AIME26 91.3, GPQA Diamond 88.4 LiveCodeBench v6 83.6 Frontier MoE with only 17B active params. Native multimodal, 201 languages. Competes with Claude 4.5 Opus.
DeepSeek-V3.2 Comparable to GPT-5 IMO and IOI gold-medal level SWE-bench competitive with GPT-5 685B total / 37B active MoE. Integrated thinking + tool-use. Speciale variant exceeds GPT-5 on reasoning but drops tool calling.
Mistral Large 3 MMLU-Pro ~73–78 Strong mid-to-high tier HumanEval ~92 675B total / 41B active MoE. Apache 2.0. Multimodal. 256K context.
Llama 4 Maverick MMLU 85.5, MMLU-Pro 80.5 GPQA Diamond 69.8 HumanEval 82.4 400B total / 17B active, 128 experts. Natively multimodal, 1M context. FP8 fits on a single H100 DGX host.
Llama 4 Scout MMLU 79.6, MMLU-Pro 74.3 GPQA Diamond 57.2 HumanEval 74.1 109B total / 17B active, 16 experts. 10M context. Fits on a single H100 with on-the-fly int4 quantization.
gpt-oss-120b MMLU-Pro 90.0 AIME 2025 97.9 (with tools) Near o4-mini on competition coding 117B total / 5.1B active MoE. Apache 2.0. Fits on a single 80GB GPU. 128K context.
gpt-oss-20b Matches o3-mini on common benchmarks Strong for its size class Competitive with o3-mini 21B total / 3.6B active MoE. Apache 2.0. Runs on 16GB devices. 128K context.
Gemma 3 27B IT MMLU-Pro 67.5, MMMU 64.9 GPQA Diamond 42.4, MATH 69.0 LiveCodeBench 29.7 Natively multimodal. Comparable to Gemini 1.5 Pro (mixed results, not a blanket win). 128K context, 140+ languages. Chatbot Arena Elo 1338.
OLMo 2 13B MMLU 81.5 Competitive with equivalently-sized open models Less emphasized than code-specialized families Fully open (weights, code, data, training recipe). Apache 2.0 for base models; some instruct checkpoints have additional terms.
How to use benchmarks correctly

Use one academic snapshot table, one real-work evaluation table, and one reliability table. If a model only looks good in one of those three, it is not production-ready for your team.

Open vs. closed models: where each wins

The real tradeoff is not "open is better" or "closed is better." It is whether you want control, customization, and privacy enough to take on the systems burden yourself.

Dimension Open / open-weight Closed ecosystem
Control Self-host, fine-tune, inspect, and route however you want. Fastest path to strong capability with less systems work.
Cost model Infrastructure, ops, and engineering replace per-token API pricing. Usage-based pricing is simple but can become expensive at scale.
Privacy and data boundary Best option when prompts, outputs, and logs must stay inside your environment. Provider policy and retention controls matter more.
Customization Adapters, quantization, routing, and domain tuning are the major advantages. Prompting is easy; deep model customization is limited.
Operational burden You own serving, evals, security, and reliability. You inherit better managed infrastructure and usually better SLAs.
Best fit Teams with repeatable workloads, privacy needs, or platform ambitions. Teams optimizing for speed, simplicity, and managed frontier access.
Diagram showing consumer, prosumer, and enterprise hardware tiers for open model workloads.
Hardware fit is one of the fastest ways to narrow the field before you benchmark anything.

Hardware tiers: what you actually need

The fastest way to waste time in open AI is to choose models before you define the serving envelope. Pick the hardware tier first, then shortlist models that fit.

Consumer / hobbyist

Single GPU, 12–16GB VRAM, 32–64GB RAM

What fits: Qwen 3.5 4B/9B, Gemma 3 12B, gpt-oss-20b, lightweight coding models, SDXL

Best for: Local testing, lightweight RAG, first agents, image generation

Watch-outs: Do not expect comfortable 70B+ serving or serious open video production.

Prosumer / advanced local

24–48GB VRAM, 64–128GB RAM, fast NVMe, optional multi-GPU

What fits: Qwen 3.5 27B/35B-A3B, Gemma 3 27B, gpt-oss-120b, Qwen3-Coder-Next, small video stacks

Best for: Serious private assistants, agentic coding, local experimentation with MoE models

Watch-outs: Open video is still slow and multi-step agent stacks need careful tuning.

Enterprise / datacenter

Multi-GPU clusters, high-bandwidth networking, optimized serving

What fits: Qwen 3.5 397B-A17B, Llama 4 Maverick, Mistral Large 3, DeepSeek-V3.2, concurrency-heavy inference

Best for: Internal copilots, agent platforms, multimodal services, governed deployment

Watch-outs: This is where reliability, governance, and evaluation become more important than raw model choice.

Practical serving reality

For most real teams, the 14B-32B band is the easiest place to get strong quality without crossing into difficult multi-GPU operations. Giant MoE systems make sense later, not first.

Hallucinations, reliability, and the failure modes that matter

Hallucinations are only one part of the reliability story. Open models also fail through prompt sensitivity, poor tool arguments, visual grounding errors, license misunderstandings, and brittle long-context behavior.

Text and coding models

The most common failures are fabricated facts, false confidence, stale knowledge, malformed JSON, and plausible-but-wrong code. Code models can also generate insecure or license-sensitive output.

Multimodal models

Expect OCR misses, object misidentification, incorrect grounding, and overconfident descriptions of partially visible content.

Image models

The main problems are prompt drift, poor typography, inconsistent identity, and weak fine-grained control unless you add editing and conditioning tools.

Video models

The biggest issues remain temporal flicker, identity drift, motion incoherence, and long runtimes for short clips.

Reliability checklist
  • Treat hallucinations as a systems problem, not only a model problem.
  • Require citations or retrieval for factual workflows.
  • Schema-validate every tool call and structured output.
  • Use test suites and eval harnesses before swapping models.
  • Separate "good at chat" from "good at operations."
  • Expect prompt sensitivity, especially around formatting and long contexts.
  • Add human review for regulated, financial, legal, medical, or externally visible outputs.

Licensing: the most overlooked part of model selection

License fit is not cleanup work after the benchmark review. It is one of the first filters. Many teams waste time evaluating models they cannot legally or economically ship.

License pattern Best for Examples Watch-out
Apache 2.0 / MIT Commercial deployment and broad integration OLMo 2, Whisper, Qwen 3.5, Qwen3-Coder, Mistral Large 3, gpt-oss, FLUX.1 schnell Still verify each model card; not every family uses the same license for every checkpoint.
Llama 4 Community License Commercial use with strong ecosystem momentum Llama 4 Scout, Llama 4 Maverick Permissive for many uses, but it is not OSI-style open source.
Gemma terms / custom terms Practical use when the model fits your needs Gemma 3 Do not assume "Google open model" means Apache-style freedom.
OpenRAIL / Responsible AI licenses Creative or research use where restrictions are acceptable Some Stable Diffusion family releases, BigCode OpenRAIL-M Behavioral restrictions and downstream obligations can affect productization.
Community / revenue-threshold licenses Early testing before full commercialization Some Stability releases Revenue thresholds and enterprise terms can change the total cost of ownership.
Non-commercial weight licenses Research, experimentation, internal evaluation Some FLUX variants, AudioCraft weights This is a hard stop for many production uses.
A good licensing rule

Treat every checkpoint as its own legal object. Do not assume the family name tells you the full commercial story.

Recommended deployment stacks

Choosing a model without choosing a serving and evaluation stack is incomplete. The stack determines latency, batching, observability, and how painful future model swaps will be.

Ollama + llama.cpp

Best for: Fastest path to local testing

Strengths: Great for laptops, desktops, and quick internal prototypes.

Limits: Not the best fit for serious multi-user production serving.

vLLM

Best for: High-throughput production inference

Strengths: Paged attention, strong batching behavior, and a mature serving ecosystem.

Limits: More ops-heavy than local tools.

TensorRT-LLM

Best for: NVIDIA-centric optimized serving

Strengths: Best when you want GPU-specific performance tuning at scale.

Limits: More specialized setup and infra assumptions.

Transformers + Diffusers

Best for: Custom workflows and research flexibility

Strengths: Best ecosystem for model experimentation, adapters, and editing pipelines.

Limits: Requires more assembly than end-user desktop tools.

ComfyUI

Best for: Creative image and video workflows

Strengths: Visual pipeline building, strong community extensions, easy iteration.

Limits: Operational governance is weaker than code-first stacks.

LangGraph / LlamaIndex / AutoGen

Best for: Agents, tool use, and workflow orchestration

Strengths: Useful abstractions for state, retrieval, and multi-step execution.

Limits: They do not fix weak evals or poor model choices for you.

Recommended starting stacks by team profile

Use these as default launch points, not as permanent architecture decisions.

Founder or operator testing AI internally

Start with Qwen 3.5 9B/27B or Llama 4 Scout, run it through a small RAG layer, and measure task completion before chasing bigger models.

Developer building a local coding copilot

Start with Qwen3-Coder-Next or Qwen3-Coder-30B-A3B, then step up only if your evals show clear gains on your real repos.

Creative team evaluating image and video

Use SDXL or FLUX for image work first. Treat open video as an R&D lane, not your default production pipeline.

Enterprise team with privacy and governance requirements

Prioritize license clarity, eval discipline, and serving fit over raw leaderboard rank. vLLM-class serving plus a Qwen 3.5 27B–35B or Llama 4 Scout is usually the right first step.

Best first experiment

Pick one workflow, one evaluation harness, one hardware target, and three candidate models. Anything broader becomes expensive research theater.

Sources and methodology

This page is built from model cards, technical reports, official repositories, standards bodies, and tooling documentation. The goal is practical decision support, not hype-driven ranking.

Open Source Initiative — Open Source AI Definition https://opensource.org/ai/open-source-ai-definition Qwen 3.5 announcement https://qwen.ai/blog?id=qwen3.5 Qwen 3.5 GitHub repository https://github.com/QwenLM/Qwen3.5 Qwen3-Coder GitHub repository https://github.com/QwenLM/Qwen3-Coder Meta Llama 4 announcement https://ai.meta.com/blog/llama-4-multimodal-intelligence/ Llama 4 model page https://www.llama.com/models/llama-4/ DeepSeek-V3.2 Technical Report https://arxiv.org/html/2512.02556v1 DeepSeek-V3.2 Hugging Face https://huggingface.co/deepseek-ai/DeepSeek-V3.2 Gemma 3 model overview https://ai.google.dev/gemma/docs/core Gemma 3 — Google DeepMind https://deepmind.google/models/gemma/gemma-3/ OLMo 2 13B model card https://huggingface.co/allenai/OLMo-2-1124-13B SDXL paper https://arxiv.org/abs/2307.01952 FLUX.1 schnell model page https://huggingface.co/black-forest-labs/FLUX.1-schnell Open-Sora repository https://github.com/hpcaitech/Open-Sora Wan2.1 repository https://github.com/Wan-Video/Wan2.1 FramePack repository https://github.com/lllyasviel/FramePack Whisper repository https://github.com/openai/whisper Mistral Large 3 announcement https://mistral.ai/news/mistral-3 Mistral Large 3 Hugging Face https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512 OpenAI gpt-oss announcement https://openai.com/index/introducing-gpt-oss/ gpt-oss model card https://openai.com/index/gpt-oss-model-card/ DeepSeek-V3.2-Speciale Hugging Face https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale Gemma 3 Technical Report https://arxiv.org/html/2503.19786v1 Qwen3.5-397B-A17B Hugging Face https://huggingface.co/Qwen/Qwen3.5-397B-A17B llama.cpp quantization memory reference https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md vLLM — high-throughput LLM serving https://github.com/vllm-project/vllm Ollama — local model runner https://ollama.com/ ComfyUI — visual workflow builder https://github.com/comfyanonymous/ComfyUI ControlNet repository https://github.com/lllyasviel/ControlNet SAM 2 repository https://github.com/facebookresearch/sam2