Taxonomy of AI: From ML to World Models

The term “artificial intelligence” has become so overloaded that it often obscures more than it reveals. A rules-based chatbot, a spam filter, GPT-4.5, and a self-driving car are all labeled “AI,” but they represent fundamentally different approaches with distinct capabilities, limitations, and appropriate use cases.

This guide provides a structured taxonomy of AI systems. Understanding these distinctions is essential for anyone evaluating AI technologies, building AI-powered products, or simply trying to separate substance from hype. We’ll cover the major paradigms, their relationships, and when each approach makes sense.

The AI Landscape: A Visual Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                        ARTIFICIAL INTELLIGENCE                               │
│   Any system that performs tasks typically requiring human intelligence      │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌──────────────────────────┐    ┌──────────────────────────────────────┐  │
│  │    SYMBOLIC AI           │    │         MACHINE LEARNING              │  │
│  │    (Rule-based)          │    │         (Data-driven)                 │  │
│  │                          │    │                                       │  │
│  │  • Expert systems        │    │  ┌─────────────────────────────────┐ │  │
│  │  • Knowledge graphs      │    │  │        DEEP LEARNING             │ │  │
│  │  • Logic programming     │    │  │        (Neural networks)         │ │  │
│  │  • Search algorithms     │    │  │                                  │ │  │
│  │                          │    │  │  ┌──────────────────────────┐   │ │  │
│  │                          │    │  │  │   FOUNDATION MODELS      │   │ │  │
│  │                          │    │  │  │                          │   │ │  │
│  │                          │    │  │  │  • LLMs (GPT, Claude)    │   │ │  │
│  │                          │    │  │  │  • Vision (CLIP, SAM)    │   │ │  │
│  │                          │    │  │  │  • Multimodal (GPT-4V)   │   │ │  │
│  │                          │    │  │  │  • World Models          │   │ │  │
│  │                          │    │  │  └──────────────────────────┘   │ │  │
│  │                          │    │  └─────────────────────────────────┘ │  │
│  └──────────────────────────┘    └──────────────────────────────────────┘  │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Part 1: The Foundational Paradigms

Symbolic AI: Intelligence Through Rules

Symbolic AI—also called “Good Old-Fashioned AI” (GOFAI)—represents knowledge explicitly using symbols and rules that humans can read and understand. This was the dominant paradigm from the 1950s through the 1980s.

How it works: Human experts encode their knowledge into logical rules. The system applies these rules to new situations through logical inference.

Rule: IF temperature > 100°F AND has_cough = true THEN possible_fever = true
Rule: IF possible_fever = true AND duration > 3_days THEN recommend_doctor = true

Input: {temperature: 102, has_cough: true, duration: 5}
Inference: possible_fever = true → recommend_doctor = true

Strengths:

Fully explainable—you can trace exactly why any decision was made
Guaranteed behavior—the system does exactly what the rules specify
Works with small amounts of data—rules come from experts, not datasets
Easy to modify—change a rule, change the behavior

Limitations:

Doesn’t scale—encoding enough rules for complex domains becomes impossible
Brittle—fails on edge cases not anticipated by rule authors
No learning—can’t improve from experience without human intervention
Knowledge bottleneck—extracting expertise from humans is slow and expensive

Modern applications:

Business rules engines (insurance underwriting, loan approval)
Configuration management and validation
Legal document analysis (structured compliance checking)
Game AI (board games, strategy games with explicit rules)

Notable systems: IBM’s Deep Blue (chess), early medical diagnosis systems like MYCIN, expert systems in manufacturing.

Machine Learning: Intelligence Through Data

Machine learning flips the paradigm: instead of humans writing rules, algorithms discover patterns from data. The system learns to map inputs to outputs by being shown examples.

The fundamental formulation: Given a dataset of (input, output) pairs, find a function f that maps inputs to outputs and generalizes to new inputs not in the training set.

Training data:
  Input: "Great product!" → Output: positive
  Input: "Terrible, broke immediately" → Output: negative
  Input: "Works as expected" → Output: neutral
  ... (thousands more examples)

Learned function f:
  f("Love it, will buy again!") → positive  (new input, correct prediction)

Machine learning encompasses several distinct approaches:

Supervised Learning

The algorithm learns from labeled examples—data where the correct answer is provided.

Classification: Predict discrete categories

Spam detection (spam vs. not spam)
Medical diagnosis (disease present vs. absent)
Image recognition (cat, dog, bird, etc.)

Regression: Predict continuous values

House price prediction
Demand forecasting
Credit scoring

Common algorithms:

Linear/logistic regression
Decision trees and random forests
Support vector machines (SVMs)
Gradient boosting (XGBoost, LightGBM)

Unsupervised Learning

The algorithm finds patterns in data without labeled examples.

Clustering: Group similar items together

Customer segmentation
Document organization
Anomaly detection (anything that doesn’t fit clusters)

Dimensionality reduction: Find compact representations

Visualization of high-dimensional data
Feature compression
Noise reduction

Common algorithms:

K-means clustering
Hierarchical clustering
Principal Component Analysis (PCA)
Autoencoders

Reinforcement Learning

The algorithm learns by taking actions in an environment and receiving rewards or penalties.

┌──────────────────────────────────────────────────────────────┐
│                   Reinforcement Learning Loop                 │
│                                                               │
│    ┌─────────┐        action         ┌─────────────────┐    │
│    │  Agent  │ ────────────────────► │   Environment   │    │
│    │         │ ◄──────────────────── │                 │    │
│    └─────────┘   state + reward      └─────────────────┘    │
│                                                               │
│    Agent learns policy π(state) → action to maximize         │
│    cumulative reward over time                                │
└──────────────────────────────────────────────────────────────┘

Applications:

Game playing (AlphaGo, Atari games, chess)
Robotics control (walking, manipulation)
Resource management (data center cooling, network routing)
Recommendation systems
Autonomous vehicle decision-making

Key algorithms:

Q-learning and Deep Q-Networks (DQN)
Policy gradient methods (PPO, A3C)
Model-based reinforcement learning
Multi-agent reinforcement learning

Semi-Supervised and Self-Supervised Learning

These approaches bridge labeled and unlabeled data:

Semi-supervised: Use a small amount of labeled data with a large amount of unlabeled data. Particularly valuable when labeling is expensive (medical imaging, specialized domains).

Self-supervised: Create labels from the data itself. The model predicts part of the input from other parts:

Predicting the next word in a sentence (how LLMs are trained)
Predicting masked patches in an image
Predicting future video frames from past frames

Self-supervised learning has become the dominant paradigm for training foundation models because it can leverage essentially unlimited unlabeled data.

Part 2: Deep Learning and Neural Networks

Deep learning is a subset of machine learning that uses neural networks with many layers (hence “deep”) to learn hierarchical representations of data.

Neural Network Architecture

A neural network transforms inputs through successive layers of computation:

┌───────────────────────────────────────────────────────────────────────────┐
│                        Neural Network Structure                            │
│                                                                            │
│   Input Layer        Hidden Layers (×N)           Output Layer            │
│                                                                            │
│      ○               ○──────○──────○                   ○                  │
│      │              ╱│╲    ╱│╲    ╱│╲                 ╱│╲                 │
│      ○             ○─┼─○──○─┼─○──○─┼─○               ○─┼─○                │
│      │              ╲│╱    ╲│╱    ╲│╱                 ╲│╱                 │
│      ○               ○──────○──────○                   ○                  │
│      │                                                                     │
│      ○             Each layer learns increasingly                          │
│                    abstract representations                                │
│  [raw input]       [features] [patterns] [concepts]   [prediction]        │
└───────────────────────────────────────────────────────────────────────────┘

Each neuron:

Receives inputs from the previous layer
Multiplies each input by a learned weight
Sums the weighted inputs
Applies a non-linear activation function
Passes the result to the next layer

The weights are learned through backpropagation: compute the error between prediction and ground truth, then adjust weights throughout the network to reduce this error.

Deep Learning Architectures

Different architectures are optimized for different data types:

Convolutional Neural Networks (CNNs)

Designed for spatial data like images, where local patterns matter.

Image → [Convolution] → [Pooling] → [Convolution] → [Pooling] → [Dense] → Class

Convolution: Detect local patterns (edges, textures, shapes)
Pooling: Reduce spatial dimensions, increase invariance
Dense: Combine patterns into final classification

Applications:

Image classification and object detection
Medical image analysis
Satellite imagery analysis
Quality inspection in manufacturing
Facial recognition

Key models: ResNet, VGG, EfficientNet, YOLO (object detection), U-Net (segmentation)

Recurrent Neural Networks (RNNs)

Designed for sequential data where order matters, with connections that loop back on themselves to maintain memory.

┌─────────────────────────────────────────────────────────┐
│               RNN Processing Sequence                    │
│                                                          │
│   x₁ ──► [RNN] ──► h₁ ──► [RNN] ──► h₂ ──► [RNN] ──► h₃│
│            ▲               ▲               ▲            │
│            │               │               │            │
│           x₁              x₂              x₃            │
│                                                          │
│   Each step receives current input + previous hidden     │
│   state, maintaining memory of earlier inputs            │
└─────────────────────────────────────────────────────────┘

Variants:

LSTM (Long Short-Term Memory): Adds gates to control information flow
GRU (Gated Recurrent Unit): Simplified version of LSTM

Applications:

Time series forecasting
Speech recognition
Language modeling (before transformers)
Music generation

Limitation: Difficult to train on very long sequences due to vanishing gradients. Largely superseded by transformers for language tasks.

Transformers

The architecture behind modern LLMs, based on self-attention—the ability to weigh the importance of different parts of the input when processing each position.

┌───────────────────────────────────────────────────────────────────────┐
│                    Transformer Architecture                            │
│                                                                        │
│   Input: "The capital of France is ___"                               │
│                  ↓                                                      │
│         ┌─────────────┐                                                │
│         │ Embedding + │    Convert tokens to vectors                   │
│         │ Positional  │    Add position information                    │
│         └─────────────┘                                                │
│                  ↓                                                      │
│         ┌─────────────┐                                                │
│         │   Self-     │    Each token attends to all                   │
│         │  Attention  │    other tokens, learning relevance            │
│         └─────────────┘                                                │
│                  ↓                                                      │
│         ┌─────────────┐                                                │
│         │ Feed-Forward│    Transform representations                    │
│         │   Network   │                                                │
│         └─────────────┘                                                │
│                  ↓                                                      │
│           (repeat N×)      Stack many transformer layers               │
│                  ↓                                                      │
│         ┌─────────────┐                                                │
│         │   Output    │    Predict next token: "Paris"                 │
│         │ Projection  │                                                │
│         └─────────────┘                                                │
└───────────────────────────────────────────────────────────────────────┘

Why transformers dominate:

Process entire sequences in parallel (faster training than RNNs)
Self-attention captures long-range dependencies
Scale extremely well with more parameters and data
Transfer learning works exceptionally well

For a deeper dive into how transformers process text, see our guide on tokens and LLM inference.

Graph Neural Networks (GNNs)

Designed for data with graph structure—nodes connected by edges.

Applications:

Social network analysis
Molecular property prediction (atoms = nodes, bonds = edges)
Recommendation systems
Fraud detection in transaction networks
Traffic prediction on road networks

Key approaches: Graph Convolutional Networks (GCN), GraphSAGE, Graph Attention Networks (GAT)

Part 3: Large Language Models (LLMs)

LLMs represent a qualitative shift in AI capabilities. They’re transformer models trained on massive text corpora (hundreds of billions to trillions of tokens) to predict the next token in a sequence. Through this simple objective, they develop sophisticated language understanding and generation capabilities.

How LLMs Work

The training process:

Collect vast amounts of text (web pages, books, code, etc.)
Train the model to predict the next token given all previous tokens
Scale up: more parameters, more data, more compute
Fine-tune for specific tasks (instruction following, safety, etc.)

At inference, the model generates text one token at a time, sampling from its predicted probability distribution:

Input: "Write a haiku about programming"

Token 1: "Code" (sampled from distribution)
Token 2: " flows" (conditioned on "Code")
Token 3: " like" (conditioned on "Code flows")
...

Major LLM Families

Model Family	Developer	Key Characteristics
GPT-4.5 / GPT-4o	OpenAI	Strong reasoning, multimodal, 128K context
Claude (Opus 4.5, Sonnet 4)	Anthropic	Long context (200K), Constitutional AI training
Gemini 2.0	Google DeepMind	Multimodal native, up to 2M context
Llama 4	Meta	Open weights, strong performance/efficiency
Mistral / Mixtral	Mistral AI	Efficient, competitive open-source
Command R+	Cohere	Optimized for RAG and enterprise use

Capabilities and Limitations

Emergent capabilities at scale:

Few-shot learning (learn from examples in the prompt)
Chain-of-thought reasoning (solving multi-step problems)
Instruction following
Code generation and debugging
Multilingual translation
Creative writing and summarization

Inherent limitations:

Hallucination: LLMs generate plausible-sounding but false information. They don’t “know” what they know—they predict likely tokens.
Knowledge cutoff: Training data has a fixed date; models don’t know recent events.
No persistent memory: Each conversation starts fresh (without external tools).
Reasoning limits: Despite chain-of-thought, they struggle with complex logic, math, and multi-step planning.
Context window: Limited to processing a fixed amount of text at once.

LLM Training Phases

Modern LLMs go through multiple training stages:

1. Pretraining: Unsupervised learning on massive text corpora. The model learns language patterns, factual knowledge, and reasoning abilities.

2. Supervised Fine-Tuning (SFT): Training on curated examples of good behavior—following instructions, refusing harmful requests, providing helpful responses.

3. Reinforcement Learning from Human Feedback (RLHF): Humans rank model outputs. A reward model learns these preferences. The LLM is then trained to maximize this reward.

Pretraining          SFT              RLHF
    │                 │                 │
    ▼                 ▼                 ▼
[Raw language] → [Helpful/harmless] → [Aligned with preferences]
  capability        behavior           quality

Alternative approaches like Constitutional AI (Anthropic) and Direct Preference Optimization (DPO) achieve similar goals through different mechanisms.

Part 4: Multimodal AI

Multimodal models process and generate multiple types of data—text, images, audio, video—in an integrated way.

Vision-Language Models

These models understand and generate both text and images:

Architecture approaches:

┌─────────────────────────────────────────────────────────────────┐
│               Vision-Language Model Architectures                │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Dual Encoder (CLIP-style)         Unified Transformer          │
│  ┌───────┐    ┌───────┐           ┌───────────────────────┐    │
│  │ Image │    │ Text  │           │  Image    Text         │    │
│  │Encoder│    │Encoder│           │  tokens + tokens       │    │
│  └───┬───┘    └───┬───┘           │         ↓              │    │
│      │            │               │    [Transformer]       │    │
│      └────────────┘               │         ↓              │    │
│            ↓                      │  Joint understanding   │    │
│    Compare embeddings             └───────────────────────┘    │
│    (similarity search)                                          │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

CLIP (Contrastive Language-Image Pretraining): Learns to align image and text representations. Given an image, find the matching text description, and vice versa. Powers image search, zero-shot classification, and many creative applications.

Vision transformers (ViT): Apply transformer architecture directly to images by splitting them into patches and treating patches as tokens.

Multimodal LLMs (GPT-4V, Claude 3, Gemini): Full language models that can also see images. Capabilities include:

Image understanding and description
Visual question answering
Document and chart analysis
Reasoning about visual scenes
Generating text grounded in images

Text-to-Image Generation

Models that create images from text descriptions:

Diffusion models (Stable Diffusion, DALL-E 3, Midjourney): Learn to reverse a process that gradually adds noise to images. Generation works by starting from pure noise and iteratively denoising it, guided by the text prompt.

Pure noise → [Denoise step 1] → [Denoise step 2] → ... → Final image
                    ↑                  ↑
              Text embedding guides each step

Key capabilities:

Generate images from detailed text descriptions
Style transfer and artistic rendering
Inpainting (fill in missing parts of images)
Image editing guided by text

Audio and Speech Models

Speech recognition (ASR): Convert audio to text. Modern systems like OpenAI’s Whisper achieve near-human accuracy across many languages.

Text-to-speech (TTS): Generate natural-sounding speech from text. Models like ElevenLabs and OpenAI’s TTS produce remarkably human-like voices.

Music generation: Models like Google’s MusicLM and Suno generate music from text descriptions.

Audio understanding: Models that can analyze and describe audio content, transcribe speech, and answer questions about sounds.

Video Models

Video presents unique challenges—temporal consistency, motion understanding, and massive data requirements.

Video understanding: Analyze video content, answer questions, generate descriptions. Models extend image understanding with temporal processing.

Video generation: Generate video from text (Sora, Runway Gen-3, Kling). Current models produce impressive short clips but struggle with long-form consistency.

Key challenges:

Maintaining consistency across frames
Physically plausible motion
Computational cost (video = many images)
Training data quality and quantity

Part 5: World Models and Embodied AI

World models represent an emerging paradigm aimed at systems that understand how the physical world works—not just patterns in data, but causal relationships, physics, and the consequences of actions.

What Are World Models?

A world model learns an internal representation of the environment that can:

Predict future states given current state and actions
Simulate hypothetical scenarios (“what if?”)
Plan by reasoning about consequences before acting

┌─────────────────────────────────────────────────────────────────────┐
│                     World Model Architecture                         │
│                                                                      │
│   ┌─────────┐    ┌─────────────┐    ┌─────────────┐                │
│   │ Current │    │   World     │    │  Predicted  │                │
│   │  State  │ →  │   Model     │  → │   Future    │                │
│   │ + Action│    │ (Simulator) │    │   States    │                │
│   └─────────┘    └─────────────┘    └─────────────┘                │
│                                            │                        │
│                                            ▼                        │
│                                     ┌─────────────┐                │
│                                     │   Planner   │                │
│                                     │ (Choose best│                │
│                                     │   action)   │                │
│                                     └─────────────┘                │
└─────────────────────────────────────────────────────────────────────┘

Current Research Directions

Video prediction as world modeling: Models like DeepMind’s Genie and OpenAI’s work on video models learn to predict future video frames, implicitly learning physics and dynamics.

Interactive simulation: World models for games (Dreamer, IRIS) learn to simulate game environments, enabling agents to plan by imagining future scenarios.

Robotics foundation models: Systems that learn generalizable world knowledge applicable across different robots and tasks.

Embodied AI

Embodied AI concerns systems that interact with the physical world through robotic bodies or simulated environments.

Key challenges:

Grounding language in physical action
Generalizing across different environments
Real-time decision making with partial information
Safety in physical interaction

Notable systems:

Google’s RT-2 and RT-X: Vision-language-action models for robotics
Tesla’s Optimus: Humanoid robot leveraging self-driving AI
Figure and 1X: Humanoid robots with LLM integration

The intersection of LLMs and robotics—models that can reason in language and act in the physical world—represents a major frontier in AI research.

Part 6: Specialized AI Systems

Beyond the major paradigms, several specialized approaches deserve attention:

Recommender Systems

Systems that predict user preferences and suggest relevant content.

Approaches:

Collaborative filtering: Find similar users, recommend what they liked
Content-based filtering: Recommend items similar to past preferences
Hybrid systems: Combine multiple approaches
Deep learning methods: Neural networks that learn complex patterns

Powers Netflix, Spotify, Amazon, YouTube, and most personalized experiences online.

Anomaly Detection

Systems that identify unusual patterns—fraud, defects, security threats.

Methods:

Statistical approaches (deviations from normal distributions)
Machine learning classifiers
Autoencoders (learn normal patterns, flag reconstruction errors)
Isolation forests

Critical for fraud detection, cybersecurity, quality control, and predictive maintenance.

Autonomous Systems

Systems that operate independently in complex environments:

Self-driving vehicles: Combine perception (cameras, LiDAR, radar), prediction (what will other agents do?), planning (choose safe path), and control (execute maneuvers).

Drone systems: Navigation, obstacle avoidance, mission planning.

Industrial automation: Robotic arms, warehouse robots, manufacturing systems.

Generative AI

Beyond text and images, generative AI creates:

3D models: Neural radiance fields (NeRF), Gaussian splatting
Code: GitHub Copilot, Claude, specialized code models
Scientific data: Protein structures (AlphaFold), molecule design
Synthetic data: Training data for other AI systems

Part 7: Choosing the Right Approach

With this taxonomy in mind, how do you choose the right AI approach for a given problem?

Decision Framework

┌─────────────────────────────────────────────────────────────────────────┐
│                     Choosing an AI Approach                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  1. Do you need explainability and guaranteed behavior?                  │
│     YES → Consider symbolic AI / rules-based systems                     │
│                                                                          │
│  2. Do you have labeled training data?                                   │
│     YES → Supervised learning (classification, regression)               │
│     NO  → Unsupervised learning or self-supervised pretrained models     │
│                                                                          │
│  3. Is the problem sequential decision-making?                           │
│     YES → Reinforcement learning                                         │
│                                                                          │
│  4. What type of data?                                                   │
│     • Tabular → Gradient boosting (XGBoost, LightGBM)                   │
│     • Images → CNNs or vision transformers                               │
│     • Text → Transformers / LLMs                                         │
│     • Sequences → Transformers (preferred) or RNNs                       │
│     • Graphs → Graph neural networks                                     │
│     • Multiple modalities → Multimodal models                            │
│                                                                          │
│  5. Do you need generation (text, images, etc.)?                        │
│     YES → Generative models (LLMs, diffusion models, etc.)               │
│                                                                          │
│  6. Is there a capable pretrained model?                                 │
│     YES → Fine-tune or use via API (usually faster and better)           │
│     NO  → Train from scratch (requires significant resources)            │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Practical Considerations

Start simple: Begin with the simplest approach that might work. Logistic regression before neural networks. GPT-4o before custom training. You can always add complexity.

Consider the data: The amount and quality of data often matters more than algorithm sophistication. Data quality is the make-or-break factor in most AI projects.

Account for costs: Different approaches have vastly different compute requirements. A fine-tuned smaller model might be more practical than a larger one, even if slightly less capable. Understand the hidden costs of AI projects before committing.

Evaluate build vs. buy: For many applications, using existing models via APIs is faster and cheaper than training custom systems. See our analysis on build vs. buy decisions for AI.

Plan for integration: The AI model is rarely the hard part—integrating it into production systems is. Consider deployment, monitoring, and maintenance from the start.

The State of AI in 2026

The field is moving remarkably fast. A few observations on the current state:

LLMs have crossed a capability threshold. They’re now genuinely useful for a wide range of tasks—writing, coding, analysis, research assistance. The practical challenge is often prompt engineering and system design rather than fundamental capability. Our guide on prompt engineering patterns covers production-ready techniques.

Multimodal is becoming standard. The distinction between “vision models” and “language models” is blurring. Leading models natively handle text, images, and increasingly audio and video.

Open-source is competitive. Models like Llama 4 and Mistral’s offerings provide capabilities that rival proprietary systems for many applications, enabling self-hosting and customization.

Reasoning remains a frontier. Despite impressive demonstrations, reliable multi-step reasoning, planning, and factual accuracy remain challenging. Systems that combine LLMs with retrieval (RAG), tools, and structured reasoning are bridging some gaps. Our RAG pipeline design guide covers these hybrid approaches.

World models are nascent. The vision of AI systems that truly understand the physical world and can plan effectively within it remains largely aspirational. Progress is happening but practical applications are limited.

Conclusion

Artificial intelligence encompasses a diverse range of approaches, from rule-based expert systems to trillion-parameter language models. Understanding this taxonomy helps you:

Evaluate claims: When someone says “AI,” you can ask what kind and whether it’s appropriate for the problem
Choose approaches: Match the right technique to your problem’s characteristics
Set expectations: Different AI types have different capabilities and limitations
Stay oriented: As the field evolves, you have a framework for understanding where new developments fit

The most effective practitioners understand not just the latest models but the full landscape of techniques—knowing when a simple decision tree beats a neural network, when to fine-tune versus prompt, and when AI might not be the answer at all.

If you’re ready to apply these concepts, our guide on when AI makes sense for your business can help you identify high-value opportunities. For hands-on implementation, building your first AI proof of concept provides a practical starting point.

Further reading:

Deep Learning by Goodfellow, Bengio, and Courville—the foundational textbook
Attention Is All You Need—the original transformer paper (Vaswani et al., 2017)
Language Models are Few-Shot Learners—GPT-3 paper demonstrating scaling effects (Brown et al., 2020)
CLIP: Learning Transferable Visual Models—multimodal foundations (Radford et al., 2021)

Taxonomy of AI: From ML to World Models

The AI Landscape: A Visual Overview

Part 1: The Foundational Paradigms

Symbolic AI: Intelligence Through Rules

Machine Learning: Intelligence Through Data

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Semi-Supervised and Self-Supervised Learning

Part 2: Deep Learning and Neural Networks

Neural Network Architecture

Deep Learning Architectures

Convolutional Neural Networks (CNNs)

Recurrent Neural Networks (RNNs)

Transformers

Graph Neural Networks (GNNs)

Part 3: Large Language Models (LLMs)

How LLMs Work

Major LLM Families

Capabilities and Limitations

LLM Training Phases

Part 4: Multimodal AI

Vision-Language Models

Text-to-Image Generation

Audio and Speech Models

Video Models

Part 5: World Models and Embodied AI

What Are World Models?

Current Research Directions

Embodied AI

Part 6: Specialized AI Systems

Recommender Systems

Anomaly Detection

Autonomous Systems

Generative AI

Part 7: Choosing the Right Approach

Decision Framework

Practical Considerations

The State of AI in 2026

Conclusion

Want to discuss this topic?