Ευάγγελος Μπαλάσκας - Evaggelos Balaskas

Mar

2026

Running vLLM-MLX on Apple Silicon

Posted by ebal at 15:12:16 in blog

Want to run modern LLMs locally — with an OpenAI-compatible API, multimodal support, and strong performance on Apple Silicon? This beginner-friendly guide walks you through everything from installation to your first inference request.

No prior ML experience required.

What is vllm-mlx?

vllm-mlx is a community-driven inference server built specifically for Apple Silicon Macs. It uses MLX, Apple’s machine learning framework designed for M-series chips, and exposes an OpenAI-compatible HTTP API so you can drop it in wherever you’d use the OpenAI SDK.

Think of it as a full, self-contained AI server stack that runs entirely on your Mac.

How does it differ from official vLLM?

Feature	vLLM (official)	vllm-mlx
Backend	CUDA (NVIDIA GPUs)	MLX (Apple Silicon)
Platform	Linux + NVIDIA	macOS + Apple Silicon
Multimodal support	Limited	Built-in (vision, audio, embeddings)
API compatibility	OpenAI	OpenAI + Anthropic
Architecture	Plugin-based	Standalone framework
Built on	vLLM engine internals	`mlx-lm`, `mlx-vlm`

Important: vllm-mlx is not a plugin or fork of official vLLM. It’s a separate framework built from the ground up for Macs.

Why use vllm-mlx?

It’s the right tool if you want:

A full-featured local AI server on Apple Silicon
Text and multimodal inference in a single server
OpenAI-compatible APIs out of the box
Fully offline inference — no cloud, no data leaving your machine

System requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.10+
16 GB RAM minimum recommended (larger models require more)

Step 1 — Create a clean Python environment

Never install ML tooling into your global Python. Use an isolated virtual environment:

python3 -m venv ~/.venv-vllm-mlx
source ~/.venv-vllm-mlx/bin/activate

Once activated, your shell prompt should change to something like:

(venv-vllm-mlx) yourname@macbook %

Alternatively, with virtualenv:

virtualenv venv-vllm-mlx
source venv-vllm-mlx/bin/activate

Step 2 — Install vllm-mlx

pip install vllm-mlx

Verify the installation:

pip list | grep vllm

You should see vllm-mlx in the output.

Step 3 — Start your first model server

We’ll use a 4-bit quantized Llama 3.2 model — small, fast, and a good starting point.

vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8010

This command will:

Download the model from HuggingFace (first run only)
Load it into the MLX backend
Start an HTTP API server on port 8010

You’ll see log output showing the model loading and the server starting on 0.0.0.0:8010.

Step 4 — Verify the server

Health check

curl -s http://localhost:8010/health | jq .

Expected output:

{
  "status": "healthy",
  "model_loaded": true,
  "model_name": "mlx-community/Llama-3.2-3B-Instruct-4bit",
  "model_type": "llm",
  "engine_type": "simple",
  "mcp": null
}

List available models

curl -s http://localhost:8010/v1/models | jq .

Expected output:

{
  "object": "list",
  "data": [
    {
      "id": "mlx-community/Llama-3.2-3B-Instruct-4bit",
      "object": "model",
      "created": 1772701579,
      "owned_by": "vllm-mlx"
    }
  ]
}

Step 5 — Send a chat request

Use the OpenAI-compatible /v1/chat/completions endpoint:

curl -s http://127.0.0.1:8010/v1/chat/completions
  -H "Content-Type: application/json"
  -d '{
    "model": "mlx-community/Llama-3.2-3B-Instruct-4bit",
    "messages": [
      {"role": "user", "content": "Hello! What is the capital of Greece?"}
    ],
    "max_tokens": 100
  }' | jq .

Expected response:

{
  "id": "...",
  "object": "chat.completion",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "The capital of Greece is Athens."
      }
    }
  ]
}

You’re now running a local LLM server on your Mac.

Running larger models (advanced)

For high-memory Macs (64 GB+ recommended), you can run much larger models with additional flags:

vllm-mlx serve Qwen/Qwen3.5-35B-A3B-GPTQ-Int4
  --port 8010
  --max-tokens 262144
  --reasoning-parser qwen3

Flag	Purpose
`--max-tokens 262144`	Sets a large context window (256k tokens)
`--reasoning-parser qwen3`	Enables Qwen-specific reasoning output format

What you can do next

With your local server running, you can connect it to the broader AI tooling ecosystem by pointing any OpenAI-compatible client at http://localhost:8010/v1:

Open WebUI — browser-based chat UI
LangChain or LlamaIndex — agent and RAG pipelines
OpenAI Python SDK — just set base_url="http://localhost:8010/v1"
Embeddings and multimodal models — swap in a different model and the same API applies

Architecture overview

When you run vllm-mlx serve, you get a layered system:

vLLM-MLX architecture overview

Your App (curl / SDK / WebUI)
        ↓
OpenAI-Compatible API Layer
  /v1/chat/completions, /v1/models, /health, ...
        ↓
vllm-mlx Core Server
  Request validation, tokenization, generation loop,
  streaming, multimodal routing, embeddings
        ↓
MLX Model Layer
  Quantized model weights, forward passes,
  Apple GPU acceleration, unified memory management
        ↓
Apple Silicon Hardware
  M-series GPU + CPU sharing the same memory pool

Why Apple Silicon works so well here

On a discrete GPU setup (NVIDIA), model weights must be copied over PCIe from system RAM to VRAM before inference can begin. Apple Silicon eliminates this bottleneck entirely — the CPU and GPU share the same unified memory pool. Combined with Apple’s high memory bandwidth, this makes MLX extremely efficient for inference on models that fit in RAM.

Multimodal routing

When using a vision or audio model, the server adds an extra routing step:

Image / Audio input
        ↓
Multimodal Router (mlx-vlm / audio pipeline)
        ↓
LLM reasoning
        ↓
Text output

No additional services are required — it’s built into the same server process.

How vllm-mlx differs from official vLLM under the hood

Official vLLM:   App → vLLM Engine → CUDA kernels → NVIDIA GPU
vllm-mlx:        App → vllm-mlx Server → MLX tensors → Apple GPU

These are entirely different acceleration stacks. vllm-mlx doesn’t use or depend on any CUDA code.

That’s it. A local, fully offline, OpenAI-compatible LLM server running natively on your Mac.

Tag(s): vllm, mlx, macbook, llm, ai

Oct

2025

Your MacBook can run AI too

Posted by ebal at 19:14:17 in blog

🚀 Curious about trying out a Large Language Model (LLM) like Mistral directly on your own macbook?

Here’s a simple step-by-step guide I used on my MacBook M1 Pro. No advanced technical skills required, but some techinal command-line skills are needed. Just follow the commands and you’ll be chatting with an AI model in no time.

🧰 What We’ll Need

LLM: A CLI utility and Python library for interacting with Large Language Models → a command-line tool and Python library that makes it easy to install and run language models.
Mistral → a modern open-source language model you can run locally.
Python virtual environment → a safe “sandbox” where we install the tools without messing with the rest of the system.
MacBook → All Apple Silicon MacBooks (M1, M2, M3, M4 chips) feature an integrated GPU on the same chip as the CPU.

🧑‍🔬 About Mistral 7B

Mistral 7B is a 7-billion parameter large language model, trained to be fast, efficient, and good at following instructions.

Technical requirements (approximate):

Full precision model (FP16) → ~13–14 GB of RAM (fits best on a server or high-end GPU).
Quantized model (4-bit, like the one we use here) → ~4 GB of RAM, which makes it practical for a MacBook or laptop.
Disk storage → the 4-bit model download is around 4–5 GB.
CPU/GPU → runs on Apple Silicon (M1/M2/M3) CPUs and GPUs thanks to the MLX library. It can also run on Intel Macs, though it may be slower.

👉 In short:
With the 4-bit quantized version, you can run Mistral smoothly on a modern MacBook with 8 GB RAM or more. The more memory and cores you have, the faster it runs.

⚙️ Step 1: Create a Virtual Environment

We’ll create a clean workspace just for this project.

python3 -m venv ~/.venvs/llm

source ~/.venvs/llm/bin/activate

👉 What happens here:

python3 -m venv creates a new isolated environment named llm.
source .../activate switches you into that environment, so all installs stay inside it.

📦 Step 2: Install the LLM Tool

Now, let’s install LLM.

pip install -U llm

👉 This gives us the llm command we’ll use to talk to models.

🛠️ Step 3: Install Extra Dependencies

Mistral needs a few extra packages:

pip install mlx

pip install sentencepiece

👉 mlx is Apple’s library that helps models run efficiently on Mac.
👉 sentencepiece helps the model break down text into tokens (words/pieces).

🔌 Step 4: Install the Mistral Plugin

We now connect LLM with Mistral:

llm install llm-mlx

👉 This installs the llm-mlx plugin, which allows LLM to use Mistral models via Apple’s MLX framework.

Verify the plugin with this

llm plugins

result should look like that:

[
  {
    "name": "llm-mlx",
    "hooks": [
      "register_commands",
      "register_models"
    ],
    "version": "0.4"
  }
]

⬇️ Step 5: Download the Model

Now for the fun part — downloading Mistral 7B.

llm mlx download-model mlx-community/Mistral-7B-Instruct-v0.3-4bit

👉 This pulls down the model from the community in a compressed, 4-bit version (smaller and faster to run on laptops).

Verify the model is on your system:

llm models | grep -i mistral

output should be something similar with this:

MlxModel: mlx-community/Mistral-7B-Instruct-v0.3-4bit (aliases: m7)

🏷️ Step 6: Set a Shortcut (Alias)

Typing the full model name is long and annoying. Let’s create a shortcut:

llm aliases set m7 mlx-community/Mistral-7B-Instruct-v0.3-4bit

👉 From now on, we can just use -m m7 instead of the full model name.

💡 Step 7: One last thing

if you are using Homebrew then most probably you already have OpenSSL on your system, if you do not know what we are talking about, then you are using LibreSSL and you need to make a small change:

pip install "urllib3<2"

only if you are using brew run:

brew install openssl@3

💬 Step 8: Ask Your First Question

Time to chat with Mistral!

llm -m m7 'Capital of Greece ?'

👉 Expected result:
The model should respond with:

Athens

🎉 Congratulations — you’ve just run a powerful AI model locally on your Mac!

👨‍💻 A More Technical Example

Mistral isn’t only for trivia — it can help with real command-line tasks too.

For example, let’s ask it something more advanced:

llm -m m7 'On Arch Linux, give only the bash command using find
 that lists files in the current directory larger than 1 GB,
 do not cross filesystem boundaries. Output file sizes in
 human-readable format with GB units along with the file paths.
 Return only the command.'

👉 Mistral responds with:

find . -type f -size +1G -exec du -sh {} +

💡 What this does:

find . -type f -size +1G → finds files bigger than 1 GB in the current folder.
-exec ls -lhS {} ; → runs ls on each file to display the size in human-readable format (GB).

This is the kind of real-world productivity boost you get by running models locally.

Full text example output:

This command will find all files (-type f) larger than 1 GB (-size +1G) in the current directory (.) and execute the du -sh command on each file to display the file size in a human-readable format with GB units (-h). The + after -exec tells find to execute the command once for each set of found files, instead of once for each file.

🌟 Why This Is Cool

🔒 No internet needed once the model is downloaded.
🕵️ Privacy: your text never leaves your laptop.
🧪 Flexible: you can try different open-source models, not just Mistral.

though it won’t be as fast as running it in the cloud.

That’s it !

PS. These are my personal notes from my home lab; AI was used to structure and format the final version of this blog post.

Tag(s): llm, mistral, ai, macbook

llm

mistral

macbook

Evaggelos Balaskas - System Engineer

What is vllm-mlx?

How does it differ from official vLLM?

Why use vllm-mlx?

System requirements

Step 1 — Create a clean Python environment

Step 2 — Install vllm-mlx

Step 3 — Start your first model server

Step 4 — Verify the server

Health check

List available models

Step 5 — Send a chat request

Running larger models (advanced)

What you can do next

Architecture overview

Why Apple Silicon works so well here

Multimodal routing

How vllm-mlx differs from official vLLM under the hood

🧰 What We’ll Need

🧑‍🔬 About Mistral 7B

Technical requirements (approximate):

⚙️ Step 1: Create a Virtual Environment

📦 Step 2: Install the LLM Tool

🛠️ Step 3: Install Extra Dependencies

🔌 Step 4: Install the Mistral Plugin

Verify the plugin with this

⬇️ Step 5: Download the Model

🏷️ Step 6: Set a Shortcut (Alias)

💡 Step 7: One last thing

💬 Step 8: Ask Your First Question

👨‍💻 A More Technical Example

🌟 Why This Is Cool

Admin area

Categories

Archives

Evaggelos Balaskas - System Engineer

What is vllm-mlx?

How does it differ from official vLLM?

Why use vllm-mlx?

System requirements

Step 1 — Create a clean Python environment

Step 2 — Install vllm-mlx

Step 3 — Start your first model server

Step 4 — Verify the server

Health check

List available models

Step 5 — Send a chat request

Running larger models (advanced)

What you can do next

Architecture overview

Why Apple Silicon works so well here

Multimodal routing

How vllm-mlx differs from official vLLM under the hood

🧰 What We’ll Need

🧑‍🔬 About Mistral 7B

Technical requirements (approximate):

⚙️ Step 1: Create a Virtual Environment

📦 Step 2: Install the LLM Tool

🛠️ Step 3: Install Extra Dependencies

🔌 Step 4: Install the Mistral Plugin

Verify the plugin with this

⬇️ Step 5: Download the Model

🏷️ Step 6: Set a Shortcut (Alias)

💡 Step 7: One last thing

💬 Step 8: Ask Your First Question

👨‍💻 A More Technical Example

🌟 Why This Is Cool

Search

Admin area

Categories

Archives