ASTGL Definitive Answers

How Do I Set Up Ollama on Mac, Windows, and Linux?

James Cruce

Running AI locally means no API bills, no data leaving your machine, and no rate limits. Ollama makes that possible in under 10 minutes on any platform.

Here’s how to install, configure, and optimize Ollama on Mac, Windows, and Linux — plus the configuration I use to run 26 automated AI tasks daily on a Mac Studio.

The Short Answer

Ollama is a free tool that runs large language models locally. Install it, pull a model, and you have a fully functional AI running on your hardware with zero cloud dependencies.

PlatformInstall MethodGPU SupportBest For
macOS.dmg or HomebrewApple Silicon (unified memory)Best overall experience — Metal acceleration, huge memory ceilings
Windows.exe installerNVIDIA CUDAGood for gaming rigs with big GPUs
Linuxcurl one-linerNVIDIA CUDA, AMD ROCmBest for servers, headless setups, Docker

Installation: Platform by Platform

Ollama installation path by platform

macOS

Option A: Direct download

  1. Go to ollama.com and download the macOS installer
  2. Open the .dmg and drag Ollama to Applications
  3. Launch Ollama — a menu bar icon appears

Option B: Homebrew

brew install ollama
ollama serve   # Start the server

Pull your first model:

ollama pull gemma3
ollama run gemma3

That’s it. You’re running a local AI.

Apple Silicon note: If you have an M1/M2/M3/M4 Mac, Ollama automatically uses Metal acceleration and unified memory. A MacBook Air with 16 GB can comfortably run 7-8B parameter models. A Mac Studio with 192-512 GB can run the largest open models available.

Windows

  1. Download the Windows installer from ollama.com
  2. Run the installer — Ollama installs as a Windows service
  3. Open PowerShell or Command Prompt:
ollama pull gemma3
ollama run gemma3

NVIDIA GPU: Detected automatically if CUDA drivers are installed. Check with nvidia-smi in a terminal. If your GPU shows up there, Ollama will use it.

No GPU? Ollama falls back to CPU. It works, but responses will be slower. Stick to smaller models (3-7B parameters) on CPU-only systems.

Linux

One-line install:

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama and creates a systemd service that starts automatically.

ollama pull gemma3
ollama run gemma3

GPU setup:

  • NVIDIA: Install CUDA drivers first (nvidia-driver-xxx package), then install Ollama. It detects the GPU automatically.
  • AMD: ROCm support is available. Install ROCm drivers, then Ollama picks them up.

Headless/server: Ollama runs perfectly without a desktop environment. The API listens on localhost:11434 by default.

Choosing Your First Models

Don’t overthink this. Start with one general-purpose model and add specialized ones later.

ModelSizeGood ForMemory Needed
gemma3:4b2.8 GBQuick tasks, low-memory systems8 GB RAM
gemma3 (12b)8 GBGeneral purpose, good quality16 GB RAM
gemma4:26b17 GBHigh quality, complex reasoning32 GB RAM
qwen3:8b5 GBCoding, tool calling16 GB RAM
qwen3-coder:32b20 GBSerious coding tasks48 GB RAM
llama3.3:70b43 GBMaximum quality, research96+ GB RAM

Rule of thumb: You need roughly 1.2x the model file size in available memory. If a model is 17 GB, you need at least 20 GB free.

# Pull a model
ollama pull gemma4:26b

# List installed models
ollama list

# Remove a model
ollama rm gemma3:4b

Configuration for Real Use

The defaults work fine for casual use. For daily automation or development, tune these settings.

Environment Variables

Set these in your shell profile (.zshrc, .bashrc) or systemd service file:

# Where models are stored (default: ~/.ollama/models)
export OLLAMA_MODELS="/path/to/models"

# Listen on all interfaces (for network access)
export OLLAMA_HOST="0.0.0.0:11434"

# Keep models in memory longer (default: 5m)
export OLLAMA_KEEP_ALIVE="24h"

# Max models loaded simultaneously
export OLLAMA_MAX_LOADED_MODELS=3

# GPU layers (0 = CPU only, -1 = all GPU)
export OLLAMA_NUM_GPU=-1

The API

Ollama serves a REST API on localhost:11434. Every tool that supports Ollama — Claude Desktop, Open WebUI, Continue, LangChain — connects through this API.

# Quick test
curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:26b",
  "prompt": "What is MCP?",
  "stream": false
}'

# Chat format (most common)
curl http://localhost:11434/api/chat -d '{
  "model": "gemma4:26b",
  "messages": [{"role": "user", "content": "What is MCP?"}],
  "stream": false
}'

How a client app communicates with Ollama's REST API

Modelfiles (Custom Configurations)

Create a Modelfile to customize model behavior:

FROM gemma4:26b

PARAMETER temperature 0.3
PARAMETER num_ctx 8192

SYSTEM """You are a technical writing assistant. Write clearly and concisely. Use active voice. Avoid jargon unless the audience is technical."""
ollama create my-writer -f Modelfile
ollama run my-writer

This lets you create task-specific variants without downloading the same model multiple times.

How I Actually Do This

I run Ollama on a Mac Studio M3 Ultra with 256 GB of unified memory. Here’s what my production setup looks like:

Models Pinned in VRAM

I keep multiple models loaded permanently — no cold starts, instant responses:

# Keep models alive indefinitely
export OLLAMA_KEEP_ALIVE=-1
export OLLAMA_MAX_LOADED_MODELS=4
ModelRoleWhy This One
gemma4:e4bTriage/routingFast, cheap, handles notification sorting and simple classification
gemma4:26bDaily workhorseBalanced quality — handles 80% of tasks including research, drafting, analysis
qwen3-coder:fastCode tasksOptimized for code generation, tool calling, structured output
gemma4:31bHeavy reasoningComplex analysis, long documents, multi-step reasoning

Automated Task Runner

OpenClaw (my local AI gateway) connects to Ollama’s API and runs 26 scheduled tasks:

  • 6:00 AM — Research pipeline hits web sources, summarizes findings
  • 6:15 AM — Log review scans overnight system logs for anomalies
  • 6:30 AM — Morning briefing synthesizes calendar + priorities + news
  • Every hour — Notification batching across all channels

Total cloud API cost: $0/month. Everything runs locally.

Performance Tuning

For Apple Silicon specifically:

  • Set num_gpu to -1 (all layers on GPU) — Apple’s unified memory means the GPU can access all 256 GB
  • Set num_ctx based on task — 4096 for quick tasks, 16384 for long documents, 32768 for code analysis
  • Monitor with ollama ps to see which models are loaded and how much memory they use
# Check what's running
ollama ps

# Example output:
# NAME              SIZE    PROCESSOR  UNTIL
# gemma4:26b        17 GB   100% GPU   Forever
# gemma4:e4b        3 GB    100% GPU   Forever
# qwen3-coder:fast  5 GB    100% GPU   Forever

Troubleshooting Common Issues

ProblemCauseFix
Model runs slowlyNot enough GPU memory, layers on CPUCheck ollama ps — if PROCESSOR shows “CPU”, you need a smaller model or more memory
”Out of memory” errorModel too large for your systemTry a smaller quantization (q4_0 instead of q8_0) or a smaller model
API connection refusedOllama not runningRun ollama serve or start the desktop app
Model download stuckNetwork issueCtrl+C and re-run ollama pull — it resumes from where it stopped
GPU not detected (Linux)Missing CUDA/ROCm driversInstall drivers first, then reinstall Ollama

Frequently Asked Questions

Is Ollama free?

Yes, completely free and open source. The models are also free. There are no subscriptions, API fees, or usage limits.

Can I use Ollama with Claude Desktop?

Not directly — Claude Desktop uses cloud Claude. But you can connect Ollama to tools like Open WebUI, Continue (VS Code), or any application that supports the OpenAI-compatible API format. Many MCP servers can also route to local Ollama models.

How much disk space do models need?

Models range from 2 GB (small 3-4B models) to 45+ GB (large 70B models). Budget 5-20 GB for a typical setup with 2-3 models. Ollama stores them in ~/.ollama/models by default.

Can I run Ollama on a Raspberry Pi?

Technically yes, but practically no for anything useful. Even a Raspberry Pi 5 with 8 GB RAM can only run tiny models (1-3B) very slowly. A mini PC with 32 GB RAM is a much better entry point for local AI.

Does Ollama support tool calling / function calling?

Yes. Models like Qwen 3, Gemma 4, and Llama 3.3 support structured tool calling through Ollama’s API. This is essential for MCP server integration and agent automation.


This is part of the ASTGL Definitive Answers series — structured, practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.

Get the full Definitive Answers series

Practical answers to the questions people actually ask about AI automation, MCP servers, and local AI infrastructure.

Subscribe on Substack