Run AI Models Locally with Ollama: The Complete 2026 Laptop Guide

Rohit June 30, 2026

10 min read

Local Ollama is a free, open-source tool that runs AI models on your own laptop instead of someone else’s server. The software itself costs nothing and has no usage cap — install it, pull a model, and run it as many times as you want, offline, with no account required. There’s also an optional Ollama Cloud tier (Free / Pro around $20 a month / Max around $100 a month) for the rare case where you want a model too large for your hardware — same commands, same API, just routed to Ollama’s servers instead of your machine. You don’t need it to follow this guide; it only matters once local hardware becomes the limit, not before.

Three commands, and you’re chatting with a local model:

curl -fsSL https://ollama.com/install.sh | sh   # Linux only — Windows/Mac use the installer below
ollama pull llama3.2:3b
ollama run llama3.2:3b

That’s the whole setup. Everything after this is the part people actually get stuck on: which model fits your laptop, what speed to expect, what to do when it’s slow, and how to use it from code instead of a terminal.

What You Need

8 GB of RAM → stick to 1B–3B models. llama3.2:3b or gemma3:4b. A 7B model will load. It just won’t enjoy it.

16 GB → this is the real sweet spot for a laptop in 2026. qwen2.5:7b or llama3.1:8b run comfortably, especially with a GPU.

32 GB+ or Apple Silicon → 14B–32B opens up. But ask yourself if you need it. Most people are better off not starting with a 32B model just because the laptop can technically run it — a 7B model that answers in two seconds beats a 32B model that takes fifteen, for almost everything you’ll actually use it for day to day.

RAM sets the floor. VRAM decides how it feels. If a model fits entirely on the GPU, generation is fast. If it spills into system memory, speed drops hard often by 10x or more. Quantization makes large models practical on consumer hardware: a 4-bit 7B model uses around 4–5 GB of memory, compared to roughly 14–16 GB in FP16.

Install Ollama

For Windows

Step 1: Download the installer

Go to ollama.com/download and click the Windows download button. A file shows up — usually called OllamaSetup.exe. Takes a few seconds depending on your connection.

Step 2: Run the installer

Double-click the downloaded file. On most Windows PCs, Ollama installs without administrator privileges, although some systems may still display a UAC prompt.

Step 3: Installer finishes

The installer closes automatically — there’s no “Finish” button to click. On Windows, it then automatically opens the Ollama app, a simple chat window with a model selector and message box. This is expected behavior, not an error.

Step 4: Open a terminal and confirm the install

Open Command Prompt or PowerShell (the desktop app and the terminal are independent — closing the app window doesn’t stop the Ollama service running in the background). Then run:

ollama --version

Step 5 (GPU-specific): Check GPU pickup

NVIDIA: nothing to do — drivers are picked up automatically.
AMD: needs the ROCm v7 driver stack installed separately, with Vulkan as the fallback for cards ROCm doesn’t cover

For macOS

Step 1: Check your macOS version

You’ll need macOS 14 (Sonoma) or newer — Apple has bumped this minimum in past releases, so don’t assume you’re covered. Check via Apple menu → About This Mac.

Step 2: Drag Ollama.app into Applications

From the downloaded zip, drag Ollama.app into the Applications folder.
Screenshot: the drag-and-drop, or the Applications folder showing Ollama.app installed.

Step 3: Note on acceleration

If you’re on Apple Silicon (M1/M2/M3/M4), you get Metal acceleration automatically — no extra setup needed.

For Linux

Step 1: Run the install script

curl -fsSL https://ollama.com/install.sh | sh

This sets Ollama up as a background service (systemd) on most distros.
Screenshot: the terminal output after the script finishes.

Step 2: GPU acceleration — check your hardware

NVIDIA drivers get picked up automatically — no extra steps. AMD GPUs need ROCm v7 installed separately before Ollama can use them.
Screenshot: optional — skip if you don’t have a GPU to demo, or screenshot nvidia-smi / rocm-smi output if you do.

Step 3: Confirm the install
Run:

ollama list

On a fresh install this returns an empty table — that’s success, not a bug (it means Ollama is running but no models are downloaded yet).

Run Your First Model

If you haven’t pulled the model yet, just run it directly — ollama run will pull it automatically before loading it:

ollama run llama3.2:3b

This does two things in sequence: it pulls (downloads) the model first, then loads it and drops you into a chat prompt. You’ll see a progress bar like the one below while it downloads — for llama3.2:3b that’s about 2 GB, so the wait depends on your connection speed.

Once the pull finishes, it opens a chat prompt right in the same terminal window. Already pulled it before? It skips straight to this chat prompt. To exit, press Ctrl + D.

Here’s the catch almost everyone hits once: the first response after a model loads is always slower than the rest, even on strong hardware. That’s loading time, not a performance problem. If speed never picks up after that, run ollama ps and check whether it’s actually using your GPU.

Which Model Should You Pick?

How much RAM do you have?

8 GB
 ├─ Need code help?                  → qwen2.5-coder:3b
 └─ Just chatting / writing?         → llama3.2:3b or gemma3:4b

16 GB
 ├─ Need code help?                  → qwen2.5-coder:7b
 ├─ Need math / multi-step reasoning?→ deepseek-r1:8b
 └─ Just chatting / writing?         → llama3.1:8b or qwen2.5:7b

32 GB+ / Apple Silicon
 ├─ Need code help?                  → qwen2.5-coder:32b
 ├─ Need math / multi-step reasoning?→ deepseek-r1:32b
 └─ Just chatting / writing?         → qwen3:30b or gemma3:27b

Coding and general chat are different jobs. qwen2.5-coder beats a general-purpose model at code even at a smaller size — don’t reach for the bigger generalist out of habit. And reasoning models like deepseek-r1 “think” before answering, which makes them slower per response and a poor pick for quick back-and-forth chat, but noticeably better at problems with several steps.

What Speed Actually Looks Like

Figures below are from public benchmarks, not tested for this guide — quantization, context length, and Ollama version all shift results. Treat as a sense of scale, not a guarantee.

Hardware	Model (quant)	Speed	Why
CPU only	Llama 3 8B (Q4)	~5 tok/s	No GPU offload
RTX 3060 12GB	Llama 3.1 8B (Q4)	~45 tok/s	Fits in VRAM
RTX 3080	Llama 3 8B (Q4)	40+ tok/s	Full GPU offload
Dual GPU (4060 Ti + 3060)	30B MoE (3B active)	~54 tok/s	Only active params compute
16GB+ VRAM / Apple Silicon	7B–14B class	30–60 tok/s	General range, several sources
RTX 5090 vs 4090	Same model	5090 ~28–67% faster	Memory bandwidth is the bottleneck once VRAM fits

Running the same model on a compatible GPU is typically 5–10× faster than CPU-only inference. Once a model fits entirely in VRAM, upgrading to a newer GPU usually improves throughput by tens of percent rather than several-fold.

Real-World Benchmarks

To see how this plays out on real hardware, I benchmarked the models below using the following test system:

GPU: NVIDIA GeForce RTX 3060 (12 GB VRAM)
RAM: 16 GB DDR5
Operating System: Windows 11
Ollama Version: 0.30.10

Results will vary depending on your CPU, GPU, RAM, model quantization, context length, prompt complexity, and Ollama version.

Model	Size	Eval rate (tok/s)	Total duration	Outcome
llama3.2:3b	2.6 GB	114.08	2.83s	Clean, complete answer (100% GPU)
gemma4:e4b	9.6 GB	68.63	9.8s	Clean, complete answer
qwen3.5:latest	6.6 GB	48.83	84.0s	Never finished — burned 4,072 tokens stuck in its thinking step, cut off mid-sentence
qwythos-9b-abliterated (Q4_K_M)	5.6 GB	48.03	21.5s	Clean, complete answer

qwen3.5 wasn’t slower per-token than qwythos (48.8 vs 48.0) — it just got stuck re-drafting inside its own reasoning step and never produced an answer. “Tokens/sec” alone hides this kind of failure.

Fix for looping reasoning models: cap output length. API: options.num_predict to 600–800. Interactive: /set parameter num_predict 600. If a model keeps looping, avoid strict word-count prompts with it, or pick a model without an exposed thinking step.

How It Actually Works

Ollama wraps llama.cpp, the engine that made it possible to perform quantization inference of LLMs on ordinary hardware. To download a model, you should use ollama pull command, and then to start an inference server with ollama run, which at the same time provides a local REST API (usually port 11434 by default) which can be invoked by applications running on your machine. Neither accounts nor internet connections are needed if the models are already downloaded. In case, you will find a model which will be too big for your hardware, Ollama also provides a cloud service option for performing inference.

A couple of settings are worth understanding rather than just copying:

OLLAMA_KEEP_ALIVE=-1 — by default, Ollama unloads a model from memory five minutes after the last request, so the next request pays a reload cost, often a full second or more before the first token appears. Setting this to -1 keeps the model resident so every call is instant. That matters if you’re repeatedly calling the same model from a local app or script. The trade-off: the model sits in RAM or VRAM the whole time, even when idle — fine on a 32 GB desktop, a bad idea on an 8 GB laptop running three other things.

OLLAMA_MAX_LOADED_MODELS — raises how many models Ollama keeps loaded simultaneously, above the default of one. Useful if you’re running a chat model and an embedding model together for a RAG setup, or comparing two models without eating a reload delay every time you switch. Each loaded model claims its own slice of memory, so this setting is really a memory budget decision, not a free upgrade.

Use It with the API

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Why is the sky blue?"}],
  "stream": false
}'

import ollama
response = ollama.chat(model="llama3.2", messages=[
    {"role": "user", "content": "Write a palindrome checker in Python"}
])
print(response["message"]["content"])

There’s also an OpenAI-compatible endpoint at /v1/chat/completions, so tools built for the OpenAI SDK can point here with barely any code change. The API only listens on localhost by default — leave it that way unless you specifically need other devices on your network to reach it.

Want a browser interface instead of a terminal? One Docker command gets you Open WebUI:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open localhost:3000 and every model you’ve already pulled shows up in the dropdown.

Ollama vs LM Studio vs Jan vs GPT4All

	Ollama	LM Studio	Jan	GPT4All
Best for	Developers, scripting	Browsing/trying models	Privacy purists	Fastest first chat
Interface	Terminal + API	Native GUI	Native GUI	Native GUI
Open source	Yes (MIT)	No	Yes (MIT)	Yes
Built-in document Q&A	No	No	Basic	Yes (LocalDocs)

Use Ollama if you’re writing code against it. But if you want the best model browser choose LM Studio. Pick Jan if you need to prove nothing leaves the machine. Pick GPT4All if you just want to chat with a PDF in five minutes.

Common Problems

It’s slow. Run ollama ps. CPU instead of GPU is the usual answer. Already on GPU and still slow? The model’s too big for VRAM and spilling into RAM — try a smaller or more quantized version.

Ollama isn’t detecting your GPU. nvidia-smi for NVIDIA, rocminfo for AMD. If neither shows your card, the driver — not Ollama — is the problem.

Want to store models on another drive? Set OLLAMA_MODELS to a new folder under Windows environment variables, quit Ollama from the tray, relaunch. External SSDs work fine here, USB-C included, as long as the path stays stable.

A pull keeps failing. Usually disk space, a flaky connection, or a firewall blocking outbound HTTPS. Re-running the same ollama pull resumes instead of starting over.

Mistakes Beginners Make

Picking the biggest model the laptop can technically load. A model that loads is not the same as a model that’s pleasant to use. If every response takes 15+ seconds, the “smarter” model isn’t winning anything.
Expecting GPU speeds on a CPU-only laptop. Most numbers people quote assume GPU acceleration. Without one, divide expectations by 5–10x, per the benchmark table above.
Ignoring free disk space until a pull fails halfway through. Quantized models range from under a gigabyte to well over 100 GB. Check available space before queuing several pulls back to back.
Running an outdated GPU driver. GPU detection depends entirely on the driver, not on Ollama. An old NVIDIA or AMD driver is the single most common reason nvidia-smi or rocminfo comes back empty.
Running two inference tools at once. Ollama, LM Studio, and similar tools all compete for the same VRAM. Running two simultaneously usually means neither performs well, even on capable hardware.

FAQ

Rohit

Rohit Kumar is an experienced tech expert and content creator who simplifies technology. Through his website, he provides insightful articles, practical tips, and expert analysis on mobile specs, PC/laptop news, and how-to guides, empowering users to make informed tech decisions.

View all posts →

Run AI Models Locally with Ollama: The Complete 2026 Laptop Guide

In this article

What You Need

Install Ollama

For Windows

Step 1: Download the installer

Step 2: Run the installer

Step 3: Installer finishes

Step 4: Open a terminal and confirm the install

For macOS

Step 1: Check your macOS version

Step 2: Drag Ollama.app into Applications

Step 3: Note on acceleration

For Linux

Step 1: Run the install script

Step 2: GPU acceleration — check your hardware

Run Your First Model

Which Model Should You Pick?

What Speed Actually Looks Like

Real-World Benchmarks

How It Actually Works

Use It with the API

Ollama vs LM Studio vs Jan vs GPT4All

Common Problems

Mistakes Beginners Make

FAQ

Rohit

Leave a Comment Cancel Reply

In this article

What You Need

Install Ollama

For Windows

Step 1: Download the installer

Step 2: Run the installer

Step 3: Installer finishes

Step 4: Open a terminal and confirm the install

For macOS

Step 1: Check your macOS version

Step 2: Drag Ollama.app into Applications

Step 3: Note on acceleration

For Linux

Step 1: Run the install script

Step 2: GPU acceleration — check your hardware

Run Your First Model

Which Model Should You Pick?

What Speed Actually Looks Like

Real-World Benchmarks

How It Actually Works

Use It with the API

Ollama vs LM Studio vs Jan vs GPT4All

Common Problems

Mistakes Beginners Make

FAQ

How much disk space does Ollama need?

Can Ollama run multiple models at once?

How do I uninstall Ollama?

Can Ollama use an external SSD?

How much VRAM is enough?

Is Ollama free?

Does it need internet?

Rohit

Leave a Comment Cancel Reply