Run AI Models Locally with Ollama: The Complete 2026 Laptop Guide
Open Source Ai

Run AI Models Locally with Ollama: The Complete 2026 Laptop Guide

10 min read

In this article

    Local Ollama is a free, open-source tool that runs AI models on your own laptop instead of someone else’s server. The software itself costs nothing and has no usage cap — install it, pull a model, and run it as many times as you want, offline, with no account required. There’s also an optional Ollama Cloud tier (Free / Pro around $20 a month / Max around $100 a month) for the rare case where you want a model too large for your hardware — same commands, same API, just routed to Ollama’s servers instead of your machine. You don’t need it to follow this guide; it only matters once local hardware becomes the limit, not before.

    Three commands, and you’re chatting with a local model:

    curl -fsSL https://ollama.com/install.sh | sh   # Linux only — Windows/Mac use the installer below
    ollama pull llama3.2:3b
    ollama run llama3.2:3b

    That’s the whole setup. Everything after this is the part people actually get stuck on: which model fits your laptop, what speed to expect, what to do when it’s slow, and how to use it from code instead of a terminal.

    What You Need

    8 GB of RAM → stick to 1B–3B models. llama3.2:3b or gemma3:4b. A 7B model will load. It just won’t enjoy it.

    16 GB → this is the real sweet spot for a laptop in 2026. qwen2.5:7b or llama3.1:8b run comfortably, especially with a GPU.

    32 GB+ or Apple Silicon → 14B–32B opens up. But ask yourself if you need it. Most people are better off not starting with a 32B model just because the laptop can technically run it — a 7B model that answers in two seconds beats a 32B model that takes fifteen, for almost everything you’ll actually use it for day to day.

    RAM sets the floor. VRAM decides how it feels. If a model fits entirely on the GPU, generation is fast. If it spills into system memory, speed drops hard often by 10x or more. Quantization makes large models practical on consumer hardware: a 4-bit 7B model uses around 4–5 GB of memory, compared to roughly 14–16 GB in FP16.

    Install Ollama

    For Windows

    Step 1: Download the installer

    Go to ollama.com/download and click the Windows download button. A file shows up — usually called OllamaSetup.exe. Takes a few seconds depending on your connection.

    Download the installer

    Step 2: Run the installer

    Double-click the downloaded file. On most Windows PCs, Ollama installs without administrator privileges, although some systems may still display a UAC prompt.

    Run the installer

    Step 3: Installer finishes

    The installer closes automatically — there’s no “Finish” button to click. On Windows, it then automatically opens the Ollama app, a simple chat window with a model selector and message box. This is expected behavior, not an error.

    Installer finishes

    Step 4: Open a terminal and confirm the install

    Open Command Prompt or PowerShell (the desktop app and the terminal are independent — closing the app window doesn’t stop the Ollama service running in the background). Then run:

    ollama --version
    Open a terminal and confirm the install

    Step 5 (GPU-specific): Check GPU pickup

    • NVIDIA: nothing to do — drivers are picked up automatically.
    • AMD: needs the ROCm v7 driver stack installed separately, with Vulkan as the fallback for cards ROCm doesn’t cover

    For macOS

    Step 1: Check your macOS version

    You’ll need macOS 14 (Sonoma) or newer — Apple has bumped this minimum in past releases, so don’t assume you’re covered. Check via Apple menu → About This Mac.

    Step 2: Drag Ollama.app into Applications

    From the downloaded zip, drag Ollama.app into the Applications folder.
    Screenshot: the drag-and-drop, or the Applications folder showing Ollama.app installed.

    Drag Ollama.app into Applications

    Step 3: Note on acceleration

    If you’re on Apple Silicon (M1/M2/M3/M4), you get Metal acceleration automatically — no extra setup needed.

    For Linux

    Step 1: Run the install script

    curl -fsSL https://ollama.com/install.sh | sh

    This sets Ollama up as a background service (systemd) on most distros.
    Screenshot: the terminal output after the script finishes.

    Step 2: GPU acceleration — check your hardware

    NVIDIA drivers get picked up automatically — no extra steps. AMD GPUs need ROCm v7 installed separately before Ollama can use them.
    Screenshot: optional — skip if you don’t have a GPU to demo, or screenshot nvidia-smi / rocm-smi output if you do.

    Step 3: Confirm the install
    Run:

    ollama list

    On a fresh install this returns an empty table — that’s success, not a bug (it means Ollama is running but no models are downloaded yet).

    Run Your First Model

    If you haven’t pulled the model yet, just run it directly — ollama run will pull it automatically before loading it:

    ollama run llama3.2:3b
    Run Your First Model

    This does two things in sequence: it pulls (downloads) the model first, then loads it and drops you into a chat prompt. You’ll see a progress bar like the one below while it downloads — for llama3.2:3b that’s about 2 GB, so the wait depends on your connection speed.

    Once the pull finishes, it opens a chat prompt right in the same terminal window. Already pulled it before? It skips straight to this chat prompt. To exit, press Ctrl + D.

     chat prompt right

    Here’s the catch almost everyone hits once: the first response after a model loads is always slower than the rest, even on strong hardware. That’s loading time, not a performance problem. If speed never picks up after that, run ollama ps and check whether it’s actually using your GPU.

    Which Model Should You Pick?

    How much RAM do you have?
    
    8 GB
     ├─ Need code help?                  → qwen2.5-coder:3b
     └─ Just chatting / writing?         → llama3.2:3b or gemma3:4b
    
    16 GB
     ├─ Need code help?                  → qwen2.5-coder:7b
     ├─ Need math / multi-step reasoning?→ deepseek-r1:8b
     └─ Just chatting / writing?         → llama3.1:8b or qwen2.5:7b
    
    32 GB+ / Apple Silicon
     ├─ Need code help?                  → qwen2.5-coder:32b
     ├─ Need math / multi-step reasoning?→ deepseek-r1:32b
     └─ Just chatting / writing?         → qwen3:30b or gemma3:27b

    Coding and general chat are different jobs. qwen2.5-coder beats a general-purpose model at code even at a smaller size — don’t reach for the bigger generalist out of habit. And reasoning models like deepseek-r1 “think” before answering, which makes them slower per response and a poor pick for quick back-and-forth chat, but noticeably better at problems with several steps.

    What Speed Actually Looks Like

    Figures below are from public benchmarks, not tested for this guide — quantization, context length, and Ollama version all shift results. Treat as a sense of scale, not a guarantee.

    HardwareModel (quant)SpeedWhy
    CPU onlyLlama 3 8B (Q4)~5 tok/sNo GPU offload
    RTX 3060 12GBLlama 3.1 8B (Q4)~45 tok/sFits in VRAM
    RTX 3080Llama 3 8B (Q4)40+ tok/sFull GPU offload
    Dual GPU (4060 Ti + 3060)30B MoE (3B active)~54 tok/sOnly active params compute
    16GB+ VRAM / Apple Silicon7B–14B class30–60 tok/sGeneral range, several sources
    RTX 5090 vs 4090Same model5090 ~28–67% fasterMemory bandwidth is the bottleneck once VRAM fits

    Running the same model on a compatible GPU is typically 5–10× faster than CPU-only inference. Once a model fits entirely in VRAM, upgrading to a newer GPU usually improves throughput by tens of percent rather than several-fold.

    Real-World Benchmarks

    To see how this plays out on real hardware, I benchmarked the models below using the following test system:

    • GPU: NVIDIA GeForce RTX 3060 (12 GB VRAM)
    • RAM: 16 GB DDR5
    • Operating System: Windows 11
    • Ollama Version: 0.30.10

    Results will vary depending on your CPU, GPU, RAM, model quantization, context length, prompt complexity, and Ollama version.

    ModelSizeEval rate (tok/s)Total durationOutcome
    llama3.2:3b2.6 GB114.082.83sClean, complete answer (100% GPU)
    gemma4:e4b9.6 GB68.639.8sClean, complete answer
    qwen3.5:latest6.6 GB48.8384.0sNever finished — burned 4,072 tokens stuck in its thinking step, cut off mid-sentence
    qwythos-9b-abliterated (Q4_K_M)5.6 GB48.0321.5sClean, complete answer

    qwen3.5 wasn’t slower per-token than qwythos (48.8 vs 48.0) — it just got stuck re-drafting inside its own reasoning step and never produced an answer. “Tokens/sec” alone hides this kind of failure.

    Fix for looping reasoning models: cap output length. API: options.num_predict to 600–800. Interactive: /set parameter num_predict 600. If a model keeps looping, avoid strict word-count prompts with it, or pick a model without an exposed thinking step.

    How It Actually Works

    Ollama wraps llama.cpp, the engine that made it possible to perform quantization inference of LLMs on ordinary hardware. To download a model, you should use ollama pull command, and then to start an inference server with ollama run, which at the same time provides a local REST API (usually port 11434 by default) which can be invoked by applications running on your machine. Neither accounts nor internet connections are needed if the models are already downloaded. In case, you will find a model which will be too big for your hardware, Ollama also provides a cloud service option for performing inference.

    A couple of settings are worth understanding rather than just copying:

    OLLAMA_KEEP_ALIVE=-1 — by default, Ollama unloads a model from memory five minutes after the last request, so the next request pays a reload cost, often a full second or more before the first token appears. Setting this to -1 keeps the model resident so every call is instant. That matters if you’re repeatedly calling the same model from a local app or script. The trade-off: the model sits in RAM or VRAM the whole time, even when idle — fine on a 32 GB desktop, a bad idea on an 8 GB laptop running three other things.

    OLLAMA_MAX_LOADED_MODELS — raises how many models Ollama keeps loaded simultaneously, above the default of one. Useful if you’re running a chat model and an embedding model together for a RAG setup, or comparing two models without eating a reload delay every time you switch. Each loaded model claims its own slice of memory, so this setting is really a memory budget decision, not a free upgrade.

    Use It with the API

    curl http://localhost:11434/api/chat -d '{
      "model": "llama3.2",
      "messages": [{"role": "user", "content": "Why is the sky blue?"}],
      "stream": false
    }'
    import ollama
    response = ollama.chat(model="llama3.2", messages=[
        {"role": "user", "content": "Write a palindrome checker in Python"}
    ])
    print(response["message"]["content"])

    There’s also an OpenAI-compatible endpoint at /v1/chat/completions, so tools built for the OpenAI SDK can point here with barely any code change. The API only listens on localhost by default — leave it that way unless you specifically need other devices on your network to reach it.

    Want a browser interface instead of a terminal? One Docker command gets you Open WebUI:

    docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
      -v open-webui:/app/backend/data --name open-webui \
      ghcr.io/open-webui/open-webui:main

    Open localhost:3000 and every model you’ve already pulled shows up in the dropdown.

    Ollama vs LM Studio vs Jan vs GPT4All

    OllamaLM StudioJanGPT4All
    Best forDevelopers, scriptingBrowsing/trying modelsPrivacy puristsFastest first chat
    InterfaceTerminal + APINative GUINative GUINative GUI
    Open sourceYes (MIT)NoYes (MIT)Yes
    Built-in document Q&ANoNoBasicYes (LocalDocs)

    Use Ollama if you’re writing code against it. But if you want the best model browser choose LM Studio. Pick Jan if you need to prove nothing leaves the machine. Pick GPT4All if you just want to chat with a PDF in five minutes.

    Common Problems

    It’s slow. Run ollama ps. CPU instead of GPU is the usual answer. Already on GPU and still slow? The model’s too big for VRAM and spilling into RAM — try a smaller or more quantized version.

    Ollama isn’t detecting your GPU. nvidia-smi for NVIDIA, rocminfo for AMD. If neither shows your card, the driver — not Ollama — is the problem.

    Want to store models on another drive? Set OLLAMA_MODELS to a new folder under Windows environment variables, quit Ollama from the tray, relaunch. External SSDs work fine here, USB-C included, as long as the path stays stable.

    A pull keeps failing. Usually disk space, a flaky connection, or a firewall blocking outbound HTTPS. Re-running the same ollama pull resumes instead of starting over.

    Mistakes Beginners Make

    • Picking the biggest model the laptop can technically load. A model that loads is not the same as a model that’s pleasant to use. If every response takes 15+ seconds, the “smarter” model isn’t winning anything.
    • Expecting GPU speeds on a CPU-only laptop. Most numbers people quote assume GPU acceleration. Without one, divide expectations by 5–10x, per the benchmark table above.
    • Ignoring free disk space until a pull fails halfway through. Quantized models range from under a gigabyte to well over 100 GB. Check available space before queuing several pulls back to back.
    • Running an outdated GPU driver. GPU detection depends entirely on the driver, not on Ollama. An old NVIDIA or AMD driver is the single most common reason nvidia-smi or rocminfo comes back empty.
    • Running two inference tools at once. Ollama, LM Studio, and similar tools all compete for the same VRAM. Running two simultaneously usually means neither performs well, even on capable hardware.

    FAQ

    Rohit

    Rohit Kumar is an experienced tech expert and content creator who simplifies technology. Through his website, he provides insightful articles, practical tips, and expert analysis on mobile specs, PC/laptop news, and how-to guides, empowering users to make informed tech decisions.

    View all posts →

    Leave a Comment

    Your comment will be held for moderation if it's your first submission.

    No comments yet. Be the first to share your thoughts!