Beginner’s Guide to Local LLMs: Complete Setup Tutorial with Ollama and LM Studio

TOC

What Is a Local LLM? How It Differs from Cloud AI

Have you ever worried that what you type into ChatGPT might be used as training data? Or maybe you’ve had that sinking feeling at the end of the month when your API bill is higher than expected. That’s exactly why local LLMs — AI models that run entirely on your own PC — are getting so much attention lately.

How Local LLMs Work

A local LLM is a large language model that handles all inference (text generation) directly on your machine, with no internet connection required. You download the model weights (pre-trained parameters) and run the computations locally using your CPU or GPU.

Well-known examples include Meta’s Llama series and Mistral AI’s Mistral series. These are released as open-weight models, meaning both commercial and personal use is allowed for free under certain conditions.

Key point: Cloud AI sends your request to a remote server and returns the result. With a local LLM, everything is processed on your own hardware — nothing leaves your machine.

You can check Jan.ai’s full feature list and supported models on their official website — definitely worth a look if you’re curious.

Pros and Cons Compared to Cloud AI

  • Pro #1: Privacy — Your input data never leaves your machine, making it ideal for tasks involving confidential or personal information
  • Pro #2: Zero ongoing cost — Once you download the model, there are no API fees going forward
  • Pro #3: Offline capability — Works without an internet connection, so you can use it while traveling or on air-gapped networks
  • Con #1: Performance ceiling — At this point, local models still lag behind top-tier cloud models like GPT-4o or Claude 3.7 Sonnet in reasoning accuracy
  • Con #2: Hardware requirements — You need sufficient RAM and/or VRAM for the model size you’re running; underpowered machines will be noticeably slow

Use Cases Where Local LLMs Shine

Local LLMs aren’t the best fit for every situation. Understanding where they excel — and where they fall short — is key to using them effectively.

Great fit

Summarizing and formatting internal documents, code completion and review assistance, organizing personal notes and ideas — essentially any task involving data you don’t want leaving your system

Not the best fit

Answering questions about current events, complex multi-step reasoning, or integrating with image generation — for anything requiring cutting-edge accuracy or up-to-date knowledge, cloud models have the edge

ローカルLLM動作に必要なRAMやGPUなどのPCスペック確認イメージ

GPT4All is a great tool for trying out local LLMs through a simple GUI — no command line needed. Check the official website to see which models are supported and what hardware is required before you dive in.

Before You Install: Recommended Specs and Supported Operating Systems

Before you end up stuck at model loading because your machine doesn’t quite cut it, take a moment to verify that your system meets the requirements for running a local LLM. Going in underpowered often means extremely slow responses — or the model simply won’t run at all.

RAM and GPU VRAM Requirements by Model Size

The single biggest factor in how well a local LLM runs is the amount of RAM (or GPU VRAM) available. Model “size” is measured in parameter count (expressed in billions, or “B”), and that directly determines how much memory you’ll need.

Memory requirements by model size

  • 3B–7B models: 8GB RAM minimum (16GB recommended)
  • 13B models: 16GB RAM minimum (32GB recommended)
  • 30B–70B models: 32GB RAM minimum, or a GPU with 24GB+ VRAM

If you have a GPU, the golden rule is to choose a model that fits entirely within your VRAM. If it doesn’t fit, the workload gets split between the GPU and CPU (offloading), which significantly tanks performance. Using quantized models (4-bit or 8-bit) can dramatically reduce memory usage, so if you’re not sure your specs are up to the task, start with a quantized version.

OS-Specific Considerations

Windows

Runs most smoothly with an NVIDIA GPU (CUDA-compatible). AMD GPU support via ROCm is limited, so check compatibility before proceeding.

macOS (Apple Silicon)

M1 and later chips use a unified memory architecture, meaning RAM is shared with the GPU. Metal-based GPU acceleration works out of the box, and machines with 16GB or more run local LLMs quite comfortably.

Linux

With a properly configured CUDA environment, Linux is the most stable of the three options. If you’re planning server deployments or scripting integrations, Linux is the top choice.

Software You’ll Need to Install Beforehand

Both Ollama and LM Studio can be installed as standalone apps, but getting GPU acceleration working requires a bit of setup.

  • NVIDIA GPU users: Latest NVIDIA drivers + CUDA Toolkit
  • macOS: No additional installs needed (Metal support is automatic)
  • Linux (Ollama): The official install script handles CUDA dependencies automatically

Outdated GPU drivers can cause your GPU to go unrecognized, forcing everything to run on the CPU instead. On Windows especially, update your GPU drivers first before doing anything else.

ターミナルでOllamaコマンドを実行してローカルLLMを起動している画面イメージ

For detailed specs and download instructions for Gemma 3, check Google DeepMind’s official page. It covers available sizes (1B–27B) and quantization options to help you pick the right version for your setup.

How to Set Up a Local LLM with Ollama

Once you’ve confirmed your system specs, it’s time to get started. Ollama lets you download and launch a model with a single command, making it the lowest-barrier entry point for running a local LLM.

For detailed specs on Llama 3 — including supported file sizes and real-world Japanese language performance benchmarks — check the official Meta page. Reviewing the model requirements before setup will help the installation go smoothly.

How to Install Ollama (Windows & Mac)

Ollama supports all three major platforms: Windows, Mac, and Linux. The installation process varies slightly depending on your OS.

1

Visit the Official Website

Go to ollama.com/download and select the installer for your operating system.

2

Run the Installer

Mac: Open the downloaded .dmg file and drag the app to your Applications folder. Once launched, an icon will appear in the menu bar.
Windows: Simply run OllamaSetup.exe — no additional configuration needed.
Linux: Run curl -fsSL https://ollama.com/install.sh | sh in your terminal.

3

Verify the Installation

Run ollama --version in your terminal (or Command Prompt). If a version number is returned, you’re good to go.

The official Ollama website has full documentation on usage and a complete list of supported models — definitely worth a look. Everything from installation to command usage is covered in detail, so even first-timers should have no trouble getting started.

Downloading Models and Running Commands

All Ollama operations are handled through the ollama command. Here’s a quick reference for the most commonly used commands.

Essential Commands

  • ollama run llama3.2 — Automatically downloads the model and launches an interactive chat session
  • ollama pull mistral — Downloads the model without launching it
  • ollama list — Lists all installed models
  • ollama rm llama3.2 — Removes the specified model
  • ollama serve — Starts the API server on port 11434

The first time you run the run command, it will download the model file, which can be several gigabytes depending on the model. It’s best to do this on a stable internet connection. Once downloaded, subsequent launches pull straight from the local cache.

You can browse all available models at ollama.com/library. Popular options include llama3.2, mistral, and gemma3. You can also append a parameter size tag like :7b or :13b to the model name to choose a size that fits your hardware.

For full Mistral model specs and download instructions, check out the official website.

Setting Up a Web UI to Use Ollama in Your Browser

While the command line can feel perfectly natural once you get used to it, if you prefer a graphical interface, Open WebUI is the go-to option. It gives you a ChatGPT-like experience that connects directly to Ollama.

If you already have Docker installed, a single command is all it takes to get up and running.

docker run -d -p 3000:80 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui --restart always \
  ghcr.io/open-webui/open-webui:main

Once it’s running, just open http://localhost:3000 in your browser. As long as Ollama is running via ollama serve on the same machine, your installed models will appear automatically.

Keep the Downsides of Open WebUI in Mind

Docker is a hard requirement, so if you don’t already have it set up, there’s some prep work involved. It also runs a local web server in the background at all times, which may be noticeable on machines with limited resources. Sticking with the CLI alone is a perfectly valid choice if you want to keep things lean.

LM StudioのGUI画面をマウス操作でローカルLLMとチャットしているイメージ

How to Set Up LM Studio: A GUI-Based Local LLM Tool

If the command line feels intimidating, LM Studio is your answer. While Ollama requires terminal commands, LM Studio is entirely GUI-based — from searching models to chatting, everything is done with your mouse. It’s a practical option for anyone who isn’t comfortable with the CLI.

Downloading and Setting Up LM Studio

STEP 1

Download the Installer from the Official Site

Visit the LM Studio official website (lmstudio.ai) and download the installer for your OS (Windows, Mac, or Linux). On Mac, separate builds are available for Apple Silicon and Intel — make sure to check which chip your Mac has before downloading.

STEP 2

Install and Launch

Run the downloaded file and follow the on-screen instructions to complete the installation. On first launch, you’ll be asked whether to share usage data — turning this off has no effect on functionality.

System Requirements at a Glance
Macs with Apple Silicon (M1 or later) offer the most stable performance. On Windows, an NVIDIA GPU makes a noticeable difference, though CPU-only inference is also supported. Check the official documentation for detailed spec requirements.

For a full breakdown of LM Studio’s interface and a list of supported models, visit the official site. You’ll find everything from installation steps to configuring models for non-English languages all in one place.

How to Search for and Download Models

Click the magnifying glass icon (Discover) in the left sidebar to search for models directly from Hugging Face. If you’re just getting started, picking from the “Recommended Models” shown at the top of the screen is the safest approach.

3 Key Tips for Choosing a Model

  • Keep the file size under half your available RAM (e.g., choose a model under 4GB if you have 8GB of memory)
  • For quantization, “Q4_K_M” hits a good balance between speed and accuracy
  • If you need a language other than English, look for models tagged with the appropriate language identifier

Click the “Download” button next to a model name to start downloading it in the background. You can track progress via the icon at the bottom of the sidebar. Since model files can easily be several gigabytes, downloading on a stable Wi-Fi connection is recommended.

The NVIDIA GeForce RTX 4060 is a well-rounded choice with 8GB of VRAM, making it capable of running 7B–13B class models comfortably. Check the latest pricing and availability if you’re interested.

Using the Chat Interface and Key Settings

Click the chat icon in the left sidebar, then select a downloaded model from the dropdown at the top of the screen — you’re ready to start chatting right away.

STEP 1

Configure the System Prompt

Enter instructions in the “System Prompt” field in the right panel to set the model’s behavior upfront. For example, adding “Please respond in English” can significantly improve response consistency, even with models primarily trained in other languages.

STEP 2

Adjust Parameters

The “Temperature” setting in the right panel controls output randomness. Lower values (closer to 0) produce more consistent, predictable responses, while higher values (closer to 1) encourage more varied, creative output. Around 0.2 works well for code generation; 0.7 is a good starting point for casual conversation or creative writing.

The Honest Downsides of LM Studio
The trade-off for its intuitive GUI is higher memory usage compared to Ollama. It’s also not open source, which can be limiting for power users who want fine-grained control over how things work under the hood. Think of LM Studio as a tool designed for beginners to intermediate users who prioritize ease of use — understanding that helps you decide when to use it versus other options.

How to Choose the Right Local LLM Model

Once you’ve set up LM Studio or Ollama, the next challenge is figuring out which model to use. With hundreds of models available, the names alone rarely tell you much. Here are three key criteria to help you narrow down your options.

The Trade-off Between Model Size (Parameter Count), Performance, and Speed

Think of the parameter count as the “brain size” of a model. Larger models are more capable, but they require more memory. Here are some general guidelines:

Parameter Count and System Requirements
  • Around 7B: Runs on 8GB+ RAM. Best for speed-focused use cases.
  • Around 13B: Requires 16GB+ RAM. Offers a good balance of quality and speed.
  • 70B and above: Needs 32GB+ RAM. Approaches GPT-4 quality, but too slow for most consumer PCs.

Models also come in quantized (compressed) formats. Q4 is smaller and faster, while Q8 is more accurate but uses more memory. Starting with Q4 or Q5 is the most practical approach for most users.

What to Look for When Choosing a Japanese-Compatible Model

If you ask an English-only model questions in Japanese, you may get English responses or inaccurate answers. When using a model in Japanese, check for the following:

Japanese-Compatible Model Checklist
  • Does the model name or Hugging Face description mention “Japanese,” “multilingual,” or “ja”?
  • Does the tokenizer handle Japanese text with dedicated tokens?
  • Has the model been fine-tuned on Japanese data?

For example, the Qwen2 series, developed by Alibaba, is a multilingual model known for relatively efficient Japanese tokenization. On the other hand, asking an English-only model questions in Japanese can significantly degrade response quality — even at the same parameter count — so it’s something to keep in mind.

Recommended Models by Use Case: Coding, Japanese Chat, and General Purpose

The more specific your use case, the easier it is to pick a model. Here’s a breakdown of go-to options by purpose:

For Coding

DeepSeek Coder and CodeLlama are the standard choices. If your primary tasks are code completion, debugging, or refactoring, a code-specialized model will outperform a general-purpose one. Just keep in mind that code comments will primarily be in English.

For Japanese Conversation and Text Generation

ELYZA is a Llama-based model fine-tuned on Japanese data and produces natural Japanese output. Qwen2 is also strong for Japanese Q&A and is a reliable choice in the 7B class.

General Purpose (Coding + Japanese)

Meta’s Llama 3.1 series offers a well-rounded balance for multitasking and is often recommended as a solid first model to try. It’s a great starting point if you haven’t settled on a specific use case yet.

Important Note

Model performance varies significantly depending on your hardware, quantization format, and how you write your prompts. Use benchmark scores as a reference only — the most reliable way to evaluate a model is to test it yourself for your specific use case.

For detailed specifications and the latest version information on Qwen2.5, check the official page. You’ll find a full overview of specs relevant to local deployment, including Japanese language accuracy and supported context length.

Common Issues and How to Fix Them

You’ve picked a model, downloaded it, and hit the launch button — but nothing happens. If that sounds familiar, you’re not alone. Most local LLM problems fall into a handful of recognizable patterns, and working through them in order will resolve the issue in most cases.

Steps to Take When a Model Won’t Start or Keeps Crashing

If the model crashes immediately on startup, the cause is almost always either a corrupted model file or insufficient memory. Don’t panic — just work through the following steps:

1

Re-download the model file
An interrupted download can leave a corrupted file behind. With Ollama, run ollama rm [model name] to delete it, then use ollama pull to download it again.

2

Check your RAM and VRAM
Even a 4-bit quantized 7B model requires at least 8GB of RAM. Use Task Manager or htop to check your available memory.

3

Check the logs
Ollama prints error codes directly to the terminal log. In LM Studio, open the console tab at the bottom of the screen for detailed output.

The NVIDIA GeForce RTX 4070 offers the 12GB of VRAM needed for smooth local LLM performance at a strong price-to-performance ratio. Check the link below for the latest pricing and availability.

Settings to Review When Generation Is Slow

If the model is running but taking several seconds to output a single token, GPU acceleration is likely not enabled.

Settings to Check

  • Enable GPU acceleration: Ollama automatically uses CUDA if NVIDIA drivers are properly installed. When running ollama run, look for “using CUDA” in the output to confirm it’s active.
  • Reduce context length: Lowering the default context length (num_ctx) to around 2048 can reduce VRAM usage and improve speed.
  • Switch to a lower quantization level: Q4 models run faster than Q8. Keep the quality-vs-speed trade-off in mind when choosing.

CPU Offloading When You Don’t Have Enough VRAM

If you’re running low on VRAM, you can offload part of the model to RAM using “CPU offloading.” It’s slower, but far more practical than not being able to run the model at all.

In Ollama, you can use the OLLAMA_NUM_GPU environment variable to specify how many layers to load onto the GPU. For example, setting OLLAMA_NUM_GPU=20 will process the remaining layers on the CPU. In LM Studio, you can achieve the same effect by lowering the “GPU Layers” slider on the model loading screen.

As a rough guideline: If you’re trying to run a 13B model on 8GB of VRAM, setting GPU layers to around 20–30 often works. The optimal value varies by model, so experiment with different numbers to find what works best.

Summary | Choose the Setup Method That Fits Your PC

Getting started with a local LLM is smoother than you’d expect once you take that first step. Based on everything covered so far, let’s find the fastest path forward based on your hardware specs and use case.

How to Choose the Right Tool

  • Choose Ollama if you: are comfortable with the command line, want to integrate it with scripts or APIs, or just want a lightweight setup
  • Choose LM Studio if you: prefer a visual, intuitive interface, want to compare multiple models side by side, or are trying a local LLM for the first time

Model Selection Guide

  • VRAM 4GB or less / RAM 8GB: Start with a quantized model in the 1B–3B range
  • VRAM 8GB / RAM 16GB: The 7B–8B class runs comfortably — a realistic sweet spot for most users
  • RAM 32GB or more (no GPU): Even CPU-only setups can handle 13B–14B class models

You don’t need a high-end machine to get real value out of local LLMs. For focused tasks like proofreading emails or debugging code, a 3B–7B model will cover the vast majority of use cases.

STEP 1

Check your RAM and VRAM to determine what model sizes your system can handle

STEP 2

Install Ollama or LM Studio and verify everything works with a smaller model first

STEP 3

Once it’s running smoothly, scale up the model size and fine-tune the settings for your specific use case

Unlike cloud-based AI, every experiment you run with a local LLM builds real, hands-on skill. Focus on getting it up and running first — you can always optimize the details later.

Let's share this post !

Author of this article

TOC