Introduction
Prerequisites
Step-by-Step Installation with Ollama
Step-by-Step Installation with vLLM
Step-by-Step Installation with Transformers (Hugging Face)
DeepSeek-R1 and GPU Requirements
Recommended AI NVIDIA Workstations from BIZON
Conclusion

How to Run DeepSeek-R1 Locally, a Free Alternative to OpenAI’s o1 model. Hardware requirements.

By Noah Lazega

January 28, 2025

AI / Deep learning AI llama NVIDIA Bizon deepseek deepseekr1 ailocally

Introduction

Large Language Models (LLMs) have transformed the world of natural language processing by enabling a range of advanced use cases, from question-answering to code generation. DeepSeek-R1 is one of the latest specialized language models designed to deliver powerful performance in tasks like semantic search, text summarization, classification, and more. Thanks to open-source tooling, you can deploy DeepSeek-R1 locally with frameworks such as Ollama, vLLM, and Transformers to harness its capabilities on-premises or in your own private environment.

This article will walk you through everything you need to know to get DeepSeek-R1 up and running, including:

A step-by-step guide to installing DeepSeek-R1 with Ollama, vLLM, and Transformers
DeepSeek-R1’s GPU requirements and considerations
A detailed chart matching model types to NVIDIA GPUs and VRAM requirements
Recommended AI NVIDIA workstations from BIZON for powering your DeepSeek-R1 deployments

Prerequisites

Before installing DeepSeek-R1, ensure you have the following in place:

Python Environment: Python 3.8 or higher is recommended, plus Pip or Conda.
GPU (optional, but recommended): An NVIDIA GPU with sufficient VRAM. CPU-only mode works but is slower.
CUDA and Drivers: For GPU usage, install the appropriate NVIDIA drivers and CUDA toolkit.
Git (optional): Useful for cloning repositories.
Disk Space: LLMs can occupy multiple gigabytes, so ensure you have enough free space.

Step-by-Step Installation with Ollama

Ollama is a command-line application that simplifies running LLMs locally. Below are the general steps to install and use DeepSeek-R1 with Ollama.

1. Install Ollama

Mac (Apple Silicon / Intel):

brew install ollama/tap/ollama

Linux: Refer to Ollama’s GitHub for .tar.gz or .deb packages, or use:

wget https://github.com/jmorganca/ollama/releases/download/vX.X.X/ollama-linux.tar.gz

tar -xvf ollama-linux.tar.gz

sudo ./ollama install

Windows (via WSL):

First, install WSL by opening PowerShell as an administrator and running:

wsl --install

Follow any on-screen instructions and reboot if prompted. Once WSL is installed, open your WSL distribution (e.g., Ubuntu) and follow the Linux steps above inside the WSL environment.

2. Download DeepSeek-R1 for Ollama

ollama pull deepseek-r1

This retrieves the DeepSeek-R1 model in a GGML or similar format optimized for local inference.

3. Run DeepSeek-R1 with Ollama

ollama run deepseek-r1:<MODEL_CODE>

An interactive session will start in your terminal, allowing you to input prompts and receive responses.

4. GPU Acceleration (Optional)

If you have a supported GPU and the correct CUDA setup, Ollama can offload model parts to the GPU for faster inference. Refer to Ollama’s documentation to confirm GPU backend support and configurations.

Step-by-Step Installation with vLLM

vLLM is a high-throughput, optimized inference engine suitable for production scenarios. Below is how you can install and run DeepSeek-R1 with vLLM.

1. Install vLLM

conda create -n vllm_env python=3.9

conda activate vllm_env

pip install vllm

2. Download DeepSeek-R1

Clone from the Hugging Face Hub or relevant repository:

git lfs install

git clone https://huggingface.co/deepseek/deepseek-r1

3. Run DeepSeek-R1 with vLLM

vllm_server \

--model deepseek-r1 \

--port 8000

Replace deepseek-r1 with the local path or model identifier as needed. You can then send requests to http://127.0.0.1:8000.

4. Example Usage

curl -X POST http://127.0.0.1:8000/generate \

-H 'Content-Type: application/json' \

-d '{

"prompt": "Explain the significance of natural language processing:",

"max_tokens": 100

}'

5. GPU Utilization

To use a GPU, set environment variables (CUDA_VISIBLE_DEVICES) or pass arguments for GPU-based inference. You can also configure multi-GPU setups for higher throughput.

Step-by-Step Installation with Transformers (Hugging Face)

Transformers by Hugging Face is one of the most popular libraries for working with LLMs. Below is how to install and use DeepSeek-R1 via Transformers.

1. Install Transformers

pip install transformers accelerate

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Adjust the CUDA version in the torch installation URL to match your system.

2. Load DeepSeek-R1

from transformers import AutoTokenizer, AutoModelForCausalLM



tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-r1")

model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-r1").cuda()

If you have a local copy, replace "deepseek/deepseek-r1" with your local path.

3. Inference Example

prompt = "What are the key use cases for semantic search?"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output_tokens = model.generate(**inputs, max_length=128)

print(tokenizer.decode(output_tokens[0], skip_special_tokens=True))

4. Performance Tuning

Use mixed precision (FP16, BF16) or 8-bit quantization to reduce VRAM usage. Hugging Face’s accelerate library can also distribute the model across multiple GPUs for better performance.

DeepSeek-R1 and GPU Requirements

The following table outlines VRAM requirements for DeepSeek-R1 variants in a standard or higher-precision scenario (e.g., FP16 or 8-bit). These values represent approximate VRAM usage and recommended GPUs for inference. For extremely large models (hundreds of billions of parameters), multi-GPU setups with data parallelism or tensor parallelism are necessary.

Model	Parameters (B)	VRAM Requirement (GB)	Recommended GPU
DeepSeek-R1-Zero	671B	~1,342 GB	Multi-GPU setup (e.g., NVIDIA A100 80GB x16)
DeepSeek-R1	671B	~1,342 GB	Multi-GPU setup (e.g., NVIDIA A100 80GB x16)
DeepSeek-R1-Distill-Qwen-1.5B	1.5B	~3.5 GB	NVIDIA RTX 3060 12GB or higher
DeepSeek-R1-Distill-Qwen-7B	7B	~16 GB	NVIDIA RTX 4080 16GB or higher
DeepSeek-R1-Distill-Llama-8B	8B	~18 GB	NVIDIA RTX 4080 16GB or higher
DeepSeek-R1-Distill-Qwen-14B	14B	~32 GB	Multi-GPU setup (e.g., NVIDIA RTX 4090 x2)
DeepSeek-R1-Distill-Qwen-32B	32B	~74 GB	Multi-GPU setup (e.g., NVIDIA RTX 4090 x4)
DeepSeek-R1-Distill-Llama-70B	70B	~161 GB	Multi-GPU setup (e.g., NVIDIA A100 80GB x2)

The table below covers approximate VRAM requirements when using 4-bit quantization. This significantly reduces memory usage compared to standard precision, but still may require multi-GPU solutions for the largest model variants:

Model	Parameters (B)	VRAM Requirement (GB) (4-bit)	Recommended GPU
DeepSeek-R1-Zero	671B	~336 GB	Multi-GPU setup (e.g., NVIDIA A100 80GB x6)
DeepSeek-R1	671B	~336 GB	Multi-GPU setup (e.g., NVIDIA A100 80GB x6)
DeepSeek-R1-Distill-Qwen-1.5B	1.5B	~1 GB	NVIDIA RTX 3050 8GB or higher
DeepSeek-R1-Distill-Qwen-7B	7B	~4 GB	NVIDIA RTX 3060 12GB or higher
DeepSeek-R1-Distill-Llama-8B	8B	~4.5 GB	NVIDIA RTX 3060 12GB or higher
DeepSeek-R1-Distill-Qwen-14B	14B	~8 GB	NVIDIA RTX 4080 16GB or higher
DeepSeek-R1-Distill-Qwen-32B	32B	~18 GB	NVIDIA RTX 4090 24GB or higher
DeepSeek-R1-Distill-Llama-70B	70B	~40 GB	Multi-GPU setup (e.g., NVIDIA RTX 4090 24GB x2)

*Note: The VRAM requirements above are approximate. Additional techniques like model sharding, gradient checkpointing, or 8-bit optimizers may reduce memory usage further if your GPU falls short of these estimates.

Recommended AI NVIDIA Workstations from BIZON

If you’re deploying models like DeepSeek-R1 in a professional or research setting, having a powerful workstation is critical. BIZON offers custom-built AI workstations optimized for deep learning and heavy compute loads. Below are three recommended configurations:

Entry-Level to Mid-Range AI

BIZON X5500
- CPU: Threadripper PRO CPUs up to 96 cores
- GPU: Up to 2× NVIDIA RTX 4090/5090 , and up to four a800/a6000s
- RAM: Up to 1 TB ECC DDR4/DDR5 (CPU dependent)
Excellent for small to mid-scale projects, data science workloads, and training or fine-tuning DeepSeek-R1 models up to Medium–Large sizes.

Mid-Range to Advanced AI

BIZON ZX5500
- CPU: Threadripper PRO CPUs up to 96 cores
- GPU: Up to 7x H200 / H100 / 4090 / 5090 Watercooled GPUs
- RAM: Up to 1 TB ECC DDR4/DDR5 (CPU dependent)
- Storage: Multiple NVMe SSDs (RAID configurations available)
Perfect for research labs or enterprises needing robust multi-GPU performance for larger DeepSeek-R1 deployments and data-intensive workflows.

High-End AI & HPC

BIZON X7000
- CPU: Dual AMD EPYC CPUs up to 384 cores
- GPU: Up to 8× NVIDIA RTX 4090 / RTX 5090 / A100 / H100 / H200
- RAM: Up to 3 TB ECC DDR4/DDR5
- Storage: Multiple NVMe SSDs with optional RAID setups
Designed for mission-critical, large-scale DeepSeek-R1 training or inference, capable of handling up to XX-Large model parameters and complex HPC workloads.

Why BIZON?

Customization: Tailor CPU, GPU, memory, and storage to your specific workloads.
Thermal Design: BIZON’s high airflow and liquid cooling solutions ensure optimal GPU temperatures.
Support & Warranty: Specialized deep learning support and flexible upgrade paths.

Conclusion

DeepSeek-R1 is a powerful language model that you can run locally for maximum control and privacy. Whether you choose Ollama, vLLM, or Hugging Face’s Transformers, you have flexible deployment options to integrate DeepSeek-R1 into your workflow. GPU acceleration is highly recommended for any medium or large-scale project, with VRAM requirements scaling alongside model size.

Here’s a quick recap of the key points:

Ollama: User-friendly CLI for Mac and Linux, offering quick setups.
vLLM: Optimized for production with high throughput.
Transformers: A robust, feature-rich library for custom NLP workflows.
GPU Requirements: Match the model size to an appropriate GPU; from RTX 3060 for smaller models to A100/H100 for the largest models.
BIZON Workstations: A turnkey solution for AI, built for heavy workloads and backed by expert support.

With a properly configured system—be it a single-GPU workstation or a multi-GPU powerhouse—you can unlock the full potential of DeepSeek-R1 for applications like semantic search, question-answering, text summarization, and beyond. Refer to this guide whenever you need a refresher on installing and running DeepSeek-R1 locally, and don’t hesitate to explore BIZON workstations for a scalable, high-performance solution.

Table of Contents