Table of Contents
How to Run DeepSeek-R1 Locally, a Free Alternative to OpenAI’s o1 model. Hardware requirements.
Introduction
Large Language Models (LLMs) have transformed the world of natural language processing by enabling a range of advanced use cases, from question-answering to code generation. DeepSeek-R1 is one of the latest specialized language models designed to deliver powerful performance in tasks like semantic search, text summarization, classification, and more. Thanks to open-source tooling, you can deploy DeepSeek-R1 locally with frameworks such as Ollama, vLLM, and Transformers to harness its capabilities on-premises or in your own private environment.
This article will walk you through everything you need to know to get DeepSeek-R1 up and running, including:
- A step-by-step guide to installing DeepSeek-R1 with Ollama, vLLM, and Transformers
- DeepSeek-R1’s GPU requirements and considerations
- A detailed chart matching model types to NVIDIA GPUs and VRAM requirements
- Recommended AI NVIDIA workstations from BIZON for powering your DeepSeek-R1 deployments
Prerequisites
Before installing DeepSeek-R1, ensure you have the following in place:
- Python Environment: Python 3.8 or higher is recommended, plus Pip or Conda.
- GPU (optional, but recommended): An NVIDIA GPU with sufficient VRAM. CPU-only mode works but is slower.
- CUDA and Drivers: For GPU usage, install the appropriate NVIDIA drivers and CUDA toolkit.
- Git (optional): Useful for cloning repositories.
- Disk Space: LLMs can occupy multiple gigabytes, so ensure you have enough free space.
Step-by-Step Installation with Ollama
Ollama is a command-line application that simplifies running LLMs locally. Below are the general steps to install and use DeepSeek-R1 with Ollama.
1. Install Ollama
Mac (Apple Silicon / Intel):
brew install ollama/tap/ollama
Linux: Refer to Ollama’s GitHub for .tar.gz or .deb packages, or use:
wget https://github.com/jmorganca/ollama/releases/download/vX.X.X/ollama-linux.tar.gz
tar -xvf ollama-linux.tar.gz
sudo ./ollama install
Windows (via WSL):
First, install WSL by opening PowerShell as an administrator and running:
wsl --install
Follow any on-screen instructions and reboot if prompted. Once WSL is installed, open your WSL distribution (e.g., Ubuntu) and follow the Linux steps above inside the WSL environment.
2. Download DeepSeek-R1 for Ollama
ollama pull deepseek-r1
This retrieves the DeepSeek-R1 model in a GGML or similar format optimized for local inference.
3. Run DeepSeek-R1 with Ollama
ollama run deepseek-r1:<MODEL_CODE>
An interactive session will start in your terminal, allowing you to input prompts and receive responses.
4. GPU Acceleration (Optional)
If you have a supported GPU and the correct CUDA setup, Ollama can offload model parts to the GPU for faster inference. Refer to Ollama’s documentation to confirm GPU backend support and configurations.
Step-by-Step Installation with vLLM
vLLM is a high-throughput, optimized inference engine suitable for production scenarios. Below is how you can install and run DeepSeek-R1 with vLLM.
1. Install vLLM
conda create -n vllm_env python=3.9
conda activate vllm_env
pip install vllm
2. Download DeepSeek-R1
Clone from the Hugging Face Hub or relevant repository:
git lfs install
git clone https://huggingface.co/deepseek/deepseek-r1
3. Run DeepSeek-R1 with vLLM
vllm_server \
--model deepseek-r1 \
--port 8000
Replace deepseek-r1 with the local path or model identifier as needed. You can then send requests to http://127.0.0.1:8000.
4. Example Usage
curl -X POST http://127.0.0.1:8000/generate \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Explain the significance of natural language processing:",
"max_tokens": 100
}'
5. GPU Utilization
To use a GPU, set environment variables (CUDA_VISIBLE_DEVICES
) or pass arguments for GPU-based inference. You can also configure multi-GPU setups for higher throughput.
Step-by-Step Installation with Transformers (Hugging Face)
Transformers by Hugging Face is one of the most popular libraries for working with LLMs. Below is how to install and use DeepSeek-R1 via Transformers.
1. Install Transformers
pip install transformers accelerate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Adjust the CUDA version in the torch
installation URL to match your system.
2. Load DeepSeek-R1
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-r1")
model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-r1").cuda()
If you have a local copy, replace "deepseek/deepseek-r1" with your local path.
3. Inference Example
prompt = "What are the key use cases for semantic search?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output_tokens = model.generate(**inputs, max_length=128)
print(tokenizer.decode(output_tokens[0], skip_special_tokens=True))
4. Performance Tuning
Use mixed precision (FP16
, BF16
) or 8-bit quantization to reduce VRAM usage. Hugging Face’s accelerate
library can also distribute the model across multiple GPUs for better performance.
DeepSeek-R1 and GPU Requirements
The following table outlines VRAM requirements for DeepSeek-R1 variants in a standard or higher-precision scenario (e.g., FP16 or 8-bit). These values represent approximate VRAM usage and recommended GPUs for inference. For extremely large models (hundreds of billions of parameters), multi-GPU setups with data parallelism or tensor parallelism are necessary.
Model | Parameters (B) | VRAM Requirement (GB) | Recommended GPU |
---|---|---|---|
DeepSeek-R1-Zero | 671B | ~1,342 GB | Multi-GPU setup (e.g., NVIDIA A100 80GB x16) |
DeepSeek-R1 | 671B | ~1,342 GB | Multi-GPU setup (e.g., NVIDIA A100 80GB x16) |
DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | ~3.5 GB | NVIDIA RTX 3060 12GB or higher |
DeepSeek-R1-Distill-Qwen-7B | 7B | ~16 GB | NVIDIA RTX 4080 16GB or higher |
DeepSeek-R1-Distill-Llama-8B | 8B | ~18 GB | NVIDIA RTX 4080 16GB or higher |
DeepSeek-R1-Distill-Qwen-14B | 14B | ~32 GB | Multi-GPU setup (e.g., NVIDIA RTX 4090 x2) |
DeepSeek-R1-Distill-Qwen-32B | 32B | ~74 GB | Multi-GPU setup (e.g., NVIDIA RTX 4090 x4) |
DeepSeek-R1-Distill-Llama-70B | 70B | ~161 GB | Multi-GPU setup (e.g., NVIDIA A100 80GB x2) |
The table below covers approximate VRAM requirements when using 4-bit quantization. This significantly reduces memory usage compared to standard precision, but still may require multi-GPU solutions for the largest model variants:
Model | Parameters (B) | VRAM Requirement (GB) (4-bit) | Recommended GPU |
---|---|---|---|
DeepSeek-R1-Zero | 671B | ~336 GB | Multi-GPU setup (e.g., NVIDIA A100 80GB x6) |
DeepSeek-R1 | 671B | ~336 GB | Multi-GPU setup (e.g., NVIDIA A100 80GB x6) |
DeepSeek-R1-Distill-Qwen-1.5B | 1.5B | ~1 GB | NVIDIA RTX 3050 8GB or higher |
DeepSeek-R1-Distill-Qwen-7B | 7B | ~4 GB | NVIDIA RTX 3060 12GB or higher |
DeepSeek-R1-Distill-Llama-8B | 8B | ~4.5 GB | NVIDIA RTX 3060 12GB or higher |
DeepSeek-R1-Distill-Qwen-14B | 14B | ~8 GB | NVIDIA RTX 4080 16GB or higher |
DeepSeek-R1-Distill-Qwen-32B | 32B | ~18 GB | NVIDIA RTX 4090 24GB or higher |
DeepSeek-R1-Distill-Llama-70B | 70B | ~40 GB | Multi-GPU setup (e.g., NVIDIA RTX 4090 24GB x2) |
*Note: The VRAM requirements above are approximate. Additional techniques like model sharding, gradient checkpointing, or 8-bit optimizers may reduce memory usage further if your GPU falls short of these estimates.
Recommended AI NVIDIA Workstations from BIZON
If you’re deploying models like DeepSeek-R1 in a professional or research setting, having a powerful workstation is critical. BIZON offers custom-built AI workstations optimized for deep learning and heavy compute loads. Below are three recommended configurations:
-
Entry-Level to Mid-Range AI
- CPU: Threadripper PRO CPUs up to 96 cores
- GPU: Up to 2× NVIDIA RTX 4090/5090 , and up to four a800/a6000s
- RAM: Up to 1 TB ECC DDR4/DDR5 (CPU dependent)
Excellent for small to mid-scale projects, data science workloads, and training or fine-tuning DeepSeek-R1 models up to Medium–Large sizes.
-
Mid-Range to Advanced AI
- CPU: Threadripper PRO CPUs up to 96 cores
- GPU: Up to 7x H200 / H100 / 4090 / 5090 Watercooled GPUs
- RAM: Up to 1 TB ECC DDR4/DDR5 (CPU dependent)
- Storage: Multiple NVMe SSDs (RAID configurations available)
Perfect for research labs or enterprises needing robust multi-GPU performance for larger DeepSeek-R1 deployments and data-intensive workflows.
-
High-End AI & HPC
- CPU: Dual AMD EPYC CPUs up to 384 cores
- GPU: Up to 8× NVIDIA RTX 4090 / RTX 5090 / A100 / H100 / H200
- RAM: Up to 3 TB ECC DDR4/DDR5
- Storage: Multiple NVMe SSDs with optional RAID setups
Designed for mission-critical, large-scale DeepSeek-R1 training or inference, capable of handling up to XX-Large model parameters and complex HPC workloads.
Why BIZON?
- Customization: Tailor CPU, GPU, memory, and storage to your specific workloads.
- Thermal Design: BIZON’s high airflow and liquid cooling solutions ensure optimal GPU temperatures.
- Support & Warranty: Specialized deep learning support and flexible upgrade paths.
Conclusion
DeepSeek-R1 is a powerful language model that you can run locally for maximum control and privacy. Whether you choose Ollama, vLLM, or Hugging Face’s Transformers, you have flexible deployment options to integrate DeepSeek-R1 into your workflow. GPU acceleration is highly recommended for any medium or large-scale project, with VRAM requirements scaling alongside model size.
Here’s a quick recap of the key points:
- Ollama: User-friendly CLI for Mac and Linux, offering quick setups.
- vLLM: Optimized for production with high throughput.
- Transformers: A robust, feature-rich library for custom NLP workflows.
- GPU Requirements: Match the model size to an appropriate GPU; from RTX 3060 for smaller models to A100/H100 for the largest models.
- BIZON Workstations: A turnkey solution for AI, built for heavy workloads and backed by expert support.
With a properly configured system—be it a single-GPU workstation or a multi-GPU powerhouse—you can unlock the full potential of DeepSeek-R1 for applications like semantic search, question-answering, text summarization, and beyond. Refer to this guide whenever you need a refresher on installing and running DeepSeek-R1 locally, and don’t hesitate to explore BIZON workstations for a scalable, high-performance solution.