Deepseek R1 Offline Deployment Guide

Overview

DeepSeek-R1 is an inference model developed by DeepSeek. It uses reinforcement learning fine-tuning to improve inference performance and is designed for complex reasoning tasks such as mathematics, code, and natural language reasoning. This article describes how to run DeepSeek-R1 offline on EASY-EAI-Orin-Nano (RK3576). The RK3576 offers competitive edge AI efficiency and cost-effectiveness suitable for on-device inference.

Development environment setup

RKLLM-Toolkit installation

This section explains how to install RKLLM-Toolkit using pip. Follow the steps below to install the RKLLM-Toolkit toolchain.

Install miniforge3

To avoid conflicts between multiple Python versions, use miniforge3 to manage the Python environment. Check whether miniforge3 and conda are installed; skip this step if already present.

wget -c https://mirro [rs](https://www.elecfans.com/tags/rs/).bfsu.edu.cn/github-release/conda-forge/miniforge/LatestRelease/Miniforge3- [Linux](https://www.elecfans.com/v/tag/538/)-x86_64.sh

Install miniforge3:

chmod 777 Miniforge3-Linux-x86_64.shbash Miniforge3-Linux-x86_64.sh

Create RKLLM-Toolkit Conda environment

Enter the Conda base environment:

source ~/miniforge3/bin/ [ac](https://www.hqchip.com/app/1703) [ti](https://www.elecfans.com/tags/ti/) vate

Create a Python 3.8 Conda environment named RKLLM-Toolkit (Python 3.8 is the recommended version):

conda create -n RKLLM-Toolkit python=3.8

Activate the RKLLM-Toolkit Conda environment:

conda activate RKLLM-Toolkit

Install RKLLM-Toolkit

Within the RKLLM-Toolkit Conda environment, use pip to install the provided wheel packages. The installer will download required dependencies automatically

pip3 install nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whlpip3 install torch-2.1.0-cp38-cp38-manylinux1_x86_64.whlpip3 install rkllm_toolkit-1.1.4-cp38-cp38-linux_x86_64.whl

If some files download slowly during installation, you can manually download packages from the Python package index

https://pypi.org/

If the commands above run without errors, the installation is successful.

DeepSeek-R1 model conversion

This chapter explains how to convert the DeepSeek-R1 large language model to the RKLLM format.

Model and script download

Two types of model files are provided: the original Hugging Face model and the converted NPU model.

Model conversion

After downloading, place the model and conversion scripts in the same directory.

In the RKLLM-Toolkit environment, run the conversion commands provided with the toolkit to perform the model conversion.

After conversion, a file named deepseek_w4a16.rkllm will be generated as the NPU-optimized model file.

DeepSeek-R1 model deployment

This chapter explains how to run the RKLLM-format NPU model on EASY-EAI-Orin-Nano hardware.

Copy the example project files into an NFS-mounted directory. If you are unsure how to construct the directory, refer to the Getting Started/Development Environment/NFS setup and mount documentation.

Important: Copy the source directory and model locally on the board (for example, to /userdata). Running a large model directly from an NFS mount can cause slow initialization.

Enter the example directory on the development board and run the build script as shown:

cd /userdata/deepseek-demo/./build.sh

Run the example and test

Enter the deepseek-demo_release directory and run the sample program:

cd deepseek-demo_release/ulimit -HSn 102400 ./deepseek-demo deepseek_w4a16.rkllm 256 512

You can now test a conversation. For example, input: "The two legs of a right triangle are 3 and 4, what is the hypotenuse?" The model will respond accordingly.

RKLLM algorithm example

The example source is located at rkllm-demo/src/main.cpp. The operation flow is shown in the example code.

#include#include#include#include "rkllm.h"#include#include#include#include#define PROMPT_TEXT_PREFIX "<|im_start|>system You are a helpful assistant. <|im_end|> <|im_start|>user"#define PROMPT_TEXT_POSTFIX "<|im_end|><|im_start|>assistant"using namespace std;LLMHandle llmHandle = nullptr;void exit_handler(int signal){    if (llmHandle != nullptr)    {        {            cout << "程序即将退出" << endl;            LLMHandle _tmp = llmHandle;            llmHandle = nullptr;            rkllm_destroy(_tmp);        }    }    exit(signal);}void callback(RKLLMResult *result, void *userdata, LLMCallState state){    if (state == RKLLM_RUN_FINISH)    {        printf("\n");    } else if (state == RKLLM_RUN_ERROR) {        printf("\run error\n");    } else if (state == RKLLM_RUN_GET_LAST_HIDDEN_LAYER) {        /* ================================================================================================================        If using GET_LAST_HIDDEN_LAYER, the callback interface returns a memory pointer: last_hidden_layer, the token count: num_tokens, and the hidden layer size: embd_size.        These three parameters can be used to retrieve data from last_hidden_layer.        Note: retrieve the data within the current callback; if not retrieved in time, the pointer will be freed on the next callback.        ===============================================================================================================*/        if (result->last_hidden_layer.embd_size != 0 && result->last_hidden_layer.num_tokens != 0) {            int data_size = result->last_hidden_layer.embd_size * result->last_hidden_layer.num_tokens * sizeof(float);            printf("\ndata_size:%d",data_size);            std::ofstream outFile("last_hidden_layer.bin", std::binary);            if (outFile.is_open()) {                outFile.write(reinterpret_cast(result->last_hidden_layer.hidden_states), data_size);                outFile.close();                std::cout << "Data saved to output.bin successfully!" << std::endl;            } else {                std::cerr << "Failed to open the file for writing!" << std::endl;            }        }    } else if (state == RKLLM_RUN_NORMAL) {        printf("%s", result->text);    }}int main(int argc, char **argv){    if (argc < 4) {        std::cerr << "Usage: " << argv[0] << " model_path max_new_tokens max_context_len\n";        return 1;    }    signal(SIGINT, exit_handler);    printf("rkllm init start\n");    // Set parameters and initialize    RKLLMParam param = rkllm_createDefaultParam();    param.model_path = argv[1];    // Set sampling parameters    param.top_k = 1;    param.top_p = 0.95;    param.temperature = 0.8;    param.repeat_penalty = 1.1;    param.frequency_penalty = 0.0;    param.presence_penalty = 0.0;    param.max_new_tokens = std::atoi(argv[2]);    param.max_context_len = std::atoi(argv[3]);    param.skip_special_token = true;    param.extend_param.base_domain_id = 0;    int ret = rkllm_init(&llmHandle, ¶m, callback);    if (ret == 0){        printf("rkllm init success\n");    } else {        printf("rkllm init failed\n");        exit_handler(-1);    }    string text;    RKLLMInput rkllm_input;    // Initialize infer parameter structure    RKLLMInferParam rkllm_infer_params;    memset(&rkllm_infer_params, 0, sizeof(RKLLMInferParam));  // Initialize all contents to 0    rkllm_infer_params.mode = RKLLM_INFER_GENERATE;    while (true)    {        std::string input_str;        printf("\n");        printf("user: ");        std::getline(std::cin, input_str);        if (input_str == "exit")        {            break;        }        text = input_str;        rkllm_input.input_type = RKLLM_INPUT_PROMPT;        rkllm_input.prompt_input = (char *)text.c_str();        printf("robot: ");        rkllm_run(llmHandle, &rkllm_input, &rkllm_infer_params, NULL);    }    rkllm_destroy(llmHandle);    return 0;}