Summary Introduction R1’s Evolution Architectural Overview DeepSeek-R1 and DeepSeekV3 DeepSeek MoE Llama 3.1 Conclusions: a matter of intrigue and importance

Summary

HiddenLayer’s previous blog post on DeepSeek-R1 highlighted security concerns identified during analysis and urged caution on its deployment. This blog takes that into further consideration, combining it with the principles of ShadowGenes as a means of identifying possible unsanctioned deployment of the model within an organization’s environment. Join us as we delve into the model’s architecture and genealogy to understand its building blocks and execution flow better, comparing and contrasting it with other models.

Introduction

DeepSeek was the talk of the AI town following the release of their R1 model. Multiple blogs from various sources were released, including one from our own team, delving into the security implications of the rapid adoption of the new model. With the general recommendation to await full and proper security assessment of the model prior to deployment, and the passage of a few weeks, the talk somewhat cooled.

In this time, our team performed an analysis of the architecture of the model to assess its genealogy. Since our signature-based method of model genealogy is relatively new, we believe this is a perfect time to walk through a ‘live’ example and demonstrate our process; especially given that this underpins our ShadowLogic detection technique as well.

We will show how we leverage the computational graphs of different model families to identify particular subgraphs that appear across multiple models designed for the same task. This can be used to build a signature to find subsections within model layers that are indicative of specific functionality. We also show how we can identify subgraphs that are unique to a single model family and how we validate this through code analysis and review of technical documentation. If a uniquely identifiable subgraph is found, this can be used to build a more specific signature for a particular model family.

In this blog, we will walk through DeepSeek-R1’s evolution based on its architecture and that of its base model (DeepSeekV3). On top of giving a general outline of the model’s structure and progression over time, we will dive into the use and visualization of:

Identifiable attributes within a given layer’s attention mechanism;
The use of Rotary Embeddings (RoPE) in each layer;
Mixture-of-Experts (MoE) utilization.

As part of this, we will compare and contrast the architecture with DeepSeek’s prior MoE model and Meta’s Llama 3.

This research has led to a better understanding of the building blocks of the R1 model and the identification of its distinguishable features, which can be useful for identifying deployed instances within your environment which could have occurred without prior approval.

R1’s Evolution

For the purposes of our analysis, our team converted the DeepSeek R1 model hosted on HuggingFace to the ONNX file format, enabling us to examine its computational graph. We used this, along with a review of associated technical papers and code, to identify shared characteristics and subgraphs observed within other models and piece together the defining features of its architecture.

Architectural Overview

Before delving into the technical details of the genealogy of DeepSeek-R1, a high-level overview of its structure is set out below.

The model was found to have sixty-one hidden layers, which followed the following basic pattern:

Input layer normalization;
Self-attention;
Post-attention layer normalization;
Feed Forward Network (FFN).

The key subsections we focus on in this blog are the self-attention mechanism and the FFN. For the self-attention mechanism, we outline how the model leverages Multi-head Latent Attention (MLA), its use of Rotary Embeddings (RoPE) throughout, and interesting subgraphs with patterns indicative of masking that were observed. For the FFN, we discuss how the model utilizes the Mixture-of-Experts concept within a Multi-Layer Perceptron (MLP).

DeepSeek-R1 and DeepSeekV3

Initial analysis showed that the architecture of these two models is essentially the same, with some minor exceptions. This is to be expected because, reading through DeepSeek-R1’s accompanying paper, we see that DeepSeekV3 was used as a base, and Reinforcement Learning (RL) post-training develops its reasoning and Chain-of-Thought output, something which is unlikely to have too much impact on the overall structure of the model in the computational graph.

The DeepSeekV3 technical report can therefore be used as a reference guide for how the model’s architecture comes together. For this reason, we will refer to the V3/R1 model architecture in this section of the blog as if referring to one model because – unless stated otherwise – the structural patterns exist in both.

In order to better understand the model’s computational flow, we also cross-referenced our observations from within the computational graph with the code in the modeling_deepseek.py file.

Before any manual examination, we ran our current ShadowGenes signatures against the model. The results indicated the presence of an attention mechanism structure and an MLP structure, both of which are discussed in more detail below. We also had a hit for a broader LLM-related signature that was developed based on previous analysis of Llama, which was not overly surprising and actually somewhat validating, given the references to Llama in the DeepSeek code—providing a nice sanity check for our method!

Attention Mechanism

A key feature of the V3 architecture is Multi-Head Latent Attention (MLA). Prior to diving into the technicalities of this, let’s go through a high-level overview of how an attention mechanism extracts contextual information from an input sequence:

Each token in the sequence has a key, query, and value:

The key represents the information the token has to offer, so it can answer queries from other tokens;
The query represents the information the token wants to retrieve from the keys of other tokens;
The value is the actual data that will be passed forward based on the responses.

To simplify the process, a given token queries the keys of other tokens to determine how much attention it should pay to each. The values from the other tokens are used to obtain the result and enable the model to understand the context between tokens.

The Key-Value (KV) Cache is part of a mechanism used to store precomputed keys and values to speed up inference.

MLA

The purpose of MLA is to further increase computational efficiency by compressing the KV Cache into a latent vector.

This is something that has come from the DeepSeekV2 model and is discussed in V2’s technical report. Reading the paper and reviewing the code, we can piece together where this is visible within the computational graph of the V3 architecture.

After the input layer normalization, a given layer’s graph splits into multiple branches, two of which begin with ‘MatMul’ operators. One of these is associated with the Key (K) and Value (V) projections and the other is associated with the Query (Q) projections. The metadata label of the former is set as kv_a_proj_with_mqa, the latter is set as q_a_proj. We can map this back to the following snippet of the DeepseekV3Attention class within the code:

The key parts of this code related to MLA are highlighted in red. Per the technical paper:

The core of MLA is the low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference.

We can, therefore, infer that the kv_lora_rank represents the cache’s compression. Then, we see kv_a_proj_with_mqa, which lines up with the metadata label of the MatMul operator we see in the graph.

The “kv” branch passes through layer normalization into the Rotary Embeddings subsection of the layer, which is our next point of interest!

RoPE

When looking at the visualization of the Rotary Embeddings subsection of the self-attention mechanism, something immediately stood out: it was present in every hidden layer. This is quite rare in other model architectures our team has examined in the past, but it does happen. However, something distinctive about this particular computational graph in comparison to the others was the presence of the ‘Einsum’ operator and the flow of the ‘Sine’ and ‘Cosine’ branches that split from it.

Figure 1: One key differentiating pattern observed in the DeepSeekV3 model architecture was in the RoPE embeddings section within each hidden layer.

This subgraph was found to be a key differentiator of this specific model architecture and was, therefore, a pattern we used to build a targeted signature for it. The operators highlighted in green in Figure 1 represent subgraphs we had observed in a small number of other models when performing signature testing; one example is this export of codeLlama. Whilst a very similar trait exists within the DeepSeekMoE model, we examined as part of this blog, the ‘Mul’ operators highlighted in purple appear to be a new part of the V3 architecture.

Digging into why this pattern within the graph was different from other architectures, we referred back to the code. We found that this pattern appears to be represented within the _set_cos_sin_cache function of the DeepseekV3YarnRotaryEmbedding class, the colors are mapped as seen in Figure 1:

Masking

An interesting repeated subgraph indicative of a causal masking pattern was observed within the self-attention subsection of each layer. The pattern itself is observed in other LLMs, such as Phi3 and Mistral, as well as the Llama model analyzed as part of this blog. This provides another good sanity check for our method because we know these models use similar attention mechanisms, so we should see similar masking patterns.

Something unique to the DeepSeekV3 architecture is how these are structured within each layer. There are sixteen of these masking patterns in each hidden layer, in groups of four, and are visualized like so:

Figure 2: One key differentiating pattern observed within the DeepSeekV3 model architecture was a masking pattern within each hidden layer. This can be seen in green.

The operators highlighted in green are (individually) a common masking pattern, the structure as a whole was observed as a key differentiator for this architecture. This containing pattern was observed four times in each hidden layer.

MLP

The distinguishing MLP features were observed from the fourth layer onwards; they were not present in layers zero, one, or two. This section of the layer followed from the post-attention layer normalization and split into two major branches: one containing the MLP Gate and Mixture of Experts subsections and the other containing the Shared Experts subsection of the layer. As previously mentioned, our pre-existing signatures for MLP patterns were triggered when scanning this model, but the use of experts was a key differentiator.

Gate

The flow for the MLP gate can be tracked through the MoEGate class of the code. One key difference between the V3 architecture and the MoE model architecture is the use of the Sigmoid operator, versus the Softmax operator, respectively (more on that shortly.)

This subsection of the model flows into the MoE section as it is used to identify which experts are to be used for routing. This is described in section 2.12 of the DeepSeekV3 whitepaper, which distinctly states the use of the Sigmoid operator is a change from previous iterations of DeepSeek:

Slightly different from DeepSeek-V2, DeepSeek-V3 uses the sigmoid function to compute the affinity scores, and applies a normalization among all selected affinity scores to produce the gating values.

MoE

One of the key points DeepSeek highlights in its technical literature is its novel use of MoE. This is, of course, something that is used in the DeepSeekMoE model, and whilst the theory is retained and the architecture is similar, there are differences in the graphical representation. An MoE comprises multiple ‘experts’ within a FFN, which follow the MLP structure. Each token is assigned to a number of experts for processing after self attention, before passage through to the following layer. DeepSeek applies fine-grained expert segmentation for this, meaning there are a large number of these experts within a given hidden layer.

Figure 3 shows a sample of the experts within a V3 hidden layer. Whilst this pattern is observed within the V3 and R1 models, the R1 model actually has more experts within each layer, which accounts for the differences observed in Figure 4.

Figure 3: One key differentiating pattern observed within the DeepSeekV3 model architecture was the Mixture-of-Experts repeating subgraph.

The operators highlighted in green are (individually) part of the pre-existing MLP signature; the structure as a whole was observed as a key differentiator for this particular architecture. This visualization shows four experts. We observed fifty-five per layer in V3 and sixty-seven per layer in R1.

Shared Experts

The second path the MLP splits into is another subgraph pattern that matches our preexisting MLP signature but is separated from the MoE section of the model. It has the same pattern shown in one expert from Figure 3 above. Figure 2 of the DeepSeekMoE paper illustrates why this particular separation exists, referring to this as the shared expert isolation strategy: a strategy that captures and consolidates knowledge across varying contexts.

Graph Operator Count

As part of our analysis, we generated histograms showing the operator count within each model for comparison. As can be seen in Figure 4, the histograms for the V3 and R1 models are almost identical and share the same visual pattern. The key difference is the number of operators, in particular those associated with MoE, such as ‘MatMul’ and ‘Sigmoid’—which is to be expected based on the above.

Figure 4: Side by side comparison of operator counts for DeepSeek-R1 vs. DeepSeekV3. This excludes constant operators.

DeepSeek MoE

The DeepSeekMoE technical report can be used as a reference guide for how the model’s architecture comes together. Reading this, analyzing the computational graph, and cross-referencing it with its own version of modeling_deepseek.py, it is clear that the basic structure of the architecture can be mapped to the V3 architecture outlined above; albeit with the MoE model having 28 hidden layers. This is a useful part of understanding the genealogy of the R1 model, there are some key differences, however, which are outlined below.