Summary

HiddenLayer’s previous blog post on DeepSeek-R1 highlighted security concerns identified during analysis and urged caution on its deployment. This blog takes that into further consideration, combining it with the principles of ShadowGenes as a means of identifying possible unsanctioned deployment of the model within an organization’s environment. Join us as we delve into the model’s architecture and genealogy to understand its building blocks and execution flow better, comparing and contrasting it with other models.

Introduction

DeepSeek was the talk of the AI town following the release of their R1 model. Multiple blogs from various sources were released, including one from our own team, delving into the security implications of the rapid adoption of the new model. With the general recommendation to await full and proper security assessment of the model prior to deployment, and the passage of a few weeks, the talk somewhat cooled. 

In this time, our team performed an analysis of the architecture of the model to assess its genealogy. Since our signature-based method of model genealogy is relatively new, we believe this is a perfect time to walk through a ‘live’ example and demonstrate our process; especially given that this underpins our ShadowLogic detection technique as well.

We will show how we leverage the computational graphs of different model families to identify particular subgraphs that appear across multiple models designed for the same task. This can be used to build a signature to find subsections within model layers that are indicative of specific functionality. We also show how we can identify subgraphs that are unique to a single model family and how we validate this through code analysis and review of technical documentation. If a uniquely identifiable subgraph is found, this can be used to build a more specific signature for a particular model family.

In this blog, we will walk through DeepSeek-R1’s evolution based on its architecture and that of its base model (DeepSeekV3). On top of giving a general outline of the model’s structure and progression over time, we will dive into the use and visualization of:

  • Identifiable attributes within a given layer’s attention mechanism;
  • The use of Rotary Embeddings (RoPE) in each layer;
  • Mixture-of-Experts (MoE) utilization.

As part of this, we will compare and contrast the architecture with DeepSeek’s prior MoE model and Meta’s Llama 3.

This research has led to a better understanding of the building blocks of the R1 model and the identification of its distinguishable features, which can be useful for identifying deployed instances within your environment which could have occurred without prior approval. 

R1’s Evolution

For the purposes of our analysis, our team converted the DeepSeek R1 model hosted on HuggingFace to the ONNX file format, enabling us to examine its computational graph. We used this, along with a review of associated technical papers and code, to identify shared characteristics and subgraphs observed within other models and piece together the defining features of its architecture.

Architectural Overview

Before delving into the technical details of the genealogy of DeepSeek-R1, a high-level overview of its structure is set out below.

The model was found to have sixty-one hidden layers, which followed the following basic pattern:

  • Input layer normalization;
  • Self-attention;
  • Post-attention layer normalization;
  • Feed Forward Network (FFN).

The key subsections we focus on in this blog are the self-attention mechanism and the FFN. For the self-attention mechanism, we outline how the model leverages Multi-head Latent Attention (MLA), its use of Rotary Embeddings (RoPE) throughout, and interesting subgraphs with patterns indicative of masking that were observed. For the FFN, we discuss how the model utilizes the Mixture-of-Experts concept within a Multi-Layer Perceptron (MLP). 

DeepSeek-R1 and DeepSeekV3

Initial analysis showed that the architecture of these two models is essentially the same, with some minor exceptions. This is to be expected because, reading through DeepSeek-R1’s accompanying paper, we see that DeepSeekV3 was used as a base, and Reinforcement Learning (RL) post-training develops its reasoning and Chain-of-Thought output, something which is unlikely to have too much impact on the overall structure of the model in the computational graph. 

The DeepSeekV3 technical report can therefore be used as a reference guide for how the model’s architecture comes together. For this reason, we will refer to the V3/R1 model architecture in this section of the blog as if referring to one model because – unless stated otherwise – the structural patterns exist in both.

In order to better understand the model’s computational flow, we also cross-referenced our observations from within the computational graph with the code in the modeling_deepseek.py file.

Before any manual examination, we ran our current ShadowGenes signatures against the model. The results indicated the presence of an attention mechanism structure and an MLP structure, both of which are discussed in more detail below. We also had a hit for a broader LLM-related signature that was developed based on previous analysis of Llama, which was not overly surprising and actually somewhat validating, given the references to Llama in the DeepSeek code—providing a nice sanity check for our method! 

Attention Mechanism

A key feature of the V3 architecture is Multi-Head Latent Attention (MLA). Prior to diving into the technicalities of this, let’s go through a high-level overview of how an attention mechanism extracts contextual information from an input sequence: 

Each token in the sequence has a key, query, and value:

  • The key represents the information the token has to offer, so it can answer queries from other tokens;
  • The query represents the information the token wants to retrieve from the keys of other tokens;
  • The value is the actual data that will be passed forward based on the responses.

To simplify the process, a given token queries the keys of other tokens to determine how much attention it should pay to each. The values from the other tokens are used to obtain the result and enable the model to understand the context between tokens.

The Key-Value (KV) Cache is part of a mechanism used to store precomputed keys and values to speed up inference.

MLA

The purpose of MLA is to further increase computational efficiency by compressing the KV Cache into a latent vector.

This is something that has come from the DeepSeekV2 model and is discussed in V2’s technical report. Reading the paper and reviewing the code, we can piece together where this is visible within the computational graph of the V3 architecture. 

After the input layer normalization, a given layer’s graph splits into multiple branches, two of which begin with ‘MatMul’ operators. One of these is associated with the Key (K) and Value (V) projections and the other is associated with the Query (Q) projections. The metadata label of the former is set as kv_a_proj_with_mqa, the latter is set as q_a_proj. We can map this back to the following snippet of the DeepseekV3Attention class within the code:

The key parts of this code related to MLA are highlighted in red. Per the technical paper:

The core of MLA is the low-rank joint compression for attention keys and values to reduce Key-Value (KV) cache during inference.

We can, therefore, infer that the kv_lora_rank represents the cache’s compression. Then, we see kv_a_proj_with_mqa, which lines up with the metadata label of the MatMul operator we see in the graph.

The “kv” branch passes through layer normalization into the Rotary Embeddings subsection of the layer, which is our next point of interest!

RoPE

When looking at the visualization of the Rotary Embeddings subsection of the self-attention mechanism, something immediately stood out: it was present in every hidden layer. This is quite rare in other model architectures our team has examined in the past, but it does happen. However, something distinctive about this particular computational graph in comparison to the others was the presence of the ‘Einsum’ operator and the flow of the ‘Sine’ and ‘Cosine’ branches that split from it. 

Figure 1: One key differentiating pattern observed in the DeepSeekV3 model architecture was in the RoPE embeddings section within each hidden layer.

This subgraph was found to be a key differentiator of this specific model architecture and was, therefore, a pattern we used to build a targeted signature for it. The operators highlighted in green in Figure 1 represent subgraphs we had observed in a small number of other models when performing signature testing; one example is this export of codeLlama. Whilst a very similar trait exists within the DeepSeekMoE model, we examined as part of this blog, the ‘Mul’ operators highlighted in purple appear to be a new part of the V3 architecture.

Digging into why this pattern within the graph was different from other architectures, we referred back to the code. We found that this pattern appears to be represented within the _set_cos_sin_cache function of the DeepseekV3YarnRotaryEmbedding class, the colors are mapped as seen in Figure 1:

Masking

An interesting repeated subgraph indicative of a causal masking pattern was observed within the self-attention subsection of each layer. The pattern itself is observed in other LLMs, such as Phi3 and Mistral, as well as the Llama model analyzed as part of this blog. This provides another good sanity check for our method because we know these models use similar attention mechanisms, so we should see similar masking patterns.

Something unique to the DeepSeekV3 architecture is how these are structured within each layer. There are sixteen of these masking patterns in each hidden layer, in groups of four, and are visualized like so:

Figure 2: One key differentiating pattern observed within the DeepSeekV3 model architecture was a masking pattern within each hidden layer. This can be seen in green.

The operators highlighted in green are (individually) a common masking pattern, the structure as a whole was observed as a key differentiator for this architecture. This containing pattern was observed four times in each hidden layer.

MLP

The distinguishing MLP features were observed from the fourth layer onwards; they were not present in layers zero, one, or two. This section of the layer followed from the post-attention layer normalization and split into two major branches: one containing the MLP Gate and Mixture of Experts subsections and the other containing the Shared Experts subsection of the layer. As previously mentioned, our pre-existing signatures for MLP patterns were triggered when scanning this model, but the use of experts was a key differentiator.

Gate

The flow for the MLP gate can be tracked through the MoEGate class of the code. One key difference between the V3 architecture and the MoE model architecture is the use of the Sigmoid operator, versus the Softmax operator, respectively (more on that shortly.) 

This subsection of the model flows into the MoE section as it is used to identify which experts are to be used for routing. This is described in section 2.12 of the DeepSeekV3 whitepaper, which distinctly states the use of the Sigmoid operator is a change from previous iterations of DeepSeek:

Slightly different from DeepSeek-V2, DeepSeek-V3 uses the sigmoid function to compute the affinity scores, and applies a normalization among all selected affinity scores to produce the gating values.

MoE

One of the key points DeepSeek highlights in its technical literature is its novel use of MoE. This is, of course, something that is used in the DeepSeekMoE model, and whilst the theory is retained and the architecture is similar, there are differences in the graphical representation. An MoE comprises multiple ‘experts’ within a FFN, which follow the MLP structure. Each token is assigned to a number of experts for processing after self attention, before passage through to the following layer. DeepSeek applies fine-grained expert segmentation for this, meaning there are a large number of these experts within a given hidden layer. 

Figure 3 shows a sample of the experts within a V3 hidden layer. Whilst this pattern is observed within the V3 and R1 models, the R1 model actually has more experts within each layer, which accounts for the differences observed in Figure 4.

Figure 3: One key differentiating pattern observed within the DeepSeekV3 model architecture was the Mixture-of-Experts repeating subgraph.

The operators highlighted in green are (individually) part of the pre-existing MLP signature; the structure as a whole was observed as a key differentiator for this particular architecture. This visualization shows four experts. We observed fifty-five per layer in V3 and sixty-seven per layer in R1.

Shared Experts

The second path the MLP splits into is another subgraph pattern that matches our preexisting MLP signature but is separated from the MoE section of the model. It has the same pattern shown in one expert from Figure 3 above. Figure 2 of the DeepSeekMoE paper illustrates why this particular separation exists, referring to this as the shared expert isolation strategy: a strategy that captures and consolidates knowledge across varying contexts.

Graph Operator Count

As part of our analysis, we generated histograms showing the operator count within each model for comparison. As can be seen in Figure 4, the histograms for the V3 and R1 models are almost identical and share the same visual pattern. The key difference is the number of operators, in particular those associated with MoE, such as ‘MatMul’ and ‘Sigmoid’—which is to be expected based on the above.

Figure 4: Side by side comparison of operator counts for DeepSeek-R1 vs. DeepSeekV3. This excludes constant operators.  

DeepSeek MoE

The DeepSeekMoE technical report can be used as a reference guide for how the model’s architecture comes together. Reading this, analyzing the computational graph, and cross-referencing it with its own version of modeling_deepseek.py, it is clear that the basic structure of the architecture can be mapped to the V3 architecture outlined above; albeit with the MoE model having 28 hidden layers. This is a useful part of understanding the genealogy of the R1 model, there are some key differences, however, which are outlined below.

Attention Mechanism

RoPE

The flows:

still exist within the rotary embeddings section of each layer, with one small difference in the graph being that the ‘Cos’ and ‘Sin’ operators flow straight into their respective ‘Cast’ operators, without the ‘Mul’ operator observed in V3. This tracks in the _set_cos_sin_cache function of the code, which ends like so:

Given this difference primarily occurs in V3’s RotaryEmbeddings class associated with YARN (Yet Another RoPE extensioN), which is not used in the DeepSeekMoE code, or the other models analyzed as part of this paper, we believe this difference in the graph is caused by YARN’s introduction in V3.

MLP

Despite following a similar flow, the MLP section of the model took a different shape, with slightly different operators used.

Auxiliary Loss

A major difference observed in the DeepSeekMoE model architecture when compared with the V3 architecture included multiple branches containing ‘If’ operators within the MLP. These are visualized in Figure 5. This pattern does not carry over to the V3 architecture and explains one major difference observed in the operator count histogram.

According to the corresponding paper, Expert-Level Balance Loss was introduced to “mitigate the risk of routing collapse”. 

Figure 5: ‘If’ operator branches that appear to be related to Expert Level Loss.

This can be found in the forward function of the MoEGate class within the code: 

Gate

As has already been stated, the V3 architecture used the ‘Sigmoid’ operator in the MLP gate, as opposed to the ‘Softmax’ operator used in V2. From this, we can see that the DeepSeekMoE model shares this difference:

Figure 6: The use of the ‘Softmax’ operator in the MLP gate subsection of the layer. This flows through the ‘TopK’ operator into the MoE subsection of the layer for post-attention processing.

MoE

The output of the MLP gate goes through the ‘experts’ in the layer. Of note, the MoE subgraphs started occurring in layer one of this model, as opposed to layer three in the V3 architecture. The operators are also slightly different, as shown below:

Figure 7: The MoE repeated pattern observed within the DeepSeekMoE model.

A shared expert was also visible in this computational graph, one per layer, with the same visualization as that seen in V3.

Graph Operator Count

Figure 8 shows the operator counts for the DeepSeekMoE model. Key differences compared with the V3 architecture discussed above include the large number of ‘If’ operators, as well as a relatively larger percentage of ‘Softmax’ operators.

Figure 8: The operator counts for the DeepSeek MoE model. This excludes constant operators. 

Llama 3.1

Technical papers associated with DeepSeek reference Meta’s Llama models. For the purposes of this blog, we have focused on Llama-3.1-8B. The similarities also become apparent when comparing the DeepSeek code we have been looking at to modeling_llama.py. There are, of course, major differences between the models, as well as some striking similarities, both of which will be discussed below.

Attention Mechanism

A subgraph pattern we originally used to build a signature based on analysis of Llama models’ attention mechanism was also found to exist within the DeepSeekMoE architecture, and also carried through to the V3 architecture, albeit with a slight change in the computational graph. This section of the layer is near the beginning of the attention mechanism and is shown in Figure 9 below:

Figure 9: Repeating subgraph observed within the self-attention section of Llama 3.1 that is also observed in the DeepSeekMoE architecture and carries over to the V3 architecture.

RoPE

In the Llama model, rather than the Rotary Embeddings being applied at each layer, these were applied once, at the start of the model. The operators involved in the computational graph were similar, but not the same. The RoPE section of the model is still split into ‘Sine’ and ‘Cosine’ operators from a Concat operator, but there was no Einsum operation prior to this.

Masking

As mentioned earlier, the masking pattern observed in the V3 architecture shared similarities with many other transformer-based models. Notably, this Llama model has the same masking pattern shown in Figure 2. There are differences in the surrounding operators and how this is structured within the model, but the core of the repeated subgraph remains the same. This is particularly interesting because it was not seen in the DeepSeekMoE model.

MLP

Within the Llama model, there is an MLP that follows the post-attention normalization and passes into the next layer, but this is only seen as a pattern once. There are not multiple ‘experts’, but the MLP pattern is clearly visible, as shown in Figure 10.

Figure 10: The MLP subsection of a layer in the Llama model that sits between the post-attention normalization and the start of the following layer

Graph Operator Count

Despite similarities between the Llama and DeepSeek architectures, there are still substantial differences in the operator count histogram, as shown in Figure 11. One clear difference is the similarity in the count of ‘Softmax’ operators and ‘Sigmoid’ operators. In this Llama model, there is one ‘Softmax’ in each layer, used for the output of the self-attention mechanism, and one ‘Sigmoid’ operator in the MLP of each layer – of which there was only one prior to the passage into the following layer.

Figure 11. The operator counts for the Llama 3.1 model. This excludes constant operators.  

Conclusions: a matter of intrigue and importance

Understanding the architecture of a model like DeepSeek is so interesting because it allows us to see how new models are being built on top of preexisting models with novel tweaks and ideas. Of course, this is not specific to DeepSeek-R1, but seeing as it made such a splash in recent weeks, it was natural for the team to focus on it. 

Our analysis of the model’s computational graph has shown the building blocks and evolutionary steps visible within the DeepSeek-R1 architecture, which can be applied to understanding its genealogy. It is clear that the model was built from the DeepSeekV3 base. We can also see how ideas were incorporated from the previous DeepSeekMoE and V2 models and upgraded. On top of this, we also see striking similarities with repeated subgraphs and patterns within a Llama3.1 model. 

That being said, more generally speaking, we can apply this technique of analysis to understand how a model was put together based on a review of its computational graph, similarities it has with pre-existing models, and unique identifiers for the specific architecture.

Understanding the distinguishing features of a model’s computational graph is important because within an organization, if a new model becomes publicly available and people want to start using it; key security questions must be answered prior to deployment. Being able to scan a given model for identifying characteristics can determine whether a team member has jumped the gun and deployed without proper authorization. Not only that, this methodology underpins HiddenLayer’s ShadowLogic methodology to detect backdoors within models. So not only is model genealogy extremely interesting in understanding the building blocks of models (at least to us!), but it is also a handy capability in the constant battle to maintain secure and trustworthy AI systems.