DeepSeek

DeepSeek was founded in July 2023 by High-Flyer co-founder Liang Wenfeng, who also serves as the CEO for both companies.

Released under the MIT License, DeepSeek-R1 provides responses comparable to other contemporary large language models, such as OpenAI's GPT-4o and o1.

[10][11] DeepSeek's models are "open weight", which provides less freedom for modification than true open-source software.

[12][13] The company reportedly recruits AI researchers from top Chinese universities[10] and hires from outside the computer science field to diversify its models' knowledge and abilities.

[7] The low cost of training and running the language model was attributed to Chinese firms' lack of access to Nvidia chipsets, which were restricted by the US as part of the ongoing trade war between the two countries.

This breakthrough in reducing expenses while increasing efficiency and maintaining the model's performance in the AI industry sent "shockwaves" through the market.

[14][15] In February 2016, High-Flyer was co-founded by AI enthusiast Liang Wenfeng, who had been trading since the 2007–2008 financial crisis while attending Zhejiang University.

[17] In 2019, Liang established High-Flyer as a hedge fund focused on developing and using AI trading algorithms.

[19] Initial computing cluster Fire-Flyer began construction in 2019 and finished in 2020, at a cost of 200 million yuan.

[19] According to 36Kr, Liang acquired 10,000 Nvidia A100 GPUs[20] before the United States restricted chip sales to China.

At the time, they exclusively used PCIe instead of the DGX version of A100, since at the time the models they trained could fit within a single 40 GB GPU VRAM, so there was no need for the higher bandwidth of DGX (i.e. they required only data parallelism but not model parallelism).

[25][26] Incorporated on 17 July 2023,[1] with High-Flyer as the investor and backer, the lab became its own company, DeepSeek.

It was later taken under 100% control of Hangzhou DeepSeek Artificial Intelligence Basic Technology Research Co., Ltd, which was incorporated 2 months after.

[26][7] Likewise, the company recruits individuals without any computer science background to help its technology understand more knowledge areas,[10] such as poetry and China's notoriously difficult college admissions exams (Gaokao).

This update introduces compressed latent vectors to boost performance and reduce memory usage during inference.

The model was made source-available under the DeepSeek License, which includes "open and responsible downstream usage" restrictions.

DeepSeek's accompanying paper claimed benchmark results higher than Llama 2 and most open-source LLMs at the time.

They used the pre-norm decoder-only Transformer with RMSNorm as the normalization, SwiGLU in the feedforward layers, rotary positional embedding (RoPE), and grouped-query attention (GQA).

They trained on 2 trillion tokens of English and Chinese text obtained by deduplicating the Common Crawl.

[28] DeepSeek-MoE models (Base and Chat), each have 16B parameters (2.7B activated per token, 4K context length).

[29] The Financial Times reported that it was cheaper than its peers with a price of 2 RMB for every million output tokens.

[32] DeepSeek-V3-Base and DeepSeek-V3 (a chat model) use essentially the same architecture as V2 with the addition of multi-token prediction, which (optionally) decodes extra tokens faster but less accurately.

Training:[22] The DeepSeek team performed extensive low-level engineering to improve efficiency.

Much of the forward pass was performed in 8-bit floating point numbers (5E2M: 5-bit exponent and 2-bit mantissa) rather than the standard 32-bit, requiring special GEMM routines to accumulate accurately.

[59] Benchmark tests show that V3 outperformed Llama 3.1 and Qwen 2.5 while matching GPT-4o and Claude 3.5 Sonnet.

DeepSeek claimed that it exceeded performance of OpenAI o1 on benchmarks such as American Invitational Mathematics Examination (AIME) and MATH.

[64] However, The Wall Street Journal reported that on 15 problems from the 2024 edition of AIME, the o1 model reached a solution faster.

Accuracy reward was checking whether a boxed answer is correct (for math) or whether a code passes tests (for programming).

The DeepSeek login page shortly after a cyberattack that occurred following its January 20 launch
The architecture of V2, showing both shared-routed MoE and MLA [ 51 ] : Figure 2
Multi-Token Prediction
Mixed-precision framework for V3 [ 22 ] : Figure 6