AI模型專題｜DeepSeek 撼動全世界的技術創新(上)

Published On: 2025/04/02|Categories: 科技(Technology)|

Author: Mr. Lin Weizhi, Executive Vice President, Ji-Pu Industrial Trend Research Institute

In order to prevent China's development of AI chips, the Trump administration will continue the "three-tier control of AI chips" that it has been offering since the previous dynasty, and will be implemented in the second half of this year.025May 15, 2012 on the road. Including Bloomberg and other foreign media large coverage, many technology giants and countries, hope that the Trump administration to reconsider the three-tier control rules for AI chips, but Taiwan, because of semiconductor heavyweights, thealthoughRanked 1st(T1)relativefree fromU.S. ExportsLimitations.butMost countries are included in Tier 2(T2)If you purchase an AI chip, there is a cap on the computing power, and the output cannot exceed 7%.TheStrictest Level 3(T3)The United States, including China, Russia, North Korea and Iran, is a common country in the U.S. arms embargo, and once the ban is in place, it will be tantamount to a blockade, which will further affect Pfizer's ability to deliver its products to its customers.and other large factoriesRevenue andProfit. Among other things, Trump specifically said, "Chinese companiesDeepSeekThe launch of the R1/R1 Zero model was a wake-up call for our industry, reminding us that we need to focus on competing and winning."Stretching Trump's delegatesDeepSeekThe emergence of the U.S. wafer ban may be the ostensible reason for the escalation of the U.S. wafer ban, and thusLet's do it.in order toFrom a technical point of view.DeepSeekHow to rock the world.

DeepSeek(in-depth searching)is the first familyAn AI start-up company founded in July 2023 and headquartered in Hangzhou, China. The company is led by Mirage Quantitative(Hedge fund company)Founder of theLEUNG MAN FUNGleadership, insistence on developing powerful models, insistence ontoThe company has rapidly risen to global prominence in the AI arena by taking an open-source approach, insisting on technological innovation, and targeting AGI.DeepSeekThe development of the technology covers several versions.Such as DeepSeek-V2, which launches in June 2024(marketed as the Price Butcher)Launched at Christmas, 2024(used form a nominal expression)DeepSeek-V3 and DeepSeek-R1 launched before the Chinese New Year this year. Zero/R1These breakthroughs have resulted in the development of a new technology, which has demonstrated strong performance and efficient cost control strategies. These technological breakthroughs have enabledDeepSeekIn education, finance, customer service and content creation, etc.The company plays an important role in the field, and enhances the flexibility and data security of developers through open source strategy. The company has also gradually challenged the well-knownOpen Sourcemodels or even other

大語言模型之主導地位。下面我們就以技術角度一步步來探討，DeepSeek的技術創新之處。

簡單回顧Transformer organizationMechanism

如上篇文章所提到【大語言模型的運作、極限與突破】(math.) genus當今市面上大多數大語言模型（LLM）皆基於Transformer organization(math.) genus當時不但取代了傳統RNN/CNN的架構，還oldest幅提高了計算效率以及規模化的可能，為之後快速進步的AI 模型奠定基礎The其兩大核心部分概述如下：

One, Encoder(編碼器)

編碼器是透過多層的Self-Attention（自注意力機制）和Feed Forward Network (FFN)（前饋神經網路)(math.) genus將輸入序列(例如一句話)轉換為高維向量表示。而其中組件運作機制大致為下：

● 自注意力機制（Self-Attention）

自注意力機制是Transformer的核心概念，讓語句中每個詞都能得到權重，並計算他們的重要性。具體作法主要有三個方向。

a) 計算向量資訊Query(查，關),Key(鍵、特徵),Value(值)

Ｑ、Ｋ、Ｖ是輸入詞語後，經過不同權重矩陣而得到的三個向量。

b) 計算Attention Scores(注意力分數)

使用點積注意力的方法（Scaled Dot-Product Attention）計算關聯度，使模型較易判讀。

c) 多頭注意力(Multi-Head Attention)

利用多個Self-Attention可以關注不同語意的特徵，讓模型學習不同的關聯性（比如文法結構、語義關聯等），並且合併結果。

● 位置編碼(Position Encoding)

由於Transformer並沒有傳統RNN的時序性，因此需要使用位置編碼讓模型得知詞語於文句中的排序。就像是在課本中標上頁碼，能夠更好的理解訊息的先後關係。

● 前饋神經網路（Feed Forward Network, FFN）

每個Transformer的內部還包含一個前饋神經網路（FFN），其有總結及鞏固的功能。好比你從書本上學習到很多東西後，需要花時間統整並內化資訊，而Transformer會用運用FFN提取核心資訊，讓最重要的內容更突出。

二、 Decoder(解碼器)

負責接收Encoder（編碼器）的輸出，並且生成目標序列，內部也包含自注意力機制（Self-Attention）及FFN（前饋神經網路)(math.) genus不同的是其多了一個Masked Self-Attention防止資訊洩漏The

了解專家混合模型（Mixture of Experts, MoE)

由於DeepSeek使用Transformer模型架構，並採用MoE模型做優化，在此需要提一下MoE模型的基本概念，MoE最早可追溯至1988年，直至2017年1月，Google在論文《Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer》中進一步發展，將其應用於長短期記憶（LSTM）模型，並成功訓練出擁有高達1370億（137B）參數、專家數量達12.8萬（128K）的模型。2021年後Goggle在傳統Transformer架構中，將前饋神經網路（FFN）替換為MoE層，如此可提高提高模型容量（Capacity）；減少計算成本等優點，當然過去一直也存在著挑戰，如負載不均；Gating Network 訓練不穩等。而一個ＭoE層通常由兩個關鍵部分所組成：

One, 專家網路（Experts）

每個專家皆是獨立的子網路（通常為FFN），在實際計算過程中，只有部分專家會被激活並參與處理。例如在自然語言處裡任務中，專家Ａ可能專注於處理與語言、文法相關的問題，專家Ｂ可能更專注於語意理解等。

二、 門控網路（Gating/Router）

負責根據用戶輸入標記（Token）的特徵，而動態的選擇激活那些專家。通常門控網路會使用簡單的FFN（前饋神經網路)來計算每個專家的權重，最後經過訓練後會逐步學會將相似的輸入傳遞到表現更好的專家。

例如你在聊天機器人輸入一個問題，這些輸入將會先被分解成較小的單位，也就是標記（Token）。門控網路類似於一個交管人員，會根據Token的向量計算每個專家的相關性分數，並決定哪些專家更適合處理該輸入。最後所有專家的結果會被整合，生成最終的輸出。想像你想學習整個科學領域的知識，從物理、化學到生物，這對一個人來說是一項極為艱鉅的任務。但是，如果有一組專門的學生，每個人都專精於不同的科學科目，那麼學習將變得更加有效率。這正是混合專家模型（Mixture of Experts, MoE）在人工智慧（AI）領域的運作方式。它讓AI模型變得更聰明、更高效，並能夠處理龐大的資訊量。