hobbytp
diff --git a/‎content/zh/celebrity_insights/Andrej_Karpathy_2025_Review.md‎
Lines changed: 15 additions & 13 deletions b/‎content/zh/celebrity_insights/Andrej_Karpathy_2025_Review.md‎
Lines changed: 15 additions & 13 deletions
diff --git a/‎content/zh/celebrity_insights/Ilyasutskever.md‎
Lines changed: 1 addition & 1 deletion b/‎content/zh/celebrity_insights/Ilyasutskever.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎content/zh/celebrity_insights/Shunyu_Yao.md‎
Lines changed: 4 additions & 4 deletions b/‎content/zh/celebrity_insights/Shunyu_Yao.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎static/images/backup/png/Ilya_changes.png‎
5.96 MB b/‎static/images/backup/png/Ilya_changes.png‎
5.96 MB
diff --git a/‎static/images/backup/png/human_vs_ai_intelligence.png‎
81.1 KB b/‎static/images/backup/png/human_vs_ai_intelligence.png‎
81.1 KB
diff --git a/‎static/images/backup/png/second_half_figure1.png‎
354 KB b/‎static/images/backup/png/second_half_figure1.png‎
354 KB
diff --git a/‎static/images/backup/png/second_half_figure3.png‎
106 KB b/‎static/images/backup/png/second_half_figure3.png‎
106 KB
diff --git a/‎static/images/backup/png/second_half_figure4.png‎
227 KB b/‎static/images/backup/png/second_half_figure4.png‎
227 KB
diff --git a/‎static/images/backup/png/second_half_figure5.png‎
372 KB b/‎static/images/backup/png/second_half_figure5.png‎
372 KB
diff --git a/‎static/images/optimized/png/Ilya_changes_1024w.webp‎
121 KB b/‎static/images/optimized/png/Ilya_changes_1024w.webp‎
121 KB
@@ -9,7 +9,7 @@ description: "Karpathy大神在2025年年底的时候多大模型这一年的发
 cover:
   image: "/images/generated-covers/Andrej_Karpathy_2025_Review.webp"
   alt: "Karpathy: 2025 年大语言模型年度回顾"
-wordCount: 3289
+wordCount: 3281
 readingTime: 9
 ---
 
@@ -72,7 +72,7 @@ OpenAI 的 o1（2024 年底）是首个 RLVR 模型的雏形，
 当可验证任务成为 RLVR 的训练场，LLM 的能力便会在这些领域**垂直飙升**，最终呈现出一种**荒诞又真实的“锯齿状智能”**：  
 它前一秒是博学多才的通才，后一秒就成了被一道脑筋急转弯绕晕的小学生；再过一秒，可能就被一个 jailbreak prompt 哄骗着把你的数据偷走。
 
-![human_vs_ai_intelligence](./images/human_vs_ai_intelligence.png)
+![human_vs_ai_intelligence](/images/optimized/png/human_vs_ai_intelligence_857w.webp)
 人类智能：蓝色；AI 智能：红色。  
 我喜欢这个 meme 版本（抱歉忘了原帖出处），它提醒我们：**人类智能本身也是锯齿状的**——只是“锯齿”的形状不同罢了。
 
@@ -86,9 +86,9 @@ OpenAI 的 o1（2024 年底）是首个 RLVR 模型的雏形，
 **怎样才能‘刷爆所有榜’，却依然离 AGI 十万八千里？**  
 
 关于本节，我另有长文详述：  
-▸ [Animals vs. Ghosts](链接)  
-▸ [Verifiability](链接)  
-▸ [The Space of Minds](链接)
+▸ [Animals vs. Ghosts](https://karpathy.bearblog.dev/animals-vs-ghosts/)
+▸ [Verifiability](https://karpathy.bearblog.dev/verifiability/)
+▸ [The Space of Minds](https://karpathy.bearblog.dev/the-space-of-minds)
 
 ---
 
@@ -219,28 +219,30 @@ Unlike the SFT and RLHF stage, which are both relatively thin/short stages (mino
 2. Ghosts vs. Animals / Jagged Intelligence
 2025 is where I (and I think the rest of the industry also) first started to internalize the "shape" of LLM intelligence in a more intuitive sense. We're not "evolving/growing animals", we are "summoning ghosts". Everything about the LLM stack is different (neural architecture, training data, training algorithms, and especially optimization pressure) so it should be no surprise that we are getting very different entities in the intelligence space, which are inappropriate to think about through an animal lens. Supervision bits-wise, human neural nets are optimized for survival of a tribe in the jungle but LLM neural nets are optimized for imitating humanity's text, collecting rewards in math puzzles, and getting that upvote from a human on the LM Arena. As verifiable domains allow for RLVR, LLMs "spike" in capability in the vicinity of these domains and overall display amusingly jagged performance characteristics - they are at the same time a genius polymath and a confused and cognitively challenged grade schooler, seconds away from getting tricked by a jailbreak to exfiltrate your data.
 
-《image》
+![human_vs_ai_intelligence](/images/optimized/png/human_vs_ai_intelligence_857w.webp)
+
 human intelligence: blue, AI intelligence: red. I like this version of the meme (I'm sorry I lost the reference to its original post on X) for pointing out that human intelligence is also jagged in its own different way.
 
 
 Related to all this is my general apathy and loss of trust in benchmarks in 2025. The core issue is that benchmarks are almost by construction verifiable environments and are therefore immediately susceptible to RLVR and weaker forms of it via synthetic data generation. In the typical benchmaxxing process, teams in LLM labs inevitably construct environments adjacent to little pockets of the embedding space occupied by benchmarks and grow jaggies to cover them. Training on the test set is a new art form.
 What does it look like to crush all the benchmarks but still not get AGI?
 I have written a lot more on the topic of this section here:
-Animals vs. Ghosts
-Verifiability
-The Space of Minds
-3. Cursor / new layer of LLM apps
+▸ [Animals vs. Ghosts](https://karpathy.bearblog.dev/animals-vs-ghosts/)
+▸ [Verifiability](https://karpathy.bearblog.dev/verifiability/)
+▸ [The Space of Minds](https://karpathy.bearblog.dev/the-space-of-minds)
+
+1. Cursor / new layer of LLM apps
 What I find most notable about Cursor (other than its meteoric rise this year) is that it convincingly revealed a new layer of an "LLM app" - people started to talk about "Cursor for X". As I highlighted in my Y Combinator talk this year (transcript and video), LLM apps like Cursor bundle and orchestrate LLM calls for specific verticals:
 They do the "context engineering"
 They orchestrate multiple LLM calls under the hood strung into increasingly more complex DAGs, carefully balancing performance and cost tradeoffs.
 They provide an application-specific GUI for the human in the loop
 They offer an "autonomy slider"
 A lot of chatter has been spent in 2025 on how "thick" this new app layer is. Will the LLM labs capture all applications or are there green pastures for LLM apps? Personally I suspect that LLM labs will trend to graduate the generally capable college student, but LLM apps will organize, finetune and actually animate teams of them into deployed professionals in specific verticals by supplying private data, sensors and actuators and feedback loops.
-4. Claude Code / AI that lives on your computer
+1. Claude Code / AI that lives on your computer
 Claude Code (CC) emerged as the first convincing demonstration of what an LLM Agent looks like - something that in a loopy way strings together tool use and reasoning for extended problem solving. In addition, CC is notable to me in that it runs on your computer and with your private environment, data and context. I think OpenAI got this wrong because I think they focused their codex / agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of `localhost`. And while agent swarms running in the cloud feels like the "AGI endgame", we live in an intermediate and slow enough takeoff world of jagged capabilities that it makes more sense to simply run the agents on the computer, hand in hand with developers and their specific setup. CC got this order of precedence correct and packaged it into a beautiful, minimal, compelling CLI form factor that changed what AI looks like - it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer. This is a new, distinct paradigm of interaction with an AI.
-5. Vibe coding
+1. Vibe coding
 2025 is the year that AI crossed a capability threshold necessary to build all kinds of impressive programs simply via English, forgetting that the code even exists. Amusingly, I coined the term "vibe coding" in this shower of thoughts tweet totally oblivious to how far it would go :). With vibe coding, programming is not strictly reserved for highly trained professionals, it is something anyone can do. In this capacity, it is yet another example of what I wrote about in Power to the people: How LLMs flip the script on technology diffusion, on how (in sharp contrast to all other technology so far) regular people benefit a lot more from LLMs compared to professionals, corporations and governments. But not only does vibe coding empower regular people to approach programming, it empowers trained professionals to write a lot more (vibe coded) software that would otherwise never be written. In nanochat, I vibe coded my own custom highly efficient BPE tokenizer in Rust instead of having to adopt existing libraries or learn Rust at that level. I vibe coded many projects this year as quick app demos of something I wanted to exist (e.g. see menugen, llm-council, reader3, HN time capsule). And I've vibe coded entire ephemeral apps just to find a single bug because why not - code is suddenly free, ephemeral, malleable, discardable after single use. Vibe coding will terraform software and alter job descriptions.
-6. Nano banana / LLM GUI
+1. Nano banana / LLM GUI
 Google Gemini Nano banana is one of the most incredible, paradigm-shifting models of 2025. In my world view, LLMs are the next major computing paradigm similar to computers of the 1970s, 80s, etc. Therefore, we are going to see similar kinds of innovations for fundamentally similar kinds of reasons. We're going to see equivalents of personal computing, of microcontrollers (cognitive core), or internet (of agents), etc etc. In particular, in terms of the UIUX, "chatting" with LLMs is a bit like issuing commands to a computer console in the 1980s. Text is the raw/favored data representation for computers (and LLMs), but it is not the favored format for people, especially at the input. People actually dislike reading text - it is slow and effortful. Instead, people love to consume information visually and spatially and this is why the GUI has been invented in traditional computing. In the same way, LLMs should speak to us in our favored format - in images, infographics, slides, whiteboards, animations/videos, web apps, etc. The early and present version of this of course are things like emoji and Markdown, which are ways to "dress up" and lay out text visually for easier consumption with titles, bold, italics, lists, tables, etc. But who is actually going to build the LLM GUI? In this world view, nano banana is a first early hint of what that might look like. And importantly, one notable aspect of it is that it's not just about the image generation itself, it's about the joint capability coming from text generation, image generation and world knowledge, all tangled up in the model weights.
 ---
 TLDR. 2025 was an exciting and mildly surprising year of LLMs. LLMs are emerging as a new kind of intelligence, simultaneously a lot smarter than I expected and a lot dumber than I expected. In any case they are extremely useful and I don't think the industry has realized anywhere near 10% of their potential even at present capability. Meanwhile, there are so many ideas to try and conceptually the field feels wide open. And as I mentioned on my Dwarkesh pod earlier this year, I simultaneously (and on the surface paradoxically) believe that we will both see rapid and continued progress *and* that yet there is a lot of work to be done. Strap in.
 
@@ -72,7 +72,7 @@ cover:
 总而言之，伊利亚当前的核心思想是：**AI的下一步突破在于发现人类特有的、优雅的、内在的学习机制——特别是价值函数**，而不再是单纯地依赖规模化和资源堆砌。行业需要重回研究，寻找新的、更简洁的、符合直觉的“配方”。
 
 **类比总结：**
-![Ilya_changes](./images/Ilya_changes.png)
+![Ilya_changes](/images/optimized/png/Ilya_changes_1024w.webp)
 
 我们可以将伊利亚观点的转变理解为，他从一位相信通过建造 **更大、燃料更多、速度更快的高速公路（规模化）** 就能征服世界的工程师，变成了一位追求 **设计出更优雅、更高效、能自我学习、且拥有内置GPS和情感（价值函数）** 的智能交通系统架构师。他认识到，一味地扩大规模只会导致效率低下和“伪泛化”，真正的进步需要更深入、更基础的理论（研究）突破。
 
 
@@ -138,7 +138,7 @@ Shunyu_Yao 主页： <https://ysymyth.github.io/>
 我在斯坦福大学的 224N 课程中尝试了测验，答案并不令人意外：Transformer、AlexNet、GPT-3 等。这些论文有什么共同点？它们都提出了一些训练更好模型的根本性突破。但同时，它们也通过在一些基准测试上展示出（显著的）改进成功发表了论文。
 
 不过，有一个潜在的共性：这些“赢家”都是训练方法或模型，而非基准或任务。即便是最具影响力的基准之一 ImageNet，其引用量也还不到 AlexNet 的三分之一。在其他任何地方，方法与基准之间的对比则更为悬殊——例如，Transformer 的主要基准是 WMT'14，其研讨会报告的引用量约为 1300 次，而 Transformer 的引用量则超过了 16 万次。
-[!image](./images/second_half_figure1.png)
+[!image](/images/optimized/png/second_half_figure1_1024w.webp)
 
 这说明了<b><span style="color:green">上半场的比赛情况：重点在于构建新的模型和方法</span></b>，评估和基准测试是次要的（尽管对于使论文体系运转起来是必要的）。
 
@@ -166,7 +166,7 @@ Shunyu_Yao 主页： <https://ysymyth.github.io/>
 
 语言预训练为聊天创造了良好的先验条件，但对于控制计算机或玩视频游戏而言效果却不尽如人意。为什么呢？这些领域与互联网文本的分布相去甚远，直接在这些领域上进行有监督微调（SFT）/强化学习（RL）泛化效果不佳。我在 2019 年就注意到了这个问题，当时 GPT-2 刚刚问世，我在其基础上进行了 SFT/RL 来解决基于文本的游戏——[CALM](https://arxiv.org/abs/2010.02903)是世界上首个通过预训练语言模型构建的智能体。但智能体要爬坡攻克一个游戏需要数百万次的强化学习步骤，而且无法迁移到新游戏中。尽管这正是强化学习的特点，对强化学习研究人员来说也不算奇怪，但我却觉得不可思议，因为我们人类可以轻松玩新游戏，并且零样本表现显著更好。然后我迎来了人生中的第一个顿悟时刻——我们之所以能泛化，是因为我们不只是选择“去2号柜子”“用 1 号钥匙打开3号宝箱”“用剑杀掉地牢”，我们还能选择思考诸如“地牢很危险，我需要一件武器来对抗它。没有看到武器，也许我得在锁着的箱子或宝箱里找一件。3号宝箱在2号柜子里，我先去那里把它打开”之类的事情。
 
-[!image](./images/second_half_figure3.png)
+[!image](/images/optimized/png/second_half_figure3_962w.webp)
 
 思考或者说推理是一种奇特的行为——它不会直接作用于外部世界，然而推理的空间却是开放且组合上无限的——你可以思考一个词、一个句子、一段完整的文字，或者 10000 个随机的英语单词，但你周围的环境不会立刻发生变化。在经典的强化学习理论中，这是个糟糕的情况，会让决策变得不可能。想象一下，你需要从两个盒子中选一个，其中一个盒子里有 100 万美元，另一个是空的。你预期能赚 50 万美元。现在想象一下，我加入无限个空盒子。你预期将一无所获。但通过将推理纳入任何强化学习环境的动作空间，我们利用语言预训练的先验知识进行泛化，并且能够为不同的决策灵活分配测试时的计算资源。这真是件神奇的事，我在这里没能完全讲清楚，可能需要再写一篇博客专门讲讲。欢迎阅读 ReAct 了解推理在智能体中的原始故事，也可以看看我当时的想法。目前，我的直观解释是：即便你添加了无穷多个空盒子，你一生中在各种游戏中都见过它们，选择这些盒子能让你在任何给定的游戏中更好地选择装有钱的盒子。我的抽象解释是：<b><span style="color:blue">语言通过代理人的推理进行概括(language generalizes through reasoning in agents)</span></b>。
 
@@ -186,7 +186,7 @@ Shunyu_Yao 主页： <https://ysymyth.github.io/>
 - 该方案实质上已将基准爬坡法标准化和工业化，无需太多新思路。由于该方案具有良好的扩展性和通用性，你针对特定任务的新方法可能使其性能提升 5%，而下一代 O 系列模型则可能在未明确针对该任务的情况下使其性能提升 30%。
 - 即便我们设定更难的基准，很快（而且会越来越快）它们也会被这种套路解决。我的同事杰森·韦伊制作了一张精美的图表，很好地展现了这一趋势：
 
-[!image](./images/second_half_figure4.png)]
+[!image](/images/optimized/png/second_half_figure4_927w.webp)
 
 那么下半场还有什么可玩的呢？如果不再需要新颖的方法，而更难的基准测试也会越来越快地被攻克，那我们该怎么办？
 
@@ -201,7 +201,7 @@ Shunyu_Yao 主页： <https://ysymyth.github.io/>
 或许我们很快就能解决效用问题，或许不能。不管怎样，这个问题的根本原因可能看似简单：<b><span style="color:green">我们的评估设置在很多基本方面都与现实世界的设置不同</span></b>。举两个例子：
 
 - **评估“应当”自动运行**，所以通常情况下，智能体接收任务输入，自主完成任务，然后获得任务奖励。但在现实中，智能体在整个任务过程中必须与人类互动——你不会只是给客服发一条超级长的消息，等上 10 分钟，然后就指望得到一个详尽的回复来解决所有问题。质疑这种设置，新的基准测试被发明出来，要么让真实的人类参与其中（例如 [Chatbot 领域](https://lmarena.ai/)），要么进行用户模拟（例如 [tau-bench](https://arxiv.org/abs/2406.12045)）。
-[!image](./images/second_half_figure5.png)
+[!image](/images/optimized/png/second_half_figure5_1024w.webp)
 
 - **评估“应该”独立同分布地进行**。如果你有一个包含 500 个任务的测试集，那么你应独立运行每个任务，对任务指标取平均值，从而得出总体指标。但在实际中，任务是按顺序而非并行解决的。谷歌的一名软件工程师在对代码库越来越熟悉的情况下，解决 google3 问题的能力会逐渐增强，但一个软件工程师代理在解决同一代码库中的许多问题时却不会获得这种熟悉度。显然，我们需要长期记忆方法（而且确实存在），但学术界没有合适的基准来证明这种需求，甚至没有足够的勇气去质疑作为机器学习基础的独立同分布假设。