DeepSeek researchers detail mHC, a new architecture they used to train 3B, 9B, and 27B models, finding it scaled without adding significant computational burden

DeepSeek has published a technical paper co-authored by founder Liang Wenfeng proposing a rethink of its core deep learning architecture

South China Morning Post 2026-01-02 Vincent Chow

Discussion

@saritharai Saritha Rai on x
DeepSeek touts new training method in a paper co-authored by reclusive founder Liang Wenfeng, as Chinese AI companies strive to build more efficient AI systems @business https://www.bloomberg.com/...
@chrmanning Christopher Manning on x
Great to see an AI lab doing and publishing science (as well as discussing engineering efficiencies)! Some of the other “frontier” labs should try it! Thx, @deepseek_ai!
@meer_aiit Meer on x
DeepSeek just dropped a core transformer architecture change. Manifold-Constrained Hyper-Connections replace the single residual stream with multiple parallel signal paths. Standard residual connections have been the foundation of deep learning since ResNets. They allow [image]
@iamgrigorev George Grigorev on x
residuals in transformers are great for stability and scaling; deeper layers update the signal along the residual stream. few people questioned this choice publicly, and since 2025 there's been progress. few thoughts about hyper connections (wrt the newly released DeepSeek paper …
@arjunkocher Arjun on x
mHC: Manifold-Constrained Hyper-Connections most recent paper from @deepseek_ai [image]
@jiqizhixin @jiqizhixin on x
On the first day of the New Year, DeepSeek released a major paper. They try to fix the training instability that plagues advanced neural network designs. Enter mHC: Manifold-Constrained Hyper-Connections. They take the powerful but unstable “Hyper-Connections” architecture [image…
@nathancgy4 Nathan Chen on x
last week, a deepseek researcher told me that he believes the two biggest architectural innovations in 2025 are 1) muon and 2) hyper-connections since muon was already heavily explored by kimi, i asked him why don't they do something with hyper-connections now it's [image]
@scaling01 @scaling01 on x
of course 2026 starts with a banger DeepSeek paper I quoted a good explanation of what they are doing [image]
@zephyr_z9 @zephyr_z9 on x
What are Chinese quant companies smoking to get this kind of performance??? Mogging Sonnet 4.5 with a 40B [image]
@dorialexander Alexander Doria on x
Unsurprisingly a new DeepSeek banger. I'll post my reading notes later but I already recommend going through the original hyper-connection ByteDance paper first: clearly explain the expected benefits, better layer specialization/management ("enhance the impact of each layer") [im…
@zephyr_z9 @zephyr_z9 on x
DeepSeek Moment 2.0 🤔🤔 (the timing matches)
@zephyr_z9 @zephyr_z9 on x
BRUH These numbers are absolutely insane for a 40B Beating everyone on a bunch of hard benchmarks [image]
@scottstts Scott on x
DeepSeek dropped a pretty significant paper yesterday mHC building on hyper connection, introduced clever math and algo maneuvers to prevent the unbounded amplification and attenuation issue of normal HC when scaled up (doubly stochastic matrices and Sinkhorn-Knopp projection) [i…
@dorialexander Alexander Doria on x
So the first major paper of 2026, DeepSeek mHC: Manifold-Constrained Hyper-Connections. This is actually an engineering paper, taking as a starting points ideas already exposed in the original Hyper-Connections (HC) paper from ByteDance, which is consequently a prerequisite for
@joelc_eth Joel on x
DeepSeek drops new paper: mHC (Manifold-Constrained Hyper-Connections). According to a DeepSeek researcher, “the two biggest architectural innovations in 2025 are 1) Muon and 2) Hyper-Connections.” So what does mHC bring to the table? Residual connections have been the backbone […
@shiwei_liu66 Shiwei Liu on x
DeepSeek's new mHC is a nice step toward mitigating the curse of depth issues tied to residual, highlighted in our recent work https://arxiv.org/.... Glad to see frontier labs engaging with this direction. Congrats to Seed-Foundation-Model Team too https://arxiv.org/... [image]
@norxornor Nor on x
Quick read through of Deepseek's new Manifold-Constrained Hyper-Connections paper: - You want to increase residual size from 1×C to n×C (n streams instead of 1). Earlier residual update: x' = x + layer(x). Make the x be n×C, and use x' = Ax + B layer(Cx) instead. A, B, C are all …
@teortaxestex @teortaxestex on x
ALERT, NEW YEAR GIFT FROM DEEPSEEK mHC: Manifold-Constrained Hyper-Connections it's a pretty crazy fundamental result! They show stable hyper-connection training. This leth them *scale residual stream width*, with minor compute&memory overhead This is a *huge model smell* recipe.…
@novasarc01 &LAMBDA;ux on x
interesting paper by deepseek. the part i liked most is how mHC keeps the multi-stream idea of hyper-connections but puts hard constraints on residual mixing. nice example of real research progress coming from stability analysis rather than chasing more expressivity. [image]
@jenzhuscott Jen Zhu on x
Excellent thread summarising @deepseek_ai Jan 1st mHC paper. Two quant funds' open-sourced labs in 🇨🇳 already set a dizzy pace for 2026 in its first 24 hours. This builds nicely on the ByteDance Hyper-Connections paper - mHC's manifold constraint feels like a principled [image]
@rryssf_ Robert Youssef on x
🚨 DeepSeek just dropped a paper that quietly exposes why modern neural networks get unstable as they scale. It's called mHC: Manifold-Constrained Hyper-Connections, and the core idea is deceptively simple: Neural networks keep breaking their own geometry. Here's what that means…
@adinayakup Adina Yakup on x
To start 2026, @deepseek_ai released mHC🔥 A new architecture that makes hyper-connections more stable when training large models, without losing their performance benefits. https://huggingface.co/...

Chronicles

DeepSeek researchers detail mHC, a new architecture they used to train 3B, 9B, and 27B models, finding it scaled without adding significant computational burden

Related Coverage

Discussion