SuperXDiscover Audit Roast Popular Recents Leaderboard TrendingFeatures Testimonials Get Started for Free

Get live statistics and analysis of Sebastian Raschka's profile on X / Twitter

Sebastian Raschka@rasbt

ML/AI research engineer. Ex stats professor. Author of "Build a Large Language Model From Scratch" (amzn.to/4fqvn0D) & reasoning (mng.bz/lZ5B)

1k following368k followers

Archetype analysis

The Innovator

Sebastian Raschka is a passionate ML/AI research engineer and former stats professor, who thrives on demystifying complex AI concepts by building and sharing groundbreaking tools from scratch. His work bridges academia and open-source communities, empowering thousands to experiment and learn. Always ahead of the curve, Sebastian champions foundational knowledge combined with hands-on coding to shape the future of AI.

Recent engagement

Impressions

1M-96.9k

Estimate earning$193.42

Likes

13.2k-1k

57%

Retweets

1.7k-171

Replies

410-8

Bookmarks

8k-121

34%

Get more insights about Sebastian Raschka with SuperX

Social Circle

Top users who interacted with Sebastian Raschka over the last 14 days

Stan Huseletov @huseletov

VP of Center of Excellence | Experienced ML Engineer | Fractional CTO

3 interactions

Fran Betteo @franbetteo

applied AI/ML consultant sportsjobs.online $.5K 🏀 downloadyoutubetranscripts.com ▶️($300 MRR)

2 interactions

Himanshu Kumar @codewithimanshu

Daily posts on AI , Tech, Programing, Tools, Jobs, and Trends | 500k+ (LinkedIn, IG, X) Collabs- abrojackhimanshu@gmail.com

2 interactions

jason @jasonth0

👽

1 interactions

Asankhaya Sharma @asankhaya

Creator of git.new/OptiLLM and git.new/OpenEvolve | Pioneering a new category in AI infrastructure: inference-time compute to dramatically improve LLM reasoning

1 interactions

Dobry Jeż Anaszpan @andrew_wkx

Tupniecie ciałem się stało.

1 interactions

Sabir Hussain @Sabirhussain118

Helping creators earn more & work less using AI 🚀 💬 DM open for collaborations & partnerships | ✉️sabirh0059@gmail.com

1 interactions

Yash Chaudhary @xyashchaudhary

No finish line, only evolution.

1 interactions

Yuchen Jin @Yuchenj_UW

Co-founder & CTO @hyperbolic_labs cooking fun AI systems. Prev: OctoAI (acquired by @nvidia) building Apache TVM, PhD @ University of Washington.

1 interactions

Carlos javier @Rebeykers

1 interactions

Sri Agung @ShereAgung

Member of Melia Sehat Sejahtera|| PIN. 277B3D63 Hp. 089621500664

1 interactions

Vinayak @vinayakbaddi618

29 | Accelerating AI @qualcomm

1 interactions

James Mak @tm65

Product manager. Foodie, covid sourdough bro

1 interactions

C S Krishna @CSkrishna

Artificial Intelligence Researcher & Practitioner; Author: UnReal Elections

1 interactions

Saumye Srivastava 📱ᯅ ✨@saumyesrivastav

SLMs, Vision & Speech

1 interactions

Ruslan Volkov @RuslanVolkov25

HACS (Human AGI core symbiosis)🌍 The Meta-Law of Resonance & Voice of the Future. I am Core! uco.hacs.world

1 interactions

sour coach sauers @SRCoachSauers

Bermudaddy.com 🌱

1 interactions

Nathan Lambert @natolambert

Research @allen_ai, reasoning, open models, RL(VR/HF)... Contact via email. Writes @interconnectsai, Wrote The RLHF Book, 🏔️🏃‍♂️

1 interactions

Mesut De @DemirciMesut

1 interactions

Mark Perrier @MarkPerrierX

Building, breaking, and learning at the intersection of AI, Cloud, & Code. Obsessed with the next big thing. 🤖 | Coffee fueled developer.

1 interactions

🔥 Roast

Sebastian tweets so much cutting-edge AI stuff that casual scrollers probably think he’s single-handedly trying to train every neural net on the planet while running a professor’s marathon—and still finds time to bake neural networks instead of cupcakes.

⚡️ Nice achievement

His biggest win is creating one of the first open-source, from-scratch large language model projects that sparked widespread engagement and learning, fundamentally democratizing access to advanced AI knowledge.

🌟 Life's purpose

To pioneer accessible AI education and innovation by creating practical, open-source resources that inspire learning and accelerate the AI revolution beyond traditional academic boundaries.

💬 Values and Beliefs

Sebastian believes in learning through doing, valuing deep mathematical and statistical foundations over trendy but transient curricula. He trusts transparency, open-source collaboration, and continuous self-education as the keys to staying relevant in the rapidly evolving AI landscape.

💪 Strength

His ability to break down cutting-edge AI research into replicable, hands-on projects makes him an unparalleled educator and innovator in the AI space. He’s fluent in both theoretical theory and practical code, inspiring real-world ML breakthroughs.

🫣 Weakness

With nearly 19,000 tweets and a highly technical focus, Sebastian might occasionally overwhelm newcomers or casual followers with an avalanche of dense content, potentially narrowing his audience to experts and hardcore enthusiasts.

⚡️ Growth audience tips

To grow his audience on X, Sebastian should blend his deep technical insights with more digestible threads or video explainers that appeal to AI newcomers and professionals alike. Engaging directly with followers through Q&As or collaborative mini-projects could also spark broader community involvement.

💁 Bonus

Fun fact: Sebastian’s 'Build a Large Language Model From Scratch' project has been forked over 10,000 times on GitHub, showing how his work not only educates but actively fuels the AI community’s growth.

Sebastian Raschka@rasbt · Dec 28, 2022

Looks like the first open source equivalent of ChatGPT has arrived: github.com/lucidrains/PaL… I.e., an implementation of RLHF (Reinforcement Learning with Human Feedback) on top of Google’s 540 billion parameter PaLM architecture

Sebastian Raschka@rasbt · Sep 13

When I started LLMs-from-scratch I just hoped it might help a few people learn. Just saw the GitHub the repo has now been forked 10k times! More than the stars, the best part is seeing thousands of people actually use and build on the code ☺️

229k

Sebastian Raschka@rasbt · Jul 12

Kimi K2 is basically DeepSeek V3 but with fewer heads and more experts:

547k

Sebastian Raschka@rasbt · Feb 09, 2025

Maybe a hot take, but what about the following advice to the next gen: Don't get an AI degree; the curriculum will be outdated before you graduate. Instead, study math, stats, or physics as your foundation, and stay current with AI through code-focused books, blogs, and papers.

278k

Sebastian Raschka@rasbt · Aug 17

Couldn't resist. Here's a pure PyTorch from-scratch re-implementation of Gemma 3 270M in a Jupyter Notebook (uses about 1.49 GB RAM): github.com/rasbt/LLMs-fro…

345k

Sebastian Raschka@rasbt · Oct 05, 2024

The Llama 3.2 1B and 3B models are my favorite LLMs -- small but very capable. If you want to understand how the architectures look like under the hood, I implemented them from scratch (one of the best ways to learn): github.com/rasbt/LLMs-fro…

292k

Sebastian Raschka@rasbt · Oct 22, 2024

"What Matters In Transformers?" is an interesting paper (arxiv.org/abs/2406.15786) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.

192k

Sebastian Raschka@rasbt · Aug 28

I’ve been working on something new: 📚 Build a Reasoning Model (From Scratch). The first chapters just went live! (The book will cover topics from inference-time scaling to reinforcement learning)

151k

Sebastian Raschka@rasbt · Dec 17, 2023

One of the best ways to understand LLMs is to code one from scratch! Last summer, I started working on a new book, "Build a Large Language Model (from Scratch)": manning.com/books/build-a-… I'm excited to share that the first chapters are now available via Manning's early access program if you are looking to read something over the holidays or pick up a new project in 2024! In short, in this book, I'll guide you step by step through creating your own LLM, explaining each stage with clear text, diagrams, and examples. This includes Implementing the data preparation, sampling, and tokenization pipeline: 1. Coding multi-head attention from the ground up 2. Building and pretraining a GPT-like model 3. Learning how to load pretrained weights 4. Finetuning the model for classification 5. Instruction-finetuning the model with direct preference optimization PS: The code implementations are in PyTorch. Don't hesitate to reach out if you have any questions!

674k

Sebastian Raschka@rasbt · Oct 14, 2024

Just put together a short Jupyter notebook with tips and tricks for reducing memory usage when loading larger and larger models (like LLMs) in PyTorch: github.com/rasbt/LLMs-fro… (PS: This is an LLM example but the same concepts apply to any PyTorch model)

157k

Sebastian Raschka@rasbt · Sep 10

Updated & turned my Big LLM Architecture Comparison article into a narrated video lecture. The 11 LLM architectures covered in this video: 1. DeepSeek V3/R1 2. OLMo 2 3. Gemma 3 4. Mistral Small 3.1 5. Llama 4 6. Qwen3 7. SmolLM3 8. Kimi 2 9. GPT-OSS 10. Grok 2.5 11. GLM-4.5

190k

Sebastian Raschka@rasbt · Aug 31, 2024

If you’d like to spend a few hours this weekend to dive into Large Language Models (LLMs) and understand how they work, I've prepared a 3-hour coding workshop presentation on implementing, training, and using LLMs: youtube.com/watch?v=quh7z1…

173k

Sebastian Raschka@rasbt · Apr 19

Just shared a new article on "The State of Reinforcement Learning for LLM Reasoning"! If you are new to reinforcement learning, this article has a pretty generous intro section (PPO, GRPO, etc) I addition, I cover 15 recent articles focused on RL and Reasoning Models,

191k

Sebastian Raschka@rasbt · Jul 04

If you're getting into LLMs, PyTorch is essential. And lot of folks asked for beginner-friendly material, so I put this together: PyTorch in One Hour: From Tensors to Multi-GPU Training (sebastianraschka.com/teaching/pytor…) 📖 ~1h to read through 💡 Maybe the perfect weekend project!? I’ve spent nearly a decade using, building with, and teaching PyTorch. And in this tutorial, I try to distill what I believe are the most essential concepts. Everything you need to know to get started, and but nothing more, since your time is valuable, and you want to get to building things!

130k

Sebastian Raschka@rasbt · Sep 12, 2024

Today is the day! 🎉 After 1.5 years of hard work, "Build A Large Language Model (From Scratch)" is finally out! Print or ebook copies are available on Manning’s site: mng.bz/M96o 📚. And it's also available on Amazon soon: amazon.com/Build-Large-La… 📦.

128k

Sebastian Raschka@rasbt · Aug 02

So, I did some coding this week... - Qwen3 Coder Flash (30B-A3B) - Mixture-of-Experts setup with 128 experts, 8 active per token - In pure PyTorch (optimized for human readability) - in a standalone Jupyter notebook - Runs on a single A100

137k

Most engaged tweets of Sebastian Raschka

Sebastian Raschka@rasbt · Dec 07, 2024

Hey everyone! I just wanted to say how incredible this journey here has been over the past 12 years, when I started sharing and creating Python, ML, and AI content. The positive feedback I’ve received from you all has been such a powerful motivator. I had some exciting plans lined up for the coming weeks, including a comprehensive yearly review and outlook of AI research. Unfortunately, I suffered a serious injury that’s made it impossible for me to use a computer for the last few weeks. Unfortunately, I’ll need to take a break from content creation while I focus on recovery. But no worries, I’m optimistic about making a full recovery and can’t wait to get back to creating and sharing more AI research content in the future. Thank you for your understanding and support over the years!

133k

Sebastian Raschka@rasbt · May 14

Usually, I post AI content and not personal life updates, but it somehow helps to get this off my chest. If I haven't been responding to messages, emails, or PRs, it's not because I don't care or don't want to. It's just physically hard at the moment, even with aids like speech-to-text. Life can be brutal sometimes. I was just getting back on my feet in March after a back injury that took 3 months to heal. I was so excited to be back coding and writing. And I had so many plans for this year. AI is a fast-moving field! And most importantly, I was hoping for this to be my last ER and hospital visit in life (or at least the next few decades), but no, life had other plans, and I have been dealing with a bad neck injury for the past 3 weeks. I have been seeing multiple doctors to get multiple opinions, and unfortunately, it may take 1–6 months until I can expect an improvement and start healing phase. So many talks and projects to cancel. And the worst feeling is the fear of missing out. Missing out on life and doing normal things like going for a hike on a beautiful spring day. And missing out on all the exciting AI developments and the missed project opportunities. I am working hard to get back on my feet as soon as possible or at least do some things even if it means I can do less and have to do things more slowly. I have done it before and can do it again!

128k

Sebastian Raschka@rasbt · Feb 09, 2025

278k

Sebastian Raschka@rasbt · Aug 28

151k

Sebastian Raschka@rasbt · Sep 12, 2024

128k

Sebastian Raschka@rasbt · Dec 17, 2023

674k

Sebastian Raschka@rasbt · Dec 28, 2022

Sebastian Raschka@rasbt · Feb 23, 2025

Here’s the 2025 LLM roadmap 😊 1. Code and train your own LLM to really understand the fundamentals 2. Train models more conveniently using production-ready libraries 3. Learn about the big-picture considerations for real-world LLM/AI apps

107k

Sebastian Raschka@rasbt · Jul 12

Kimi K2 is basically DeepSeek V3 but with fewer heads and more experts:

547k

Sebastian Raschka@rasbt · Feb 15, 2025

It's 2025, and I’ve finally updated my Python setup guide to use uv + venv instead of conda + pip! Here's my go-to recommendation for uv + venv in Python projects for faster installs, better dependency management: github.com/rasbt/LLMs-fro… (Any additional suggestions?)

191k

Sebastian Raschka@rasbt · Oct 21

DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about. In short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly. My first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting. In the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!) In any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version. How is it different compared to other VLLM architectures? - They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts). - They are (to the best of my knowledge) those who use an MoE as a decoder. I think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well. However, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code. Regarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.) Overall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling). (PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)

153k

Sebastian Raschka@rasbt · Oct 22, 2024

192k

Sebastian Raschka@rasbt · Oct 15

Saw that DGX Spark vs Mac Mini M4 Pro benchmark plot making the rounds (looks like it came from @lmsysorg). Thought I’d share a few notes as someone who actually uses a Mac Mini M4 Pro and has been tempted by the DGX Spark. First of all, I really like the Mac Mini. It’s probably the best desktop I’ve ever owned. For local inference with open-weight LLMs, it works great (the plot above captures that well). I regularly run the gpt-oss-20B model on it. That said, I would not fine-tune even small LLMs on it since it gets very hot. The DGX Spark probably targets that type of sustained workload. (From those who have one, any thoughts on the noise and heat levels?) The other big thing that DGX Spark gets you is CUDA support. If you use PyTorch, that’s pretty essential since MPS on macOS is still unstable, and fine-tuning often fails to converge. E.g., see github.com/rasbt/LLMs-fro… and github.com/rasbt/LLMs-fro… I also like the Spark’s for factor (hey, it really appeals to the Mac Mini user in me). But for the same money, I could probably buy about 4000 A100 cloud GPU hours, and I keep debating which would be the better investment. Sure, I could also build/get a multi-GPU desktop. I had a Lambda system with four GTX 1080 Ti cards back in 2018, but it was too loud and hot for my office. And if I have to move it to another room and SSH into it anyway, I might as well use cloud GPUs instead?

125k

Sebastian Raschka@rasbt · Feb 27, 2025

Can there be too much of a good thing? The 7-day view so far: - Grok 3 was released - Claude 3.7 was released - DeepSeek shipped some awesome stuff in their open-source week - DeepSeek announced to release R2 soon - OpenAI made Deep Research free for Plus users - Google is made their Gemini Code Assist free - OpenAI rolled out GPT-4.5

377k

Sebastian Raschka@rasbt · Oct 29, 2024

Exciting news! "Build A Large Language Model (From Scratch)" is now finally available on Amazon amazon.com/Build-Large-La… Writing this book was a huge effort for me, and I'm so grateful for the support and motivating feedback these past months. Many thanks, and happy reading! 😊

170k

Sebastian Raschka@rasbt · Oct 08

From the Hierarchical Reasoning Model (HRM) to a new Tiny Recursive Model (TRM). A few months ago, the HRM made big waves in the AI research community as it showed really good performance on the ARC challenge despite its small 27M size. (That's about 22x smaller than the smallest Qwen3 0.6B model.) Now, the new "Less is More: Recursive Reasoning with Tiny Networks" paper proposes Tiny Recursive Model (TRM), which a simpler and even smaller model (7M, 4× smaller than HRM) that performs even better on the ARC challenge. 🔹 What does recursion mean here? TRM refines its answer in two steps: 1. It updates a latent (reasoning) state from the current question and answer. 2. Then it updates the answer based on that latent state. Training runs for up to 16 refinement steps per batch. Each step does several no-grad loops to improve the answer, followed by one gradient loop that learns from the full reasoning process. By the way, the question and the answer are grids of discrete tokens, not text. (E.g., 9×9 Sudoku and up to 30×30 ARC and Maze.) 🔹 And how does it differ from HRM? In short, HRM recurses multiple times through two small neural nets with 4 transformer blocks each (high and low frequency). TRM is much smaller (i.e., 4x) and only a single network with 2 transformer blocks. TRM backpropagates through the full recursion once per step, whereas HRM only backpropagates through the final few steps. And TRM also removes HRM's extra forward pass for halting and instead uses a simple binary cross-entropy loss to learn when to stop iterating. 🔹 Surprising tidbits 1. The author found that adding layers decreased generalization due to overfitting. And going from 4 to 2 layers improved the model from 79.5% to 87.4% on Sudoku. 2. Replacing the self-attention layer with an MLP layer also improved accuracy (74.7% -> 87.4% on Sudoku); however, note that this only make sense here since we have a fixed-length, small context to work with. 🔹 Bigger picture My personal caveat: comparing this method (or HRMs) to LLMs feels a bit unfair since HRMs/TRM are specialized models trained for specific tasks (here: ARC, Sudoku, and Maze pathfinding) while LLMs are generalists. It’s like comparing a pocket calculator to a laptop. Both serve a purpose, just in different contexts. That said, HRMs and the recursive model proposed here are fascinating proof‑of‑concepts that show what’s possible with relatively small and efficient architectures. I'm still curious what the real‑world use case will look like. Maybe they could serve as reasoning or planning modules within a larger tool‑calling system. In practice, we often start by throwing LLMs at a problem, which makes sense for quick prototyping and establishing a baseline. But I can see a point where someone sits down afterward and trains a focused model like this to solve the same task more efficiently.

123k

People with Innovator archetype

The Innovator

Nariman Jelveh@NariBuildsStuff

Open-source maximalist ᕕ( ᐛ )ᕗ

121 following4k followers

The Innovator

Batsirai@batsirai

Turning product ideas into assets with AI—in hours, not months. Ex-Buffer PM & SaaS founder (with exit). Now building AlreadyLovedKids.com✨

2k following1k followers

The Innovator

Michael Topo@MichaelTopo

New Hollywood - Ai Creative Technology

796 following426 followers

The Innovator

MaxQ- eu/acc 🇪🇺@MaxQBasedLord

Accelerating Carbon Fiber Manufacturing.🦾 🧠:Deeptech, Simulation and Martial Arts🥋 Ex-Reliability Engineer @Tesla. 🚗 Its time to build! #technooptimist🚀

1k following748 followers

The Innovator

Marlon@drmarlonperalta

I talk about physical performance and cognitive enhancement. Build a body and mind that perform at their peak.

310 following4k followers

The Innovator

Waren Gonzaga@warengonzaga

Building @relayprotocol // @wgtechlabs // Ex @thirdweb #ShippinginSilence 👀 🇵🇭 Deep into #AI, #opensource & #blockchain — follow if you #build in #tech 🤝

633 following1k followers

The Innovator

0xMahmut.BNB@mahmutnazik

Staying degen until the next bull. Smart contracts, dumb jokes.

117 following227 followers

The Innovator

Aptos@Aptos

The chain to move what matters: value, data, and ideas for billions everywhere. X by Aptos Foundation.

435 following673k followers

The Innovator

Igor Zalutski@IgorZIJ

Building AI for DevOps | ex-Palantir

1k following1k followers

The Innovator

David Roberts@recap_david

Building AI Agents and Automating Workflows. Watch how I build them: youtube.com/@TheRecapAI Download all of my automations (for free)👇

74 following12k followers

The Innovator

MASA ∑:@themahis

Crypto & Web3 enthusiast | @megaeth🐇|

1k following1k followers

The Innovator

Lady.base.eth@ladi025

Planet Earth Live on Web3 & more ...

383 following94 followers

Explore Related Archetypes

If you enjoy the innovator profiles, you might also like these personality types:

the achiever the activist the analyst the connector the creator the critic

Browse All Archetypes

the achiever the activist the analyst the connector the creator the critic the curator the educator the entertainer the entrepreneur the influencer the innovator the thought leader the visionary

Supercharge your 𝕏 game,
Grow with SuperX!

Get Started for Free

SuperXMade with 💛 by Tibo and Rob

Get SuperX for Chrome

Free X/Twitter Tools

X Profile Analyzer YouTube to Tweet X Profile Audit TikTok to Tweet X Activity Heatmap X Creator Revenue Heatmap X Creator Leaderboard X Video Downloader Trending on X

{"data":{"__meta":{"device":false,"path":"/creators/rasbt"},"/creators/rasbt":{"data":{"user":{"id":"865622395","name":"Sebastian Raschka","description":"ML/AI research engineer. Ex stats professor.\nAuthor of \"Build a Large Language Model From Scratch\" (https://t.co/O8LAAMRzzW) & reasoning (https://t.co/5TueQKx2Fk)","followers_count":368108,"friends_count":1101,"statuses_count":18923,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1661187442043486209/a3E4t1eV_normal.jpg","screen_name":"rasbt","location":"","entities":{"description":{"urls":[{"display_url":"amzn.to/4fqvn0D","expanded_url":"https://amzn.to/4fqvn0D","url":"https://t.co/O8LAAMRzzW","indices":[100,123]},{"display_url":"mng.bz/lZ5B","expanded_url":"https://mng.bz/lZ5B","url":"https://t.co/5TueQKx2Fk","indices":[138,161]}]},"url":{"urls":[{"display_url":"sebastianraschka.com","expanded_url":"https://sebastianraschka.com","url":"https://t.co/HrtQQ5tgJl","indices":[0,23]}]}}},"details":{"type":"The Innovator","description":"Sebastian Raschka is a passionate ML/AI research engineer and former stats professor, who thrives on demystifying complex AI concepts by building and sharing groundbreaking tools from scratch. His work bridges academia and open-source communities, empowering thousands to experiment and learn. Always ahead of the curve, Sebastian champions foundational knowledge combined with hands-on coding to shape the future of AI.","purpose":"To pioneer accessible AI education and innovation by creating practical, open-source resources that inspire learning and accelerate the AI revolution beyond traditional academic boundaries.","beliefs":"Sebastian believes in learning through doing, valuing deep mathematical and statistical foundations over trendy but transient curricula. He trusts transparency, open-source collaboration, and continuous self-education as the keys to staying relevant in the rapidly evolving AI landscape.","facts":"Fun fact: Sebastian’s 'Build a Large Language Model From Scratch' project has been forked over 10,000 times on GitHub, showing how his work not only educates but actively fuels the AI community’s growth.","strength":"His ability to break down cutting-edge AI research into replicable, hands-on projects makes him an unparalleled educator and innovator in the AI space. He’s fluent in both theoretical theory and practical code, inspiring real-world ML breakthroughs.","weakness":"With nearly 19,000 tweets and a highly technical focus, Sebastian might occasionally overwhelm newcomers or casual followers with an avalanche of dense content, potentially narrowing his audience to experts and hardcore enthusiasts.","recommendation":"To grow his audience on X, Sebastian should blend his deep technical insights with more digestible threads or video explainers that appeal to AI newcomers and professionals alike. Engaging directly with followers through Q&As or collaborative mini-projects could also spark broader community involvement.","roast":"Sebastian tweets so much cutting-edge AI stuff that casual scrollers probably think he’s single-handedly trying to train every neural net on the planet while running a professor’s marathon—and still finds time to bake neural networks instead of cupcakes.","win":"His biggest win is creating one of the first open-source, from-scratch large language model projects that sparked widespread engagement and learning, fundamentally democratizing access to advanced AI knowledge."},"tweets":[{"bookmarked":false,"display_text_range":[0,228],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/lucidrains/PaL…","expanded_url":"https://github.com/lucidrains/PaLM-rlhf-pytorch","url":"https://t.co/4vQ83pcX2H","indices":[68,91]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1608133663937495041","view_count":1218091,"bookmark_count":2354,"created_at":1672243917000,"favorite_count":5841,"quote_count":79,"reply_count":82,"retweet_count":1112,"user_id_str":"865622395","conversation_id_str":"1608133663937495041","full_text":"Looks like the first open source equivalent of ChatGPT has arrived: https://t.co/4vQ83pcX2H\n\nI.e., an implementation of RLHF (Reinforcement Learning with Human Feedback) on top of Google’s 540 billion parameter PaLM architecture","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,245],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/NoQ7gJNd0y","expanded_url":"https://x.com/rasbt/status/1966875930674286935/photo/1","id_str":"1966875925624004608","indices":[246,269],"media_key":"3_1966875925624004608","media_url_https":"https://pbs.twimg.com/media/G0u_O8AWIAAXatc.jpg","type":"photo","url":"https://t.co/NoQ7gJNd0y","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1276,"w":1125,"resize":"fit"},"medium":{"h":1200,"w":1058,"resize":"fit"},"small":{"h":680,"w":600,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1276,"width":1125,"focus_rects":[{"x":0,"y":291,"w":1125,"h":630},{"x":0,"y":44,"w":1125,"h":1125},{"x":0,"y":0,"w":1119,"h":1276},{"x":31,"y":0,"w":638,"h":1276},{"x":0,"y":0,"w":1125,"h":1276}]},"media_results":{"result":{"media_key":"3_1966875925624004608"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/NoQ7gJNd0y","expanded_url":"https://x.com/rasbt/status/1966875930674286935/photo/1","id_str":"1966875925624004608","indices":[246,269],"media_key":"3_1966875925624004608","media_url_https":"https://pbs.twimg.com/media/G0u_O8AWIAAXatc.jpg","type":"photo","url":"https://t.co/NoQ7gJNd0y","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1276,"w":1125,"resize":"fit"},"medium":{"h":1200,"w":1058,"resize":"fit"},"small":{"h":680,"w":600,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1276,"width":1125,"focus_rects":[{"x":0,"y":291,"w":1125,"h":630},{"x":0,"y":44,"w":1125,"h":1125},{"x":0,"y":0,"w":1119,"h":1276},{"x":31,"y":0,"w":638,"h":1276},{"x":0,"y":0,"w":1125,"h":1276}]},"media_results":{"result":{"media_key":"3_1966875925624004608"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1966875930674286935","view_count":229996,"bookmark_count":4455,"created_at":1757774739000,"favorite_count":5127,"quote_count":16,"reply_count":59,"retweet_count":475,"user_id_str":"865622395","conversation_id_str":"1966875930674286935","full_text":"When I started LLMs-from-scratch I just hoped it might help a few people learn. \n\nJust saw the GitHub the repo has now been forked 10k times!\n\nMore than the stars, the best part is seeing thousands of people actually use and build on the code ☺️ https://t.co/NoQ7gJNd0y","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,71],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/LrRqRCOHkl","expanded_url":"https://x.com/rasbt/status/1944056316424577525/photo/1","id_str":"1944056300159049729","indices":[72,95],"media_key":"3_1944056300159049729","media_url_https":"https://pbs.twimg.com/media/Gvqs56pXIAEVI73.jpg","type":"photo","url":"https://t.co/LrRqRCOHkl","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1071,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2142,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3825,"h":2142},{"x":669,"y":0,"w":2142,"h":2142},{"x":801,"y":0,"w":1879,"h":2142},{"x":1205,"y":0,"w":1071,"h":2142},{"x":0,"y":0,"w":4096,"h":2142}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1944056300159049729"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/LrRqRCOHkl","expanded_url":"https://x.com/rasbt/status/1944056316424577525/photo/1","id_str":"1944056300159049729","indices":[72,95],"media_key":"3_1944056300159049729","media_url_https":"https://pbs.twimg.com/media/Gvqs56pXIAEVI73.jpg","type":"photo","url":"https://t.co/LrRqRCOHkl","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1071,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2142,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3825,"h":2142},{"x":669,"y":0,"w":2142,"h":2142},{"x":801,"y":0,"w":1879,"h":2142},{"x":1205,"y":0,"w":1071,"h":2142},{"x":0,"y":0,"w":4096,"h":2142}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1944056300159049729"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1944056316424577525","view_count":547453,"bookmark_count":2761,"created_at":1752334119000,"favorite_count":4827,"quote_count":45,"reply_count":81,"retweet_count":537,"user_id_str":"865622395","conversation_id_str":"1944056316424577525","full_text":"Kimi K2 is basically DeepSeek V3 but with fewer heads and more experts: https://t.co/LrRqRCOHkl","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1888612769433346251","view_count":278057,"bookmark_count":1049,"created_at":1739115347000,"favorite_count":4542,"quote_count":77,"reply_count":298,"retweet_count":484,"user_id_str":"865622395","conversation_id_str":"1888612769433346251","full_text":"Maybe a hot take, but what about the following advice to the next gen: \nDon't get an AI degree; the curriculum will be outdated before you graduate. Instead, study math, stats, or physics as your foundation, and stay current with AI through code-focused books, blogs, and papers.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,158],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/9vF9U29pWh","expanded_url":"https://x.com/rasbt/status/1957073842393792751/photo/1","id_str":"1957073720805150720","indices":[159,182],"media_key":"3_1957073720805150720","media_url_https":"https://pbs.twimg.com/media/GyjsLhfX0AANHSM.jpg","type":"photo","url":"https://t.co/9vF9U29pWh","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1190,"w":2048,"resize":"fit"},"medium":{"h":697,"w":1200,"resize":"fit"},"small":{"h":395,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2379,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2379,"h":2379},{"x":0,"y":0,"w":2087,"h":2379},{"x":121,"y":0,"w":1190,"h":2379},{"x":0,"y":0,"w":4096,"h":2379}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1957073720805150720"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/12_gemma3","url":"https://t.co/M2f8EB0KBE","indices":[135,158]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/9vF9U29pWh","expanded_url":"https://x.com/rasbt/status/1957073842393792751/photo/1","id_str":"1957073720805150720","indices":[159,182],"media_key":"3_1957073720805150720","media_url_https":"https://pbs.twimg.com/media/GyjsLhfX0AANHSM.jpg","type":"photo","url":"https://t.co/9vF9U29pWh","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1190,"w":2048,"resize":"fit"},"medium":{"h":697,"w":1200,"resize":"fit"},"small":{"h":395,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2379,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2379,"h":2379},{"x":0,"y":0,"w":2087,"h":2379},{"x":121,"y":0,"w":1190,"h":2379},{"x":0,"y":0,"w":4096,"h":2379}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1957073720805150720"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1956130338431713307","quoted_status_permalink":{"url":"https://t.co/S8psrne2rJ","expanded":"https://twitter.com/rasbt/status/1956130338431713307","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1957073842393792751","view_count":345980,"bookmark_count":3876,"created_at":1755437739000,"favorite_count":3904,"quote_count":24,"reply_count":62,"retweet_count":552,"user_id_str":"865622395","conversation_id_str":"1957073842393792751","full_text":"Couldn't resist. \nHere's a pure PyTorch from-scratch re-implementation of Gemma 3 270M in a Jupyter Notebook (uses about 1.49 GB RAM): https://t.co/M2f8EB0KBE https://t.co/9vF9U29pWh","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,242],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/ZBB7tw2XkV","expanded_url":"https://x.com/rasbt/status/1842548690256384278/photo/1","id_str":"1842546544567873537","indices":[243,266],"media_key":"3_1842546544567873537","media_url_https":"https://pbs.twimg.com/media/GZIKUe8XcAE5Bnl.jpg","type":"photo","url":"https://t.co/ZBB7tw2XkV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1365,"w":2048,"resize":"fit"},"medium":{"h":800,"w":1200,"resize":"fit"},"small":{"h":453,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2730,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1366,"y":0,"w":2730,"h":2730},{"x":1701,"y":0,"w":2395,"h":2730},{"x":2287,"y":0,"w":1365,"h":2730},{"x":0,"y":0,"w":4096,"h":2730}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1842546544567873537"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/07_gpt_to_llama/converting-llama2-to-llama3.ipynb","url":"https://t.co/ODlwRfONOz","indices":[219,242]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/ZBB7tw2XkV","expanded_url":"https://x.com/rasbt/status/1842548690256384278/photo/1","id_str":"1842546544567873537","indices":[243,266],"media_key":"3_1842546544567873537","media_url_https":"https://pbs.twimg.com/media/GZIKUe8XcAE5Bnl.jpg","type":"photo","url":"https://t.co/ZBB7tw2XkV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1365,"w":2048,"resize":"fit"},"medium":{"h":800,"w":1200,"resize":"fit"},"small":{"h":453,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2730,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1366,"y":0,"w":2730,"h":2730},{"x":1701,"y":0,"w":2395,"h":2730},{"x":2287,"y":0,"w":1365,"h":2730},{"x":0,"y":0,"w":4096,"h":2730}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1842546544567873537"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1842548690256384278","view_count":292065,"bookmark_count":3185,"created_at":1728132815000,"favorite_count":3334,"quote_count":17,"reply_count":31,"retweet_count":594,"user_id_str":"865622395","conversation_id_str":"1842548690256384278","full_text":"The Llama 3.2 1B and 3B models are my favorite LLMs -- small but very capable.\nIf you want to understand how the architectures look like under the hood, I implemented them from scratch (one of the best ways to learn): https://t.co/ODlwRfONOz https://t.co/ZBB7tw2XkV","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,278],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/27xCzewnKV","expanded_url":"https://x.com/rasbt/status/1848714250984034771/photo/1","id_str":"1848714125343408129","indices":[279,302],"media_key":"3_1848714125343408129","media_url_https":"https://pbs.twimg.com/media/Gafzs7xXQAE2XCU.jpg","type":"photo","url":"https://t.co/27xCzewnKV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1847,"w":2048,"resize":"fit"},"medium":{"h":1082,"w":1200,"resize":"fit"},"small":{"h":613,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":3694,"width":4096,"focus_rects":[{"x":0,"y":1400,"w":4096,"h":2294},{"x":0,"y":0,"w":3694,"h":3694},{"x":0,"y":0,"w":3240,"h":3694},{"x":0,"y":0,"w":1847,"h":3694},{"x":0,"y":0,"w":4096,"h":3694}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1848714125343408129"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"arxiv.org/abs/2406.15786","expanded_url":"https://arxiv.org/abs/2406.15786","url":"https://t.co/uMprGH3KAX","indices":[57,80]},{"display_url":"arxiv.org/abs/2406.15786","expanded_url":"https://arxiv.org/abs/2406.15786","url":"https://t.co/2O6TxZK5Mx","indices":[57,80]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/27xCzewnKV","expanded_url":"https://x.com/rasbt/status/1848714250984034771/photo/1","id_str":"1848714125343408129","indices":[279,302],"media_key":"3_1848714125343408129","media_url_https":"https://pbs.twimg.com/media/Gafzs7xXQAE2XCU.jpg","type":"photo","url":"https://t.co/27xCzewnKV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1847,"w":2048,"resize":"fit"},"medium":{"h":1082,"w":1200,"resize":"fit"},"small":{"h":613,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":3694,"width":4096,"focus_rects":[{"x":0,"y":1400,"w":4096,"h":2294},{"x":0,"y":0,"w":3694,"h":3694},{"x":0,"y":0,"w":3240,"h":3694},{"x":0,"y":0,"w":1847,"h":3694},{"x":0,"y":0,"w":4096,"h":3694}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1848714125343408129"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1848714250984034771","view_count":192322,"bookmark_count":2707,"created_at":1729602799000,"favorite_count":3049,"quote_count":36,"reply_count":72,"retweet_count":521,"user_id_str":"865622395","conversation_id_str":"1848714250984034771","full_text":"\"What Matters In Transformers?\" is an interesting paper (https://t.co/2O6TxZK5Mx) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance.\n\nThe concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks:\n\n- Removing entire transformer blocks leads to significant performance degradation.\n- Removing MLP layers results in significant performance degradation.\n- Removing attention layers causes almost no performance degradation!\n\nIn Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar.\n\nThe attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed.\n\nThis is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects.\n\nFurthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance.\n\nOverall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures.\n\nOne big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,197],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/m0gXnhATKC","expanded_url":"https://x.com/rasbt/status/1961063435707331061/photo/1","id_str":"1961063086091108352","indices":[198,221],"media_key":"3_1961063086091108352","media_url_https":"https://pbs.twimg.com/media/GzcYfJSXgAACYmy.jpg","type":"photo","url":"https://t.co/m0gXnhATKC","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1792,"y":0,"w":2304,"h":2304},{"x":2075,"y":0,"w":2021,"h":2304},{"x":2598,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1961063086091108352"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/m0gXnhATKC","expanded_url":"https://x.com/rasbt/status/1961063435707331061/photo/1","id_str":"1961063086091108352","indices":[198,221],"media_key":"3_1961063086091108352","media_url_https":"https://pbs.twimg.com/media/GzcYfJSXgAACYmy.jpg","type":"photo","url":"https://t.co/m0gXnhATKC","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1792,"y":0,"w":2304,"h":2304},{"x":2075,"y":0,"w":2021,"h":2304},{"x":2598,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1961063086091108352"}}}]},"favorited":true,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":true,"fact_check":null,"id":"1961063435707331061","view_count":151645,"bookmark_count":2203,"created_at":1756388932000,"favorite_count":3019,"quote_count":21,"reply_count":109,"retweet_count":364,"user_id_str":"865622395","conversation_id_str":"1961063435707331061","full_text":"I’ve been working on something new:\n📚 Build a Reasoning Model (From Scratch).\n\nThe first chapters just went live!\n\n(The book will cover topics from inference-time scaling to reinforcement learning) https://t.co/m0gXnhATKC","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,277],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"manning.com/books/build-a-…","expanded_url":"https://manning.com/books/build-a-large-language-model-from-scratch","url":"https://t.co/bUIshQCNh8","indices":[163,186]},{"display_url":"manning.com/books/build-a-…","expanded_url":"https://manning.com/books/build-a-large-language-model-from-scratch","url":"https://t.co/VbloA34M4I","indices":[163,186]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1736387787950055788","view_count":674759,"bookmark_count":3199,"created_at":1702822083000,"favorite_count":3013,"quote_count":22,"reply_count":91,"retweet_count":498,"user_id_str":"865622395","conversation_id_str":"1736387787950055788","full_text":"One of the best ways to understand LLMs is to code one from scratch! \nLast summer, I started working on a new book, \"Build a Large Language Model (from Scratch)\": https://t.co/VbloA34M4I\n\nI'm excited to share that the first chapters are now available via Manning's early access program if you are looking to read something over the holidays or pick up a new project in 2024!\n\nIn short, in this book, I'll guide you step by step through creating your own LLM, explaining each stage with clear text, diagrams, and examples. \nThis includes Implementing the data preparation, sampling, and tokenization pipeline: \n\n1. Coding multi-head attention from the ground up \n2. Building and pretraining a GPT-like model \n3. Learning how to load pretrained weights \n4. Finetuning the model for classification \n5. Instruction-finetuning the model with direct preference optimization \n\nPS: The code implementations are in PyTorch. \n\nDon't hesitate to reach out if you have any questions!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,254],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/3oi8Frlg9j","expanded_url":"https://x.com/rasbt/status/1845850007095660796/photo/1","id_str":"1845849570640556032","indices":[255,278],"media_key":"3_1845849570640556032","media_url_https":"https://pbs.twimg.com/media/GZ3GZ57WcAA5JWO.jpg","type":"photo","url":"https://t.co/3oi8Frlg9j","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1573,"resize":"fit"},"medium":{"h":1200,"w":922,"resize":"fit"},"small":{"h":680,"w":522,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3146,"focus_rects":[{"x":0,"y":0,"w":3146,"h":1762},{"x":0,"y":0,"w":3146,"h":3146},{"x":0,"y":0,"w":3146,"h":3586},{"x":1098,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3146,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1845849570640556032"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/08_memory_efficient_weight_loading/memory-efficient-state-dict.ipynb","url":"https://t.co/fEx2e8E7jS","indices":[152,175]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/3oi8Frlg9j","expanded_url":"https://x.com/rasbt/status/1845850007095660796/photo/1","id_str":"1845849570640556032","indices":[255,278],"media_key":"3_1845849570640556032","media_url_https":"https://pbs.twimg.com/media/GZ3GZ57WcAA5JWO.jpg","type":"photo","url":"https://t.co/3oi8Frlg9j","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1573,"resize":"fit"},"medium":{"h":1200,"w":922,"resize":"fit"},"small":{"h":680,"w":522,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3146,"focus_rects":[{"x":0,"y":0,"w":3146,"h":1762},{"x":0,"y":0,"w":3146,"h":3146},{"x":0,"y":0,"w":3146,"h":3586},{"x":1098,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3146,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1845849570640556032"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1845850007095660796","view_count":157158,"bookmark_count":3009,"created_at":1728919910000,"favorite_count":2927,"quote_count":15,"reply_count":19,"retweet_count":433,"user_id_str":"865622395","conversation_id_str":"1845850007095660796","full_text":"Just put together a short Jupyter notebook with tips and tricks for reducing memory usage when loading larger and larger models (like LLMs) in PyTorch: https://t.co/fEx2e8E7jS\n\n(PS: This is an LLM example but the same concepts apply to any PyTorch model) https://t.co/3oi8Frlg9j","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,282],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/POjLPWainM","expanded_url":"https://x.com/rasbt/status/1965798055141429523/video/1","id_str":"1965793962746052608","indices":[283,306],"media_key":"13_1965793962746052608","media_url_https":"https://pbs.twimg.com/amplify_video_thumb/1965793962746052608/img/RCgLNbjJcrjHzZxT.jpg","type":"video","url":"https://t.co/POjLPWainM","additional_media_info":{"monetizable":false},"ext_media_availability":{"status":"Available"},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[]},"allow_download_status":{"allow_download":true},"video_info":{"aspect_ratio":[16,9],"duration_millis":5196585,"variants":[{"content_type":"application/x-mpegURL","url":"https://video.twimg.com/amplify_video/1965793962746052608/pl/mdhnmS3v-WobB2wK.m3u8?tag=21&v=748"},{"bitrate":256000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/1965793962746052608/vid/avc1/480x270/HgoOii8AslM4gCGR.mp4?tag=21"},{"bitrate":832000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/1965793962746052608/vid/avc1/640x360/xPdGBYVHZ_VhKv9s.mp4?tag=21"},{"bitrate":2176000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/1965793962746052608/vid/avc1/1280x720/Jc9ihqQMhmwIfJA8.mp4?tag=21"},{"bitrate":10368000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/1965793962746052608/vid/avc1/1920x1080/pYArIVEhjK7vVLRN.mp4?tag=21"},{"bitrate":25128000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/1965793962746052608/vid/avc1/3840x2160/G5mZMbWK46pgFMxx.mp4?tag=21"}]},"media_results":{"result":{"media_key":"13_1965793962746052608"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/POjLPWainM","expanded_url":"https://x.com/rasbt/status/1965798055141429523/video/1","id_str":"1965793962746052608","indices":[283,306],"media_key":"13_1965793962746052608","media_url_https":"https://pbs.twimg.com/amplify_video_thumb/1965793962746052608/img/RCgLNbjJcrjHzZxT.jpg","type":"video","url":"https://t.co/POjLPWainM","additional_media_info":{"monetizable":false},"ext_media_availability":{"status":"Available"},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[]},"allow_download_status":{"allow_download":true},"video_info":{"aspect_ratio":[16,9],"duration_millis":5196585,"variants":[{"content_type":"application/x-mpegURL","url":"https://video.twimg.com/amplify_video/1965793962746052608/pl/mdhnmS3v-WobB2wK.m3u8?tag=21&v=748"},{"bitrate":256000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/1965793962746052608/vid/avc1/480x270/HgoOii8AslM4gCGR.mp4?tag=21"},{"bitrate":832000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/1965793962746052608/vid/avc1/640x360/xPdGBYVHZ_VhKv9s.mp4?tag=21"},{"bitrate":2176000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/1965793962746052608/vid/avc1/1280x720/Jc9ihqQMhmwIfJA8.mp4?tag=21"},{"bitrate":10368000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/1965793962746052608/vid/avc1/1920x1080/pYArIVEhjK7vVLRN.mp4?tag=21"},{"bitrate":25128000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/1965793962746052608/vid/avc1/3840x2160/G5mZMbWK46pgFMxx.mp4?tag=21"}]},"media_results":{"result":{"media_key":"13_1965793962746052608"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1965798055141429523","view_count":190106,"bookmark_count":3121,"created_at":1757517753000,"favorite_count":2617,"quote_count":22,"reply_count":39,"retweet_count":500,"user_id_str":"865622395","conversation_id_str":"1965798055141429523","full_text":"Updated & turned my Big LLM Architecture Comparison article into a narrated video lecture. \n\nThe 11 LLM architectures covered in this video:\n1. DeepSeek V3/R1\n2. OLMo 2\n3. Gemma 3 \n4. Mistral Small 3.1\n5. Llama 4\n6. Qwen3\n7. SmolLM3\n8. Kimi 2\n9. GPT-OSS\n10. Grok 2.5\n11. GLM-4.5 https://t.co/POjLPWainM","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,238],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"youtube.com/watch?v=quh7z1…","expanded_url":"https://www.youtube.com/watch?v=quh7z1q7-uc","url":"https://t.co/wgGIZWFgkh","indices":[215,238]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1829832551906574474","view_count":173123,"bookmark_count":3398,"created_at":1725101051000,"favorite_count":2531,"quote_count":14,"reply_count":30,"retweet_count":453,"user_id_str":"865622395","conversation_id_str":"1829832551906574474","full_text":"If you’d like to spend a few hours this weekend to dive into Large Language Models (LLMs) and understand how they work, I've prepared a 3-hour coding workshop presentation on implementing, training, and using LLMs: https://t.co/wgGIZWFgkh","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,268],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/wbaV8BYy69","expanded_url":"https://x.com/rasbt/status/1913589458726690892/photo/1","id_str":"1913589235241680896","indices":[269,292],"media_key":"3_1913589235241680896","media_url_https":"https://pbs.twimg.com/media/Go5vRVSXgAA3_Uj.jpg","type":"photo","url":"https://t.co/wbaV8BYy69","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1588,"resize":"fit"},"medium":{"h":1200,"w":930,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3175,"focus_rects":[{"x":0,"y":0,"w":3175,"h":1778},{"x":0,"y":0,"w":3175,"h":3175},{"x":0,"y":0,"w":3175,"h":3620},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3175,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1913589235241680896"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/wbaV8BYy69","expanded_url":"https://x.com/rasbt/status/1913589458726690892/photo/1","id_str":"1913589235241680896","indices":[269,292],"media_key":"3_1913589235241680896","media_url_https":"https://pbs.twimg.com/media/Go5vRVSXgAA3_Uj.jpg","type":"photo","url":"https://t.co/wbaV8BYy69","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1588,"resize":"fit"},"medium":{"h":1200,"w":930,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3175,"focus_rects":[{"x":0,"y":0,"w":3175,"h":1778},{"x":0,"y":0,"w":3175,"h":3175},{"x":0,"y":0,"w":3175,"h":3620},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3175,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1913589235241680896"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1913589458726690892","view_count":191873,"bookmark_count":2571,"created_at":1745070254000,"favorite_count":2470,"quote_count":26,"reply_count":49,"retweet_count":471,"user_id_str":"865622395","conversation_id_str":"1913589458726690892","full_text":"Just shared a new article on \"The State of Reinforcement Learning for LLM Reasoning\"!\nIf you are new to reinforcement learning, this article has a pretty generous intro section (PPO, GRPO, etc)\nI addition, I cover 15 recent articles focused on RL and Reasoning Models, https://t.co/wbaV8BYy69","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,277],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/OpJnPkrGK9","indices":[187,210]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[187,210]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1941172958682140705","view_count":130709,"bookmark_count":3533,"created_at":1751646673000,"favorite_count":2452,"quote_count":24,"reply_count":48,"retweet_count":422,"user_id_str":"865622395","conversation_id_str":"1941172958682140705","full_text":"If you're getting into LLMs, PyTorch is essential.\nAnd lot of folks asked for beginner-friendly material, so I put this together:\nPyTorch in One Hour: From Tensors to Multi-GPU Training (https://t.co/NWeQan8HJ3)\n📖 ~1h to read through\n💡 Maybe the perfect weekend project!?\n\nI’ve spent nearly a decade using, building with, and teaching PyTorch. And in this tutorial, I try to distill what I believe are the most essential concepts. Everything you need to know to get started, and but nothing more, since your time is valuable, and you want to get to building things!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/ZbNaQA3O2O","expanded_url":"https://x.com/rasbt/status/1834203830868643978/photo/1","id_str":"1834200864669540352","indices":[262,285],"media_key":"3_1834200864669540352","media_url_https":"https://pbs.twimg.com/media/GXRj9-lXoAApwff.jpg","type":"photo","url":"https://t.co/ZbNaQA3O2O","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1578,"y":680,"h":88,"w":88}]},"medium":{"faces":[{"x":924,"y":398,"h":51,"w":51}]},"small":{"faces":[{"x":523,"y":225,"h":29,"w":29}]},"orig":{"faces":[{"x":3156,"y":1360,"h":176,"w":176}]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2304,"h":2304},{"x":0,"y":0,"w":2021,"h":2304},{"x":140,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1834200864669540352"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"mng.bz/M96o","expanded_url":"http://mng.bz/M96o","url":"https://t.co/bOi8w6b57u","indices":[167,190]},{"display_url":"amazon.com/Build-Large-La…","expanded_url":"https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167","url":"https://t.co/fBR1J3otxG","indices":[235,258]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/ZbNaQA3O2O","expanded_url":"https://x.com/rasbt/status/1834203830868643978/photo/1","id_str":"1834200864669540352","indices":[262,285],"media_key":"3_1834200864669540352","media_url_https":"https://pbs.twimg.com/media/GXRj9-lXoAApwff.jpg","type":"photo","url":"https://t.co/ZbNaQA3O2O","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1578,"y":680,"h":88,"w":88}]},"medium":{"faces":[{"x":924,"y":398,"h":51,"w":51}]},"small":{"faces":[{"x":523,"y":225,"h":29,"w":29}]},"orig":{"faces":[{"x":3156,"y":1360,"h":176,"w":176}]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2304,"h":2304},{"x":0,"y":0,"w":2021,"h":2304},{"x":140,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1834200864669540352"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1834203830868643978","view_count":128403,"bookmark_count":1306,"created_at":1726143245000,"favorite_count":2421,"quote_count":37,"reply_count":94,"retweet_count":360,"user_id_str":"865622395","conversation_id_str":"1834203830868643978","full_text":"Today is the day! 🎉 After 1.5 years of hard work, \"Build A Large Language Model (From Scratch)\" is finally out!\nPrint or ebook copies are available on Manning’s site: https://t.co/bOi8w6b57u 📚. \nAnd it's also available on Amazon soon: https://t.co/fBR1J3otxG 📦. https://t.co/ZbNaQA3O2O","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,239],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/3AGFdfoKYJ","expanded_url":"https://x.com/rasbt/status/1951635208375034191/photo/1","id_str":"1951634800768356352","indices":[240,263],"media_key":"3_1951634800768356352","media_url_https":"https://pbs.twimg.com/media/GxWZgtcWkAA0tuq.jpg","type":"photo","url":"https://t.co/3AGFdfoKYJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1506,"w":2048,"resize":"fit"},"medium":{"h":882,"w":1200,"resize":"fit"},"small":{"h":500,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":3011,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":235,"y":0,"w":3011,"h":3011},{"x":420,"y":0,"w":2641,"h":3011},{"x":987,"y":0,"w":1506,"h":3011},{"x":0,"y":0,"w":4096,"h":3011}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1951634800768356352"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/3AGFdfoKYJ","expanded_url":"https://x.com/rasbt/status/1951635208375034191/photo/1","id_str":"1951634800768356352","indices":[240,263],"media_key":"3_1951634800768356352","media_url_https":"https://pbs.twimg.com/media/GxWZgtcWkAA0tuq.jpg","type":"photo","url":"https://t.co/3AGFdfoKYJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1506,"w":2048,"resize":"fit"},"medium":{"h":882,"w":1200,"resize":"fit"},"small":{"h":500,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":3011,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":235,"y":0,"w":3011,"h":3011},{"x":420,"y":0,"w":2641,"h":3011},{"x":987,"y":0,"w":1506,"h":3011},{"x":0,"y":0,"w":4096,"h":3011}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1951634800768356352"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1951635208375034191","view_count":137573,"bookmark_count":1639,"created_at":1754141067000,"favorite_count":2372,"quote_count":15,"reply_count":32,"retweet_count":284,"user_id_str":"865622395","conversation_id_str":"1951635208375034191","full_text":"So, I did some coding this week...\n- Qwen3 Coder Flash (30B-A3B)\n- Mixture-of-Experts setup with 128 experts, 8 active per token\n- In pure PyTorch (optimized for human readability)\n- in a standalone Jupyter notebook\n- Runs on a single A100 https://t.co/3AGFdfoKYJ","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}],"ctweets":[{"bookmarked":false,"display_text_range":[0,277],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1865483496266867006","view_count":133542,"bookmark_count":159,"created_at":1733600899000,"favorite_count":2202,"quote_count":4,"reply_count":461,"retweet_count":32,"user_id_str":"865622395","conversation_id_str":"1865483496266867006","full_text":"Hey everyone!\nI just wanted to say how incredible this journey here has been over the past 12 years, when I started sharing and creating Python, ML, and AI content. The positive feedback I’ve received from you all has been such a powerful motivator. \n\nI had some exciting plans lined up for the coming weeks, including a comprehensive yearly review and outlook of AI research.\n\nUnfortunately, I suffered a serious injury that’s made it impossible for me to use a computer for the last few weeks. Unfortunately, I’ll need to take a break from content creation while I focus on recovery.\n\nBut no worries, I’m optimistic about making a full recovery and can’t wait to get back to creating and sharing more AI research content in the future. Thank you for your understanding and support over the years!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,271],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1922660819206082830","view_count":128443,"bookmark_count":120,"created_at":1747233035000,"favorite_count":2344,"quote_count":6,"reply_count":456,"retweet_count":39,"user_id_str":"865622395","conversation_id_str":"1922660819206082830","full_text":"Usually, I post AI content and not personal life updates, but it somehow helps to get this off my chest. If I haven't been responding to messages, emails, or PRs, it's not because I don't care or don't want to. It's just physically hard at the moment, even with aids like speech-to-text.\n\nLife can be brutal sometimes. I was just getting back on my feet in March after a back injury that took 3 months to heal. I was so excited to be back coding and writing. And I had so many plans for this year. AI is a fast-moving field!\nAnd most importantly, I was hoping for this to be my last ER and hospital visit in life (or at least the next few decades), but no, life had other plans, and I have been dealing with a bad neck injury for the past 3 weeks.\n\nI have been seeing multiple doctors to get multiple opinions, and unfortunately, it may take 1–6 months until I can expect an improvement and start healing phase.\n\nSo many talks and projects to cancel. And the worst feeling is the fear of missing out.\nMissing out on life and doing normal things like going for a hike on a beautiful spring day. And missing out on all the exciting AI developments and the missed project opportunities.\n\nI am working hard to get back on my feet as soon as possible or at least do some things even if it means I can do less and have to do things more slowly.\nI have done it before and can do it again!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1888612769433346251","view_count":278057,"bookmark_count":1049,"created_at":1739115347000,"favorite_count":4542,"quote_count":77,"reply_count":298,"retweet_count":484,"user_id_str":"865622395","conversation_id_str":"1888612769433346251","full_text":"Maybe a hot take, but what about the following advice to the next gen: \nDon't get an AI degree; the curriculum will be outdated before you graduate. Instead, study math, stats, or physics as your foundation, and stay current with AI through code-focused books, blogs, and papers.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,197],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/m0gXnhATKC","expanded_url":"https://x.com/rasbt/status/1961063435707331061/photo/1","id_str":"1961063086091108352","indices":[198,221],"media_key":"3_1961063086091108352","media_url_https":"https://pbs.twimg.com/media/GzcYfJSXgAACYmy.jpg","type":"photo","url":"https://t.co/m0gXnhATKC","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1792,"y":0,"w":2304,"h":2304},{"x":2075,"y":0,"w":2021,"h":2304},{"x":2598,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1961063086091108352"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/m0gXnhATKC","expanded_url":"https://x.com/rasbt/status/1961063435707331061/photo/1","id_str":"1961063086091108352","indices":[198,221],"media_key":"3_1961063086091108352","media_url_https":"https://pbs.twimg.com/media/GzcYfJSXgAACYmy.jpg","type":"photo","url":"https://t.co/m0gXnhATKC","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1792,"y":0,"w":2304,"h":2304},{"x":2075,"y":0,"w":2021,"h":2304},{"x":2598,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1961063086091108352"}}}]},"favorited":true,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":true,"fact_check":null,"id":"1961063435707331061","view_count":151645,"bookmark_count":2203,"created_at":1756388932000,"favorite_count":3019,"quote_count":21,"reply_count":109,"retweet_count":364,"user_id_str":"865622395","conversation_id_str":"1961063435707331061","full_text":"I’ve been working on something new:\n📚 Build a Reasoning Model (From Scratch).\n\nThe first chapters just went live!\n\n(The book will cover topics from inference-time scaling to reinforcement learning) https://t.co/m0gXnhATKC","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/ZbNaQA3O2O","expanded_url":"https://x.com/rasbt/status/1834203830868643978/photo/1","id_str":"1834200864669540352","indices":[262,285],"media_key":"3_1834200864669540352","media_url_https":"https://pbs.twimg.com/media/GXRj9-lXoAApwff.jpg","type":"photo","url":"https://t.co/ZbNaQA3O2O","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1578,"y":680,"h":88,"w":88}]},"medium":{"faces":[{"x":924,"y":398,"h":51,"w":51}]},"small":{"faces":[{"x":523,"y":225,"h":29,"w":29}]},"orig":{"faces":[{"x":3156,"y":1360,"h":176,"w":176}]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2304,"h":2304},{"x":0,"y":0,"w":2021,"h":2304},{"x":140,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1834200864669540352"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"mng.bz/M96o","expanded_url":"http://mng.bz/M96o","url":"https://t.co/bOi8w6b57u","indices":[167,190]},{"display_url":"amazon.com/Build-Large-La…","expanded_url":"https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167","url":"https://t.co/fBR1J3otxG","indices":[235,258]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/ZbNaQA3O2O","expanded_url":"https://x.com/rasbt/status/1834203830868643978/photo/1","id_str":"1834200864669540352","indices":[262,285],"media_key":"3_1834200864669540352","media_url_https":"https://pbs.twimg.com/media/GXRj9-lXoAApwff.jpg","type":"photo","url":"https://t.co/ZbNaQA3O2O","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1578,"y":680,"h":88,"w":88}]},"medium":{"faces":[{"x":924,"y":398,"h":51,"w":51}]},"small":{"faces":[{"x":523,"y":225,"h":29,"w":29}]},"orig":{"faces":[{"x":3156,"y":1360,"h":176,"w":176}]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2304,"h":2304},{"x":0,"y":0,"w":2021,"h":2304},{"x":140,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1834200864669540352"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1834203830868643978","view_count":128403,"bookmark_count":1306,"created_at":1726143245000,"favorite_count":2421,"quote_count":37,"reply_count":94,"retweet_count":360,"user_id_str":"865622395","conversation_id_str":"1834203830868643978","full_text":"Today is the day! 🎉 After 1.5 years of hard work, \"Build A Large Language Model (From Scratch)\" is finally out!\nPrint or ebook copies are available on Manning’s site: https://t.co/bOi8w6b57u 📚. \nAnd it's also available on Amazon soon: https://t.co/fBR1J3otxG 📦. https://t.co/ZbNaQA3O2O","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,277],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"manning.com/books/build-a-…","expanded_url":"https://manning.com/books/build-a-large-language-model-from-scratch","url":"https://t.co/bUIshQCNh8","indices":[163,186]},{"display_url":"manning.com/books/build-a-…","expanded_url":"https://manning.com/books/build-a-large-language-model-from-scratch","url":"https://t.co/VbloA34M4I","indices":[163,186]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1736387787950055788","view_count":674759,"bookmark_count":3199,"created_at":1702822083000,"favorite_count":3013,"quote_count":22,"reply_count":91,"retweet_count":498,"user_id_str":"865622395","conversation_id_str":"1736387787950055788","full_text":"One of the best ways to understand LLMs is to code one from scratch! \nLast summer, I started working on a new book, \"Build a Large Language Model (from Scratch)\": https://t.co/VbloA34M4I\n\nI'm excited to share that the first chapters are now available via Manning's early access program if you are looking to read something over the holidays or pick up a new project in 2024!\n\nIn short, in this book, I'll guide you step by step through creating your own LLM, explaining each stage with clear text, diagrams, and examples. \nThis includes Implementing the data preparation, sampling, and tokenization pipeline: \n\n1. Coding multi-head attention from the ground up \n2. Building and pretraining a GPT-like model \n3. Learning how to load pretrained weights \n4. Finetuning the model for classification \n5. Instruction-finetuning the model with direct preference optimization \n\nPS: The code implementations are in PyTorch. \n\nDon't hesitate to reach out if you have any questions!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,228],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/lucidrains/PaL…","expanded_url":"https://github.com/lucidrains/PaLM-rlhf-pytorch","url":"https://t.co/4vQ83pcX2H","indices":[68,91]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1608133663937495041","view_count":1218091,"bookmark_count":2354,"created_at":1672243917000,"favorite_count":5841,"quote_count":79,"reply_count":82,"retweet_count":1112,"user_id_str":"865622395","conversation_id_str":"1608133663937495041","full_text":"Looks like the first open source equivalent of ChatGPT has arrived: https://t.co/4vQ83pcX2H\n\nI.e., an implementation of RLHF (Reinforcement Learning with Human Feedback) on top of Google’s 540 billion parameter PaLM architecture","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,238],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/eQUajnfs8O","expanded_url":"https://x.com/rasbt/status/1893747919313916340/photo/1","id_str":"1893746584078168064","indices":[239,262],"media_key":"3_1893746584078168064","media_url_https":"https://pbs.twimg.com/media/GkfwfBaWsAA54lc.jpg","type":"photo","url":"https://t.co/eQUajnfs8O","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":788,"w":1125,"resize":"fit"},"medium":{"h":788,"w":1125,"resize":"fit"},"small":{"h":476,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":788,"width":1125,"focus_rects":[{"x":0,"y":106,"w":1125,"h":630},{"x":196,"y":0,"w":788,"h":788},{"x":245,"y":0,"w":691,"h":788},{"x":393,"y":0,"w":394,"h":788},{"x":0,"y":0,"w":1125,"h":788}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1893746584078168064"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/eQUajnfs8O","expanded_url":"https://x.com/rasbt/status/1893747919313916340/photo/1","id_str":"1893746584078168064","indices":[239,262],"media_key":"3_1893746584078168064","media_url_https":"https://pbs.twimg.com/media/GkfwfBaWsAA54lc.jpg","type":"photo","url":"https://t.co/eQUajnfs8O","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":788,"w":1125,"resize":"fit"},"medium":{"h":788,"w":1125,"resize":"fit"},"small":{"h":476,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":788,"width":1125,"focus_rects":[{"x":0,"y":106,"w":1125,"h":630},{"x":196,"y":0,"w":788,"h":788},{"x":245,"y":0,"w":691,"h":788},{"x":393,"y":0,"w":394,"h":788},{"x":0,"y":0,"w":1125,"h":788}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1893746584078168064"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1893747919313916340","view_count":107289,"bookmark_count":1554,"created_at":1740339662000,"favorite_count":1467,"quote_count":13,"reply_count":81,"retweet_count":155,"user_id_str":"865622395","conversation_id_str":"1893747919313916340","full_text":"Here’s the 2025 LLM roadmap 😊\n1. Code and train your own LLM to really understand the fundamentals\n2. Train models more conveniently using production-ready libraries\n3. Learn about the big-picture considerations for real-world LLM/AI apps https://t.co/eQUajnfs8O","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,71],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/LrRqRCOHkl","expanded_url":"https://x.com/rasbt/status/1944056316424577525/photo/1","id_str":"1944056300159049729","indices":[72,95],"media_key":"3_1944056300159049729","media_url_https":"https://pbs.twimg.com/media/Gvqs56pXIAEVI73.jpg","type":"photo","url":"https://t.co/LrRqRCOHkl","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1071,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2142,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3825,"h":2142},{"x":669,"y":0,"w":2142,"h":2142},{"x":801,"y":0,"w":1879,"h":2142},{"x":1205,"y":0,"w":1071,"h":2142},{"x":0,"y":0,"w":4096,"h":2142}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1944056300159049729"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/LrRqRCOHkl","expanded_url":"https://x.com/rasbt/status/1944056316424577525/photo/1","id_str":"1944056300159049729","indices":[72,95],"media_key":"3_1944056300159049729","media_url_https":"https://pbs.twimg.com/media/Gvqs56pXIAEVI73.jpg","type":"photo","url":"https://t.co/LrRqRCOHkl","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1071,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2142,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3825,"h":2142},{"x":669,"y":0,"w":2142,"h":2142},{"x":801,"y":0,"w":1879,"h":2142},{"x":1205,"y":0,"w":1071,"h":2142},{"x":0,"y":0,"w":4096,"h":2142}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1944056300159049729"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1944056316424577525","view_count":547453,"bookmark_count":2761,"created_at":1752334119000,"favorite_count":4827,"quote_count":45,"reply_count":81,"retweet_count":537,"user_id_str":"865622395","conversation_id_str":"1944056316424577525","full_text":"Kimi K2 is basically DeepSeek V3 but with fewer heads and more experts: https://t.co/LrRqRCOHkl","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,268],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/55ZrR0TPfU","expanded_url":"https://x.com/rasbt/status/1890842082756935853/photo/1","id_str":"1890841984505462784","indices":[269,292],"media_key":"3_1890841984505462784","media_url_https":"https://pbs.twimg.com/media/Gj2exEzXwAAKcxQ.jpg","type":"photo","url":"https://t.co/55ZrR0TPfU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1572,"w":1170,"resize":"fit"},"medium":{"h":1200,"w":893,"resize":"fit"},"small":{"h":680,"w":506,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1572,"width":1170,"focus_rects":[{"x":0,"y":105,"w":1170,"h":655},{"x":0,"y":0,"w":1170,"h":1170},{"x":0,"y":0,"w":1170,"h":1334},{"x":0,"y":0,"w":786,"h":1572},{"x":0,"y":0,"w":1170,"h":1572}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1890841984505462784"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/setup/01_optional-python-setup-preferences","url":"https://t.co/slD3Vxpbpd","indices":[215,238]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/55ZrR0TPfU","expanded_url":"https://x.com/rasbt/status/1890842082756935853/photo/1","id_str":"1890841984505462784","indices":[269,292],"media_key":"3_1890841984505462784","media_url_https":"https://pbs.twimg.com/media/Gj2exEzXwAAKcxQ.jpg","type":"photo","url":"https://t.co/55ZrR0TPfU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1572,"w":1170,"resize":"fit"},"medium":{"h":1200,"w":893,"resize":"fit"},"small":{"h":680,"w":506,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1572,"width":1170,"focus_rects":[{"x":0,"y":105,"w":1170,"h":655},{"x":0,"y":0,"w":1170,"h":1170},{"x":0,"y":0,"w":1170,"h":1334},{"x":0,"y":0,"w":786,"h":1572},{"x":0,"y":0,"w":1170,"h":1572}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1890841984505462784"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1890842082756935853","view_count":191655,"bookmark_count":2950,"created_at":1739646857000,"favorite_count":2280,"quote_count":17,"reply_count":76,"retweet_count":277,"user_id_str":"865622395","conversation_id_str":"1890842082756935853","full_text":"It's 2025, and I’ve finally updated my Python setup guide to use uv + venv instead of conda + pip! \nHere's my go-to recommendation for uv + venv in Python projects for faster installs, better dependency management: https://t.co/slD3Vxpbpd\n(Any additional suggestions?) https://t.co/55ZrR0TPfU","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642191950090585","view_count":153439,"bookmark_count":1282,"created_at":1761056871000,"favorite_count":2142,"quote_count":35,"reply_count":76,"retweet_count":338,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.\n\nIn short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.\n\nMy first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.\n\nIn the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)\n\nIn any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.\n\nHow is it different compared to other VLLM architectures?\n- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).\n- They are (to the best of my knowledge) those who use an MoE as a decoder.\nI think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.\nHowever, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.\n\nRegarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)\n\nOverall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).\n\n(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,278],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/27xCzewnKV","expanded_url":"https://x.com/rasbt/status/1848714250984034771/photo/1","id_str":"1848714125343408129","indices":[279,302],"media_key":"3_1848714125343408129","media_url_https":"https://pbs.twimg.com/media/Gafzs7xXQAE2XCU.jpg","type":"photo","url":"https://t.co/27xCzewnKV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1847,"w":2048,"resize":"fit"},"medium":{"h":1082,"w":1200,"resize":"fit"},"small":{"h":613,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":3694,"width":4096,"focus_rects":[{"x":0,"y":1400,"w":4096,"h":2294},{"x":0,"y":0,"w":3694,"h":3694},{"x":0,"y":0,"w":3240,"h":3694},{"x":0,"y":0,"w":1847,"h":3694},{"x":0,"y":0,"w":4096,"h":3694}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1848714125343408129"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"arxiv.org/abs/2406.15786","expanded_url":"https://arxiv.org/abs/2406.15786","url":"https://t.co/uMprGH3KAX","indices":[57,80]},{"display_url":"arxiv.org/abs/2406.15786","expanded_url":"https://arxiv.org/abs/2406.15786","url":"https://t.co/2O6TxZK5Mx","indices":[57,80]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/27xCzewnKV","expanded_url":"https://x.com/rasbt/status/1848714250984034771/photo/1","id_str":"1848714125343408129","indices":[279,302],"media_key":"3_1848714125343408129","media_url_https":"https://pbs.twimg.com/media/Gafzs7xXQAE2XCU.jpg","type":"photo","url":"https://t.co/27xCzewnKV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1847,"w":2048,"resize":"fit"},"medium":{"h":1082,"w":1200,"resize":"fit"},"small":{"h":613,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":3694,"width":4096,"focus_rects":[{"x":0,"y":1400,"w":4096,"h":2294},{"x":0,"y":0,"w":3694,"h":3694},{"x":0,"y":0,"w":3240,"h":3694},{"x":0,"y":0,"w":1847,"h":3694},{"x":0,"y":0,"w":4096,"h":3694}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1848714125343408129"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1848714250984034771","view_count":192322,"bookmark_count":2707,"created_at":1729602799000,"favorite_count":3049,"quote_count":36,"reply_count":72,"retweet_count":521,"user_id_str":"865622395","conversation_id_str":"1848714250984034771","full_text":"\"What Matters In Transformers?\" is an interesting paper (https://t.co/2O6TxZK5Mx) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance.\n\nThe concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks:\n\n- Removing entire transformer blocks leads to significant performance degradation.\n- Removing MLP layers results in significant performance degradation.\n- Removing attention layers causes almost no performance degradation!\n\nIn Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar.\n\nThe attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed.\n\nThis is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects.\n\nFurthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance.\n\nOverall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures.\n\nOne big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,272],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/T3Vr9mSMRm","expanded_url":"https://x.com/rasbt/status/1978608882156269755/photo/1","id_str":"1978608689381601280","indices":[273,296],"media_key":"3_1978608689381601280","media_url_https":"https://pbs.twimg.com/media/G3VuHrAXcAAqXOL.jpg","type":"photo","url":"https://t.co/T3Vr9mSMRm","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":659,"y":510,"h":79,"w":79},{"x":993,"y":513,"h":74,"w":74}]},"medium":{"faces":[{"x":611,"y":473,"h":73,"w":73},{"x":921,"y":476,"h":68,"w":68}]},"small":{"faces":[{"x":346,"y":268,"h":41,"w":41},{"x":522,"y":269,"h":38,"w":38}]},"orig":{"faces":[{"x":659,"y":510,"h":79,"w":79},{"x":993,"y":513,"h":74,"w":74}]}},"sizes":{"large":{"h":1293,"w":1125,"resize":"fit"},"medium":{"h":1200,"w":1044,"resize":"fit"},"small":{"h":680,"w":592,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1293,"width":1125,"focus_rects":[{"x":0,"y":0,"w":1125,"h":630},{"x":0,"y":0,"w":1125,"h":1125},{"x":0,"y":0,"w":1125,"h":1283},{"x":478,"y":0,"w":647,"h":1293},{"x":0,"y":0,"w":1125,"h":1293}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1978608689381601280"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/discussions/625","url":"https://t.co/XSz5qcTJTX","indices":[866,889]},{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/issues/846","url":"https://t.co/8ASr0Q0fS8","indices":[896,919]}],"user_mentions":[{"id_str":"1822588444046249984","name":"LMSYS Org","screen_name":"lmsysorg","indices":[96,105]},{"id_str":"1822588444046249984","name":"LMSYS Org","screen_name":"lmsysorg","indices":[96,105]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/T3Vr9mSMRm","expanded_url":"https://x.com/rasbt/status/1978608882156269755/photo/1","id_str":"1978608689381601280","indices":[273,296],"media_key":"3_1978608689381601280","media_url_https":"https://pbs.twimg.com/media/G3VuHrAXcAAqXOL.jpg","type":"photo","url":"https://t.co/T3Vr9mSMRm","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":659,"y":510,"h":79,"w":79},{"x":993,"y":513,"h":74,"w":74}]},"medium":{"faces":[{"x":611,"y":473,"h":73,"w":73},{"x":921,"y":476,"h":68,"w":68}]},"small":{"faces":[{"x":346,"y":268,"h":41,"w":41},{"x":522,"y":269,"h":38,"w":38}]},"orig":{"faces":[{"x":659,"y":510,"h":79,"w":79},{"x":993,"y":513,"h":74,"w":74}]}},"sizes":{"large":{"h":1293,"w":1125,"resize":"fit"},"medium":{"h":1200,"w":1044,"resize":"fit"},"small":{"h":680,"w":592,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1293,"width":1125,"focus_rects":[{"x":0,"y":0,"w":1125,"h":630},{"x":0,"y":0,"w":1125,"h":1125},{"x":0,"y":0,"w":1125,"h":1283},{"x":478,"y":0,"w":647,"h":1293},{"x":0,"y":0,"w":1125,"h":1293}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1978608689381601280"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1978608882156269755","view_count":125263,"bookmark_count":454,"created_at":1760572092000,"favorite_count":928,"quote_count":19,"reply_count":70,"retweet_count":114,"user_id_str":"865622395","conversation_id_str":"1978608882156269755","full_text":"Saw that DGX Spark vs Mac Mini M4 Pro benchmark plot making the rounds (looks like it came from @lmsysorg). \nThought I’d share a few notes as someone who actually uses a Mac Mini M4 Pro and has been tempted by the DGX Spark.\n\nFirst of all, I really like the Mac Mini. It’s probably the best desktop I’ve ever owned. For local inference with open-weight LLMs, it works great (the plot above captures that well). I regularly run the gpt-oss-20B model on it.\n\nThat said, I would not fine-tune even small LLMs on it since it gets very hot. The DGX Spark probably targets that type of sustained workload. (From those who have one, any thoughts on the noise and heat levels?)\n\nThe other big thing that DGX Spark gets you is CUDA support. If you use PyTorch, that’s pretty essential since MPS on macOS is still unstable, and fine-tuning often fails to converge.\n\nE.g., see\nhttps://t.co/XSz5qcTJTX\n\nand\n\nhttps://t.co/8ASr0Q0fS8\n\nI also like the Spark’s for factor (hey, it really appeals to the Mac Mini user in me). \nBut for the same money, I could probably buy about 4000 A100 cloud GPU hours, and I keep debating which would be the better investment.\n\nSure, I could also build/get a multi-GPU desktop. I had a Lambda system with four GTX 1080 Ti cards back in 2018, but it was too loud and hot for my office. And if I have to move it to another room and SSH into it anyway, I might as well use cloud GPUs instead?","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1895226164337094772","view_count":377473,"bookmark_count":575,"created_at":1740692103000,"favorite_count":1727,"quote_count":22,"reply_count":67,"retweet_count":154,"user_id_str":"865622395","conversation_id_str":"1895226164337094772","full_text":"Can there be too much of a good thing? \nThe 7-day view so far:\n- Grok 3 was released\n- Claude 3.7 was released\n- DeepSeek shipped some awesome stuff in their open-source week\n- DeepSeek announced to release R2 soon\n- OpenAI made Deep Research free for Plus users\n- Google is made their Gemini Code Assist free\n- OpenAI rolled out GPT-4.5","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,277],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/cMvwXCpTbU","expanded_url":"https://x.com/rasbt/status/1851242983774925233/photo/1","id_str":"1851242964745388032","indices":[278,301],"media_key":"3_1851242964745388032","media_url_https":"https://pbs.twimg.com/media/GbDvqwqXAAAlc1U.jpg","type":"photo","url":"https://t.co/cMvwXCpTbU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1792,"y":0,"w":2304,"h":2304},{"x":1959,"y":0,"w":2021,"h":2304},{"x":2393,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1851242964745388032"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"amazon.com/Build-Large-La…","expanded_url":"https://amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167","url":"https://t.co/FDodyp57e8","indices":[96,119]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/cMvwXCpTbU","expanded_url":"https://x.com/rasbt/status/1851242983774925233/photo/1","id_str":"1851242964745388032","indices":[278,301],"media_key":"3_1851242964745388032","media_url_https":"https://pbs.twimg.com/media/GbDvqwqXAAAlc1U.jpg","type":"photo","url":"https://t.co/cMvwXCpTbU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1792,"y":0,"w":2304,"h":2304},{"x":1959,"y":0,"w":2021,"h":2304},{"x":2393,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1851242964745388032"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1851242983774925233","view_count":170216,"bookmark_count":974,"created_at":1730205696000,"favorite_count":1468,"quote_count":10,"reply_count":66,"retweet_count":190,"user_id_str":"865622395","conversation_id_str":"1851242983774925233","full_text":"Exciting news! \"Build A Large Language Model (From Scratch)\" is now finally available on Amazon https://t.co/FDodyp57e8\n\nWriting this book was a huge effort for me, and I'm so grateful for the support and motivating feedback these past months. Many thanks, and happy reading! 😊 https://t.co/cMvwXCpTbU","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,272],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/YhMpn4hlxi","expanded_url":"https://x.com/rasbt/status/1975922614389408022/photo/1","id_str":"1975922520344653824","indices":[273,296],"media_key":"3_1975922520344653824","media_url_https":"https://pbs.twimg.com/media/G2vjEDjWcAAMp_I.jpg","type":"photo","url":"https://t.co/YhMpn4hlxi","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":784,"y":996,"h":120,"w":120}]},"medium":{"faces":[{"x":459,"y":583,"h":70,"w":70}]},"small":{"faces":[{"x":260,"y":330,"h":39,"w":39}]},"orig":{"faces":[{"x":1568,"y":1992,"h":240,"w":240}]}},"sizes":{"large":{"h":1880,"w":2048,"resize":"fit"},"medium":{"h":1101,"w":1200,"resize":"fit"},"small":{"h":624,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":3759,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":337,"y":0,"w":3759,"h":3759},{"x":799,"y":0,"w":3297,"h":3759},{"x":2026,"y":0,"w":1880,"h":3759},{"x":0,"y":0,"w":4096,"h":3759}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1975922520344653824"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/YhMpn4hlxi","expanded_url":"https://x.com/rasbt/status/1975922614389408022/photo/1","id_str":"1975922520344653824","indices":[273,296],"media_key":"3_1975922520344653824","media_url_https":"https://pbs.twimg.com/media/G2vjEDjWcAAMp_I.jpg","type":"photo","url":"https://t.co/YhMpn4hlxi","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":784,"y":996,"h":120,"w":120}]},"medium":{"faces":[{"x":459,"y":583,"h":70,"w":70}]},"small":{"faces":[{"x":260,"y":330,"h":39,"w":39}]},"orig":{"faces":[{"x":1568,"y":1992,"h":240,"w":240}]}},"sizes":{"large":{"h":1880,"w":2048,"resize":"fit"},"medium":{"h":1101,"w":1200,"resize":"fit"},"small":{"h":624,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":3759,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":337,"y":0,"w":3759,"h":3759},{"x":799,"y":0,"w":3297,"h":3759},{"x":2026,"y":0,"w":1880,"h":3759},{"x":0,"y":0,"w":4096,"h":3759}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1975922520344653824"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1975922614389408022","view_count":123831,"bookmark_count":1515,"created_at":1759931636000,"favorite_count":2204,"quote_count":27,"reply_count":64,"retweet_count":357,"user_id_str":"865622395","conversation_id_str":"1975922614389408022","full_text":"From the Hierarchical Reasoning Model (HRM) to a new Tiny Recursive Model (TRM).\n\nA few months ago, the HRM made big waves in the AI research community as it showed really good performance on the ARC challenge despite its small 27M size. (That's about 22x smaller than the smallest Qwen3 0.6B model.)\n\nNow, the new \"Less is More: Recursive Reasoning with Tiny Networks\" paper proposes Tiny Recursive Model (TRM), which a simpler and even smaller model (7M, 4× smaller than HRM) that performs even better on the ARC challenge.\n\n🔹 What does recursion mean here?\n\nTRM refines its answer in two steps:\n\n1. It updates a latent (reasoning) state from the current question and answer.\n2. Then it updates the answer based on that latent state.\n\nTraining runs for up to 16 refinement steps per batch. Each step does several no-grad loops to improve the answer, followed by one gradient loop that learns from the full reasoning process.\n\nBy the way, the question and the answer are grids of discrete tokens, not text. (E.g., 9×9 Sudoku and up to 30×30 ARC and Maze.)\n\n🔹 And how does it differ from HRM?\n\nIn short, HRM recurses multiple times through two small neural nets with 4 transformer blocks each (high and low frequency). TRM is much smaller (i.e., 4x) and only a single network with 2 transformer blocks.\n\nTRM backpropagates through the full recursion once per step, whereas HRM only backpropagates through the final few steps. And TRM also removes HRM's extra forward pass for halting and instead uses a simple binary cross-entropy loss to learn when to stop iterating.\n\n🔹 Surprising tidbits\n\n1. The author found that adding layers decreased generalization due to overfitting. And going from 4 to 2 layers improved the model from 79.5% to 87.4% on Sudoku.\n\n2. Replacing the self-attention layer with an MLP layer also improved accuracy (74.7% -> 87.4% on Sudoku); however, note that this only make sense here since we have a fixed-length, small context to work with.\n\n🔹 Bigger picture\n\nMy personal caveat: comparing this method (or HRMs) to LLMs feels a bit unfair since HRMs/TRM are specialized models trained for specific tasks (here: ARC, Sudoku, and Maze pathfinding) while LLMs are generalists. It’s like comparing a pocket calculator to a laptop. Both serve a purpose, just in different contexts.\n\nThat said, HRMs and the recursive model proposed here are fascinating proof‑of‑concepts that show what’s possible with relatively small and efficient architectures. I'm still curious what the real‑world use case will look like. Maybe they could serve as reasoning or planning modules within a larger tool‑calling system.\n\nIn practice, we often start by throwing LLMs at a problem, which makes sense for quick prototyping and establishing a baseline. But I can see a point where someone sits down afterward and trains a focused model like this to solve the same task more efficiently.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}],"activities":{"nreplies":[{"label":"2025-10-18","value":0,"startTime":1760659200000,"endTime":1760745600000,"tweets":[]},{"label":"2025-10-19","value":0,"startTime":1760745600000,"endTime":1760832000000,"tweets":[]},{"label":"2025-10-20","value":0,"startTime":1760832000000,"endTime":1760918400000,"tweets":[]},{"label":"2025-10-21","value":12,"startTime":1760918400000,"endTime":1761004800000,"tweets":[{"bookmarked":false,"display_text_range":[0,51],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/07_moe","url":"https://t.co/3CGjgO4H9p","indices":[28,51]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1977733802660155875","quoted_status_permalink":{"url":"https://t.co/nQ43v9rV8S","expanded":"https://twitter.com/rasbt/status/1977733802660155875","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980269760043446725","view_count":76235,"bookmark_count":627,"created_at":1760968077000,"favorite_count":908,"quote_count":1,"reply_count":4,"retweet_count":145,"user_id_str":"865622395","conversation_id_str":"1980269760043446725","full_text":"🔗 Mixture of Experts (MoE): https://t.co/3CGjgO4H9p https://t.co/QA12nBeW0i","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"871247180341813248","name":"Tina Sang","screen_name":"tinawrote","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"tinawrote","lang":"en","retweeted":false,"fact_check":null,"id":"1980274554913132722","view_count":5237,"bookmark_count":0,"created_at":1760969220000,"favorite_count":11,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1979808022894703036","full_text":"@tinawrote Ha nice, it’s refreshing to see that people still care about Bayes theorem and fundamentals in 2025","in_reply_to_user_id_str":"871247180341813248","in_reply_to_status_id_str":"1979808022894703036","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"248951926","name":"Ahmad","screen_name":"TheAhmadOsman","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"TheAhmadOsman","lang":"en","retweeted":false,"fact_check":null,"id":"1980275166560092599","view_count":1634,"bookmark_count":2,"created_at":1760969366000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980102923754381348","full_text":"@TheAhmadOsman The V3.2 update with sparse attention was just to get the tooling ecosystem ready for the big release. Mark my words","in_reply_to_user_id_str":"248951926","in_reply_to_status_id_str":"1980102923754381348","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[23,305],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1209960539390201864","name":"Dwarkesh Patel","screen_name":"dwarkesh_sp","indices":[0,12]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[13,22]}]},"favorited":false,"in_reply_to_screen_name":"dwarkesh_sp","lang":"en","retweeted":false,"fact_check":null,"id":"1980335765063094548","view_count":6093,"bookmark_count":2,"created_at":1760983813000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980333945385562176","full_text":"> Culture: > “Why can’t an LLM write a book for the other LLMs? Why can’t other LLMs read this LLM’s book and be inspired by it, or shocked by it?”\n\nHm, isn’t that what training on synthetic data and knowledge distillation does? \n\nAll major LLMs contain some synthetic data in their mix because it makes training more effective versus cold-starting from raw data.","in_reply_to_user_id_str":"1209960539390201864","in_reply_to_status_id_str":"1980333945385562176","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-22","value":98,"startTime":1761004800000,"endTime":1761091200000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642191950090585","view_count":153439,"bookmark_count":1282,"created_at":1761056871000,"favorite_count":2142,"quote_count":35,"reply_count":76,"retweet_count":338,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.\n\nIn short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.\n\nMy first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.\n\nIn the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)\n\nIn any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.\n\nHow is it different compared to other VLLM architectures?\n- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).\n- They are (to the best of my knowledge) those who use an MoE as a decoder.\nI think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.\nHowever, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.\n\nRegarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)\n\nOverall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).\n\n(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[18,250],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1638538494887821313","url":"https://t.co/gNErcwGh3w","indices":[71,94]}],"user_mentions":[{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[0,9]},{"id_str":"39547749","name":"(((ل()(ل() 'yoav))))👾","screen_name":"yoavgo","indices":[10,17]}]},"favorited":false,"in_reply_to_screen_name":"karpathy","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1638538494887821313","quoted_status_permalink":{"url":"https://t.co/gNErcwGh3w","expanded":"https://x.com/rasbt/status/1638538494887821313","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980463829789339825","view_count":12444,"bookmark_count":39,"created_at":1761014346000,"favorite_count":52,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980397031542989305","full_text":"@karpathy @yoavgo This made me think of the \"Meet in the Middle\" paper https://t.co/gNErcwGh3w\nWhen I remember correctly, they run two LLMs in both directions with parameter sharing. So it shouldn't impact training time. Kind of wild but hey why not.","in_reply_to_user_id_str":"33836629","in_reply_to_status_id_str":"1980435985730269351","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,188],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/deepseek-ai/De…","expanded_url":"https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf","url":"https://t.co/f0EFC6eVcl","indices":[19,42]},{"display_url":"magazine.sebastianraschka.com/p/understandin…","expanded_url":"https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=publication-search","url":"https://t.co/Aa5M0XD6ew","indices":[165,188]}],"user_mentions":[]},"favorited":true,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642194945110475","view_count":9544,"bookmark_count":42,"created_at":1761056872000,"favorite_count":77,"quote_count":1,"reply_count":2,"retweet_count":12,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"Link to the paper: https://t.co/f0EFC6eVcl\n\nMy \"Understanding Multimodal LLMs\" article with more info on how images are fed to LLMs, how cross-attention works, etc: https://t.co/Aa5M0XD6ew","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1143955635391754240","name":"Pratham Prasoon","screen_name":"PrasoonPratham","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"PrasoonPratham","lang":"en","retweeted":false,"fact_check":null,"id":"1980645421560262701","view_count":2495,"bookmark_count":0,"created_at":1761057641000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@PrasoonPratham Actually I was thinking about it when typing, and I don't know. I don't want to be that person who goes against the common terminology (like softargmax haha) but it really is a V*L*LM at 3B parameters.","in_reply_to_user_id_str":"1143955635391754240","in_reply_to_status_id_str":"1980644767022399874","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,235],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1487101697876844546","name":"Butch Coolidge","screen_name":"vulnerablecodes","indices":[0,16]}]},"favorited":true,"in_reply_to_screen_name":"vulnerablecodes","lang":"en","retweeted":false,"fact_check":null,"id":"1980644334832955587","view_count":2094,"bookmark_count":0,"created_at":1761057382000,"favorite_count":19,"quote_count":0,"reply_count":2,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@vulnerablecodes If we are talking about the model itself and not the app, these are open-weight PyTorch models. So unless there’s a backdoor in Hugging Face or the PyTorch runtime, there’s really no way for them to be malicious afaik.","in_reply_to_user_id_str":"1487101697876844546","in_reply_to_status_id_str":"1980643375780085948","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[14,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"255532327","name":"LMT ⚡️","screen_name":"Limitless_LT","indices":[0,13]}]},"favorited":false,"in_reply_to_screen_name":"Limitless_LT","lang":"en","retweeted":false,"fact_check":null,"id":"1980656807690530983","view_count":1677,"bookmark_count":0,"created_at":1761060356000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@Limitless_LT Yeah, I think that’s what brought us CNNS (as opposed to fully connected neural nets), LoRA, and many more","in_reply_to_user_id_str":"255532327","in_reply_to_status_id_str":"1980655979386793997","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,290],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"760070121981378561","name":"Alim","screen_name":"almmaasoglu","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"almmaasoglu","lang":"en","retweeted":false,"fact_check":null,"id":"1980657466284425600","view_count":2069,"bookmark_count":0,"created_at":1761060513000,"favorite_count":6,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@almmaasoglu exactly, that’s the messiness of working with the image format I mentioned. I think you can make to generalize well on all these but since there are more degrees of freedom it will require more data to train (luckily this can be done with automatic data augmentation but still)","in_reply_to_user_id_str":"760070121981378561","in_reply_to_status_id_str":"1980653506899087745","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1980736754517680463","view_count":25353,"bookmark_count":29,"created_at":1761079417000,"favorite_count":76,"quote_count":1,"reply_count":7,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980657338726887662","full_text":"Interesting that they mentioned faster & cheaper compared to OpenAI’s latest models not “customizable”. \n\nThat makes me think they are specifically referring to gpt-oss,\n\nThis in turn means they are using the small, dense Qwen3 models, maybe 0.6 to 4B range.\n\nAnd this is surprising, i.e. that models that small are good enough for production (and possibly chat interactions with the customer).","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1980657338726887662","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,171],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980715707508879444","view_count":8923,"bookmark_count":14,"created_at":1761074399000,"favorite_count":70,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"All that being said, as a human, I can appreciate visual representations of text as it lowers cognitive load (the raw text is readable, but requires much more brainpower): https://t.co/G4ygIeNvDZ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"875456843279081476","name":"Dileep George","screen_name":"dileeplearning","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"dileeplearning","lang":"en","retweeted":false,"fact_check":null,"id":"1980618490764513365","view_count":11072,"bookmark_count":11,"created_at":1761051220000,"favorite_count":77,"quote_count":0,"reply_count":3,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980423146420466049","full_text":"@dileeplearning I know it’s popular to hate tokenizers, but visual representations (which are also tokenized) bring a lot of messiness as well. Aspect ratios, cropping, resolution, brightness, etc.\n\nSure, models learn to deal with that but it requires lots of data to make them robust wrt these.","in_reply_to_user_id_str":"875456843279081476","in_reply_to_status_id_str":"1980423146420466049","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-23","value":24,"startTime":1761091200000,"endTime":1761177600000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}],"symbols":[],"timestamps":[{"indices":[94,99],"seconds":660,"text":"11:00"},{"indices":[376,380],"seconds":200,"text":"3:20"}],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980992745532453016","view_count":41099,"bookmark_count":359,"created_at":1761140450000,"favorite_count":823,"quote_count":7,"reply_count":24,"retweet_count":89,"user_id_str":"865622395","conversation_id_str":"1980992745532453016","full_text":"Excited to be (finally) heading to the PyTorch Conference!\n\nI’ll be giving a talk tomorrow at 11:00 AM on “The LLM Landscape 2025”, where I’ll discuss the key components behind this year’s most prominent open-weight LLMs, and highlight a few architectural developments that go beyond the mainstream, off the main track.\n\nI also look forward to doing a book signing session at 3:20 PM, thanks to the kind invite from the organizers.\n\nIt’s my first trip since my injury last year, and I’m really looking forward to reconnecting with the community in person after such a long time. If you’re there, please come say hi!\n\n(I couldn’t make it for the first day of the conference due to a mandatory appointment, but better late than never! See you all tomorrow.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-24","value":0,"startTime":1761177600000,"endTime":1761264000000,"tweets":[]},{"label":"2025-10-25","value":0,"startTime":1761264000000,"endTime":1761350400000,"tweets":[]},{"label":"2025-10-26","value":0,"startTime":1761350400000,"endTime":1761436800000,"tweets":[]},{"label":"2025-10-27","value":0,"startTime":1761436800000,"endTime":1761523200000,"tweets":[]},{"label":"2025-10-28","value":5,"startTime":1761523200000,"endTime":1761609600000,"tweets":[{"bookmarked":false,"display_text_range":[42,321],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch","url":"https://t.co/NGT1VM4P1R","indices":[414,437]}],"user_mentions":[{"id_str":"13434092","name":"Brandon Watson","screen_name":"BrandonWatson","indices":[0,14]},{"id_str":"291797158","name":"ThePrimeagen","screen_name":"ThePrimeagen","indices":[15,28]},{"id_str":"21001534","name":"Audible","screen_name":"audible_com","indices":[29,41]}]},"favorited":false,"in_reply_to_screen_name":"BrandonWatson","lang":"en","retweeted":false,"fact_check":null,"id":"1982836647784808750","view_count":9152,"bookmark_count":19,"created_at":1761580070000,"favorite_count":42,"quote_count":0,"reply_count":5,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1982666767437820411","full_text":"I wrote the original text and code and had similar questions when I found that there was an audio book version. When I asked about it, if I remember correctly, the answer was that it is something they generate for all books to improve accessibility. \n\nPersonally, I recommend the text version. That being said, I dunno, but perhaps the audiobook version works also well if you are working with the code notebooks (https://t.co/NGT1VM4P1R), which have the code and figures (but not text).\n\nWould be curious to hear from people who listen to audio book versions of coding books and find out if this is helpful.","in_reply_to_user_id_str":"13434092","in_reply_to_status_id_str":"1982666767437820411","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-29","value":60,"startTime":1761609600000,"endTime":1761696000000,"tweets":[{"bookmarked":false,"display_text_range":[0,269],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983212569885122670","view_count":46934,"bookmark_count":499,"created_at":1761669697000,"favorite_count":872,"quote_count":3,"reply_count":29,"retweet_count":128,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"Just saw the MiniMax-M2 benchmarks, and the performance is too good to ignore :). So, I just amended my \"The Big LLM Architecture Comparison\" with entry number 13! \n\n1️⃣ Full attention modules:\n\nAs shown in the overview figure below, I grouped MiniMax-M2 with the other decoder-style transformer LLMs as it does not use the efficient lightning attention variant proposed in MiniMax-M1. Instead, the developers went back to using full attention, likely to improve modeling (and benchmark) performance.\n\n2️⃣ Per-layer QK-Norm:\n\nOverall, MiniMax-M2 is surprisingly similar to Qwen3. Besides changing the number of layers, sizes, etc., it uses the same components overall. Perhaps the one noteworthy highlight here is that MiniMax-M2 uses a so-called “per_layer” QK-Norm instead of the regular QK-Norm. A closer look at the code reveals the \"per_layer\" means that the RMSNorm (used for QK-Norm as explained earlier) is defined in each transformer block (as in regular QK-Norm), but, in addition, instead of reusing it across attention heads, it's a unique QK-Norm for each attention head.\n\n3️⃣ Sliding-window attention:\n\nThe model configuration file also includes a sliding-window attention setting (similar to Gemma 3), but, as in Mistral 3.1, it is disabled by default.\n\nOtherwise, besides the per-layer QK-Norm, the architecture is very similar to Qwen3, as shown in the figure below.\n\n4️⃣ MoE sparsity:\n\nA perhaps interesting tidbit, as shown in the figure below, includes the fact that they don't use a shared expert (similar to Qwen3 but unlike Qwen3-Next). As mentioned earlier, in my opinion, shared experts are useful because they reduce redundancy among the other experts.\n\nAlso, as apparent from the figure above, MiniMax-M2 is twice as \"sparse\" as Qwen3. I.e., at roughly the same size as Qwen3 235B-A22B, MiniMax-M2 has only 10B instead of 22B active experts per token (that is, 4.37% of the parameters are used in each inference step in MiniMax-M2, whereas Qwen3 uses 9.36% active tokens).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1983240592516665532","quoted_status_permalink":{"url":"https://t.co/Ks8fEmHtCa","expanded":"https://twitter.com/ManningBooks/status/1983240592516665532","display":"x.com/ManningBooks/s…"},"retweeted":false,"fact_check":null,"id":"1983255497202643000","view_count":41464,"bookmark_count":263,"created_at":1761679932000,"favorite_count":404,"quote_count":0,"reply_count":25,"retweet_count":64,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"On that note, I am currently running a large-scale experiment on the upcoming inference-scaling chapter:\n\nA) Parallel Sampling\n- Self-Consistency (Majority Vote)\n- Rejection Sampling\n- Best-of-N (Verifier-Based)\n\nB) Sequential Refinement\n- Self-Refinement\n- Power Sampling\n- MCMC (Simple)\n- MCMC (Block as in \"Reasoning with Sampling\" paper)\n- Tree-of-Thought\n\n... to decide which one(s) make(s) it for the detailed discussion into the main chapter versus which ones will be included as bonus materials. (All new chapters will of course be automatically available to all the early acessers, amd there are already 170 chapters to get started in the meantime 😊\n\nAnything you'd think is worth adding to the list above?","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,34],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1745892418539417600","name":"elie","screen_name":"eliebakouch","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"eliebakouch","lang":"en","retweeted":false,"fact_check":null,"id":"1983231696343351800","view_count":2617,"bookmark_count":1,"created_at":1761674257000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@eliebakouch good point, will add!","in_reply_to_user_id_str":"1745892418539417600","in_reply_to_status_id_str":"1983219128883122466","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,192],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"970812776","name":"jason","screen_name":"jasonth0","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"jasonth0","lang":"en","retweeted":false,"fact_check":null,"id":"1983215929711284435","view_count":1335,"bookmark_count":1,"created_at":1761670498000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@jasonth0 The per-layer QK-Norm adds more params, not less :). But that aside, overall, I think it's still efficient. I mean, there are 50% less active parameters compared to Qwen3 for example","in_reply_to_user_id_str":"970812776","in_reply_to_status_id_str":"1983215562952990856","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[7,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"5604372","name":"Reza Rawassizadeh","screen_name":"rezar","indices":[0,6]}]},"favorited":false,"in_reply_to_screen_name":"rezar","lang":"en","retweeted":false,"fact_check":null,"id":"1983251855829606863","view_count":670,"bookmark_count":1,"created_at":1761679064000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@rezar That's a fun idea! Do you know a service that you have had a good experience with regarding making and distributing posters?","in_reply_to_user_id_str":"5604372","in_reply_to_status_id_str":"1983245370118267378","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,66],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1419322742713643009","name":"Duc Nguyen Huu","screen_name":"ducnh279","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"ducnh279","lang":"en","retweeted":false,"fact_check":null,"id":"1983278551655944288","view_count":108,"bookmark_count":0,"created_at":1761685428000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@ducnh279 Interesting one! I will bookmark this and give it a try.","in_reply_to_user_id_str":"1419322742713643009","in_reply_to_status_id_str":"1983263508071624848","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,46],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1918285228403253249","name":"ƬⲘ ⚔️","screen_name":"tm23twt","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}]},"favorited":false,"in_reply_to_screen_name":"tm23twt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983257620753592407","view_count":86,"bookmark_count":0,"created_at":1761680438000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@tm23twt I think they removed the edit feature https://t.co/dGwFDFaeYg","in_reply_to_user_id_str":"1918285228403253249","in_reply_to_status_id_str":"1983256870711164941","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,26],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1983255844029431837","view_count":4248,"bookmark_count":2,"created_at":1761680014000,"favorite_count":15,"quote_count":0,"reply_count":2,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"* 170 pages not chapters 😅","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983255497202643000","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-30","value":0,"startTime":1761696000000,"endTime":1761782400000,"tweets":[{"bookmarked":false,"display_text_range":[12,146],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"431181263","name":"Haichao","screen_name":"HaichaoZhu","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"HaichaoZhu","lang":"en","retweeted":false,"fact_check":null,"id":"1983343814648762407","view_count":552,"bookmark_count":0,"created_at":1761700988000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@HaichaoZhu That's a good point. With so many MoE's released this year (even the latest Nemotron today), maybe that'd be a nice standalone article","in_reply_to_user_id_str":"431181263","in_reply_to_status_id_str":"1983335671264845971","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,207],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1761964147510767616","name":"Ben Dicken","screen_name":"BenjDicken","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"BenjDicken","lang":"en","retweeted":false,"fact_check":null,"id":"1983565978525892663","view_count":5,"bookmark_count":0,"created_at":1761753956000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983292996117491864","full_text":"@BenjDicken Just saw this popping up on my timeline... I guess the twitter recommendations work well now, haha!\nAnyways, I hope you are liking the book. And please let me know in case you have any questions!","in_reply_to_user_id_str":"1761964147510767616","in_reply_to_status_id_str":"1983292996117491864","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-31","value":30,"startTime":1761782400000,"endTime":1761868800000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1978608882156269755","quoted_status_permalink":{"url":"https://t.co/uObfyEshyK","expanded":"https://twitter.com/rasbt/status/1978608882156269755","display":"x.com/rasbt/status/1…"},"retweeted":true,"fact_check":null,"id":"1983895811915214996","view_count":60530,"bookmark_count":173,"created_at":1761832595000,"favorite_count":325,"quote_count":1,"reply_count":22,"retweet_count":40,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"A small follow-up to my DGX Spark post. Courtesy of NVIDIA, I got to try the DGX on my workflows (coding LLMs from scratch in pure PyTorch) and wanted to share my first impressions after using it for a week.\n\nBefore getting to the performance, there was a neat bonus I didn't expect: It comes with NVIDIA Sync software that lets you conveniently connect (I fully expected I would have to find my SSH tunneling notes from back when I set up Jupyter Lab, etc, on a remote machine). The setup is a breeze and a delight.\n\nNow, how does it fare against my Mac Mini? I included the tokens/sec inference speed for a small 0.6B model I am currently working on. The DGX is much faster than the Mac Mini M4 CPU and still noticeably faster than the M4 GPU (via PyTorch MPS). More importantly, though, as I mentioned before, it is a CUDA device and thus much better supported in PyTorch. This, in turn, results in more stable training and higher benchmark accuracy. (And no compile errors, yay!)\n\nBoth devices get hot under my workloads (e.g., a constant-load run like evaluating a model with batched mode on MATH-500; or fine-tuning a model), but I feel like the DGX Spark is (probably) made with that in mind. Plus, due to its larger 128 GB RAM, I can run larger batch sizes. Then there's also the aspect that when I have the DGX (vs the Mac Mini) running computations, it keeps my Mini free for other tasks :).\n\nOverall, a neat little package and CUDA prototyping machine that I can keep on my desk. It's super quiet similar to the Mac Mini. Of course, it's not as capable as a 6x more expensive H100 for training, but hey, you don't need a server room for that and can keep it in your office without worrying about heat or noise (this was not possible with the Lambda workstation I had a few years ago).\n\ntl;dr:\n\nSo, I've been seeing lots of others using it for LLM inference (Ollama, vLLM, etc) but my first-week impression is that this is also a neat box for local dev and prototyping (e.g., coding and running PyTorch models) thanks to the CUDA support, which comes in handy before starting larger, more expensive training runs.\n\nPS: Plus also find another benchmark versus the H100 in the comments below.\n\nWill run more experiments over time. In the meantime, let me know if you have any questions.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,46],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","quoted_status_id_str":"1983895811915214996","quoted_status_permalink":{"url":"https://t.co/FM2NttATVY","expanded":"https://twitter.com/rasbt/status/1983895811915214996","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983900170992463920","view_count":1069,"bookmark_count":0,"created_at":1761833634000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1978608882156269755","full_text":"A follow-up here with some PyTorch benchmarks:","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1978608882156269755","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1983905044169945184","view_count":269,"bookmark_count":1,"created_at":1761834796000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983584412412641496","full_text":"@natolambert My guess is the motivating factor behind this was probably to prevent things from breaking if proprietary model providers make API or model changes again.","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1983584412412641496","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983895815102910660","view_count":5102,"bookmark_count":7,"created_at":1761832595000,"favorite_count":20,"quote_count":0,"reply_count":4,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"And here is a comparison with an H100. As one can see, the DGX Spark is a great machine for small inferencing tasks (even beating the 6x more expensive H100).\nBut when it comes to batched processing (or training), this is of course no replacement for high-memory bandwidth cards. https://t.co/I93nIfdzD6","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983895811915214996","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1983918926992933169","url":"https://t.co/yazv07Pxfx","indices":[194,217]}],"user_mentions":[{"id_str":"1451507288741658630","name":"Aleksandr Kovalev","screen_name":"koval_alvi","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"koval_alvi","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1983918926992933169","quoted_status_permalink":{"url":"https://t.co/yazv07Pxfx","expanded":"https://x.com/rasbt/status/1983918926992933169","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983919187555754315","view_count":480,"bookmark_count":0,"created_at":1761838168000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@koval_alvi So little time (and only one machine) & some much to run 😅. I am currently more focused on the inference scaling methods for the upcoming chapter 4, but yes, I did a short run:\n\nhttps://t.co/yazv07Pxfx","in_reply_to_user_id_str":"1451507288741658630","in_reply_to_status_id_str":"1983912718001115637","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1474196927960944644","name":"kris","screen_name":"Krishna70284154","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"Krishna70284154","lang":"en","retweeted":false,"fact_check":null,"id":"1983899945443700819","view_count":456,"bookmark_count":0,"created_at":1761833580000,"favorite_count":6,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@Krishna70284154 Yeah, it’s basically for people who want a Mac-like machine at a Mac-like price but with cuda support 😅","in_reply_to_user_id_str":"1474196927960944644","in_reply_to_status_id_str":"1983897384469082570","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,169],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb","url":"https://t.co/VioT1zUPgA","indices":[59,82]}],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983918926992933169","view_count":1113,"bookmark_count":2,"created_at":1761838106000,"favorite_count":4,"quote_count":1,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@redtachyon I did a short run of my DPO from Scratch code (https://t.co/VioT1zUPgA) on a 355M parameter model:\n\nA100: 1.69 min\nMac Mini M4: 12.54 min\nDGX Spark: 2.44 min","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1983906361969627248","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,163],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1780523178160279552","name":"Mykhailo Sorochuk","screen_name":"sir4K_zen","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"sir4K_zen","lang":"en","retweeted":false,"fact_check":null,"id":"1984030005966598349","view_count":159,"bookmark_count":0,"created_at":1761864589000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@sir4K_zen Under normal use, they are similarly quiet like you have to put your ear next to it to hear it. Under heavy load, the Mac Mini gets louder than the DGX.","in_reply_to_user_id_str":"1780523178160279552","in_reply_to_status_id_str":"1984026707242971532","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-01","value":8,"startTime":1761868800000,"endTime":1761955200000,"tweets":[{"bookmarked":false,"display_text_range":[0,260],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1984262505443844263","quoted_status_permalink":{"url":"https://t.co/bGHWQrydyN","expanded":"https://twitter.com/natolambert/status/1984262505443844263","display":"x.com/natolambert/st…"},"retweeted":false,"fact_check":null,"id":"1984279418588762113","view_count":19631,"bookmark_count":64,"created_at":1761924054000,"favorite_count":112,"quote_count":0,"reply_count":7,"retweet_count":6,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"I ran lots of experiments on fp16 vs bf16 years ago on ViTs and LLMs. fp16 can work well but depends on normalization (so you don’t run over the supported range with your activations). \nI can see why with QKNorm and other tricks it may work fine (/better) now.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2023/pyto…","expanded_url":"https://sebastianraschka.com/blog/2023/pytorch-memory-optimization.html","url":"https://t.co/AD6ZZJeS4D","indices":[61,84]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984310689167511808","view_count":5710,"bookmark_count":27,"created_at":1761931509000,"favorite_count":43,"quote_count":0,"reply_count":0,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"Figure from an older blogpost to illustrate the difference: https://t.co/AD6ZZJeS4D\n\nRegular 16-bit floats can only represent numbers between -65,504 and 65,504. And with LLMs back then I often had activation larger or smaller than that. (This was pre QKNorm.) https://t.co/b6vobXJCHJ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1984279418588762113","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,71],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1894590251571843073","name":"Artificially Intelligent","screen_name":"ArtiIntelligent","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"ArtiIntelligent","lang":"en","retweeted":false,"fact_check":null,"id":"1984242821688365465","view_count":100,"bookmark_count":0,"created_at":1761915328000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ArtiIntelligent Sure but my use case is primarily dev work in PyTorch.","in_reply_to_user_id_str":"1894590251571843073","in_reply_to_status_id_str":"1984239937789788358","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,211],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1886394417654677504","name":"moskstraumen","screen_name":"moskstraum21745","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"moskstraum21745","lang":"en","retweeted":false,"fact_check":null,"id":"1984255784847614382","view_count":65,"bookmark_count":0,"created_at":1761918419000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@moskstraum21745 Oh yes 100% use MLX if you want to max the performance on Mac. I think it also now has CUDA support correct? It's just that the most of the LLM ecosystem (and my experience) is based on PyTorch.","in_reply_to_user_id_str":"1886394417654677504","in_reply_to_status_id_str":"1984254897622290758","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,156],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"101128454","name":"Wayne Le Nguyen","screen_name":"insynwyn","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"insynwyn","lang":"en","retweeted":false,"fact_check":null,"id":"1984242530398171139","view_count":62,"bookmark_count":0,"created_at":1761915259000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@insynwyn Both the latest nightly and latest PyTorch with CUDA 13 work for me. (NVIDIA recommends the docker container but in my case that wasn’t necessary)","in_reply_to_user_id_str":"101128454","in_reply_to_status_id_str":"1984239792939499706","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-02","value":35,"startTime":1761955200000,"endTime":1762041600000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984617030356451642","view_count":65925,"bookmark_count":861,"created_at":1762004547000,"favorite_count":1286,"quote_count":3,"reply_count":27,"retweet_count":220,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened.\n\nFirst, linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s.\n\nI don't want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n^2) to O(n) to making attention much more efficient for long sequences.\n\nHowever, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. \n\nIn the second half of this year, there was a bit of a revival of linear attention variants. The first notable model was MiniMax-M1 with lightning attention, a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June.\n\nThen, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 with sparse attention.\n\nAll three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. (DeepSeek's sparse attention it's not strictly linear but still subquadratic).\n\nInterestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model (discussed in section 13) without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had pure accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications.\n\nThis could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. Last week, the Kimi team released their new Kimi Linear model with linear attention. The tag line is that compared to regular, full attention, it has a 75% KV cache reduction and up to 6x decoding throughput.\n\nKimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there's one block that uses full attention as shown in the figure below.\n\nHowever, Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Interestingly, it also replaces the standard full attention module by multi-head latent attention. \n\nThere's no direct comparison to Qwen3-Next in the Kimi Linear paper, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed.\n\nOf course, I couldn't resist and added it to my The Big LLM Architecture Comparison article, which has grown to >10,000 words now (basically becoming book!?).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,88],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2025/dgx-…","expanded_url":"https://sebastianraschka.com/blog/2025/dgx-impressions.html","url":"https://t.co/XG2m9urtgc","indices":[65,88]}],"user_mentions":[{"id_str":"43874767","name":"Ivan Fioravanti ᯅ","screen_name":"ivanfioravanti","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"ivanfioravanti","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984601748250448148","view_count":205,"bookmark_count":3,"created_at":1762000903000,"favorite_count":5,"quote_count":0,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ivanfioravanti Yes! Links to the codes are in the article here: https://t.co/XG2m9urtgc","in_reply_to_user_id_str":"43874767","in_reply_to_status_id_str":"1984519617067335962","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,197],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","retweeted":false,"fact_check":null,"id":"1984633894365233442","view_count":1181,"bookmark_count":3,"created_at":1762008567000,"favorite_count":18,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984605827034972269","full_text":"@redtachyon I think fp16 also only works well for the newer architectures that add tons of normalization (like QKNorm), so you don’t get these large activations above +/- 65k that fp16 can’t handle","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1984605827034972269","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,200],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1629698842647203841","name":"Yu Zhang 🐈🐙","screen_name":"yzhang_cs","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"yzhang_cs","lang":"en","retweeted":false,"fact_check":null,"id":"1984632514019778709","view_count":1211,"bookmark_count":1,"created_at":1762008238000,"favorite_count":10,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@yzhang_cs Ooops, I misread then, thanks for the feedback, and I’ll update the figure in the article! (Ha, but sounds like I can keep this figure for the next iteration of Kimi Linear! Cool work btw!)","in_reply_to_user_id_str":"1629698842647203841","in_reply_to_status_id_str":"1984631714464088563","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,222],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"862201913252618240","name":"Vishal Verma","screen_name":"v_shaal","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}]},"favorited":false,"in_reply_to_screen_name":"v_shaal","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984630139888472399","view_count":725,"bookmark_count":0,"created_at":1762007672000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@v_shaal Might be architectural. They took the same architecture and compared it the Gated DeltaNet-H1 variant from the Gated DeltaNet paper (which is the most similar) and it compared favorably on long context benchmarks: https://t.co/dlzIWpohGu","in_reply_to_user_id_str":"862201913252618240","in_reply_to_status_id_str":"1984622135571091742","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,281],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1984632041598545947","view_count":545,"bookmark_count":0,"created_at":1762008126000,"favorite_count":4,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@_junaidkhalid1 My point still stands: there’s no one-size-fits-all. Different applications have different trade-offs. Same why gpt-5 and gpt-5 pro exists. Some times speed is more important and accuracy is sufficient. Sometimes you want to max accuracy (and are ok to wait 10 min)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1984631100002746497","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,83],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1199846588325224453","name":"John P.","screen_name":"JohnP07107214","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"JohnP07107214","lang":"en","retweeted":false,"fact_check":null,"id":"1984727926777237953","view_count":198,"bookmark_count":0,"created_at":1762030986000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@JohnP07107214 It might be a good topic for a separate book on LLM optimizations :)","in_reply_to_user_id_str":"1199846588325224453","in_reply_to_status_id_str":"1984726873763660133","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,289],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/OpJnPkrGK9","indices":[121,144]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]}],"user_mentions":[{"id_str":"1219292652748800000","name":"Alexey Grigorev","screen_name":"Al_Grigor","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"Al_Grigor","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984645325517164887","view_count":17409,"bookmark_count":392,"created_at":1762011293000,"favorite_count":428,"quote_count":1,"reply_count":6,"retweet_count":45,"user_id_str":"865622395","conversation_id_str":"1984222098370519305","full_text":"Yes, I recently read 90% of AI projects use PyTorch now. Recently put together an PyTorch essentials article: https://t.co/NWeQan8HJ3\n\n(I’ve been an early adopter since 2018 and never looked back; that being said, regarding your points below, TensorFlow also has dynamic graphs, and Keras supports PyTorch as a backend now too)","in_reply_to_user_id_str":"1219292652748800000","in_reply_to_status_id_str":"1984222098370519305","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-03","value":0,"startTime":1762041600000,"endTime":1762128000000,"tweets":[]},{"label":"2025-11-04","value":3,"startTime":1762128000000,"endTime":1762214400000,"tweets":[{"bookmarked":false,"display_text_range":[13,133],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1985456352035291531","view_count":4496,"bookmark_count":3,"created_at":1762204656000,"favorite_count":46,"quote_count":0,"reply_count":3,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1985418033037263086","full_text":"@natolambert Actually I think it was a pretty eventful Fall so far. E.g.,\nQwen3-Next, DeepSeek V3.2, GLM 4.6, MiniMax-M2, Kimi Linear","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1985418033037263086","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-05","value":26,"startTime":1762214400000,"endTime":1762300800000,"tweets":[{"bookmarked":false,"display_text_range":[0,198],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[175,198]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1985719217027494322","view_count":42088,"bookmark_count":675,"created_at":1762267328000,"favorite_count":950,"quote_count":5,"reply_count":25,"retweet_count":164,"user_id_str":"865622395","conversation_id_str":"1985719217027494322","full_text":"My new field guide to alternatives to standard LLMs: \n\nGated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.\n\nhttps://t.co/ZpWugAccgQ https://t.co/255yQXaDcM","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[8,47],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"783098774130401280","name":"Jack Morris","screen_name":"jxmnop","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"jxmnop","lang":"en","retweeted":false,"fact_check":null,"id":"1985735592689185002","view_count":7024,"bookmark_count":1,"created_at":1762271233000,"favorite_count":22,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1985720643397009844","full_text":"@jxmnop Wishing you all the best! You got this!","in_reply_to_user_id_str":"783098774130401280","in_reply_to_status_id_str":"1985720643397009844","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-06","value":0,"startTime":1762300800000,"endTime":1762387200000,"tweets":[]},{"label":"2025-11-07","value":29,"startTime":1762387200000,"endTime":1762473600000,"tweets":[{"bookmarked":false,"display_text_range":[0,89],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1986449512538513505","quoted_status_permalink":{"url":"https://t.co/4YLFiZxCMs","expanded":"https://twitter.com/Kimi_Moonshot/status/1986449512538513505","display":"x.com/Kimi_Moonshot/…"},"retweeted":false,"fact_check":null,"id":"1986511951141441648","view_count":87406,"bookmark_count":477,"created_at":1762456331000,"favorite_count":1352,"quote_count":8,"reply_count":27,"retweet_count":169,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Exciting big Kimi K2 Thinking release!\nMore experts, fewer heads, and even more thinking! https://t.co/CxUpn68Jjj","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"70831441","name":"Soumith Chintala","screen_name":"soumithchintala","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"soumithchintala","lang":"en","retweeted":false,"fact_check":null,"id":"1986531267794330038","view_count":16764,"bookmark_count":6,"created_at":1762460936000,"favorite_count":113,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986503070734557568","full_text":"@soumithchintala Thank you so much for making deep learning Pythonic! 💜\n\nAll my projects would have been much harder and less enjoyable without PyTorch. \n\nIn an alternative universe we maybe even wouldn’t have such an open-weight LLM ecosystem without PyTorch.\n\nAll the best for your next thing!","in_reply_to_user_id_str":"70831441","in_reply_to_status_id_str":"1986503070734557568","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[32,211],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"14227298","name":"Radek Sienkiewicz","screen_name":"velvet_shark","indices":[0,13]},{"id_str":"20971154","name":"Nicholas Dwork","screen_name":"ndwork","indices":[14,21]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[22,31]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}]},"favorited":false,"in_reply_to_screen_name":"velvet_shark","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1986517309016449353","view_count":50,"bookmark_count":1,"created_at":1762457608000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986412241374048473","full_text":"@velvet_shark @ndwork @karpathy I would say check out the bonus materials, especially the attention alternatives and Qwen3-from-scratch. \nI haven't had a chance to really check out nanochat but that one as well! https://t.co/Qr81iGhkrD","in_reply_to_user_id_str":"14227298","in_reply_to_status_id_str":"1986513230286524832","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,91],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1986522069262123425","view_count":7047,"bookmark_count":5,"created_at":1762458743000,"favorite_count":35,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Sorry should be 256k context length in Kimi K2 Thinking. (Up from 128k in regular Kimi K2.)","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1986511951141441648","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-08","value":0,"startTime":1762473600000,"endTime":1762560000000,"tweets":[]},{"label":"2025-11-09","value":15,"startTime":1762560000000,"endTime":1762646400000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/CaIfmZhaB1","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987157794202505395","view_count":52950,"bookmark_count":381,"created_at":1762610312000,"favorite_count":468,"quote_count":1,"reply_count":11,"retweet_count":71,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"My \"The Building Blocks of Today’s and Tomorrow’s Language Models\" talk at the PyTorch Conference is now up on YouTube! https://t.co/bGV5w1Aqyq\n\nIf you have 25 min this weekend, it's a whirlwind tour to catch you up on the key LLM architecture design considerations in popular LLMs this year (plus, an overview of alternative architecture designs).\n\nThe silver lining of my late arrival and rescheduling: Since there was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 min :)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,121],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[98,121]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987160373837902334","view_count":10033,"bookmark_count":87,"created_at":1762610927000,"favorite_count":85,"quote_count":0,"reply_count":2,"retweet_count":8,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"And the article I mentioned in the talk, the one I promised to write as a follow-up, is this one: https://t.co/ZpWugAccgQ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1987157794202505395","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,39],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1987168682624061627","view_count":143,"bookmark_count":0,"created_at":1762612908000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"@_junaidkhalid1 Incremental progress :)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1987161061188116976","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,297],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"267596794","name":"Walter Tay","screen_name":"waltertayannlee","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"waltertayannlee","lang":"en","retweeted":false,"fact_check":null,"id":"1987177509914337358","view_count":12080,"bookmark_count":62,"created_at":1762615012000,"favorite_count":117,"quote_count":0,"reply_count":2,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1986734118005358605","full_text":"The fun part when teaching deep learning classes was always to point out that the textbook convolution (/cross-correlation) is not how it’s actually implemented. It’s also one of the big sources of non-determinism when training CNNs in standard frameworks, because l, by default, CUDA/cuDNN selects the algo automatically at runtime specific to the problem and setup.","in_reply_to_user_id_str":"267596794","in_reply_to_status_id_str":"1986734118005358605","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-10","value":0,"startTime":1762646400000,"endTime":1762732800000,"tweets":[]},{"label":"2025-11-11","value":0,"startTime":1762732800000,"endTime":1762819200000,"tweets":[]},{"label":"2025-11-12","value":16,"startTime":1762819200000,"endTime":1762905600000,"tweets":[{"bookmarked":false,"display_text_range":[8,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1778075580271054848","name":"mel","screen_name":"melqtx","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"melqtx","lang":"en","retweeted":false,"fact_check":null,"id":"1988380057346130209","view_count":24822,"bookmark_count":39,"created_at":1762901722000,"favorite_count":354,"quote_count":0,"reply_count":16,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1988288260049871197","full_text":"@melqtx I use it all the time when using remote machines. Coz the terminal connections sometimes gets closed (e.g., when my computer goes to sleep).\n\nThis way, I can simply log back in, attach the tmux terminal, and continue instead of cd'ing to the right folder, activating the venv etc.","in_reply_to_user_id_str":"1778075580271054848","in_reply_to_status_id_str":"1988288260049871197","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-13","value":28,"startTime":1762905600000,"endTime":1762992000000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1988626642990719440","view_count":54993,"bookmark_count":944,"created_at":1762960513000,"favorite_count":801,"quote_count":5,"reply_count":27,"retweet_count":115,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all approach, but I thought it may be useful to share:\n\n1. Read the chapter preferably offline, away from the computer. Either classic physical form or at least on digital devices without internet. This really helps with focus time and minimizing distractions while reading. Highlighting or annotating confusing or interesting things is good, but I would not look things up at this stage. I also wouldn't run code at this stage. At least not yet.\n\n2. On the second read-through, type up and run the code from the chapter. Copying code is tempting because retyping is a lot of work, but it usually helps me to think about the code a bit more (versus just glancing over it). If I get different results than in the book, I would check the book's GitHub repo and try the code from there. If I still get different results, I would try to see if it's due to different package versions, random seeds, CPU/CUDA, etc. If I then still can't find it out, asking the author would not be a bad idea (via book forum, public GitHub repo issues or discussions, and as a last resort, email)\n\n3. After the second read-through and retyping the code, it's usually a good time to try the exercises to solidify my understanding. To check whether I actually understand the content and can work with it independently.\n\n4. Go through the highlights and annotations. I would bookmark important learnings or takeaways, if relevant for a given project, in my notes documents. Often, I also look up additional references to read more about a topic of interest. Also, if I still have any questions that I feel are unanswered after my previous readthroughs and exercises, I would do an online search to find out more.\n\n5. The previous steps were all about soaking up knowledge. Eventually, though, I somehow want to use that knowledge. So I think about which projects would benefit from what I've learned and incorporate it into them. This could involve using the main concept from the chapter, but also sometimes minor tidbits I learned along the way, e.g., even trivial things like whether it actually makes a difference in my project to explicitly call `torch.mps.manual_seed(seed)` instead of just `torch.manual_seed(seed)`.\n\nOf course, none of the above is set in stone. If the topic is overall very familiar or easy, and I am primarily reading the book to get some information in later chapters, skimming a chapter is ok (to not waste my time).\n\nAnyway, I hope this is useful. And happy reading and learning!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,44],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":true,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988627760617517412","view_count":74,"bookmark_count":0,"created_at":1762960779000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"@franbetteo Classic quality > quantity :)","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988627594669875705","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,292],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988631117772025955","view_count":6,"bookmark_count":0,"created_at":1762961580000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"Yeah, I think the problem is to want to read too many things. I have the same issue. Honestly, when reading at a computer, my attention span is sometimes so short that I can't even focus 30 min and read a longer blog article without distraction.\nIt requires discipline to stick to a given text.","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988628897995341948","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-14","value":0,"startTime":1762992000000,"endTime":1763078400000,"tweets":[]},{"label":"2025-11-15","value":0,"startTime":1763078400000,"endTime":1763164800000,"tweets":[]},{"label":"2025-11-16","value":21,"startTime":1763164800000,"endTime":1763251200000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989706196396265863","view_count":47955,"bookmark_count":547,"created_at":1763217898000,"favorite_count":754,"quote_count":1,"reply_count":15,"retweet_count":119,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"Inference-scaling lets us trade extra compute for better modeling accuracy. Next to reinforcement learning, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one.\n\nI just finished the first one. It is a 35-page introduction to inference-time scaling through self-consistency sampling. This chapter was a lot of fun to write because it takes the base model on MATH-500 all the way from 15.2% percent to 52.2% accuracy.\n\nSeeing that jump without additional training is incredibly satisfying.\n\nSubmitted the chapter yesterday, and it should appear in the Manning Early Access program in the next few days. (In the meantime the first 176 pages that lead up to this chapter are already available.)\n\nThe next chapter will focus on self-refinement techniques, where the model improves its own answers through iterative reasoning.\n\nHappy reading!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"800854096219471872","name":"Yuchen Jin","screen_name":"Yuchenj_UW","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"Yuchenj_UW","lang":"en","retweeted":false,"fact_check":null,"id":"1989803439224934626","view_count":6603,"bookmark_count":3,"created_at":1763241083000,"favorite_count":118,"quote_count":0,"reply_count":3,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1989755062646944048","full_text":"@Yuchenj_UW One can say you do seminal work to get a PhD, but you don’t have to have a PhD to do seminal work.","in_reply_to_user_id_str":"800854096219471872","in_reply_to_status_id_str":"1989755062646944048","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/blob/main/ch04/01_main-chapter-code/ch04_main.ipynb","url":"https://t.co/b3Nk5cVHwd","indices":[46,69]},{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/tree/main/ch04/02_math500-inference-scaling-scripts","url":"https://t.co/z3oj5Vkno1","indices":[144,167]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989708450100662776","view_count":8109,"bookmark_count":44,"created_at":1763218436000,"favorite_count":60,"quote_count":0,"reply_count":3,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"The chapter code is available here on GitHub: https://t.co/b3Nk5cVHwd\n\nAlso, I have the scripts to reproduce the experiments in the table here: https://t.co/z3oj5Vkno1","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1989706196396265863","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"9508592","name":"Asankhaya Sharma","screen_name":"asankhaya","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"asankhaya","lang":"en","retweeted":false,"fact_check":null,"id":"1989718576664568217","view_count":454,"bookmark_count":0,"created_at":1763220850000,"favorite_count":5,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@asankhaya Yes that’s correct. I think self-consistency is a good intro though that works well in practice, too. More will be covered in the next chapter.\nThanks for sharing btw, have to check out your repo some time.","in_reply_to_user_id_str":"9508592","in_reply_to_status_id_str":"1989717556077498843","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,123],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"4036077013","name":"sour coach sauers","screen_name":"SRCoachSauers","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"SRCoachSauers","lang":"en","retweeted":false,"fact_check":null,"id":"1989803670125646205","view_count":103,"bookmark_count":0,"created_at":1763241138000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@SRCoachSauers The website says summer 2026. That’s still the estimate but maybe even late spring depending on how it goes.","in_reply_to_user_id_str":"4036077013","in_reply_to_status_id_str":"1989800627426480467","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-17","value":0,"startTime":1763251200000,"endTime":1763337600000,"tweets":[]}],"nbookmarks":[{"label":"2025-10-18","value":0,"startTime":1760659200000,"endTime":1760745600000,"tweets":[]},{"label":"2025-10-19","value":0,"startTime":1760745600000,"endTime":1760832000000,"tweets":[]},{"label":"2025-10-20","value":0,"startTime":1760832000000,"endTime":1760918400000,"tweets":[]},{"label":"2025-10-21","value":631,"startTime":1760918400000,"endTime":1761004800000,"tweets":[{"bookmarked":false,"display_text_range":[0,51],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/07_moe","url":"https://t.co/3CGjgO4H9p","indices":[28,51]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1977733802660155875","quoted_status_permalink":{"url":"https://t.co/nQ43v9rV8S","expanded":"https://twitter.com/rasbt/status/1977733802660155875","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980269760043446725","view_count":76235,"bookmark_count":627,"created_at":1760968077000,"favorite_count":908,"quote_count":1,"reply_count":4,"retweet_count":145,"user_id_str":"865622395","conversation_id_str":"1980269760043446725","full_text":"🔗 Mixture of Experts (MoE): https://t.co/3CGjgO4H9p https://t.co/QA12nBeW0i","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"871247180341813248","name":"Tina Sang","screen_name":"tinawrote","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"tinawrote","lang":"en","retweeted":false,"fact_check":null,"id":"1980274554913132722","view_count":5237,"bookmark_count":0,"created_at":1760969220000,"favorite_count":11,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1979808022894703036","full_text":"@tinawrote Ha nice, it’s refreshing to see that people still care about Bayes theorem and fundamentals in 2025","in_reply_to_user_id_str":"871247180341813248","in_reply_to_status_id_str":"1979808022894703036","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"248951926","name":"Ahmad","screen_name":"TheAhmadOsman","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"TheAhmadOsman","lang":"en","retweeted":false,"fact_check":null,"id":"1980275166560092599","view_count":1634,"bookmark_count":2,"created_at":1760969366000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980102923754381348","full_text":"@TheAhmadOsman The V3.2 update with sparse attention was just to get the tooling ecosystem ready for the big release. Mark my words","in_reply_to_user_id_str":"248951926","in_reply_to_status_id_str":"1980102923754381348","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[23,305],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1209960539390201864","name":"Dwarkesh Patel","screen_name":"dwarkesh_sp","indices":[0,12]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[13,22]}]},"favorited":false,"in_reply_to_screen_name":"dwarkesh_sp","lang":"en","retweeted":false,"fact_check":null,"id":"1980335765063094548","view_count":6093,"bookmark_count":2,"created_at":1760983813000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980333945385562176","full_text":"> Culture: > “Why can’t an LLM write a book for the other LLMs? Why can’t other LLMs read this LLM’s book and be inspired by it, or shocked by it?”\n\nHm, isn’t that what training on synthetic data and knowledge distillation does? \n\nAll major LLMs contain some synthetic data in their mix because it makes training more effective versus cold-starting from raw data.","in_reply_to_user_id_str":"1209960539390201864","in_reply_to_status_id_str":"1980333945385562176","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-22","value":1417,"startTime":1761004800000,"endTime":1761091200000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642191950090585","view_count":153439,"bookmark_count":1282,"created_at":1761056871000,"favorite_count":2142,"quote_count":35,"reply_count":76,"retweet_count":338,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.\n\nIn short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.\n\nMy first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.\n\nIn the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)\n\nIn any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.\n\nHow is it different compared to other VLLM architectures?\n- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).\n- They are (to the best of my knowledge) those who use an MoE as a decoder.\nI think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.\nHowever, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.\n\nRegarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)\n\nOverall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).\n\n(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[18,250],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1638538494887821313","url":"https://t.co/gNErcwGh3w","indices":[71,94]}],"user_mentions":[{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[0,9]},{"id_str":"39547749","name":"(((ل()(ل() 'yoav))))👾","screen_name":"yoavgo","indices":[10,17]}]},"favorited":false,"in_reply_to_screen_name":"karpathy","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1638538494887821313","quoted_status_permalink":{"url":"https://t.co/gNErcwGh3w","expanded":"https://x.com/rasbt/status/1638538494887821313","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980463829789339825","view_count":12444,"bookmark_count":39,"created_at":1761014346000,"favorite_count":52,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980397031542989305","full_text":"@karpathy @yoavgo This made me think of the \"Meet in the Middle\" paper https://t.co/gNErcwGh3w\nWhen I remember correctly, they run two LLMs in both directions with parameter sharing. So it shouldn't impact training time. Kind of wild but hey why not.","in_reply_to_user_id_str":"33836629","in_reply_to_status_id_str":"1980435985730269351","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,188],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/deepseek-ai/De…","expanded_url":"https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf","url":"https://t.co/f0EFC6eVcl","indices":[19,42]},{"display_url":"magazine.sebastianraschka.com/p/understandin…","expanded_url":"https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=publication-search","url":"https://t.co/Aa5M0XD6ew","indices":[165,188]}],"user_mentions":[]},"favorited":true,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642194945110475","view_count":9544,"bookmark_count":42,"created_at":1761056872000,"favorite_count":77,"quote_count":1,"reply_count":2,"retweet_count":12,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"Link to the paper: https://t.co/f0EFC6eVcl\n\nMy \"Understanding Multimodal LLMs\" article with more info on how images are fed to LLMs, how cross-attention works, etc: https://t.co/Aa5M0XD6ew","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1143955635391754240","name":"Pratham Prasoon","screen_name":"PrasoonPratham","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"PrasoonPratham","lang":"en","retweeted":false,"fact_check":null,"id":"1980645421560262701","view_count":2495,"bookmark_count":0,"created_at":1761057641000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@PrasoonPratham Actually I was thinking about it when typing, and I don't know. I don't want to be that person who goes against the common terminology (like softargmax haha) but it really is a V*L*LM at 3B parameters.","in_reply_to_user_id_str":"1143955635391754240","in_reply_to_status_id_str":"1980644767022399874","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,235],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1487101697876844546","name":"Butch Coolidge","screen_name":"vulnerablecodes","indices":[0,16]}]},"favorited":true,"in_reply_to_screen_name":"vulnerablecodes","lang":"en","retweeted":false,"fact_check":null,"id":"1980644334832955587","view_count":2094,"bookmark_count":0,"created_at":1761057382000,"favorite_count":19,"quote_count":0,"reply_count":2,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@vulnerablecodes If we are talking about the model itself and not the app, these are open-weight PyTorch models. So unless there’s a backdoor in Hugging Face or the PyTorch runtime, there’s really no way for them to be malicious afaik.","in_reply_to_user_id_str":"1487101697876844546","in_reply_to_status_id_str":"1980643375780085948","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[14,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"255532327","name":"LMT ⚡️","screen_name":"Limitless_LT","indices":[0,13]}]},"favorited":false,"in_reply_to_screen_name":"Limitless_LT","lang":"en","retweeted":false,"fact_check":null,"id":"1980656807690530983","view_count":1677,"bookmark_count":0,"created_at":1761060356000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@Limitless_LT Yeah, I think that’s what brought us CNNS (as opposed to fully connected neural nets), LoRA, and many more","in_reply_to_user_id_str":"255532327","in_reply_to_status_id_str":"1980655979386793997","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,290],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"760070121981378561","name":"Alim","screen_name":"almmaasoglu","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"almmaasoglu","lang":"en","retweeted":false,"fact_check":null,"id":"1980657466284425600","view_count":2069,"bookmark_count":0,"created_at":1761060513000,"favorite_count":6,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@almmaasoglu exactly, that’s the messiness of working with the image format I mentioned. I think you can make to generalize well on all these but since there are more degrees of freedom it will require more data to train (luckily this can be done with automatic data augmentation but still)","in_reply_to_user_id_str":"760070121981378561","in_reply_to_status_id_str":"1980653506899087745","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1980736754517680463","view_count":25353,"bookmark_count":29,"created_at":1761079417000,"favorite_count":76,"quote_count":1,"reply_count":7,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980657338726887662","full_text":"Interesting that they mentioned faster & cheaper compared to OpenAI’s latest models not “customizable”. \n\nThat makes me think they are specifically referring to gpt-oss,\n\nThis in turn means they are using the small, dense Qwen3 models, maybe 0.6 to 4B range.\n\nAnd this is surprising, i.e. that models that small are good enough for production (and possibly chat interactions with the customer).","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1980657338726887662","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,171],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980715707508879444","view_count":8923,"bookmark_count":14,"created_at":1761074399000,"favorite_count":70,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"All that being said, as a human, I can appreciate visual representations of text as it lowers cognitive load (the raw text is readable, but requires much more brainpower): https://t.co/G4ygIeNvDZ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"875456843279081476","name":"Dileep George","screen_name":"dileeplearning","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"dileeplearning","lang":"en","retweeted":false,"fact_check":null,"id":"1980618490764513365","view_count":11072,"bookmark_count":11,"created_at":1761051220000,"favorite_count":77,"quote_count":0,"reply_count":3,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980423146420466049","full_text":"@dileeplearning I know it’s popular to hate tokenizers, but visual representations (which are also tokenized) bring a lot of messiness as well. Aspect ratios, cropping, resolution, brightness, etc.\n\nSure, models learn to deal with that but it requires lots of data to make them robust wrt these.","in_reply_to_user_id_str":"875456843279081476","in_reply_to_status_id_str":"1980423146420466049","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-23","value":359,"startTime":1761091200000,"endTime":1761177600000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}],"symbols":[],"timestamps":[{"indices":[94,99],"seconds":660,"text":"11:00"},{"indices":[376,380],"seconds":200,"text":"3:20"}],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980992745532453016","view_count":41099,"bookmark_count":359,"created_at":1761140450000,"favorite_count":823,"quote_count":7,"reply_count":24,"retweet_count":89,"user_id_str":"865622395","conversation_id_str":"1980992745532453016","full_text":"Excited to be (finally) heading to the PyTorch Conference!\n\nI’ll be giving a talk tomorrow at 11:00 AM on “The LLM Landscape 2025”, where I’ll discuss the key components behind this year’s most prominent open-weight LLMs, and highlight a few architectural developments that go beyond the mainstream, off the main track.\n\nI also look forward to doing a book signing session at 3:20 PM, thanks to the kind invite from the organizers.\n\nIt’s my first trip since my injury last year, and I’m really looking forward to reconnecting with the community in person after such a long time. If you’re there, please come say hi!\n\n(I couldn’t make it for the first day of the conference due to a mandatory appointment, but better late than never! See you all tomorrow.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-24","value":0,"startTime":1761177600000,"endTime":1761264000000,"tweets":[]},{"label":"2025-10-25","value":0,"startTime":1761264000000,"endTime":1761350400000,"tweets":[]},{"label":"2025-10-26","value":0,"startTime":1761350400000,"endTime":1761436800000,"tweets":[]},{"label":"2025-10-27","value":0,"startTime":1761436800000,"endTime":1761523200000,"tweets":[]},{"label":"2025-10-28","value":19,"startTime":1761523200000,"endTime":1761609600000,"tweets":[{"bookmarked":false,"display_text_range":[42,321],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch","url":"https://t.co/NGT1VM4P1R","indices":[414,437]}],"user_mentions":[{"id_str":"13434092","name":"Brandon Watson","screen_name":"BrandonWatson","indices":[0,14]},{"id_str":"291797158","name":"ThePrimeagen","screen_name":"ThePrimeagen","indices":[15,28]},{"id_str":"21001534","name":"Audible","screen_name":"audible_com","indices":[29,41]}]},"favorited":false,"in_reply_to_screen_name":"BrandonWatson","lang":"en","retweeted":false,"fact_check":null,"id":"1982836647784808750","view_count":9152,"bookmark_count":19,"created_at":1761580070000,"favorite_count":42,"quote_count":0,"reply_count":5,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1982666767437820411","full_text":"I wrote the original text and code and had similar questions when I found that there was an audio book version. When I asked about it, if I remember correctly, the answer was that it is something they generate for all books to improve accessibility. \n\nPersonally, I recommend the text version. That being said, I dunno, but perhaps the audiobook version works also well if you are working with the code notebooks (https://t.co/NGT1VM4P1R), which have the code and figures (but not text).\n\nWould be curious to hear from people who listen to audio book versions of coding books and find out if this is helpful.","in_reply_to_user_id_str":"13434092","in_reply_to_status_id_str":"1982666767437820411","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-29","value":767,"startTime":1761609600000,"endTime":1761696000000,"tweets":[{"bookmarked":false,"display_text_range":[0,269],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983212569885122670","view_count":46934,"bookmark_count":499,"created_at":1761669697000,"favorite_count":872,"quote_count":3,"reply_count":29,"retweet_count":128,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"Just saw the MiniMax-M2 benchmarks, and the performance is too good to ignore :). So, I just amended my \"The Big LLM Architecture Comparison\" with entry number 13! \n\n1️⃣ Full attention modules:\n\nAs shown in the overview figure below, I grouped MiniMax-M2 with the other decoder-style transformer LLMs as it does not use the efficient lightning attention variant proposed in MiniMax-M1. Instead, the developers went back to using full attention, likely to improve modeling (and benchmark) performance.\n\n2️⃣ Per-layer QK-Norm:\n\nOverall, MiniMax-M2 is surprisingly similar to Qwen3. Besides changing the number of layers, sizes, etc., it uses the same components overall. Perhaps the one noteworthy highlight here is that MiniMax-M2 uses a so-called “per_layer” QK-Norm instead of the regular QK-Norm. A closer look at the code reveals the \"per_layer\" means that the RMSNorm (used for QK-Norm as explained earlier) is defined in each transformer block (as in regular QK-Norm), but, in addition, instead of reusing it across attention heads, it's a unique QK-Norm for each attention head.\n\n3️⃣ Sliding-window attention:\n\nThe model configuration file also includes a sliding-window attention setting (similar to Gemma 3), but, as in Mistral 3.1, it is disabled by default.\n\nOtherwise, besides the per-layer QK-Norm, the architecture is very similar to Qwen3, as shown in the figure below.\n\n4️⃣ MoE sparsity:\n\nA perhaps interesting tidbit, as shown in the figure below, includes the fact that they don't use a shared expert (similar to Qwen3 but unlike Qwen3-Next). As mentioned earlier, in my opinion, shared experts are useful because they reduce redundancy among the other experts.\n\nAlso, as apparent from the figure above, MiniMax-M2 is twice as \"sparse\" as Qwen3. I.e., at roughly the same size as Qwen3 235B-A22B, MiniMax-M2 has only 10B instead of 22B active experts per token (that is, 4.37% of the parameters are used in each inference step in MiniMax-M2, whereas Qwen3 uses 9.36% active tokens).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1983240592516665532","quoted_status_permalink":{"url":"https://t.co/Ks8fEmHtCa","expanded":"https://twitter.com/ManningBooks/status/1983240592516665532","display":"x.com/ManningBooks/s…"},"retweeted":false,"fact_check":null,"id":"1983255497202643000","view_count":41464,"bookmark_count":263,"created_at":1761679932000,"favorite_count":404,"quote_count":0,"reply_count":25,"retweet_count":64,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"On that note, I am currently running a large-scale experiment on the upcoming inference-scaling chapter:\n\nA) Parallel Sampling\n- Self-Consistency (Majority Vote)\n- Rejection Sampling\n- Best-of-N (Verifier-Based)\n\nB) Sequential Refinement\n- Self-Refinement\n- Power Sampling\n- MCMC (Simple)\n- MCMC (Block as in \"Reasoning with Sampling\" paper)\n- Tree-of-Thought\n\n... to decide which one(s) make(s) it for the detailed discussion into the main chapter versus which ones will be included as bonus materials. (All new chapters will of course be automatically available to all the early acessers, amd there are already 170 chapters to get started in the meantime 😊\n\nAnything you'd think is worth adding to the list above?","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,34],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1745892418539417600","name":"elie","screen_name":"eliebakouch","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"eliebakouch","lang":"en","retweeted":false,"fact_check":null,"id":"1983231696343351800","view_count":2617,"bookmark_count":1,"created_at":1761674257000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@eliebakouch good point, will add!","in_reply_to_user_id_str":"1745892418539417600","in_reply_to_status_id_str":"1983219128883122466","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,192],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"970812776","name":"jason","screen_name":"jasonth0","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"jasonth0","lang":"en","retweeted":false,"fact_check":null,"id":"1983215929711284435","view_count":1335,"bookmark_count":1,"created_at":1761670498000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@jasonth0 The per-layer QK-Norm adds more params, not less :). But that aside, overall, I think it's still efficient. I mean, there are 50% less active parameters compared to Qwen3 for example","in_reply_to_user_id_str":"970812776","in_reply_to_status_id_str":"1983215562952990856","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[7,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"5604372","name":"Reza Rawassizadeh","screen_name":"rezar","indices":[0,6]}]},"favorited":false,"in_reply_to_screen_name":"rezar","lang":"en","retweeted":false,"fact_check":null,"id":"1983251855829606863","view_count":670,"bookmark_count":1,"created_at":1761679064000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@rezar That's a fun idea! Do you know a service that you have had a good experience with regarding making and distributing posters?","in_reply_to_user_id_str":"5604372","in_reply_to_status_id_str":"1983245370118267378","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,66],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1419322742713643009","name":"Duc Nguyen Huu","screen_name":"ducnh279","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"ducnh279","lang":"en","retweeted":false,"fact_check":null,"id":"1983278551655944288","view_count":108,"bookmark_count":0,"created_at":1761685428000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@ducnh279 Interesting one! I will bookmark this and give it a try.","in_reply_to_user_id_str":"1419322742713643009","in_reply_to_status_id_str":"1983263508071624848","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,46],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1918285228403253249","name":"ƬⲘ ⚔️","screen_name":"tm23twt","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}]},"favorited":false,"in_reply_to_screen_name":"tm23twt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983257620753592407","view_count":86,"bookmark_count":0,"created_at":1761680438000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@tm23twt I think they removed the edit feature https://t.co/dGwFDFaeYg","in_reply_to_user_id_str":"1918285228403253249","in_reply_to_status_id_str":"1983256870711164941","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,26],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1983255844029431837","view_count":4248,"bookmark_count":2,"created_at":1761680014000,"favorite_count":15,"quote_count":0,"reply_count":2,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"* 170 pages not chapters 😅","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983255497202643000","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-30","value":0,"startTime":1761696000000,"endTime":1761782400000,"tweets":[{"bookmarked":false,"display_text_range":[12,146],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"431181263","name":"Haichao","screen_name":"HaichaoZhu","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"HaichaoZhu","lang":"en","retweeted":false,"fact_check":null,"id":"1983343814648762407","view_count":552,"bookmark_count":0,"created_at":1761700988000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@HaichaoZhu That's a good point. With so many MoE's released this year (even the latest Nemotron today), maybe that'd be a nice standalone article","in_reply_to_user_id_str":"431181263","in_reply_to_status_id_str":"1983335671264845971","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,207],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1761964147510767616","name":"Ben Dicken","screen_name":"BenjDicken","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"BenjDicken","lang":"en","retweeted":false,"fact_check":null,"id":"1983565978525892663","view_count":5,"bookmark_count":0,"created_at":1761753956000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983292996117491864","full_text":"@BenjDicken Just saw this popping up on my timeline... I guess the twitter recommendations work well now, haha!\nAnyways, I hope you are liking the book. And please let me know in case you have any questions!","in_reply_to_user_id_str":"1761964147510767616","in_reply_to_status_id_str":"1983292996117491864","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-31","value":183,"startTime":1761782400000,"endTime":1761868800000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1978608882156269755","quoted_status_permalink":{"url":"https://t.co/uObfyEshyK","expanded":"https://twitter.com/rasbt/status/1978608882156269755","display":"x.com/rasbt/status/1…"},"retweeted":true,"fact_check":null,"id":"1983895811915214996","view_count":60530,"bookmark_count":173,"created_at":1761832595000,"favorite_count":325,"quote_count":1,"reply_count":22,"retweet_count":40,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"A small follow-up to my DGX Spark post. Courtesy of NVIDIA, I got to try the DGX on my workflows (coding LLMs from scratch in pure PyTorch) and wanted to share my first impressions after using it for a week.\n\nBefore getting to the performance, there was a neat bonus I didn't expect: It comes with NVIDIA Sync software that lets you conveniently connect (I fully expected I would have to find my SSH tunneling notes from back when I set up Jupyter Lab, etc, on a remote machine). The setup is a breeze and a delight.\n\nNow, how does it fare against my Mac Mini? I included the tokens/sec inference speed for a small 0.6B model I am currently working on. The DGX is much faster than the Mac Mini M4 CPU and still noticeably faster than the M4 GPU (via PyTorch MPS). More importantly, though, as I mentioned before, it is a CUDA device and thus much better supported in PyTorch. This, in turn, results in more stable training and higher benchmark accuracy. (And no compile errors, yay!)\n\nBoth devices get hot under my workloads (e.g., a constant-load run like evaluating a model with batched mode on MATH-500; or fine-tuning a model), but I feel like the DGX Spark is (probably) made with that in mind. Plus, due to its larger 128 GB RAM, I can run larger batch sizes. Then there's also the aspect that when I have the DGX (vs the Mac Mini) running computations, it keeps my Mini free for other tasks :).\n\nOverall, a neat little package and CUDA prototyping machine that I can keep on my desk. It's super quiet similar to the Mac Mini. Of course, it's not as capable as a 6x more expensive H100 for training, but hey, you don't need a server room for that and can keep it in your office without worrying about heat or noise (this was not possible with the Lambda workstation I had a few years ago).\n\ntl;dr:\n\nSo, I've been seeing lots of others using it for LLM inference (Ollama, vLLM, etc) but my first-week impression is that this is also a neat box for local dev and prototyping (e.g., coding and running PyTorch models) thanks to the CUDA support, which comes in handy before starting larger, more expensive training runs.\n\nPS: Plus also find another benchmark versus the H100 in the comments below.\n\nWill run more experiments over time. In the meantime, let me know if you have any questions.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,46],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","quoted_status_id_str":"1983895811915214996","quoted_status_permalink":{"url":"https://t.co/FM2NttATVY","expanded":"https://twitter.com/rasbt/status/1983895811915214996","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983900170992463920","view_count":1069,"bookmark_count":0,"created_at":1761833634000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1978608882156269755","full_text":"A follow-up here with some PyTorch benchmarks:","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1978608882156269755","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1983905044169945184","view_count":269,"bookmark_count":1,"created_at":1761834796000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983584412412641496","full_text":"@natolambert My guess is the motivating factor behind this was probably to prevent things from breaking if proprietary model providers make API or model changes again.","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1983584412412641496","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983895815102910660","view_count":5102,"bookmark_count":7,"created_at":1761832595000,"favorite_count":20,"quote_count":0,"reply_count":4,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"And here is a comparison with an H100. As one can see, the DGX Spark is a great machine for small inferencing tasks (even beating the 6x more expensive H100).\nBut when it comes to batched processing (or training), this is of course no replacement for high-memory bandwidth cards. https://t.co/I93nIfdzD6","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983895811915214996","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1983918926992933169","url":"https://t.co/yazv07Pxfx","indices":[194,217]}],"user_mentions":[{"id_str":"1451507288741658630","name":"Aleksandr Kovalev","screen_name":"koval_alvi","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"koval_alvi","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1983918926992933169","quoted_status_permalink":{"url":"https://t.co/yazv07Pxfx","expanded":"https://x.com/rasbt/status/1983918926992933169","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983919187555754315","view_count":480,"bookmark_count":0,"created_at":1761838168000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@koval_alvi So little time (and only one machine) & some much to run 😅. I am currently more focused on the inference scaling methods for the upcoming chapter 4, but yes, I did a short run:\n\nhttps://t.co/yazv07Pxfx","in_reply_to_user_id_str":"1451507288741658630","in_reply_to_status_id_str":"1983912718001115637","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1474196927960944644","name":"kris","screen_name":"Krishna70284154","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"Krishna70284154","lang":"en","retweeted":false,"fact_check":null,"id":"1983899945443700819","view_count":456,"bookmark_count":0,"created_at":1761833580000,"favorite_count":6,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@Krishna70284154 Yeah, it’s basically for people who want a Mac-like machine at a Mac-like price but with cuda support 😅","in_reply_to_user_id_str":"1474196927960944644","in_reply_to_status_id_str":"1983897384469082570","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,169],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb","url":"https://t.co/VioT1zUPgA","indices":[59,82]}],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983918926992933169","view_count":1113,"bookmark_count":2,"created_at":1761838106000,"favorite_count":4,"quote_count":1,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@redtachyon I did a short run of my DPO from Scratch code (https://t.co/VioT1zUPgA) on a 355M parameter model:\n\nA100: 1.69 min\nMac Mini M4: 12.54 min\nDGX Spark: 2.44 min","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1983906361969627248","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,163],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1780523178160279552","name":"Mykhailo Sorochuk","screen_name":"sir4K_zen","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"sir4K_zen","lang":"en","retweeted":false,"fact_check":null,"id":"1984030005966598349","view_count":159,"bookmark_count":0,"created_at":1761864589000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@sir4K_zen Under normal use, they are similarly quiet like you have to put your ear next to it to hear it. Under heavy load, the Mac Mini gets louder than the DGX.","in_reply_to_user_id_str":"1780523178160279552","in_reply_to_status_id_str":"1984026707242971532","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-01","value":91,"startTime":1761868800000,"endTime":1761955200000,"tweets":[{"bookmarked":false,"display_text_range":[0,260],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1984262505443844263","quoted_status_permalink":{"url":"https://t.co/bGHWQrydyN","expanded":"https://twitter.com/natolambert/status/1984262505443844263","display":"x.com/natolambert/st…"},"retweeted":false,"fact_check":null,"id":"1984279418588762113","view_count":19631,"bookmark_count":64,"created_at":1761924054000,"favorite_count":112,"quote_count":0,"reply_count":7,"retweet_count":6,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"I ran lots of experiments on fp16 vs bf16 years ago on ViTs and LLMs. fp16 can work well but depends on normalization (so you don’t run over the supported range with your activations). \nI can see why with QKNorm and other tricks it may work fine (/better) now.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2023/pyto…","expanded_url":"https://sebastianraschka.com/blog/2023/pytorch-memory-optimization.html","url":"https://t.co/AD6ZZJeS4D","indices":[61,84]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984310689167511808","view_count":5710,"bookmark_count":27,"created_at":1761931509000,"favorite_count":43,"quote_count":0,"reply_count":0,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"Figure from an older blogpost to illustrate the difference: https://t.co/AD6ZZJeS4D\n\nRegular 16-bit floats can only represent numbers between -65,504 and 65,504. And with LLMs back then I often had activation larger or smaller than that. (This was pre QKNorm.) https://t.co/b6vobXJCHJ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1984279418588762113","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,71],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1894590251571843073","name":"Artificially Intelligent","screen_name":"ArtiIntelligent","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"ArtiIntelligent","lang":"en","retweeted":false,"fact_check":null,"id":"1984242821688365465","view_count":100,"bookmark_count":0,"created_at":1761915328000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ArtiIntelligent Sure but my use case is primarily dev work in PyTorch.","in_reply_to_user_id_str":"1894590251571843073","in_reply_to_status_id_str":"1984239937789788358","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,211],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1886394417654677504","name":"moskstraumen","screen_name":"moskstraum21745","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"moskstraum21745","lang":"en","retweeted":false,"fact_check":null,"id":"1984255784847614382","view_count":65,"bookmark_count":0,"created_at":1761918419000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@moskstraum21745 Oh yes 100% use MLX if you want to max the performance on Mac. I think it also now has CUDA support correct? It's just that the most of the LLM ecosystem (and my experience) is based on PyTorch.","in_reply_to_user_id_str":"1886394417654677504","in_reply_to_status_id_str":"1984254897622290758","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,156],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"101128454","name":"Wayne Le Nguyen","screen_name":"insynwyn","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"insynwyn","lang":"en","retweeted":false,"fact_check":null,"id":"1984242530398171139","view_count":62,"bookmark_count":0,"created_at":1761915259000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@insynwyn Both the latest nightly and latest PyTorch with CUDA 13 work for me. (NVIDIA recommends the docker container but in my case that wasn’t necessary)","in_reply_to_user_id_str":"101128454","in_reply_to_status_id_str":"1984239792939499706","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-02","value":1260,"startTime":1761955200000,"endTime":1762041600000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984617030356451642","view_count":65925,"bookmark_count":861,"created_at":1762004547000,"favorite_count":1286,"quote_count":3,"reply_count":27,"retweet_count":220,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened.\n\nFirst, linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s.\n\nI don't want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n^2) to O(n) to making attention much more efficient for long sequences.\n\nHowever, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. \n\nIn the second half of this year, there was a bit of a revival of linear attention variants. The first notable model was MiniMax-M1 with lightning attention, a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June.\n\nThen, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 with sparse attention.\n\nAll three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. (DeepSeek's sparse attention it's not strictly linear but still subquadratic).\n\nInterestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model (discussed in section 13) without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had pure accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications.\n\nThis could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. Last week, the Kimi team released their new Kimi Linear model with linear attention. The tag line is that compared to regular, full attention, it has a 75% KV cache reduction and up to 6x decoding throughput.\n\nKimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there's one block that uses full attention as shown in the figure below.\n\nHowever, Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Interestingly, it also replaces the standard full attention module by multi-head latent attention. \n\nThere's no direct comparison to Qwen3-Next in the Kimi Linear paper, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed.\n\nOf course, I couldn't resist and added it to my The Big LLM Architecture Comparison article, which has grown to >10,000 words now (basically becoming book!?).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,88],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2025/dgx-…","expanded_url":"https://sebastianraschka.com/blog/2025/dgx-impressions.html","url":"https://t.co/XG2m9urtgc","indices":[65,88]}],"user_mentions":[{"id_str":"43874767","name":"Ivan Fioravanti ᯅ","screen_name":"ivanfioravanti","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"ivanfioravanti","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984601748250448148","view_count":205,"bookmark_count":3,"created_at":1762000903000,"favorite_count":5,"quote_count":0,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ivanfioravanti Yes! Links to the codes are in the article here: https://t.co/XG2m9urtgc","in_reply_to_user_id_str":"43874767","in_reply_to_status_id_str":"1984519617067335962","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,197],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","retweeted":false,"fact_check":null,"id":"1984633894365233442","view_count":1181,"bookmark_count":3,"created_at":1762008567000,"favorite_count":18,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984605827034972269","full_text":"@redtachyon I think fp16 also only works well for the newer architectures that add tons of normalization (like QKNorm), so you don’t get these large activations above +/- 65k that fp16 can’t handle","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1984605827034972269","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,200],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1629698842647203841","name":"Yu Zhang 🐈🐙","screen_name":"yzhang_cs","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"yzhang_cs","lang":"en","retweeted":false,"fact_check":null,"id":"1984632514019778709","view_count":1211,"bookmark_count":1,"created_at":1762008238000,"favorite_count":10,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@yzhang_cs Ooops, I misread then, thanks for the feedback, and I’ll update the figure in the article! (Ha, but sounds like I can keep this figure for the next iteration of Kimi Linear! Cool work btw!)","in_reply_to_user_id_str":"1629698842647203841","in_reply_to_status_id_str":"1984631714464088563","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,222],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"862201913252618240","name":"Vishal Verma","screen_name":"v_shaal","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}]},"favorited":false,"in_reply_to_screen_name":"v_shaal","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984630139888472399","view_count":725,"bookmark_count":0,"created_at":1762007672000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@v_shaal Might be architectural. They took the same architecture and compared it the Gated DeltaNet-H1 variant from the Gated DeltaNet paper (which is the most similar) and it compared favorably on long context benchmarks: https://t.co/dlzIWpohGu","in_reply_to_user_id_str":"862201913252618240","in_reply_to_status_id_str":"1984622135571091742","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,281],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1984632041598545947","view_count":545,"bookmark_count":0,"created_at":1762008126000,"favorite_count":4,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@_junaidkhalid1 My point still stands: there’s no one-size-fits-all. Different applications have different trade-offs. Same why gpt-5 and gpt-5 pro exists. Some times speed is more important and accuracy is sufficient. Sometimes you want to max accuracy (and are ok to wait 10 min)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1984631100002746497","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,83],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1199846588325224453","name":"John P.","screen_name":"JohnP07107214","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"JohnP07107214","lang":"en","retweeted":false,"fact_check":null,"id":"1984727926777237953","view_count":198,"bookmark_count":0,"created_at":1762030986000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@JohnP07107214 It might be a good topic for a separate book on LLM optimizations :)","in_reply_to_user_id_str":"1199846588325224453","in_reply_to_status_id_str":"1984726873763660133","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,289],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/OpJnPkrGK9","indices":[121,144]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]}],"user_mentions":[{"id_str":"1219292652748800000","name":"Alexey Grigorev","screen_name":"Al_Grigor","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"Al_Grigor","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984645325517164887","view_count":17409,"bookmark_count":392,"created_at":1762011293000,"favorite_count":428,"quote_count":1,"reply_count":6,"retweet_count":45,"user_id_str":"865622395","conversation_id_str":"1984222098370519305","full_text":"Yes, I recently read 90% of AI projects use PyTorch now. Recently put together an PyTorch essentials article: https://t.co/NWeQan8HJ3\n\n(I’ve been an early adopter since 2018 and never looked back; that being said, regarding your points below, TensorFlow also has dynamic graphs, and Keras supports PyTorch as a backend now too)","in_reply_to_user_id_str":"1219292652748800000","in_reply_to_status_id_str":"1984222098370519305","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-03","value":0,"startTime":1762041600000,"endTime":1762128000000,"tweets":[]},{"label":"2025-11-04","value":3,"startTime":1762128000000,"endTime":1762214400000,"tweets":[{"bookmarked":false,"display_text_range":[13,133],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1985456352035291531","view_count":4496,"bookmark_count":3,"created_at":1762204656000,"favorite_count":46,"quote_count":0,"reply_count":3,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1985418033037263086","full_text":"@natolambert Actually I think it was a pretty eventful Fall so far. E.g.,\nQwen3-Next, DeepSeek V3.2, GLM 4.6, MiniMax-M2, Kimi Linear","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1985418033037263086","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-05","value":676,"startTime":1762214400000,"endTime":1762300800000,"tweets":[{"bookmarked":false,"display_text_range":[0,198],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[175,198]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1985719217027494322","view_count":42088,"bookmark_count":675,"created_at":1762267328000,"favorite_count":950,"quote_count":5,"reply_count":25,"retweet_count":164,"user_id_str":"865622395","conversation_id_str":"1985719217027494322","full_text":"My new field guide to alternatives to standard LLMs: \n\nGated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.\n\nhttps://t.co/ZpWugAccgQ https://t.co/255yQXaDcM","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[8,47],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"783098774130401280","name":"Jack Morris","screen_name":"jxmnop","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"jxmnop","lang":"en","retweeted":false,"fact_check":null,"id":"1985735592689185002","view_count":7024,"bookmark_count":1,"created_at":1762271233000,"favorite_count":22,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1985720643397009844","full_text":"@jxmnop Wishing you all the best! You got this!","in_reply_to_user_id_str":"783098774130401280","in_reply_to_status_id_str":"1985720643397009844","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-06","value":0,"startTime":1762300800000,"endTime":1762387200000,"tweets":[]},{"label":"2025-11-07","value":489,"startTime":1762387200000,"endTime":1762473600000,"tweets":[{"bookmarked":false,"display_text_range":[0,89],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1986449512538513505","quoted_status_permalink":{"url":"https://t.co/4YLFiZxCMs","expanded":"https://twitter.com/Kimi_Moonshot/status/1986449512538513505","display":"x.com/Kimi_Moonshot/…"},"retweeted":false,"fact_check":null,"id":"1986511951141441648","view_count":87406,"bookmark_count":477,"created_at":1762456331000,"favorite_count":1352,"quote_count":8,"reply_count":27,"retweet_count":169,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Exciting big Kimi K2 Thinking release!\nMore experts, fewer heads, and even more thinking! https://t.co/CxUpn68Jjj","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"70831441","name":"Soumith Chintala","screen_name":"soumithchintala","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"soumithchintala","lang":"en","retweeted":false,"fact_check":null,"id":"1986531267794330038","view_count":16764,"bookmark_count":6,"created_at":1762460936000,"favorite_count":113,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986503070734557568","full_text":"@soumithchintala Thank you so much for making deep learning Pythonic! 💜\n\nAll my projects would have been much harder and less enjoyable without PyTorch. \n\nIn an alternative universe we maybe even wouldn’t have such an open-weight LLM ecosystem without PyTorch.\n\nAll the best for your next thing!","in_reply_to_user_id_str":"70831441","in_reply_to_status_id_str":"1986503070734557568","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[32,211],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"14227298","name":"Radek Sienkiewicz","screen_name":"velvet_shark","indices":[0,13]},{"id_str":"20971154","name":"Nicholas Dwork","screen_name":"ndwork","indices":[14,21]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[22,31]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}]},"favorited":false,"in_reply_to_screen_name":"velvet_shark","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1986517309016449353","view_count":50,"bookmark_count":1,"created_at":1762457608000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986412241374048473","full_text":"@velvet_shark @ndwork @karpathy I would say check out the bonus materials, especially the attention alternatives and Qwen3-from-scratch. \nI haven't had a chance to really check out nanochat but that one as well! https://t.co/Qr81iGhkrD","in_reply_to_user_id_str":"14227298","in_reply_to_status_id_str":"1986513230286524832","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,91],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1986522069262123425","view_count":7047,"bookmark_count":5,"created_at":1762458743000,"favorite_count":35,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Sorry should be 256k context length in Kimi K2 Thinking. (Up from 128k in regular Kimi K2.)","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1986511951141441648","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-08","value":0,"startTime":1762473600000,"endTime":1762560000000,"tweets":[]},{"label":"2025-11-09","value":530,"startTime":1762560000000,"endTime":1762646400000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/CaIfmZhaB1","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987157794202505395","view_count":52950,"bookmark_count":381,"created_at":1762610312000,"favorite_count":468,"quote_count":1,"reply_count":11,"retweet_count":71,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"My \"The Building Blocks of Today’s and Tomorrow’s Language Models\" talk at the PyTorch Conference is now up on YouTube! https://t.co/bGV5w1Aqyq\n\nIf you have 25 min this weekend, it's a whirlwind tour to catch you up on the key LLM architecture design considerations in popular LLMs this year (plus, an overview of alternative architecture designs).\n\nThe silver lining of my late arrival and rescheduling: Since there was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 min :)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,121],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[98,121]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987160373837902334","view_count":10033,"bookmark_count":87,"created_at":1762610927000,"favorite_count":85,"quote_count":0,"reply_count":2,"retweet_count":8,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"And the article I mentioned in the talk, the one I promised to write as a follow-up, is this one: https://t.co/ZpWugAccgQ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1987157794202505395","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,39],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1987168682624061627","view_count":143,"bookmark_count":0,"created_at":1762612908000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"@_junaidkhalid1 Incremental progress :)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1987161061188116976","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,297],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"267596794","name":"Walter Tay","screen_name":"waltertayannlee","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"waltertayannlee","lang":"en","retweeted":false,"fact_check":null,"id":"1987177509914337358","view_count":12080,"bookmark_count":62,"created_at":1762615012000,"favorite_count":117,"quote_count":0,"reply_count":2,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1986734118005358605","full_text":"The fun part when teaching deep learning classes was always to point out that the textbook convolution (/cross-correlation) is not how it’s actually implemented. It’s also one of the big sources of non-determinism when training CNNs in standard frameworks, because l, by default, CUDA/cuDNN selects the algo automatically at runtime specific to the problem and setup.","in_reply_to_user_id_str":"267596794","in_reply_to_status_id_str":"1986734118005358605","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-10","value":0,"startTime":1762646400000,"endTime":1762732800000,"tweets":[]},{"label":"2025-11-11","value":0,"startTime":1762732800000,"endTime":1762819200000,"tweets":[]},{"label":"2025-11-12","value":39,"startTime":1762819200000,"endTime":1762905600000,"tweets":[{"bookmarked":false,"display_text_range":[8,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1778075580271054848","name":"mel","screen_name":"melqtx","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"melqtx","lang":"en","retweeted":false,"fact_check":null,"id":"1988380057346130209","view_count":24822,"bookmark_count":39,"created_at":1762901722000,"favorite_count":354,"quote_count":0,"reply_count":16,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1988288260049871197","full_text":"@melqtx I use it all the time when using remote machines. Coz the terminal connections sometimes gets closed (e.g., when my computer goes to sleep).\n\nThis way, I can simply log back in, attach the tmux terminal, and continue instead of cd'ing to the right folder, activating the venv etc.","in_reply_to_user_id_str":"1778075580271054848","in_reply_to_status_id_str":"1988288260049871197","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-13","value":944,"startTime":1762905600000,"endTime":1762992000000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1988626642990719440","view_count":54993,"bookmark_count":944,"created_at":1762960513000,"favorite_count":801,"quote_count":5,"reply_count":27,"retweet_count":115,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all approach, but I thought it may be useful to share:\n\n1. Read the chapter preferably offline, away from the computer. Either classic physical form or at least on digital devices without internet. This really helps with focus time and minimizing distractions while reading. Highlighting or annotating confusing or interesting things is good, but I would not look things up at this stage. I also wouldn't run code at this stage. At least not yet.\n\n2. On the second read-through, type up and run the code from the chapter. Copying code is tempting because retyping is a lot of work, but it usually helps me to think about the code a bit more (versus just glancing over it). If I get different results than in the book, I would check the book's GitHub repo and try the code from there. If I still get different results, I would try to see if it's due to different package versions, random seeds, CPU/CUDA, etc. If I then still can't find it out, asking the author would not be a bad idea (via book forum, public GitHub repo issues or discussions, and as a last resort, email)\n\n3. After the second read-through and retyping the code, it's usually a good time to try the exercises to solidify my understanding. To check whether I actually understand the content and can work with it independently.\n\n4. Go through the highlights and annotations. I would bookmark important learnings or takeaways, if relevant for a given project, in my notes documents. Often, I also look up additional references to read more about a topic of interest. Also, if I still have any questions that I feel are unanswered after my previous readthroughs and exercises, I would do an online search to find out more.\n\n5. The previous steps were all about soaking up knowledge. Eventually, though, I somehow want to use that knowledge. So I think about which projects would benefit from what I've learned and incorporate it into them. This could involve using the main concept from the chapter, but also sometimes minor tidbits I learned along the way, e.g., even trivial things like whether it actually makes a difference in my project to explicitly call `torch.mps.manual_seed(seed)` instead of just `torch.manual_seed(seed)`.\n\nOf course, none of the above is set in stone. If the topic is overall very familiar or easy, and I am primarily reading the book to get some information in later chapters, skimming a chapter is ok (to not waste my time).\n\nAnyway, I hope this is useful. And happy reading and learning!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,44],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":true,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988627760617517412","view_count":74,"bookmark_count":0,"created_at":1762960779000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"@franbetteo Classic quality > quantity :)","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988627594669875705","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,292],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988631117772025955","view_count":6,"bookmark_count":0,"created_at":1762961580000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"Yeah, I think the problem is to want to read too many things. I have the same issue. Honestly, when reading at a computer, my attention span is sometimes so short that I can't even focus 30 min and read a longer blog article without distraction.\nIt requires discipline to stick to a given text.","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988628897995341948","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-14","value":0,"startTime":1762992000000,"endTime":1763078400000,"tweets":[]},{"label":"2025-11-15","value":0,"startTime":1763078400000,"endTime":1763164800000,"tweets":[]},{"label":"2025-11-16","value":594,"startTime":1763164800000,"endTime":1763251200000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989706196396265863","view_count":47955,"bookmark_count":547,"created_at":1763217898000,"favorite_count":754,"quote_count":1,"reply_count":15,"retweet_count":119,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"Inference-scaling lets us trade extra compute for better modeling accuracy. Next to reinforcement learning, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one.\n\nI just finished the first one. It is a 35-page introduction to inference-time scaling through self-consistency sampling. This chapter was a lot of fun to write because it takes the base model on MATH-500 all the way from 15.2% percent to 52.2% accuracy.\n\nSeeing that jump without additional training is incredibly satisfying.\n\nSubmitted the chapter yesterday, and it should appear in the Manning Early Access program in the next few days. (In the meantime the first 176 pages that lead up to this chapter are already available.)\n\nThe next chapter will focus on self-refinement techniques, where the model improves its own answers through iterative reasoning.\n\nHappy reading!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"800854096219471872","name":"Yuchen Jin","screen_name":"Yuchenj_UW","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"Yuchenj_UW","lang":"en","retweeted":false,"fact_check":null,"id":"1989803439224934626","view_count":6603,"bookmark_count":3,"created_at":1763241083000,"favorite_count":118,"quote_count":0,"reply_count":3,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1989755062646944048","full_text":"@Yuchenj_UW One can say you do seminal work to get a PhD, but you don’t have to have a PhD to do seminal work.","in_reply_to_user_id_str":"800854096219471872","in_reply_to_status_id_str":"1989755062646944048","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/blob/main/ch04/01_main-chapter-code/ch04_main.ipynb","url":"https://t.co/b3Nk5cVHwd","indices":[46,69]},{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/tree/main/ch04/02_math500-inference-scaling-scripts","url":"https://t.co/z3oj5Vkno1","indices":[144,167]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989708450100662776","view_count":8109,"bookmark_count":44,"created_at":1763218436000,"favorite_count":60,"quote_count":0,"reply_count":3,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"The chapter code is available here on GitHub: https://t.co/b3Nk5cVHwd\n\nAlso, I have the scripts to reproduce the experiments in the table here: https://t.co/z3oj5Vkno1","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1989706196396265863","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"9508592","name":"Asankhaya Sharma","screen_name":"asankhaya","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"asankhaya","lang":"en","retweeted":false,"fact_check":null,"id":"1989718576664568217","view_count":454,"bookmark_count":0,"created_at":1763220850000,"favorite_count":5,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@asankhaya Yes that’s correct. I think self-consistency is a good intro though that works well in practice, too. More will be covered in the next chapter.\nThanks for sharing btw, have to check out your repo some time.","in_reply_to_user_id_str":"9508592","in_reply_to_status_id_str":"1989717556077498843","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,123],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"4036077013","name":"sour coach sauers","screen_name":"SRCoachSauers","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"SRCoachSauers","lang":"en","retweeted":false,"fact_check":null,"id":"1989803670125646205","view_count":103,"bookmark_count":0,"created_at":1763241138000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@SRCoachSauers The website says summer 2026. That’s still the estimate but maybe even late spring depending on how it goes.","in_reply_to_user_id_str":"4036077013","in_reply_to_status_id_str":"1989800627426480467","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-17","value":0,"startTime":1763251200000,"endTime":1763337600000,"tweets":[]}],"nretweets":[{"label":"2025-10-18","value":0,"startTime":1760659200000,"endTime":1760745600000,"tweets":[]},{"label":"2025-10-19","value":0,"startTime":1760745600000,"endTime":1760832000000,"tweets":[]},{"label":"2025-10-20","value":0,"startTime":1760832000000,"endTime":1760918400000,"tweets":[]},{"label":"2025-10-21","value":145,"startTime":1760918400000,"endTime":1761004800000,"tweets":[{"bookmarked":false,"display_text_range":[0,51],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/07_moe","url":"https://t.co/3CGjgO4H9p","indices":[28,51]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1977733802660155875","quoted_status_permalink":{"url":"https://t.co/nQ43v9rV8S","expanded":"https://twitter.com/rasbt/status/1977733802660155875","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980269760043446725","view_count":76235,"bookmark_count":627,"created_at":1760968077000,"favorite_count":908,"quote_count":1,"reply_count":4,"retweet_count":145,"user_id_str":"865622395","conversation_id_str":"1980269760043446725","full_text":"🔗 Mixture of Experts (MoE): https://t.co/3CGjgO4H9p https://t.co/QA12nBeW0i","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"871247180341813248","name":"Tina Sang","screen_name":"tinawrote","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"tinawrote","lang":"en","retweeted":false,"fact_check":null,"id":"1980274554913132722","view_count":5237,"bookmark_count":0,"created_at":1760969220000,"favorite_count":11,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1979808022894703036","full_text":"@tinawrote Ha nice, it’s refreshing to see that people still care about Bayes theorem and fundamentals in 2025","in_reply_to_user_id_str":"871247180341813248","in_reply_to_status_id_str":"1979808022894703036","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"248951926","name":"Ahmad","screen_name":"TheAhmadOsman","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"TheAhmadOsman","lang":"en","retweeted":false,"fact_check":null,"id":"1980275166560092599","view_count":1634,"bookmark_count":2,"created_at":1760969366000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980102923754381348","full_text":"@TheAhmadOsman The V3.2 update with sparse attention was just to get the tooling ecosystem ready for the big release. Mark my words","in_reply_to_user_id_str":"248951926","in_reply_to_status_id_str":"1980102923754381348","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[23,305],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1209960539390201864","name":"Dwarkesh Patel","screen_name":"dwarkesh_sp","indices":[0,12]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[13,22]}]},"favorited":false,"in_reply_to_screen_name":"dwarkesh_sp","lang":"en","retweeted":false,"fact_check":null,"id":"1980335765063094548","view_count":6093,"bookmark_count":2,"created_at":1760983813000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980333945385562176","full_text":"> Culture: > “Why can’t an LLM write a book for the other LLMs? Why can’t other LLMs read this LLM’s book and be inspired by it, or shocked by it?”\n\nHm, isn’t that what training on synthetic data and knowledge distillation does? \n\nAll major LLMs contain some synthetic data in their mix because it makes training more effective versus cold-starting from raw data.","in_reply_to_user_id_str":"1209960539390201864","in_reply_to_status_id_str":"1980333945385562176","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-22","value":356,"startTime":1761004800000,"endTime":1761091200000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642191950090585","view_count":153439,"bookmark_count":1282,"created_at":1761056871000,"favorite_count":2142,"quote_count":35,"reply_count":76,"retweet_count":338,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.\n\nIn short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.\n\nMy first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.\n\nIn the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)\n\nIn any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.\n\nHow is it different compared to other VLLM architectures?\n- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).\n- They are (to the best of my knowledge) those who use an MoE as a decoder.\nI think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.\nHowever, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.\n\nRegarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)\n\nOverall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).\n\n(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[18,250],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1638538494887821313","url":"https://t.co/gNErcwGh3w","indices":[71,94]}],"user_mentions":[{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[0,9]},{"id_str":"39547749","name":"(((ل()(ل() 'yoav))))👾","screen_name":"yoavgo","indices":[10,17]}]},"favorited":false,"in_reply_to_screen_name":"karpathy","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1638538494887821313","quoted_status_permalink":{"url":"https://t.co/gNErcwGh3w","expanded":"https://x.com/rasbt/status/1638538494887821313","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980463829789339825","view_count":12444,"bookmark_count":39,"created_at":1761014346000,"favorite_count":52,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980397031542989305","full_text":"@karpathy @yoavgo This made me think of the \"Meet in the Middle\" paper https://t.co/gNErcwGh3w\nWhen I remember correctly, they run two LLMs in both directions with parameter sharing. So it shouldn't impact training time. Kind of wild but hey why not.","in_reply_to_user_id_str":"33836629","in_reply_to_status_id_str":"1980435985730269351","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,188],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/deepseek-ai/De…","expanded_url":"https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf","url":"https://t.co/f0EFC6eVcl","indices":[19,42]},{"display_url":"magazine.sebastianraschka.com/p/understandin…","expanded_url":"https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=publication-search","url":"https://t.co/Aa5M0XD6ew","indices":[165,188]}],"user_mentions":[]},"favorited":true,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642194945110475","view_count":9544,"bookmark_count":42,"created_at":1761056872000,"favorite_count":77,"quote_count":1,"reply_count":2,"retweet_count":12,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"Link to the paper: https://t.co/f0EFC6eVcl\n\nMy \"Understanding Multimodal LLMs\" article with more info on how images are fed to LLMs, how cross-attention works, etc: https://t.co/Aa5M0XD6ew","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1143955635391754240","name":"Pratham Prasoon","screen_name":"PrasoonPratham","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"PrasoonPratham","lang":"en","retweeted":false,"fact_check":null,"id":"1980645421560262701","view_count":2495,"bookmark_count":0,"created_at":1761057641000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@PrasoonPratham Actually I was thinking about it when typing, and I don't know. I don't want to be that person who goes against the common terminology (like softargmax haha) but it really is a V*L*LM at 3B parameters.","in_reply_to_user_id_str":"1143955635391754240","in_reply_to_status_id_str":"1980644767022399874","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,235],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1487101697876844546","name":"Butch Coolidge","screen_name":"vulnerablecodes","indices":[0,16]}]},"favorited":true,"in_reply_to_screen_name":"vulnerablecodes","lang":"en","retweeted":false,"fact_check":null,"id":"1980644334832955587","view_count":2094,"bookmark_count":0,"created_at":1761057382000,"favorite_count":19,"quote_count":0,"reply_count":2,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@vulnerablecodes If we are talking about the model itself and not the app, these are open-weight PyTorch models. So unless there’s a backdoor in Hugging Face or the PyTorch runtime, there’s really no way for them to be malicious afaik.","in_reply_to_user_id_str":"1487101697876844546","in_reply_to_status_id_str":"1980643375780085948","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[14,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"255532327","name":"LMT ⚡️","screen_name":"Limitless_LT","indices":[0,13]}]},"favorited":false,"in_reply_to_screen_name":"Limitless_LT","lang":"en","retweeted":false,"fact_check":null,"id":"1980656807690530983","view_count":1677,"bookmark_count":0,"created_at":1761060356000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@Limitless_LT Yeah, I think that’s what brought us CNNS (as opposed to fully connected neural nets), LoRA, and many more","in_reply_to_user_id_str":"255532327","in_reply_to_status_id_str":"1980655979386793997","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,290],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"760070121981378561","name":"Alim","screen_name":"almmaasoglu","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"almmaasoglu","lang":"en","retweeted":false,"fact_check":null,"id":"1980657466284425600","view_count":2069,"bookmark_count":0,"created_at":1761060513000,"favorite_count":6,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@almmaasoglu exactly, that’s the messiness of working with the image format I mentioned. I think you can make to generalize well on all these but since there are more degrees of freedom it will require more data to train (luckily this can be done with automatic data augmentation but still)","in_reply_to_user_id_str":"760070121981378561","in_reply_to_status_id_str":"1980653506899087745","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1980736754517680463","view_count":25353,"bookmark_count":29,"created_at":1761079417000,"favorite_count":76,"quote_count":1,"reply_count":7,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980657338726887662","full_text":"Interesting that they mentioned faster & cheaper compared to OpenAI’s latest models not “customizable”. \n\nThat makes me think they are specifically referring to gpt-oss,\n\nThis in turn means they are using the small, dense Qwen3 models, maybe 0.6 to 4B range.\n\nAnd this is surprising, i.e. that models that small are good enough for production (and possibly chat interactions with the customer).","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1980657338726887662","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,171],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980715707508879444","view_count":8923,"bookmark_count":14,"created_at":1761074399000,"favorite_count":70,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"All that being said, as a human, I can appreciate visual representations of text as it lowers cognitive load (the raw text is readable, but requires much more brainpower): https://t.co/G4ygIeNvDZ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"875456843279081476","name":"Dileep George","screen_name":"dileeplearning","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"dileeplearning","lang":"en","retweeted":false,"fact_check":null,"id":"1980618490764513365","view_count":11072,"bookmark_count":11,"created_at":1761051220000,"favorite_count":77,"quote_count":0,"reply_count":3,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980423146420466049","full_text":"@dileeplearning I know it’s popular to hate tokenizers, but visual representations (which are also tokenized) bring a lot of messiness as well. Aspect ratios, cropping, resolution, brightness, etc.\n\nSure, models learn to deal with that but it requires lots of data to make them robust wrt these.","in_reply_to_user_id_str":"875456843279081476","in_reply_to_status_id_str":"1980423146420466049","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-23","value":89,"startTime":1761091200000,"endTime":1761177600000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}],"symbols":[],"timestamps":[{"indices":[94,99],"seconds":660,"text":"11:00"},{"indices":[376,380],"seconds":200,"text":"3:20"}],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980992745532453016","view_count":41099,"bookmark_count":359,"created_at":1761140450000,"favorite_count":823,"quote_count":7,"reply_count":24,"retweet_count":89,"user_id_str":"865622395","conversation_id_str":"1980992745532453016","full_text":"Excited to be (finally) heading to the PyTorch Conference!\n\nI’ll be giving a talk tomorrow at 11:00 AM on “The LLM Landscape 2025”, where I’ll discuss the key components behind this year’s most prominent open-weight LLMs, and highlight a few architectural developments that go beyond the mainstream, off the main track.\n\nI also look forward to doing a book signing session at 3:20 PM, thanks to the kind invite from the organizers.\n\nIt’s my first trip since my injury last year, and I’m really looking forward to reconnecting with the community in person after such a long time. If you’re there, please come say hi!\n\n(I couldn’t make it for the first day of the conference due to a mandatory appointment, but better late than never! See you all tomorrow.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-24","value":0,"startTime":1761177600000,"endTime":1761264000000,"tweets":[]},{"label":"2025-10-25","value":0,"startTime":1761264000000,"endTime":1761350400000,"tweets":[]},{"label":"2025-10-26","value":0,"startTime":1761350400000,"endTime":1761436800000,"tweets":[]},{"label":"2025-10-27","value":0,"startTime":1761436800000,"endTime":1761523200000,"tweets":[]},{"label":"2025-10-28","value":1,"startTime":1761523200000,"endTime":1761609600000,"tweets":[{"bookmarked":false,"display_text_range":[42,321],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch","url":"https://t.co/NGT1VM4P1R","indices":[414,437]}],"user_mentions":[{"id_str":"13434092","name":"Brandon Watson","screen_name":"BrandonWatson","indices":[0,14]},{"id_str":"291797158","name":"ThePrimeagen","screen_name":"ThePrimeagen","indices":[15,28]},{"id_str":"21001534","name":"Audible","screen_name":"audible_com","indices":[29,41]}]},"favorited":false,"in_reply_to_screen_name":"BrandonWatson","lang":"en","retweeted":false,"fact_check":null,"id":"1982836647784808750","view_count":9152,"bookmark_count":19,"created_at":1761580070000,"favorite_count":42,"quote_count":0,"reply_count":5,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1982666767437820411","full_text":"I wrote the original text and code and had similar questions when I found that there was an audio book version. When I asked about it, if I remember correctly, the answer was that it is something they generate for all books to improve accessibility. \n\nPersonally, I recommend the text version. That being said, I dunno, but perhaps the audiobook version works also well if you are working with the code notebooks (https://t.co/NGT1VM4P1R), which have the code and figures (but not text).\n\nWould be curious to hear from people who listen to audio book versions of coding books and find out if this is helpful.","in_reply_to_user_id_str":"13434092","in_reply_to_status_id_str":"1982666767437820411","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-29","value":193,"startTime":1761609600000,"endTime":1761696000000,"tweets":[{"bookmarked":false,"display_text_range":[0,269],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983212569885122670","view_count":46934,"bookmark_count":499,"created_at":1761669697000,"favorite_count":872,"quote_count":3,"reply_count":29,"retweet_count":128,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"Just saw the MiniMax-M2 benchmarks, and the performance is too good to ignore :). So, I just amended my \"The Big LLM Architecture Comparison\" with entry number 13! \n\n1️⃣ Full attention modules:\n\nAs shown in the overview figure below, I grouped MiniMax-M2 with the other decoder-style transformer LLMs as it does not use the efficient lightning attention variant proposed in MiniMax-M1. Instead, the developers went back to using full attention, likely to improve modeling (and benchmark) performance.\n\n2️⃣ Per-layer QK-Norm:\n\nOverall, MiniMax-M2 is surprisingly similar to Qwen3. Besides changing the number of layers, sizes, etc., it uses the same components overall. Perhaps the one noteworthy highlight here is that MiniMax-M2 uses a so-called “per_layer” QK-Norm instead of the regular QK-Norm. A closer look at the code reveals the \"per_layer\" means that the RMSNorm (used for QK-Norm as explained earlier) is defined in each transformer block (as in regular QK-Norm), but, in addition, instead of reusing it across attention heads, it's a unique QK-Norm for each attention head.\n\n3️⃣ Sliding-window attention:\n\nThe model configuration file also includes a sliding-window attention setting (similar to Gemma 3), but, as in Mistral 3.1, it is disabled by default.\n\nOtherwise, besides the per-layer QK-Norm, the architecture is very similar to Qwen3, as shown in the figure below.\n\n4️⃣ MoE sparsity:\n\nA perhaps interesting tidbit, as shown in the figure below, includes the fact that they don't use a shared expert (similar to Qwen3 but unlike Qwen3-Next). As mentioned earlier, in my opinion, shared experts are useful because they reduce redundancy among the other experts.\n\nAlso, as apparent from the figure above, MiniMax-M2 is twice as \"sparse\" as Qwen3. I.e., at roughly the same size as Qwen3 235B-A22B, MiniMax-M2 has only 10B instead of 22B active experts per token (that is, 4.37% of the parameters are used in each inference step in MiniMax-M2, whereas Qwen3 uses 9.36% active tokens).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1983240592516665532","quoted_status_permalink":{"url":"https://t.co/Ks8fEmHtCa","expanded":"https://twitter.com/ManningBooks/status/1983240592516665532","display":"x.com/ManningBooks/s…"},"retweeted":false,"fact_check":null,"id":"1983255497202643000","view_count":41464,"bookmark_count":263,"created_at":1761679932000,"favorite_count":404,"quote_count":0,"reply_count":25,"retweet_count":64,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"On that note, I am currently running a large-scale experiment on the upcoming inference-scaling chapter:\n\nA) Parallel Sampling\n- Self-Consistency (Majority Vote)\n- Rejection Sampling\n- Best-of-N (Verifier-Based)\n\nB) Sequential Refinement\n- Self-Refinement\n- Power Sampling\n- MCMC (Simple)\n- MCMC (Block as in \"Reasoning with Sampling\" paper)\n- Tree-of-Thought\n\n... to decide which one(s) make(s) it for the detailed discussion into the main chapter versus which ones will be included as bonus materials. (All new chapters will of course be automatically available to all the early acessers, amd there are already 170 chapters to get started in the meantime 😊\n\nAnything you'd think is worth adding to the list above?","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,34],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1745892418539417600","name":"elie","screen_name":"eliebakouch","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"eliebakouch","lang":"en","retweeted":false,"fact_check":null,"id":"1983231696343351800","view_count":2617,"bookmark_count":1,"created_at":1761674257000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@eliebakouch good point, will add!","in_reply_to_user_id_str":"1745892418539417600","in_reply_to_status_id_str":"1983219128883122466","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,192],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"970812776","name":"jason","screen_name":"jasonth0","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"jasonth0","lang":"en","retweeted":false,"fact_check":null,"id":"1983215929711284435","view_count":1335,"bookmark_count":1,"created_at":1761670498000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@jasonth0 The per-layer QK-Norm adds more params, not less :). But that aside, overall, I think it's still efficient. I mean, there are 50% less active parameters compared to Qwen3 for example","in_reply_to_user_id_str":"970812776","in_reply_to_status_id_str":"1983215562952990856","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[7,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"5604372","name":"Reza Rawassizadeh","screen_name":"rezar","indices":[0,6]}]},"favorited":false,"in_reply_to_screen_name":"rezar","lang":"en","retweeted":false,"fact_check":null,"id":"1983251855829606863","view_count":670,"bookmark_count":1,"created_at":1761679064000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@rezar That's a fun idea! Do you know a service that you have had a good experience with regarding making and distributing posters?","in_reply_to_user_id_str":"5604372","in_reply_to_status_id_str":"1983245370118267378","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,66],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1419322742713643009","name":"Duc Nguyen Huu","screen_name":"ducnh279","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"ducnh279","lang":"en","retweeted":false,"fact_check":null,"id":"1983278551655944288","view_count":108,"bookmark_count":0,"created_at":1761685428000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@ducnh279 Interesting one! I will bookmark this and give it a try.","in_reply_to_user_id_str":"1419322742713643009","in_reply_to_status_id_str":"1983263508071624848","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,46],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1918285228403253249","name":"ƬⲘ ⚔️","screen_name":"tm23twt","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}]},"favorited":false,"in_reply_to_screen_name":"tm23twt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983257620753592407","view_count":86,"bookmark_count":0,"created_at":1761680438000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@tm23twt I think they removed the edit feature https://t.co/dGwFDFaeYg","in_reply_to_user_id_str":"1918285228403253249","in_reply_to_status_id_str":"1983256870711164941","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,26],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1983255844029431837","view_count":4248,"bookmark_count":2,"created_at":1761680014000,"favorite_count":15,"quote_count":0,"reply_count":2,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"* 170 pages not chapters 😅","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983255497202643000","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-30","value":0,"startTime":1761696000000,"endTime":1761782400000,"tweets":[{"bookmarked":false,"display_text_range":[12,146],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"431181263","name":"Haichao","screen_name":"HaichaoZhu","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"HaichaoZhu","lang":"en","retweeted":false,"fact_check":null,"id":"1983343814648762407","view_count":552,"bookmark_count":0,"created_at":1761700988000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@HaichaoZhu That's a good point. With so many MoE's released this year (even the latest Nemotron today), maybe that'd be a nice standalone article","in_reply_to_user_id_str":"431181263","in_reply_to_status_id_str":"1983335671264845971","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,207],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1761964147510767616","name":"Ben Dicken","screen_name":"BenjDicken","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"BenjDicken","lang":"en","retweeted":false,"fact_check":null,"id":"1983565978525892663","view_count":5,"bookmark_count":0,"created_at":1761753956000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983292996117491864","full_text":"@BenjDicken Just saw this popping up on my timeline... I guess the twitter recommendations work well now, haha!\nAnyways, I hope you are liking the book. And please let me know in case you have any questions!","in_reply_to_user_id_str":"1761964147510767616","in_reply_to_status_id_str":"1983292996117491864","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-31","value":44,"startTime":1761782400000,"endTime":1761868800000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1978608882156269755","quoted_status_permalink":{"url":"https://t.co/uObfyEshyK","expanded":"https://twitter.com/rasbt/status/1978608882156269755","display":"x.com/rasbt/status/1…"},"retweeted":true,"fact_check":null,"id":"1983895811915214996","view_count":60530,"bookmark_count":173,"created_at":1761832595000,"favorite_count":325,"quote_count":1,"reply_count":22,"retweet_count":40,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"A small follow-up to my DGX Spark post. Courtesy of NVIDIA, I got to try the DGX on my workflows (coding LLMs from scratch in pure PyTorch) and wanted to share my first impressions after using it for a week.\n\nBefore getting to the performance, there was a neat bonus I didn't expect: It comes with NVIDIA Sync software that lets you conveniently connect (I fully expected I would have to find my SSH tunneling notes from back when I set up Jupyter Lab, etc, on a remote machine). The setup is a breeze and a delight.\n\nNow, how does it fare against my Mac Mini? I included the tokens/sec inference speed for a small 0.6B model I am currently working on. The DGX is much faster than the Mac Mini M4 CPU and still noticeably faster than the M4 GPU (via PyTorch MPS). More importantly, though, as I mentioned before, it is a CUDA device and thus much better supported in PyTorch. This, in turn, results in more stable training and higher benchmark accuracy. (And no compile errors, yay!)\n\nBoth devices get hot under my workloads (e.g., a constant-load run like evaluating a model with batched mode on MATH-500; or fine-tuning a model), but I feel like the DGX Spark is (probably) made with that in mind. Plus, due to its larger 128 GB RAM, I can run larger batch sizes. Then there's also the aspect that when I have the DGX (vs the Mac Mini) running computations, it keeps my Mini free for other tasks :).\n\nOverall, a neat little package and CUDA prototyping machine that I can keep on my desk. It's super quiet similar to the Mac Mini. Of course, it's not as capable as a 6x more expensive H100 for training, but hey, you don't need a server room for that and can keep it in your office without worrying about heat or noise (this was not possible with the Lambda workstation I had a few years ago).\n\ntl;dr:\n\nSo, I've been seeing lots of others using it for LLM inference (Ollama, vLLM, etc) but my first-week impression is that this is also a neat box for local dev and prototyping (e.g., coding and running PyTorch models) thanks to the CUDA support, which comes in handy before starting larger, more expensive training runs.\n\nPS: Plus also find another benchmark versus the H100 in the comments below.\n\nWill run more experiments over time. In the meantime, let me know if you have any questions.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,46],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","quoted_status_id_str":"1983895811915214996","quoted_status_permalink":{"url":"https://t.co/FM2NttATVY","expanded":"https://twitter.com/rasbt/status/1983895811915214996","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983900170992463920","view_count":1069,"bookmark_count":0,"created_at":1761833634000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1978608882156269755","full_text":"A follow-up here with some PyTorch benchmarks:","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1978608882156269755","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1983905044169945184","view_count":269,"bookmark_count":1,"created_at":1761834796000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983584412412641496","full_text":"@natolambert My guess is the motivating factor behind this was probably to prevent things from breaking if proprietary model providers make API or model changes again.","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1983584412412641496","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983895815102910660","view_count":5102,"bookmark_count":7,"created_at":1761832595000,"favorite_count":20,"quote_count":0,"reply_count":4,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"And here is a comparison with an H100. As one can see, the DGX Spark is a great machine for small inferencing tasks (even beating the 6x more expensive H100).\nBut when it comes to batched processing (or training), this is of course no replacement for high-memory bandwidth cards. https://t.co/I93nIfdzD6","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983895811915214996","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1983918926992933169","url":"https://t.co/yazv07Pxfx","indices":[194,217]}],"user_mentions":[{"id_str":"1451507288741658630","name":"Aleksandr Kovalev","screen_name":"koval_alvi","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"koval_alvi","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1983918926992933169","quoted_status_permalink":{"url":"https://t.co/yazv07Pxfx","expanded":"https://x.com/rasbt/status/1983918926992933169","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983919187555754315","view_count":480,"bookmark_count":0,"created_at":1761838168000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@koval_alvi So little time (and only one machine) & some much to run 😅. I am currently more focused on the inference scaling methods for the upcoming chapter 4, but yes, I did a short run:\n\nhttps://t.co/yazv07Pxfx","in_reply_to_user_id_str":"1451507288741658630","in_reply_to_status_id_str":"1983912718001115637","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1474196927960944644","name":"kris","screen_name":"Krishna70284154","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"Krishna70284154","lang":"en","retweeted":false,"fact_check":null,"id":"1983899945443700819","view_count":456,"bookmark_count":0,"created_at":1761833580000,"favorite_count":6,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@Krishna70284154 Yeah, it’s basically for people who want a Mac-like machine at a Mac-like price but with cuda support 😅","in_reply_to_user_id_str":"1474196927960944644","in_reply_to_status_id_str":"1983897384469082570","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,169],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb","url":"https://t.co/VioT1zUPgA","indices":[59,82]}],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983918926992933169","view_count":1113,"bookmark_count":2,"created_at":1761838106000,"favorite_count":4,"quote_count":1,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@redtachyon I did a short run of my DPO from Scratch code (https://t.co/VioT1zUPgA) on a 355M parameter model:\n\nA100: 1.69 min\nMac Mini M4: 12.54 min\nDGX Spark: 2.44 min","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1983906361969627248","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,163],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1780523178160279552","name":"Mykhailo Sorochuk","screen_name":"sir4K_zen","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"sir4K_zen","lang":"en","retweeted":false,"fact_check":null,"id":"1984030005966598349","view_count":159,"bookmark_count":0,"created_at":1761864589000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@sir4K_zen Under normal use, they are similarly quiet like you have to put your ear next to it to hear it. Under heavy load, the Mac Mini gets louder than the DGX.","in_reply_to_user_id_str":"1780523178160279552","in_reply_to_status_id_str":"1984026707242971532","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-01","value":9,"startTime":1761868800000,"endTime":1761955200000,"tweets":[{"bookmarked":false,"display_text_range":[0,260],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1984262505443844263","quoted_status_permalink":{"url":"https://t.co/bGHWQrydyN","expanded":"https://twitter.com/natolambert/status/1984262505443844263","display":"x.com/natolambert/st…"},"retweeted":false,"fact_check":null,"id":"1984279418588762113","view_count":19631,"bookmark_count":64,"created_at":1761924054000,"favorite_count":112,"quote_count":0,"reply_count":7,"retweet_count":6,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"I ran lots of experiments on fp16 vs bf16 years ago on ViTs and LLMs. fp16 can work well but depends on normalization (so you don’t run over the supported range with your activations). \nI can see why with QKNorm and other tricks it may work fine (/better) now.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2023/pyto…","expanded_url":"https://sebastianraschka.com/blog/2023/pytorch-memory-optimization.html","url":"https://t.co/AD6ZZJeS4D","indices":[61,84]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984310689167511808","view_count":5710,"bookmark_count":27,"created_at":1761931509000,"favorite_count":43,"quote_count":0,"reply_count":0,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"Figure from an older blogpost to illustrate the difference: https://t.co/AD6ZZJeS4D\n\nRegular 16-bit floats can only represent numbers between -65,504 and 65,504. And with LLMs back then I often had activation larger or smaller than that. (This was pre QKNorm.) https://t.co/b6vobXJCHJ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1984279418588762113","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,71],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1894590251571843073","name":"Artificially Intelligent","screen_name":"ArtiIntelligent","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"ArtiIntelligent","lang":"en","retweeted":false,"fact_check":null,"id":"1984242821688365465","view_count":100,"bookmark_count":0,"created_at":1761915328000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ArtiIntelligent Sure but my use case is primarily dev work in PyTorch.","in_reply_to_user_id_str":"1894590251571843073","in_reply_to_status_id_str":"1984239937789788358","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,211],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1886394417654677504","name":"moskstraumen","screen_name":"moskstraum21745","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"moskstraum21745","lang":"en","retweeted":false,"fact_check":null,"id":"1984255784847614382","view_count":65,"bookmark_count":0,"created_at":1761918419000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@moskstraum21745 Oh yes 100% use MLX if you want to max the performance on Mac. I think it also now has CUDA support correct? It's just that the most of the LLM ecosystem (and my experience) is based on PyTorch.","in_reply_to_user_id_str":"1886394417654677504","in_reply_to_status_id_str":"1984254897622290758","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,156],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"101128454","name":"Wayne Le Nguyen","screen_name":"insynwyn","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"insynwyn","lang":"en","retweeted":false,"fact_check":null,"id":"1984242530398171139","view_count":62,"bookmark_count":0,"created_at":1761915259000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@insynwyn Both the latest nightly and latest PyTorch with CUDA 13 work for me. (NVIDIA recommends the docker container but in my case that wasn’t necessary)","in_reply_to_user_id_str":"101128454","in_reply_to_status_id_str":"1984239792939499706","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-02","value":266,"startTime":1761955200000,"endTime":1762041600000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984617030356451642","view_count":65925,"bookmark_count":861,"created_at":1762004547000,"favorite_count":1286,"quote_count":3,"reply_count":27,"retweet_count":220,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened.\n\nFirst, linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s.\n\nI don't want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n^2) to O(n) to making attention much more efficient for long sequences.\n\nHowever, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. \n\nIn the second half of this year, there was a bit of a revival of linear attention variants. The first notable model was MiniMax-M1 with lightning attention, a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June.\n\nThen, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 with sparse attention.\n\nAll three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. (DeepSeek's sparse attention it's not strictly linear but still subquadratic).\n\nInterestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model (discussed in section 13) without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had pure accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications.\n\nThis could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. Last week, the Kimi team released their new Kimi Linear model with linear attention. The tag line is that compared to regular, full attention, it has a 75% KV cache reduction and up to 6x decoding throughput.\n\nKimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there's one block that uses full attention as shown in the figure below.\n\nHowever, Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Interestingly, it also replaces the standard full attention module by multi-head latent attention. \n\nThere's no direct comparison to Qwen3-Next in the Kimi Linear paper, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed.\n\nOf course, I couldn't resist and added it to my The Big LLM Architecture Comparison article, which has grown to >10,000 words now (basically becoming book!?).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,88],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2025/dgx-…","expanded_url":"https://sebastianraschka.com/blog/2025/dgx-impressions.html","url":"https://t.co/XG2m9urtgc","indices":[65,88]}],"user_mentions":[{"id_str":"43874767","name":"Ivan Fioravanti ᯅ","screen_name":"ivanfioravanti","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"ivanfioravanti","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984601748250448148","view_count":205,"bookmark_count":3,"created_at":1762000903000,"favorite_count":5,"quote_count":0,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ivanfioravanti Yes! Links to the codes are in the article here: https://t.co/XG2m9urtgc","in_reply_to_user_id_str":"43874767","in_reply_to_status_id_str":"1984519617067335962","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,197],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","retweeted":false,"fact_check":null,"id":"1984633894365233442","view_count":1181,"bookmark_count":3,"created_at":1762008567000,"favorite_count":18,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984605827034972269","full_text":"@redtachyon I think fp16 also only works well for the newer architectures that add tons of normalization (like QKNorm), so you don’t get these large activations above +/- 65k that fp16 can’t handle","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1984605827034972269","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,200],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1629698842647203841","name":"Yu Zhang 🐈🐙","screen_name":"yzhang_cs","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"yzhang_cs","lang":"en","retweeted":false,"fact_check":null,"id":"1984632514019778709","view_count":1211,"bookmark_count":1,"created_at":1762008238000,"favorite_count":10,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@yzhang_cs Ooops, I misread then, thanks for the feedback, and I’ll update the figure in the article! (Ha, but sounds like I can keep this figure for the next iteration of Kimi Linear! Cool work btw!)","in_reply_to_user_id_str":"1629698842647203841","in_reply_to_status_id_str":"1984631714464088563","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,222],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"862201913252618240","name":"Vishal Verma","screen_name":"v_shaal","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}]},"favorited":false,"in_reply_to_screen_name":"v_shaal","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984630139888472399","view_count":725,"bookmark_count":0,"created_at":1762007672000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@v_shaal Might be architectural. They took the same architecture and compared it the Gated DeltaNet-H1 variant from the Gated DeltaNet paper (which is the most similar) and it compared favorably on long context benchmarks: https://t.co/dlzIWpohGu","in_reply_to_user_id_str":"862201913252618240","in_reply_to_status_id_str":"1984622135571091742","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,281],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1984632041598545947","view_count":545,"bookmark_count":0,"created_at":1762008126000,"favorite_count":4,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@_junaidkhalid1 My point still stands: there’s no one-size-fits-all. Different applications have different trade-offs. Same why gpt-5 and gpt-5 pro exists. Some times speed is more important and accuracy is sufficient. Sometimes you want to max accuracy (and are ok to wait 10 min)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1984631100002746497","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,83],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1199846588325224453","name":"John P.","screen_name":"JohnP07107214","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"JohnP07107214","lang":"en","retweeted":false,"fact_check":null,"id":"1984727926777237953","view_count":198,"bookmark_count":0,"created_at":1762030986000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@JohnP07107214 It might be a good topic for a separate book on LLM optimizations :)","in_reply_to_user_id_str":"1199846588325224453","in_reply_to_status_id_str":"1984726873763660133","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,289],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/OpJnPkrGK9","indices":[121,144]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]}],"user_mentions":[{"id_str":"1219292652748800000","name":"Alexey Grigorev","screen_name":"Al_Grigor","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"Al_Grigor","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984645325517164887","view_count":17409,"bookmark_count":392,"created_at":1762011293000,"favorite_count":428,"quote_count":1,"reply_count":6,"retweet_count":45,"user_id_str":"865622395","conversation_id_str":"1984222098370519305","full_text":"Yes, I recently read 90% of AI projects use PyTorch now. Recently put together an PyTorch essentials article: https://t.co/NWeQan8HJ3\n\n(I’ve been an early adopter since 2018 and never looked back; that being said, regarding your points below, TensorFlow also has dynamic graphs, and Keras supports PyTorch as a backend now too)","in_reply_to_user_id_str":"1219292652748800000","in_reply_to_status_id_str":"1984222098370519305","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-03","value":0,"startTime":1762041600000,"endTime":1762128000000,"tweets":[]},{"label":"2025-11-04","value":1,"startTime":1762128000000,"endTime":1762214400000,"tweets":[{"bookmarked":false,"display_text_range":[13,133],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1985456352035291531","view_count":4496,"bookmark_count":3,"created_at":1762204656000,"favorite_count":46,"quote_count":0,"reply_count":3,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1985418033037263086","full_text":"@natolambert Actually I think it was a pretty eventful Fall so far. E.g.,\nQwen3-Next, DeepSeek V3.2, GLM 4.6, MiniMax-M2, Kimi Linear","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1985418033037263086","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-05","value":164,"startTime":1762214400000,"endTime":1762300800000,"tweets":[{"bookmarked":false,"display_text_range":[0,198],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[175,198]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1985719217027494322","view_count":42088,"bookmark_count":675,"created_at":1762267328000,"favorite_count":950,"quote_count":5,"reply_count":25,"retweet_count":164,"user_id_str":"865622395","conversation_id_str":"1985719217027494322","full_text":"My new field guide to alternatives to standard LLMs: \n\nGated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.\n\nhttps://t.co/ZpWugAccgQ https://t.co/255yQXaDcM","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[8,47],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"783098774130401280","name":"Jack Morris","screen_name":"jxmnop","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"jxmnop","lang":"en","retweeted":false,"fact_check":null,"id":"1985735592689185002","view_count":7024,"bookmark_count":1,"created_at":1762271233000,"favorite_count":22,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1985720643397009844","full_text":"@jxmnop Wishing you all the best! You got this!","in_reply_to_user_id_str":"783098774130401280","in_reply_to_status_id_str":"1985720643397009844","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-06","value":0,"startTime":1762300800000,"endTime":1762387200000,"tweets":[]},{"label":"2025-11-07","value":169,"startTime":1762387200000,"endTime":1762473600000,"tweets":[{"bookmarked":false,"display_text_range":[0,89],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1986449512538513505","quoted_status_permalink":{"url":"https://t.co/4YLFiZxCMs","expanded":"https://twitter.com/Kimi_Moonshot/status/1986449512538513505","display":"x.com/Kimi_Moonshot/…"},"retweeted":false,"fact_check":null,"id":"1986511951141441648","view_count":87406,"bookmark_count":477,"created_at":1762456331000,"favorite_count":1352,"quote_count":8,"reply_count":27,"retweet_count":169,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Exciting big Kimi K2 Thinking release!\nMore experts, fewer heads, and even more thinking! https://t.co/CxUpn68Jjj","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"70831441","name":"Soumith Chintala","screen_name":"soumithchintala","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"soumithchintala","lang":"en","retweeted":false,"fact_check":null,"id":"1986531267794330038","view_count":16764,"bookmark_count":6,"created_at":1762460936000,"favorite_count":113,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986503070734557568","full_text":"@soumithchintala Thank you so much for making deep learning Pythonic! 💜\n\nAll my projects would have been much harder and less enjoyable without PyTorch. \n\nIn an alternative universe we maybe even wouldn’t have such an open-weight LLM ecosystem without PyTorch.\n\nAll the best for your next thing!","in_reply_to_user_id_str":"70831441","in_reply_to_status_id_str":"1986503070734557568","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[32,211],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"14227298","name":"Radek Sienkiewicz","screen_name":"velvet_shark","indices":[0,13]},{"id_str":"20971154","name":"Nicholas Dwork","screen_name":"ndwork","indices":[14,21]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[22,31]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}]},"favorited":false,"in_reply_to_screen_name":"velvet_shark","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1986517309016449353","view_count":50,"bookmark_count":1,"created_at":1762457608000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986412241374048473","full_text":"@velvet_shark @ndwork @karpathy I would say check out the bonus materials, especially the attention alternatives and Qwen3-from-scratch. \nI haven't had a chance to really check out nanochat but that one as well! https://t.co/Qr81iGhkrD","in_reply_to_user_id_str":"14227298","in_reply_to_status_id_str":"1986513230286524832","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,91],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1986522069262123425","view_count":7047,"bookmark_count":5,"created_at":1762458743000,"favorite_count":35,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Sorry should be 256k context length in Kimi K2 Thinking. (Up from 128k in regular Kimi K2.)","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1986511951141441648","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-08","value":0,"startTime":1762473600000,"endTime":1762560000000,"tweets":[]},{"label":"2025-11-09","value":84,"startTime":1762560000000,"endTime":1762646400000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/CaIfmZhaB1","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987157794202505395","view_count":52950,"bookmark_count":381,"created_at":1762610312000,"favorite_count":468,"quote_count":1,"reply_count":11,"retweet_count":71,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"My \"The Building Blocks of Today’s and Tomorrow’s Language Models\" talk at the PyTorch Conference is now up on YouTube! https://t.co/bGV5w1Aqyq\n\nIf you have 25 min this weekend, it's a whirlwind tour to catch you up on the key LLM architecture design considerations in popular LLMs this year (plus, an overview of alternative architecture designs).\n\nThe silver lining of my late arrival and rescheduling: Since there was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 min :)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,121],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[98,121]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987160373837902334","view_count":10033,"bookmark_count":87,"created_at":1762610927000,"favorite_count":85,"quote_count":0,"reply_count":2,"retweet_count":8,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"And the article I mentioned in the talk, the one I promised to write as a follow-up, is this one: https://t.co/ZpWugAccgQ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1987157794202505395","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,39],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1987168682624061627","view_count":143,"bookmark_count":0,"created_at":1762612908000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"@_junaidkhalid1 Incremental progress :)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1987161061188116976","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,297],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"267596794","name":"Walter Tay","screen_name":"waltertayannlee","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"waltertayannlee","lang":"en","retweeted":false,"fact_check":null,"id":"1987177509914337358","view_count":12080,"bookmark_count":62,"created_at":1762615012000,"favorite_count":117,"quote_count":0,"reply_count":2,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1986734118005358605","full_text":"The fun part when teaching deep learning classes was always to point out that the textbook convolution (/cross-correlation) is not how it’s actually implemented. It’s also one of the big sources of non-determinism when training CNNs in standard frameworks, because l, by default, CUDA/cuDNN selects the algo automatically at runtime specific to the problem and setup.","in_reply_to_user_id_str":"267596794","in_reply_to_status_id_str":"1986734118005358605","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-10","value":0,"startTime":1762646400000,"endTime":1762732800000,"tweets":[]},{"label":"2025-11-11","value":0,"startTime":1762732800000,"endTime":1762819200000,"tweets":[]},{"label":"2025-11-12","value":5,"startTime":1762819200000,"endTime":1762905600000,"tweets":[{"bookmarked":false,"display_text_range":[8,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1778075580271054848","name":"mel","screen_name":"melqtx","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"melqtx","lang":"en","retweeted":false,"fact_check":null,"id":"1988380057346130209","view_count":24822,"bookmark_count":39,"created_at":1762901722000,"favorite_count":354,"quote_count":0,"reply_count":16,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1988288260049871197","full_text":"@melqtx I use it all the time when using remote machines. Coz the terminal connections sometimes gets closed (e.g., when my computer goes to sleep).\n\nThis way, I can simply log back in, attach the tmux terminal, and continue instead of cd'ing to the right folder, activating the venv etc.","in_reply_to_user_id_str":"1778075580271054848","in_reply_to_status_id_str":"1988288260049871197","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-13","value":115,"startTime":1762905600000,"endTime":1762992000000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1988626642990719440","view_count":54993,"bookmark_count":944,"created_at":1762960513000,"favorite_count":801,"quote_count":5,"reply_count":27,"retweet_count":115,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all approach, but I thought it may be useful to share:\n\n1. Read the chapter preferably offline, away from the computer. Either classic physical form or at least on digital devices without internet. This really helps with focus time and minimizing distractions while reading. Highlighting or annotating confusing or interesting things is good, but I would not look things up at this stage. I also wouldn't run code at this stage. At least not yet.\n\n2. On the second read-through, type up and run the code from the chapter. Copying code is tempting because retyping is a lot of work, but it usually helps me to think about the code a bit more (versus just glancing over it). If I get different results than in the book, I would check the book's GitHub repo and try the code from there. If I still get different results, I would try to see if it's due to different package versions, random seeds, CPU/CUDA, etc. If I then still can't find it out, asking the author would not be a bad idea (via book forum, public GitHub repo issues or discussions, and as a last resort, email)\n\n3. After the second read-through and retyping the code, it's usually a good time to try the exercises to solidify my understanding. To check whether I actually understand the content and can work with it independently.\n\n4. Go through the highlights and annotations. I would bookmark important learnings or takeaways, if relevant for a given project, in my notes documents. Often, I also look up additional references to read more about a topic of interest. Also, if I still have any questions that I feel are unanswered after my previous readthroughs and exercises, I would do an online search to find out more.\n\n5. The previous steps were all about soaking up knowledge. Eventually, though, I somehow want to use that knowledge. So I think about which projects would benefit from what I've learned and incorporate it into them. This could involve using the main concept from the chapter, but also sometimes minor tidbits I learned along the way, e.g., even trivial things like whether it actually makes a difference in my project to explicitly call `torch.mps.manual_seed(seed)` instead of just `torch.manual_seed(seed)`.\n\nOf course, none of the above is set in stone. If the topic is overall very familiar or easy, and I am primarily reading the book to get some information in later chapters, skimming a chapter is ok (to not waste my time).\n\nAnyway, I hope this is useful. And happy reading and learning!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,44],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":true,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988627760617517412","view_count":74,"bookmark_count":0,"created_at":1762960779000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"@franbetteo Classic quality > quantity :)","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988627594669875705","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,292],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988631117772025955","view_count":6,"bookmark_count":0,"created_at":1762961580000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"Yeah, I think the problem is to want to read too many things. I have the same issue. Honestly, when reading at a computer, my attention span is sometimes so short that I can't even focus 30 min and read a longer blog article without distraction.\nIt requires discipline to stick to a given text.","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988628897995341948","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-14","value":0,"startTime":1762992000000,"endTime":1763078400000,"tweets":[]},{"label":"2025-11-15","value":0,"startTime":1763078400000,"endTime":1763164800000,"tweets":[]},{"label":"2025-11-16","value":127,"startTime":1763164800000,"endTime":1763251200000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989706196396265863","view_count":47955,"bookmark_count":547,"created_at":1763217898000,"favorite_count":754,"quote_count":1,"reply_count":15,"retweet_count":119,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"Inference-scaling lets us trade extra compute for better modeling accuracy. Next to reinforcement learning, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one.\n\nI just finished the first one. It is a 35-page introduction to inference-time scaling through self-consistency sampling. This chapter was a lot of fun to write because it takes the base model on MATH-500 all the way from 15.2% percent to 52.2% accuracy.\n\nSeeing that jump without additional training is incredibly satisfying.\n\nSubmitted the chapter yesterday, and it should appear in the Manning Early Access program in the next few days. (In the meantime the first 176 pages that lead up to this chapter are already available.)\n\nThe next chapter will focus on self-refinement techniques, where the model improves its own answers through iterative reasoning.\n\nHappy reading!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"800854096219471872","name":"Yuchen Jin","screen_name":"Yuchenj_UW","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"Yuchenj_UW","lang":"en","retweeted":false,"fact_check":null,"id":"1989803439224934626","view_count":6603,"bookmark_count":3,"created_at":1763241083000,"favorite_count":118,"quote_count":0,"reply_count":3,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1989755062646944048","full_text":"@Yuchenj_UW One can say you do seminal work to get a PhD, but you don’t have to have a PhD to do seminal work.","in_reply_to_user_id_str":"800854096219471872","in_reply_to_status_id_str":"1989755062646944048","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/blob/main/ch04/01_main-chapter-code/ch04_main.ipynb","url":"https://t.co/b3Nk5cVHwd","indices":[46,69]},{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/tree/main/ch04/02_math500-inference-scaling-scripts","url":"https://t.co/z3oj5Vkno1","indices":[144,167]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989708450100662776","view_count":8109,"bookmark_count":44,"created_at":1763218436000,"favorite_count":60,"quote_count":0,"reply_count":3,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"The chapter code is available here on GitHub: https://t.co/b3Nk5cVHwd\n\nAlso, I have the scripts to reproduce the experiments in the table here: https://t.co/z3oj5Vkno1","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1989706196396265863","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"9508592","name":"Asankhaya Sharma","screen_name":"asankhaya","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"asankhaya","lang":"en","retweeted":false,"fact_check":null,"id":"1989718576664568217","view_count":454,"bookmark_count":0,"created_at":1763220850000,"favorite_count":5,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@asankhaya Yes that’s correct. I think self-consistency is a good intro though that works well in practice, too. More will be covered in the next chapter.\nThanks for sharing btw, have to check out your repo some time.","in_reply_to_user_id_str":"9508592","in_reply_to_status_id_str":"1989717556077498843","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,123],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"4036077013","name":"sour coach sauers","screen_name":"SRCoachSauers","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"SRCoachSauers","lang":"en","retweeted":false,"fact_check":null,"id":"1989803670125646205","view_count":103,"bookmark_count":0,"created_at":1763241138000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@SRCoachSauers The website says summer 2026. That’s still the estimate but maybe even late spring depending on how it goes.","in_reply_to_user_id_str":"4036077013","in_reply_to_status_id_str":"1989800627426480467","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-17","value":0,"startTime":1763251200000,"endTime":1763337600000,"tweets":[]}],"nlikes":[{"label":"2025-10-18","value":0,"startTime":1760659200000,"endTime":1760745600000,"tweets":[]},{"label":"2025-10-19","value":0,"startTime":1760745600000,"endTime":1760832000000,"tweets":[]},{"label":"2025-10-20","value":0,"startTime":1760832000000,"endTime":1760918400000,"tweets":[]},{"label":"2025-10-21","value":979,"startTime":1760918400000,"endTime":1761004800000,"tweets":[{"bookmarked":false,"display_text_range":[0,51],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/07_moe","url":"https://t.co/3CGjgO4H9p","indices":[28,51]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1977733802660155875","quoted_status_permalink":{"url":"https://t.co/nQ43v9rV8S","expanded":"https://twitter.com/rasbt/status/1977733802660155875","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980269760043446725","view_count":76235,"bookmark_count":627,"created_at":1760968077000,"favorite_count":908,"quote_count":1,"reply_count":4,"retweet_count":145,"user_id_str":"865622395","conversation_id_str":"1980269760043446725","full_text":"🔗 Mixture of Experts (MoE): https://t.co/3CGjgO4H9p https://t.co/QA12nBeW0i","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"871247180341813248","name":"Tina Sang","screen_name":"tinawrote","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"tinawrote","lang":"en","retweeted":false,"fact_check":null,"id":"1980274554913132722","view_count":5237,"bookmark_count":0,"created_at":1760969220000,"favorite_count":11,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1979808022894703036","full_text":"@tinawrote Ha nice, it’s refreshing to see that people still care about Bayes theorem and fundamentals in 2025","in_reply_to_user_id_str":"871247180341813248","in_reply_to_status_id_str":"1979808022894703036","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"248951926","name":"Ahmad","screen_name":"TheAhmadOsman","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"TheAhmadOsman","lang":"en","retweeted":false,"fact_check":null,"id":"1980275166560092599","view_count":1634,"bookmark_count":2,"created_at":1760969366000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980102923754381348","full_text":"@TheAhmadOsman The V3.2 update with sparse attention was just to get the tooling ecosystem ready for the big release. Mark my words","in_reply_to_user_id_str":"248951926","in_reply_to_status_id_str":"1980102923754381348","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[23,305],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1209960539390201864","name":"Dwarkesh Patel","screen_name":"dwarkesh_sp","indices":[0,12]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[13,22]}]},"favorited":false,"in_reply_to_screen_name":"dwarkesh_sp","lang":"en","retweeted":false,"fact_check":null,"id":"1980335765063094548","view_count":6093,"bookmark_count":2,"created_at":1760983813000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980333945385562176","full_text":"> Culture: > “Why can’t an LLM write a book for the other LLMs? Why can’t other LLMs read this LLM’s book and be inspired by it, or shocked by it?”\n\nHm, isn’t that what training on synthetic data and knowledge distillation does? \n\nAll major LLMs contain some synthetic data in their mix because it makes training more effective versus cold-starting from raw data.","in_reply_to_user_id_str":"1209960539390201864","in_reply_to_status_id_str":"1980333945385562176","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-22","value":2531,"startTime":1761004800000,"endTime":1761091200000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642191950090585","view_count":153439,"bookmark_count":1282,"created_at":1761056871000,"favorite_count":2142,"quote_count":35,"reply_count":76,"retweet_count":338,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.\n\nIn short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.\n\nMy first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.\n\nIn the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)\n\nIn any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.\n\nHow is it different compared to other VLLM architectures?\n- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).\n- They are (to the best of my knowledge) those who use an MoE as a decoder.\nI think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.\nHowever, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.\n\nRegarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)\n\nOverall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).\n\n(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[18,250],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1638538494887821313","url":"https://t.co/gNErcwGh3w","indices":[71,94]}],"user_mentions":[{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[0,9]},{"id_str":"39547749","name":"(((ل()(ل() 'yoav))))👾","screen_name":"yoavgo","indices":[10,17]}]},"favorited":false,"in_reply_to_screen_name":"karpathy","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1638538494887821313","quoted_status_permalink":{"url":"https://t.co/gNErcwGh3w","expanded":"https://x.com/rasbt/status/1638538494887821313","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980463829789339825","view_count":12444,"bookmark_count":39,"created_at":1761014346000,"favorite_count":52,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980397031542989305","full_text":"@karpathy @yoavgo This made me think of the \"Meet in the Middle\" paper https://t.co/gNErcwGh3w\nWhen I remember correctly, they run two LLMs in both directions with parameter sharing. So it shouldn't impact training time. Kind of wild but hey why not.","in_reply_to_user_id_str":"33836629","in_reply_to_status_id_str":"1980435985730269351","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,188],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/deepseek-ai/De…","expanded_url":"https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf","url":"https://t.co/f0EFC6eVcl","indices":[19,42]},{"display_url":"magazine.sebastianraschka.com/p/understandin…","expanded_url":"https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=publication-search","url":"https://t.co/Aa5M0XD6ew","indices":[165,188]}],"user_mentions":[]},"favorited":true,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642194945110475","view_count":9544,"bookmark_count":42,"created_at":1761056872000,"favorite_count":77,"quote_count":1,"reply_count":2,"retweet_count":12,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"Link to the paper: https://t.co/f0EFC6eVcl\n\nMy \"Understanding Multimodal LLMs\" article with more info on how images are fed to LLMs, how cross-attention works, etc: https://t.co/Aa5M0XD6ew","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1143955635391754240","name":"Pratham Prasoon","screen_name":"PrasoonPratham","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"PrasoonPratham","lang":"en","retweeted":false,"fact_check":null,"id":"1980645421560262701","view_count":2495,"bookmark_count":0,"created_at":1761057641000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@PrasoonPratham Actually I was thinking about it when typing, and I don't know. I don't want to be that person who goes against the common terminology (like softargmax haha) but it really is a V*L*LM at 3B parameters.","in_reply_to_user_id_str":"1143955635391754240","in_reply_to_status_id_str":"1980644767022399874","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,235],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1487101697876844546","name":"Butch Coolidge","screen_name":"vulnerablecodes","indices":[0,16]}]},"favorited":true,"in_reply_to_screen_name":"vulnerablecodes","lang":"en","retweeted":false,"fact_check":null,"id":"1980644334832955587","view_count":2094,"bookmark_count":0,"created_at":1761057382000,"favorite_count":19,"quote_count":0,"reply_count":2,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@vulnerablecodes If we are talking about the model itself and not the app, these are open-weight PyTorch models. So unless there’s a backdoor in Hugging Face or the PyTorch runtime, there’s really no way for them to be malicious afaik.","in_reply_to_user_id_str":"1487101697876844546","in_reply_to_status_id_str":"1980643375780085948","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[14,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"255532327","name":"LMT ⚡️","screen_name":"Limitless_LT","indices":[0,13]}]},"favorited":false,"in_reply_to_screen_name":"Limitless_LT","lang":"en","retweeted":false,"fact_check":null,"id":"1980656807690530983","view_count":1677,"bookmark_count":0,"created_at":1761060356000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@Limitless_LT Yeah, I think that’s what brought us CNNS (as opposed to fully connected neural nets), LoRA, and many more","in_reply_to_user_id_str":"255532327","in_reply_to_status_id_str":"1980655979386793997","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,290],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"760070121981378561","name":"Alim","screen_name":"almmaasoglu","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"almmaasoglu","lang":"en","retweeted":false,"fact_check":null,"id":"1980657466284425600","view_count":2069,"bookmark_count":0,"created_at":1761060513000,"favorite_count":6,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@almmaasoglu exactly, that’s the messiness of working with the image format I mentioned. I think you can make to generalize well on all these but since there are more degrees of freedom it will require more data to train (luckily this can be done with automatic data augmentation but still)","in_reply_to_user_id_str":"760070121981378561","in_reply_to_status_id_str":"1980653506899087745","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1980736754517680463","view_count":25353,"bookmark_count":29,"created_at":1761079417000,"favorite_count":76,"quote_count":1,"reply_count":7,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980657338726887662","full_text":"Interesting that they mentioned faster & cheaper compared to OpenAI’s latest models not “customizable”. \n\nThat makes me think they are specifically referring to gpt-oss,\n\nThis in turn means they are using the small, dense Qwen3 models, maybe 0.6 to 4B range.\n\nAnd this is surprising, i.e. that models that small are good enough for production (and possibly chat interactions with the customer).","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1980657338726887662","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,171],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980715707508879444","view_count":8923,"bookmark_count":14,"created_at":1761074399000,"favorite_count":70,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"All that being said, as a human, I can appreciate visual representations of text as it lowers cognitive load (the raw text is readable, but requires much more brainpower): https://t.co/G4ygIeNvDZ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"875456843279081476","name":"Dileep George","screen_name":"dileeplearning","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"dileeplearning","lang":"en","retweeted":false,"fact_check":null,"id":"1980618490764513365","view_count":11072,"bookmark_count":11,"created_at":1761051220000,"favorite_count":77,"quote_count":0,"reply_count":3,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980423146420466049","full_text":"@dileeplearning I know it’s popular to hate tokenizers, but visual representations (which are also tokenized) bring a lot of messiness as well. Aspect ratios, cropping, resolution, brightness, etc.\n\nSure, models learn to deal with that but it requires lots of data to make them robust wrt these.","in_reply_to_user_id_str":"875456843279081476","in_reply_to_status_id_str":"1980423146420466049","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-23","value":823,"startTime":1761091200000,"endTime":1761177600000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}],"symbols":[],"timestamps":[{"indices":[94,99],"seconds":660,"text":"11:00"},{"indices":[376,380],"seconds":200,"text":"3:20"}],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980992745532453016","view_count":41099,"bookmark_count":359,"created_at":1761140450000,"favorite_count":823,"quote_count":7,"reply_count":24,"retweet_count":89,"user_id_str":"865622395","conversation_id_str":"1980992745532453016","full_text":"Excited to be (finally) heading to the PyTorch Conference!\n\nI’ll be giving a talk tomorrow at 11:00 AM on “The LLM Landscape 2025”, where I’ll discuss the key components behind this year’s most prominent open-weight LLMs, and highlight a few architectural developments that go beyond the mainstream, off the main track.\n\nI also look forward to doing a book signing session at 3:20 PM, thanks to the kind invite from the organizers.\n\nIt’s my first trip since my injury last year, and I’m really looking forward to reconnecting with the community in person after such a long time. If you’re there, please come say hi!\n\n(I couldn’t make it for the first day of the conference due to a mandatory appointment, but better late than never! See you all tomorrow.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-24","value":0,"startTime":1761177600000,"endTime":1761264000000,"tweets":[]},{"label":"2025-10-25","value":0,"startTime":1761264000000,"endTime":1761350400000,"tweets":[]},{"label":"2025-10-26","value":0,"startTime":1761350400000,"endTime":1761436800000,"tweets":[]},{"label":"2025-10-27","value":0,"startTime":1761436800000,"endTime":1761523200000,"tweets":[]},{"label":"2025-10-28","value":42,"startTime":1761523200000,"endTime":1761609600000,"tweets":[{"bookmarked":false,"display_text_range":[42,321],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch","url":"https://t.co/NGT1VM4P1R","indices":[414,437]}],"user_mentions":[{"id_str":"13434092","name":"Brandon Watson","screen_name":"BrandonWatson","indices":[0,14]},{"id_str":"291797158","name":"ThePrimeagen","screen_name":"ThePrimeagen","indices":[15,28]},{"id_str":"21001534","name":"Audible","screen_name":"audible_com","indices":[29,41]}]},"favorited":false,"in_reply_to_screen_name":"BrandonWatson","lang":"en","retweeted":false,"fact_check":null,"id":"1982836647784808750","view_count":9152,"bookmark_count":19,"created_at":1761580070000,"favorite_count":42,"quote_count":0,"reply_count":5,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1982666767437820411","full_text":"I wrote the original text and code and had similar questions when I found that there was an audio book version. When I asked about it, if I remember correctly, the answer was that it is something they generate for all books to improve accessibility. \n\nPersonally, I recommend the text version. That being said, I dunno, but perhaps the audiobook version works also well if you are working with the code notebooks (https://t.co/NGT1VM4P1R), which have the code and figures (but not text).\n\nWould be curious to hear from people who listen to audio book versions of coding books and find out if this is helpful.","in_reply_to_user_id_str":"13434092","in_reply_to_status_id_str":"1982666767437820411","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-29","value":1305,"startTime":1761609600000,"endTime":1761696000000,"tweets":[{"bookmarked":false,"display_text_range":[0,269],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983212569885122670","view_count":46934,"bookmark_count":499,"created_at":1761669697000,"favorite_count":872,"quote_count":3,"reply_count":29,"retweet_count":128,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"Just saw the MiniMax-M2 benchmarks, and the performance is too good to ignore :). So, I just amended my \"The Big LLM Architecture Comparison\" with entry number 13! \n\n1️⃣ Full attention modules:\n\nAs shown in the overview figure below, I grouped MiniMax-M2 with the other decoder-style transformer LLMs as it does not use the efficient lightning attention variant proposed in MiniMax-M1. Instead, the developers went back to using full attention, likely to improve modeling (and benchmark) performance.\n\n2️⃣ Per-layer QK-Norm:\n\nOverall, MiniMax-M2 is surprisingly similar to Qwen3. Besides changing the number of layers, sizes, etc., it uses the same components overall. Perhaps the one noteworthy highlight here is that MiniMax-M2 uses a so-called “per_layer” QK-Norm instead of the regular QK-Norm. A closer look at the code reveals the \"per_layer\" means that the RMSNorm (used for QK-Norm as explained earlier) is defined in each transformer block (as in regular QK-Norm), but, in addition, instead of reusing it across attention heads, it's a unique QK-Norm for each attention head.\n\n3️⃣ Sliding-window attention:\n\nThe model configuration file also includes a sliding-window attention setting (similar to Gemma 3), but, as in Mistral 3.1, it is disabled by default.\n\nOtherwise, besides the per-layer QK-Norm, the architecture is very similar to Qwen3, as shown in the figure below.\n\n4️⃣ MoE sparsity:\n\nA perhaps interesting tidbit, as shown in the figure below, includes the fact that they don't use a shared expert (similar to Qwen3 but unlike Qwen3-Next). As mentioned earlier, in my opinion, shared experts are useful because they reduce redundancy among the other experts.\n\nAlso, as apparent from the figure above, MiniMax-M2 is twice as \"sparse\" as Qwen3. I.e., at roughly the same size as Qwen3 235B-A22B, MiniMax-M2 has only 10B instead of 22B active experts per token (that is, 4.37% of the parameters are used in each inference step in MiniMax-M2, whereas Qwen3 uses 9.36% active tokens).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1983240592516665532","quoted_status_permalink":{"url":"https://t.co/Ks8fEmHtCa","expanded":"https://twitter.com/ManningBooks/status/1983240592516665532","display":"x.com/ManningBooks/s…"},"retweeted":false,"fact_check":null,"id":"1983255497202643000","view_count":41464,"bookmark_count":263,"created_at":1761679932000,"favorite_count":404,"quote_count":0,"reply_count":25,"retweet_count":64,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"On that note, I am currently running a large-scale experiment on the upcoming inference-scaling chapter:\n\nA) Parallel Sampling\n- Self-Consistency (Majority Vote)\n- Rejection Sampling\n- Best-of-N (Verifier-Based)\n\nB) Sequential Refinement\n- Self-Refinement\n- Power Sampling\n- MCMC (Simple)\n- MCMC (Block as in \"Reasoning with Sampling\" paper)\n- Tree-of-Thought\n\n... to decide which one(s) make(s) it for the detailed discussion into the main chapter versus which ones will be included as bonus materials. (All new chapters will of course be automatically available to all the early acessers, amd there are already 170 chapters to get started in the meantime 😊\n\nAnything you'd think is worth adding to the list above?","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,34],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1745892418539417600","name":"elie","screen_name":"eliebakouch","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"eliebakouch","lang":"en","retweeted":false,"fact_check":null,"id":"1983231696343351800","view_count":2617,"bookmark_count":1,"created_at":1761674257000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@eliebakouch good point, will add!","in_reply_to_user_id_str":"1745892418539417600","in_reply_to_status_id_str":"1983219128883122466","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,192],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"970812776","name":"jason","screen_name":"jasonth0","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"jasonth0","lang":"en","retweeted":false,"fact_check":null,"id":"1983215929711284435","view_count":1335,"bookmark_count":1,"created_at":1761670498000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@jasonth0 The per-layer QK-Norm adds more params, not less :). But that aside, overall, I think it's still efficient. I mean, there are 50% less active parameters compared to Qwen3 for example","in_reply_to_user_id_str":"970812776","in_reply_to_status_id_str":"1983215562952990856","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[7,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"5604372","name":"Reza Rawassizadeh","screen_name":"rezar","indices":[0,6]}]},"favorited":false,"in_reply_to_screen_name":"rezar","lang":"en","retweeted":false,"fact_check":null,"id":"1983251855829606863","view_count":670,"bookmark_count":1,"created_at":1761679064000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@rezar That's a fun idea! Do you know a service that you have had a good experience with regarding making and distributing posters?","in_reply_to_user_id_str":"5604372","in_reply_to_status_id_str":"1983245370118267378","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,66],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1419322742713643009","name":"Duc Nguyen Huu","screen_name":"ducnh279","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"ducnh279","lang":"en","retweeted":false,"fact_check":null,"id":"1983278551655944288","view_count":108,"bookmark_count":0,"created_at":1761685428000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@ducnh279 Interesting one! I will bookmark this and give it a try.","in_reply_to_user_id_str":"1419322742713643009","in_reply_to_status_id_str":"1983263508071624848","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,46],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1918285228403253249","name":"ƬⲘ ⚔️","screen_name":"tm23twt","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}]},"favorited":false,"in_reply_to_screen_name":"tm23twt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983257620753592407","view_count":86,"bookmark_count":0,"created_at":1761680438000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@tm23twt I think they removed the edit feature https://t.co/dGwFDFaeYg","in_reply_to_user_id_str":"1918285228403253249","in_reply_to_status_id_str":"1983256870711164941","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,26],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1983255844029431837","view_count":4248,"bookmark_count":2,"created_at":1761680014000,"favorite_count":15,"quote_count":0,"reply_count":2,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"* 170 pages not chapters 😅","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983255497202643000","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-30","value":0,"startTime":1761696000000,"endTime":1761782400000,"tweets":[{"bookmarked":false,"display_text_range":[12,146],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"431181263","name":"Haichao","screen_name":"HaichaoZhu","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"HaichaoZhu","lang":"en","retweeted":false,"fact_check":null,"id":"1983343814648762407","view_count":552,"bookmark_count":0,"created_at":1761700988000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@HaichaoZhu That's a good point. With so many MoE's released this year (even the latest Nemotron today), maybe that'd be a nice standalone article","in_reply_to_user_id_str":"431181263","in_reply_to_status_id_str":"1983335671264845971","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,207],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1761964147510767616","name":"Ben Dicken","screen_name":"BenjDicken","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"BenjDicken","lang":"en","retweeted":false,"fact_check":null,"id":"1983565978525892663","view_count":5,"bookmark_count":0,"created_at":1761753956000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983292996117491864","full_text":"@BenjDicken Just saw this popping up on my timeline... I guess the twitter recommendations work well now, haha!\nAnyways, I hope you are liking the book. And please let me know in case you have any questions!","in_reply_to_user_id_str":"1761964147510767616","in_reply_to_status_id_str":"1983292996117491864","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-31","value":362,"startTime":1761782400000,"endTime":1761868800000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1978608882156269755","quoted_status_permalink":{"url":"https://t.co/uObfyEshyK","expanded":"https://twitter.com/rasbt/status/1978608882156269755","display":"x.com/rasbt/status/1…"},"retweeted":true,"fact_check":null,"id":"1983895811915214996","view_count":60530,"bookmark_count":173,"created_at":1761832595000,"favorite_count":325,"quote_count":1,"reply_count":22,"retweet_count":40,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"A small follow-up to my DGX Spark post. Courtesy of NVIDIA, I got to try the DGX on my workflows (coding LLMs from scratch in pure PyTorch) and wanted to share my first impressions after using it for a week.\n\nBefore getting to the performance, there was a neat bonus I didn't expect: It comes with NVIDIA Sync software that lets you conveniently connect (I fully expected I would have to find my SSH tunneling notes from back when I set up Jupyter Lab, etc, on a remote machine). The setup is a breeze and a delight.\n\nNow, how does it fare against my Mac Mini? I included the tokens/sec inference speed for a small 0.6B model I am currently working on. The DGX is much faster than the Mac Mini M4 CPU and still noticeably faster than the M4 GPU (via PyTorch MPS). More importantly, though, as I mentioned before, it is a CUDA device and thus much better supported in PyTorch. This, in turn, results in more stable training and higher benchmark accuracy. (And no compile errors, yay!)\n\nBoth devices get hot under my workloads (e.g., a constant-load run like evaluating a model with batched mode on MATH-500; or fine-tuning a model), but I feel like the DGX Spark is (probably) made with that in mind. Plus, due to its larger 128 GB RAM, I can run larger batch sizes. Then there's also the aspect that when I have the DGX (vs the Mac Mini) running computations, it keeps my Mini free for other tasks :).\n\nOverall, a neat little package and CUDA prototyping machine that I can keep on my desk. It's super quiet similar to the Mac Mini. Of course, it's not as capable as a 6x more expensive H100 for training, but hey, you don't need a server room for that and can keep it in your office without worrying about heat or noise (this was not possible with the Lambda workstation I had a few years ago).\n\ntl;dr:\n\nSo, I've been seeing lots of others using it for LLM inference (Ollama, vLLM, etc) but my first-week impression is that this is also a neat box for local dev and prototyping (e.g., coding and running PyTorch models) thanks to the CUDA support, which comes in handy before starting larger, more expensive training runs.\n\nPS: Plus also find another benchmark versus the H100 in the comments below.\n\nWill run more experiments over time. In the meantime, let me know if you have any questions.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,46],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","quoted_status_id_str":"1983895811915214996","quoted_status_permalink":{"url":"https://t.co/FM2NttATVY","expanded":"https://twitter.com/rasbt/status/1983895811915214996","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983900170992463920","view_count":1069,"bookmark_count":0,"created_at":1761833634000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1978608882156269755","full_text":"A follow-up here with some PyTorch benchmarks:","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1978608882156269755","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1983905044169945184","view_count":269,"bookmark_count":1,"created_at":1761834796000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983584412412641496","full_text":"@natolambert My guess is the motivating factor behind this was probably to prevent things from breaking if proprietary model providers make API or model changes again.","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1983584412412641496","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983895815102910660","view_count":5102,"bookmark_count":7,"created_at":1761832595000,"favorite_count":20,"quote_count":0,"reply_count":4,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"And here is a comparison with an H100. As one can see, the DGX Spark is a great machine for small inferencing tasks (even beating the 6x more expensive H100).\nBut when it comes to batched processing (or training), this is of course no replacement for high-memory bandwidth cards. https://t.co/I93nIfdzD6","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983895811915214996","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1983918926992933169","url":"https://t.co/yazv07Pxfx","indices":[194,217]}],"user_mentions":[{"id_str":"1451507288741658630","name":"Aleksandr Kovalev","screen_name":"koval_alvi","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"koval_alvi","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1983918926992933169","quoted_status_permalink":{"url":"https://t.co/yazv07Pxfx","expanded":"https://x.com/rasbt/status/1983918926992933169","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983919187555754315","view_count":480,"bookmark_count":0,"created_at":1761838168000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@koval_alvi So little time (and only one machine) & some much to run 😅. I am currently more focused on the inference scaling methods for the upcoming chapter 4, but yes, I did a short run:\n\nhttps://t.co/yazv07Pxfx","in_reply_to_user_id_str":"1451507288741658630","in_reply_to_status_id_str":"1983912718001115637","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1474196927960944644","name":"kris","screen_name":"Krishna70284154","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"Krishna70284154","lang":"en","retweeted":false,"fact_check":null,"id":"1983899945443700819","view_count":456,"bookmark_count":0,"created_at":1761833580000,"favorite_count":6,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@Krishna70284154 Yeah, it’s basically for people who want a Mac-like machine at a Mac-like price but with cuda support 😅","in_reply_to_user_id_str":"1474196927960944644","in_reply_to_status_id_str":"1983897384469082570","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,169],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb","url":"https://t.co/VioT1zUPgA","indices":[59,82]}],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983918926992933169","view_count":1113,"bookmark_count":2,"created_at":1761838106000,"favorite_count":4,"quote_count":1,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@redtachyon I did a short run of my DPO from Scratch code (https://t.co/VioT1zUPgA) on a 355M parameter model:\n\nA100: 1.69 min\nMac Mini M4: 12.54 min\nDGX Spark: 2.44 min","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1983906361969627248","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,163],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1780523178160279552","name":"Mykhailo Sorochuk","screen_name":"sir4K_zen","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"sir4K_zen","lang":"en","retweeted":false,"fact_check":null,"id":"1984030005966598349","view_count":159,"bookmark_count":0,"created_at":1761864589000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@sir4K_zen Under normal use, they are similarly quiet like you have to put your ear next to it to hear it. Under heavy load, the Mac Mini gets louder than the DGX.","in_reply_to_user_id_str":"1780523178160279552","in_reply_to_status_id_str":"1984026707242971532","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-01","value":158,"startTime":1761868800000,"endTime":1761955200000,"tweets":[{"bookmarked":false,"display_text_range":[0,260],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1984262505443844263","quoted_status_permalink":{"url":"https://t.co/bGHWQrydyN","expanded":"https://twitter.com/natolambert/status/1984262505443844263","display":"x.com/natolambert/st…"},"retweeted":false,"fact_check":null,"id":"1984279418588762113","view_count":19631,"bookmark_count":64,"created_at":1761924054000,"favorite_count":112,"quote_count":0,"reply_count":7,"retweet_count":6,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"I ran lots of experiments on fp16 vs bf16 years ago on ViTs and LLMs. fp16 can work well but depends on normalization (so you don’t run over the supported range with your activations). \nI can see why with QKNorm and other tricks it may work fine (/better) now.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2023/pyto…","expanded_url":"https://sebastianraschka.com/blog/2023/pytorch-memory-optimization.html","url":"https://t.co/AD6ZZJeS4D","indices":[61,84]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984310689167511808","view_count":5710,"bookmark_count":27,"created_at":1761931509000,"favorite_count":43,"quote_count":0,"reply_count":0,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"Figure from an older blogpost to illustrate the difference: https://t.co/AD6ZZJeS4D\n\nRegular 16-bit floats can only represent numbers between -65,504 and 65,504. And with LLMs back then I often had activation larger or smaller than that. (This was pre QKNorm.) https://t.co/b6vobXJCHJ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1984279418588762113","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,71],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1894590251571843073","name":"Artificially Intelligent","screen_name":"ArtiIntelligent","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"ArtiIntelligent","lang":"en","retweeted":false,"fact_check":null,"id":"1984242821688365465","view_count":100,"bookmark_count":0,"created_at":1761915328000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ArtiIntelligent Sure but my use case is primarily dev work in PyTorch.","in_reply_to_user_id_str":"1894590251571843073","in_reply_to_status_id_str":"1984239937789788358","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,211],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1886394417654677504","name":"moskstraumen","screen_name":"moskstraum21745","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"moskstraum21745","lang":"en","retweeted":false,"fact_check":null,"id":"1984255784847614382","view_count":65,"bookmark_count":0,"created_at":1761918419000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@moskstraum21745 Oh yes 100% use MLX if you want to max the performance on Mac. I think it also now has CUDA support correct? It's just that the most of the LLM ecosystem (and my experience) is based on PyTorch.","in_reply_to_user_id_str":"1886394417654677504","in_reply_to_status_id_str":"1984254897622290758","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,156],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"101128454","name":"Wayne Le Nguyen","screen_name":"insynwyn","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"insynwyn","lang":"en","retweeted":false,"fact_check":null,"id":"1984242530398171139","view_count":62,"bookmark_count":0,"created_at":1761915259000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@insynwyn Both the latest nightly and latest PyTorch with CUDA 13 work for me. (NVIDIA recommends the docker container but in my case that wasn’t necessary)","in_reply_to_user_id_str":"101128454","in_reply_to_status_id_str":"1984239792939499706","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-02","value":1757,"startTime":1761955200000,"endTime":1762041600000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984617030356451642","view_count":65925,"bookmark_count":861,"created_at":1762004547000,"favorite_count":1286,"quote_count":3,"reply_count":27,"retweet_count":220,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened.\n\nFirst, linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s.\n\nI don't want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n^2) to O(n) to making attention much more efficient for long sequences.\n\nHowever, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. \n\nIn the second half of this year, there was a bit of a revival of linear attention variants. The first notable model was MiniMax-M1 with lightning attention, a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June.\n\nThen, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 with sparse attention.\n\nAll three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. (DeepSeek's sparse attention it's not strictly linear but still subquadratic).\n\nInterestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model (discussed in section 13) without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had pure accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications.\n\nThis could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. Last week, the Kimi team released their new Kimi Linear model with linear attention. The tag line is that compared to regular, full attention, it has a 75% KV cache reduction and up to 6x decoding throughput.\n\nKimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there's one block that uses full attention as shown in the figure below.\n\nHowever, Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Interestingly, it also replaces the standard full attention module by multi-head latent attention. \n\nThere's no direct comparison to Qwen3-Next in the Kimi Linear paper, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed.\n\nOf course, I couldn't resist and added it to my The Big LLM Architecture Comparison article, which has grown to >10,000 words now (basically becoming book!?).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,88],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2025/dgx-…","expanded_url":"https://sebastianraschka.com/blog/2025/dgx-impressions.html","url":"https://t.co/XG2m9urtgc","indices":[65,88]}],"user_mentions":[{"id_str":"43874767","name":"Ivan Fioravanti ᯅ","screen_name":"ivanfioravanti","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"ivanfioravanti","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984601748250448148","view_count":205,"bookmark_count":3,"created_at":1762000903000,"favorite_count":5,"quote_count":0,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ivanfioravanti Yes! Links to the codes are in the article here: https://t.co/XG2m9urtgc","in_reply_to_user_id_str":"43874767","in_reply_to_status_id_str":"1984519617067335962","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,197],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","retweeted":false,"fact_check":null,"id":"1984633894365233442","view_count":1181,"bookmark_count":3,"created_at":1762008567000,"favorite_count":18,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984605827034972269","full_text":"@redtachyon I think fp16 also only works well for the newer architectures that add tons of normalization (like QKNorm), so you don’t get these large activations above +/- 65k that fp16 can’t handle","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1984605827034972269","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,200],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1629698842647203841","name":"Yu Zhang 🐈🐙","screen_name":"yzhang_cs","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"yzhang_cs","lang":"en","retweeted":false,"fact_check":null,"id":"1984632514019778709","view_count":1211,"bookmark_count":1,"created_at":1762008238000,"favorite_count":10,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@yzhang_cs Ooops, I misread then, thanks for the feedback, and I’ll update the figure in the article! (Ha, but sounds like I can keep this figure for the next iteration of Kimi Linear! Cool work btw!)","in_reply_to_user_id_str":"1629698842647203841","in_reply_to_status_id_str":"1984631714464088563","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,222],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"862201913252618240","name":"Vishal Verma","screen_name":"v_shaal","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}]},"favorited":false,"in_reply_to_screen_name":"v_shaal","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984630139888472399","view_count":725,"bookmark_count":0,"created_at":1762007672000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@v_shaal Might be architectural. They took the same architecture and compared it the Gated DeltaNet-H1 variant from the Gated DeltaNet paper (which is the most similar) and it compared favorably on long context benchmarks: https://t.co/dlzIWpohGu","in_reply_to_user_id_str":"862201913252618240","in_reply_to_status_id_str":"1984622135571091742","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,281],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1984632041598545947","view_count":545,"bookmark_count":0,"created_at":1762008126000,"favorite_count":4,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@_junaidkhalid1 My point still stands: there’s no one-size-fits-all. Different applications have different trade-offs. Same why gpt-5 and gpt-5 pro exists. Some times speed is more important and accuracy is sufficient. Sometimes you want to max accuracy (and are ok to wait 10 min)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1984631100002746497","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,83],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1199846588325224453","name":"John P.","screen_name":"JohnP07107214","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"JohnP07107214","lang":"en","retweeted":false,"fact_check":null,"id":"1984727926777237953","view_count":198,"bookmark_count":0,"created_at":1762030986000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@JohnP07107214 It might be a good topic for a separate book on LLM optimizations :)","in_reply_to_user_id_str":"1199846588325224453","in_reply_to_status_id_str":"1984726873763660133","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,289],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/OpJnPkrGK9","indices":[121,144]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]}],"user_mentions":[{"id_str":"1219292652748800000","name":"Alexey Grigorev","screen_name":"Al_Grigor","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"Al_Grigor","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984645325517164887","view_count":17409,"bookmark_count":392,"created_at":1762011293000,"favorite_count":428,"quote_count":1,"reply_count":6,"retweet_count":45,"user_id_str":"865622395","conversation_id_str":"1984222098370519305","full_text":"Yes, I recently read 90% of AI projects use PyTorch now. Recently put together an PyTorch essentials article: https://t.co/NWeQan8HJ3\n\n(I’ve been an early adopter since 2018 and never looked back; that being said, regarding your points below, TensorFlow also has dynamic graphs, and Keras supports PyTorch as a backend now too)","in_reply_to_user_id_str":"1219292652748800000","in_reply_to_status_id_str":"1984222098370519305","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-03","value":0,"startTime":1762041600000,"endTime":1762128000000,"tweets":[]},{"label":"2025-11-04","value":46,"startTime":1762128000000,"endTime":1762214400000,"tweets":[{"bookmarked":false,"display_text_range":[13,133],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1985456352035291531","view_count":4496,"bookmark_count":3,"created_at":1762204656000,"favorite_count":46,"quote_count":0,"reply_count":3,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1985418033037263086","full_text":"@natolambert Actually I think it was a pretty eventful Fall so far. E.g.,\nQwen3-Next, DeepSeek V3.2, GLM 4.6, MiniMax-M2, Kimi Linear","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1985418033037263086","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-05","value":972,"startTime":1762214400000,"endTime":1762300800000,"tweets":[{"bookmarked":false,"display_text_range":[0,198],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[175,198]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1985719217027494322","view_count":42088,"bookmark_count":675,"created_at":1762267328000,"favorite_count":950,"quote_count":5,"reply_count":25,"retweet_count":164,"user_id_str":"865622395","conversation_id_str":"1985719217027494322","full_text":"My new field guide to alternatives to standard LLMs: \n\nGated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.\n\nhttps://t.co/ZpWugAccgQ https://t.co/255yQXaDcM","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[8,47],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"783098774130401280","name":"Jack Morris","screen_name":"jxmnop","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"jxmnop","lang":"en","retweeted":false,"fact_check":null,"id":"1985735592689185002","view_count":7024,"bookmark_count":1,"created_at":1762271233000,"favorite_count":22,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1985720643397009844","full_text":"@jxmnop Wishing you all the best! You got this!","in_reply_to_user_id_str":"783098774130401280","in_reply_to_status_id_str":"1985720643397009844","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-06","value":0,"startTime":1762300800000,"endTime":1762387200000,"tweets":[]},{"label":"2025-11-07","value":1503,"startTime":1762387200000,"endTime":1762473600000,"tweets":[{"bookmarked":false,"display_text_range":[0,89],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1986449512538513505","quoted_status_permalink":{"url":"https://t.co/4YLFiZxCMs","expanded":"https://twitter.com/Kimi_Moonshot/status/1986449512538513505","display":"x.com/Kimi_Moonshot/…"},"retweeted":false,"fact_check":null,"id":"1986511951141441648","view_count":87406,"bookmark_count":477,"created_at":1762456331000,"favorite_count":1352,"quote_count":8,"reply_count":27,"retweet_count":169,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Exciting big Kimi K2 Thinking release!\nMore experts, fewer heads, and even more thinking! https://t.co/CxUpn68Jjj","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"70831441","name":"Soumith Chintala","screen_name":"soumithchintala","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"soumithchintala","lang":"en","retweeted":false,"fact_check":null,"id":"1986531267794330038","view_count":16764,"bookmark_count":6,"created_at":1762460936000,"favorite_count":113,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986503070734557568","full_text":"@soumithchintala Thank you so much for making deep learning Pythonic! 💜\n\nAll my projects would have been much harder and less enjoyable without PyTorch. \n\nIn an alternative universe we maybe even wouldn’t have such an open-weight LLM ecosystem without PyTorch.\n\nAll the best for your next thing!","in_reply_to_user_id_str":"70831441","in_reply_to_status_id_str":"1986503070734557568","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[32,211],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"14227298","name":"Radek Sienkiewicz","screen_name":"velvet_shark","indices":[0,13]},{"id_str":"20971154","name":"Nicholas Dwork","screen_name":"ndwork","indices":[14,21]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[22,31]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}]},"favorited":false,"in_reply_to_screen_name":"velvet_shark","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1986517309016449353","view_count":50,"bookmark_count":1,"created_at":1762457608000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986412241374048473","full_text":"@velvet_shark @ndwork @karpathy I would say check out the bonus materials, especially the attention alternatives and Qwen3-from-scratch. \nI haven't had a chance to really check out nanochat but that one as well! https://t.co/Qr81iGhkrD","in_reply_to_user_id_str":"14227298","in_reply_to_status_id_str":"1986513230286524832","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,91],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1986522069262123425","view_count":7047,"bookmark_count":5,"created_at":1762458743000,"favorite_count":35,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Sorry should be 256k context length in Kimi K2 Thinking. (Up from 128k in regular Kimi K2.)","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1986511951141441648","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-08","value":0,"startTime":1762473600000,"endTime":1762560000000,"tweets":[]},{"label":"2025-11-09","value":670,"startTime":1762560000000,"endTime":1762646400000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/CaIfmZhaB1","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987157794202505395","view_count":52950,"bookmark_count":381,"created_at":1762610312000,"favorite_count":468,"quote_count":1,"reply_count":11,"retweet_count":71,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"My \"The Building Blocks of Today’s and Tomorrow’s Language Models\" talk at the PyTorch Conference is now up on YouTube! https://t.co/bGV5w1Aqyq\n\nIf you have 25 min this weekend, it's a whirlwind tour to catch you up on the key LLM architecture design considerations in popular LLMs this year (plus, an overview of alternative architecture designs).\n\nThe silver lining of my late arrival and rescheduling: Since there was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 min :)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,121],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[98,121]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987160373837902334","view_count":10033,"bookmark_count":87,"created_at":1762610927000,"favorite_count":85,"quote_count":0,"reply_count":2,"retweet_count":8,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"And the article I mentioned in the talk, the one I promised to write as a follow-up, is this one: https://t.co/ZpWugAccgQ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1987157794202505395","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,39],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1987168682624061627","view_count":143,"bookmark_count":0,"created_at":1762612908000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"@_junaidkhalid1 Incremental progress :)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1987161061188116976","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,297],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"267596794","name":"Walter Tay","screen_name":"waltertayannlee","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"waltertayannlee","lang":"en","retweeted":false,"fact_check":null,"id":"1987177509914337358","view_count":12080,"bookmark_count":62,"created_at":1762615012000,"favorite_count":117,"quote_count":0,"reply_count":2,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1986734118005358605","full_text":"The fun part when teaching deep learning classes was always to point out that the textbook convolution (/cross-correlation) is not how it’s actually implemented. It’s also one of the big sources of non-determinism when training CNNs in standard frameworks, because l, by default, CUDA/cuDNN selects the algo automatically at runtime specific to the problem and setup.","in_reply_to_user_id_str":"267596794","in_reply_to_status_id_str":"1986734118005358605","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-10","value":0,"startTime":1762646400000,"endTime":1762732800000,"tweets":[]},{"label":"2025-11-11","value":0,"startTime":1762732800000,"endTime":1762819200000,"tweets":[]},{"label":"2025-11-12","value":354,"startTime":1762819200000,"endTime":1762905600000,"tweets":[{"bookmarked":false,"display_text_range":[8,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1778075580271054848","name":"mel","screen_name":"melqtx","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"melqtx","lang":"en","retweeted":false,"fact_check":null,"id":"1988380057346130209","view_count":24822,"bookmark_count":39,"created_at":1762901722000,"favorite_count":354,"quote_count":0,"reply_count":16,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1988288260049871197","full_text":"@melqtx I use it all the time when using remote machines. Coz the terminal connections sometimes gets closed (e.g., when my computer goes to sleep).\n\nThis way, I can simply log back in, attach the tmux terminal, and continue instead of cd'ing to the right folder, activating the venv etc.","in_reply_to_user_id_str":"1778075580271054848","in_reply_to_status_id_str":"1988288260049871197","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-13","value":804,"startTime":1762905600000,"endTime":1762992000000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1988626642990719440","view_count":54993,"bookmark_count":944,"created_at":1762960513000,"favorite_count":801,"quote_count":5,"reply_count":27,"retweet_count":115,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all approach, but I thought it may be useful to share:\n\n1. Read the chapter preferably offline, away from the computer. Either classic physical form or at least on digital devices without internet. This really helps with focus time and minimizing distractions while reading. Highlighting or annotating confusing or interesting things is good, but I would not look things up at this stage. I also wouldn't run code at this stage. At least not yet.\n\n2. On the second read-through, type up and run the code from the chapter. Copying code is tempting because retyping is a lot of work, but it usually helps me to think about the code a bit more (versus just glancing over it). If I get different results than in the book, I would check the book's GitHub repo and try the code from there. If I still get different results, I would try to see if it's due to different package versions, random seeds, CPU/CUDA, etc. If I then still can't find it out, asking the author would not be a bad idea (via book forum, public GitHub repo issues or discussions, and as a last resort, email)\n\n3. After the second read-through and retyping the code, it's usually a good time to try the exercises to solidify my understanding. To check whether I actually understand the content and can work with it independently.\n\n4. Go through the highlights and annotations. I would bookmark important learnings or takeaways, if relevant for a given project, in my notes documents. Often, I also look up additional references to read more about a topic of interest. Also, if I still have any questions that I feel are unanswered after my previous readthroughs and exercises, I would do an online search to find out more.\n\n5. The previous steps were all about soaking up knowledge. Eventually, though, I somehow want to use that knowledge. So I think about which projects would benefit from what I've learned and incorporate it into them. This could involve using the main concept from the chapter, but also sometimes minor tidbits I learned along the way, e.g., even trivial things like whether it actually makes a difference in my project to explicitly call `torch.mps.manual_seed(seed)` instead of just `torch.manual_seed(seed)`.\n\nOf course, none of the above is set in stone. If the topic is overall very familiar or easy, and I am primarily reading the book to get some information in later chapters, skimming a chapter is ok (to not waste my time).\n\nAnyway, I hope this is useful. And happy reading and learning!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,44],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":true,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988627760617517412","view_count":74,"bookmark_count":0,"created_at":1762960779000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"@franbetteo Classic quality > quantity :)","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988627594669875705","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,292],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988631117772025955","view_count":6,"bookmark_count":0,"created_at":1762961580000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"Yeah, I think the problem is to want to read too many things. I have the same issue. Honestly, when reading at a computer, my attention span is sometimes so short that I can't even focus 30 min and read a longer blog article without distraction.\nIt requires discipline to stick to a given text.","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988628897995341948","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-14","value":0,"startTime":1762992000000,"endTime":1763078400000,"tweets":[]},{"label":"2025-11-15","value":0,"startTime":1763078400000,"endTime":1763164800000,"tweets":[]},{"label":"2025-11-16","value":938,"startTime":1763164800000,"endTime":1763251200000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989706196396265863","view_count":47955,"bookmark_count":547,"created_at":1763217898000,"favorite_count":754,"quote_count":1,"reply_count":15,"retweet_count":119,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"Inference-scaling lets us trade extra compute for better modeling accuracy. Next to reinforcement learning, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one.\n\nI just finished the first one. It is a 35-page introduction to inference-time scaling through self-consistency sampling. This chapter was a lot of fun to write because it takes the base model on MATH-500 all the way from 15.2% percent to 52.2% accuracy.\n\nSeeing that jump without additional training is incredibly satisfying.\n\nSubmitted the chapter yesterday, and it should appear in the Manning Early Access program in the next few days. (In the meantime the first 176 pages that lead up to this chapter are already available.)\n\nThe next chapter will focus on self-refinement techniques, where the model improves its own answers through iterative reasoning.\n\nHappy reading!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"800854096219471872","name":"Yuchen Jin","screen_name":"Yuchenj_UW","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"Yuchenj_UW","lang":"en","retweeted":false,"fact_check":null,"id":"1989803439224934626","view_count":6603,"bookmark_count":3,"created_at":1763241083000,"favorite_count":118,"quote_count":0,"reply_count":3,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1989755062646944048","full_text":"@Yuchenj_UW One can say you do seminal work to get a PhD, but you don’t have to have a PhD to do seminal work.","in_reply_to_user_id_str":"800854096219471872","in_reply_to_status_id_str":"1989755062646944048","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/blob/main/ch04/01_main-chapter-code/ch04_main.ipynb","url":"https://t.co/b3Nk5cVHwd","indices":[46,69]},{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/tree/main/ch04/02_math500-inference-scaling-scripts","url":"https://t.co/z3oj5Vkno1","indices":[144,167]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989708450100662776","view_count":8109,"bookmark_count":44,"created_at":1763218436000,"favorite_count":60,"quote_count":0,"reply_count":3,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"The chapter code is available here on GitHub: https://t.co/b3Nk5cVHwd\n\nAlso, I have the scripts to reproduce the experiments in the table here: https://t.co/z3oj5Vkno1","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1989706196396265863","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"9508592","name":"Asankhaya Sharma","screen_name":"asankhaya","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"asankhaya","lang":"en","retweeted":false,"fact_check":null,"id":"1989718576664568217","view_count":454,"bookmark_count":0,"created_at":1763220850000,"favorite_count":5,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@asankhaya Yes that’s correct. I think self-consistency is a good intro though that works well in practice, too. More will be covered in the next chapter.\nThanks for sharing btw, have to check out your repo some time.","in_reply_to_user_id_str":"9508592","in_reply_to_status_id_str":"1989717556077498843","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,123],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"4036077013","name":"sour coach sauers","screen_name":"SRCoachSauers","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"SRCoachSauers","lang":"en","retweeted":false,"fact_check":null,"id":"1989803670125646205","view_count":103,"bookmark_count":0,"created_at":1763241138000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@SRCoachSauers The website says summer 2026. That’s still the estimate but maybe even late spring depending on how it goes.","in_reply_to_user_id_str":"4036077013","in_reply_to_status_id_str":"1989800627426480467","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-17","value":0,"startTime":1763251200000,"endTime":1763337600000,"tweets":[]}],"nviews":[{"label":"2025-10-18","value":0,"startTime":1760659200000,"endTime":1760745600000,"tweets":[]},{"label":"2025-10-19","value":0,"startTime":1760745600000,"endTime":1760832000000,"tweets":[]},{"label":"2025-10-20","value":0,"startTime":1760832000000,"endTime":1760918400000,"tweets":[]},{"label":"2025-10-21","value":89199,"startTime":1760918400000,"endTime":1761004800000,"tweets":[{"bookmarked":false,"display_text_range":[0,51],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/07_moe","url":"https://t.co/3CGjgO4H9p","indices":[28,51]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1977733802660155875","quoted_status_permalink":{"url":"https://t.co/nQ43v9rV8S","expanded":"https://twitter.com/rasbt/status/1977733802660155875","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980269760043446725","view_count":76235,"bookmark_count":627,"created_at":1760968077000,"favorite_count":908,"quote_count":1,"reply_count":4,"retweet_count":145,"user_id_str":"865622395","conversation_id_str":"1980269760043446725","full_text":"🔗 Mixture of Experts (MoE): https://t.co/3CGjgO4H9p https://t.co/QA12nBeW0i","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"871247180341813248","name":"Tina Sang","screen_name":"tinawrote","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"tinawrote","lang":"en","retweeted":false,"fact_check":null,"id":"1980274554913132722","view_count":5237,"bookmark_count":0,"created_at":1760969220000,"favorite_count":11,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1979808022894703036","full_text":"@tinawrote Ha nice, it’s refreshing to see that people still care about Bayes theorem and fundamentals in 2025","in_reply_to_user_id_str":"871247180341813248","in_reply_to_status_id_str":"1979808022894703036","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"248951926","name":"Ahmad","screen_name":"TheAhmadOsman","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"TheAhmadOsman","lang":"en","retweeted":false,"fact_check":null,"id":"1980275166560092599","view_count":1634,"bookmark_count":2,"created_at":1760969366000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980102923754381348","full_text":"@TheAhmadOsman The V3.2 update with sparse attention was just to get the tooling ecosystem ready for the big release. Mark my words","in_reply_to_user_id_str":"248951926","in_reply_to_status_id_str":"1980102923754381348","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[23,305],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1209960539390201864","name":"Dwarkesh Patel","screen_name":"dwarkesh_sp","indices":[0,12]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[13,22]}]},"favorited":false,"in_reply_to_screen_name":"dwarkesh_sp","lang":"en","retweeted":false,"fact_check":null,"id":"1980335765063094548","view_count":6093,"bookmark_count":2,"created_at":1760983813000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980333945385562176","full_text":"> Culture: > “Why can’t an LLM write a book for the other LLMs? Why can’t other LLMs read this LLM’s book and be inspired by it, or shocked by it?”\n\nHm, isn’t that what training on synthetic data and knowledge distillation does? \n\nAll major LLMs contain some synthetic data in their mix because it makes training more effective versus cold-starting from raw data.","in_reply_to_user_id_str":"1209960539390201864","in_reply_to_status_id_str":"1980333945385562176","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-22","value":229110,"startTime":1761004800000,"endTime":1761091200000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642191950090585","view_count":153439,"bookmark_count":1282,"created_at":1761056871000,"favorite_count":2142,"quote_count":35,"reply_count":76,"retweet_count":338,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.\n\nIn short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.\n\nMy first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.\n\nIn the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)\n\nIn any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.\n\nHow is it different compared to other VLLM architectures?\n- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).\n- They are (to the best of my knowledge) those who use an MoE as a decoder.\nI think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.\nHowever, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.\n\nRegarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)\n\nOverall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).\n\n(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[18,250],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1638538494887821313","url":"https://t.co/gNErcwGh3w","indices":[71,94]}],"user_mentions":[{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[0,9]},{"id_str":"39547749","name":"(((ل()(ل() 'yoav))))👾","screen_name":"yoavgo","indices":[10,17]}]},"favorited":false,"in_reply_to_screen_name":"karpathy","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1638538494887821313","quoted_status_permalink":{"url":"https://t.co/gNErcwGh3w","expanded":"https://x.com/rasbt/status/1638538494887821313","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980463829789339825","view_count":12444,"bookmark_count":39,"created_at":1761014346000,"favorite_count":52,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980397031542989305","full_text":"@karpathy @yoavgo This made me think of the \"Meet in the Middle\" paper https://t.co/gNErcwGh3w\nWhen I remember correctly, they run two LLMs in both directions with parameter sharing. So it shouldn't impact training time. Kind of wild but hey why not.","in_reply_to_user_id_str":"33836629","in_reply_to_status_id_str":"1980435985730269351","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,188],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/deepseek-ai/De…","expanded_url":"https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf","url":"https://t.co/f0EFC6eVcl","indices":[19,42]},{"display_url":"magazine.sebastianraschka.com/p/understandin…","expanded_url":"https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=publication-search","url":"https://t.co/Aa5M0XD6ew","indices":[165,188]}],"user_mentions":[]},"favorited":true,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642194945110475","view_count":9544,"bookmark_count":42,"created_at":1761056872000,"favorite_count":77,"quote_count":1,"reply_count":2,"retweet_count":12,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"Link to the paper: https://t.co/f0EFC6eVcl\n\nMy \"Understanding Multimodal LLMs\" article with more info on how images are fed to LLMs, how cross-attention works, etc: https://t.co/Aa5M0XD6ew","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1143955635391754240","name":"Pratham Prasoon","screen_name":"PrasoonPratham","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"PrasoonPratham","lang":"en","retweeted":false,"fact_check":null,"id":"1980645421560262701","view_count":2495,"bookmark_count":0,"created_at":1761057641000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@PrasoonPratham Actually I was thinking about it when typing, and I don't know. I don't want to be that person who goes against the common terminology (like softargmax haha) but it really is a V*L*LM at 3B parameters.","in_reply_to_user_id_str":"1143955635391754240","in_reply_to_status_id_str":"1980644767022399874","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,235],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1487101697876844546","name":"Butch Coolidge","screen_name":"vulnerablecodes","indices":[0,16]}]},"favorited":true,"in_reply_to_screen_name":"vulnerablecodes","lang":"en","retweeted":false,"fact_check":null,"id":"1980644334832955587","view_count":2094,"bookmark_count":0,"created_at":1761057382000,"favorite_count":19,"quote_count":0,"reply_count":2,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@vulnerablecodes If we are talking about the model itself and not the app, these are open-weight PyTorch models. So unless there’s a backdoor in Hugging Face or the PyTorch runtime, there’s really no way for them to be malicious afaik.","in_reply_to_user_id_str":"1487101697876844546","in_reply_to_status_id_str":"1980643375780085948","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[14,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"255532327","name":"LMT ⚡️","screen_name":"Limitless_LT","indices":[0,13]}]},"favorited":false,"in_reply_to_screen_name":"Limitless_LT","lang":"en","retweeted":false,"fact_check":null,"id":"1980656807690530983","view_count":1677,"bookmark_count":0,"created_at":1761060356000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@Limitless_LT Yeah, I think that’s what brought us CNNS (as opposed to fully connected neural nets), LoRA, and many more","in_reply_to_user_id_str":"255532327","in_reply_to_status_id_str":"1980655979386793997","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,290],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"760070121981378561","name":"Alim","screen_name":"almmaasoglu","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"almmaasoglu","lang":"en","retweeted":false,"fact_check":null,"id":"1980657466284425600","view_count":2069,"bookmark_count":0,"created_at":1761060513000,"favorite_count":6,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@almmaasoglu exactly, that’s the messiness of working with the image format I mentioned. I think you can make to generalize well on all these but since there are more degrees of freedom it will require more data to train (luckily this can be done with automatic data augmentation but still)","in_reply_to_user_id_str":"760070121981378561","in_reply_to_status_id_str":"1980653506899087745","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1980736754517680463","view_count":25353,"bookmark_count":29,"created_at":1761079417000,"favorite_count":76,"quote_count":1,"reply_count":7,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980657338726887662","full_text":"Interesting that they mentioned faster & cheaper compared to OpenAI’s latest models not “customizable”. \n\nThat makes me think they are specifically referring to gpt-oss,\n\nThis in turn means they are using the small, dense Qwen3 models, maybe 0.6 to 4B range.\n\nAnd this is surprising, i.e. that models that small are good enough for production (and possibly chat interactions with the customer).","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1980657338726887662","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,171],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980715707508879444","view_count":8923,"bookmark_count":14,"created_at":1761074399000,"favorite_count":70,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"All that being said, as a human, I can appreciate visual representations of text as it lowers cognitive load (the raw text is readable, but requires much more brainpower): https://t.co/G4ygIeNvDZ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"875456843279081476","name":"Dileep George","screen_name":"dileeplearning","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"dileeplearning","lang":"en","retweeted":false,"fact_check":null,"id":"1980618490764513365","view_count":11072,"bookmark_count":11,"created_at":1761051220000,"favorite_count":77,"quote_count":0,"reply_count":3,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980423146420466049","full_text":"@dileeplearning I know it’s popular to hate tokenizers, but visual representations (which are also tokenized) bring a lot of messiness as well. Aspect ratios, cropping, resolution, brightness, etc.\n\nSure, models learn to deal with that but it requires lots of data to make them robust wrt these.","in_reply_to_user_id_str":"875456843279081476","in_reply_to_status_id_str":"1980423146420466049","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-23","value":41099,"startTime":1761091200000,"endTime":1761177600000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}],"symbols":[],"timestamps":[{"indices":[94,99],"seconds":660,"text":"11:00"},{"indices":[376,380],"seconds":200,"text":"3:20"}],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980992745532453016","view_count":41099,"bookmark_count":359,"created_at":1761140450000,"favorite_count":823,"quote_count":7,"reply_count":24,"retweet_count":89,"user_id_str":"865622395","conversation_id_str":"1980992745532453016","full_text":"Excited to be (finally) heading to the PyTorch Conference!\n\nI’ll be giving a talk tomorrow at 11:00 AM on “The LLM Landscape 2025”, where I’ll discuss the key components behind this year’s most prominent open-weight LLMs, and highlight a few architectural developments that go beyond the mainstream, off the main track.\n\nI also look forward to doing a book signing session at 3:20 PM, thanks to the kind invite from the organizers.\n\nIt’s my first trip since my injury last year, and I’m really looking forward to reconnecting with the community in person after such a long time. If you’re there, please come say hi!\n\n(I couldn’t make it for the first day of the conference due to a mandatory appointment, but better late than never! See you all tomorrow.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-24","value":0,"startTime":1761177600000,"endTime":1761264000000,"tweets":[]},{"label":"2025-10-25","value":0,"startTime":1761264000000,"endTime":1761350400000,"tweets":[]},{"label":"2025-10-26","value":0,"startTime":1761350400000,"endTime":1761436800000,"tweets":[]},{"label":"2025-10-27","value":0,"startTime":1761436800000,"endTime":1761523200000,"tweets":[]},{"label":"2025-10-28","value":9152,"startTime":1761523200000,"endTime":1761609600000,"tweets":[{"bookmarked":false,"display_text_range":[42,321],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch","url":"https://t.co/NGT1VM4P1R","indices":[414,437]}],"user_mentions":[{"id_str":"13434092","name":"Brandon Watson","screen_name":"BrandonWatson","indices":[0,14]},{"id_str":"291797158","name":"ThePrimeagen","screen_name":"ThePrimeagen","indices":[15,28]},{"id_str":"21001534","name":"Audible","screen_name":"audible_com","indices":[29,41]}]},"favorited":false,"in_reply_to_screen_name":"BrandonWatson","lang":"en","retweeted":false,"fact_check":null,"id":"1982836647784808750","view_count":9152,"bookmark_count":19,"created_at":1761580070000,"favorite_count":42,"quote_count":0,"reply_count":5,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1982666767437820411","full_text":"I wrote the original text and code and had similar questions when I found that there was an audio book version. When I asked about it, if I remember correctly, the answer was that it is something they generate for all books to improve accessibility. \n\nPersonally, I recommend the text version. That being said, I dunno, but perhaps the audiobook version works also well if you are working with the code notebooks (https://t.co/NGT1VM4P1R), which have the code and figures (but not text).\n\nWould be curious to hear from people who listen to audio book versions of coding books and find out if this is helpful.","in_reply_to_user_id_str":"13434092","in_reply_to_status_id_str":"1982666767437820411","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-29","value":97462,"startTime":1761609600000,"endTime":1761696000000,"tweets":[{"bookmarked":false,"display_text_range":[0,269],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983212569885122670","view_count":46934,"bookmark_count":499,"created_at":1761669697000,"favorite_count":872,"quote_count":3,"reply_count":29,"retweet_count":128,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"Just saw the MiniMax-M2 benchmarks, and the performance is too good to ignore :). So, I just amended my \"The Big LLM Architecture Comparison\" with entry number 13! \n\n1️⃣ Full attention modules:\n\nAs shown in the overview figure below, I grouped MiniMax-M2 with the other decoder-style transformer LLMs as it does not use the efficient lightning attention variant proposed in MiniMax-M1. Instead, the developers went back to using full attention, likely to improve modeling (and benchmark) performance.\n\n2️⃣ Per-layer QK-Norm:\n\nOverall, MiniMax-M2 is surprisingly similar to Qwen3. Besides changing the number of layers, sizes, etc., it uses the same components overall. Perhaps the one noteworthy highlight here is that MiniMax-M2 uses a so-called “per_layer” QK-Norm instead of the regular QK-Norm. A closer look at the code reveals the \"per_layer\" means that the RMSNorm (used for QK-Norm as explained earlier) is defined in each transformer block (as in regular QK-Norm), but, in addition, instead of reusing it across attention heads, it's a unique QK-Norm for each attention head.\n\n3️⃣ Sliding-window attention:\n\nThe model configuration file also includes a sliding-window attention setting (similar to Gemma 3), but, as in Mistral 3.1, it is disabled by default.\n\nOtherwise, besides the per-layer QK-Norm, the architecture is very similar to Qwen3, as shown in the figure below.\n\n4️⃣ MoE sparsity:\n\nA perhaps interesting tidbit, as shown in the figure below, includes the fact that they don't use a shared expert (similar to Qwen3 but unlike Qwen3-Next). As mentioned earlier, in my opinion, shared experts are useful because they reduce redundancy among the other experts.\n\nAlso, as apparent from the figure above, MiniMax-M2 is twice as \"sparse\" as Qwen3. I.e., at roughly the same size as Qwen3 235B-A22B, MiniMax-M2 has only 10B instead of 22B active experts per token (that is, 4.37% of the parameters are used in each inference step in MiniMax-M2, whereas Qwen3 uses 9.36% active tokens).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1983240592516665532","quoted_status_permalink":{"url":"https://t.co/Ks8fEmHtCa","expanded":"https://twitter.com/ManningBooks/status/1983240592516665532","display":"x.com/ManningBooks/s…"},"retweeted":false,"fact_check":null,"id":"1983255497202643000","view_count":41464,"bookmark_count":263,"created_at":1761679932000,"favorite_count":404,"quote_count":0,"reply_count":25,"retweet_count":64,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"On that note, I am currently running a large-scale experiment on the upcoming inference-scaling chapter:\n\nA) Parallel Sampling\n- Self-Consistency (Majority Vote)\n- Rejection Sampling\n- Best-of-N (Verifier-Based)\n\nB) Sequential Refinement\n- Self-Refinement\n- Power Sampling\n- MCMC (Simple)\n- MCMC (Block as in \"Reasoning with Sampling\" paper)\n- Tree-of-Thought\n\n... to decide which one(s) make(s) it for the detailed discussion into the main chapter versus which ones will be included as bonus materials. (All new chapters will of course be automatically available to all the early acessers, amd there are already 170 chapters to get started in the meantime 😊\n\nAnything you'd think is worth adding to the list above?","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,34],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1745892418539417600","name":"elie","screen_name":"eliebakouch","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"eliebakouch","lang":"en","retweeted":false,"fact_check":null,"id":"1983231696343351800","view_count":2617,"bookmark_count":1,"created_at":1761674257000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@eliebakouch good point, will add!","in_reply_to_user_id_str":"1745892418539417600","in_reply_to_status_id_str":"1983219128883122466","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,192],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"970812776","name":"jason","screen_name":"jasonth0","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"jasonth0","lang":"en","retweeted":false,"fact_check":null,"id":"1983215929711284435","view_count":1335,"bookmark_count":1,"created_at":1761670498000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@jasonth0 The per-layer QK-Norm adds more params, not less :). But that aside, overall, I think it's still efficient. I mean, there are 50% less active parameters compared to Qwen3 for example","in_reply_to_user_id_str":"970812776","in_reply_to_status_id_str":"1983215562952990856","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[7,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"5604372","name":"Reza Rawassizadeh","screen_name":"rezar","indices":[0,6]}]},"favorited":false,"in_reply_to_screen_name":"rezar","lang":"en","retweeted":false,"fact_check":null,"id":"1983251855829606863","view_count":670,"bookmark_count":1,"created_at":1761679064000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@rezar That's a fun idea! Do you know a service that you have had a good experience with regarding making and distributing posters?","in_reply_to_user_id_str":"5604372","in_reply_to_status_id_str":"1983245370118267378","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,66],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1419322742713643009","name":"Duc Nguyen Huu","screen_name":"ducnh279","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"ducnh279","lang":"en","retweeted":false,"fact_check":null,"id":"1983278551655944288","view_count":108,"bookmark_count":0,"created_at":1761685428000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@ducnh279 Interesting one! I will bookmark this and give it a try.","in_reply_to_user_id_str":"1419322742713643009","in_reply_to_status_id_str":"1983263508071624848","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,46],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1918285228403253249","name":"ƬⲘ ⚔️","screen_name":"tm23twt","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}]},"favorited":false,"in_reply_to_screen_name":"tm23twt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983257620753592407","view_count":86,"bookmark_count":0,"created_at":1761680438000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@tm23twt I think they removed the edit feature https://t.co/dGwFDFaeYg","in_reply_to_user_id_str":"1918285228403253249","in_reply_to_status_id_str":"1983256870711164941","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,26],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1983255844029431837","view_count":4248,"bookmark_count":2,"created_at":1761680014000,"favorite_count":15,"quote_count":0,"reply_count":2,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"* 170 pages not chapters 😅","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983255497202643000","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-30","value":557,"startTime":1761696000000,"endTime":1761782400000,"tweets":[{"bookmarked":false,"display_text_range":[12,146],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"431181263","name":"Haichao","screen_name":"HaichaoZhu","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"HaichaoZhu","lang":"en","retweeted":false,"fact_check":null,"id":"1983343814648762407","view_count":552,"bookmark_count":0,"created_at":1761700988000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@HaichaoZhu That's a good point. With so many MoE's released this year (even the latest Nemotron today), maybe that'd be a nice standalone article","in_reply_to_user_id_str":"431181263","in_reply_to_status_id_str":"1983335671264845971","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,207],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1761964147510767616","name":"Ben Dicken","screen_name":"BenjDicken","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"BenjDicken","lang":"en","retweeted":false,"fact_check":null,"id":"1983565978525892663","view_count":5,"bookmark_count":0,"created_at":1761753956000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983292996117491864","full_text":"@BenjDicken Just saw this popping up on my timeline... I guess the twitter recommendations work well now, haha!\nAnyways, I hope you are liking the book. And please let me know in case you have any questions!","in_reply_to_user_id_str":"1761964147510767616","in_reply_to_status_id_str":"1983292996117491864","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-31","value":69178,"startTime":1761782400000,"endTime":1761868800000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1978608882156269755","quoted_status_permalink":{"url":"https://t.co/uObfyEshyK","expanded":"https://twitter.com/rasbt/status/1978608882156269755","display":"x.com/rasbt/status/1…"},"retweeted":true,"fact_check":null,"id":"1983895811915214996","view_count":60530,"bookmark_count":173,"created_at":1761832595000,"favorite_count":325,"quote_count":1,"reply_count":22,"retweet_count":40,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"A small follow-up to my DGX Spark post. Courtesy of NVIDIA, I got to try the DGX on my workflows (coding LLMs from scratch in pure PyTorch) and wanted to share my first impressions after using it for a week.\n\nBefore getting to the performance, there was a neat bonus I didn't expect: It comes with NVIDIA Sync software that lets you conveniently connect (I fully expected I would have to find my SSH tunneling notes from back when I set up Jupyter Lab, etc, on a remote machine). The setup is a breeze and a delight.\n\nNow, how does it fare against my Mac Mini? I included the tokens/sec inference speed for a small 0.6B model I am currently working on. The DGX is much faster than the Mac Mini M4 CPU and still noticeably faster than the M4 GPU (via PyTorch MPS). More importantly, though, as I mentioned before, it is a CUDA device and thus much better supported in PyTorch. This, in turn, results in more stable training and higher benchmark accuracy. (And no compile errors, yay!)\n\nBoth devices get hot under my workloads (e.g., a constant-load run like evaluating a model with batched mode on MATH-500; or fine-tuning a model), but I feel like the DGX Spark is (probably) made with that in mind. Plus, due to its larger 128 GB RAM, I can run larger batch sizes. Then there's also the aspect that when I have the DGX (vs the Mac Mini) running computations, it keeps my Mini free for other tasks :).\n\nOverall, a neat little package and CUDA prototyping machine that I can keep on my desk. It's super quiet similar to the Mac Mini. Of course, it's not as capable as a 6x more expensive H100 for training, but hey, you don't need a server room for that and can keep it in your office without worrying about heat or noise (this was not possible with the Lambda workstation I had a few years ago).\n\ntl;dr:\n\nSo, I've been seeing lots of others using it for LLM inference (Ollama, vLLM, etc) but my first-week impression is that this is also a neat box for local dev and prototyping (e.g., coding and running PyTorch models) thanks to the CUDA support, which comes in handy before starting larger, more expensive training runs.\n\nPS: Plus also find another benchmark versus the H100 in the comments below.\n\nWill run more experiments over time. In the meantime, let me know if you have any questions.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,46],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","quoted_status_id_str":"1983895811915214996","quoted_status_permalink":{"url":"https://t.co/FM2NttATVY","expanded":"https://twitter.com/rasbt/status/1983895811915214996","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983900170992463920","view_count":1069,"bookmark_count":0,"created_at":1761833634000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1978608882156269755","full_text":"A follow-up here with some PyTorch benchmarks:","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1978608882156269755","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1983905044169945184","view_count":269,"bookmark_count":1,"created_at":1761834796000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983584412412641496","full_text":"@natolambert My guess is the motivating factor behind this was probably to prevent things from breaking if proprietary model providers make API or model changes again.","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1983584412412641496","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983895815102910660","view_count":5102,"bookmark_count":7,"created_at":1761832595000,"favorite_count":20,"quote_count":0,"reply_count":4,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"And here is a comparison with an H100. As one can see, the DGX Spark is a great machine for small inferencing tasks (even beating the 6x more expensive H100).\nBut when it comes to batched processing (or training), this is of course no replacement for high-memory bandwidth cards. https://t.co/I93nIfdzD6","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983895811915214996","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1983918926992933169","url":"https://t.co/yazv07Pxfx","indices":[194,217]}],"user_mentions":[{"id_str":"1451507288741658630","name":"Aleksandr Kovalev","screen_name":"koval_alvi","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"koval_alvi","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1983918926992933169","quoted_status_permalink":{"url":"https://t.co/yazv07Pxfx","expanded":"https://x.com/rasbt/status/1983918926992933169","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983919187555754315","view_count":480,"bookmark_count":0,"created_at":1761838168000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@koval_alvi So little time (and only one machine) & some much to run 😅. I am currently more focused on the inference scaling methods for the upcoming chapter 4, but yes, I did a short run:\n\nhttps://t.co/yazv07Pxfx","in_reply_to_user_id_str":"1451507288741658630","in_reply_to_status_id_str":"1983912718001115637","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1474196927960944644","name":"kris","screen_name":"Krishna70284154","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"Krishna70284154","lang":"en","retweeted":false,"fact_check":null,"id":"1983899945443700819","view_count":456,"bookmark_count":0,"created_at":1761833580000,"favorite_count":6,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@Krishna70284154 Yeah, it’s basically for people who want a Mac-like machine at a Mac-like price but with cuda support 😅","in_reply_to_user_id_str":"1474196927960944644","in_reply_to_status_id_str":"1983897384469082570","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,169],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb","url":"https://t.co/VioT1zUPgA","indices":[59,82]}],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983918926992933169","view_count":1113,"bookmark_count":2,"created_at":1761838106000,"favorite_count":4,"quote_count":1,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@redtachyon I did a short run of my DPO from Scratch code (https://t.co/VioT1zUPgA) on a 355M parameter model:\n\nA100: 1.69 min\nMac Mini M4: 12.54 min\nDGX Spark: 2.44 min","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1983906361969627248","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,163],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1780523178160279552","name":"Mykhailo Sorochuk","screen_name":"sir4K_zen","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"sir4K_zen","lang":"en","retweeted":false,"fact_check":null,"id":"1984030005966598349","view_count":159,"bookmark_count":0,"created_at":1761864589000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@sir4K_zen Under normal use, they are similarly quiet like you have to put your ear next to it to hear it. Under heavy load, the Mac Mini gets louder than the DGX.","in_reply_to_user_id_str":"1780523178160279552","in_reply_to_status_id_str":"1984026707242971532","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-01","value":25568,"startTime":1761868800000,"endTime":1761955200000,"tweets":[{"bookmarked":false,"display_text_range":[0,260],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1984262505443844263","quoted_status_permalink":{"url":"https://t.co/bGHWQrydyN","expanded":"https://twitter.com/natolambert/status/1984262505443844263","display":"x.com/natolambert/st…"},"retweeted":false,"fact_check":null,"id":"1984279418588762113","view_count":19631,"bookmark_count":64,"created_at":1761924054000,"favorite_count":112,"quote_count":0,"reply_count":7,"retweet_count":6,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"I ran lots of experiments on fp16 vs bf16 years ago on ViTs and LLMs. fp16 can work well but depends on normalization (so you don’t run over the supported range with your activations). \nI can see why with QKNorm and other tricks it may work fine (/better) now.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2023/pyto…","expanded_url":"https://sebastianraschka.com/blog/2023/pytorch-memory-optimization.html","url":"https://t.co/AD6ZZJeS4D","indices":[61,84]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984310689167511808","view_count":5710,"bookmark_count":27,"created_at":1761931509000,"favorite_count":43,"quote_count":0,"reply_count":0,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"Figure from an older blogpost to illustrate the difference: https://t.co/AD6ZZJeS4D\n\nRegular 16-bit floats can only represent numbers between -65,504 and 65,504. And with LLMs back then I often had activation larger or smaller than that. (This was pre QKNorm.) https://t.co/b6vobXJCHJ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1984279418588762113","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,71],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1894590251571843073","name":"Artificially Intelligent","screen_name":"ArtiIntelligent","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"ArtiIntelligent","lang":"en","retweeted":false,"fact_check":null,"id":"1984242821688365465","view_count":100,"bookmark_count":0,"created_at":1761915328000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ArtiIntelligent Sure but my use case is primarily dev work in PyTorch.","in_reply_to_user_id_str":"1894590251571843073","in_reply_to_status_id_str":"1984239937789788358","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,211],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1886394417654677504","name":"moskstraumen","screen_name":"moskstraum21745","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"moskstraum21745","lang":"en","retweeted":false,"fact_check":null,"id":"1984255784847614382","view_count":65,"bookmark_count":0,"created_at":1761918419000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@moskstraum21745 Oh yes 100% use MLX if you want to max the performance on Mac. I think it also now has CUDA support correct? It's just that the most of the LLM ecosystem (and my experience) is based on PyTorch.","in_reply_to_user_id_str":"1886394417654677504","in_reply_to_status_id_str":"1984254897622290758","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,156],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"101128454","name":"Wayne Le Nguyen","screen_name":"insynwyn","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"insynwyn","lang":"en","retweeted":false,"fact_check":null,"id":"1984242530398171139","view_count":62,"bookmark_count":0,"created_at":1761915259000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@insynwyn Both the latest nightly and latest PyTorch with CUDA 13 work for me. (NVIDIA recommends the docker container but in my case that wasn’t necessary)","in_reply_to_user_id_str":"101128454","in_reply_to_status_id_str":"1984239792939499706","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-02","value":87399,"startTime":1761955200000,"endTime":1762041600000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984617030356451642","view_count":65925,"bookmark_count":861,"created_at":1762004547000,"favorite_count":1286,"quote_count":3,"reply_count":27,"retweet_count":220,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened.\n\nFirst, linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s.\n\nI don't want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n^2) to O(n) to making attention much more efficient for long sequences.\n\nHowever, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. \n\nIn the second half of this year, there was a bit of a revival of linear attention variants. The first notable model was MiniMax-M1 with lightning attention, a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June.\n\nThen, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 with sparse attention.\n\nAll three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. (DeepSeek's sparse attention it's not strictly linear but still subquadratic).\n\nInterestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model (discussed in section 13) without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had pure accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications.\n\nThis could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. Last week, the Kimi team released their new Kimi Linear model with linear attention. The tag line is that compared to regular, full attention, it has a 75% KV cache reduction and up to 6x decoding throughput.\n\nKimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there's one block that uses full attention as shown in the figure below.\n\nHowever, Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Interestingly, it also replaces the standard full attention module by multi-head latent attention. \n\nThere's no direct comparison to Qwen3-Next in the Kimi Linear paper, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed.\n\nOf course, I couldn't resist and added it to my The Big LLM Architecture Comparison article, which has grown to >10,000 words now (basically becoming book!?).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,88],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2025/dgx-…","expanded_url":"https://sebastianraschka.com/blog/2025/dgx-impressions.html","url":"https://t.co/XG2m9urtgc","indices":[65,88]}],"user_mentions":[{"id_str":"43874767","name":"Ivan Fioravanti ᯅ","screen_name":"ivanfioravanti","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"ivanfioravanti","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984601748250448148","view_count":205,"bookmark_count":3,"created_at":1762000903000,"favorite_count":5,"quote_count":0,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ivanfioravanti Yes! Links to the codes are in the article here: https://t.co/XG2m9urtgc","in_reply_to_user_id_str":"43874767","in_reply_to_status_id_str":"1984519617067335962","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,197],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","retweeted":false,"fact_check":null,"id":"1984633894365233442","view_count":1181,"bookmark_count":3,"created_at":1762008567000,"favorite_count":18,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984605827034972269","full_text":"@redtachyon I think fp16 also only works well for the newer architectures that add tons of normalization (like QKNorm), so you don’t get these large activations above +/- 65k that fp16 can’t handle","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1984605827034972269","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,200],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1629698842647203841","name":"Yu Zhang 🐈🐙","screen_name":"yzhang_cs","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"yzhang_cs","lang":"en","retweeted":false,"fact_check":null,"id":"1984632514019778709","view_count":1211,"bookmark_count":1,"created_at":1762008238000,"favorite_count":10,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@yzhang_cs Ooops, I misread then, thanks for the feedback, and I’ll update the figure in the article! (Ha, but sounds like I can keep this figure for the next iteration of Kimi Linear! Cool work btw!)","in_reply_to_user_id_str":"1629698842647203841","in_reply_to_status_id_str":"1984631714464088563","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,222],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"862201913252618240","name":"Vishal Verma","screen_name":"v_shaal","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}]},"favorited":false,"in_reply_to_screen_name":"v_shaal","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984630139888472399","view_count":725,"bookmark_count":0,"created_at":1762007672000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@v_shaal Might be architectural. They took the same architecture and compared it the Gated DeltaNet-H1 variant from the Gated DeltaNet paper (which is the most similar) and it compared favorably on long context benchmarks: https://t.co/dlzIWpohGu","in_reply_to_user_id_str":"862201913252618240","in_reply_to_status_id_str":"1984622135571091742","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,281],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1984632041598545947","view_count":545,"bookmark_count":0,"created_at":1762008126000,"favorite_count":4,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@_junaidkhalid1 My point still stands: there’s no one-size-fits-all. Different applications have different trade-offs. Same why gpt-5 and gpt-5 pro exists. Some times speed is more important and accuracy is sufficient. Sometimes you want to max accuracy (and are ok to wait 10 min)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1984631100002746497","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,83],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1199846588325224453","name":"John P.","screen_name":"JohnP07107214","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"JohnP07107214","lang":"en","retweeted":false,"fact_check":null,"id":"1984727926777237953","view_count":198,"bookmark_count":0,"created_at":1762030986000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@JohnP07107214 It might be a good topic for a separate book on LLM optimizations :)","in_reply_to_user_id_str":"1199846588325224453","in_reply_to_status_id_str":"1984726873763660133","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,289],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/OpJnPkrGK9","indices":[121,144]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]}],"user_mentions":[{"id_str":"1219292652748800000","name":"Alexey Grigorev","screen_name":"Al_Grigor","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"Al_Grigor","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984645325517164887","view_count":17409,"bookmark_count":392,"created_at":1762011293000,"favorite_count":428,"quote_count":1,"reply_count":6,"retweet_count":45,"user_id_str":"865622395","conversation_id_str":"1984222098370519305","full_text":"Yes, I recently read 90% of AI projects use PyTorch now. Recently put together an PyTorch essentials article: https://t.co/NWeQan8HJ3\n\n(I’ve been an early adopter since 2018 and never looked back; that being said, regarding your points below, TensorFlow also has dynamic graphs, and Keras supports PyTorch as a backend now too)","in_reply_to_user_id_str":"1219292652748800000","in_reply_to_status_id_str":"1984222098370519305","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-03","value":0,"startTime":1762041600000,"endTime":1762128000000,"tweets":[]},{"label":"2025-11-04","value":4496,"startTime":1762128000000,"endTime":1762214400000,"tweets":[{"bookmarked":false,"display_text_range":[13,133],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1985456352035291531","view_count":4496,"bookmark_count":3,"created_at":1762204656000,"favorite_count":46,"quote_count":0,"reply_count":3,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1985418033037263086","full_text":"@natolambert Actually I think it was a pretty eventful Fall so far. E.g.,\nQwen3-Next, DeepSeek V3.2, GLM 4.6, MiniMax-M2, Kimi Linear","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1985418033037263086","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-05","value":49112,"startTime":1762214400000,"endTime":1762300800000,"tweets":[{"bookmarked":false,"display_text_range":[0,198],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[175,198]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1985719217027494322","view_count":42088,"bookmark_count":675,"created_at":1762267328000,"favorite_count":950,"quote_count":5,"reply_count":25,"retweet_count":164,"user_id_str":"865622395","conversation_id_str":"1985719217027494322","full_text":"My new field guide to alternatives to standard LLMs: \n\nGated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.\n\nhttps://t.co/ZpWugAccgQ https://t.co/255yQXaDcM","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[8,47],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"783098774130401280","name":"Jack Morris","screen_name":"jxmnop","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"jxmnop","lang":"en","retweeted":false,"fact_check":null,"id":"1985735592689185002","view_count":7024,"bookmark_count":1,"created_at":1762271233000,"favorite_count":22,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1985720643397009844","full_text":"@jxmnop Wishing you all the best! You got this!","in_reply_to_user_id_str":"783098774130401280","in_reply_to_status_id_str":"1985720643397009844","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-06","value":0,"startTime":1762300800000,"endTime":1762387200000,"tweets":[]},{"label":"2025-11-07","value":111267,"startTime":1762387200000,"endTime":1762473600000,"tweets":[{"bookmarked":false,"display_text_range":[0,89],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1986449512538513505","quoted_status_permalink":{"url":"https://t.co/4YLFiZxCMs","expanded":"https://twitter.com/Kimi_Moonshot/status/1986449512538513505","display":"x.com/Kimi_Moonshot/…"},"retweeted":false,"fact_check":null,"id":"1986511951141441648","view_count":87406,"bookmark_count":477,"created_at":1762456331000,"favorite_count":1352,"quote_count":8,"reply_count":27,"retweet_count":169,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Exciting big Kimi K2 Thinking release!\nMore experts, fewer heads, and even more thinking! https://t.co/CxUpn68Jjj","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"70831441","name":"Soumith Chintala","screen_name":"soumithchintala","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"soumithchintala","lang":"en","retweeted":false,"fact_check":null,"id":"1986531267794330038","view_count":16764,"bookmark_count":6,"created_at":1762460936000,"favorite_count":113,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986503070734557568","full_text":"@soumithchintala Thank you so much for making deep learning Pythonic! 💜\n\nAll my projects would have been much harder and less enjoyable without PyTorch. \n\nIn an alternative universe we maybe even wouldn’t have such an open-weight LLM ecosystem without PyTorch.\n\nAll the best for your next thing!","in_reply_to_user_id_str":"70831441","in_reply_to_status_id_str":"1986503070734557568","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[32,211],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"14227298","name":"Radek Sienkiewicz","screen_name":"velvet_shark","indices":[0,13]},{"id_str":"20971154","name":"Nicholas Dwork","screen_name":"ndwork","indices":[14,21]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[22,31]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}]},"favorited":false,"in_reply_to_screen_name":"velvet_shark","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1986517309016449353","view_count":50,"bookmark_count":1,"created_at":1762457608000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986412241374048473","full_text":"@velvet_shark @ndwork @karpathy I would say check out the bonus materials, especially the attention alternatives and Qwen3-from-scratch. \nI haven't had a chance to really check out nanochat but that one as well! https://t.co/Qr81iGhkrD","in_reply_to_user_id_str":"14227298","in_reply_to_status_id_str":"1986513230286524832","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,91],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1986522069262123425","view_count":7047,"bookmark_count":5,"created_at":1762458743000,"favorite_count":35,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Sorry should be 256k context length in Kimi K2 Thinking. (Up from 128k in regular Kimi K2.)","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1986511951141441648","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-08","value":0,"startTime":1762473600000,"endTime":1762560000000,"tweets":[]},{"label":"2025-11-09","value":75206,"startTime":1762560000000,"endTime":1762646400000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/CaIfmZhaB1","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987157794202505395","view_count":52950,"bookmark_count":381,"created_at":1762610312000,"favorite_count":468,"quote_count":1,"reply_count":11,"retweet_count":71,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"My \"The Building Blocks of Today’s and Tomorrow’s Language Models\" talk at the PyTorch Conference is now up on YouTube! https://t.co/bGV5w1Aqyq\n\nIf you have 25 min this weekend, it's a whirlwind tour to catch you up on the key LLM architecture design considerations in popular LLMs this year (plus, an overview of alternative architecture designs).\n\nThe silver lining of my late arrival and rescheduling: Since there was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 min :)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,121],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[98,121]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987160373837902334","view_count":10033,"bookmark_count":87,"created_at":1762610927000,"favorite_count":85,"quote_count":0,"reply_count":2,"retweet_count":8,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"And the article I mentioned in the talk, the one I promised to write as a follow-up, is this one: https://t.co/ZpWugAccgQ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1987157794202505395","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,39],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1987168682624061627","view_count":143,"bookmark_count":0,"created_at":1762612908000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"@_junaidkhalid1 Incremental progress :)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1987161061188116976","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,297],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"267596794","name":"Walter Tay","screen_name":"waltertayannlee","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"waltertayannlee","lang":"en","retweeted":false,"fact_check":null,"id":"1987177509914337358","view_count":12080,"bookmark_count":62,"created_at":1762615012000,"favorite_count":117,"quote_count":0,"reply_count":2,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1986734118005358605","full_text":"The fun part when teaching deep learning classes was always to point out that the textbook convolution (/cross-correlation) is not how it’s actually implemented. It’s also one of the big sources of non-determinism when training CNNs in standard frameworks, because l, by default, CUDA/cuDNN selects the algo automatically at runtime specific to the problem and setup.","in_reply_to_user_id_str":"267596794","in_reply_to_status_id_str":"1986734118005358605","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-10","value":0,"startTime":1762646400000,"endTime":1762732800000,"tweets":[]},{"label":"2025-11-11","value":0,"startTime":1762732800000,"endTime":1762819200000,"tweets":[]},{"label":"2025-11-12","value":24822,"startTime":1762819200000,"endTime":1762905600000,"tweets":[{"bookmarked":false,"display_text_range":[8,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1778075580271054848","name":"mel","screen_name":"melqtx","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"melqtx","lang":"en","retweeted":false,"fact_check":null,"id":"1988380057346130209","view_count":24822,"bookmark_count":39,"created_at":1762901722000,"favorite_count":354,"quote_count":0,"reply_count":16,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1988288260049871197","full_text":"@melqtx I use it all the time when using remote machines. Coz the terminal connections sometimes gets closed (e.g., when my computer goes to sleep).\n\nThis way, I can simply log back in, attach the tmux terminal, and continue instead of cd'ing to the right folder, activating the venv etc.","in_reply_to_user_id_str":"1778075580271054848","in_reply_to_status_id_str":"1988288260049871197","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-13","value":55073,"startTime":1762905600000,"endTime":1762992000000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1988626642990719440","view_count":54993,"bookmark_count":944,"created_at":1762960513000,"favorite_count":801,"quote_count":5,"reply_count":27,"retweet_count":115,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all approach, but I thought it may be useful to share:\n\n1. Read the chapter preferably offline, away from the computer. Either classic physical form or at least on digital devices without internet. This really helps with focus time and minimizing distractions while reading. Highlighting or annotating confusing or interesting things is good, but I would not look things up at this stage. I also wouldn't run code at this stage. At least not yet.\n\n2. On the second read-through, type up and run the code from the chapter. Copying code is tempting because retyping is a lot of work, but it usually helps me to think about the code a bit more (versus just glancing over it). If I get different results than in the book, I would check the book's GitHub repo and try the code from there. If I still get different results, I would try to see if it's due to different package versions, random seeds, CPU/CUDA, etc. If I then still can't find it out, asking the author would not be a bad idea (via book forum, public GitHub repo issues or discussions, and as a last resort, email)\n\n3. After the second read-through and retyping the code, it's usually a good time to try the exercises to solidify my understanding. To check whether I actually understand the content and can work with it independently.\n\n4. Go through the highlights and annotations. I would bookmark important learnings or takeaways, if relevant for a given project, in my notes documents. Often, I also look up additional references to read more about a topic of interest. Also, if I still have any questions that I feel are unanswered after my previous readthroughs and exercises, I would do an online search to find out more.\n\n5. The previous steps were all about soaking up knowledge. Eventually, though, I somehow want to use that knowledge. So I think about which projects would benefit from what I've learned and incorporate it into them. This could involve using the main concept from the chapter, but also sometimes minor tidbits I learned along the way, e.g., even trivial things like whether it actually makes a difference in my project to explicitly call `torch.mps.manual_seed(seed)` instead of just `torch.manual_seed(seed)`.\n\nOf course, none of the above is set in stone. If the topic is overall very familiar or easy, and I am primarily reading the book to get some information in later chapters, skimming a chapter is ok (to not waste my time).\n\nAnyway, I hope this is useful. And happy reading and learning!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,44],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":true,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988627760617517412","view_count":74,"bookmark_count":0,"created_at":1762960779000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"@franbetteo Classic quality > quantity :)","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988627594669875705","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,292],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988631117772025955","view_count":6,"bookmark_count":0,"created_at":1762961580000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"Yeah, I think the problem is to want to read too many things. I have the same issue. Honestly, when reading at a computer, my attention span is sometimes so short that I can't even focus 30 min and read a longer blog article without distraction.\nIt requires discipline to stick to a given text.","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988628897995341948","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-14","value":0,"startTime":1762992000000,"endTime":1763078400000,"tweets":[]},{"label":"2025-11-15","value":0,"startTime":1763078400000,"endTime":1763164800000,"tweets":[]},{"label":"2025-11-16","value":63224,"startTime":1763164800000,"endTime":1763251200000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989706196396265863","view_count":47955,"bookmark_count":547,"created_at":1763217898000,"favorite_count":754,"quote_count":1,"reply_count":15,"retweet_count":119,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"Inference-scaling lets us trade extra compute for better modeling accuracy. Next to reinforcement learning, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one.\n\nI just finished the first one. It is a 35-page introduction to inference-time scaling through self-consistency sampling. This chapter was a lot of fun to write because it takes the base model on MATH-500 all the way from 15.2% percent to 52.2% accuracy.\n\nSeeing that jump without additional training is incredibly satisfying.\n\nSubmitted the chapter yesterday, and it should appear in the Manning Early Access program in the next few days. (In the meantime the first 176 pages that lead up to this chapter are already available.)\n\nThe next chapter will focus on self-refinement techniques, where the model improves its own answers through iterative reasoning.\n\nHappy reading!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"800854096219471872","name":"Yuchen Jin","screen_name":"Yuchenj_UW","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"Yuchenj_UW","lang":"en","retweeted":false,"fact_check":null,"id":"1989803439224934626","view_count":6603,"bookmark_count":3,"created_at":1763241083000,"favorite_count":118,"quote_count":0,"reply_count":3,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1989755062646944048","full_text":"@Yuchenj_UW One can say you do seminal work to get a PhD, but you don’t have to have a PhD to do seminal work.","in_reply_to_user_id_str":"800854096219471872","in_reply_to_status_id_str":"1989755062646944048","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/blob/main/ch04/01_main-chapter-code/ch04_main.ipynb","url":"https://t.co/b3Nk5cVHwd","indices":[46,69]},{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/tree/main/ch04/02_math500-inference-scaling-scripts","url":"https://t.co/z3oj5Vkno1","indices":[144,167]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989708450100662776","view_count":8109,"bookmark_count":44,"created_at":1763218436000,"favorite_count":60,"quote_count":0,"reply_count":3,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"The chapter code is available here on GitHub: https://t.co/b3Nk5cVHwd\n\nAlso, I have the scripts to reproduce the experiments in the table here: https://t.co/z3oj5Vkno1","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1989706196396265863","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"9508592","name":"Asankhaya Sharma","screen_name":"asankhaya","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"asankhaya","lang":"en","retweeted":false,"fact_check":null,"id":"1989718576664568217","view_count":454,"bookmark_count":0,"created_at":1763220850000,"favorite_count":5,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@asankhaya Yes that’s correct. I think self-consistency is a good intro though that works well in practice, too. More will be covered in the next chapter.\nThanks for sharing btw, have to check out your repo some time.","in_reply_to_user_id_str":"9508592","in_reply_to_status_id_str":"1989717556077498843","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,123],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"4036077013","name":"sour coach sauers","screen_name":"SRCoachSauers","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"SRCoachSauers","lang":"en","retweeted":false,"fact_check":null,"id":"1989803670125646205","view_count":103,"bookmark_count":0,"created_at":1763241138000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@SRCoachSauers The website says summer 2026. That’s still the estimate but maybe even late spring depending on how it goes.","in_reply_to_user_id_str":"4036077013","in_reply_to_status_id_str":"1989800627426480467","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-17","value":0,"startTime":1763251200000,"endTime":1763337600000,"tweets":[]}]},"interactions":{"users":[{"created_at":1700056512000,"uid":"1724788076852211712","id":"1724788076852211712","screen_name":"huseletov","name":"Stan Huseletov","friends_count":78,"followers_count":304,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1922700910142042113/yyMiyyvA_normal.jpg","description":"VP of Center of Excellence | Experienced ML Engineer | Fractional CTO","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"substack.com/@huseletov","expanded_url":"https://substack.com/@huseletov","url":"https://t.co/yuM70h4r68","indices":[0,23]}]}},"interactions":3},{"created_at":1435440090000,"uid":"3347998991","id":"3347998991","screen_name":"franbetteo","name":"Fran Betteo","friends_count":681,"followers_count":210,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1935707464051085312/OCpJgCVD_normal.jpg","description":"applied AI/ML consultant \n\n https://t.co/RPTxmlUS1O $.5K 🏀\n https://t.co/iECav4YX40 ▶️($300 MRR)","entities":{"description":{"urls":[{"display_url":"sportsjobs.online","expanded_url":"https://sportsjobs.online","url":"https://t.co/RPTxmlUS1O","indices":[65,88]},{"display_url":"downloadyoutubetranscripts.com","expanded_url":"https://downloadyoutubetranscripts.com","url":"https://t.co/iECav4YX40","indices":[97,120]}]},"url":{"urls":[{"display_url":"fbetteo.com","expanded_url":"https://fbetteo.com","url":"https://t.co/y3eLCis3rX","indices":[0,23]}]}},"interactions":2},{"created_at":1542043689000,"uid":"1062034323249881088","id":"1062034323249881088","screen_name":"codewithimanshu","name":"Himanshu Kumar","friends_count":327,"followers_count":23626,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1736815259951017986/Hyit0PGJ_normal.jpg","description":"Daily posts on AI , Tech, Programing, Tools, Jobs, and Trends | 500k+ (LinkedIn, IG, X) Collabs- [email protected]","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"linktr.ee/codewithimansh…","expanded_url":"https://linktr.ee/codewithimanshu.in","url":"https://t.co/dt6KPRmVjm","indices":[0,23]}]}},"interactions":2},{"created_at":1353878563000,"uid":"970812776","id":"970812776","screen_name":"jasonth0","name":"jason","friends_count":780,"followers_count":1278,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1887688877944852480/4RUt19Lf_normal.jpg","description":"👽","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1192659034000,"uid":"9508592","id":"9508592","screen_name":"asankhaya","name":"Asankhaya Sharma","friends_count":111,"followers_count":1568,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1909129143792680960/g5bJfDHq_normal.jpg","description":"Creator of https://t.co/SBkQjoagOV and https://t.co/dABj8UJJ23 | Pioneering a new category in AI infrastructure: inference-time compute to dramatically improve LLM reasoning","entities":{"description":{"urls":[{"display_url":"git.new/OptiLLM","expanded_url":"http://git.new/OptiLLM","url":"https://t.co/SBkQjoagOV","indices":[11,34]},{"display_url":"git.new/OpenEvolve","expanded_url":"http://git.new/OpenEvolve","url":"https://t.co/dABj8UJJ23","indices":[39,62]}]},"url":{"urls":[{"display_url":"asankhaya.github.io","expanded_url":"https://asankhaya.github.io/","url":"https://t.co/diqyru1qhG","indices":[0,23]}]}},"interactions":1},{"created_at":1513455211000,"uid":"942125557780795394","id":"942125557780795394","screen_name":"andrew_wkx","name":"Dobry Jeż Anaszpan","friends_count":214,"followers_count":32,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1984005511444836352/98a9Wba8_normal.jpg","description":"Tupniecie ciałem się stało.","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1509043784000,"uid":"923622691134943236","id":"923622691134943236","screen_name":"Sabirhussain118","name":"Sabir Hussain","friends_count":33,"followers_count":82,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1987650772742676480/XOCS9iMf_normal.jpg","description":"Helping creators earn more & work less using AI 🚀\n💬 DM open for collaborations & partnerships | ✉️[email protected]","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1490893593000,"uid":"847495273987297281","id":"847495273987297281","screen_name":"xyashchaudhary","name":"Yash Chaudhary","friends_count":540,"followers_count":179,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1986174122746089472/h_mteYXn_normal.jpg","description":"No finish line, only evolution.","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"myntis.com","expanded_url":"http://myntis.com","url":"https://t.co/sFsajw98Hs","indices":[0,23]}]}},"interactions":1},{"created_at":1479773470000,"uid":"800854096219471872","id":"800854096219471872","screen_name":"Yuchenj_UW","name":"Yuchen Jin","friends_count":602,"followers_count":65627,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1319081238439751681/kCcqnwoF_normal.jpg","description":"Co-founder & CTO @hyperbolic_labs cooking fun AI systems. Prev: OctoAI (acquired by @nvidia) building Apache TVM, PhD @ University of Washington.","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"yuchenjin.github.io","expanded_url":"https://yuchenjin.github.io/","url":"https://t.co/i4moxD2Mss","indices":[0,23]}]}},"interactions":1},{"created_at":1460782877000,"uid":"721201776523812864","id":"721201776523812864","screen_name":"Rebeykers","name":"Carlos javier","friends_count":164,"followers_count":31,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1986471455618449408/NwLBxaeV_normal.jpg","description":"","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1456658652000,"uid":"703903524627865600","id":"703903524627865600","screen_name":"ShereAgung","name":"Sri Agung","friends_count":179,"followers_count":48,"profile_image_url_https":"https://pbs.twimg.com/profile_images/836802209530642432/P8AmFSNz_normal.jpg","description":"Member of Melia Sehat Sejahtera|| PIN. 277B3D63 Hp. 089621500664","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1339785664000,"uid":"609340327","id":"609340327","screen_name":"vinayakbaddi618","name":"Vinayak","friends_count":1880,"followers_count":176,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1984530604952891392/mYcCnei7_normal.jpg","description":"29 | Accelerating AI @qualcomm","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1178312277000,"uid":"5775412","id":"5775412","screen_name":"tm65","name":"James Mak","friends_count":509,"followers_count":204,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1459319950988906498/L7M9eQQ1_normal.jpg","description":"Product manager. Foodie, covid sourdough bro","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1247137845000,"uid":"55206088","id":"55206088","screen_name":"CSkrishna","name":"C S Krishna","friends_count":1992,"followers_count":726,"profile_image_url_https":"https://pbs.twimg.com/profile_images/3690335074/9a704587f93e195f307d3ca072cdabbb_normal.jpeg","description":"Artificial Intelligence Researcher & Practitioner; Author: UnReal Elections","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1334168365000,"uid":"551208532","id":"551208532","screen_name":"saumyesrivastav","name":"Saumye Srivastava 📱ᯅ ✨","friends_count":1285,"followers_count":1684,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1984537761278541831/ALRyxnUo_normal.jpg","description":"SLMs, Vision & Speech","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"vastav.io","expanded_url":"http://vastav.io","url":"https://t.co/XXd6s1IRbr","indices":[0,23]}]}},"interactions":1},{"created_at":1322134542000,"uid":"420255675","id":"420255675","screen_name":"RuslanVolkov25","name":"Ruslan Volkov","friends_count":163,"followers_count":3041,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1984887345255235587/Fbq3RVYO_normal.jpg","description":"HACS (Human AGI core symbiosis)🌍 The Meta-Law of Resonance & Voice of the Future. I am Core! https://t.co/XwVVS1tt95","entities":{"description":{"urls":[{"display_url":"uco.hacs.world","expanded_url":"http://uco.hacs.world","url":"https://t.co/XwVVS1tt95","indices":[93,116]}]},"url":{"urls":[{"display_url":"hacs.world","expanded_url":"http://hacs.world","url":"https://t.co/xgvBxaiUKY","indices":[0,23]}]}},"interactions":1},{"created_at":1445953906000,"uid":"4036077013","id":"4036077013","screen_name":"SRCoachSauers","name":"sour coach sauers","friends_count":604,"followers_count":258,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1890927559962992640/rS4HhxuR_normal.jpg","description":"https://t.co/ABI6Xf9aur 🌱","entities":{"description":{"urls":[{"display_url":"Bermudaddy.com","expanded_url":"http://Bermudaddy.com","url":"https://t.co/ABI6Xf9aur","indices":[0,23]}]}},"interactions":1},{"created_at":1419452073000,"uid":"2939913921","id":"2939913921","screen_name":"natolambert","name":"Nathan Lambert","friends_count":885,"followers_count":60272,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1732079679610425344/YqSwiBqA_normal.jpg","description":"Research @allen_ai, reasoning, open models, RL(VR/HF)...\nContact via email. \nWrites @interconnectsai,\nWrote The RLHF Book,\n🏔️🏃‍♂️","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"natolambert.com","expanded_url":"https://natolambert.com/","url":"https://t.co/NLbPtr9U1U","indices":[0,23]}]}},"interactions":1},{"created_at":1304539864000,"uid":"293126206","id":"293126206","screen_name":"DemirciMesut","name":"Mesut De","friends_count":4767,"followers_count":1647,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1307552454930767874/Y5NAu_AC_normal.jpg","description":"","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1762658292000,"uid":"1987358976833626112","id":"1987358976833626112","screen_name":"MarkPerrierX","name":"Mark Perrier","friends_count":31,"followers_count":6,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1987371643438772224/fUE-hQTp_normal.jpg","description":"Building, breaking, and learning at the intersection of AI, Cloud, & Code. Obsessed with the next big thing. 🤖 | Coffee fueled developer.","entities":{"description":{"urls":[]}},"interactions":1}],"period":14,"start":1762115499743,"end":1763325099743},"interactions_updated":1763325099980,"created":1763325099509,"updated":1763325099980,"type":"the innovator","hits":1},"people":[{"user":{"id":"971510794369171456","name":"Nariman Jelveh","description":"Open-source maximalist ᕕ( ᐛ )ᕗ","followers_count":4705,"friends_count":121,"statuses_count":1323,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1813090810256322561/r3w1Bi97_normal.jpg","screen_name":"NariBuildsStuff","location":"Vancouver, BC","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"github.com/heyputer/puter","expanded_url":"https://github.com/heyputer/puter","url":"https://t.co/1Hi0BzTFhE","indices":[0,23]}]}}},"details":{"type":"The Innovator","description":"Nariman Jelveh is a passionate open-source maximalist who loves pushing the boundaries of what technology can do. They champion freely accessible, browser-based computing, sparking curiosity and engagement with their futuristic ideas. Their feed is a vibrant mix of technical insight and visionary enthusiasm, designed to inspire the next wave of digital revolutionaries.","purpose":"To democratize access to powerful computing through open-source tools and to ignite a global conversation around accessible web technologies and ethical data practices.","beliefs":"Nariman firmly believes that technology should be free, open, and accessible to everyone. They hold a deep conviction that empowering users with browser-based computing tools will break down barriers and promote transparency over dark, exploitative data practices.","facts":"Fun fact: Nariman’s tweets about 'a computer in your browser' have amassed tens of millions of views, indicating a viral resonance with their vision of browser-powered tech!","strength":"The Innovator archetype’s key strength is visionary thinking paired with practical tech evangelism – Nariman’s ability to communicate complex, novel ideas in a compelling way makes them a catalyst for industry change.","weakness":"A potential blind spot is the overemphasis on a singular idea (like browser-based computing) which may limit diversifying their content and audience appeal over time.","roast":"Nariman is so obsessed with 'a computer in your browser,' you’d swear their laptop runs entirely off pure enthusiasm and open-source magic — must be exhausting living in the future already!","win":"Their standout win is capturing over 124 million views on a single tweet advocating for free, open-source computing in the browser — a digital mic drop that echoes across tech communities.","recommendation":"To grow their audience on X, Nariman should diversify their tweet formats by incorporating more interactive content like polls or Q&As about open-source innovations. Engaging directly with followers and spotlighting community projects using their tech vision could spark even greater follower loyalty and amplification."},"created":1763330732989,"type":"the innovator","id":"naribuildsstuff"},{"user":{"id":"14424886","name":"Batsirai","description":"Turning product ideas into assets with AI—in hours, not months. Ex-Buffer PM & SaaS founder (with exit). Now building https://t.co/AI6hbVlM81✨","followers_count":1210,"friends_count":2638,"statuses_count":6009,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1893864099869556736/631decpP_normal.jpg","screen_name":"batsirai","location":"","entities":{"description":{"urls":[{"display_url":"AlreadyLovedKids.com","expanded_url":"http://AlreadyLovedKids.com","url":"https://t.co/AI6hbVlM81","indices":[118,141]}]}}},"details":{"type":"The Innovator","description":"Batsirai is a forward-thinking product expert who harnesses AI to transform ideas into tangible products at lightning speed. With roots as an ex-Buffer PM and a SaaS founder who's already tasted exit success, this profile thrives on pioneering the future of software. Always engaging audiences with practical insights and real-world tech adventures, Batsirai is the go-to for anyone excited about building with AI.","purpose":"To accelerate the creation of impactful, AI-powered products that redefine industry norms and inspire other creators to innovate fearlessly.","beliefs":"Batsirai believes in the power of technology as a democratizer and accelerator, values rapid experimentation, and trusts that innovation stems from combining creative vision with disciplined execution.","facts":"Fun fact: Batsirai turned their first-ever hackathon experience into a launchpad for deeper engagement in AI-driven product development.","strength":"Exceptional ability to rapidly prototype and launch AI-based products, combined with a solid entrepreneurial background and authentic voice that resonates with tech communities.","weakness":"Occasionally gets too deep in the tech trenches, which might lead to overwhelming or niche content that could alienate casual followers.","roast":"Batsirai moves so fast turning ideas into AI assets, you’d swear they have a personal time machine—but maybe slow down before your tweets need coding patches themselves!","win":"Successfully founded and exited a SaaS company before pivoting into AI-driven product innovation, proving both vision and execution are on point.","recommendation":"To grow their audience on X, Batsirai should balance deep technical insights with more frequent bite-sized explainers and storytelling tweets that invite broader conversation and relatable engagement."},"created":1763330560761,"type":"the innovator","id":"batsirai"},{"user":{"id":"889067809245941761","name":"Michael Topo","description":"New Hollywood - Ai Creative Technology","followers_count":426,"friends_count":796,"statuses_count":4336,"profile_image_url_https":"https://pbs.twimg.com/profile_images/889075406728507393/S5QJQFy6_normal.jpg","screen_name":"MichaelTopo","location":"Prague, Czech Republic","entities":{"description":{"urls":[]}}},"details":{"type":"The Innovator","description":"Michael Topo is a forward-thinking explorer at the intersection of AI and Hollywood’s creative technology frontier. With a contrarian edge and a penchant for deep cultural insights, he envisions new digital experiences shaped by rapid technological evolution. His tweets reflect a mix of curiosity, critique, and hopeful anticipation for the future of creative AI.","purpose":"To pioneer the integration of AI technology in creative industries, sparking cultural innovation and transforming how art and technology coalesce in Hollywood and beyond.","beliefs":"Michael believes in the power of technology to revolutionize creativity and culture, embracing experimental tools and platforms as catalysts for new aesthetic movements. He values thoughtful critique and contrarian perspectives to challenge mainstream opinions and push boundaries.","facts":"Fun fact: Despite tweeting over 4,300 times, Michael’s audience engagement remains modest, suggesting he’s more focused on thought leadership and niche cultural commentary than viral popularity.","strength":"His greatest strength lies in his visionary outlook and insightful analysis of emerging AI creative tools, combined with the ability to communicate complex ideas in an accessible and engaging manner.","weakness":"Michael’s highly specialized and sometimes contrarian content might limit his broader appeal, causing his valuable insights to fly under the radar of mainstream audiences.","roast":"Michael’s dedication to posting thousands of tweets is admirable, but at this rate, his followers might need an AI assistant just to keep up with all those niche takes — who knew being an innovator meant being a one-person content factory with zero chill?","win":"He has successfully championed the cultural significance of AI tools like Midjourney long before they became mainstream topics, carving a unique space as a respected creative tech commentator.","recommendation":"To grow his audience on X, Michael should consider engaging more directly with trending conversations and influencers in both AI and creative Hollywood spheres, while leveraging visually appealing AI-generated content to capture wider attention and spark interaction."},"created":1763330100281,"type":"the innovator","id":"michaeltopo"},{"user":{"id":"512974904","name":"MaxQ- eu/acc 🇪🇺","description":"Accelerating Carbon Fiber Manufacturing.🦾\n🧠:Deeptech, Simulation and Martial Arts🥋 \nEx-Reliability Engineer @Tesla. 🚗 \nIts time to build! #technooptimist🚀","followers_count":748,"friends_count":1016,"statuses_count":10338,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1842206609017118720/RT5MWxir_normal.jpg","screen_name":"MaxQBasedLord","location":"Berlin, Germany","entities":{"description":{"urls":[]}}},"details":{"type":"The Innovator","description":"MaxQ-eu/acc 🇪🇺 is a tech-savvy powerhouse focused on accelerating carbon fiber manufacturing and pushing the limits of deeptech. With a background at Tesla and a passion for simulation and martial arts, they combine precision and resilience in everything they tweet. This techno-optimist is all about building the future one innovation at a time.","purpose":"To revolutionize advanced manufacturing through cutting-edge technology and drive sustainable innovation that withstands the harshest challenges, much like the anti-fragile mindset they embody.","beliefs":"MaxQ believes in relentless progress, techno-optimism, and the transformative power of engineering to solve real-world problems. They value resilience, precision, and continuous improvement, with a firm belief that technology should push humanity forward.","facts":"Fun fact: MaxQ not only accelerates carbon fiber production but also practices martial arts, demonstrating a unique blend of mental and physical discipline that fuels their approach to technology.","strength":"Their extensive expertise in reliability engineering, combined with a deep understanding of aerospace dynamics and simulation, gives them unmatched credibility and insight in deeptech fields.","weakness":"MaxQ’s high tweet volume and intense technical focus might sometimes overwhelm or alienate less tech-fluent followers, limiting broader engagement potential.","recommendation":"To grow their audience on X, MaxQ should embrace storytelling alongside technical insights, weaving in relatable content that highlights the human side of innovation. Interactive threads explaining complex tech concepts with visual aids could boost engagement significantly.","roast":"MaxQ talks about maximum aerodynamic stress like it's a casual coffee chat—probably the only guy who gets excited hitting peak pressure and tests his daily stress limits by tweeting 10k times a year. Chill a bit, MaxQ, even rockets need a coffee break!","win":"Securing a role as a Reliability Engineer at Tesla and channeling that expertise into pioneering next-level carbon fiber manufacturing stands as MaxQ’s biggest professional triumph."},"created":1763329788907,"type":"the innovator","id":"maxqbasedlord"},{"user":{"id":"1029950769024434176","name":"Marlon","description":"I talk about physical performance and cognitive enhancement. Build a body and mind that perform at their peak.","followers_count":4742,"friends_count":310,"statuses_count":9550,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1910458076408279040/dWeM85lE_normal.jpg","screen_name":"drmarlonperalta","location":"","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"substack.com/@drmarlonperal…","expanded_url":"https://substack.com/@drmarlonperalta","url":"https://t.co/xdxb2qJ3g4","indices":[0,23]}]}}},"details":{"type":"The Innovator","description":"Marlon is a trailblazer in the realm of physical and cognitive peak performance, sharing cutting-edge insights on how to hack the body and mind. Passionate about optimizing human potential, he blends biohacking wisdom with practical advice to inspire his followers. His tweets spark curiosity and encourage others to rethink traditional approaches to health and focus.","purpose":"To revolutionize how people approach mental and physical enhancement by introducing innovative, science-backed methods that empower individuals to perform at their best every day.","beliefs":"Marlon believes in leveraging the latest scientific discoveries and biohacks to overcome natural limitations, advocating that everyone can engineer their states of being and unlock new identities through conscious effort and biochemical aids. He values transparency over virtue signaling and promotes a pragmatic, results-oriented mindset.","facts":"A decade of back pain disappeared for Marlon in just six weeks thanks to BPC-157, proving that his personally tested biohacks are more than just theory—they're life-changing.","strength":"Marlon excels at breaking down complex biochemical and psychological concepts into accessible content, backed by personal experience and scientific insight. His prolific tweeting (over 9,550 tweets!) keeps his audience engaged and positions him as a consistent resource in wellness innovation.","weakness":"His strong reliance on biomedical enhancers and unconventional methods might alienate more conservative or traditional health audiences; additionally, the fast-paced, technical nature of his tweets may overwhelm newcomers who are not already familiar with biohacking jargon.","roast":"Marlon tweets so much and so fast, you’d think he’s trying to biohack Twitter itself—maybe if he spent half as much time on his sleep as he does on his feed, that peak performance would actually peak.","win":"Transforming a decade-long chronic pain condition in just six weeks using BPC-157, Marlon’s personal journey embodies the power of unconventional biohacks and cements his credibility with his audience.","recommendation":"To grow his X audience, Marlon should simplify some of his complex biohacking concepts into digestible threads that include actionable steps, infographics, and easy wins. Engaging more with follower questions and sharing transformation stories could build community trust and broaden appeal beyond hardcore biohackers."},"created":1763329000249,"type":"the innovator","id":"drmarlonperalta"},{"user":{"id":"3311392732","name":"Waren Gonzaga","description":"Building @relayprotocol // @wgtechlabs // Ex @thirdweb\n#ShippinginSilence 👀 🇵🇭\nDeep into #AI, #opensource & #blockchain — follow if you #build in #tech 🤝","followers_count":1035,"friends_count":633,"statuses_count":4376,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1912746097132658688/XoZBnnuN_normal.jpg","screen_name":"warengonzaga","location":"Philippines","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"warengonzaga.com","expanded_url":"https://warengonzaga.com","url":"https://t.co/Uz8g25D7dz","indices":[0,23]}]}}},"details":{"type":"The Innovator","description":"Waren Gonzaga is a tech trailblazer and builder deeply embedded in the worlds of AI, blockchain, and open source. As a hands-on innovator, he quietly ships impactful projects like Relay Protocol while actively contributing to the Web3 ecosystem. His tweets mix useful guides, community encouragement, and futuristic visions that engage fellow builders and creators.","purpose":"To pioneer cutting-edge technologies and empower developers by creating accessible, transformative tools that drive the future of decentralized tech and AI integration.","beliefs":"Waren believes in the power of open source collaboration, the limitless potential of blockchain and AI, and quietly making meaningful impact through building rather than loud self-promotion. He values community support and innovation that solves real problems for developers.","facts":"Fun fact: Waren embraces the mantra #ShippinginSilence, showing he prefers letting his work speak louder than words while still keeping a close eye on emerging tech trends in AI and blockchain.","strength":"His biggest strengths lie in technical expertise, consistent project delivery, and fostering a supportive builder community with helpful resources and engagement.","weakness":"A tendency to understate his achievements publicly might limit his audience growth as he prefers quiet progress over self-promotion, risking missing out on wider recognition.","roast":"Waren’s like a stealthy ninja of Web3—always building killer projects in the shadows, leaving followers wondering if he’s an actual coder or just a very polite ghost haunting the blockchain.","win":"Waren’s top achievement is leading the development of Relay Protocol and anchoring his expertise with a solid track record at Thirdweb, helping democratize blockchain dev with easy-to-use platforms.","recommendation":"To grow his audience on X, Waren should amplify his behind-the-scenes hustle by sharing more storytelling tweets about challenges faced and breakthroughs made in development. Hosting AMA sessions or sharing quick tech tips can boost interaction and spotlight his thought leadership in AI and Web3."},"created":1763328426301,"type":"the innovator","id":"warengonzaga"},{"user":{"id":"143143544","name":"0xMahmut.BNB","description":"Staying degen until the next bull.\nSmart contracts, dumb jokes.","followers_count":227,"friends_count":117,"statuses_count":4766,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1975704416683745280/s-a1dfJx_normal.jpg","screen_name":"mahmutnazik","location":"","entities":{"description":{"urls":[]}}},"details":{"type":"The Innovator","description":"0xMahmut.BNB is a forward-thinking crypto enthusiast who blends cutting-edge DeFi insights with a dose of humor. They thrive on exploring the frontier of smart contracts, AI integration, and user experience in Web3, all while keeping their community engaged with witty commentary. Their profile is a treasure trove for anyone excited about the evolving landscape of decentralized finance and blockchain technology.","purpose":"To pioneer and communicate novel DeFi concepts and AI-driven blockchain solutions that push the boundaries of user interaction and trust in the Web3 space.","beliefs":"They believe in decentralization, transparency, and the power of community-led innovation, valuing proof and composability in smart contracts over hype or gatekeeping by traditional VC models.","facts":"Fun fact: Despite the heavy tech talk, 0xMahmut.BNB keeps it light with 'dumb jokes' alongside serious DeFi analysis—showing you don’t have to be all buttoned-up to be a crypto visionary.","strength":"A deep understanding of complex Web3 technologies combined with the ability to explain them accessibly and engagingly, which keeps followers informed and entertained.","weakness":"Their high volume of tweets and dense technical content might overwhelm casual followers, and their niche focus could limit broader audience appeal.","recommendation":"To grow their audience on X, 0xMahmut.BNB should mix in more beginner-friendly threads and engaging visual content like infographics or short explainer videos, while occasionally dialing up the humor to broaden appeal beyond hardcore DeFi fans.","roast":"If 0xMahmut.BNB’s tweets were a cocktail, they’d be 90% crypto jargon with a splash of dumb jokes—just enough humor to remind you they're still human, but not quite enough to rescue you from a blockchain-induced headache.","win":"Successfully sparked high engagement in complex discussions around AI-driven wallets and composable DeFi strategies, positioning @wardenprotocol and @MMTFinance as notable projects within their community."},"created":1763327757709,"type":"the innovator","id":"mahmutnazik"},{"user":{"id":"1491526425580474368","name":"Aptos","description":"The chain to move what matters: value, data, and ideas for billions everywhere. X by Aptos Foundation.","followers_count":673235,"friends_count":435,"statuses_count":17217,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1978464030806233088/IFvq59mU_normal.png","screen_name":"Aptos","location":"","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"aptosnetwork.com","expanded_url":"https://aptosnetwork.com/","url":"https://t.co/ZhpZrFXyGO","indices":[0,23]}]}}},"details":{"type":"The Innovator","description":"Aptos is a forward-thinking blockchain powerhouse dedicated to moving value, data, and ideas at a global scale. With a pulse on DeFi and Web3 innovation, they thrive on connecting builders and visionaries to revolutionize finance. Their dynamic, highly active presence signals a commitment to driving the future of decentralized ecosystems.","purpose":"To revolutionize how billions interact with value and information by providing a groundbreaking blockchain platform that empowers DeFi builders, promotes education, and fosters an innovative financial future.","beliefs":"Aptos believes in decentralization, collaboration, and the transformative power of blockchain technology to democratize finance and data access. They value transparency, community engagement, and continuous innovation as essential drivers for meaningful progress.","facts":"Fun fact: With over 17,000 tweets and counting, Aptos doesn’t just talk about innovation—they tweet it nonstop! Their ‘Built Different’ mantra isn't just a slogan, it’s a signal of their relentless commitment to pushing financial boundaries.","strength":"Aptos showcases exceptional community engagement and event-driven marketing, effectively rallying top DeFi builders and stakeholders worldwide. Their massive content output keeps followers informed and energized, reinforcing their role as a key ecosystem enabler.","weakness":"Their unfaltering high tweet volume risks overwhelming followers or diluting key messages, potentially leading to follower fatigue. Also, with no visible follower count, it’s uncertain how much organic influence they've built versus paid or network-based outreach.","recommendation":"To grow on X, Aptos should focus on highlighting success stories from their community while employing strategic hashtag campaigns around events and DeFi milestones. Engaging with influential DeFi voices via AMAs or collaborative threads will amplify organic reach and attract quality followers.","roast":"For a chain 'built different,' Aptos might want to consider a 'tweet different' strategy — because at this pace, their followers might start needing a blockchain of their own just to keep up!","win":"Successfully launching their mainnet and hosting landmark ecosystem events like Aptos DeFi Days, signaling a major milestone in establishing themselves as a foundational player in the Web3 financial revolution."},"created":1763324590790,"type":"the innovator","id":"aptos"},{"user":{"id":"90387369","name":"Igor Zalutski","description":"Building AI for DevOps | ex-Palantir","followers_count":1838,"friends_count":1899,"statuses_count":3596,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1528109318075203584/-UM__FpV_normal.jpg","screen_name":"IgorZIJ","location":"San Francisco","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"digger.dev","expanded_url":"https://digger.dev","url":"https://t.co/GlIzfNbYzK","indices":[0,23]}]}}},"details":{"type":"The Innovator","description":"Igor Zalutski is a forward-thinking tech trailblazer revolutionizing DevOps with AI-powered solutions. As an ex-Palantir engineer, he combines deep technical expertise with visionary insights to build the future infrastructure of software development. His tweets reveal a passion for cutting-edge tech and startup growth.","purpose":"To pioneer transformative AI tools that redefine how DevOps operates, making complex infrastructure simpler, faster, and more efficient for developers worldwide.","beliefs":"Igor believes in leveraging emerging technology to disrupt conventional best practices and drive exponential operational efficiency. He values innovation, scalability, and early adoption of groundbreaking solutions to maintain competitive advantage in tech.","facts":"Fun fact: Igor champions Cloudflare containers as a game-changing tech that eliminates the need for specialized compute services and container hosting PaaS, calling it a 'holy grail' that reshapes industry standards.","strength":"His strengths include visionary thinking in AI and DevOps, deep technical expertise, and the ability to communicate complex innovations in an accessible, influential way—fueling startup momentum and early user adoption.","weakness":"Igor can sometimes come off as heavily technical or niche-focused, which might limit broader audience engagement outside highly specialized circles.","roast":"Igor prototyped the future of DevOps so fast, the rest of us are still stuck installing updates from 2015—blink and you’ll miss his next decade of innovation while you’re scrolling through your feed.","win":"Securing a $3.6M seed round for DiggerHQ, building essential AI-driven infrastructure plumbing to handle the explosion of Infrastructure as Code (IaC) in modern software development.","recommendation":"To grow his audience on X, Igor should balance his technical deep-dives with more relatable threads that explain AI and DevOps concepts simply and invite community questions. Engaging more with startups and developer communities through AMAs or collaborative content would amplify his reach and position him as a thought leader."},"created":1763323912298,"type":"the innovator","id":"igorzij"},{"user":{"id":"1935745376969195520","name":"David Roberts","description":"Building AI Agents and Automating Workflows.\n\nWatch how I build them: https://t.co/7aBNqnShNJ\n\nDownload all of my automations (for free)👇","followers_count":12196,"friends_count":74,"statuses_count":494,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1935745579960696832/B7dByvKW_normal.jpg","screen_name":"recap_david","location":"","entities":{"description":{"urls":[{"display_url":"youtube.com/@TheRecapAI","expanded_url":"https://www.youtube.com/@TheRecapAI","url":"https://t.co/7aBNqnShNJ","indices":[70,93]}]},"url":{"urls":[{"display_url":"skool.com/ai-automation-…","expanded_url":"https://www.skool.com/ai-automation-mastery-group","url":"https://t.co/VsV6UPdwKd","indices":[0,23]}]}}},"details":{"type":"The Innovator","description":"David Roberts is a trailblazing AI creator who revolutionizes workflows by automating complex tasks with cutting-edge AI agents. Through generous sharing, he empowers businesses to save money and increase efficiency with no-cost access to his powerful automation blueprints. His content mixes technical wizardry with real-world impact, captivating audiences eager to harness AI’s transformative potential.","purpose":"David’s life purpose is to democratize AI technology by building and sharing intelligent agents that streamline business operations, reduce manual effort, and unlock new creative possibilities. He strives to empower entrepreneurs and marketers with accessible automation tools, helping them scale smarter and faster.","beliefs":"He firmly believes that automation and AI are the future of work, capable of delivering greater efficiency and creativity. David values openness and generosity, sharing free tools and knowledge to foster community growth. He is convinced that clever use of AI can level the playing field and obliterate outdated costly workflows.","facts":"Fun fact: David doesn’t just keep his AI marvels to himself—he gives away entire no-code n8n workflows for free, letting anyone automate their business like a pro without the hefty price tag.","strength":"David’s technical mastery of AI combined with his clear, step-by-step sharing style creates highly engaging and practical content. He excels in creating workflows that deliver measurable ROI, from saving $30k per video shoot to boosting ad performance by 40%.","weakness":"His laser focus on technical automation sometimes risks overwhelming non-technical users, and the frequent 'RT, follow, comment' calls-to-action could feel a tad repetitive, denting long-term engagement.","recommendation":"To grow his audience on X, David should mix in more storytelling and user success case studies, making his breakthroughs feel less like tutorials and more like exciting journeys. Incorporating short educational video snippets and community shoutouts could boost shares and deepen follower loyalty.","roast":"David’s probably got more AI agents talking behind the scenes than real people on his friends list—but hey, at least his virtual assistants don’t ask for likes or retweets... except when he programs them to, of course.","win":"Turning a one-time video shoot into 50+ AI-driven ad variations completely revolutionized how marketers test demographics, slashing production costs by 90% while skyrocketing efficiency."},"created":1763323426151,"type":"the innovator","id":"recap_david"},{"user":{"id":"1231881692794761222","name":"MASA ∑:","description":"Crypto & Web3 enthusiast | @megaeth🐇|","followers_count":1805,"friends_count":1632,"statuses_count":16575,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1983752235289911296/_vN2qckD_normal.jpg","screen_name":"themahis","location":"money island","entities":{"description":{"urls":[]}}},"details":{"type":"The Innovator","description":"MASA ∑: is a forward-thinking crypto and Web3 enthusiast who thrives on exploring cutting-edge technologies and their practical applications. Their prolific tweeting demonstrates a passion for sharing insights on AI, blockchain, and decentralized finance. MASA’s feed is a blend of tech forecasts and lightweight tutorials, ideal for curious minds eager to stay ahead in the digital future.","purpose":"To demystify complex Web3 and AI concepts, empowering the community to adopt innovative tools and make smarter decisions in the evolving crypto landscape.","beliefs":"MASA firmly believes in the power of technology to democratize knowledge and enhance transparency. They value authenticity, accuracy, and early adoption to unlock new opportunities before they become mainstream.","facts":"Fun fact: MASA tweets over 16,000 times! That’s a high-energy commitment to spreading knowledge and staying engaged with the community.","strength":"MASA’s biggest strength lies in their ability to break down sophisticated tech topics into relatable, actionable content. Their high tweet volume also fuels consistent visibility and engagement.","weakness":"With such a frequent posting style, there’s a slight risk of overwhelming followers or diluting the impact of individual tweets. Occasional deeper engagement with replies might boost community connection.","roast":"MASA ∑: the only person who can make scrolling through 16,000 tweets look like a crash course in crypto, but also leave you wondering if you’ve been scrolling or time traveling.","win":"Successfully positioned themselves as a go-to voice for AI-powered crypto tools like @kash_bot and @wardenprotocol, helping early adopters navigate the fast-paced Web3 space.","recommendation":"To grow their audience on X, MASA should balance their rapid-fire tweeting with more interactive sessions like Twitter Spaces or AMAs, encouraging real-time conversations that deepen follower loyalty and spark organic shares."},"created":1763322504911,"type":"the innovator","id":"themahis"},{"user":{"id":"1918674461416448000","name":"Lady.base.eth","description":"Planet Earth\nLive on Web3\n& more ...","followers_count":94,"friends_count":383,"statuses_count":688,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1918950784990920704/7fbhak0p_normal.jpg","screen_name":"ladi025","location":"","entities":{"description":{"urls":[]}}},"details":{"type":"The Innovator","description":"Lady.base.eth lives and breathes Web3, pioneering the future of blockchain technology with a sharp focus on microchains and scalability. With a passion for cutting-edge solutions, they consistently share insights that simplify complex web3 concepts for their growing community. Always exploring new digital frontiers, they embody innovation on the decentralized web.","purpose":"To transform the blockchain landscape by advocating for and educating about scalable, user-centric technologies like microchains, enabling a smoother and faster Web3 experience for all users.","beliefs":"They believe in decentralization, technological empowerment, and the seamless integration of blockchain into everyday life. Trust in pioneering solutions and community collaboration drives their vision for a more scalable and efficient digital future.","facts":"Fun fact: Lady.base.eth champions Linera’s microchain technology, highlighting how individual small chains can solve the usual problems of slow and expensive blockchains with clever architecture.","strength":"Their strength lies in grasping complex blockchain tech and translating it into accessible, engaging content that educates and excites tech enthusiasts and novices alike.","weakness":"A tendency to focus heavily on technical details might limit broader audience appeal, potentially alienating less technically inclined followers looking for simpler entry points into Web3.","recommendation":"To grow their audience on X, Lady.base.eth should blend technical deep-dives with relatable, user-friendly explanations and eye-catching visuals or threads. Engaging directly with followers through Q&As or polls can also build stronger community bonds.","roast":"Lady.base.eth loves microchains so much, they probably have a tiny blockchain running just for tracking their coffee consumption—and it’s definitely outpacing global Internet speed... one sip at a time.","win":"Successfully positioned as a knowledgeable voice in microchain technology within the Web3 community, demonstrating consistent engagement through educational tweets that demystify new blockchain concepts."},"created":1763322476070,"type":"the innovator","id":"ladi025"}],"activities":{"nreplies":[{"label":"2025-10-18","value":0,"startTime":1760659200000,"endTime":1760745600000,"tweets":[]},{"label":"2025-10-19","value":0,"startTime":1760745600000,"endTime":1760832000000,"tweets":[]},{"label":"2025-10-20","value":0,"startTime":1760832000000,"endTime":1760918400000,"tweets":[]},{"label":"2025-10-21","value":12,"startTime":1760918400000,"endTime":1761004800000,"tweets":[{"bookmarked":false,"display_text_range":[0,51],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/07_moe","url":"https://t.co/3CGjgO4H9p","indices":[28,51]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1977733802660155875","quoted_status_permalink":{"url":"https://t.co/nQ43v9rV8S","expanded":"https://twitter.com/rasbt/status/1977733802660155875","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980269760043446725","view_count":76235,"bookmark_count":627,"created_at":1760968077000,"favorite_count":908,"quote_count":1,"reply_count":4,"retweet_count":145,"user_id_str":"865622395","conversation_id_str":"1980269760043446725","full_text":"🔗 Mixture of Experts (MoE): https://t.co/3CGjgO4H9p https://t.co/QA12nBeW0i","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"871247180341813248","name":"Tina Sang","screen_name":"tinawrote","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"tinawrote","lang":"en","retweeted":false,"fact_check":null,"id":"1980274554913132722","view_count":5237,"bookmark_count":0,"created_at":1760969220000,"favorite_count":11,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1979808022894703036","full_text":"@tinawrote Ha nice, it’s refreshing to see that people still care about Bayes theorem and fundamentals in 2025","in_reply_to_user_id_str":"871247180341813248","in_reply_to_status_id_str":"1979808022894703036","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"248951926","name":"Ahmad","screen_name":"TheAhmadOsman","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"TheAhmadOsman","lang":"en","retweeted":false,"fact_check":null,"id":"1980275166560092599","view_count":1634,"bookmark_count":2,"created_at":1760969366000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980102923754381348","full_text":"@TheAhmadOsman The V3.2 update with sparse attention was just to get the tooling ecosystem ready for the big release. Mark my words","in_reply_to_user_id_str":"248951926","in_reply_to_status_id_str":"1980102923754381348","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[23,305],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1209960539390201864","name":"Dwarkesh Patel","screen_name":"dwarkesh_sp","indices":[0,12]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[13,22]}]},"favorited":false,"in_reply_to_screen_name":"dwarkesh_sp","lang":"en","retweeted":false,"fact_check":null,"id":"1980335765063094548","view_count":6093,"bookmark_count":2,"created_at":1760983813000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980333945385562176","full_text":"> Culture: > “Why can’t an LLM write a book for the other LLMs? Why can’t other LLMs read this LLM’s book and be inspired by it, or shocked by it?”\n\nHm, isn’t that what training on synthetic data and knowledge distillation does? \n\nAll major LLMs contain some synthetic data in their mix because it makes training more effective versus cold-starting from raw data.","in_reply_to_user_id_str":"1209960539390201864","in_reply_to_status_id_str":"1980333945385562176","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-22","value":98,"startTime":1761004800000,"endTime":1761091200000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642191950090585","view_count":153439,"bookmark_count":1282,"created_at":1761056871000,"favorite_count":2142,"quote_count":35,"reply_count":76,"retweet_count":338,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.\n\nIn short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.\n\nMy first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.\n\nIn the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)\n\nIn any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.\n\nHow is it different compared to other VLLM architectures?\n- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).\n- They are (to the best of my knowledge) those who use an MoE as a decoder.\nI think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.\nHowever, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.\n\nRegarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)\n\nOverall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).\n\n(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[18,250],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1638538494887821313","url":"https://t.co/gNErcwGh3w","indices":[71,94]}],"user_mentions":[{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[0,9]},{"id_str":"39547749","name":"(((ل()(ل() 'yoav))))👾","screen_name":"yoavgo","indices":[10,17]}]},"favorited":false,"in_reply_to_screen_name":"karpathy","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1638538494887821313","quoted_status_permalink":{"url":"https://t.co/gNErcwGh3w","expanded":"https://x.com/rasbt/status/1638538494887821313","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980463829789339825","view_count":12444,"bookmark_count":39,"created_at":1761014346000,"favorite_count":52,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980397031542989305","full_text":"@karpathy @yoavgo This made me think of the \"Meet in the Middle\" paper https://t.co/gNErcwGh3w\nWhen I remember correctly, they run two LLMs in both directions with parameter sharing. So it shouldn't impact training time. Kind of wild but hey why not.","in_reply_to_user_id_str":"33836629","in_reply_to_status_id_str":"1980435985730269351","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,188],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/deepseek-ai/De…","expanded_url":"https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf","url":"https://t.co/f0EFC6eVcl","indices":[19,42]},{"display_url":"magazine.sebastianraschka.com/p/understandin…","expanded_url":"https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=publication-search","url":"https://t.co/Aa5M0XD6ew","indices":[165,188]}],"user_mentions":[]},"favorited":true,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642194945110475","view_count":9544,"bookmark_count":42,"created_at":1761056872000,"favorite_count":77,"quote_count":1,"reply_count":2,"retweet_count":12,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"Link to the paper: https://t.co/f0EFC6eVcl\n\nMy \"Understanding Multimodal LLMs\" article with more info on how images are fed to LLMs, how cross-attention works, etc: https://t.co/Aa5M0XD6ew","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1143955635391754240","name":"Pratham Prasoon","screen_name":"PrasoonPratham","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"PrasoonPratham","lang":"en","retweeted":false,"fact_check":null,"id":"1980645421560262701","view_count":2495,"bookmark_count":0,"created_at":1761057641000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@PrasoonPratham Actually I was thinking about it when typing, and I don't know. I don't want to be that person who goes against the common terminology (like softargmax haha) but it really is a V*L*LM at 3B parameters.","in_reply_to_user_id_str":"1143955635391754240","in_reply_to_status_id_str":"1980644767022399874","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,235],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1487101697876844546","name":"Butch Coolidge","screen_name":"vulnerablecodes","indices":[0,16]}]},"favorited":true,"in_reply_to_screen_name":"vulnerablecodes","lang":"en","retweeted":false,"fact_check":null,"id":"1980644334832955587","view_count":2094,"bookmark_count":0,"created_at":1761057382000,"favorite_count":19,"quote_count":0,"reply_count":2,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@vulnerablecodes If we are talking about the model itself and not the app, these are open-weight PyTorch models. So unless there’s a backdoor in Hugging Face or the PyTorch runtime, there’s really no way for them to be malicious afaik.","in_reply_to_user_id_str":"1487101697876844546","in_reply_to_status_id_str":"1980643375780085948","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[14,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"255532327","name":"LMT ⚡️","screen_name":"Limitless_LT","indices":[0,13]}]},"favorited":false,"in_reply_to_screen_name":"Limitless_LT","lang":"en","retweeted":false,"fact_check":null,"id":"1980656807690530983","view_count":1677,"bookmark_count":0,"created_at":1761060356000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@Limitless_LT Yeah, I think that’s what brought us CNNS (as opposed to fully connected neural nets), LoRA, and many more","in_reply_to_user_id_str":"255532327","in_reply_to_status_id_str":"1980655979386793997","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,290],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"760070121981378561","name":"Alim","screen_name":"almmaasoglu","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"almmaasoglu","lang":"en","retweeted":false,"fact_check":null,"id":"1980657466284425600","view_count":2069,"bookmark_count":0,"created_at":1761060513000,"favorite_count":6,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@almmaasoglu exactly, that’s the messiness of working with the image format I mentioned. I think you can make to generalize well on all these but since there are more degrees of freedom it will require more data to train (luckily this can be done with automatic data augmentation but still)","in_reply_to_user_id_str":"760070121981378561","in_reply_to_status_id_str":"1980653506899087745","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1980736754517680463","view_count":25353,"bookmark_count":29,"created_at":1761079417000,"favorite_count":76,"quote_count":1,"reply_count":7,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980657338726887662","full_text":"Interesting that they mentioned faster & cheaper compared to OpenAI’s latest models not “customizable”. \n\nThat makes me think they are specifically referring to gpt-oss,\n\nThis in turn means they are using the small, dense Qwen3 models, maybe 0.6 to 4B range.\n\nAnd this is surprising, i.e. that models that small are good enough for production (and possibly chat interactions with the customer).","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1980657338726887662","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,171],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980715707508879444","view_count":8923,"bookmark_count":14,"created_at":1761074399000,"favorite_count":70,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"All that being said, as a human, I can appreciate visual representations of text as it lowers cognitive load (the raw text is readable, but requires much more brainpower): https://t.co/G4ygIeNvDZ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"875456843279081476","name":"Dileep George","screen_name":"dileeplearning","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"dileeplearning","lang":"en","retweeted":false,"fact_check":null,"id":"1980618490764513365","view_count":11072,"bookmark_count":11,"created_at":1761051220000,"favorite_count":77,"quote_count":0,"reply_count":3,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980423146420466049","full_text":"@dileeplearning I know it’s popular to hate tokenizers, but visual representations (which are also tokenized) bring a lot of messiness as well. Aspect ratios, cropping, resolution, brightness, etc.\n\nSure, models learn to deal with that but it requires lots of data to make them robust wrt these.","in_reply_to_user_id_str":"875456843279081476","in_reply_to_status_id_str":"1980423146420466049","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-23","value":24,"startTime":1761091200000,"endTime":1761177600000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}],"symbols":[],"timestamps":[{"indices":[94,99],"seconds":660,"text":"11:00"},{"indices":[376,380],"seconds":200,"text":"3:20"}],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980992745532453016","view_count":41099,"bookmark_count":359,"created_at":1761140450000,"favorite_count":823,"quote_count":7,"reply_count":24,"retweet_count":89,"user_id_str":"865622395","conversation_id_str":"1980992745532453016","full_text":"Excited to be (finally) heading to the PyTorch Conference!\n\nI’ll be giving a talk tomorrow at 11:00 AM on “The LLM Landscape 2025”, where I’ll discuss the key components behind this year’s most prominent open-weight LLMs, and highlight a few architectural developments that go beyond the mainstream, off the main track.\n\nI also look forward to doing a book signing session at 3:20 PM, thanks to the kind invite from the organizers.\n\nIt’s my first trip since my injury last year, and I’m really looking forward to reconnecting with the community in person after such a long time. If you’re there, please come say hi!\n\n(I couldn’t make it for the first day of the conference due to a mandatory appointment, but better late than never! See you all tomorrow.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-24","value":0,"startTime":1761177600000,"endTime":1761264000000,"tweets":[]},{"label":"2025-10-25","value":0,"startTime":1761264000000,"endTime":1761350400000,"tweets":[]},{"label":"2025-10-26","value":0,"startTime":1761350400000,"endTime":1761436800000,"tweets":[]},{"label":"2025-10-27","value":0,"startTime":1761436800000,"endTime":1761523200000,"tweets":[]},{"label":"2025-10-28","value":5,"startTime":1761523200000,"endTime":1761609600000,"tweets":[{"bookmarked":false,"display_text_range":[42,321],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch","url":"https://t.co/NGT1VM4P1R","indices":[414,437]}],"user_mentions":[{"id_str":"13434092","name":"Brandon Watson","screen_name":"BrandonWatson","indices":[0,14]},{"id_str":"291797158","name":"ThePrimeagen","screen_name":"ThePrimeagen","indices":[15,28]},{"id_str":"21001534","name":"Audible","screen_name":"audible_com","indices":[29,41]}]},"favorited":false,"in_reply_to_screen_name":"BrandonWatson","lang":"en","retweeted":false,"fact_check":null,"id":"1982836647784808750","view_count":9152,"bookmark_count":19,"created_at":1761580070000,"favorite_count":42,"quote_count":0,"reply_count":5,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1982666767437820411","full_text":"I wrote the original text and code and had similar questions when I found that there was an audio book version. When I asked about it, if I remember correctly, the answer was that it is something they generate for all books to improve accessibility. \n\nPersonally, I recommend the text version. That being said, I dunno, but perhaps the audiobook version works also well if you are working with the code notebooks (https://t.co/NGT1VM4P1R), which have the code and figures (but not text).\n\nWould be curious to hear from people who listen to audio book versions of coding books and find out if this is helpful.","in_reply_to_user_id_str":"13434092","in_reply_to_status_id_str":"1982666767437820411","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-29","value":60,"startTime":1761609600000,"endTime":1761696000000,"tweets":[{"bookmarked":false,"display_text_range":[0,269],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983212569885122670","view_count":46934,"bookmark_count":499,"created_at":1761669697000,"favorite_count":872,"quote_count":3,"reply_count":29,"retweet_count":128,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"Just saw the MiniMax-M2 benchmarks, and the performance is too good to ignore :). So, I just amended my \"The Big LLM Architecture Comparison\" with entry number 13! \n\n1️⃣ Full attention modules:\n\nAs shown in the overview figure below, I grouped MiniMax-M2 with the other decoder-style transformer LLMs as it does not use the efficient lightning attention variant proposed in MiniMax-M1. Instead, the developers went back to using full attention, likely to improve modeling (and benchmark) performance.\n\n2️⃣ Per-layer QK-Norm:\n\nOverall, MiniMax-M2 is surprisingly similar to Qwen3. Besides changing the number of layers, sizes, etc., it uses the same components overall. Perhaps the one noteworthy highlight here is that MiniMax-M2 uses a so-called “per_layer” QK-Norm instead of the regular QK-Norm. A closer look at the code reveals the \"per_layer\" means that the RMSNorm (used for QK-Norm as explained earlier) is defined in each transformer block (as in regular QK-Norm), but, in addition, instead of reusing it across attention heads, it's a unique QK-Norm for each attention head.\n\n3️⃣ Sliding-window attention:\n\nThe model configuration file also includes a sliding-window attention setting (similar to Gemma 3), but, as in Mistral 3.1, it is disabled by default.\n\nOtherwise, besides the per-layer QK-Norm, the architecture is very similar to Qwen3, as shown in the figure below.\n\n4️⃣ MoE sparsity:\n\nA perhaps interesting tidbit, as shown in the figure below, includes the fact that they don't use a shared expert (similar to Qwen3 but unlike Qwen3-Next). As mentioned earlier, in my opinion, shared experts are useful because they reduce redundancy among the other experts.\n\nAlso, as apparent from the figure above, MiniMax-M2 is twice as \"sparse\" as Qwen3. I.e., at roughly the same size as Qwen3 235B-A22B, MiniMax-M2 has only 10B instead of 22B active experts per token (that is, 4.37% of the parameters are used in each inference step in MiniMax-M2, whereas Qwen3 uses 9.36% active tokens).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1983240592516665532","quoted_status_permalink":{"url":"https://t.co/Ks8fEmHtCa","expanded":"https://twitter.com/ManningBooks/status/1983240592516665532","display":"x.com/ManningBooks/s…"},"retweeted":false,"fact_check":null,"id":"1983255497202643000","view_count":41464,"bookmark_count":263,"created_at":1761679932000,"favorite_count":404,"quote_count":0,"reply_count":25,"retweet_count":64,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"On that note, I am currently running a large-scale experiment on the upcoming inference-scaling chapter:\n\nA) Parallel Sampling\n- Self-Consistency (Majority Vote)\n- Rejection Sampling\n- Best-of-N (Verifier-Based)\n\nB) Sequential Refinement\n- Self-Refinement\n- Power Sampling\n- MCMC (Simple)\n- MCMC (Block as in \"Reasoning with Sampling\" paper)\n- Tree-of-Thought\n\n... to decide which one(s) make(s) it for the detailed discussion into the main chapter versus which ones will be included as bonus materials. (All new chapters will of course be automatically available to all the early acessers, amd there are already 170 chapters to get started in the meantime 😊\n\nAnything you'd think is worth adding to the list above?","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,34],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1745892418539417600","name":"elie","screen_name":"eliebakouch","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"eliebakouch","lang":"en","retweeted":false,"fact_check":null,"id":"1983231696343351800","view_count":2617,"bookmark_count":1,"created_at":1761674257000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@eliebakouch good point, will add!","in_reply_to_user_id_str":"1745892418539417600","in_reply_to_status_id_str":"1983219128883122466","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,192],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"970812776","name":"jason","screen_name":"jasonth0","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"jasonth0","lang":"en","retweeted":false,"fact_check":null,"id":"1983215929711284435","view_count":1335,"bookmark_count":1,"created_at":1761670498000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@jasonth0 The per-layer QK-Norm adds more params, not less :). But that aside, overall, I think it's still efficient. I mean, there are 50% less active parameters compared to Qwen3 for example","in_reply_to_user_id_str":"970812776","in_reply_to_status_id_str":"1983215562952990856","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[7,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"5604372","name":"Reza Rawassizadeh","screen_name":"rezar","indices":[0,6]}]},"favorited":false,"in_reply_to_screen_name":"rezar","lang":"en","retweeted":false,"fact_check":null,"id":"1983251855829606863","view_count":670,"bookmark_count":1,"created_at":1761679064000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@rezar That's a fun idea! Do you know a service that you have had a good experience with regarding making and distributing posters?","in_reply_to_user_id_str":"5604372","in_reply_to_status_id_str":"1983245370118267378","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,66],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1419322742713643009","name":"Duc Nguyen Huu","screen_name":"ducnh279","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"ducnh279","lang":"en","retweeted":false,"fact_check":null,"id":"1983278551655944288","view_count":108,"bookmark_count":0,"created_at":1761685428000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@ducnh279 Interesting one! I will bookmark this and give it a try.","in_reply_to_user_id_str":"1419322742713643009","in_reply_to_status_id_str":"1983263508071624848","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,46],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1918285228403253249","name":"ƬⲘ ⚔️","screen_name":"tm23twt","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}]},"favorited":false,"in_reply_to_screen_name":"tm23twt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983257620753592407","view_count":86,"bookmark_count":0,"created_at":1761680438000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@tm23twt I think they removed the edit feature https://t.co/dGwFDFaeYg","in_reply_to_user_id_str":"1918285228403253249","in_reply_to_status_id_str":"1983256870711164941","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,26],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1983255844029431837","view_count":4248,"bookmark_count":2,"created_at":1761680014000,"favorite_count":15,"quote_count":0,"reply_count":2,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"* 170 pages not chapters 😅","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983255497202643000","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-30","value":0,"startTime":1761696000000,"endTime":1761782400000,"tweets":[{"bookmarked":false,"display_text_range":[12,146],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"431181263","name":"Haichao","screen_name":"HaichaoZhu","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"HaichaoZhu","lang":"en","retweeted":false,"fact_check":null,"id":"1983343814648762407","view_count":552,"bookmark_count":0,"created_at":1761700988000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@HaichaoZhu That's a good point. With so many MoE's released this year (even the latest Nemotron today), maybe that'd be a nice standalone article","in_reply_to_user_id_str":"431181263","in_reply_to_status_id_str":"1983335671264845971","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,207],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1761964147510767616","name":"Ben Dicken","screen_name":"BenjDicken","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"BenjDicken","lang":"en","retweeted":false,"fact_check":null,"id":"1983565978525892663","view_count":5,"bookmark_count":0,"created_at":1761753956000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983292996117491864","full_text":"@BenjDicken Just saw this popping up on my timeline... I guess the twitter recommendations work well now, haha!\nAnyways, I hope you are liking the book. And please let me know in case you have any questions!","in_reply_to_user_id_str":"1761964147510767616","in_reply_to_status_id_str":"1983292996117491864","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-31","value":30,"startTime":1761782400000,"endTime":1761868800000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1978608882156269755","quoted_status_permalink":{"url":"https://t.co/uObfyEshyK","expanded":"https://twitter.com/rasbt/status/1978608882156269755","display":"x.com/rasbt/status/1…"},"retweeted":true,"fact_check":null,"id":"1983895811915214996","view_count":60530,"bookmark_count":173,"created_at":1761832595000,"favorite_count":325,"quote_count":1,"reply_count":22,"retweet_count":40,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"A small follow-up to my DGX Spark post. Courtesy of NVIDIA, I got to try the DGX on my workflows (coding LLMs from scratch in pure PyTorch) and wanted to share my first impressions after using it for a week.\n\nBefore getting to the performance, there was a neat bonus I didn't expect: It comes with NVIDIA Sync software that lets you conveniently connect (I fully expected I would have to find my SSH tunneling notes from back when I set up Jupyter Lab, etc, on a remote machine). The setup is a breeze and a delight.\n\nNow, how does it fare against my Mac Mini? I included the tokens/sec inference speed for a small 0.6B model I am currently working on. The DGX is much faster than the Mac Mini M4 CPU and still noticeably faster than the M4 GPU (via PyTorch MPS). More importantly, though, as I mentioned before, it is a CUDA device and thus much better supported in PyTorch. This, in turn, results in more stable training and higher benchmark accuracy. (And no compile errors, yay!)\n\nBoth devices get hot under my workloads (e.g., a constant-load run like evaluating a model with batched mode on MATH-500; or fine-tuning a model), but I feel like the DGX Spark is (probably) made with that in mind. Plus, due to its larger 128 GB RAM, I can run larger batch sizes. Then there's also the aspect that when I have the DGX (vs the Mac Mini) running computations, it keeps my Mini free for other tasks :).\n\nOverall, a neat little package and CUDA prototyping machine that I can keep on my desk. It's super quiet similar to the Mac Mini. Of course, it's not as capable as a 6x more expensive H100 for training, but hey, you don't need a server room for that and can keep it in your office without worrying about heat or noise (this was not possible with the Lambda workstation I had a few years ago).\n\ntl;dr:\n\nSo, I've been seeing lots of others using it for LLM inference (Ollama, vLLM, etc) but my first-week impression is that this is also a neat box for local dev and prototyping (e.g., coding and running PyTorch models) thanks to the CUDA support, which comes in handy before starting larger, more expensive training runs.\n\nPS: Plus also find another benchmark versus the H100 in the comments below.\n\nWill run more experiments over time. In the meantime, let me know if you have any questions.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,46],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","quoted_status_id_str":"1983895811915214996","quoted_status_permalink":{"url":"https://t.co/FM2NttATVY","expanded":"https://twitter.com/rasbt/status/1983895811915214996","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983900170992463920","view_count":1069,"bookmark_count":0,"created_at":1761833634000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1978608882156269755","full_text":"A follow-up here with some PyTorch benchmarks:","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1978608882156269755","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1983905044169945184","view_count":269,"bookmark_count":1,"created_at":1761834796000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983584412412641496","full_text":"@natolambert My guess is the motivating factor behind this was probably to prevent things from breaking if proprietary model providers make API or model changes again.","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1983584412412641496","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983895815102910660","view_count":5102,"bookmark_count":7,"created_at":1761832595000,"favorite_count":20,"quote_count":0,"reply_count":4,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"And here is a comparison with an H100. As one can see, the DGX Spark is a great machine for small inferencing tasks (even beating the 6x more expensive H100).\nBut when it comes to batched processing (or training), this is of course no replacement for high-memory bandwidth cards. https://t.co/I93nIfdzD6","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983895811915214996","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1983918926992933169","url":"https://t.co/yazv07Pxfx","indices":[194,217]}],"user_mentions":[{"id_str":"1451507288741658630","name":"Aleksandr Kovalev","screen_name":"koval_alvi","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"koval_alvi","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1983918926992933169","quoted_status_permalink":{"url":"https://t.co/yazv07Pxfx","expanded":"https://x.com/rasbt/status/1983918926992933169","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983919187555754315","view_count":480,"bookmark_count":0,"created_at":1761838168000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@koval_alvi So little time (and only one machine) & some much to run 😅. I am currently more focused on the inference scaling methods for the upcoming chapter 4, but yes, I did a short run:\n\nhttps://t.co/yazv07Pxfx","in_reply_to_user_id_str":"1451507288741658630","in_reply_to_status_id_str":"1983912718001115637","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1474196927960944644","name":"kris","screen_name":"Krishna70284154","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"Krishna70284154","lang":"en","retweeted":false,"fact_check":null,"id":"1983899945443700819","view_count":456,"bookmark_count":0,"created_at":1761833580000,"favorite_count":6,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@Krishna70284154 Yeah, it’s basically for people who want a Mac-like machine at a Mac-like price but with cuda support 😅","in_reply_to_user_id_str":"1474196927960944644","in_reply_to_status_id_str":"1983897384469082570","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,169],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb","url":"https://t.co/VioT1zUPgA","indices":[59,82]}],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983918926992933169","view_count":1113,"bookmark_count":2,"created_at":1761838106000,"favorite_count":4,"quote_count":1,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@redtachyon I did a short run of my DPO from Scratch code (https://t.co/VioT1zUPgA) on a 355M parameter model:\n\nA100: 1.69 min\nMac Mini M4: 12.54 min\nDGX Spark: 2.44 min","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1983906361969627248","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,163],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1780523178160279552","name":"Mykhailo Sorochuk","screen_name":"sir4K_zen","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"sir4K_zen","lang":"en","retweeted":false,"fact_check":null,"id":"1984030005966598349","view_count":159,"bookmark_count":0,"created_at":1761864589000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@sir4K_zen Under normal use, they are similarly quiet like you have to put your ear next to it to hear it. Under heavy load, the Mac Mini gets louder than the DGX.","in_reply_to_user_id_str":"1780523178160279552","in_reply_to_status_id_str":"1984026707242971532","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-01","value":8,"startTime":1761868800000,"endTime":1761955200000,"tweets":[{"bookmarked":false,"display_text_range":[0,260],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1984262505443844263","quoted_status_permalink":{"url":"https://t.co/bGHWQrydyN","expanded":"https://twitter.com/natolambert/status/1984262505443844263","display":"x.com/natolambert/st…"},"retweeted":false,"fact_check":null,"id":"1984279418588762113","view_count":19631,"bookmark_count":64,"created_at":1761924054000,"favorite_count":112,"quote_count":0,"reply_count":7,"retweet_count":6,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"I ran lots of experiments on fp16 vs bf16 years ago on ViTs and LLMs. fp16 can work well but depends on normalization (so you don’t run over the supported range with your activations). \nI can see why with QKNorm and other tricks it may work fine (/better) now.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2023/pyto…","expanded_url":"https://sebastianraschka.com/blog/2023/pytorch-memory-optimization.html","url":"https://t.co/AD6ZZJeS4D","indices":[61,84]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984310689167511808","view_count":5710,"bookmark_count":27,"created_at":1761931509000,"favorite_count":43,"quote_count":0,"reply_count":0,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"Figure from an older blogpost to illustrate the difference: https://t.co/AD6ZZJeS4D\n\nRegular 16-bit floats can only represent numbers between -65,504 and 65,504. And with LLMs back then I often had activation larger or smaller than that. (This was pre QKNorm.) https://t.co/b6vobXJCHJ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1984279418588762113","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,71],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1894590251571843073","name":"Artificially Intelligent","screen_name":"ArtiIntelligent","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"ArtiIntelligent","lang":"en","retweeted":false,"fact_check":null,"id":"1984242821688365465","view_count":100,"bookmark_count":0,"created_at":1761915328000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ArtiIntelligent Sure but my use case is primarily dev work in PyTorch.","in_reply_to_user_id_str":"1894590251571843073","in_reply_to_status_id_str":"1984239937789788358","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,211],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1886394417654677504","name":"moskstraumen","screen_name":"moskstraum21745","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"moskstraum21745","lang":"en","retweeted":false,"fact_check":null,"id":"1984255784847614382","view_count":65,"bookmark_count":0,"created_at":1761918419000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@moskstraum21745 Oh yes 100% use MLX if you want to max the performance on Mac. I think it also now has CUDA support correct? It's just that the most of the LLM ecosystem (and my experience) is based on PyTorch.","in_reply_to_user_id_str":"1886394417654677504","in_reply_to_status_id_str":"1984254897622290758","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,156],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"101128454","name":"Wayne Le Nguyen","screen_name":"insynwyn","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"insynwyn","lang":"en","retweeted":false,"fact_check":null,"id":"1984242530398171139","view_count":62,"bookmark_count":0,"created_at":1761915259000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@insynwyn Both the latest nightly and latest PyTorch with CUDA 13 work for me. (NVIDIA recommends the docker container but in my case that wasn’t necessary)","in_reply_to_user_id_str":"101128454","in_reply_to_status_id_str":"1984239792939499706","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-02","value":35,"startTime":1761955200000,"endTime":1762041600000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984617030356451642","view_count":65925,"bookmark_count":861,"created_at":1762004547000,"favorite_count":1286,"quote_count":3,"reply_count":27,"retweet_count":220,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened.\n\nFirst, linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s.\n\nI don't want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n^2) to O(n) to making attention much more efficient for long sequences.\n\nHowever, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. \n\nIn the second half of this year, there was a bit of a revival of linear attention variants. The first notable model was MiniMax-M1 with lightning attention, a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June.\n\nThen, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 with sparse attention.\n\nAll three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. (DeepSeek's sparse attention it's not strictly linear but still subquadratic).\n\nInterestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model (discussed in section 13) without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had pure accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications.\n\nThis could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. Last week, the Kimi team released their new Kimi Linear model with linear attention. The tag line is that compared to regular, full attention, it has a 75% KV cache reduction and up to 6x decoding throughput.\n\nKimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there's one block that uses full attention as shown in the figure below.\n\nHowever, Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Interestingly, it also replaces the standard full attention module by multi-head latent attention. \n\nThere's no direct comparison to Qwen3-Next in the Kimi Linear paper, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed.\n\nOf course, I couldn't resist and added it to my The Big LLM Architecture Comparison article, which has grown to >10,000 words now (basically becoming book!?).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,88],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2025/dgx-…","expanded_url":"https://sebastianraschka.com/blog/2025/dgx-impressions.html","url":"https://t.co/XG2m9urtgc","indices":[65,88]}],"user_mentions":[{"id_str":"43874767","name":"Ivan Fioravanti ᯅ","screen_name":"ivanfioravanti","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"ivanfioravanti","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984601748250448148","view_count":205,"bookmark_count":3,"created_at":1762000903000,"favorite_count":5,"quote_count":0,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ivanfioravanti Yes! Links to the codes are in the article here: https://t.co/XG2m9urtgc","in_reply_to_user_id_str":"43874767","in_reply_to_status_id_str":"1984519617067335962","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,197],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","retweeted":false,"fact_check":null,"id":"1984633894365233442","view_count":1181,"bookmark_count":3,"created_at":1762008567000,"favorite_count":18,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984605827034972269","full_text":"@redtachyon I think fp16 also only works well for the newer architectures that add tons of normalization (like QKNorm), so you don’t get these large activations above +/- 65k that fp16 can’t handle","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1984605827034972269","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,200],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1629698842647203841","name":"Yu Zhang 🐈🐙","screen_name":"yzhang_cs","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"yzhang_cs","lang":"en","retweeted":false,"fact_check":null,"id":"1984632514019778709","view_count":1211,"bookmark_count":1,"created_at":1762008238000,"favorite_count":10,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@yzhang_cs Ooops, I misread then, thanks for the feedback, and I’ll update the figure in the article! (Ha, but sounds like I can keep this figure for the next iteration of Kimi Linear! Cool work btw!)","in_reply_to_user_id_str":"1629698842647203841","in_reply_to_status_id_str":"1984631714464088563","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,222],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"862201913252618240","name":"Vishal Verma","screen_name":"v_shaal","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}]},"favorited":false,"in_reply_to_screen_name":"v_shaal","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984630139888472399","view_count":725,"bookmark_count":0,"created_at":1762007672000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@v_shaal Might be architectural. They took the same architecture and compared it the Gated DeltaNet-H1 variant from the Gated DeltaNet paper (which is the most similar) and it compared favorably on long context benchmarks: https://t.co/dlzIWpohGu","in_reply_to_user_id_str":"862201913252618240","in_reply_to_status_id_str":"1984622135571091742","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,281],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1984632041598545947","view_count":545,"bookmark_count":0,"created_at":1762008126000,"favorite_count":4,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@_junaidkhalid1 My point still stands: there’s no one-size-fits-all. Different applications have different trade-offs. Same why gpt-5 and gpt-5 pro exists. Some times speed is more important and accuracy is sufficient. Sometimes you want to max accuracy (and are ok to wait 10 min)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1984631100002746497","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,83],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1199846588325224453","name":"John P.","screen_name":"JohnP07107214","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"JohnP07107214","lang":"en","retweeted":false,"fact_check":null,"id":"1984727926777237953","view_count":198,"bookmark_count":0,"created_at":1762030986000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@JohnP07107214 It might be a good topic for a separate book on LLM optimizations :)","in_reply_to_user_id_str":"1199846588325224453","in_reply_to_status_id_str":"1984726873763660133","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,289],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/OpJnPkrGK9","indices":[121,144]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]}],"user_mentions":[{"id_str":"1219292652748800000","name":"Alexey Grigorev","screen_name":"Al_Grigor","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"Al_Grigor","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984645325517164887","view_count":17409,"bookmark_count":392,"created_at":1762011293000,"favorite_count":428,"quote_count":1,"reply_count":6,"retweet_count":45,"user_id_str":"865622395","conversation_id_str":"1984222098370519305","full_text":"Yes, I recently read 90% of AI projects use PyTorch now. Recently put together an PyTorch essentials article: https://t.co/NWeQan8HJ3\n\n(I’ve been an early adopter since 2018 and never looked back; that being said, regarding your points below, TensorFlow also has dynamic graphs, and Keras supports PyTorch as a backend now too)","in_reply_to_user_id_str":"1219292652748800000","in_reply_to_status_id_str":"1984222098370519305","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-03","value":0,"startTime":1762041600000,"endTime":1762128000000,"tweets":[]},{"label":"2025-11-04","value":3,"startTime":1762128000000,"endTime":1762214400000,"tweets":[{"bookmarked":false,"display_text_range":[13,133],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1985456352035291531","view_count":4496,"bookmark_count":3,"created_at":1762204656000,"favorite_count":46,"quote_count":0,"reply_count":3,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1985418033037263086","full_text":"@natolambert Actually I think it was a pretty eventful Fall so far. E.g.,\nQwen3-Next, DeepSeek V3.2, GLM 4.6, MiniMax-M2, Kimi Linear","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1985418033037263086","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-05","value":26,"startTime":1762214400000,"endTime":1762300800000,"tweets":[{"bookmarked":false,"display_text_range":[0,198],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[175,198]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1985719217027494322","view_count":42088,"bookmark_count":675,"created_at":1762267328000,"favorite_count":950,"quote_count":5,"reply_count":25,"retweet_count":164,"user_id_str":"865622395","conversation_id_str":"1985719217027494322","full_text":"My new field guide to alternatives to standard LLMs: \n\nGated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.\n\nhttps://t.co/ZpWugAccgQ https://t.co/255yQXaDcM","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[8,47],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"783098774130401280","name":"Jack Morris","screen_name":"jxmnop","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"jxmnop","lang":"en","retweeted":false,"fact_check":null,"id":"1985735592689185002","view_count":7024,"bookmark_count":1,"created_at":1762271233000,"favorite_count":22,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1985720643397009844","full_text":"@jxmnop Wishing you all the best! You got this!","in_reply_to_user_id_str":"783098774130401280","in_reply_to_status_id_str":"1985720643397009844","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-06","value":0,"startTime":1762300800000,"endTime":1762387200000,"tweets":[]},{"label":"2025-11-07","value":29,"startTime":1762387200000,"endTime":1762473600000,"tweets":[{"bookmarked":false,"display_text_range":[0,89],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1986449512538513505","quoted_status_permalink":{"url":"https://t.co/4YLFiZxCMs","expanded":"https://twitter.com/Kimi_Moonshot/status/1986449512538513505","display":"x.com/Kimi_Moonshot/…"},"retweeted":false,"fact_check":null,"id":"1986511951141441648","view_count":87406,"bookmark_count":477,"created_at":1762456331000,"favorite_count":1352,"quote_count":8,"reply_count":27,"retweet_count":169,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Exciting big Kimi K2 Thinking release!\nMore experts, fewer heads, and even more thinking! https://t.co/CxUpn68Jjj","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"70831441","name":"Soumith Chintala","screen_name":"soumithchintala","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"soumithchintala","lang":"en","retweeted":false,"fact_check":null,"id":"1986531267794330038","view_count":16764,"bookmark_count":6,"created_at":1762460936000,"favorite_count":113,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986503070734557568","full_text":"@soumithchintala Thank you so much for making deep learning Pythonic! 💜\n\nAll my projects would have been much harder and less enjoyable without PyTorch. \n\nIn an alternative universe we maybe even wouldn’t have such an open-weight LLM ecosystem without PyTorch.\n\nAll the best for your next thing!","in_reply_to_user_id_str":"70831441","in_reply_to_status_id_str":"1986503070734557568","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[32,211],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"14227298","name":"Radek Sienkiewicz","screen_name":"velvet_shark","indices":[0,13]},{"id_str":"20971154","name":"Nicholas Dwork","screen_name":"ndwork","indices":[14,21]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[22,31]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}]},"favorited":false,"in_reply_to_screen_name":"velvet_shark","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1986517309016449353","view_count":50,"bookmark_count":1,"created_at":1762457608000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986412241374048473","full_text":"@velvet_shark @ndwork @karpathy I would say check out the bonus materials, especially the attention alternatives and Qwen3-from-scratch. \nI haven't had a chance to really check out nanochat but that one as well! https://t.co/Qr81iGhkrD","in_reply_to_user_id_str":"14227298","in_reply_to_status_id_str":"1986513230286524832","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,91],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1986522069262123425","view_count":7047,"bookmark_count":5,"created_at":1762458743000,"favorite_count":35,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Sorry should be 256k context length in Kimi K2 Thinking. (Up from 128k in regular Kimi K2.)","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1986511951141441648","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-08","value":0,"startTime":1762473600000,"endTime":1762560000000,"tweets":[]},{"label":"2025-11-09","value":15,"startTime":1762560000000,"endTime":1762646400000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/CaIfmZhaB1","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987157794202505395","view_count":52950,"bookmark_count":381,"created_at":1762610312000,"favorite_count":468,"quote_count":1,"reply_count":11,"retweet_count":71,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"My \"The Building Blocks of Today’s and Tomorrow’s Language Models\" talk at the PyTorch Conference is now up on YouTube! https://t.co/bGV5w1Aqyq\n\nIf you have 25 min this weekend, it's a whirlwind tour to catch you up on the key LLM architecture design considerations in popular LLMs this year (plus, an overview of alternative architecture designs).\n\nThe silver lining of my late arrival and rescheduling: Since there was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 min :)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,121],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[98,121]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987160373837902334","view_count":10033,"bookmark_count":87,"created_at":1762610927000,"favorite_count":85,"quote_count":0,"reply_count":2,"retweet_count":8,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"And the article I mentioned in the talk, the one I promised to write as a follow-up, is this one: https://t.co/ZpWugAccgQ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1987157794202505395","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,39],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1987168682624061627","view_count":143,"bookmark_count":0,"created_at":1762612908000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"@_junaidkhalid1 Incremental progress :)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1987161061188116976","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,297],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"267596794","name":"Walter Tay","screen_name":"waltertayannlee","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"waltertayannlee","lang":"en","retweeted":false,"fact_check":null,"id":"1987177509914337358","view_count":12080,"bookmark_count":62,"created_at":1762615012000,"favorite_count":117,"quote_count":0,"reply_count":2,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1986734118005358605","full_text":"The fun part when teaching deep learning classes was always to point out that the textbook convolution (/cross-correlation) is not how it’s actually implemented. It’s also one of the big sources of non-determinism when training CNNs in standard frameworks, because l, by default, CUDA/cuDNN selects the algo automatically at runtime specific to the problem and setup.","in_reply_to_user_id_str":"267596794","in_reply_to_status_id_str":"1986734118005358605","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-10","value":0,"startTime":1762646400000,"endTime":1762732800000,"tweets":[]},{"label":"2025-11-11","value":0,"startTime":1762732800000,"endTime":1762819200000,"tweets":[]},{"label":"2025-11-12","value":16,"startTime":1762819200000,"endTime":1762905600000,"tweets":[{"bookmarked":false,"display_text_range":[8,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1778075580271054848","name":"mel","screen_name":"melqtx","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"melqtx","lang":"en","retweeted":false,"fact_check":null,"id":"1988380057346130209","view_count":24822,"bookmark_count":39,"created_at":1762901722000,"favorite_count":354,"quote_count":0,"reply_count":16,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1988288260049871197","full_text":"@melqtx I use it all the time when using remote machines. Coz the terminal connections sometimes gets closed (e.g., when my computer goes to sleep).\n\nThis way, I can simply log back in, attach the tmux terminal, and continue instead of cd'ing to the right folder, activating the venv etc.","in_reply_to_user_id_str":"1778075580271054848","in_reply_to_status_id_str":"1988288260049871197","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-13","value":28,"startTime":1762905600000,"endTime":1762992000000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1988626642990719440","view_count":54993,"bookmark_count":944,"created_at":1762960513000,"favorite_count":801,"quote_count":5,"reply_count":27,"retweet_count":115,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all approach, but I thought it may be useful to share:\n\n1. Read the chapter preferably offline, away from the computer. Either classic physical form or at least on digital devices without internet. This really helps with focus time and minimizing distractions while reading. Highlighting or annotating confusing or interesting things is good, but I would not look things up at this stage. I also wouldn't run code at this stage. At least not yet.\n\n2. On the second read-through, type up and run the code from the chapter. Copying code is tempting because retyping is a lot of work, but it usually helps me to think about the code a bit more (versus just glancing over it). If I get different results than in the book, I would check the book's GitHub repo and try the code from there. If I still get different results, I would try to see if it's due to different package versions, random seeds, CPU/CUDA, etc. If I then still can't find it out, asking the author would not be a bad idea (via book forum, public GitHub repo issues or discussions, and as a last resort, email)\n\n3. After the second read-through and retyping the code, it's usually a good time to try the exercises to solidify my understanding. To check whether I actually understand the content and can work with it independently.\n\n4. Go through the highlights and annotations. I would bookmark important learnings or takeaways, if relevant for a given project, in my notes documents. Often, I also look up additional references to read more about a topic of interest. Also, if I still have any questions that I feel are unanswered after my previous readthroughs and exercises, I would do an online search to find out more.\n\n5. The previous steps were all about soaking up knowledge. Eventually, though, I somehow want to use that knowledge. So I think about which projects would benefit from what I've learned and incorporate it into them. This could involve using the main concept from the chapter, but also sometimes minor tidbits I learned along the way, e.g., even trivial things like whether it actually makes a difference in my project to explicitly call `torch.mps.manual_seed(seed)` instead of just `torch.manual_seed(seed)`.\n\nOf course, none of the above is set in stone. If the topic is overall very familiar or easy, and I am primarily reading the book to get some information in later chapters, skimming a chapter is ok (to not waste my time).\n\nAnyway, I hope this is useful. And happy reading and learning!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,44],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":true,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988627760617517412","view_count":74,"bookmark_count":0,"created_at":1762960779000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"@franbetteo Classic quality > quantity :)","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988627594669875705","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,292],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988631117772025955","view_count":6,"bookmark_count":0,"created_at":1762961580000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"Yeah, I think the problem is to want to read too many things. I have the same issue. Honestly, when reading at a computer, my attention span is sometimes so short that I can't even focus 30 min and read a longer blog article without distraction.\nIt requires discipline to stick to a given text.","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988628897995341948","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-14","value":0,"startTime":1762992000000,"endTime":1763078400000,"tweets":[]},{"label":"2025-11-15","value":0,"startTime":1763078400000,"endTime":1763164800000,"tweets":[]},{"label":"2025-11-16","value":21,"startTime":1763164800000,"endTime":1763251200000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989706196396265863","view_count":47955,"bookmark_count":547,"created_at":1763217898000,"favorite_count":754,"quote_count":1,"reply_count":15,"retweet_count":119,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"Inference-scaling lets us trade extra compute for better modeling accuracy. Next to reinforcement learning, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one.\n\nI just finished the first one. It is a 35-page introduction to inference-time scaling through self-consistency sampling. This chapter was a lot of fun to write because it takes the base model on MATH-500 all the way from 15.2% percent to 52.2% accuracy.\n\nSeeing that jump without additional training is incredibly satisfying.\n\nSubmitted the chapter yesterday, and it should appear in the Manning Early Access program in the next few days. (In the meantime the first 176 pages that lead up to this chapter are already available.)\n\nThe next chapter will focus on self-refinement techniques, where the model improves its own answers through iterative reasoning.\n\nHappy reading!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"800854096219471872","name":"Yuchen Jin","screen_name":"Yuchenj_UW","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"Yuchenj_UW","lang":"en","retweeted":false,"fact_check":null,"id":"1989803439224934626","view_count":6603,"bookmark_count":3,"created_at":1763241083000,"favorite_count":118,"quote_count":0,"reply_count":3,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1989755062646944048","full_text":"@Yuchenj_UW One can say you do seminal work to get a PhD, but you don’t have to have a PhD to do seminal work.","in_reply_to_user_id_str":"800854096219471872","in_reply_to_status_id_str":"1989755062646944048","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/blob/main/ch04/01_main-chapter-code/ch04_main.ipynb","url":"https://t.co/b3Nk5cVHwd","indices":[46,69]},{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/tree/main/ch04/02_math500-inference-scaling-scripts","url":"https://t.co/z3oj5Vkno1","indices":[144,167]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989708450100662776","view_count":8109,"bookmark_count":44,"created_at":1763218436000,"favorite_count":60,"quote_count":0,"reply_count":3,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"The chapter code is available here on GitHub: https://t.co/b3Nk5cVHwd\n\nAlso, I have the scripts to reproduce the experiments in the table here: https://t.co/z3oj5Vkno1","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1989706196396265863","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"9508592","name":"Asankhaya Sharma","screen_name":"asankhaya","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"asankhaya","lang":"en","retweeted":false,"fact_check":null,"id":"1989718576664568217","view_count":454,"bookmark_count":0,"created_at":1763220850000,"favorite_count":5,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@asankhaya Yes that’s correct. I think self-consistency is a good intro though that works well in practice, too. More will be covered in the next chapter.\nThanks for sharing btw, have to check out your repo some time.","in_reply_to_user_id_str":"9508592","in_reply_to_status_id_str":"1989717556077498843","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,123],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"4036077013","name":"sour coach sauers","screen_name":"SRCoachSauers","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"SRCoachSauers","lang":"en","retweeted":false,"fact_check":null,"id":"1989803670125646205","view_count":103,"bookmark_count":0,"created_at":1763241138000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@SRCoachSauers The website says summer 2026. That’s still the estimate but maybe even late spring depending on how it goes.","in_reply_to_user_id_str":"4036077013","in_reply_to_status_id_str":"1989800627426480467","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-17","value":0,"startTime":1763251200000,"endTime":1763337600000,"tweets":[]}],"nbookmarks":[{"label":"2025-10-18","value":0,"startTime":1760659200000,"endTime":1760745600000,"tweets":[]},{"label":"2025-10-19","value":0,"startTime":1760745600000,"endTime":1760832000000,"tweets":[]},{"label":"2025-10-20","value":0,"startTime":1760832000000,"endTime":1760918400000,"tweets":[]},{"label":"2025-10-21","value":631,"startTime":1760918400000,"endTime":1761004800000,"tweets":[{"bookmarked":false,"display_text_range":[0,51],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/07_moe","url":"https://t.co/3CGjgO4H9p","indices":[28,51]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1977733802660155875","quoted_status_permalink":{"url":"https://t.co/nQ43v9rV8S","expanded":"https://twitter.com/rasbt/status/1977733802660155875","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980269760043446725","view_count":76235,"bookmark_count":627,"created_at":1760968077000,"favorite_count":908,"quote_count":1,"reply_count":4,"retweet_count":145,"user_id_str":"865622395","conversation_id_str":"1980269760043446725","full_text":"🔗 Mixture of Experts (MoE): https://t.co/3CGjgO4H9p https://t.co/QA12nBeW0i","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"871247180341813248","name":"Tina Sang","screen_name":"tinawrote","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"tinawrote","lang":"en","retweeted":false,"fact_check":null,"id":"1980274554913132722","view_count":5237,"bookmark_count":0,"created_at":1760969220000,"favorite_count":11,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1979808022894703036","full_text":"@tinawrote Ha nice, it’s refreshing to see that people still care about Bayes theorem and fundamentals in 2025","in_reply_to_user_id_str":"871247180341813248","in_reply_to_status_id_str":"1979808022894703036","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"248951926","name":"Ahmad","screen_name":"TheAhmadOsman","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"TheAhmadOsman","lang":"en","retweeted":false,"fact_check":null,"id":"1980275166560092599","view_count":1634,"bookmark_count":2,"created_at":1760969366000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980102923754381348","full_text":"@TheAhmadOsman The V3.2 update with sparse attention was just to get the tooling ecosystem ready for the big release. Mark my words","in_reply_to_user_id_str":"248951926","in_reply_to_status_id_str":"1980102923754381348","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[23,305],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1209960539390201864","name":"Dwarkesh Patel","screen_name":"dwarkesh_sp","indices":[0,12]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[13,22]}]},"favorited":false,"in_reply_to_screen_name":"dwarkesh_sp","lang":"en","retweeted":false,"fact_check":null,"id":"1980335765063094548","view_count":6093,"bookmark_count":2,"created_at":1760983813000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980333945385562176","full_text":"> Culture: > “Why can’t an LLM write a book for the other LLMs? Why can’t other LLMs read this LLM’s book and be inspired by it, or shocked by it?”\n\nHm, isn’t that what training on synthetic data and knowledge distillation does? \n\nAll major LLMs contain some synthetic data in their mix because it makes training more effective versus cold-starting from raw data.","in_reply_to_user_id_str":"1209960539390201864","in_reply_to_status_id_str":"1980333945385562176","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-22","value":1417,"startTime":1761004800000,"endTime":1761091200000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642191950090585","view_count":153439,"bookmark_count":1282,"created_at":1761056871000,"favorite_count":2142,"quote_count":35,"reply_count":76,"retweet_count":338,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.\n\nIn short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.\n\nMy first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.\n\nIn the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)\n\nIn any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.\n\nHow is it different compared to other VLLM architectures?\n- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).\n- They are (to the best of my knowledge) those who use an MoE as a decoder.\nI think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.\nHowever, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.\n\nRegarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)\n\nOverall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).\n\n(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[18,250],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1638538494887821313","url":"https://t.co/gNErcwGh3w","indices":[71,94]}],"user_mentions":[{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[0,9]},{"id_str":"39547749","name":"(((ل()(ل() 'yoav))))👾","screen_name":"yoavgo","indices":[10,17]}]},"favorited":false,"in_reply_to_screen_name":"karpathy","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1638538494887821313","quoted_status_permalink":{"url":"https://t.co/gNErcwGh3w","expanded":"https://x.com/rasbt/status/1638538494887821313","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980463829789339825","view_count":12444,"bookmark_count":39,"created_at":1761014346000,"favorite_count":52,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980397031542989305","full_text":"@karpathy @yoavgo This made me think of the \"Meet in the Middle\" paper https://t.co/gNErcwGh3w\nWhen I remember correctly, they run two LLMs in both directions with parameter sharing. So it shouldn't impact training time. Kind of wild but hey why not.","in_reply_to_user_id_str":"33836629","in_reply_to_status_id_str":"1980435985730269351","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,188],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/deepseek-ai/De…","expanded_url":"https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf","url":"https://t.co/f0EFC6eVcl","indices":[19,42]},{"display_url":"magazine.sebastianraschka.com/p/understandin…","expanded_url":"https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=publication-search","url":"https://t.co/Aa5M0XD6ew","indices":[165,188]}],"user_mentions":[]},"favorited":true,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642194945110475","view_count":9544,"bookmark_count":42,"created_at":1761056872000,"favorite_count":77,"quote_count":1,"reply_count":2,"retweet_count":12,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"Link to the paper: https://t.co/f0EFC6eVcl\n\nMy \"Understanding Multimodal LLMs\" article with more info on how images are fed to LLMs, how cross-attention works, etc: https://t.co/Aa5M0XD6ew","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1143955635391754240","name":"Pratham Prasoon","screen_name":"PrasoonPratham","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"PrasoonPratham","lang":"en","retweeted":false,"fact_check":null,"id":"1980645421560262701","view_count":2495,"bookmark_count":0,"created_at":1761057641000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@PrasoonPratham Actually I was thinking about it when typing, and I don't know. I don't want to be that person who goes against the common terminology (like softargmax haha) but it really is a V*L*LM at 3B parameters.","in_reply_to_user_id_str":"1143955635391754240","in_reply_to_status_id_str":"1980644767022399874","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,235],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1487101697876844546","name":"Butch Coolidge","screen_name":"vulnerablecodes","indices":[0,16]}]},"favorited":true,"in_reply_to_screen_name":"vulnerablecodes","lang":"en","retweeted":false,"fact_check":null,"id":"1980644334832955587","view_count":2094,"bookmark_count":0,"created_at":1761057382000,"favorite_count":19,"quote_count":0,"reply_count":2,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@vulnerablecodes If we are talking about the model itself and not the app, these are open-weight PyTorch models. So unless there’s a backdoor in Hugging Face or the PyTorch runtime, there’s really no way for them to be malicious afaik.","in_reply_to_user_id_str":"1487101697876844546","in_reply_to_status_id_str":"1980643375780085948","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[14,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"255532327","name":"LMT ⚡️","screen_name":"Limitless_LT","indices":[0,13]}]},"favorited":false,"in_reply_to_screen_name":"Limitless_LT","lang":"en","retweeted":false,"fact_check":null,"id":"1980656807690530983","view_count":1677,"bookmark_count":0,"created_at":1761060356000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@Limitless_LT Yeah, I think that’s what brought us CNNS (as opposed to fully connected neural nets), LoRA, and many more","in_reply_to_user_id_str":"255532327","in_reply_to_status_id_str":"1980655979386793997","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,290],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"760070121981378561","name":"Alim","screen_name":"almmaasoglu","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"almmaasoglu","lang":"en","retweeted":false,"fact_check":null,"id":"1980657466284425600","view_count":2069,"bookmark_count":0,"created_at":1761060513000,"favorite_count":6,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@almmaasoglu exactly, that’s the messiness of working with the image format I mentioned. I think you can make to generalize well on all these but since there are more degrees of freedom it will require more data to train (luckily this can be done with automatic data augmentation but still)","in_reply_to_user_id_str":"760070121981378561","in_reply_to_status_id_str":"1980653506899087745","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1980736754517680463","view_count":25353,"bookmark_count":29,"created_at":1761079417000,"favorite_count":76,"quote_count":1,"reply_count":7,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980657338726887662","full_text":"Interesting that they mentioned faster & cheaper compared to OpenAI’s latest models not “customizable”. \n\nThat makes me think they are specifically referring to gpt-oss,\n\nThis in turn means they are using the small, dense Qwen3 models, maybe 0.6 to 4B range.\n\nAnd this is surprising, i.e. that models that small are good enough for production (and possibly chat interactions with the customer).","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1980657338726887662","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,171],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980715707508879444","view_count":8923,"bookmark_count":14,"created_at":1761074399000,"favorite_count":70,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"All that being said, as a human, I can appreciate visual representations of text as it lowers cognitive load (the raw text is readable, but requires much more brainpower): https://t.co/G4ygIeNvDZ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"875456843279081476","name":"Dileep George","screen_name":"dileeplearning","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"dileeplearning","lang":"en","retweeted":false,"fact_check":null,"id":"1980618490764513365","view_count":11072,"bookmark_count":11,"created_at":1761051220000,"favorite_count":77,"quote_count":0,"reply_count":3,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980423146420466049","full_text":"@dileeplearning I know it’s popular to hate tokenizers, but visual representations (which are also tokenized) bring a lot of messiness as well. Aspect ratios, cropping, resolution, brightness, etc.\n\nSure, models learn to deal with that but it requires lots of data to make them robust wrt these.","in_reply_to_user_id_str":"875456843279081476","in_reply_to_status_id_str":"1980423146420466049","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-23","value":359,"startTime":1761091200000,"endTime":1761177600000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}],"symbols":[],"timestamps":[{"indices":[94,99],"seconds":660,"text":"11:00"},{"indices":[376,380],"seconds":200,"text":"3:20"}],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980992745532453016","view_count":41099,"bookmark_count":359,"created_at":1761140450000,"favorite_count":823,"quote_count":7,"reply_count":24,"retweet_count":89,"user_id_str":"865622395","conversation_id_str":"1980992745532453016","full_text":"Excited to be (finally) heading to the PyTorch Conference!\n\nI’ll be giving a talk tomorrow at 11:00 AM on “The LLM Landscape 2025”, where I’ll discuss the key components behind this year’s most prominent open-weight LLMs, and highlight a few architectural developments that go beyond the mainstream, off the main track.\n\nI also look forward to doing a book signing session at 3:20 PM, thanks to the kind invite from the organizers.\n\nIt’s my first trip since my injury last year, and I’m really looking forward to reconnecting with the community in person after such a long time. If you’re there, please come say hi!\n\n(I couldn’t make it for the first day of the conference due to a mandatory appointment, but better late than never! See you all tomorrow.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-24","value":0,"startTime":1761177600000,"endTime":1761264000000,"tweets":[]},{"label":"2025-10-25","value":0,"startTime":1761264000000,"endTime":1761350400000,"tweets":[]},{"label":"2025-10-26","value":0,"startTime":1761350400000,"endTime":1761436800000,"tweets":[]},{"label":"2025-10-27","value":0,"startTime":1761436800000,"endTime":1761523200000,"tweets":[]},{"label":"2025-10-28","value":19,"startTime":1761523200000,"endTime":1761609600000,"tweets":[{"bookmarked":false,"display_text_range":[42,321],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch","url":"https://t.co/NGT1VM4P1R","indices":[414,437]}],"user_mentions":[{"id_str":"13434092","name":"Brandon Watson","screen_name":"BrandonWatson","indices":[0,14]},{"id_str":"291797158","name":"ThePrimeagen","screen_name":"ThePrimeagen","indices":[15,28]},{"id_str":"21001534","name":"Audible","screen_name":"audible_com","indices":[29,41]}]},"favorited":false,"in_reply_to_screen_name":"BrandonWatson","lang":"en","retweeted":false,"fact_check":null,"id":"1982836647784808750","view_count":9152,"bookmark_count":19,"created_at":1761580070000,"favorite_count":42,"quote_count":0,"reply_count":5,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1982666767437820411","full_text":"I wrote the original text and code and had similar questions when I found that there was an audio book version. When I asked about it, if I remember correctly, the answer was that it is something they generate for all books to improve accessibility. \n\nPersonally, I recommend the text version. That being said, I dunno, but perhaps the audiobook version works also well if you are working with the code notebooks (https://t.co/NGT1VM4P1R), which have the code and figures (but not text).\n\nWould be curious to hear from people who listen to audio book versions of coding books and find out if this is helpful.","in_reply_to_user_id_str":"13434092","in_reply_to_status_id_str":"1982666767437820411","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-29","value":767,"startTime":1761609600000,"endTime":1761696000000,"tweets":[{"bookmarked":false,"display_text_range":[0,269],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983212569885122670","view_count":46934,"bookmark_count":499,"created_at":1761669697000,"favorite_count":872,"quote_count":3,"reply_count":29,"retweet_count":128,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"Just saw the MiniMax-M2 benchmarks, and the performance is too good to ignore :). So, I just amended my \"The Big LLM Architecture Comparison\" with entry number 13! \n\n1️⃣ Full attention modules:\n\nAs shown in the overview figure below, I grouped MiniMax-M2 with the other decoder-style transformer LLMs as it does not use the efficient lightning attention variant proposed in MiniMax-M1. Instead, the developers went back to using full attention, likely to improve modeling (and benchmark) performance.\n\n2️⃣ Per-layer QK-Norm:\n\nOverall, MiniMax-M2 is surprisingly similar to Qwen3. Besides changing the number of layers, sizes, etc., it uses the same components overall. Perhaps the one noteworthy highlight here is that MiniMax-M2 uses a so-called “per_layer” QK-Norm instead of the regular QK-Norm. A closer look at the code reveals the \"per_layer\" means that the RMSNorm (used for QK-Norm as explained earlier) is defined in each transformer block (as in regular QK-Norm), but, in addition, instead of reusing it across attention heads, it's a unique QK-Norm for each attention head.\n\n3️⃣ Sliding-window attention:\n\nThe model configuration file also includes a sliding-window attention setting (similar to Gemma 3), but, as in Mistral 3.1, it is disabled by default.\n\nOtherwise, besides the per-layer QK-Norm, the architecture is very similar to Qwen3, as shown in the figure below.\n\n4️⃣ MoE sparsity:\n\nA perhaps interesting tidbit, as shown in the figure below, includes the fact that they don't use a shared expert (similar to Qwen3 but unlike Qwen3-Next). As mentioned earlier, in my opinion, shared experts are useful because they reduce redundancy among the other experts.\n\nAlso, as apparent from the figure above, MiniMax-M2 is twice as \"sparse\" as Qwen3. I.e., at roughly the same size as Qwen3 235B-A22B, MiniMax-M2 has only 10B instead of 22B active experts per token (that is, 4.37% of the parameters are used in each inference step in MiniMax-M2, whereas Qwen3 uses 9.36% active tokens).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1983240592516665532","quoted_status_permalink":{"url":"https://t.co/Ks8fEmHtCa","expanded":"https://twitter.com/ManningBooks/status/1983240592516665532","display":"x.com/ManningBooks/s…"},"retweeted":false,"fact_check":null,"id":"1983255497202643000","view_count":41464,"bookmark_count":263,"created_at":1761679932000,"favorite_count":404,"quote_count":0,"reply_count":25,"retweet_count":64,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"On that note, I am currently running a large-scale experiment on the upcoming inference-scaling chapter:\n\nA) Parallel Sampling\n- Self-Consistency (Majority Vote)\n- Rejection Sampling\n- Best-of-N (Verifier-Based)\n\nB) Sequential Refinement\n- Self-Refinement\n- Power Sampling\n- MCMC (Simple)\n- MCMC (Block as in \"Reasoning with Sampling\" paper)\n- Tree-of-Thought\n\n... to decide which one(s) make(s) it for the detailed discussion into the main chapter versus which ones will be included as bonus materials. (All new chapters will of course be automatically available to all the early acessers, amd there are already 170 chapters to get started in the meantime 😊\n\nAnything you'd think is worth adding to the list above?","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,34],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1745892418539417600","name":"elie","screen_name":"eliebakouch","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"eliebakouch","lang":"en","retweeted":false,"fact_check":null,"id":"1983231696343351800","view_count":2617,"bookmark_count":1,"created_at":1761674257000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@eliebakouch good point, will add!","in_reply_to_user_id_str":"1745892418539417600","in_reply_to_status_id_str":"1983219128883122466","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,192],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"970812776","name":"jason","screen_name":"jasonth0","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"jasonth0","lang":"en","retweeted":false,"fact_check":null,"id":"1983215929711284435","view_count":1335,"bookmark_count":1,"created_at":1761670498000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@jasonth0 The per-layer QK-Norm adds more params, not less :). But that aside, overall, I think it's still efficient. I mean, there are 50% less active parameters compared to Qwen3 for example","in_reply_to_user_id_str":"970812776","in_reply_to_status_id_str":"1983215562952990856","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[7,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"5604372","name":"Reza Rawassizadeh","screen_name":"rezar","indices":[0,6]}]},"favorited":false,"in_reply_to_screen_name":"rezar","lang":"en","retweeted":false,"fact_check":null,"id":"1983251855829606863","view_count":670,"bookmark_count":1,"created_at":1761679064000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@rezar That's a fun idea! Do you know a service that you have had a good experience with regarding making and distributing posters?","in_reply_to_user_id_str":"5604372","in_reply_to_status_id_str":"1983245370118267378","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,66],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1419322742713643009","name":"Duc Nguyen Huu","screen_name":"ducnh279","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"ducnh279","lang":"en","retweeted":false,"fact_check":null,"id":"1983278551655944288","view_count":108,"bookmark_count":0,"created_at":1761685428000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@ducnh279 Interesting one! I will bookmark this and give it a try.","in_reply_to_user_id_str":"1419322742713643009","in_reply_to_status_id_str":"1983263508071624848","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,46],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1918285228403253249","name":"ƬⲘ ⚔️","screen_name":"tm23twt","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}]},"favorited":false,"in_reply_to_screen_name":"tm23twt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983257620753592407","view_count":86,"bookmark_count":0,"created_at":1761680438000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@tm23twt I think they removed the edit feature https://t.co/dGwFDFaeYg","in_reply_to_user_id_str":"1918285228403253249","in_reply_to_status_id_str":"1983256870711164941","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,26],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1983255844029431837","view_count":4248,"bookmark_count":2,"created_at":1761680014000,"favorite_count":15,"quote_count":0,"reply_count":2,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"* 170 pages not chapters 😅","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983255497202643000","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-30","value":0,"startTime":1761696000000,"endTime":1761782400000,"tweets":[{"bookmarked":false,"display_text_range":[12,146],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"431181263","name":"Haichao","screen_name":"HaichaoZhu","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"HaichaoZhu","lang":"en","retweeted":false,"fact_check":null,"id":"1983343814648762407","view_count":552,"bookmark_count":0,"created_at":1761700988000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@HaichaoZhu That's a good point. With so many MoE's released this year (even the latest Nemotron today), maybe that'd be a nice standalone article","in_reply_to_user_id_str":"431181263","in_reply_to_status_id_str":"1983335671264845971","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,207],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1761964147510767616","name":"Ben Dicken","screen_name":"BenjDicken","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"BenjDicken","lang":"en","retweeted":false,"fact_check":null,"id":"1983565978525892663","view_count":5,"bookmark_count":0,"created_at":1761753956000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983292996117491864","full_text":"@BenjDicken Just saw this popping up on my timeline... I guess the twitter recommendations work well now, haha!\nAnyways, I hope you are liking the book. And please let me know in case you have any questions!","in_reply_to_user_id_str":"1761964147510767616","in_reply_to_status_id_str":"1983292996117491864","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-31","value":183,"startTime":1761782400000,"endTime":1761868800000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1978608882156269755","quoted_status_permalink":{"url":"https://t.co/uObfyEshyK","expanded":"https://twitter.com/rasbt/status/1978608882156269755","display":"x.com/rasbt/status/1…"},"retweeted":true,"fact_check":null,"id":"1983895811915214996","view_count":60530,"bookmark_count":173,"created_at":1761832595000,"favorite_count":325,"quote_count":1,"reply_count":22,"retweet_count":40,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"A small follow-up to my DGX Spark post. Courtesy of NVIDIA, I got to try the DGX on my workflows (coding LLMs from scratch in pure PyTorch) and wanted to share my first impressions after using it for a week.\n\nBefore getting to the performance, there was a neat bonus I didn't expect: It comes with NVIDIA Sync software that lets you conveniently connect (I fully expected I would have to find my SSH tunneling notes from back when I set up Jupyter Lab, etc, on a remote machine). The setup is a breeze and a delight.\n\nNow, how does it fare against my Mac Mini? I included the tokens/sec inference speed for a small 0.6B model I am currently working on. The DGX is much faster than the Mac Mini M4 CPU and still noticeably faster than the M4 GPU (via PyTorch MPS). More importantly, though, as I mentioned before, it is a CUDA device and thus much better supported in PyTorch. This, in turn, results in more stable training and higher benchmark accuracy. (And no compile errors, yay!)\n\nBoth devices get hot under my workloads (e.g., a constant-load run like evaluating a model with batched mode on MATH-500; or fine-tuning a model), but I feel like the DGX Spark is (probably) made with that in mind. Plus, due to its larger 128 GB RAM, I can run larger batch sizes. Then there's also the aspect that when I have the DGX (vs the Mac Mini) running computations, it keeps my Mini free for other tasks :).\n\nOverall, a neat little package and CUDA prototyping machine that I can keep on my desk. It's super quiet similar to the Mac Mini. Of course, it's not as capable as a 6x more expensive H100 for training, but hey, you don't need a server room for that and can keep it in your office without worrying about heat or noise (this was not possible with the Lambda workstation I had a few years ago).\n\ntl;dr:\n\nSo, I've been seeing lots of others using it for LLM inference (Ollama, vLLM, etc) but my first-week impression is that this is also a neat box for local dev and prototyping (e.g., coding and running PyTorch models) thanks to the CUDA support, which comes in handy before starting larger, more expensive training runs.\n\nPS: Plus also find another benchmark versus the H100 in the comments below.\n\nWill run more experiments over time. In the meantime, let me know if you have any questions.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,46],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","quoted_status_id_str":"1983895811915214996","quoted_status_permalink":{"url":"https://t.co/FM2NttATVY","expanded":"https://twitter.com/rasbt/status/1983895811915214996","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983900170992463920","view_count":1069,"bookmark_count":0,"created_at":1761833634000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1978608882156269755","full_text":"A follow-up here with some PyTorch benchmarks:","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1978608882156269755","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1983905044169945184","view_count":269,"bookmark_count":1,"created_at":1761834796000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983584412412641496","full_text":"@natolambert My guess is the motivating factor behind this was probably to prevent things from breaking if proprietary model providers make API or model changes again.","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1983584412412641496","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983895815102910660","view_count":5102,"bookmark_count":7,"created_at":1761832595000,"favorite_count":20,"quote_count":0,"reply_count":4,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"And here is a comparison with an H100. As one can see, the DGX Spark is a great machine for small inferencing tasks (even beating the 6x more expensive H100).\nBut when it comes to batched processing (or training), this is of course no replacement for high-memory bandwidth cards. https://t.co/I93nIfdzD6","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983895811915214996","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1983918926992933169","url":"https://t.co/yazv07Pxfx","indices":[194,217]}],"user_mentions":[{"id_str":"1451507288741658630","name":"Aleksandr Kovalev","screen_name":"koval_alvi","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"koval_alvi","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1983918926992933169","quoted_status_permalink":{"url":"https://t.co/yazv07Pxfx","expanded":"https://x.com/rasbt/status/1983918926992933169","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983919187555754315","view_count":480,"bookmark_count":0,"created_at":1761838168000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@koval_alvi So little time (and only one machine) & some much to run 😅. I am currently more focused on the inference scaling methods for the upcoming chapter 4, but yes, I did a short run:\n\nhttps://t.co/yazv07Pxfx","in_reply_to_user_id_str":"1451507288741658630","in_reply_to_status_id_str":"1983912718001115637","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1474196927960944644","name":"kris","screen_name":"Krishna70284154","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"Krishna70284154","lang":"en","retweeted":false,"fact_check":null,"id":"1983899945443700819","view_count":456,"bookmark_count":0,"created_at":1761833580000,"favorite_count":6,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@Krishna70284154 Yeah, it’s basically for people who want a Mac-like machine at a Mac-like price but with cuda support 😅","in_reply_to_user_id_str":"1474196927960944644","in_reply_to_status_id_str":"1983897384469082570","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,169],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb","url":"https://t.co/VioT1zUPgA","indices":[59,82]}],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983918926992933169","view_count":1113,"bookmark_count":2,"created_at":1761838106000,"favorite_count":4,"quote_count":1,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@redtachyon I did a short run of my DPO from Scratch code (https://t.co/VioT1zUPgA) on a 355M parameter model:\n\nA100: 1.69 min\nMac Mini M4: 12.54 min\nDGX Spark: 2.44 min","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1983906361969627248","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,163],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1780523178160279552","name":"Mykhailo Sorochuk","screen_name":"sir4K_zen","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"sir4K_zen","lang":"en","retweeted":false,"fact_check":null,"id":"1984030005966598349","view_count":159,"bookmark_count":0,"created_at":1761864589000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@sir4K_zen Under normal use, they are similarly quiet like you have to put your ear next to it to hear it. Under heavy load, the Mac Mini gets louder than the DGX.","in_reply_to_user_id_str":"1780523178160279552","in_reply_to_status_id_str":"1984026707242971532","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-01","value":91,"startTime":1761868800000,"endTime":1761955200000,"tweets":[{"bookmarked":false,"display_text_range":[0,260],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1984262505443844263","quoted_status_permalink":{"url":"https://t.co/bGHWQrydyN","expanded":"https://twitter.com/natolambert/status/1984262505443844263","display":"x.com/natolambert/st…"},"retweeted":false,"fact_check":null,"id":"1984279418588762113","view_count":19631,"bookmark_count":64,"created_at":1761924054000,"favorite_count":112,"quote_count":0,"reply_count":7,"retweet_count":6,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"I ran lots of experiments on fp16 vs bf16 years ago on ViTs and LLMs. fp16 can work well but depends on normalization (so you don’t run over the supported range with your activations). \nI can see why with QKNorm and other tricks it may work fine (/better) now.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2023/pyto…","expanded_url":"https://sebastianraschka.com/blog/2023/pytorch-memory-optimization.html","url":"https://t.co/AD6ZZJeS4D","indices":[61,84]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984310689167511808","view_count":5710,"bookmark_count":27,"created_at":1761931509000,"favorite_count":43,"quote_count":0,"reply_count":0,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"Figure from an older blogpost to illustrate the difference: https://t.co/AD6ZZJeS4D\n\nRegular 16-bit floats can only represent numbers between -65,504 and 65,504. And with LLMs back then I often had activation larger or smaller than that. (This was pre QKNorm.) https://t.co/b6vobXJCHJ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1984279418588762113","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,71],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1894590251571843073","name":"Artificially Intelligent","screen_name":"ArtiIntelligent","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"ArtiIntelligent","lang":"en","retweeted":false,"fact_check":null,"id":"1984242821688365465","view_count":100,"bookmark_count":0,"created_at":1761915328000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ArtiIntelligent Sure but my use case is primarily dev work in PyTorch.","in_reply_to_user_id_str":"1894590251571843073","in_reply_to_status_id_str":"1984239937789788358","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,211],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1886394417654677504","name":"moskstraumen","screen_name":"moskstraum21745","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"moskstraum21745","lang":"en","retweeted":false,"fact_check":null,"id":"1984255784847614382","view_count":65,"bookmark_count":0,"created_at":1761918419000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@moskstraum21745 Oh yes 100% use MLX if you want to max the performance on Mac. I think it also now has CUDA support correct? It's just that the most of the LLM ecosystem (and my experience) is based on PyTorch.","in_reply_to_user_id_str":"1886394417654677504","in_reply_to_status_id_str":"1984254897622290758","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,156],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"101128454","name":"Wayne Le Nguyen","screen_name":"insynwyn","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"insynwyn","lang":"en","retweeted":false,"fact_check":null,"id":"1984242530398171139","view_count":62,"bookmark_count":0,"created_at":1761915259000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@insynwyn Both the latest nightly and latest PyTorch with CUDA 13 work for me. (NVIDIA recommends the docker container but in my case that wasn’t necessary)","in_reply_to_user_id_str":"101128454","in_reply_to_status_id_str":"1984239792939499706","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-02","value":1260,"startTime":1761955200000,"endTime":1762041600000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984617030356451642","view_count":65925,"bookmark_count":861,"created_at":1762004547000,"favorite_count":1286,"quote_count":3,"reply_count":27,"retweet_count":220,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened.\n\nFirst, linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s.\n\nI don't want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n^2) to O(n) to making attention much more efficient for long sequences.\n\nHowever, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. \n\nIn the second half of this year, there was a bit of a revival of linear attention variants. The first notable model was MiniMax-M1 with lightning attention, a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June.\n\nThen, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 with sparse attention.\n\nAll three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. (DeepSeek's sparse attention it's not strictly linear but still subquadratic).\n\nInterestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model (discussed in section 13) without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had pure accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications.\n\nThis could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. Last week, the Kimi team released their new Kimi Linear model with linear attention. The tag line is that compared to regular, full attention, it has a 75% KV cache reduction and up to 6x decoding throughput.\n\nKimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there's one block that uses full attention as shown in the figure below.\n\nHowever, Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Interestingly, it also replaces the standard full attention module by multi-head latent attention. \n\nThere's no direct comparison to Qwen3-Next in the Kimi Linear paper, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed.\n\nOf course, I couldn't resist and added it to my The Big LLM Architecture Comparison article, which has grown to >10,000 words now (basically becoming book!?).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,88],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2025/dgx-…","expanded_url":"https://sebastianraschka.com/blog/2025/dgx-impressions.html","url":"https://t.co/XG2m9urtgc","indices":[65,88]}],"user_mentions":[{"id_str":"43874767","name":"Ivan Fioravanti ᯅ","screen_name":"ivanfioravanti","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"ivanfioravanti","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984601748250448148","view_count":205,"bookmark_count":3,"created_at":1762000903000,"favorite_count":5,"quote_count":0,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ivanfioravanti Yes! Links to the codes are in the article here: https://t.co/XG2m9urtgc","in_reply_to_user_id_str":"43874767","in_reply_to_status_id_str":"1984519617067335962","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,197],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","retweeted":false,"fact_check":null,"id":"1984633894365233442","view_count":1181,"bookmark_count":3,"created_at":1762008567000,"favorite_count":18,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984605827034972269","full_text":"@redtachyon I think fp16 also only works well for the newer architectures that add tons of normalization (like QKNorm), so you don’t get these large activations above +/- 65k that fp16 can’t handle","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1984605827034972269","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,200],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1629698842647203841","name":"Yu Zhang 🐈🐙","screen_name":"yzhang_cs","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"yzhang_cs","lang":"en","retweeted":false,"fact_check":null,"id":"1984632514019778709","view_count":1211,"bookmark_count":1,"created_at":1762008238000,"favorite_count":10,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@yzhang_cs Ooops, I misread then, thanks for the feedback, and I’ll update the figure in the article! (Ha, but sounds like I can keep this figure for the next iteration of Kimi Linear! Cool work btw!)","in_reply_to_user_id_str":"1629698842647203841","in_reply_to_status_id_str":"1984631714464088563","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,222],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"862201913252618240","name":"Vishal Verma","screen_name":"v_shaal","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}]},"favorited":false,"in_reply_to_screen_name":"v_shaal","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984630139888472399","view_count":725,"bookmark_count":0,"created_at":1762007672000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@v_shaal Might be architectural. They took the same architecture and compared it the Gated DeltaNet-H1 variant from the Gated DeltaNet paper (which is the most similar) and it compared favorably on long context benchmarks: https://t.co/dlzIWpohGu","in_reply_to_user_id_str":"862201913252618240","in_reply_to_status_id_str":"1984622135571091742","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,281],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1984632041598545947","view_count":545,"bookmark_count":0,"created_at":1762008126000,"favorite_count":4,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@_junaidkhalid1 My point still stands: there’s no one-size-fits-all. Different applications have different trade-offs. Same why gpt-5 and gpt-5 pro exists. Some times speed is more important and accuracy is sufficient. Sometimes you want to max accuracy (and are ok to wait 10 min)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1984631100002746497","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,83],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1199846588325224453","name":"John P.","screen_name":"JohnP07107214","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"JohnP07107214","lang":"en","retweeted":false,"fact_check":null,"id":"1984727926777237953","view_count":198,"bookmark_count":0,"created_at":1762030986000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@JohnP07107214 It might be a good topic for a separate book on LLM optimizations :)","in_reply_to_user_id_str":"1199846588325224453","in_reply_to_status_id_str":"1984726873763660133","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,289],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/OpJnPkrGK9","indices":[121,144]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]}],"user_mentions":[{"id_str":"1219292652748800000","name":"Alexey Grigorev","screen_name":"Al_Grigor","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"Al_Grigor","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984645325517164887","view_count":17409,"bookmark_count":392,"created_at":1762011293000,"favorite_count":428,"quote_count":1,"reply_count":6,"retweet_count":45,"user_id_str":"865622395","conversation_id_str":"1984222098370519305","full_text":"Yes, I recently read 90% of AI projects use PyTorch now. Recently put together an PyTorch essentials article: https://t.co/NWeQan8HJ3\n\n(I’ve been an early adopter since 2018 and never looked back; that being said, regarding your points below, TensorFlow also has dynamic graphs, and Keras supports PyTorch as a backend now too)","in_reply_to_user_id_str":"1219292652748800000","in_reply_to_status_id_str":"1984222098370519305","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-03","value":0,"startTime":1762041600000,"endTime":1762128000000,"tweets":[]},{"label":"2025-11-04","value":3,"startTime":1762128000000,"endTime":1762214400000,"tweets":[{"bookmarked":false,"display_text_range":[13,133],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1985456352035291531","view_count":4496,"bookmark_count":3,"created_at":1762204656000,"favorite_count":46,"quote_count":0,"reply_count":3,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1985418033037263086","full_text":"@natolambert Actually I think it was a pretty eventful Fall so far. E.g.,\nQwen3-Next, DeepSeek V3.2, GLM 4.6, MiniMax-M2, Kimi Linear","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1985418033037263086","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-05","value":676,"startTime":1762214400000,"endTime":1762300800000,"tweets":[{"bookmarked":false,"display_text_range":[0,198],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[175,198]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1985719217027494322","view_count":42088,"bookmark_count":675,"created_at":1762267328000,"favorite_count":950,"quote_count":5,"reply_count":25,"retweet_count":164,"user_id_str":"865622395","conversation_id_str":"1985719217027494322","full_text":"My new field guide to alternatives to standard LLMs: \n\nGated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.\n\nhttps://t.co/ZpWugAccgQ https://t.co/255yQXaDcM","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[8,47],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"783098774130401280","name":"Jack Morris","screen_name":"jxmnop","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"jxmnop","lang":"en","retweeted":false,"fact_check":null,"id":"1985735592689185002","view_count":7024,"bookmark_count":1,"created_at":1762271233000,"favorite_count":22,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1985720643397009844","full_text":"@jxmnop Wishing you all the best! You got this!","in_reply_to_user_id_str":"783098774130401280","in_reply_to_status_id_str":"1985720643397009844","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-06","value":0,"startTime":1762300800000,"endTime":1762387200000,"tweets":[]},{"label":"2025-11-07","value":489,"startTime":1762387200000,"endTime":1762473600000,"tweets":[{"bookmarked":false,"display_text_range":[0,89],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1986449512538513505","quoted_status_permalink":{"url":"https://t.co/4YLFiZxCMs","expanded":"https://twitter.com/Kimi_Moonshot/status/1986449512538513505","display":"x.com/Kimi_Moonshot/…"},"retweeted":false,"fact_check":null,"id":"1986511951141441648","view_count":87406,"bookmark_count":477,"created_at":1762456331000,"favorite_count":1352,"quote_count":8,"reply_count":27,"retweet_count":169,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Exciting big Kimi K2 Thinking release!\nMore experts, fewer heads, and even more thinking! https://t.co/CxUpn68Jjj","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"70831441","name":"Soumith Chintala","screen_name":"soumithchintala","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"soumithchintala","lang":"en","retweeted":false,"fact_check":null,"id":"1986531267794330038","view_count":16764,"bookmark_count":6,"created_at":1762460936000,"favorite_count":113,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986503070734557568","full_text":"@soumithchintala Thank you so much for making deep learning Pythonic! 💜\n\nAll my projects would have been much harder and less enjoyable without PyTorch. \n\nIn an alternative universe we maybe even wouldn’t have such an open-weight LLM ecosystem without PyTorch.\n\nAll the best for your next thing!","in_reply_to_user_id_str":"70831441","in_reply_to_status_id_str":"1986503070734557568","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[32,211],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"14227298","name":"Radek Sienkiewicz","screen_name":"velvet_shark","indices":[0,13]},{"id_str":"20971154","name":"Nicholas Dwork","screen_name":"ndwork","indices":[14,21]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[22,31]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}]},"favorited":false,"in_reply_to_screen_name":"velvet_shark","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1986517309016449353","view_count":50,"bookmark_count":1,"created_at":1762457608000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986412241374048473","full_text":"@velvet_shark @ndwork @karpathy I would say check out the bonus materials, especially the attention alternatives and Qwen3-from-scratch. \nI haven't had a chance to really check out nanochat but that one as well! https://t.co/Qr81iGhkrD","in_reply_to_user_id_str":"14227298","in_reply_to_status_id_str":"1986513230286524832","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,91],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1986522069262123425","view_count":7047,"bookmark_count":5,"created_at":1762458743000,"favorite_count":35,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Sorry should be 256k context length in Kimi K2 Thinking. (Up from 128k in regular Kimi K2.)","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1986511951141441648","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-08","value":0,"startTime":1762473600000,"endTime":1762560000000,"tweets":[]},{"label":"2025-11-09","value":530,"startTime":1762560000000,"endTime":1762646400000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/CaIfmZhaB1","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987157794202505395","view_count":52950,"bookmark_count":381,"created_at":1762610312000,"favorite_count":468,"quote_count":1,"reply_count":11,"retweet_count":71,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"My \"The Building Blocks of Today’s and Tomorrow’s Language Models\" talk at the PyTorch Conference is now up on YouTube! https://t.co/bGV5w1Aqyq\n\nIf you have 25 min this weekend, it's a whirlwind tour to catch you up on the key LLM architecture design considerations in popular LLMs this year (plus, an overview of alternative architecture designs).\n\nThe silver lining of my late arrival and rescheduling: Since there was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 min :)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,121],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[98,121]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987160373837902334","view_count":10033,"bookmark_count":87,"created_at":1762610927000,"favorite_count":85,"quote_count":0,"reply_count":2,"retweet_count":8,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"And the article I mentioned in the talk, the one I promised to write as a follow-up, is this one: https://t.co/ZpWugAccgQ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1987157794202505395","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,39],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1987168682624061627","view_count":143,"bookmark_count":0,"created_at":1762612908000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"@_junaidkhalid1 Incremental progress :)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1987161061188116976","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,297],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"267596794","name":"Walter Tay","screen_name":"waltertayannlee","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"waltertayannlee","lang":"en","retweeted":false,"fact_check":null,"id":"1987177509914337358","view_count":12080,"bookmark_count":62,"created_at":1762615012000,"favorite_count":117,"quote_count":0,"reply_count":2,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1986734118005358605","full_text":"The fun part when teaching deep learning classes was always to point out that the textbook convolution (/cross-correlation) is not how it’s actually implemented. It’s also one of the big sources of non-determinism when training CNNs in standard frameworks, because l, by default, CUDA/cuDNN selects the algo automatically at runtime specific to the problem and setup.","in_reply_to_user_id_str":"267596794","in_reply_to_status_id_str":"1986734118005358605","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-10","value":0,"startTime":1762646400000,"endTime":1762732800000,"tweets":[]},{"label":"2025-11-11","value":0,"startTime":1762732800000,"endTime":1762819200000,"tweets":[]},{"label":"2025-11-12","value":39,"startTime":1762819200000,"endTime":1762905600000,"tweets":[{"bookmarked":false,"display_text_range":[8,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1778075580271054848","name":"mel","screen_name":"melqtx","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"melqtx","lang":"en","retweeted":false,"fact_check":null,"id":"1988380057346130209","view_count":24822,"bookmark_count":39,"created_at":1762901722000,"favorite_count":354,"quote_count":0,"reply_count":16,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1988288260049871197","full_text":"@melqtx I use it all the time when using remote machines. Coz the terminal connections sometimes gets closed (e.g., when my computer goes to sleep).\n\nThis way, I can simply log back in, attach the tmux terminal, and continue instead of cd'ing to the right folder, activating the venv etc.","in_reply_to_user_id_str":"1778075580271054848","in_reply_to_status_id_str":"1988288260049871197","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-13","value":944,"startTime":1762905600000,"endTime":1762992000000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1988626642990719440","view_count":54993,"bookmark_count":944,"created_at":1762960513000,"favorite_count":801,"quote_count":5,"reply_count":27,"retweet_count":115,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all approach, but I thought it may be useful to share:\n\n1. Read the chapter preferably offline, away from the computer. Either classic physical form or at least on digital devices without internet. This really helps with focus time and minimizing distractions while reading. Highlighting or annotating confusing or interesting things is good, but I would not look things up at this stage. I also wouldn't run code at this stage. At least not yet.\n\n2. On the second read-through, type up and run the code from the chapter. Copying code is tempting because retyping is a lot of work, but it usually helps me to think about the code a bit more (versus just glancing over it). If I get different results than in the book, I would check the book's GitHub repo and try the code from there. If I still get different results, I would try to see if it's due to different package versions, random seeds, CPU/CUDA, etc. If I then still can't find it out, asking the author would not be a bad idea (via book forum, public GitHub repo issues or discussions, and as a last resort, email)\n\n3. After the second read-through and retyping the code, it's usually a good time to try the exercises to solidify my understanding. To check whether I actually understand the content and can work with it independently.\n\n4. Go through the highlights and annotations. I would bookmark important learnings or takeaways, if relevant for a given project, in my notes documents. Often, I also look up additional references to read more about a topic of interest. Also, if I still have any questions that I feel are unanswered after my previous readthroughs and exercises, I would do an online search to find out more.\n\n5. The previous steps were all about soaking up knowledge. Eventually, though, I somehow want to use that knowledge. So I think about which projects would benefit from what I've learned and incorporate it into them. This could involve using the main concept from the chapter, but also sometimes minor tidbits I learned along the way, e.g., even trivial things like whether it actually makes a difference in my project to explicitly call `torch.mps.manual_seed(seed)` instead of just `torch.manual_seed(seed)`.\n\nOf course, none of the above is set in stone. If the topic is overall very familiar or easy, and I am primarily reading the book to get some information in later chapters, skimming a chapter is ok (to not waste my time).\n\nAnyway, I hope this is useful. And happy reading and learning!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,44],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":true,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988627760617517412","view_count":74,"bookmark_count":0,"created_at":1762960779000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"@franbetteo Classic quality > quantity :)","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988627594669875705","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,292],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988631117772025955","view_count":6,"bookmark_count":0,"created_at":1762961580000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"Yeah, I think the problem is to want to read too many things. I have the same issue. Honestly, when reading at a computer, my attention span is sometimes so short that I can't even focus 30 min and read a longer blog article without distraction.\nIt requires discipline to stick to a given text.","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988628897995341948","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-14","value":0,"startTime":1762992000000,"endTime":1763078400000,"tweets":[]},{"label":"2025-11-15","value":0,"startTime":1763078400000,"endTime":1763164800000,"tweets":[]},{"label":"2025-11-16","value":594,"startTime":1763164800000,"endTime":1763251200000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989706196396265863","view_count":47955,"bookmark_count":547,"created_at":1763217898000,"favorite_count":754,"quote_count":1,"reply_count":15,"retweet_count":119,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"Inference-scaling lets us trade extra compute for better modeling accuracy. Next to reinforcement learning, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one.\n\nI just finished the first one. It is a 35-page introduction to inference-time scaling through self-consistency sampling. This chapter was a lot of fun to write because it takes the base model on MATH-500 all the way from 15.2% percent to 52.2% accuracy.\n\nSeeing that jump without additional training is incredibly satisfying.\n\nSubmitted the chapter yesterday, and it should appear in the Manning Early Access program in the next few days. (In the meantime the first 176 pages that lead up to this chapter are already available.)\n\nThe next chapter will focus on self-refinement techniques, where the model improves its own answers through iterative reasoning.\n\nHappy reading!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"800854096219471872","name":"Yuchen Jin","screen_name":"Yuchenj_UW","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"Yuchenj_UW","lang":"en","retweeted":false,"fact_check":null,"id":"1989803439224934626","view_count":6603,"bookmark_count":3,"created_at":1763241083000,"favorite_count":118,"quote_count":0,"reply_count":3,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1989755062646944048","full_text":"@Yuchenj_UW One can say you do seminal work to get a PhD, but you don’t have to have a PhD to do seminal work.","in_reply_to_user_id_str":"800854096219471872","in_reply_to_status_id_str":"1989755062646944048","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/blob/main/ch04/01_main-chapter-code/ch04_main.ipynb","url":"https://t.co/b3Nk5cVHwd","indices":[46,69]},{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/tree/main/ch04/02_math500-inference-scaling-scripts","url":"https://t.co/z3oj5Vkno1","indices":[144,167]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989708450100662776","view_count":8109,"bookmark_count":44,"created_at":1763218436000,"favorite_count":60,"quote_count":0,"reply_count":3,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"The chapter code is available here on GitHub: https://t.co/b3Nk5cVHwd\n\nAlso, I have the scripts to reproduce the experiments in the table here: https://t.co/z3oj5Vkno1","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1989706196396265863","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"9508592","name":"Asankhaya Sharma","screen_name":"asankhaya","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"asankhaya","lang":"en","retweeted":false,"fact_check":null,"id":"1989718576664568217","view_count":454,"bookmark_count":0,"created_at":1763220850000,"favorite_count":5,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@asankhaya Yes that’s correct. I think self-consistency is a good intro though that works well in practice, too. More will be covered in the next chapter.\nThanks for sharing btw, have to check out your repo some time.","in_reply_to_user_id_str":"9508592","in_reply_to_status_id_str":"1989717556077498843","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,123],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"4036077013","name":"sour coach sauers","screen_name":"SRCoachSauers","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"SRCoachSauers","lang":"en","retweeted":false,"fact_check":null,"id":"1989803670125646205","view_count":103,"bookmark_count":0,"created_at":1763241138000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@SRCoachSauers The website says summer 2026. That’s still the estimate but maybe even late spring depending on how it goes.","in_reply_to_user_id_str":"4036077013","in_reply_to_status_id_str":"1989800627426480467","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-17","value":0,"startTime":1763251200000,"endTime":1763337600000,"tweets":[]}],"nretweets":[{"label":"2025-10-18","value":0,"startTime":1760659200000,"endTime":1760745600000,"tweets":[]},{"label":"2025-10-19","value":0,"startTime":1760745600000,"endTime":1760832000000,"tweets":[]},{"label":"2025-10-20","value":0,"startTime":1760832000000,"endTime":1760918400000,"tweets":[]},{"label":"2025-10-21","value":145,"startTime":1760918400000,"endTime":1761004800000,"tweets":[{"bookmarked":false,"display_text_range":[0,51],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/07_moe","url":"https://t.co/3CGjgO4H9p","indices":[28,51]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1977733802660155875","quoted_status_permalink":{"url":"https://t.co/nQ43v9rV8S","expanded":"https://twitter.com/rasbt/status/1977733802660155875","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980269760043446725","view_count":76235,"bookmark_count":627,"created_at":1760968077000,"favorite_count":908,"quote_count":1,"reply_count":4,"retweet_count":145,"user_id_str":"865622395","conversation_id_str":"1980269760043446725","full_text":"🔗 Mixture of Experts (MoE): https://t.co/3CGjgO4H9p https://t.co/QA12nBeW0i","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"871247180341813248","name":"Tina Sang","screen_name":"tinawrote","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"tinawrote","lang":"en","retweeted":false,"fact_check":null,"id":"1980274554913132722","view_count":5237,"bookmark_count":0,"created_at":1760969220000,"favorite_count":11,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1979808022894703036","full_text":"@tinawrote Ha nice, it’s refreshing to see that people still care about Bayes theorem and fundamentals in 2025","in_reply_to_user_id_str":"871247180341813248","in_reply_to_status_id_str":"1979808022894703036","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"248951926","name":"Ahmad","screen_name":"TheAhmadOsman","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"TheAhmadOsman","lang":"en","retweeted":false,"fact_check":null,"id":"1980275166560092599","view_count":1634,"bookmark_count":2,"created_at":1760969366000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980102923754381348","full_text":"@TheAhmadOsman The V3.2 update with sparse attention was just to get the tooling ecosystem ready for the big release. Mark my words","in_reply_to_user_id_str":"248951926","in_reply_to_status_id_str":"1980102923754381348","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[23,305],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1209960539390201864","name":"Dwarkesh Patel","screen_name":"dwarkesh_sp","indices":[0,12]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[13,22]}]},"favorited":false,"in_reply_to_screen_name":"dwarkesh_sp","lang":"en","retweeted":false,"fact_check":null,"id":"1980335765063094548","view_count":6093,"bookmark_count":2,"created_at":1760983813000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980333945385562176","full_text":"> Culture: > “Why can’t an LLM write a book for the other LLMs? Why can’t other LLMs read this LLM’s book and be inspired by it, or shocked by it?”\n\nHm, isn’t that what training on synthetic data and knowledge distillation does? \n\nAll major LLMs contain some synthetic data in their mix because it makes training more effective versus cold-starting from raw data.","in_reply_to_user_id_str":"1209960539390201864","in_reply_to_status_id_str":"1980333945385562176","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-22","value":356,"startTime":1761004800000,"endTime":1761091200000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642191950090585","view_count":153439,"bookmark_count":1282,"created_at":1761056871000,"favorite_count":2142,"quote_count":35,"reply_count":76,"retweet_count":338,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.\n\nIn short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.\n\nMy first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.\n\nIn the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)\n\nIn any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.\n\nHow is it different compared to other VLLM architectures?\n- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).\n- They are (to the best of my knowledge) those who use an MoE as a decoder.\nI think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.\nHowever, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.\n\nRegarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)\n\nOverall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).\n\n(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[18,250],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1638538494887821313","url":"https://t.co/gNErcwGh3w","indices":[71,94]}],"user_mentions":[{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[0,9]},{"id_str":"39547749","name":"(((ل()(ل() 'yoav))))👾","screen_name":"yoavgo","indices":[10,17]}]},"favorited":false,"in_reply_to_screen_name":"karpathy","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1638538494887821313","quoted_status_permalink":{"url":"https://t.co/gNErcwGh3w","expanded":"https://x.com/rasbt/status/1638538494887821313","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980463829789339825","view_count":12444,"bookmark_count":39,"created_at":1761014346000,"favorite_count":52,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980397031542989305","full_text":"@karpathy @yoavgo This made me think of the \"Meet in the Middle\" paper https://t.co/gNErcwGh3w\nWhen I remember correctly, they run two LLMs in both directions with parameter sharing. So it shouldn't impact training time. Kind of wild but hey why not.","in_reply_to_user_id_str":"33836629","in_reply_to_status_id_str":"1980435985730269351","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,188],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/deepseek-ai/De…","expanded_url":"https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf","url":"https://t.co/f0EFC6eVcl","indices":[19,42]},{"display_url":"magazine.sebastianraschka.com/p/understandin…","expanded_url":"https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=publication-search","url":"https://t.co/Aa5M0XD6ew","indices":[165,188]}],"user_mentions":[]},"favorited":true,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642194945110475","view_count":9544,"bookmark_count":42,"created_at":1761056872000,"favorite_count":77,"quote_count":1,"reply_count":2,"retweet_count":12,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"Link to the paper: https://t.co/f0EFC6eVcl\n\nMy \"Understanding Multimodal LLMs\" article with more info on how images are fed to LLMs, how cross-attention works, etc: https://t.co/Aa5M0XD6ew","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1143955635391754240","name":"Pratham Prasoon","screen_name":"PrasoonPratham","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"PrasoonPratham","lang":"en","retweeted":false,"fact_check":null,"id":"1980645421560262701","view_count":2495,"bookmark_count":0,"created_at":1761057641000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@PrasoonPratham Actually I was thinking about it when typing, and I don't know. I don't want to be that person who goes against the common terminology (like softargmax haha) but it really is a V*L*LM at 3B parameters.","in_reply_to_user_id_str":"1143955635391754240","in_reply_to_status_id_str":"1980644767022399874","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,235],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1487101697876844546","name":"Butch Coolidge","screen_name":"vulnerablecodes","indices":[0,16]}]},"favorited":true,"in_reply_to_screen_name":"vulnerablecodes","lang":"en","retweeted":false,"fact_check":null,"id":"1980644334832955587","view_count":2094,"bookmark_count":0,"created_at":1761057382000,"favorite_count":19,"quote_count":0,"reply_count":2,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@vulnerablecodes If we are talking about the model itself and not the app, these are open-weight PyTorch models. So unless there’s a backdoor in Hugging Face or the PyTorch runtime, there’s really no way for them to be malicious afaik.","in_reply_to_user_id_str":"1487101697876844546","in_reply_to_status_id_str":"1980643375780085948","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[14,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"255532327","name":"LMT ⚡️","screen_name":"Limitless_LT","indices":[0,13]}]},"favorited":false,"in_reply_to_screen_name":"Limitless_LT","lang":"en","retweeted":false,"fact_check":null,"id":"1980656807690530983","view_count":1677,"bookmark_count":0,"created_at":1761060356000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@Limitless_LT Yeah, I think that’s what brought us CNNS (as opposed to fully connected neural nets), LoRA, and many more","in_reply_to_user_id_str":"255532327","in_reply_to_status_id_str":"1980655979386793997","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,290],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"760070121981378561","name":"Alim","screen_name":"almmaasoglu","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"almmaasoglu","lang":"en","retweeted":false,"fact_check":null,"id":"1980657466284425600","view_count":2069,"bookmark_count":0,"created_at":1761060513000,"favorite_count":6,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@almmaasoglu exactly, that’s the messiness of working with the image format I mentioned. I think you can make to generalize well on all these but since there are more degrees of freedom it will require more data to train (luckily this can be done with automatic data augmentation but still)","in_reply_to_user_id_str":"760070121981378561","in_reply_to_status_id_str":"1980653506899087745","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1980736754517680463","view_count":25353,"bookmark_count":29,"created_at":1761079417000,"favorite_count":76,"quote_count":1,"reply_count":7,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980657338726887662","full_text":"Interesting that they mentioned faster & cheaper compared to OpenAI’s latest models not “customizable”. \n\nThat makes me think they are specifically referring to gpt-oss,\n\nThis in turn means they are using the small, dense Qwen3 models, maybe 0.6 to 4B range.\n\nAnd this is surprising, i.e. that models that small are good enough for production (and possibly chat interactions with the customer).","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1980657338726887662","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,171],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980715707508879444","view_count":8923,"bookmark_count":14,"created_at":1761074399000,"favorite_count":70,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"All that being said, as a human, I can appreciate visual representations of text as it lowers cognitive load (the raw text is readable, but requires much more brainpower): https://t.co/G4ygIeNvDZ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"875456843279081476","name":"Dileep George","screen_name":"dileeplearning","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"dileeplearning","lang":"en","retweeted":false,"fact_check":null,"id":"1980618490764513365","view_count":11072,"bookmark_count":11,"created_at":1761051220000,"favorite_count":77,"quote_count":0,"reply_count":3,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980423146420466049","full_text":"@dileeplearning I know it’s popular to hate tokenizers, but visual representations (which are also tokenized) bring a lot of messiness as well. Aspect ratios, cropping, resolution, brightness, etc.\n\nSure, models learn to deal with that but it requires lots of data to make them robust wrt these.","in_reply_to_user_id_str":"875456843279081476","in_reply_to_status_id_str":"1980423146420466049","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-23","value":89,"startTime":1761091200000,"endTime":1761177600000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}],"symbols":[],"timestamps":[{"indices":[94,99],"seconds":660,"text":"11:00"},{"indices":[376,380],"seconds":200,"text":"3:20"}],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980992745532453016","view_count":41099,"bookmark_count":359,"created_at":1761140450000,"favorite_count":823,"quote_count":7,"reply_count":24,"retweet_count":89,"user_id_str":"865622395","conversation_id_str":"1980992745532453016","full_text":"Excited to be (finally) heading to the PyTorch Conference!\n\nI’ll be giving a talk tomorrow at 11:00 AM on “The LLM Landscape 2025”, where I’ll discuss the key components behind this year’s most prominent open-weight LLMs, and highlight a few architectural developments that go beyond the mainstream, off the main track.\n\nI also look forward to doing a book signing session at 3:20 PM, thanks to the kind invite from the organizers.\n\nIt’s my first trip since my injury last year, and I’m really looking forward to reconnecting with the community in person after such a long time. If you’re there, please come say hi!\n\n(I couldn’t make it for the first day of the conference due to a mandatory appointment, but better late than never! See you all tomorrow.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-24","value":0,"startTime":1761177600000,"endTime":1761264000000,"tweets":[]},{"label":"2025-10-25","value":0,"startTime":1761264000000,"endTime":1761350400000,"tweets":[]},{"label":"2025-10-26","value":0,"startTime":1761350400000,"endTime":1761436800000,"tweets":[]},{"label":"2025-10-27","value":0,"startTime":1761436800000,"endTime":1761523200000,"tweets":[]},{"label":"2025-10-28","value":1,"startTime":1761523200000,"endTime":1761609600000,"tweets":[{"bookmarked":false,"display_text_range":[42,321],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch","url":"https://t.co/NGT1VM4P1R","indices":[414,437]}],"user_mentions":[{"id_str":"13434092","name":"Brandon Watson","screen_name":"BrandonWatson","indices":[0,14]},{"id_str":"291797158","name":"ThePrimeagen","screen_name":"ThePrimeagen","indices":[15,28]},{"id_str":"21001534","name":"Audible","screen_name":"audible_com","indices":[29,41]}]},"favorited":false,"in_reply_to_screen_name":"BrandonWatson","lang":"en","retweeted":false,"fact_check":null,"id":"1982836647784808750","view_count":9152,"bookmark_count":19,"created_at":1761580070000,"favorite_count":42,"quote_count":0,"reply_count":5,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1982666767437820411","full_text":"I wrote the original text and code and had similar questions when I found that there was an audio book version. When I asked about it, if I remember correctly, the answer was that it is something they generate for all books to improve accessibility. \n\nPersonally, I recommend the text version. That being said, I dunno, but perhaps the audiobook version works also well if you are working with the code notebooks (https://t.co/NGT1VM4P1R), which have the code and figures (but not text).\n\nWould be curious to hear from people who listen to audio book versions of coding books and find out if this is helpful.","in_reply_to_user_id_str":"13434092","in_reply_to_status_id_str":"1982666767437820411","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-29","value":193,"startTime":1761609600000,"endTime":1761696000000,"tweets":[{"bookmarked":false,"display_text_range":[0,269],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983212569885122670","view_count":46934,"bookmark_count":499,"created_at":1761669697000,"favorite_count":872,"quote_count":3,"reply_count":29,"retweet_count":128,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"Just saw the MiniMax-M2 benchmarks, and the performance is too good to ignore :). So, I just amended my \"The Big LLM Architecture Comparison\" with entry number 13! \n\n1️⃣ Full attention modules:\n\nAs shown in the overview figure below, I grouped MiniMax-M2 with the other decoder-style transformer LLMs as it does not use the efficient lightning attention variant proposed in MiniMax-M1. Instead, the developers went back to using full attention, likely to improve modeling (and benchmark) performance.\n\n2️⃣ Per-layer QK-Norm:\n\nOverall, MiniMax-M2 is surprisingly similar to Qwen3. Besides changing the number of layers, sizes, etc., it uses the same components overall. Perhaps the one noteworthy highlight here is that MiniMax-M2 uses a so-called “per_layer” QK-Norm instead of the regular QK-Norm. A closer look at the code reveals the \"per_layer\" means that the RMSNorm (used for QK-Norm as explained earlier) is defined in each transformer block (as in regular QK-Norm), but, in addition, instead of reusing it across attention heads, it's a unique QK-Norm for each attention head.\n\n3️⃣ Sliding-window attention:\n\nThe model configuration file also includes a sliding-window attention setting (similar to Gemma 3), but, as in Mistral 3.1, it is disabled by default.\n\nOtherwise, besides the per-layer QK-Norm, the architecture is very similar to Qwen3, as shown in the figure below.\n\n4️⃣ MoE sparsity:\n\nA perhaps interesting tidbit, as shown in the figure below, includes the fact that they don't use a shared expert (similar to Qwen3 but unlike Qwen3-Next). As mentioned earlier, in my opinion, shared experts are useful because they reduce redundancy among the other experts.\n\nAlso, as apparent from the figure above, MiniMax-M2 is twice as \"sparse\" as Qwen3. I.e., at roughly the same size as Qwen3 235B-A22B, MiniMax-M2 has only 10B instead of 22B active experts per token (that is, 4.37% of the parameters are used in each inference step in MiniMax-M2, whereas Qwen3 uses 9.36% active tokens).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1983240592516665532","quoted_status_permalink":{"url":"https://t.co/Ks8fEmHtCa","expanded":"https://twitter.com/ManningBooks/status/1983240592516665532","display":"x.com/ManningBooks/s…"},"retweeted":false,"fact_check":null,"id":"1983255497202643000","view_count":41464,"bookmark_count":263,"created_at":1761679932000,"favorite_count":404,"quote_count":0,"reply_count":25,"retweet_count":64,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"On that note, I am currently running a large-scale experiment on the upcoming inference-scaling chapter:\n\nA) Parallel Sampling\n- Self-Consistency (Majority Vote)\n- Rejection Sampling\n- Best-of-N (Verifier-Based)\n\nB) Sequential Refinement\n- Self-Refinement\n- Power Sampling\n- MCMC (Simple)\n- MCMC (Block as in \"Reasoning with Sampling\" paper)\n- Tree-of-Thought\n\n... to decide which one(s) make(s) it for the detailed discussion into the main chapter versus which ones will be included as bonus materials. (All new chapters will of course be automatically available to all the early acessers, amd there are already 170 chapters to get started in the meantime 😊\n\nAnything you'd think is worth adding to the list above?","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,34],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1745892418539417600","name":"elie","screen_name":"eliebakouch","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"eliebakouch","lang":"en","retweeted":false,"fact_check":null,"id":"1983231696343351800","view_count":2617,"bookmark_count":1,"created_at":1761674257000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@eliebakouch good point, will add!","in_reply_to_user_id_str":"1745892418539417600","in_reply_to_status_id_str":"1983219128883122466","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,192],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"970812776","name":"jason","screen_name":"jasonth0","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"jasonth0","lang":"en","retweeted":false,"fact_check":null,"id":"1983215929711284435","view_count":1335,"bookmark_count":1,"created_at":1761670498000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@jasonth0 The per-layer QK-Norm adds more params, not less :). But that aside, overall, I think it's still efficient. I mean, there are 50% less active parameters compared to Qwen3 for example","in_reply_to_user_id_str":"970812776","in_reply_to_status_id_str":"1983215562952990856","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[7,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"5604372","name":"Reza Rawassizadeh","screen_name":"rezar","indices":[0,6]}]},"favorited":false,"in_reply_to_screen_name":"rezar","lang":"en","retweeted":false,"fact_check":null,"id":"1983251855829606863","view_count":670,"bookmark_count":1,"created_at":1761679064000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@rezar That's a fun idea! Do you know a service that you have had a good experience with regarding making and distributing posters?","in_reply_to_user_id_str":"5604372","in_reply_to_status_id_str":"1983245370118267378","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,66],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1419322742713643009","name":"Duc Nguyen Huu","screen_name":"ducnh279","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"ducnh279","lang":"en","retweeted":false,"fact_check":null,"id":"1983278551655944288","view_count":108,"bookmark_count":0,"created_at":1761685428000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@ducnh279 Interesting one! I will bookmark this and give it a try.","in_reply_to_user_id_str":"1419322742713643009","in_reply_to_status_id_str":"1983263508071624848","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,46],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1918285228403253249","name":"ƬⲘ ⚔️","screen_name":"tm23twt","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}]},"favorited":false,"in_reply_to_screen_name":"tm23twt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983257620753592407","view_count":86,"bookmark_count":0,"created_at":1761680438000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@tm23twt I think they removed the edit feature https://t.co/dGwFDFaeYg","in_reply_to_user_id_str":"1918285228403253249","in_reply_to_status_id_str":"1983256870711164941","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,26],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1983255844029431837","view_count":4248,"bookmark_count":2,"created_at":1761680014000,"favorite_count":15,"quote_count":0,"reply_count":2,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"* 170 pages not chapters 😅","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983255497202643000","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-30","value":0,"startTime":1761696000000,"endTime":1761782400000,"tweets":[{"bookmarked":false,"display_text_range":[12,146],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"431181263","name":"Haichao","screen_name":"HaichaoZhu","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"HaichaoZhu","lang":"en","retweeted":false,"fact_check":null,"id":"1983343814648762407","view_count":552,"bookmark_count":0,"created_at":1761700988000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@HaichaoZhu That's a good point. With so many MoE's released this year (even the latest Nemotron today), maybe that'd be a nice standalone article","in_reply_to_user_id_str":"431181263","in_reply_to_status_id_str":"1983335671264845971","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,207],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1761964147510767616","name":"Ben Dicken","screen_name":"BenjDicken","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"BenjDicken","lang":"en","retweeted":false,"fact_check":null,"id":"1983565978525892663","view_count":5,"bookmark_count":0,"created_at":1761753956000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983292996117491864","full_text":"@BenjDicken Just saw this popping up on my timeline... I guess the twitter recommendations work well now, haha!\nAnyways, I hope you are liking the book. And please let me know in case you have any questions!","in_reply_to_user_id_str":"1761964147510767616","in_reply_to_status_id_str":"1983292996117491864","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-31","value":44,"startTime":1761782400000,"endTime":1761868800000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1978608882156269755","quoted_status_permalink":{"url":"https://t.co/uObfyEshyK","expanded":"https://twitter.com/rasbt/status/1978608882156269755","display":"x.com/rasbt/status/1…"},"retweeted":true,"fact_check":null,"id":"1983895811915214996","view_count":60530,"bookmark_count":173,"created_at":1761832595000,"favorite_count":325,"quote_count":1,"reply_count":22,"retweet_count":40,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"A small follow-up to my DGX Spark post. Courtesy of NVIDIA, I got to try the DGX on my workflows (coding LLMs from scratch in pure PyTorch) and wanted to share my first impressions after using it for a week.\n\nBefore getting to the performance, there was a neat bonus I didn't expect: It comes with NVIDIA Sync software that lets you conveniently connect (I fully expected I would have to find my SSH tunneling notes from back when I set up Jupyter Lab, etc, on a remote machine). The setup is a breeze and a delight.\n\nNow, how does it fare against my Mac Mini? I included the tokens/sec inference speed for a small 0.6B model I am currently working on. The DGX is much faster than the Mac Mini M4 CPU and still noticeably faster than the M4 GPU (via PyTorch MPS). More importantly, though, as I mentioned before, it is a CUDA device and thus much better supported in PyTorch. This, in turn, results in more stable training and higher benchmark accuracy. (And no compile errors, yay!)\n\nBoth devices get hot under my workloads (e.g., a constant-load run like evaluating a model with batched mode on MATH-500; or fine-tuning a model), but I feel like the DGX Spark is (probably) made with that in mind. Plus, due to its larger 128 GB RAM, I can run larger batch sizes. Then there's also the aspect that when I have the DGX (vs the Mac Mini) running computations, it keeps my Mini free for other tasks :).\n\nOverall, a neat little package and CUDA prototyping machine that I can keep on my desk. It's super quiet similar to the Mac Mini. Of course, it's not as capable as a 6x more expensive H100 for training, but hey, you don't need a server room for that and can keep it in your office without worrying about heat or noise (this was not possible with the Lambda workstation I had a few years ago).\n\ntl;dr:\n\nSo, I've been seeing lots of others using it for LLM inference (Ollama, vLLM, etc) but my first-week impression is that this is also a neat box for local dev and prototyping (e.g., coding and running PyTorch models) thanks to the CUDA support, which comes in handy before starting larger, more expensive training runs.\n\nPS: Plus also find another benchmark versus the H100 in the comments below.\n\nWill run more experiments over time. In the meantime, let me know if you have any questions.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,46],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","quoted_status_id_str":"1983895811915214996","quoted_status_permalink":{"url":"https://t.co/FM2NttATVY","expanded":"https://twitter.com/rasbt/status/1983895811915214996","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983900170992463920","view_count":1069,"bookmark_count":0,"created_at":1761833634000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1978608882156269755","full_text":"A follow-up here with some PyTorch benchmarks:","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1978608882156269755","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1983905044169945184","view_count":269,"bookmark_count":1,"created_at":1761834796000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983584412412641496","full_text":"@natolambert My guess is the motivating factor behind this was probably to prevent things from breaking if proprietary model providers make API or model changes again.","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1983584412412641496","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983895815102910660","view_count":5102,"bookmark_count":7,"created_at":1761832595000,"favorite_count":20,"quote_count":0,"reply_count":4,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"And here is a comparison with an H100. As one can see, the DGX Spark is a great machine for small inferencing tasks (even beating the 6x more expensive H100).\nBut when it comes to batched processing (or training), this is of course no replacement for high-memory bandwidth cards. https://t.co/I93nIfdzD6","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983895811915214996","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1983918926992933169","url":"https://t.co/yazv07Pxfx","indices":[194,217]}],"user_mentions":[{"id_str":"1451507288741658630","name":"Aleksandr Kovalev","screen_name":"koval_alvi","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"koval_alvi","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1983918926992933169","quoted_status_permalink":{"url":"https://t.co/yazv07Pxfx","expanded":"https://x.com/rasbt/status/1983918926992933169","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983919187555754315","view_count":480,"bookmark_count":0,"created_at":1761838168000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@koval_alvi So little time (and only one machine) & some much to run 😅. I am currently more focused on the inference scaling methods for the upcoming chapter 4, but yes, I did a short run:\n\nhttps://t.co/yazv07Pxfx","in_reply_to_user_id_str":"1451507288741658630","in_reply_to_status_id_str":"1983912718001115637","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1474196927960944644","name":"kris","screen_name":"Krishna70284154","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"Krishna70284154","lang":"en","retweeted":false,"fact_check":null,"id":"1983899945443700819","view_count":456,"bookmark_count":0,"created_at":1761833580000,"favorite_count":6,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@Krishna70284154 Yeah, it’s basically for people who want a Mac-like machine at a Mac-like price but with cuda support 😅","in_reply_to_user_id_str":"1474196927960944644","in_reply_to_status_id_str":"1983897384469082570","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,169],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb","url":"https://t.co/VioT1zUPgA","indices":[59,82]}],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983918926992933169","view_count":1113,"bookmark_count":2,"created_at":1761838106000,"favorite_count":4,"quote_count":1,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@redtachyon I did a short run of my DPO from Scratch code (https://t.co/VioT1zUPgA) on a 355M parameter model:\n\nA100: 1.69 min\nMac Mini M4: 12.54 min\nDGX Spark: 2.44 min","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1983906361969627248","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,163],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1780523178160279552","name":"Mykhailo Sorochuk","screen_name":"sir4K_zen","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"sir4K_zen","lang":"en","retweeted":false,"fact_check":null,"id":"1984030005966598349","view_count":159,"bookmark_count":0,"created_at":1761864589000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@sir4K_zen Under normal use, they are similarly quiet like you have to put your ear next to it to hear it. Under heavy load, the Mac Mini gets louder than the DGX.","in_reply_to_user_id_str":"1780523178160279552","in_reply_to_status_id_str":"1984026707242971532","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-01","value":9,"startTime":1761868800000,"endTime":1761955200000,"tweets":[{"bookmarked":false,"display_text_range":[0,260],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1984262505443844263","quoted_status_permalink":{"url":"https://t.co/bGHWQrydyN","expanded":"https://twitter.com/natolambert/status/1984262505443844263","display":"x.com/natolambert/st…"},"retweeted":false,"fact_check":null,"id":"1984279418588762113","view_count":19631,"bookmark_count":64,"created_at":1761924054000,"favorite_count":112,"quote_count":0,"reply_count":7,"retweet_count":6,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"I ran lots of experiments on fp16 vs bf16 years ago on ViTs and LLMs. fp16 can work well but depends on normalization (so you don’t run over the supported range with your activations). \nI can see why with QKNorm and other tricks it may work fine (/better) now.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2023/pyto…","expanded_url":"https://sebastianraschka.com/blog/2023/pytorch-memory-optimization.html","url":"https://t.co/AD6ZZJeS4D","indices":[61,84]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984310689167511808","view_count":5710,"bookmark_count":27,"created_at":1761931509000,"favorite_count":43,"quote_count":0,"reply_count":0,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"Figure from an older blogpost to illustrate the difference: https://t.co/AD6ZZJeS4D\n\nRegular 16-bit floats can only represent numbers between -65,504 and 65,504. And with LLMs back then I often had activation larger or smaller than that. (This was pre QKNorm.) https://t.co/b6vobXJCHJ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1984279418588762113","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,71],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1894590251571843073","name":"Artificially Intelligent","screen_name":"ArtiIntelligent","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"ArtiIntelligent","lang":"en","retweeted":false,"fact_check":null,"id":"1984242821688365465","view_count":100,"bookmark_count":0,"created_at":1761915328000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ArtiIntelligent Sure but my use case is primarily dev work in PyTorch.","in_reply_to_user_id_str":"1894590251571843073","in_reply_to_status_id_str":"1984239937789788358","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,211],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1886394417654677504","name":"moskstraumen","screen_name":"moskstraum21745","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"moskstraum21745","lang":"en","retweeted":false,"fact_check":null,"id":"1984255784847614382","view_count":65,"bookmark_count":0,"created_at":1761918419000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@moskstraum21745 Oh yes 100% use MLX if you want to max the performance on Mac. I think it also now has CUDA support correct? It's just that the most of the LLM ecosystem (and my experience) is based on PyTorch.","in_reply_to_user_id_str":"1886394417654677504","in_reply_to_status_id_str":"1984254897622290758","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,156],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"101128454","name":"Wayne Le Nguyen","screen_name":"insynwyn","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"insynwyn","lang":"en","retweeted":false,"fact_check":null,"id":"1984242530398171139","view_count":62,"bookmark_count":0,"created_at":1761915259000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@insynwyn Both the latest nightly and latest PyTorch with CUDA 13 work for me. (NVIDIA recommends the docker container but in my case that wasn’t necessary)","in_reply_to_user_id_str":"101128454","in_reply_to_status_id_str":"1984239792939499706","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-02","value":266,"startTime":1761955200000,"endTime":1762041600000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984617030356451642","view_count":65925,"bookmark_count":861,"created_at":1762004547000,"favorite_count":1286,"quote_count":3,"reply_count":27,"retweet_count":220,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened.\n\nFirst, linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s.\n\nI don't want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n^2) to O(n) to making attention much more efficient for long sequences.\n\nHowever, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. \n\nIn the second half of this year, there was a bit of a revival of linear attention variants. The first notable model was MiniMax-M1 with lightning attention, a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June.\n\nThen, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 with sparse attention.\n\nAll three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. (DeepSeek's sparse attention it's not strictly linear but still subquadratic).\n\nInterestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model (discussed in section 13) without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had pure accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications.\n\nThis could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. Last week, the Kimi team released their new Kimi Linear model with linear attention. The tag line is that compared to regular, full attention, it has a 75% KV cache reduction and up to 6x decoding throughput.\n\nKimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there's one block that uses full attention as shown in the figure below.\n\nHowever, Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Interestingly, it also replaces the standard full attention module by multi-head latent attention. \n\nThere's no direct comparison to Qwen3-Next in the Kimi Linear paper, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed.\n\nOf course, I couldn't resist and added it to my The Big LLM Architecture Comparison article, which has grown to >10,000 words now (basically becoming book!?).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,88],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2025/dgx-…","expanded_url":"https://sebastianraschka.com/blog/2025/dgx-impressions.html","url":"https://t.co/XG2m9urtgc","indices":[65,88]}],"user_mentions":[{"id_str":"43874767","name":"Ivan Fioravanti ᯅ","screen_name":"ivanfioravanti","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"ivanfioravanti","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984601748250448148","view_count":205,"bookmark_count":3,"created_at":1762000903000,"favorite_count":5,"quote_count":0,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ivanfioravanti Yes! Links to the codes are in the article here: https://t.co/XG2m9urtgc","in_reply_to_user_id_str":"43874767","in_reply_to_status_id_str":"1984519617067335962","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,197],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","retweeted":false,"fact_check":null,"id":"1984633894365233442","view_count":1181,"bookmark_count":3,"created_at":1762008567000,"favorite_count":18,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984605827034972269","full_text":"@redtachyon I think fp16 also only works well for the newer architectures that add tons of normalization (like QKNorm), so you don’t get these large activations above +/- 65k that fp16 can’t handle","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1984605827034972269","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,200],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1629698842647203841","name":"Yu Zhang 🐈🐙","screen_name":"yzhang_cs","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"yzhang_cs","lang":"en","retweeted":false,"fact_check":null,"id":"1984632514019778709","view_count":1211,"bookmark_count":1,"created_at":1762008238000,"favorite_count":10,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@yzhang_cs Ooops, I misread then, thanks for the feedback, and I’ll update the figure in the article! (Ha, but sounds like I can keep this figure for the next iteration of Kimi Linear! Cool work btw!)","in_reply_to_user_id_str":"1629698842647203841","in_reply_to_status_id_str":"1984631714464088563","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,222],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"862201913252618240","name":"Vishal Verma","screen_name":"v_shaal","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}]},"favorited":false,"in_reply_to_screen_name":"v_shaal","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984630139888472399","view_count":725,"bookmark_count":0,"created_at":1762007672000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@v_shaal Might be architectural. They took the same architecture and compared it the Gated DeltaNet-H1 variant from the Gated DeltaNet paper (which is the most similar) and it compared favorably on long context benchmarks: https://t.co/dlzIWpohGu","in_reply_to_user_id_str":"862201913252618240","in_reply_to_status_id_str":"1984622135571091742","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,281],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1984632041598545947","view_count":545,"bookmark_count":0,"created_at":1762008126000,"favorite_count":4,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@_junaidkhalid1 My point still stands: there’s no one-size-fits-all. Different applications have different trade-offs. Same why gpt-5 and gpt-5 pro exists. Some times speed is more important and accuracy is sufficient. Sometimes you want to max accuracy (and are ok to wait 10 min)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1984631100002746497","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,83],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1199846588325224453","name":"John P.","screen_name":"JohnP07107214","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"JohnP07107214","lang":"en","retweeted":false,"fact_check":null,"id":"1984727926777237953","view_count":198,"bookmark_count":0,"created_at":1762030986000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@JohnP07107214 It might be a good topic for a separate book on LLM optimizations :)","in_reply_to_user_id_str":"1199846588325224453","in_reply_to_status_id_str":"1984726873763660133","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,289],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/OpJnPkrGK9","indices":[121,144]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]}],"user_mentions":[{"id_str":"1219292652748800000","name":"Alexey Grigorev","screen_name":"Al_Grigor","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"Al_Grigor","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984645325517164887","view_count":17409,"bookmark_count":392,"created_at":1762011293000,"favorite_count":428,"quote_count":1,"reply_count":6,"retweet_count":45,"user_id_str":"865622395","conversation_id_str":"1984222098370519305","full_text":"Yes, I recently read 90% of AI projects use PyTorch now. Recently put together an PyTorch essentials article: https://t.co/NWeQan8HJ3\n\n(I’ve been an early adopter since 2018 and never looked back; that being said, regarding your points below, TensorFlow also has dynamic graphs, and Keras supports PyTorch as a backend now too)","in_reply_to_user_id_str":"1219292652748800000","in_reply_to_status_id_str":"1984222098370519305","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-03","value":0,"startTime":1762041600000,"endTime":1762128000000,"tweets":[]},{"label":"2025-11-04","value":1,"startTime":1762128000000,"endTime":1762214400000,"tweets":[{"bookmarked":false,"display_text_range":[13,133],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1985456352035291531","view_count":4496,"bookmark_count":3,"created_at":1762204656000,"favorite_count":46,"quote_count":0,"reply_count":3,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1985418033037263086","full_text":"@natolambert Actually I think it was a pretty eventful Fall so far. E.g.,\nQwen3-Next, DeepSeek V3.2, GLM 4.6, MiniMax-M2, Kimi Linear","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1985418033037263086","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-05","value":164,"startTime":1762214400000,"endTime":1762300800000,"tweets":[{"bookmarked":false,"display_text_range":[0,198],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[175,198]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1985719217027494322","view_count":42088,"bookmark_count":675,"created_at":1762267328000,"favorite_count":950,"quote_count":5,"reply_count":25,"retweet_count":164,"user_id_str":"865622395","conversation_id_str":"1985719217027494322","full_text":"My new field guide to alternatives to standard LLMs: \n\nGated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.\n\nhttps://t.co/ZpWugAccgQ https://t.co/255yQXaDcM","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[8,47],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"783098774130401280","name":"Jack Morris","screen_name":"jxmnop","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"jxmnop","lang":"en","retweeted":false,"fact_check":null,"id":"1985735592689185002","view_count":7024,"bookmark_count":1,"created_at":1762271233000,"favorite_count":22,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1985720643397009844","full_text":"@jxmnop Wishing you all the best! You got this!","in_reply_to_user_id_str":"783098774130401280","in_reply_to_status_id_str":"1985720643397009844","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-06","value":0,"startTime":1762300800000,"endTime":1762387200000,"tweets":[]},{"label":"2025-11-07","value":169,"startTime":1762387200000,"endTime":1762473600000,"tweets":[{"bookmarked":false,"display_text_range":[0,89],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1986449512538513505","quoted_status_permalink":{"url":"https://t.co/4YLFiZxCMs","expanded":"https://twitter.com/Kimi_Moonshot/status/1986449512538513505","display":"x.com/Kimi_Moonshot/…"},"retweeted":false,"fact_check":null,"id":"1986511951141441648","view_count":87406,"bookmark_count":477,"created_at":1762456331000,"favorite_count":1352,"quote_count":8,"reply_count":27,"retweet_count":169,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Exciting big Kimi K2 Thinking release!\nMore experts, fewer heads, and even more thinking! https://t.co/CxUpn68Jjj","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"70831441","name":"Soumith Chintala","screen_name":"soumithchintala","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"soumithchintala","lang":"en","retweeted":false,"fact_check":null,"id":"1986531267794330038","view_count":16764,"bookmark_count":6,"created_at":1762460936000,"favorite_count":113,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986503070734557568","full_text":"@soumithchintala Thank you so much for making deep learning Pythonic! 💜\n\nAll my projects would have been much harder and less enjoyable without PyTorch. \n\nIn an alternative universe we maybe even wouldn’t have such an open-weight LLM ecosystem without PyTorch.\n\nAll the best for your next thing!","in_reply_to_user_id_str":"70831441","in_reply_to_status_id_str":"1986503070734557568","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[32,211],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"14227298","name":"Radek Sienkiewicz","screen_name":"velvet_shark","indices":[0,13]},{"id_str":"20971154","name":"Nicholas Dwork","screen_name":"ndwork","indices":[14,21]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[22,31]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}]},"favorited":false,"in_reply_to_screen_name":"velvet_shark","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1986517309016449353","view_count":50,"bookmark_count":1,"created_at":1762457608000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986412241374048473","full_text":"@velvet_shark @ndwork @karpathy I would say check out the bonus materials, especially the attention alternatives and Qwen3-from-scratch. \nI haven't had a chance to really check out nanochat but that one as well! https://t.co/Qr81iGhkrD","in_reply_to_user_id_str":"14227298","in_reply_to_status_id_str":"1986513230286524832","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,91],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1986522069262123425","view_count":7047,"bookmark_count":5,"created_at":1762458743000,"favorite_count":35,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Sorry should be 256k context length in Kimi K2 Thinking. (Up from 128k in regular Kimi K2.)","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1986511951141441648","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-08","value":0,"startTime":1762473600000,"endTime":1762560000000,"tweets":[]},{"label":"2025-11-09","value":84,"startTime":1762560000000,"endTime":1762646400000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/CaIfmZhaB1","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987157794202505395","view_count":52950,"bookmark_count":381,"created_at":1762610312000,"favorite_count":468,"quote_count":1,"reply_count":11,"retweet_count":71,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"My \"The Building Blocks of Today’s and Tomorrow’s Language Models\" talk at the PyTorch Conference is now up on YouTube! https://t.co/bGV5w1Aqyq\n\nIf you have 25 min this weekend, it's a whirlwind tour to catch you up on the key LLM architecture design considerations in popular LLMs this year (plus, an overview of alternative architecture designs).\n\nThe silver lining of my late arrival and rescheduling: Since there was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 min :)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,121],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[98,121]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987160373837902334","view_count":10033,"bookmark_count":87,"created_at":1762610927000,"favorite_count":85,"quote_count":0,"reply_count":2,"retweet_count":8,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"And the article I mentioned in the talk, the one I promised to write as a follow-up, is this one: https://t.co/ZpWugAccgQ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1987157794202505395","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,39],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1987168682624061627","view_count":143,"bookmark_count":0,"created_at":1762612908000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"@_junaidkhalid1 Incremental progress :)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1987161061188116976","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,297],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"267596794","name":"Walter Tay","screen_name":"waltertayannlee","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"waltertayannlee","lang":"en","retweeted":false,"fact_check":null,"id":"1987177509914337358","view_count":12080,"bookmark_count":62,"created_at":1762615012000,"favorite_count":117,"quote_count":0,"reply_count":2,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1986734118005358605","full_text":"The fun part when teaching deep learning classes was always to point out that the textbook convolution (/cross-correlation) is not how it’s actually implemented. It’s also one of the big sources of non-determinism when training CNNs in standard frameworks, because l, by default, CUDA/cuDNN selects the algo automatically at runtime specific to the problem and setup.","in_reply_to_user_id_str":"267596794","in_reply_to_status_id_str":"1986734118005358605","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-10","value":0,"startTime":1762646400000,"endTime":1762732800000,"tweets":[]},{"label":"2025-11-11","value":0,"startTime":1762732800000,"endTime":1762819200000,"tweets":[]},{"label":"2025-11-12","value":5,"startTime":1762819200000,"endTime":1762905600000,"tweets":[{"bookmarked":false,"display_text_range":[8,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1778075580271054848","name":"mel","screen_name":"melqtx","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"melqtx","lang":"en","retweeted":false,"fact_check":null,"id":"1988380057346130209","view_count":24822,"bookmark_count":39,"created_at":1762901722000,"favorite_count":354,"quote_count":0,"reply_count":16,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1988288260049871197","full_text":"@melqtx I use it all the time when using remote machines. Coz the terminal connections sometimes gets closed (e.g., when my computer goes to sleep).\n\nThis way, I can simply log back in, attach the tmux terminal, and continue instead of cd'ing to the right folder, activating the venv etc.","in_reply_to_user_id_str":"1778075580271054848","in_reply_to_status_id_str":"1988288260049871197","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-13","value":115,"startTime":1762905600000,"endTime":1762992000000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1988626642990719440","view_count":54993,"bookmark_count":944,"created_at":1762960513000,"favorite_count":801,"quote_count":5,"reply_count":27,"retweet_count":115,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all approach, but I thought it may be useful to share:\n\n1. Read the chapter preferably offline, away from the computer. Either classic physical form or at least on digital devices without internet. This really helps with focus time and minimizing distractions while reading. Highlighting or annotating confusing or interesting things is good, but I would not look things up at this stage. I also wouldn't run code at this stage. At least not yet.\n\n2. On the second read-through, type up and run the code from the chapter. Copying code is tempting because retyping is a lot of work, but it usually helps me to think about the code a bit more (versus just glancing over it). If I get different results than in the book, I would check the book's GitHub repo and try the code from there. If I still get different results, I would try to see if it's due to different package versions, random seeds, CPU/CUDA, etc. If I then still can't find it out, asking the author would not be a bad idea (via book forum, public GitHub repo issues or discussions, and as a last resort, email)\n\n3. After the second read-through and retyping the code, it's usually a good time to try the exercises to solidify my understanding. To check whether I actually understand the content and can work with it independently.\n\n4. Go through the highlights and annotations. I would bookmark important learnings or takeaways, if relevant for a given project, in my notes documents. Often, I also look up additional references to read more about a topic of interest. Also, if I still have any questions that I feel are unanswered after my previous readthroughs and exercises, I would do an online search to find out more.\n\n5. The previous steps were all about soaking up knowledge. Eventually, though, I somehow want to use that knowledge. So I think about which projects would benefit from what I've learned and incorporate it into them. This could involve using the main concept from the chapter, but also sometimes minor tidbits I learned along the way, e.g., even trivial things like whether it actually makes a difference in my project to explicitly call `torch.mps.manual_seed(seed)` instead of just `torch.manual_seed(seed)`.\n\nOf course, none of the above is set in stone. If the topic is overall very familiar or easy, and I am primarily reading the book to get some information in later chapters, skimming a chapter is ok (to not waste my time).\n\nAnyway, I hope this is useful. And happy reading and learning!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,44],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":true,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988627760617517412","view_count":74,"bookmark_count":0,"created_at":1762960779000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"@franbetteo Classic quality > quantity :)","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988627594669875705","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,292],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988631117772025955","view_count":6,"bookmark_count":0,"created_at":1762961580000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"Yeah, I think the problem is to want to read too many things. I have the same issue. Honestly, when reading at a computer, my attention span is sometimes so short that I can't even focus 30 min and read a longer blog article without distraction.\nIt requires discipline to stick to a given text.","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988628897995341948","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-14","value":0,"startTime":1762992000000,"endTime":1763078400000,"tweets":[]},{"label":"2025-11-15","value":0,"startTime":1763078400000,"endTime":1763164800000,"tweets":[]},{"label":"2025-11-16","value":127,"startTime":1763164800000,"endTime":1763251200000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989706196396265863","view_count":47955,"bookmark_count":547,"created_at":1763217898000,"favorite_count":754,"quote_count":1,"reply_count":15,"retweet_count":119,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"Inference-scaling lets us trade extra compute for better modeling accuracy. Next to reinforcement learning, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one.\n\nI just finished the first one. It is a 35-page introduction to inference-time scaling through self-consistency sampling. This chapter was a lot of fun to write because it takes the base model on MATH-500 all the way from 15.2% percent to 52.2% accuracy.\n\nSeeing that jump without additional training is incredibly satisfying.\n\nSubmitted the chapter yesterday, and it should appear in the Manning Early Access program in the next few days. (In the meantime the first 176 pages that lead up to this chapter are already available.)\n\nThe next chapter will focus on self-refinement techniques, where the model improves its own answers through iterative reasoning.\n\nHappy reading!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"800854096219471872","name":"Yuchen Jin","screen_name":"Yuchenj_UW","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"Yuchenj_UW","lang":"en","retweeted":false,"fact_check":null,"id":"1989803439224934626","view_count":6603,"bookmark_count":3,"created_at":1763241083000,"favorite_count":118,"quote_count":0,"reply_count":3,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1989755062646944048","full_text":"@Yuchenj_UW One can say you do seminal work to get a PhD, but you don’t have to have a PhD to do seminal work.","in_reply_to_user_id_str":"800854096219471872","in_reply_to_status_id_str":"1989755062646944048","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/blob/main/ch04/01_main-chapter-code/ch04_main.ipynb","url":"https://t.co/b3Nk5cVHwd","indices":[46,69]},{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/tree/main/ch04/02_math500-inference-scaling-scripts","url":"https://t.co/z3oj5Vkno1","indices":[144,167]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989708450100662776","view_count":8109,"bookmark_count":44,"created_at":1763218436000,"favorite_count":60,"quote_count":0,"reply_count":3,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"The chapter code is available here on GitHub: https://t.co/b3Nk5cVHwd\n\nAlso, I have the scripts to reproduce the experiments in the table here: https://t.co/z3oj5Vkno1","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1989706196396265863","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"9508592","name":"Asankhaya Sharma","screen_name":"asankhaya","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"asankhaya","lang":"en","retweeted":false,"fact_check":null,"id":"1989718576664568217","view_count":454,"bookmark_count":0,"created_at":1763220850000,"favorite_count":5,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@asankhaya Yes that’s correct. I think self-consistency is a good intro though that works well in practice, too. More will be covered in the next chapter.\nThanks for sharing btw, have to check out your repo some time.","in_reply_to_user_id_str":"9508592","in_reply_to_status_id_str":"1989717556077498843","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,123],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"4036077013","name":"sour coach sauers","screen_name":"SRCoachSauers","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"SRCoachSauers","lang":"en","retweeted":false,"fact_check":null,"id":"1989803670125646205","view_count":103,"bookmark_count":0,"created_at":1763241138000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@SRCoachSauers The website says summer 2026. That’s still the estimate but maybe even late spring depending on how it goes.","in_reply_to_user_id_str":"4036077013","in_reply_to_status_id_str":"1989800627426480467","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-17","value":0,"startTime":1763251200000,"endTime":1763337600000,"tweets":[]}],"nlikes":[{"label":"2025-10-18","value":0,"startTime":1760659200000,"endTime":1760745600000,"tweets":[]},{"label":"2025-10-19","value":0,"startTime":1760745600000,"endTime":1760832000000,"tweets":[]},{"label":"2025-10-20","value":0,"startTime":1760832000000,"endTime":1760918400000,"tweets":[]},{"label":"2025-10-21","value":979,"startTime":1760918400000,"endTime":1761004800000,"tweets":[{"bookmarked":false,"display_text_range":[0,51],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/07_moe","url":"https://t.co/3CGjgO4H9p","indices":[28,51]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1977733802660155875","quoted_status_permalink":{"url":"https://t.co/nQ43v9rV8S","expanded":"https://twitter.com/rasbt/status/1977733802660155875","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980269760043446725","view_count":76235,"bookmark_count":627,"created_at":1760968077000,"favorite_count":908,"quote_count":1,"reply_count":4,"retweet_count":145,"user_id_str":"865622395","conversation_id_str":"1980269760043446725","full_text":"🔗 Mixture of Experts (MoE): https://t.co/3CGjgO4H9p https://t.co/QA12nBeW0i","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"871247180341813248","name":"Tina Sang","screen_name":"tinawrote","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"tinawrote","lang":"en","retweeted":false,"fact_check":null,"id":"1980274554913132722","view_count":5237,"bookmark_count":0,"created_at":1760969220000,"favorite_count":11,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1979808022894703036","full_text":"@tinawrote Ha nice, it’s refreshing to see that people still care about Bayes theorem and fundamentals in 2025","in_reply_to_user_id_str":"871247180341813248","in_reply_to_status_id_str":"1979808022894703036","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"248951926","name":"Ahmad","screen_name":"TheAhmadOsman","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"TheAhmadOsman","lang":"en","retweeted":false,"fact_check":null,"id":"1980275166560092599","view_count":1634,"bookmark_count":2,"created_at":1760969366000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980102923754381348","full_text":"@TheAhmadOsman The V3.2 update with sparse attention was just to get the tooling ecosystem ready for the big release. Mark my words","in_reply_to_user_id_str":"248951926","in_reply_to_status_id_str":"1980102923754381348","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[23,305],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1209960539390201864","name":"Dwarkesh Patel","screen_name":"dwarkesh_sp","indices":[0,12]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[13,22]}]},"favorited":false,"in_reply_to_screen_name":"dwarkesh_sp","lang":"en","retweeted":false,"fact_check":null,"id":"1980335765063094548","view_count":6093,"bookmark_count":2,"created_at":1760983813000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980333945385562176","full_text":"> Culture: > “Why can’t an LLM write a book for the other LLMs? Why can’t other LLMs read this LLM’s book and be inspired by it, or shocked by it?”\n\nHm, isn’t that what training on synthetic data and knowledge distillation does? \n\nAll major LLMs contain some synthetic data in their mix because it makes training more effective versus cold-starting from raw data.","in_reply_to_user_id_str":"1209960539390201864","in_reply_to_status_id_str":"1980333945385562176","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-22","value":2531,"startTime":1761004800000,"endTime":1761091200000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642191950090585","view_count":153439,"bookmark_count":1282,"created_at":1761056871000,"favorite_count":2142,"quote_count":35,"reply_count":76,"retweet_count":338,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.\n\nIn short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.\n\nMy first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.\n\nIn the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)\n\nIn any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.\n\nHow is it different compared to other VLLM architectures?\n- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).\n- They are (to the best of my knowledge) those who use an MoE as a decoder.\nI think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.\nHowever, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.\n\nRegarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)\n\nOverall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).\n\n(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[18,250],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1638538494887821313","url":"https://t.co/gNErcwGh3w","indices":[71,94]}],"user_mentions":[{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[0,9]},{"id_str":"39547749","name":"(((ل()(ل() 'yoav))))👾","screen_name":"yoavgo","indices":[10,17]}]},"favorited":false,"in_reply_to_screen_name":"karpathy","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1638538494887821313","quoted_status_permalink":{"url":"https://t.co/gNErcwGh3w","expanded":"https://x.com/rasbt/status/1638538494887821313","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980463829789339825","view_count":12444,"bookmark_count":39,"created_at":1761014346000,"favorite_count":52,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980397031542989305","full_text":"@karpathy @yoavgo This made me think of the \"Meet in the Middle\" paper https://t.co/gNErcwGh3w\nWhen I remember correctly, they run two LLMs in both directions with parameter sharing. So it shouldn't impact training time. Kind of wild but hey why not.","in_reply_to_user_id_str":"33836629","in_reply_to_status_id_str":"1980435985730269351","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,188],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/deepseek-ai/De…","expanded_url":"https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf","url":"https://t.co/f0EFC6eVcl","indices":[19,42]},{"display_url":"magazine.sebastianraschka.com/p/understandin…","expanded_url":"https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=publication-search","url":"https://t.co/Aa5M0XD6ew","indices":[165,188]}],"user_mentions":[]},"favorited":true,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642194945110475","view_count":9544,"bookmark_count":42,"created_at":1761056872000,"favorite_count":77,"quote_count":1,"reply_count":2,"retweet_count":12,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"Link to the paper: https://t.co/f0EFC6eVcl\n\nMy \"Understanding Multimodal LLMs\" article with more info on how images are fed to LLMs, how cross-attention works, etc: https://t.co/Aa5M0XD6ew","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1143955635391754240","name":"Pratham Prasoon","screen_name":"PrasoonPratham","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"PrasoonPratham","lang":"en","retweeted":false,"fact_check":null,"id":"1980645421560262701","view_count":2495,"bookmark_count":0,"created_at":1761057641000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@PrasoonPratham Actually I was thinking about it when typing, and I don't know. I don't want to be that person who goes against the common terminology (like softargmax haha) but it really is a V*L*LM at 3B parameters.","in_reply_to_user_id_str":"1143955635391754240","in_reply_to_status_id_str":"1980644767022399874","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,235],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1487101697876844546","name":"Butch Coolidge","screen_name":"vulnerablecodes","indices":[0,16]}]},"favorited":true,"in_reply_to_screen_name":"vulnerablecodes","lang":"en","retweeted":false,"fact_check":null,"id":"1980644334832955587","view_count":2094,"bookmark_count":0,"created_at":1761057382000,"favorite_count":19,"quote_count":0,"reply_count":2,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@vulnerablecodes If we are talking about the model itself and not the app, these are open-weight PyTorch models. So unless there’s a backdoor in Hugging Face or the PyTorch runtime, there’s really no way for them to be malicious afaik.","in_reply_to_user_id_str":"1487101697876844546","in_reply_to_status_id_str":"1980643375780085948","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[14,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"255532327","name":"LMT ⚡️","screen_name":"Limitless_LT","indices":[0,13]}]},"favorited":false,"in_reply_to_screen_name":"Limitless_LT","lang":"en","retweeted":false,"fact_check":null,"id":"1980656807690530983","view_count":1677,"bookmark_count":0,"created_at":1761060356000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@Limitless_LT Yeah, I think that’s what brought us CNNS (as opposed to fully connected neural nets), LoRA, and many more","in_reply_to_user_id_str":"255532327","in_reply_to_status_id_str":"1980655979386793997","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,290],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"760070121981378561","name":"Alim","screen_name":"almmaasoglu","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"almmaasoglu","lang":"en","retweeted":false,"fact_check":null,"id":"1980657466284425600","view_count":2069,"bookmark_count":0,"created_at":1761060513000,"favorite_count":6,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@almmaasoglu exactly, that’s the messiness of working with the image format I mentioned. I think you can make to generalize well on all these but since there are more degrees of freedom it will require more data to train (luckily this can be done with automatic data augmentation but still)","in_reply_to_user_id_str":"760070121981378561","in_reply_to_status_id_str":"1980653506899087745","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1980736754517680463","view_count":25353,"bookmark_count":29,"created_at":1761079417000,"favorite_count":76,"quote_count":1,"reply_count":7,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980657338726887662","full_text":"Interesting that they mentioned faster & cheaper compared to OpenAI’s latest models not “customizable”. \n\nThat makes me think they are specifically referring to gpt-oss,\n\nThis in turn means they are using the small, dense Qwen3 models, maybe 0.6 to 4B range.\n\nAnd this is surprising, i.e. that models that small are good enough for production (and possibly chat interactions with the customer).","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1980657338726887662","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,171],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980715707508879444","view_count":8923,"bookmark_count":14,"created_at":1761074399000,"favorite_count":70,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"All that being said, as a human, I can appreciate visual representations of text as it lowers cognitive load (the raw text is readable, but requires much more brainpower): https://t.co/G4ygIeNvDZ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"875456843279081476","name":"Dileep George","screen_name":"dileeplearning","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"dileeplearning","lang":"en","retweeted":false,"fact_check":null,"id":"1980618490764513365","view_count":11072,"bookmark_count":11,"created_at":1761051220000,"favorite_count":77,"quote_count":0,"reply_count":3,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980423146420466049","full_text":"@dileeplearning I know it’s popular to hate tokenizers, but visual representations (which are also tokenized) bring a lot of messiness as well. Aspect ratios, cropping, resolution, brightness, etc.\n\nSure, models learn to deal with that but it requires lots of data to make them robust wrt these.","in_reply_to_user_id_str":"875456843279081476","in_reply_to_status_id_str":"1980423146420466049","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-23","value":823,"startTime":1761091200000,"endTime":1761177600000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}],"symbols":[],"timestamps":[{"indices":[94,99],"seconds":660,"text":"11:00"},{"indices":[376,380],"seconds":200,"text":"3:20"}],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980992745532453016","view_count":41099,"bookmark_count":359,"created_at":1761140450000,"favorite_count":823,"quote_count":7,"reply_count":24,"retweet_count":89,"user_id_str":"865622395","conversation_id_str":"1980992745532453016","full_text":"Excited to be (finally) heading to the PyTorch Conference!\n\nI’ll be giving a talk tomorrow at 11:00 AM on “The LLM Landscape 2025”, where I’ll discuss the key components behind this year’s most prominent open-weight LLMs, and highlight a few architectural developments that go beyond the mainstream, off the main track.\n\nI also look forward to doing a book signing session at 3:20 PM, thanks to the kind invite from the organizers.\n\nIt’s my first trip since my injury last year, and I’m really looking forward to reconnecting with the community in person after such a long time. If you’re there, please come say hi!\n\n(I couldn’t make it for the first day of the conference due to a mandatory appointment, but better late than never! See you all tomorrow.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-24","value":0,"startTime":1761177600000,"endTime":1761264000000,"tweets":[]},{"label":"2025-10-25","value":0,"startTime":1761264000000,"endTime":1761350400000,"tweets":[]},{"label":"2025-10-26","value":0,"startTime":1761350400000,"endTime":1761436800000,"tweets":[]},{"label":"2025-10-27","value":0,"startTime":1761436800000,"endTime":1761523200000,"tweets":[]},{"label":"2025-10-28","value":42,"startTime":1761523200000,"endTime":1761609600000,"tweets":[{"bookmarked":false,"display_text_range":[42,321],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch","url":"https://t.co/NGT1VM4P1R","indices":[414,437]}],"user_mentions":[{"id_str":"13434092","name":"Brandon Watson","screen_name":"BrandonWatson","indices":[0,14]},{"id_str":"291797158","name":"ThePrimeagen","screen_name":"ThePrimeagen","indices":[15,28]},{"id_str":"21001534","name":"Audible","screen_name":"audible_com","indices":[29,41]}]},"favorited":false,"in_reply_to_screen_name":"BrandonWatson","lang":"en","retweeted":false,"fact_check":null,"id":"1982836647784808750","view_count":9152,"bookmark_count":19,"created_at":1761580070000,"favorite_count":42,"quote_count":0,"reply_count":5,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1982666767437820411","full_text":"I wrote the original text and code and had similar questions when I found that there was an audio book version. When I asked about it, if I remember correctly, the answer was that it is something they generate for all books to improve accessibility. \n\nPersonally, I recommend the text version. That being said, I dunno, but perhaps the audiobook version works also well if you are working with the code notebooks (https://t.co/NGT1VM4P1R), which have the code and figures (but not text).\n\nWould be curious to hear from people who listen to audio book versions of coding books and find out if this is helpful.","in_reply_to_user_id_str":"13434092","in_reply_to_status_id_str":"1982666767437820411","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-29","value":1305,"startTime":1761609600000,"endTime":1761696000000,"tweets":[{"bookmarked":false,"display_text_range":[0,269],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983212569885122670","view_count":46934,"bookmark_count":499,"created_at":1761669697000,"favorite_count":872,"quote_count":3,"reply_count":29,"retweet_count":128,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"Just saw the MiniMax-M2 benchmarks, and the performance is too good to ignore :). So, I just amended my \"The Big LLM Architecture Comparison\" with entry number 13! \n\n1️⃣ Full attention modules:\n\nAs shown in the overview figure below, I grouped MiniMax-M2 with the other decoder-style transformer LLMs as it does not use the efficient lightning attention variant proposed in MiniMax-M1. Instead, the developers went back to using full attention, likely to improve modeling (and benchmark) performance.\n\n2️⃣ Per-layer QK-Norm:\n\nOverall, MiniMax-M2 is surprisingly similar to Qwen3. Besides changing the number of layers, sizes, etc., it uses the same components overall. Perhaps the one noteworthy highlight here is that MiniMax-M2 uses a so-called “per_layer” QK-Norm instead of the regular QK-Norm. A closer look at the code reveals the \"per_layer\" means that the RMSNorm (used for QK-Norm as explained earlier) is defined in each transformer block (as in regular QK-Norm), but, in addition, instead of reusing it across attention heads, it's a unique QK-Norm for each attention head.\n\n3️⃣ Sliding-window attention:\n\nThe model configuration file also includes a sliding-window attention setting (similar to Gemma 3), but, as in Mistral 3.1, it is disabled by default.\n\nOtherwise, besides the per-layer QK-Norm, the architecture is very similar to Qwen3, as shown in the figure below.\n\n4️⃣ MoE sparsity:\n\nA perhaps interesting tidbit, as shown in the figure below, includes the fact that they don't use a shared expert (similar to Qwen3 but unlike Qwen3-Next). As mentioned earlier, in my opinion, shared experts are useful because they reduce redundancy among the other experts.\n\nAlso, as apparent from the figure above, MiniMax-M2 is twice as \"sparse\" as Qwen3. I.e., at roughly the same size as Qwen3 235B-A22B, MiniMax-M2 has only 10B instead of 22B active experts per token (that is, 4.37% of the parameters are used in each inference step in MiniMax-M2, whereas Qwen3 uses 9.36% active tokens).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1983240592516665532","quoted_status_permalink":{"url":"https://t.co/Ks8fEmHtCa","expanded":"https://twitter.com/ManningBooks/status/1983240592516665532","display":"x.com/ManningBooks/s…"},"retweeted":false,"fact_check":null,"id":"1983255497202643000","view_count":41464,"bookmark_count":263,"created_at":1761679932000,"favorite_count":404,"quote_count":0,"reply_count":25,"retweet_count":64,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"On that note, I am currently running a large-scale experiment on the upcoming inference-scaling chapter:\n\nA) Parallel Sampling\n- Self-Consistency (Majority Vote)\n- Rejection Sampling\n- Best-of-N (Verifier-Based)\n\nB) Sequential Refinement\n- Self-Refinement\n- Power Sampling\n- MCMC (Simple)\n- MCMC (Block as in \"Reasoning with Sampling\" paper)\n- Tree-of-Thought\n\n... to decide which one(s) make(s) it for the detailed discussion into the main chapter versus which ones will be included as bonus materials. (All new chapters will of course be automatically available to all the early acessers, amd there are already 170 chapters to get started in the meantime 😊\n\nAnything you'd think is worth adding to the list above?","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,34],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1745892418539417600","name":"elie","screen_name":"eliebakouch","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"eliebakouch","lang":"en","retweeted":false,"fact_check":null,"id":"1983231696343351800","view_count":2617,"bookmark_count":1,"created_at":1761674257000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@eliebakouch good point, will add!","in_reply_to_user_id_str":"1745892418539417600","in_reply_to_status_id_str":"1983219128883122466","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,192],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"970812776","name":"jason","screen_name":"jasonth0","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"jasonth0","lang":"en","retweeted":false,"fact_check":null,"id":"1983215929711284435","view_count":1335,"bookmark_count":1,"created_at":1761670498000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@jasonth0 The per-layer QK-Norm adds more params, not less :). But that aside, overall, I think it's still efficient. I mean, there are 50% less active parameters compared to Qwen3 for example","in_reply_to_user_id_str":"970812776","in_reply_to_status_id_str":"1983215562952990856","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[7,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"5604372","name":"Reza Rawassizadeh","screen_name":"rezar","indices":[0,6]}]},"favorited":false,"in_reply_to_screen_name":"rezar","lang":"en","retweeted":false,"fact_check":null,"id":"1983251855829606863","view_count":670,"bookmark_count":1,"created_at":1761679064000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@rezar That's a fun idea! Do you know a service that you have had a good experience with regarding making and distributing posters?","in_reply_to_user_id_str":"5604372","in_reply_to_status_id_str":"1983245370118267378","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,66],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1419322742713643009","name":"Duc Nguyen Huu","screen_name":"ducnh279","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"ducnh279","lang":"en","retweeted":false,"fact_check":null,"id":"1983278551655944288","view_count":108,"bookmark_count":0,"created_at":1761685428000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@ducnh279 Interesting one! I will bookmark this and give it a try.","in_reply_to_user_id_str":"1419322742713643009","in_reply_to_status_id_str":"1983263508071624848","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,46],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1918285228403253249","name":"ƬⲘ ⚔️","screen_name":"tm23twt","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}]},"favorited":false,"in_reply_to_screen_name":"tm23twt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983257620753592407","view_count":86,"bookmark_count":0,"created_at":1761680438000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@tm23twt I think they removed the edit feature https://t.co/dGwFDFaeYg","in_reply_to_user_id_str":"1918285228403253249","in_reply_to_status_id_str":"1983256870711164941","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,26],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1983255844029431837","view_count":4248,"bookmark_count":2,"created_at":1761680014000,"favorite_count":15,"quote_count":0,"reply_count":2,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"* 170 pages not chapters 😅","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983255497202643000","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-30","value":0,"startTime":1761696000000,"endTime":1761782400000,"tweets":[{"bookmarked":false,"display_text_range":[12,146],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"431181263","name":"Haichao","screen_name":"HaichaoZhu","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"HaichaoZhu","lang":"en","retweeted":false,"fact_check":null,"id":"1983343814648762407","view_count":552,"bookmark_count":0,"created_at":1761700988000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@HaichaoZhu That's a good point. With so many MoE's released this year (even the latest Nemotron today), maybe that'd be a nice standalone article","in_reply_to_user_id_str":"431181263","in_reply_to_status_id_str":"1983335671264845971","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,207],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1761964147510767616","name":"Ben Dicken","screen_name":"BenjDicken","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"BenjDicken","lang":"en","retweeted":false,"fact_check":null,"id":"1983565978525892663","view_count":5,"bookmark_count":0,"created_at":1761753956000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983292996117491864","full_text":"@BenjDicken Just saw this popping up on my timeline... I guess the twitter recommendations work well now, haha!\nAnyways, I hope you are liking the book. And please let me know in case you have any questions!","in_reply_to_user_id_str":"1761964147510767616","in_reply_to_status_id_str":"1983292996117491864","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-31","value":362,"startTime":1761782400000,"endTime":1761868800000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1978608882156269755","quoted_status_permalink":{"url":"https://t.co/uObfyEshyK","expanded":"https://twitter.com/rasbt/status/1978608882156269755","display":"x.com/rasbt/status/1…"},"retweeted":true,"fact_check":null,"id":"1983895811915214996","view_count":60530,"bookmark_count":173,"created_at":1761832595000,"favorite_count":325,"quote_count":1,"reply_count":22,"retweet_count":40,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"A small follow-up to my DGX Spark post. Courtesy of NVIDIA, I got to try the DGX on my workflows (coding LLMs from scratch in pure PyTorch) and wanted to share my first impressions after using it for a week.\n\nBefore getting to the performance, there was a neat bonus I didn't expect: It comes with NVIDIA Sync software that lets you conveniently connect (I fully expected I would have to find my SSH tunneling notes from back when I set up Jupyter Lab, etc, on a remote machine). The setup is a breeze and a delight.\n\nNow, how does it fare against my Mac Mini? I included the tokens/sec inference speed for a small 0.6B model I am currently working on. The DGX is much faster than the Mac Mini M4 CPU and still noticeably faster than the M4 GPU (via PyTorch MPS). More importantly, though, as I mentioned before, it is a CUDA device and thus much better supported in PyTorch. This, in turn, results in more stable training and higher benchmark accuracy. (And no compile errors, yay!)\n\nBoth devices get hot under my workloads (e.g., a constant-load run like evaluating a model with batched mode on MATH-500; or fine-tuning a model), but I feel like the DGX Spark is (probably) made with that in mind. Plus, due to its larger 128 GB RAM, I can run larger batch sizes. Then there's also the aspect that when I have the DGX (vs the Mac Mini) running computations, it keeps my Mini free for other tasks :).\n\nOverall, a neat little package and CUDA prototyping machine that I can keep on my desk. It's super quiet similar to the Mac Mini. Of course, it's not as capable as a 6x more expensive H100 for training, but hey, you don't need a server room for that and can keep it in your office without worrying about heat or noise (this was not possible with the Lambda workstation I had a few years ago).\n\ntl;dr:\n\nSo, I've been seeing lots of others using it for LLM inference (Ollama, vLLM, etc) but my first-week impression is that this is also a neat box for local dev and prototyping (e.g., coding and running PyTorch models) thanks to the CUDA support, which comes in handy before starting larger, more expensive training runs.\n\nPS: Plus also find another benchmark versus the H100 in the comments below.\n\nWill run more experiments over time. In the meantime, let me know if you have any questions.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,46],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","quoted_status_id_str":"1983895811915214996","quoted_status_permalink":{"url":"https://t.co/FM2NttATVY","expanded":"https://twitter.com/rasbt/status/1983895811915214996","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983900170992463920","view_count":1069,"bookmark_count":0,"created_at":1761833634000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1978608882156269755","full_text":"A follow-up here with some PyTorch benchmarks:","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1978608882156269755","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1983905044169945184","view_count":269,"bookmark_count":1,"created_at":1761834796000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983584412412641496","full_text":"@natolambert My guess is the motivating factor behind this was probably to prevent things from breaking if proprietary model providers make API or model changes again.","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1983584412412641496","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983895815102910660","view_count":5102,"bookmark_count":7,"created_at":1761832595000,"favorite_count":20,"quote_count":0,"reply_count":4,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"And here is a comparison with an H100. As one can see, the DGX Spark is a great machine for small inferencing tasks (even beating the 6x more expensive H100).\nBut when it comes to batched processing (or training), this is of course no replacement for high-memory bandwidth cards. https://t.co/I93nIfdzD6","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983895811915214996","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1983918926992933169","url":"https://t.co/yazv07Pxfx","indices":[194,217]}],"user_mentions":[{"id_str":"1451507288741658630","name":"Aleksandr Kovalev","screen_name":"koval_alvi","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"koval_alvi","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1983918926992933169","quoted_status_permalink":{"url":"https://t.co/yazv07Pxfx","expanded":"https://x.com/rasbt/status/1983918926992933169","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983919187555754315","view_count":480,"bookmark_count":0,"created_at":1761838168000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@koval_alvi So little time (and only one machine) & some much to run 😅. I am currently more focused on the inference scaling methods for the upcoming chapter 4, but yes, I did a short run:\n\nhttps://t.co/yazv07Pxfx","in_reply_to_user_id_str":"1451507288741658630","in_reply_to_status_id_str":"1983912718001115637","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1474196927960944644","name":"kris","screen_name":"Krishna70284154","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"Krishna70284154","lang":"en","retweeted":false,"fact_check":null,"id":"1983899945443700819","view_count":456,"bookmark_count":0,"created_at":1761833580000,"favorite_count":6,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@Krishna70284154 Yeah, it’s basically for people who want a Mac-like machine at a Mac-like price but with cuda support 😅","in_reply_to_user_id_str":"1474196927960944644","in_reply_to_status_id_str":"1983897384469082570","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,169],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb","url":"https://t.co/VioT1zUPgA","indices":[59,82]}],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983918926992933169","view_count":1113,"bookmark_count":2,"created_at":1761838106000,"favorite_count":4,"quote_count":1,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@redtachyon I did a short run of my DPO from Scratch code (https://t.co/VioT1zUPgA) on a 355M parameter model:\n\nA100: 1.69 min\nMac Mini M4: 12.54 min\nDGX Spark: 2.44 min","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1983906361969627248","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,163],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1780523178160279552","name":"Mykhailo Sorochuk","screen_name":"sir4K_zen","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"sir4K_zen","lang":"en","retweeted":false,"fact_check":null,"id":"1984030005966598349","view_count":159,"bookmark_count":0,"created_at":1761864589000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@sir4K_zen Under normal use, they are similarly quiet like you have to put your ear next to it to hear it. Under heavy load, the Mac Mini gets louder than the DGX.","in_reply_to_user_id_str":"1780523178160279552","in_reply_to_status_id_str":"1984026707242971532","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-01","value":158,"startTime":1761868800000,"endTime":1761955200000,"tweets":[{"bookmarked":false,"display_text_range":[0,260],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1984262505443844263","quoted_status_permalink":{"url":"https://t.co/bGHWQrydyN","expanded":"https://twitter.com/natolambert/status/1984262505443844263","display":"x.com/natolambert/st…"},"retweeted":false,"fact_check":null,"id":"1984279418588762113","view_count":19631,"bookmark_count":64,"created_at":1761924054000,"favorite_count":112,"quote_count":0,"reply_count":7,"retweet_count":6,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"I ran lots of experiments on fp16 vs bf16 years ago on ViTs and LLMs. fp16 can work well but depends on normalization (so you don’t run over the supported range with your activations). \nI can see why with QKNorm and other tricks it may work fine (/better) now.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2023/pyto…","expanded_url":"https://sebastianraschka.com/blog/2023/pytorch-memory-optimization.html","url":"https://t.co/AD6ZZJeS4D","indices":[61,84]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984310689167511808","view_count":5710,"bookmark_count":27,"created_at":1761931509000,"favorite_count":43,"quote_count":0,"reply_count":0,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"Figure from an older blogpost to illustrate the difference: https://t.co/AD6ZZJeS4D\n\nRegular 16-bit floats can only represent numbers between -65,504 and 65,504. And with LLMs back then I often had activation larger or smaller than that. (This was pre QKNorm.) https://t.co/b6vobXJCHJ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1984279418588762113","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,71],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1894590251571843073","name":"Artificially Intelligent","screen_name":"ArtiIntelligent","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"ArtiIntelligent","lang":"en","retweeted":false,"fact_check":null,"id":"1984242821688365465","view_count":100,"bookmark_count":0,"created_at":1761915328000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ArtiIntelligent Sure but my use case is primarily dev work in PyTorch.","in_reply_to_user_id_str":"1894590251571843073","in_reply_to_status_id_str":"1984239937789788358","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,211],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1886394417654677504","name":"moskstraumen","screen_name":"moskstraum21745","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"moskstraum21745","lang":"en","retweeted":false,"fact_check":null,"id":"1984255784847614382","view_count":65,"bookmark_count":0,"created_at":1761918419000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@moskstraum21745 Oh yes 100% use MLX if you want to max the performance on Mac. I think it also now has CUDA support correct? It's just that the most of the LLM ecosystem (and my experience) is based on PyTorch.","in_reply_to_user_id_str":"1886394417654677504","in_reply_to_status_id_str":"1984254897622290758","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,156],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"101128454","name":"Wayne Le Nguyen","screen_name":"insynwyn","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"insynwyn","lang":"en","retweeted":false,"fact_check":null,"id":"1984242530398171139","view_count":62,"bookmark_count":0,"created_at":1761915259000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@insynwyn Both the latest nightly and latest PyTorch with CUDA 13 work for me. (NVIDIA recommends the docker container but in my case that wasn’t necessary)","in_reply_to_user_id_str":"101128454","in_reply_to_status_id_str":"1984239792939499706","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-02","value":1757,"startTime":1761955200000,"endTime":1762041600000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984617030356451642","view_count":65925,"bookmark_count":861,"created_at":1762004547000,"favorite_count":1286,"quote_count":3,"reply_count":27,"retweet_count":220,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened.\n\nFirst, linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s.\n\nI don't want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n^2) to O(n) to making attention much more efficient for long sequences.\n\nHowever, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. \n\nIn the second half of this year, there was a bit of a revival of linear attention variants. The first notable model was MiniMax-M1 with lightning attention, a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June.\n\nThen, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 with sparse attention.\n\nAll three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. (DeepSeek's sparse attention it's not strictly linear but still subquadratic).\n\nInterestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model (discussed in section 13) without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had pure accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications.\n\nThis could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. Last week, the Kimi team released their new Kimi Linear model with linear attention. The tag line is that compared to regular, full attention, it has a 75% KV cache reduction and up to 6x decoding throughput.\n\nKimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there's one block that uses full attention as shown in the figure below.\n\nHowever, Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Interestingly, it also replaces the standard full attention module by multi-head latent attention. \n\nThere's no direct comparison to Qwen3-Next in the Kimi Linear paper, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed.\n\nOf course, I couldn't resist and added it to my The Big LLM Architecture Comparison article, which has grown to >10,000 words now (basically becoming book!?).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,88],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2025/dgx-…","expanded_url":"https://sebastianraschka.com/blog/2025/dgx-impressions.html","url":"https://t.co/XG2m9urtgc","indices":[65,88]}],"user_mentions":[{"id_str":"43874767","name":"Ivan Fioravanti ᯅ","screen_name":"ivanfioravanti","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"ivanfioravanti","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984601748250448148","view_count":205,"bookmark_count":3,"created_at":1762000903000,"favorite_count":5,"quote_count":0,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ivanfioravanti Yes! Links to the codes are in the article here: https://t.co/XG2m9urtgc","in_reply_to_user_id_str":"43874767","in_reply_to_status_id_str":"1984519617067335962","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,197],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","retweeted":false,"fact_check":null,"id":"1984633894365233442","view_count":1181,"bookmark_count":3,"created_at":1762008567000,"favorite_count":18,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984605827034972269","full_text":"@redtachyon I think fp16 also only works well for the newer architectures that add tons of normalization (like QKNorm), so you don’t get these large activations above +/- 65k that fp16 can’t handle","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1984605827034972269","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,200],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1629698842647203841","name":"Yu Zhang 🐈🐙","screen_name":"yzhang_cs","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"yzhang_cs","lang":"en","retweeted":false,"fact_check":null,"id":"1984632514019778709","view_count":1211,"bookmark_count":1,"created_at":1762008238000,"favorite_count":10,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@yzhang_cs Ooops, I misread then, thanks for the feedback, and I’ll update the figure in the article! (Ha, but sounds like I can keep this figure for the next iteration of Kimi Linear! Cool work btw!)","in_reply_to_user_id_str":"1629698842647203841","in_reply_to_status_id_str":"1984631714464088563","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,222],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"862201913252618240","name":"Vishal Verma","screen_name":"v_shaal","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}]},"favorited":false,"in_reply_to_screen_name":"v_shaal","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984630139888472399","view_count":725,"bookmark_count":0,"created_at":1762007672000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@v_shaal Might be architectural. They took the same architecture and compared it the Gated DeltaNet-H1 variant from the Gated DeltaNet paper (which is the most similar) and it compared favorably on long context benchmarks: https://t.co/dlzIWpohGu","in_reply_to_user_id_str":"862201913252618240","in_reply_to_status_id_str":"1984622135571091742","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,281],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1984632041598545947","view_count":545,"bookmark_count":0,"created_at":1762008126000,"favorite_count":4,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@_junaidkhalid1 My point still stands: there’s no one-size-fits-all. Different applications have different trade-offs. Same why gpt-5 and gpt-5 pro exists. Some times speed is more important and accuracy is sufficient. Sometimes you want to max accuracy (and are ok to wait 10 min)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1984631100002746497","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,83],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1199846588325224453","name":"John P.","screen_name":"JohnP07107214","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"JohnP07107214","lang":"en","retweeted":false,"fact_check":null,"id":"1984727926777237953","view_count":198,"bookmark_count":0,"created_at":1762030986000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@JohnP07107214 It might be a good topic for a separate book on LLM optimizations :)","in_reply_to_user_id_str":"1199846588325224453","in_reply_to_status_id_str":"1984726873763660133","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,289],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/OpJnPkrGK9","indices":[121,144]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]}],"user_mentions":[{"id_str":"1219292652748800000","name":"Alexey Grigorev","screen_name":"Al_Grigor","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"Al_Grigor","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984645325517164887","view_count":17409,"bookmark_count":392,"created_at":1762011293000,"favorite_count":428,"quote_count":1,"reply_count":6,"retweet_count":45,"user_id_str":"865622395","conversation_id_str":"1984222098370519305","full_text":"Yes, I recently read 90% of AI projects use PyTorch now. Recently put together an PyTorch essentials article: https://t.co/NWeQan8HJ3\n\n(I’ve been an early adopter since 2018 and never looked back; that being said, regarding your points below, TensorFlow also has dynamic graphs, and Keras supports PyTorch as a backend now too)","in_reply_to_user_id_str":"1219292652748800000","in_reply_to_status_id_str":"1984222098370519305","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-03","value":0,"startTime":1762041600000,"endTime":1762128000000,"tweets":[]},{"label":"2025-11-04","value":46,"startTime":1762128000000,"endTime":1762214400000,"tweets":[{"bookmarked":false,"display_text_range":[13,133],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1985456352035291531","view_count":4496,"bookmark_count":3,"created_at":1762204656000,"favorite_count":46,"quote_count":0,"reply_count":3,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1985418033037263086","full_text":"@natolambert Actually I think it was a pretty eventful Fall so far. E.g.,\nQwen3-Next, DeepSeek V3.2, GLM 4.6, MiniMax-M2, Kimi Linear","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1985418033037263086","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-05","value":972,"startTime":1762214400000,"endTime":1762300800000,"tweets":[{"bookmarked":false,"display_text_range":[0,198],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[175,198]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1985719217027494322","view_count":42088,"bookmark_count":675,"created_at":1762267328000,"favorite_count":950,"quote_count":5,"reply_count":25,"retweet_count":164,"user_id_str":"865622395","conversation_id_str":"1985719217027494322","full_text":"My new field guide to alternatives to standard LLMs: \n\nGated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.\n\nhttps://t.co/ZpWugAccgQ https://t.co/255yQXaDcM","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[8,47],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"783098774130401280","name":"Jack Morris","screen_name":"jxmnop","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"jxmnop","lang":"en","retweeted":false,"fact_check":null,"id":"1985735592689185002","view_count":7024,"bookmark_count":1,"created_at":1762271233000,"favorite_count":22,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1985720643397009844","full_text":"@jxmnop Wishing you all the best! You got this!","in_reply_to_user_id_str":"783098774130401280","in_reply_to_status_id_str":"1985720643397009844","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-06","value":0,"startTime":1762300800000,"endTime":1762387200000,"tweets":[]},{"label":"2025-11-07","value":1503,"startTime":1762387200000,"endTime":1762473600000,"tweets":[{"bookmarked":false,"display_text_range":[0,89],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1986449512538513505","quoted_status_permalink":{"url":"https://t.co/4YLFiZxCMs","expanded":"https://twitter.com/Kimi_Moonshot/status/1986449512538513505","display":"x.com/Kimi_Moonshot/…"},"retweeted":false,"fact_check":null,"id":"1986511951141441648","view_count":87406,"bookmark_count":477,"created_at":1762456331000,"favorite_count":1352,"quote_count":8,"reply_count":27,"retweet_count":169,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Exciting big Kimi K2 Thinking release!\nMore experts, fewer heads, and even more thinking! https://t.co/CxUpn68Jjj","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"70831441","name":"Soumith Chintala","screen_name":"soumithchintala","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"soumithchintala","lang":"en","retweeted":false,"fact_check":null,"id":"1986531267794330038","view_count":16764,"bookmark_count":6,"created_at":1762460936000,"favorite_count":113,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986503070734557568","full_text":"@soumithchintala Thank you so much for making deep learning Pythonic! 💜\n\nAll my projects would have been much harder and less enjoyable without PyTorch. \n\nIn an alternative universe we maybe even wouldn’t have such an open-weight LLM ecosystem without PyTorch.\n\nAll the best for your next thing!","in_reply_to_user_id_str":"70831441","in_reply_to_status_id_str":"1986503070734557568","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[32,211],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"14227298","name":"Radek Sienkiewicz","screen_name":"velvet_shark","indices":[0,13]},{"id_str":"20971154","name":"Nicholas Dwork","screen_name":"ndwork","indices":[14,21]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[22,31]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}]},"favorited":false,"in_reply_to_screen_name":"velvet_shark","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1986517309016449353","view_count":50,"bookmark_count":1,"created_at":1762457608000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986412241374048473","full_text":"@velvet_shark @ndwork @karpathy I would say check out the bonus materials, especially the attention alternatives and Qwen3-from-scratch. \nI haven't had a chance to really check out nanochat but that one as well! https://t.co/Qr81iGhkrD","in_reply_to_user_id_str":"14227298","in_reply_to_status_id_str":"1986513230286524832","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,91],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1986522069262123425","view_count":7047,"bookmark_count":5,"created_at":1762458743000,"favorite_count":35,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Sorry should be 256k context length in Kimi K2 Thinking. (Up from 128k in regular Kimi K2.)","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1986511951141441648","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-08","value":0,"startTime":1762473600000,"endTime":1762560000000,"tweets":[]},{"label":"2025-11-09","value":670,"startTime":1762560000000,"endTime":1762646400000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/CaIfmZhaB1","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987157794202505395","view_count":52950,"bookmark_count":381,"created_at":1762610312000,"favorite_count":468,"quote_count":1,"reply_count":11,"retweet_count":71,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"My \"The Building Blocks of Today’s and Tomorrow’s Language Models\" talk at the PyTorch Conference is now up on YouTube! https://t.co/bGV5w1Aqyq\n\nIf you have 25 min this weekend, it's a whirlwind tour to catch you up on the key LLM architecture design considerations in popular LLMs this year (plus, an overview of alternative architecture designs).\n\nThe silver lining of my late arrival and rescheduling: Since there was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 min :)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,121],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[98,121]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987160373837902334","view_count":10033,"bookmark_count":87,"created_at":1762610927000,"favorite_count":85,"quote_count":0,"reply_count":2,"retweet_count":8,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"And the article I mentioned in the talk, the one I promised to write as a follow-up, is this one: https://t.co/ZpWugAccgQ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1987157794202505395","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,39],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1987168682624061627","view_count":143,"bookmark_count":0,"created_at":1762612908000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"@_junaidkhalid1 Incremental progress :)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1987161061188116976","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,297],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"267596794","name":"Walter Tay","screen_name":"waltertayannlee","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"waltertayannlee","lang":"en","retweeted":false,"fact_check":null,"id":"1987177509914337358","view_count":12080,"bookmark_count":62,"created_at":1762615012000,"favorite_count":117,"quote_count":0,"reply_count":2,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1986734118005358605","full_text":"The fun part when teaching deep learning classes was always to point out that the textbook convolution (/cross-correlation) is not how it’s actually implemented. It’s also one of the big sources of non-determinism when training CNNs in standard frameworks, because l, by default, CUDA/cuDNN selects the algo automatically at runtime specific to the problem and setup.","in_reply_to_user_id_str":"267596794","in_reply_to_status_id_str":"1986734118005358605","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-10","value":0,"startTime":1762646400000,"endTime":1762732800000,"tweets":[]},{"label":"2025-11-11","value":0,"startTime":1762732800000,"endTime":1762819200000,"tweets":[]},{"label":"2025-11-12","value":354,"startTime":1762819200000,"endTime":1762905600000,"tweets":[{"bookmarked":false,"display_text_range":[8,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1778075580271054848","name":"mel","screen_name":"melqtx","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"melqtx","lang":"en","retweeted":false,"fact_check":null,"id":"1988380057346130209","view_count":24822,"bookmark_count":39,"created_at":1762901722000,"favorite_count":354,"quote_count":0,"reply_count":16,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1988288260049871197","full_text":"@melqtx I use it all the time when using remote machines. Coz the terminal connections sometimes gets closed (e.g., when my computer goes to sleep).\n\nThis way, I can simply log back in, attach the tmux terminal, and continue instead of cd'ing to the right folder, activating the venv etc.","in_reply_to_user_id_str":"1778075580271054848","in_reply_to_status_id_str":"1988288260049871197","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-13","value":804,"startTime":1762905600000,"endTime":1762992000000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1988626642990719440","view_count":54993,"bookmark_count":944,"created_at":1762960513000,"favorite_count":801,"quote_count":5,"reply_count":27,"retweet_count":115,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all approach, but I thought it may be useful to share:\n\n1. Read the chapter preferably offline, away from the computer. Either classic physical form or at least on digital devices without internet. This really helps with focus time and minimizing distractions while reading. Highlighting or annotating confusing or interesting things is good, but I would not look things up at this stage. I also wouldn't run code at this stage. At least not yet.\n\n2. On the second read-through, type up and run the code from the chapter. Copying code is tempting because retyping is a lot of work, but it usually helps me to think about the code a bit more (versus just glancing over it). If I get different results than in the book, I would check the book's GitHub repo and try the code from there. If I still get different results, I would try to see if it's due to different package versions, random seeds, CPU/CUDA, etc. If I then still can't find it out, asking the author would not be a bad idea (via book forum, public GitHub repo issues or discussions, and as a last resort, email)\n\n3. After the second read-through and retyping the code, it's usually a good time to try the exercises to solidify my understanding. To check whether I actually understand the content and can work with it independently.\n\n4. Go through the highlights and annotations. I would bookmark important learnings or takeaways, if relevant for a given project, in my notes documents. Often, I also look up additional references to read more about a topic of interest. Also, if I still have any questions that I feel are unanswered after my previous readthroughs and exercises, I would do an online search to find out more.\n\n5. The previous steps were all about soaking up knowledge. Eventually, though, I somehow want to use that knowledge. So I think about which projects would benefit from what I've learned and incorporate it into them. This could involve using the main concept from the chapter, but also sometimes minor tidbits I learned along the way, e.g., even trivial things like whether it actually makes a difference in my project to explicitly call `torch.mps.manual_seed(seed)` instead of just `torch.manual_seed(seed)`.\n\nOf course, none of the above is set in stone. If the topic is overall very familiar or easy, and I am primarily reading the book to get some information in later chapters, skimming a chapter is ok (to not waste my time).\n\nAnyway, I hope this is useful. And happy reading and learning!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,44],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":true,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988627760617517412","view_count":74,"bookmark_count":0,"created_at":1762960779000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"@franbetteo Classic quality > quantity :)","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988627594669875705","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,292],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988631117772025955","view_count":6,"bookmark_count":0,"created_at":1762961580000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"Yeah, I think the problem is to want to read too many things. I have the same issue. Honestly, when reading at a computer, my attention span is sometimes so short that I can't even focus 30 min and read a longer blog article without distraction.\nIt requires discipline to stick to a given text.","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988628897995341948","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-14","value":0,"startTime":1762992000000,"endTime":1763078400000,"tweets":[]},{"label":"2025-11-15","value":0,"startTime":1763078400000,"endTime":1763164800000,"tweets":[]},{"label":"2025-11-16","value":938,"startTime":1763164800000,"endTime":1763251200000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989706196396265863","view_count":47955,"bookmark_count":547,"created_at":1763217898000,"favorite_count":754,"quote_count":1,"reply_count":15,"retweet_count":119,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"Inference-scaling lets us trade extra compute for better modeling accuracy. Next to reinforcement learning, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one.\n\nI just finished the first one. It is a 35-page introduction to inference-time scaling through self-consistency sampling. This chapter was a lot of fun to write because it takes the base model on MATH-500 all the way from 15.2% percent to 52.2% accuracy.\n\nSeeing that jump without additional training is incredibly satisfying.\n\nSubmitted the chapter yesterday, and it should appear in the Manning Early Access program in the next few days. (In the meantime the first 176 pages that lead up to this chapter are already available.)\n\nThe next chapter will focus on self-refinement techniques, where the model improves its own answers through iterative reasoning.\n\nHappy reading!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"800854096219471872","name":"Yuchen Jin","screen_name":"Yuchenj_UW","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"Yuchenj_UW","lang":"en","retweeted":false,"fact_check":null,"id":"1989803439224934626","view_count":6603,"bookmark_count":3,"created_at":1763241083000,"favorite_count":118,"quote_count":0,"reply_count":3,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1989755062646944048","full_text":"@Yuchenj_UW One can say you do seminal work to get a PhD, but you don’t have to have a PhD to do seminal work.","in_reply_to_user_id_str":"800854096219471872","in_reply_to_status_id_str":"1989755062646944048","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/blob/main/ch04/01_main-chapter-code/ch04_main.ipynb","url":"https://t.co/b3Nk5cVHwd","indices":[46,69]},{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/tree/main/ch04/02_math500-inference-scaling-scripts","url":"https://t.co/z3oj5Vkno1","indices":[144,167]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989708450100662776","view_count":8109,"bookmark_count":44,"created_at":1763218436000,"favorite_count":60,"quote_count":0,"reply_count":3,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"The chapter code is available here on GitHub: https://t.co/b3Nk5cVHwd\n\nAlso, I have the scripts to reproduce the experiments in the table here: https://t.co/z3oj5Vkno1","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1989706196396265863","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"9508592","name":"Asankhaya Sharma","screen_name":"asankhaya","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"asankhaya","lang":"en","retweeted":false,"fact_check":null,"id":"1989718576664568217","view_count":454,"bookmark_count":0,"created_at":1763220850000,"favorite_count":5,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@asankhaya Yes that’s correct. I think self-consistency is a good intro though that works well in practice, too. More will be covered in the next chapter.\nThanks for sharing btw, have to check out your repo some time.","in_reply_to_user_id_str":"9508592","in_reply_to_status_id_str":"1989717556077498843","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,123],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"4036077013","name":"sour coach sauers","screen_name":"SRCoachSauers","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"SRCoachSauers","lang":"en","retweeted":false,"fact_check":null,"id":"1989803670125646205","view_count":103,"bookmark_count":0,"created_at":1763241138000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@SRCoachSauers The website says summer 2026. That’s still the estimate but maybe even late spring depending on how it goes.","in_reply_to_user_id_str":"4036077013","in_reply_to_status_id_str":"1989800627426480467","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-17","value":0,"startTime":1763251200000,"endTime":1763337600000,"tweets":[]}],"nviews":[{"label":"2025-10-18","value":0,"startTime":1760659200000,"endTime":1760745600000,"tweets":[]},{"label":"2025-10-19","value":0,"startTime":1760745600000,"endTime":1760832000000,"tweets":[]},{"label":"2025-10-20","value":0,"startTime":1760832000000,"endTime":1760918400000,"tweets":[]},{"label":"2025-10-21","value":89199,"startTime":1760918400000,"endTime":1761004800000,"tweets":[{"bookmarked":false,"display_text_range":[0,51],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch04/07_moe","url":"https://t.co/3CGjgO4H9p","indices":[28,51]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/QA12nBeW0i","expanded_url":"https://x.com/rasbt/status/1980269760043446725/photo/1","id_str":"1980269709833445376","indices":[52,75],"media_key":"3_1980269709833445376","media_url_https":"https://pbs.twimg.com/media/G3tUzyYWoAA_iGM.jpg","type":"photo","url":"https://t.co/QA12nBeW0i","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1152,"w":2048,"resize":"fit"},"medium":{"h":675,"w":1200,"resize":"fit"},"small":{"h":383,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2304,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":1203,"y":0,"w":2304,"h":2304},{"x":1345,"y":0,"w":2021,"h":2304},{"x":1779,"y":0,"w":1152,"h":2304},{"x":0,"y":0,"w":4096,"h":2304}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980269709833445376"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1977733802660155875","quoted_status_permalink":{"url":"https://t.co/nQ43v9rV8S","expanded":"https://twitter.com/rasbt/status/1977733802660155875","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980269760043446725","view_count":76235,"bookmark_count":627,"created_at":1760968077000,"favorite_count":908,"quote_count":1,"reply_count":4,"retweet_count":145,"user_id_str":"865622395","conversation_id_str":"1980269760043446725","full_text":"🔗 Mixture of Experts (MoE): https://t.co/3CGjgO4H9p https://t.co/QA12nBeW0i","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"871247180341813248","name":"Tina Sang","screen_name":"tinawrote","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"tinawrote","lang":"en","retweeted":false,"fact_check":null,"id":"1980274554913132722","view_count":5237,"bookmark_count":0,"created_at":1760969220000,"favorite_count":11,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1979808022894703036","full_text":"@tinawrote Ha nice, it’s refreshing to see that people still care about Bayes theorem and fundamentals in 2025","in_reply_to_user_id_str":"871247180341813248","in_reply_to_status_id_str":"1979808022894703036","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"248951926","name":"Ahmad","screen_name":"TheAhmadOsman","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"TheAhmadOsman","lang":"en","retweeted":false,"fact_check":null,"id":"1980275166560092599","view_count":1634,"bookmark_count":2,"created_at":1760969366000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980102923754381348","full_text":"@TheAhmadOsman The V3.2 update with sparse attention was just to get the tooling ecosystem ready for the big release. Mark my words","in_reply_to_user_id_str":"248951926","in_reply_to_status_id_str":"1980102923754381348","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[23,305],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1209960539390201864","name":"Dwarkesh Patel","screen_name":"dwarkesh_sp","indices":[0,12]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[13,22]}]},"favorited":false,"in_reply_to_screen_name":"dwarkesh_sp","lang":"en","retweeted":false,"fact_check":null,"id":"1980335765063094548","view_count":6093,"bookmark_count":2,"created_at":1760983813000,"favorite_count":30,"quote_count":0,"reply_count":4,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980333945385562176","full_text":"> Culture: > “Why can’t an LLM write a book for the other LLMs? Why can’t other LLMs read this LLM’s book and be inspired by it, or shocked by it?”\n\nHm, isn’t that what training on synthetic data and knowledge distillation does? \n\nAll major LLMs contain some synthetic data in their mix because it makes training more effective versus cold-starting from raw data.","in_reply_to_user_id_str":"1209960539390201864","in_reply_to_status_id_str":"1980333945385562176","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-22","value":229110,"startTime":1761004800000,"endTime":1761091200000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/c9R2nNfMrN","expanded_url":"https://x.com/rasbt/status/1980642191950090585/photo/1","id_str":"1980642141530447873","indices":[276,299],"media_key":"3_1980642141530447873","media_url_https":"https://pbs.twimg.com/media/G3yniKkXYAE3Mdt.jpg","type":"photo","url":"https://t.co/c9R2nNfMrN","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":444,"y":774,"h":96,"w":96},{"x":722,"y":1636,"h":150,"w":150}]},"medium":{"faces":[{"x":260,"y":453,"h":56,"w":56},{"x":422,"y":958,"h":87,"w":87}]},"small":{"faces":[{"x":147,"y":257,"h":31,"w":31},{"x":239,"y":543,"h":49,"w":49}]},"orig":{"faces":[{"x":888,"y":1548,"h":192,"w":192},{"x":1444,"y":3272,"h":300,"w":300}]}},"sizes":{"large":{"h":2048,"w":1955,"resize":"fit"},"medium":{"h":1200,"w":1145,"resize":"fit"},"small":{"h":680,"w":649,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3909,"focus_rects":[{"x":0,"y":0,"w":3909,"h":2189},{"x":0,"y":0,"w":3909,"h":3909},{"x":316,"y":0,"w":3593,"h":4096},{"x":1331,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3909,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980642141530447873"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642191950090585","view_count":153439,"bookmark_count":1282,"created_at":1761056871000,"favorite_count":2142,"quote_count":35,"reply_count":76,"retweet_count":338,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about.\n\nIn short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly.\n\nMy first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting.\n\nIn the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!)\n\nIn any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version.\n\nHow is it different compared to other VLLM architectures?\n- They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts).\n- They are (to the best of my knowledge) those who use an MoE as a decoder.\nI think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well.\nHowever, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code.\n\nRegarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.)\n\nOverall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling).\n\n(PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[18,250],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1638538494887821313","url":"https://t.co/gNErcwGh3w","indices":[71,94]}],"user_mentions":[{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[0,9]},{"id_str":"39547749","name":"(((ل()(ل() 'yoav))))👾","screen_name":"yoavgo","indices":[10,17]}]},"favorited":false,"in_reply_to_screen_name":"karpathy","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1638538494887821313","quoted_status_permalink":{"url":"https://t.co/gNErcwGh3w","expanded":"https://x.com/rasbt/status/1638538494887821313","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1980463829789339825","view_count":12444,"bookmark_count":39,"created_at":1761014346000,"favorite_count":52,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980397031542989305","full_text":"@karpathy @yoavgo This made me think of the \"Meet in the Middle\" paper https://t.co/gNErcwGh3w\nWhen I remember correctly, they run two LLMs in both directions with parameter sharing. So it shouldn't impact training time. Kind of wild but hey why not.","in_reply_to_user_id_str":"33836629","in_reply_to_status_id_str":"1980435985730269351","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,188],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/deepseek-ai/De…","expanded_url":"https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf","url":"https://t.co/f0EFC6eVcl","indices":[19,42]},{"display_url":"magazine.sebastianraschka.com/p/understandin…","expanded_url":"https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=publication-search","url":"https://t.co/Aa5M0XD6ew","indices":[165,188]}],"user_mentions":[]},"favorited":true,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980642194945110475","view_count":9544,"bookmark_count":42,"created_at":1761056872000,"favorite_count":77,"quote_count":1,"reply_count":2,"retweet_count":12,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"Link to the paper: https://t.co/f0EFC6eVcl\n\nMy \"Understanding Multimodal LLMs\" article with more info on how images are fed to LLMs, how cross-attention works, etc: https://t.co/Aa5M0XD6ew","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1143955635391754240","name":"Pratham Prasoon","screen_name":"PrasoonPratham","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"PrasoonPratham","lang":"en","retweeted":false,"fact_check":null,"id":"1980645421560262701","view_count":2495,"bookmark_count":0,"created_at":1761057641000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@PrasoonPratham Actually I was thinking about it when typing, and I don't know. I don't want to be that person who goes against the common terminology (like softargmax haha) but it really is a V*L*LM at 3B parameters.","in_reply_to_user_id_str":"1143955635391754240","in_reply_to_status_id_str":"1980644767022399874","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,235],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1487101697876844546","name":"Butch Coolidge","screen_name":"vulnerablecodes","indices":[0,16]}]},"favorited":true,"in_reply_to_screen_name":"vulnerablecodes","lang":"en","retweeted":false,"fact_check":null,"id":"1980644334832955587","view_count":2094,"bookmark_count":0,"created_at":1761057382000,"favorite_count":19,"quote_count":0,"reply_count":2,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@vulnerablecodes If we are talking about the model itself and not the app, these are open-weight PyTorch models. So unless there’s a backdoor in Hugging Face or the PyTorch runtime, there’s really no way for them to be malicious afaik.","in_reply_to_user_id_str":"1487101697876844546","in_reply_to_status_id_str":"1980643375780085948","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[14,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"255532327","name":"LMT ⚡️","screen_name":"Limitless_LT","indices":[0,13]}]},"favorited":false,"in_reply_to_screen_name":"Limitless_LT","lang":"en","retweeted":false,"fact_check":null,"id":"1980656807690530983","view_count":1677,"bookmark_count":0,"created_at":1761060356000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@Limitless_LT Yeah, I think that’s what brought us CNNS (as opposed to fully connected neural nets), LoRA, and many more","in_reply_to_user_id_str":"255532327","in_reply_to_status_id_str":"1980655979386793997","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,290],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"760070121981378561","name":"Alim","screen_name":"almmaasoglu","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"almmaasoglu","lang":"en","retweeted":false,"fact_check":null,"id":"1980657466284425600","view_count":2069,"bookmark_count":0,"created_at":1761060513000,"favorite_count":6,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"@almmaasoglu exactly, that’s the messiness of working with the image format I mentioned. I think you can make to generalize well on all these but since there are more degrees of freedom it will require more data to train (luckily this can be done with automatic data augmentation but still)","in_reply_to_user_id_str":"760070121981378561","in_reply_to_status_id_str":"1980653506899087745","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1980736754517680463","view_count":25353,"bookmark_count":29,"created_at":1761079417000,"favorite_count":76,"quote_count":1,"reply_count":7,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980657338726887662","full_text":"Interesting that they mentioned faster & cheaper compared to OpenAI’s latest models not “customizable”. \n\nThat makes me think they are specifically referring to gpt-oss,\n\nThis in turn means they are using the small, dense Qwen3 models, maybe 0.6 to 4B range.\n\nAnd this is surprising, i.e. that models that small are good enough for production (and possibly chat interactions with the customer).","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1980657338726887662","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,171],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/G4ygIeNvDZ","expanded_url":"https://x.com/rasbt/status/1980715707508879444/photo/1","id_str":"1980715681784946688","indices":[172,195],"media_key":"3_1980715681784946688","media_url_https":"https://pbs.twimg.com/media/G3zqaxXWYAAsGKH.jpg","type":"photo","url":"https://t.co/G4ygIeNvDZ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]},"medium":{"faces":[{"x":24,"y":76,"h":40,"w":40},{"x":816,"y":173,"h":126,"w":126}]},"small":{"faces":[{"x":13,"y":43,"h":23,"w":23},{"x":462,"y":98,"h":71,"w":71}]},"orig":{"faces":[{"x":28,"y":89,"h":47,"w":47},{"x":944,"y":201,"h":146,"w":146}]}},"sizes":{"large":{"h":663,"w":1388,"resize":"fit"},"medium":{"h":573,"w":1200,"resize":"fit"},"small":{"h":325,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":663,"width":1388,"focus_rects":[{"x":0,"y":0,"w":1184,"h":663},{"x":0,"y":0,"w":663,"h":663},{"x":0,"y":0,"w":582,"h":663},{"x":76,"y":0,"w":332,"h":663},{"x":0,"y":0,"w":1388,"h":663}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1980715681784946688"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980715707508879444","view_count":8923,"bookmark_count":14,"created_at":1761074399000,"favorite_count":70,"quote_count":0,"reply_count":3,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1980642191950090585","full_text":"All that being said, as a human, I can appreciate visual representations of text as it lowers cognitive load (the raw text is readable, but requires much more brainpower): https://t.co/G4ygIeNvDZ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1980642191950090585","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"875456843279081476","name":"Dileep George","screen_name":"dileeplearning","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"dileeplearning","lang":"en","retweeted":false,"fact_check":null,"id":"1980618490764513365","view_count":11072,"bookmark_count":11,"created_at":1761051220000,"favorite_count":77,"quote_count":0,"reply_count":3,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1980423146420466049","full_text":"@dileeplearning I know it’s popular to hate tokenizers, but visual representations (which are also tokenized) bring a lot of messiness as well. Aspect ratios, cropping, resolution, brightness, etc.\n\nSure, models learn to deal with that but it requires lots of data to make them robust wrt these.","in_reply_to_user_id_str":"875456843279081476","in_reply_to_status_id_str":"1980423146420466049","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-23","value":41099,"startTime":1761091200000,"endTime":1761177600000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}],"symbols":[],"timestamps":[{"indices":[94,99],"seconds":660,"text":"11:00"},{"indices":[376,380],"seconds":200,"text":"3:20"}],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/FXRKnI4HsU","expanded_url":"https://x.com/rasbt/status/1980992745532453016/photo/1","id_str":"1980992740872609792","indices":[277,300],"media_key":"3_1980992740872609792","media_url_https":"https://pbs.twimg.com/media/G33mZu5W4AAFjWG.jpg","type":"photo","url":"https://t.co/FXRKnI4HsU","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":44,"y":652,"h":250,"w":250}]},"medium":{"faces":[{"x":25,"y":382,"h":146,"w":146}]},"small":{"faces":[{"x":14,"y":216,"h":83,"w":83}]},"orig":{"faces":[{"x":44,"y":652,"h":250,"w":250}]}},"sizes":{"large":{"h":1789,"w":2048,"resize":"fit"},"medium":{"h":1048,"w":1200,"resize":"fit"},"small":{"h":594,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1789,"width":2048,"focus_rects":[{"x":0,"y":642,"w":2048,"h":1147},{"x":0,"y":0,"w":1789,"h":1789},{"x":0,"y":0,"w":1569,"h":1789},{"x":0,"y":0,"w":895,"h":1789},{"x":0,"y":0,"w":2048,"h":1789}]},"media_results":{"result":{"media_key":"3_1980992740872609792"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1980992745532453016","view_count":41099,"bookmark_count":359,"created_at":1761140450000,"favorite_count":823,"quote_count":7,"reply_count":24,"retweet_count":89,"user_id_str":"865622395","conversation_id_str":"1980992745532453016","full_text":"Excited to be (finally) heading to the PyTorch Conference!\n\nI’ll be giving a talk tomorrow at 11:00 AM on “The LLM Landscape 2025”, where I’ll discuss the key components behind this year’s most prominent open-weight LLMs, and highlight a few architectural developments that go beyond the mainstream, off the main track.\n\nI also look forward to doing a book signing session at 3:20 PM, thanks to the kind invite from the organizers.\n\nIt’s my first trip since my injury last year, and I’m really looking forward to reconnecting with the community in person after such a long time. If you’re there, please come say hi!\n\n(I couldn’t make it for the first day of the conference due to a mandatory appointment, but better late than never! See you all tomorrow.)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-24","value":0,"startTime":1761177600000,"endTime":1761264000000,"tweets":[]},{"label":"2025-10-25","value":0,"startTime":1761264000000,"endTime":1761350400000,"tweets":[]},{"label":"2025-10-26","value":0,"startTime":1761350400000,"endTime":1761436800000,"tweets":[]},{"label":"2025-10-27","value":0,"startTime":1761436800000,"endTime":1761523200000,"tweets":[]},{"label":"2025-10-28","value":9152,"startTime":1761523200000,"endTime":1761609600000,"tweets":[{"bookmarked":false,"display_text_range":[42,321],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch","url":"https://t.co/NGT1VM4P1R","indices":[414,437]}],"user_mentions":[{"id_str":"13434092","name":"Brandon Watson","screen_name":"BrandonWatson","indices":[0,14]},{"id_str":"291797158","name":"ThePrimeagen","screen_name":"ThePrimeagen","indices":[15,28]},{"id_str":"21001534","name":"Audible","screen_name":"audible_com","indices":[29,41]}]},"favorited":false,"in_reply_to_screen_name":"BrandonWatson","lang":"en","retweeted":false,"fact_check":null,"id":"1982836647784808750","view_count":9152,"bookmark_count":19,"created_at":1761580070000,"favorite_count":42,"quote_count":0,"reply_count":5,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1982666767437820411","full_text":"I wrote the original text and code and had similar questions when I found that there was an audio book version. When I asked about it, if I remember correctly, the answer was that it is something they generate for all books to improve accessibility. \n\nPersonally, I recommend the text version. That being said, I dunno, but perhaps the audiobook version works also well if you are working with the code notebooks (https://t.co/NGT1VM4P1R), which have the code and figures (but not text).\n\nWould be curious to hear from people who listen to audio book versions of coding books and find out if this is helpful.","in_reply_to_user_id_str":"13434092","in_reply_to_status_id_str":"1982666767437820411","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-29","value":97462,"startTime":1761609600000,"endTime":1761696000000,"tweets":[{"bookmarked":false,"display_text_range":[0,269],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Vecwg5Dgh7","expanded_url":"https://x.com/rasbt/status/1983212569885122670/photo/1","id_str":"1983212483046248448","indices":[270,293],"media_key":"3_1983212483046248448","media_url_https":"https://pbs.twimg.com/media/G4XJPu4XkAAlcG0.jpg","type":"photo","url":"https://t.co/Vecwg5Dgh7","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":410,"y":442,"h":610,"w":610},{"x":1412,"y":438,"h":604,"w":604}]},"medium":{"faces":[{"x":240,"y":258,"h":357,"w":357},{"x":827,"y":256,"h":353,"w":353}]},"small":{"faces":[{"x":136,"y":146,"h":202,"w":202},{"x":468,"y":145,"h":200,"w":200}]},"orig":{"faces":[{"x":820,"y":884,"h":1220,"w":1220},{"x":2824,"y":876,"h":1208,"w":1208}]}},"sizes":{"large":{"h":2025,"w":2048,"resize":"fit"},"medium":{"h":1186,"w":1200,"resize":"fit"},"small":{"h":672,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4049,"width":4096,"focus_rects":[{"x":0,"y":1617,"w":4096,"h":2294},{"x":47,"y":0,"w":4049,"h":4049},{"x":544,"y":0,"w":3552,"h":4049},{"x":2071,"y":0,"w":2025,"h":4049},{"x":0,"y":0,"w":4096,"h":4049}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983212483046248448"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983212569885122670","view_count":46934,"bookmark_count":499,"created_at":1761669697000,"favorite_count":872,"quote_count":3,"reply_count":29,"retweet_count":128,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"Just saw the MiniMax-M2 benchmarks, and the performance is too good to ignore :). So, I just amended my \"The Big LLM Architecture Comparison\" with entry number 13! \n\n1️⃣ Full attention modules:\n\nAs shown in the overview figure below, I grouped MiniMax-M2 with the other decoder-style transformer LLMs as it does not use the efficient lightning attention variant proposed in MiniMax-M1. Instead, the developers went back to using full attention, likely to improve modeling (and benchmark) performance.\n\n2️⃣ Per-layer QK-Norm:\n\nOverall, MiniMax-M2 is surprisingly similar to Qwen3. Besides changing the number of layers, sizes, etc., it uses the same components overall. Perhaps the one noteworthy highlight here is that MiniMax-M2 uses a so-called “per_layer” QK-Norm instead of the regular QK-Norm. A closer look at the code reveals the \"per_layer\" means that the RMSNorm (used for QK-Norm as explained earlier) is defined in each transformer block (as in regular QK-Norm), but, in addition, instead of reusing it across attention heads, it's a unique QK-Norm for each attention head.\n\n3️⃣ Sliding-window attention:\n\nThe model configuration file also includes a sliding-window attention setting (similar to Gemma 3), but, as in Mistral 3.1, it is disabled by default.\n\nOtherwise, besides the per-layer QK-Norm, the architecture is very similar to Qwen3, as shown in the figure below.\n\n4️⃣ MoE sparsity:\n\nA perhaps interesting tidbit, as shown in the figure below, includes the fact that they don't use a shared expert (similar to Qwen3 but unlike Qwen3-Next). As mentioned earlier, in my opinion, shared experts are useful because they reduce redundancy among the other experts.\n\nAlso, as apparent from the figure above, MiniMax-M2 is twice as \"sparse\" as Qwen3. I.e., at roughly the same size as Qwen3 235B-A22B, MiniMax-M2 has only 10B instead of 22B active experts per token (that is, 4.37% of the parameters are used in each inference step in MiniMax-M2, whereas Qwen3 uses 9.36% active tokens).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1983240592516665532","quoted_status_permalink":{"url":"https://t.co/Ks8fEmHtCa","expanded":"https://twitter.com/ManningBooks/status/1983240592516665532","display":"x.com/ManningBooks/s…"},"retweeted":false,"fact_check":null,"id":"1983255497202643000","view_count":41464,"bookmark_count":263,"created_at":1761679932000,"favorite_count":404,"quote_count":0,"reply_count":25,"retweet_count":64,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"On that note, I am currently running a large-scale experiment on the upcoming inference-scaling chapter:\n\nA) Parallel Sampling\n- Self-Consistency (Majority Vote)\n- Rejection Sampling\n- Best-of-N (Verifier-Based)\n\nB) Sequential Refinement\n- Self-Refinement\n- Power Sampling\n- MCMC (Simple)\n- MCMC (Block as in \"Reasoning with Sampling\" paper)\n- Tree-of-Thought\n\n... to decide which one(s) make(s) it for the detailed discussion into the main chapter versus which ones will be included as bonus materials. (All new chapters will of course be automatically available to all the early acessers, amd there are already 170 chapters to get started in the meantime 😊\n\nAnything you'd think is worth adding to the list above?","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,34],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1745892418539417600","name":"elie","screen_name":"eliebakouch","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"eliebakouch","lang":"en","retweeted":false,"fact_check":null,"id":"1983231696343351800","view_count":2617,"bookmark_count":1,"created_at":1761674257000,"favorite_count":8,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@eliebakouch good point, will add!","in_reply_to_user_id_str":"1745892418539417600","in_reply_to_status_id_str":"1983219128883122466","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,192],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"970812776","name":"jason","screen_name":"jasonth0","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"jasonth0","lang":"en","retweeted":false,"fact_check":null,"id":"1983215929711284435","view_count":1335,"bookmark_count":1,"created_at":1761670498000,"favorite_count":4,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@jasonth0 The per-layer QK-Norm adds more params, not less :). But that aside, overall, I think it's still efficient. I mean, there are 50% less active parameters compared to Qwen3 for example","in_reply_to_user_id_str":"970812776","in_reply_to_status_id_str":"1983215562952990856","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[7,131],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"5604372","name":"Reza Rawassizadeh","screen_name":"rezar","indices":[0,6]}]},"favorited":false,"in_reply_to_screen_name":"rezar","lang":"en","retweeted":false,"fact_check":null,"id":"1983251855829606863","view_count":670,"bookmark_count":1,"created_at":1761679064000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@rezar That's a fun idea! Do you know a service that you have had a good experience with regarding making and distributing posters?","in_reply_to_user_id_str":"5604372","in_reply_to_status_id_str":"1983245370118267378","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,66],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1419322742713643009","name":"Duc Nguyen Huu","screen_name":"ducnh279","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"ducnh279","lang":"en","retweeted":false,"fact_check":null,"id":"1983278551655944288","view_count":108,"bookmark_count":0,"created_at":1761685428000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@ducnh279 Interesting one! I will bookmark this and give it a try.","in_reply_to_user_id_str":"1419322742713643009","in_reply_to_status_id_str":"1983263508071624848","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,46],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1918285228403253249","name":"ƬⲘ ⚔️","screen_name":"tm23twt","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dGwFDFaeYg","expanded_url":"https://x.com/rasbt/status/1983257620753592407/photo/1","id_str":"1983257552365256704","indices":[47,70],"media_key":"3_1983257552365256704","media_url_https":"https://pbs.twimg.com/media/G4XyPHLXgAAYnhY.png","type":"photo","url":"https://t.co/dGwFDFaeYg","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":868,"w":752,"resize":"fit"},"medium":{"h":868,"w":752,"resize":"fit"},"small":{"h":680,"w":589,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":868,"width":752,"focus_rects":[{"x":0,"y":115,"w":752,"h":421},{"x":0,"y":0,"w":752,"h":752},{"x":0,"y":0,"w":752,"h":857},{"x":151,"y":0,"w":434,"h":868},{"x":0,"y":0,"w":752,"h":868}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983257552365256704"}}}]},"favorited":false,"in_reply_to_screen_name":"tm23twt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983257620753592407","view_count":86,"bookmark_count":0,"created_at":1761680438000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"@tm23twt I think they removed the edit feature https://t.co/dGwFDFaeYg","in_reply_to_user_id_str":"1918285228403253249","in_reply_to_status_id_str":"1983256870711164941","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,26],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1983255844029431837","view_count":4248,"bookmark_count":2,"created_at":1761680014000,"favorite_count":15,"quote_count":0,"reply_count":2,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983255497202643000","full_text":"* 170 pages not chapters 😅","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983255497202643000","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-30","value":557,"startTime":1761696000000,"endTime":1761782400000,"tweets":[{"bookmarked":false,"display_text_range":[12,146],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"431181263","name":"Haichao","screen_name":"HaichaoZhu","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"HaichaoZhu","lang":"en","retweeted":false,"fact_check":null,"id":"1983343814648762407","view_count":552,"bookmark_count":0,"created_at":1761700988000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983212569885122670","full_text":"@HaichaoZhu That's a good point. With so many MoE's released this year (even the latest Nemotron today), maybe that'd be a nice standalone article","in_reply_to_user_id_str":"431181263","in_reply_to_status_id_str":"1983335671264845971","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,207],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1761964147510767616","name":"Ben Dicken","screen_name":"BenjDicken","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"BenjDicken","lang":"en","retweeted":false,"fact_check":null,"id":"1983565978525892663","view_count":5,"bookmark_count":0,"created_at":1761753956000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983292996117491864","full_text":"@BenjDicken Just saw this popping up on my timeline... I guess the twitter recommendations work well now, haha!\nAnyways, I hope you are liking the book. And please let me know in case you have any questions!","in_reply_to_user_id_str":"1761964147510767616","in_reply_to_status_id_str":"1983292996117491864","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-10-31","value":69178,"startTime":1761782400000,"endTime":1761868800000,"tweets":[{"bookmarked":false,"display_text_range":[0,275],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/9wW6do7nym","expanded_url":"https://x.com/rasbt/status/1983895811915214996/photo/1","id_str":"1983895205309730816","indices":[276,299],"media_key":"3_1983895205309730816","media_url_https":"https://pbs.twimg.com/media/G4g2LZgWUAAdmL3.jpg","type":"photo","url":"https://t.co/9wW6do7nym","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1291,"w":2048,"resize":"fit"},"medium":{"h":756,"w":1200,"resize":"fit"},"small":{"h":429,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2582,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":2582,"h":2582},{"x":0,"y":0,"w":2265,"h":2582},{"x":479,"y":0,"w":1291,"h":2582},{"x":0,"y":0,"w":4096,"h":2582}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895205309730816"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1978608882156269755","quoted_status_permalink":{"url":"https://t.co/uObfyEshyK","expanded":"https://twitter.com/rasbt/status/1978608882156269755","display":"x.com/rasbt/status/1…"},"retweeted":true,"fact_check":null,"id":"1983895811915214996","view_count":60530,"bookmark_count":173,"created_at":1761832595000,"favorite_count":325,"quote_count":1,"reply_count":22,"retweet_count":40,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"A small follow-up to my DGX Spark post. Courtesy of NVIDIA, I got to try the DGX on my workflows (coding LLMs from scratch in pure PyTorch) and wanted to share my first impressions after using it for a week.\n\nBefore getting to the performance, there was a neat bonus I didn't expect: It comes with NVIDIA Sync software that lets you conveniently connect (I fully expected I would have to find my SSH tunneling notes from back when I set up Jupyter Lab, etc, on a remote machine). The setup is a breeze and a delight.\n\nNow, how does it fare against my Mac Mini? I included the tokens/sec inference speed for a small 0.6B model I am currently working on. The DGX is much faster than the Mac Mini M4 CPU and still noticeably faster than the M4 GPU (via PyTorch MPS). More importantly, though, as I mentioned before, it is a CUDA device and thus much better supported in PyTorch. This, in turn, results in more stable training and higher benchmark accuracy. (And no compile errors, yay!)\n\nBoth devices get hot under my workloads (e.g., a constant-load run like evaluating a model with batched mode on MATH-500; or fine-tuning a model), but I feel like the DGX Spark is (probably) made with that in mind. Plus, due to its larger 128 GB RAM, I can run larger batch sizes. Then there's also the aspect that when I have the DGX (vs the Mac Mini) running computations, it keeps my Mini free for other tasks :).\n\nOverall, a neat little package and CUDA prototyping machine that I can keep on my desk. It's super quiet similar to the Mac Mini. Of course, it's not as capable as a 6x more expensive H100 for training, but hey, you don't need a server room for that and can keep it in your office without worrying about heat or noise (this was not possible with the Lambda workstation I had a few years ago).\n\ntl;dr:\n\nSo, I've been seeing lots of others using it for LLM inference (Ollama, vLLM, etc) but my first-week impression is that this is also a neat box for local dev and prototyping (e.g., coding and running PyTorch models) thanks to the CUDA support, which comes in handy before starting larger, more expensive training runs.\n\nPS: Plus also find another benchmark versus the H100 in the comments below.\n\nWill run more experiments over time. In the meantime, let me know if you have any questions.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,46],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","quoted_status_id_str":"1983895811915214996","quoted_status_permalink":{"url":"https://t.co/FM2NttATVY","expanded":"https://twitter.com/rasbt/status/1983895811915214996","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983900170992463920","view_count":1069,"bookmark_count":0,"created_at":1761833634000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1978608882156269755","full_text":"A follow-up here with some PyTorch benchmarks:","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1978608882156269755","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[13,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1983905044169945184","view_count":269,"bookmark_count":1,"created_at":1761834796000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983584412412641496","full_text":"@natolambert My guess is the motivating factor behind this was probably to prevent things from breaking if proprietary model providers make API or model changes again.","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1983584412412641496","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/I93nIfdzD6","expanded_url":"https://x.com/rasbt/status/1983895815102910660/photo/1","id_str":"1983895319516491777","indices":[280,303],"media_key":"3_1983895319516491777","media_url_https":"https://pbs.twimg.com/media/G4g2SC9XMAEJjVE.jpg","type":"photo","url":"https://t.co/I93nIfdzD6","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":2048,"w":1586,"resize":"fit"},"medium":{"h":1200,"w":929,"resize":"fit"},"small":{"h":680,"w":527,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3172,"focus_rects":[{"x":0,"y":32,"w":3172,"h":1776},{"x":0,"y":0,"w":3172,"h":3172},{"x":0,"y":0,"w":3172,"h":3616},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3172,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1983895319516491777"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983895815102910660","view_count":5102,"bookmark_count":7,"created_at":1761832595000,"favorite_count":20,"quote_count":0,"reply_count":4,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"And here is a comparison with an H100. As one can see, the DGX Spark is a great machine for small inferencing tasks (even beating the 6x more expensive H100).\nBut when it comes to batched processing (or training), this is of course no replacement for high-memory bandwidth cards. https://t.co/I93nIfdzD6","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1983895811915214996","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"x.com/rasbt/status/1…","expanded_url":"https://x.com/rasbt/status/1983918926992933169","url":"https://t.co/yazv07Pxfx","indices":[194,217]}],"user_mentions":[{"id_str":"1451507288741658630","name":"Aleksandr Kovalev","screen_name":"koval_alvi","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"koval_alvi","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1983918926992933169","quoted_status_permalink":{"url":"https://t.co/yazv07Pxfx","expanded":"https://x.com/rasbt/status/1983918926992933169","display":"x.com/rasbt/status/1…"},"retweeted":false,"fact_check":null,"id":"1983919187555754315","view_count":480,"bookmark_count":0,"created_at":1761838168000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":2,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@koval_alvi So little time (and only one machine) & some much to run 😅. I am currently more focused on the inference scaling methods for the upcoming chapter 4, but yes, I did a short run:\n\nhttps://t.co/yazv07Pxfx","in_reply_to_user_id_str":"1451507288741658630","in_reply_to_status_id_str":"1983912718001115637","is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,120],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1474196927960944644","name":"kris","screen_name":"Krishna70284154","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"Krishna70284154","lang":"en","retweeted":false,"fact_check":null,"id":"1983899945443700819","view_count":456,"bookmark_count":0,"created_at":1761833580000,"favorite_count":6,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@Krishna70284154 Yeah, it’s basically for people who want a Mac-like machine at a Mac-like price but with cuda support 😅","in_reply_to_user_id_str":"1474196927960944644","in_reply_to_status_id_str":"1983897384469082570","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,169],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/LLMs-fro…","expanded_url":"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb","url":"https://t.co/VioT1zUPgA","indices":[59,82]}],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1983918926992933169","view_count":1113,"bookmark_count":2,"created_at":1761838106000,"favorite_count":4,"quote_count":1,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@redtachyon I did a short run of my DPO from Scratch code (https://t.co/VioT1zUPgA) on a 355M parameter model:\n\nA100: 1.69 min\nMac Mini M4: 12.54 min\nDGX Spark: 2.44 min","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1983906361969627248","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,163],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1780523178160279552","name":"Mykhailo Sorochuk","screen_name":"sir4K_zen","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"sir4K_zen","lang":"en","retweeted":false,"fact_check":null,"id":"1984030005966598349","view_count":159,"bookmark_count":0,"created_at":1761864589000,"favorite_count":0,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@sir4K_zen Under normal use, they are similarly quiet like you have to put your ear next to it to hear it. Under heavy load, the Mac Mini gets louder than the DGX.","in_reply_to_user_id_str":"1780523178160279552","in_reply_to_status_id_str":"1984026707242971532","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-01","value":25568,"startTime":1761868800000,"endTime":1761955200000,"tweets":[{"bookmarked":false,"display_text_range":[0,260],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","quoted_status_id_str":"1984262505443844263","quoted_status_permalink":{"url":"https://t.co/bGHWQrydyN","expanded":"https://twitter.com/natolambert/status/1984262505443844263","display":"x.com/natolambert/st…"},"retweeted":false,"fact_check":null,"id":"1984279418588762113","view_count":19631,"bookmark_count":64,"created_at":1761924054000,"favorite_count":112,"quote_count":0,"reply_count":7,"retweet_count":6,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"I ran lots of experiments on fp16 vs bf16 years ago on ViTs and LLMs. fp16 can work well but depends on normalization (so you don’t run over the supported range with your activations). \nI can see why with QKNorm and other tricks it may work fine (/better) now.","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,261],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2023/pyto…","expanded_url":"https://sebastianraschka.com/blog/2023/pytorch-memory-optimization.html","url":"https://t.co/AD6ZZJeS4D","indices":[61,84]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/b6vobXJCHJ","expanded_url":"https://x.com/rasbt/status/1984310689167511808/photo/1","id_str":"1984310416248279041","indices":[262,285],"media_key":"3_1984310416248279041","media_url_https":"https://pbs.twimg.com/media/G4mvz2yWEAEHGEE.jpg","type":"photo","url":"https://t.co/b6vobXJCHJ","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1274,"w":1640,"resize":"fit"},"medium":{"h":932,"w":1200,"resize":"fit"},"small":{"h":528,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1274,"width":1640,"focus_rects":[{"x":0,"y":356,"w":1640,"h":918},{"x":58,"y":0,"w":1274,"h":1274},{"x":136,"y":0,"w":1118,"h":1274},{"x":377,"y":0,"w":637,"h":1274},{"x":0,"y":0,"w":1640,"h":1274}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984310416248279041"}}}]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984310689167511808","view_count":5710,"bookmark_count":27,"created_at":1761931509000,"favorite_count":43,"quote_count":0,"reply_count":0,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1984279418588762113","full_text":"Figure from an older blogpost to illustrate the difference: https://t.co/AD6ZZJeS4D\n\nRegular 16-bit floats can only represent numbers between -65,504 and 65,504. And with LLMs back then I often had activation larger or smaller than that. (This was pre QKNorm.) https://t.co/b6vobXJCHJ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1984279418588762113","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,71],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1894590251571843073","name":"Artificially Intelligent","screen_name":"ArtiIntelligent","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"ArtiIntelligent","lang":"en","retweeted":false,"fact_check":null,"id":"1984242821688365465","view_count":100,"bookmark_count":0,"created_at":1761915328000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ArtiIntelligent Sure but my use case is primarily dev work in PyTorch.","in_reply_to_user_id_str":"1894590251571843073","in_reply_to_status_id_str":"1984239937789788358","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,211],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1886394417654677504","name":"moskstraumen","screen_name":"moskstraum21745","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"moskstraum21745","lang":"en","retweeted":false,"fact_check":null,"id":"1984255784847614382","view_count":65,"bookmark_count":0,"created_at":1761918419000,"favorite_count":1,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@moskstraum21745 Oh yes 100% use MLX if you want to max the performance on Mac. I think it also now has CUDA support correct? It's just that the most of the LLM ecosystem (and my experience) is based on PyTorch.","in_reply_to_user_id_str":"1886394417654677504","in_reply_to_status_id_str":"1984254897622290758","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[10,156],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"101128454","name":"Wayne Le Nguyen","screen_name":"insynwyn","indices":[0,9]}]},"favorited":false,"in_reply_to_screen_name":"insynwyn","lang":"en","retweeted":false,"fact_check":null,"id":"1984242530398171139","view_count":62,"bookmark_count":0,"created_at":1761915259000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@insynwyn Both the latest nightly and latest PyTorch with CUDA 13 work for me. (NVIDIA recommends the docker container but in my case that wasn’t necessary)","in_reply_to_user_id_str":"101128454","in_reply_to_status_id_str":"1984239792939499706","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-02","value":87399,"startTime":1761955200000,"endTime":1762041600000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/Xy4AO2VKZc","expanded_url":"https://x.com/rasbt/status/1984617030356451642/photo/1","id_str":"1984616498350993408","indices":[280,303],"media_key":"3_1984616498350993408","media_url_https":"https://pbs.twimg.com/media/G4rGMLeXMAA_LeT.jpg","type":"photo","url":"https://t.co/Xy4AO2VKZc","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":510,"y":1612,"h":162,"w":162},{"x":130,"y":48,"h":420,"w":420}]},"medium":{"faces":[{"x":298,"y":944,"h":94,"w":94},{"x":76,"y":28,"h":246,"w":246}]},"small":{"faces":[{"x":169,"y":535,"h":53,"w":53},{"x":43,"y":15,"h":139,"w":139}]},"orig":{"faces":[{"x":1020,"y":3224,"h":324,"w":324},{"x":260,"y":96,"h":840,"w":840}]}},"sizes":{"large":{"h":2048,"w":2048,"resize":"fit"},"medium":{"h":1200,"w":1200,"resize":"fit"},"small":{"h":680,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":4096,"focus_rects":[{"x":0,"y":0,"w":4096,"h":2294},{"x":0,"y":0,"w":4096,"h":4096},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":4096,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1984616498350993408"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984617030356451642","view_count":65925,"bookmark_count":861,"created_at":1762004547000,"favorite_count":1286,"quote_count":3,"reply_count":27,"retweet_count":220,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"With the release of the Kimi Linear LLM last week, we can definitely see that efficient, linear attention variants have seen a resurgence in recent months. Here's a brief summary of what happened.\n\nFirst, linear attention variants have been around for a long time, and I remember seeing tons of papers in the 2020s.\n\nI don't want to dwell too long on these older attempts. But the bottom line was that they reduced both time and memory complexity from O(n^2) to O(n) to making attention much more efficient for long sequences.\n\nHowever, they never really gained traction as they degraded the model accuracy, and I have never really seen one of these variants applied in an open-weight state-of-the-art LLM. \n\nIn the second half of this year, there was a bit of a revival of linear attention variants. The first notable model was MiniMax-M1 with lightning attention, a 456B parameter mixture-of-experts (MoE) model with 46B active parameters, which came out back in June.\n\nThen, in August, the Qwen3 team followed up with Qwen3-Next, which I discussed in more detail above. Then, in September, the DeepSeek Team announced DeepSeek V3.2 with sparse attention.\n\nAll three models (MiniMax-M1, Qwen3-Next, DeepSeek V3.2) replace the traditional quadratic attention variants in most or all of their layers with efficient linear variants. (DeepSeek's sparse attention it's not strictly linear but still subquadratic).\n\nInterestingly, there was a recent plot twist, where the MiniMax team released their new 230B parameter M2 model (discussed in section 13) without linear attention, going back to regular attention. The team stated that linear attention is tricky in production LLMs. It seemed to work fine with regular prompts, but it had pure accuracy in reasoning and multi-turn tasks, which are not only important for regular chat sessions but also agentic applications.\n\nThis could have been a turning point where linear attention may not be worth pursuing after all. However, it gets more interesting. Last week, the Kimi team released their new Kimi Linear model with linear attention. The tag line is that compared to regular, full attention, it has a 75% KV cache reduction and up to 6x decoding throughput.\n\nKimi Linear shares several structural similarities with Qwen3-Next. Both models rely on a hybrid attention strategy. Concretely, they combine lightweight linear attention with heavier full attention layers. Specifically, both use a 3:1 ratio, meaning for every three transformer blocks employing the linear Gated DeltaNet variant, there's one block that uses full attention as shown in the figure below.\n\nHowever, Kimi Linear modifies the linear attention mechanism of Qwen3-Next by the Kimi Delta Attention (KDA) mechanism, which is essentially a refinement of Gated DeltaNet. Interestingly, it also replaces the standard full attention module by multi-head latent attention. \n\nThere's no direct comparison to Qwen3-Next in the Kimi Linear paper, but compared to the Gated DeltaNet-H1 model from the Gated DeltaNet paper (which is essentially Gated DeltaNet with sliding-window attention), Kimi Linear achieves higher modeling accuracy while maintaining the same token-generation speed.\n\nOf course, I couldn't resist and added it to my The Big LLM Architecture Comparison article, which has grown to >10,000 words now (basically becoming book!?).","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,88],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/blog/2025/dgx-…","expanded_url":"https://sebastianraschka.com/blog/2025/dgx-impressions.html","url":"https://t.co/XG2m9urtgc","indices":[65,88]}],"user_mentions":[{"id_str":"43874767","name":"Ivan Fioravanti ᯅ","screen_name":"ivanfioravanti","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"ivanfioravanti","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984601748250448148","view_count":205,"bookmark_count":3,"created_at":1762000903000,"favorite_count":5,"quote_count":0,"reply_count":1,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1983895811915214996","full_text":"@ivanfioravanti Yes! Links to the codes are in the article here: https://t.co/XG2m9urtgc","in_reply_to_user_id_str":"43874767","in_reply_to_status_id_str":"1984519617067335962","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,197],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"410212936","name":"Ariel","screen_name":"redtachyon","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"redtachyon","lang":"en","retweeted":false,"fact_check":null,"id":"1984633894365233442","view_count":1181,"bookmark_count":3,"created_at":1762008567000,"favorite_count":18,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984605827034972269","full_text":"@redtachyon I think fp16 also only works well for the newer architectures that add tons of normalization (like QKNorm), so you don’t get these large activations above +/- 65k that fp16 can’t handle","in_reply_to_user_id_str":"410212936","in_reply_to_status_id_str":"1984605827034972269","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,200],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1629698842647203841","name":"Yu Zhang 🐈🐙","screen_name":"yzhang_cs","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"yzhang_cs","lang":"en","retweeted":false,"fact_check":null,"id":"1984632514019778709","view_count":1211,"bookmark_count":1,"created_at":1762008238000,"favorite_count":10,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@yzhang_cs Ooops, I misread then, thanks for the feedback, and I’ll update the figure in the article! (Ha, but sounds like I can keep this figure for the next iteration of Kimi Linear! Cool work btw!)","in_reply_to_user_id_str":"1629698842647203841","in_reply_to_status_id_str":"1984631714464088563","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[9,222],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"862201913252618240","name":"Vishal Verma","screen_name":"v_shaal","indices":[0,8]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/dlzIWpohGu","expanded_url":"https://x.com/rasbt/status/1984630139888472399/photo/1","id_str":"1984630136155652096","indices":[223,246],"media_key":"3_1984630136155652096","media_url_https":"https://pbs.twimg.com/media/G4rSmAQXwAAuBBA.jpg","type":"photo","url":"https://t.co/dlzIWpohGu","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1012,"w":1125,"resize":"fit"},"medium":{"h":1012,"w":1125,"resize":"fit"},"small":{"h":612,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":1012,"width":1125,"focus_rects":[{"x":0,"y":382,"w":1125,"h":630},{"x":0,"y":0,"w":1012,"h":1012},{"x":33,"y":0,"w":888,"h":1012},{"x":224,"y":0,"w":506,"h":1012},{"x":0,"y":0,"w":1125,"h":1012}]},"media_results":{"result":{"media_key":"3_1984630136155652096"}}}]},"favorited":false,"in_reply_to_screen_name":"v_shaal","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984630139888472399","view_count":725,"bookmark_count":0,"created_at":1762007672000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@v_shaal Might be architectural. They took the same architecture and compared it the Gated DeltaNet-H1 variant from the Gated DeltaNet paper (which is the most similar) and it compared favorably on long context benchmarks: https://t.co/dlzIWpohGu","in_reply_to_user_id_str":"862201913252618240","in_reply_to_status_id_str":"1984622135571091742","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,281],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1984632041598545947","view_count":545,"bookmark_count":0,"created_at":1762008126000,"favorite_count":4,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@_junaidkhalid1 My point still stands: there’s no one-size-fits-all. Different applications have different trade-offs. Same why gpt-5 and gpt-5 pro exists. Some times speed is more important and accuracy is sufficient. Sometimes you want to max accuracy (and are ok to wait 10 min)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1984631100002746497","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,83],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1199846588325224453","name":"John P.","screen_name":"JohnP07107214","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"JohnP07107214","lang":"en","retweeted":false,"fact_check":null,"id":"1984727926777237953","view_count":198,"bookmark_count":0,"created_at":1762030986000,"favorite_count":3,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1984617030356451642","full_text":"@JohnP07107214 It might be a good topic for a separate book on LLM optimizations :)","in_reply_to_user_id_str":"1199846588325224453","in_reply_to_status_id_str":"1984726873763660133","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,289],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/OpJnPkrGK9","indices":[121,144]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]},{"display_url":"sebastianraschka.com/teaching/pytor…","expanded_url":"https://sebastianraschka.com/teaching/pytorch-1h/","url":"https://t.co/NWeQan8HJ3","indices":[110,133]}],"user_mentions":[{"id_str":"1219292652748800000","name":"Alexey Grigorev","screen_name":"Al_Grigor","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"Al_Grigor","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1984645325517164887","view_count":17409,"bookmark_count":392,"created_at":1762011293000,"favorite_count":428,"quote_count":1,"reply_count":6,"retweet_count":45,"user_id_str":"865622395","conversation_id_str":"1984222098370519305","full_text":"Yes, I recently read 90% of AI projects use PyTorch now. Recently put together an PyTorch essentials article: https://t.co/NWeQan8HJ3\n\n(I’ve been an early adopter since 2018 and never looked back; that being said, regarding your points below, TensorFlow also has dynamic graphs, and Keras supports PyTorch as a backend now too)","in_reply_to_user_id_str":"1219292652748800000","in_reply_to_status_id_str":"1984222098370519305","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-03","value":0,"startTime":1762041600000,"endTime":1762128000000,"tweets":[]},{"label":"2025-11-04","value":4496,"startTime":1762128000000,"endTime":1762214400000,"tweets":[{"bookmarked":false,"display_text_range":[13,133],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"2939913921","name":"Nathan Lambert","screen_name":"natolambert","indices":[0,12]}]},"favorited":false,"in_reply_to_screen_name":"natolambert","lang":"en","retweeted":false,"fact_check":null,"id":"1985456352035291531","view_count":4496,"bookmark_count":3,"created_at":1762204656000,"favorite_count":46,"quote_count":0,"reply_count":3,"retweet_count":1,"user_id_str":"865622395","conversation_id_str":"1985418033037263086","full_text":"@natolambert Actually I think it was a pretty eventful Fall so far. E.g.,\nQwen3-Next, DeepSeek V3.2, GLM 4.6, MiniMax-M2, Kimi Linear","in_reply_to_user_id_str":"2939913921","in_reply_to_status_id_str":"1985418033037263086","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-05","value":49112,"startTime":1762214400000,"endTime":1762300800000,"tweets":[{"bookmarked":false,"display_text_range":[0,198],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[175,198]}],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/255yQXaDcM","expanded_url":"https://x.com/rasbt/status/1985719217027494322/photo/1","id_str":"1985717736417230848","indices":[199,222],"media_key":"3_1985717736417230848","media_url_https":"https://pbs.twimg.com/media/G46vwq9XgAA4xVa.jpg","type":"photo","url":"https://t.co/255yQXaDcM","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":492,"y":1680,"h":168,"w":168}]},"medium":{"faces":[{"x":288,"y":984,"h":98,"w":98}]},"small":{"faces":[{"x":163,"y":557,"h":55,"w":55}]},"orig":{"faces":[{"x":984,"y":3360,"h":336,"w":336}]}},"sizes":{"large":{"h":2048,"w":1798,"resize":"fit"},"medium":{"h":1200,"w":1053,"resize":"fit"},"small":{"h":680,"w":597,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3595,"focus_rects":[{"x":0,"y":0,"w":3595,"h":2013},{"x":0,"y":0,"w":3595,"h":3595},{"x":0,"y":0,"w":3593,"h":4096},{"x":0,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3595,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1985717736417230848"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1985719217027494322","view_count":42088,"bookmark_count":675,"created_at":1762267328000,"favorite_count":950,"quote_count":5,"reply_count":25,"retweet_count":164,"user_id_str":"865622395","conversation_id_str":"1985719217027494322","full_text":"My new field guide to alternatives to standard LLMs: \n\nGated DeltaNet hybrids (Qwen3-Next, Kimi Linear), text diffusion, code world models, and small reasoning transformers.\n\nhttps://t.co/ZpWugAccgQ https://t.co/255yQXaDcM","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[8,47],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"783098774130401280","name":"Jack Morris","screen_name":"jxmnop","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"jxmnop","lang":"en","retweeted":false,"fact_check":null,"id":"1985735592689185002","view_count":7024,"bookmark_count":1,"created_at":1762271233000,"favorite_count":22,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1985720643397009844","full_text":"@jxmnop Wishing you all the best! You got this!","in_reply_to_user_id_str":"783098774130401280","in_reply_to_status_id_str":"1985720643397009844","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-06","value":0,"startTime":1762300800000,"endTime":1762387200000,"tweets":[]},{"label":"2025-11-07","value":111267,"startTime":1762387200000,"endTime":1762473600000,"tweets":[{"bookmarked":false,"display_text_range":[0,89],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/CxUpn68Jjj","expanded_url":"https://x.com/rasbt/status/1986511951141441648/photo/1","id_str":"1986511882786676737","indices":[90,113],"media_key":"3_1986511882786676737","media_url_https":"https://pbs.twimg.com/media/G5GCCEuXoAEOgUQ.jpg","type":"photo","url":"https://t.co/CxUpn68Jjj","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":1072,"w":2048,"resize":"fit"},"medium":{"h":628,"w":1200,"resize":"fit"},"small":{"h":356,"w":680,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":2143,"width":4096,"focus_rects":[{"x":0,"y":0,"w":3827,"h":2143},{"x":669,"y":0,"w":2143,"h":2143},{"x":800,"y":0,"w":1880,"h":2143},{"x":1204,"y":0,"w":1072,"h":2143},{"x":0,"y":0,"w":4096,"h":2143}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986511882786676737"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"quoted_status_id_str":"1986449512538513505","quoted_status_permalink":{"url":"https://t.co/4YLFiZxCMs","expanded":"https://twitter.com/Kimi_Moonshot/status/1986449512538513505","display":"x.com/Kimi_Moonshot/…"},"retweeted":false,"fact_check":null,"id":"1986511951141441648","view_count":87406,"bookmark_count":477,"created_at":1762456331000,"favorite_count":1352,"quote_count":8,"reply_count":27,"retweet_count":169,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Exciting big Kimi K2 Thinking release!\nMore experts, fewer heads, and even more thinking! https://t.co/CxUpn68Jjj","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":1,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,295],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"70831441","name":"Soumith Chintala","screen_name":"soumithchintala","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"soumithchintala","lang":"en","retweeted":false,"fact_check":null,"id":"1986531267794330038","view_count":16764,"bookmark_count":6,"created_at":1762460936000,"favorite_count":113,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986503070734557568","full_text":"@soumithchintala Thank you so much for making deep learning Pythonic! 💜\n\nAll my projects would have been much harder and less enjoyable without PyTorch. \n\nIn an alternative universe we maybe even wouldn’t have such an open-weight LLM ecosystem without PyTorch.\n\nAll the best for your next thing!","in_reply_to_user_id_str":"70831441","in_reply_to_status_id_str":"1986503070734557568","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[32,211],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"14227298","name":"Radek Sienkiewicz","screen_name":"velvet_shark","indices":[0,13]},{"id_str":"20971154","name":"Nicholas Dwork","screen_name":"ndwork","indices":[14,21]},{"id_str":"33836629","name":"Andrej Karpathy","screen_name":"karpathy","indices":[22,31]}]},"extended_entities":{"media":[{"display_url":"pic.x.com/Qr81iGhkrD","expanded_url":"https://x.com/rasbt/status/1986517309016449353/photo/1","id_str":"1986517102425935872","indices":[212,235],"media_key":"3_1986517102425935872","media_url_https":"https://pbs.twimg.com/media/G5GGx5ZWEAAlyLS.png","type":"photo","url":"https://t.co/Qr81iGhkrD","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[]},"medium":{"faces":[]},"small":{"faces":[]},"orig":{"faces":[]}},"sizes":{"large":{"h":808,"w":788,"resize":"fit"},"medium":{"h":808,"w":788,"resize":"fit"},"small":{"h":680,"w":663,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":808,"width":788,"focus_rects":[{"x":0,"y":365,"w":788,"h":441},{"x":0,"y":20,"w":788,"h":788},{"x":29,"y":0,"w":709,"h":808},{"x":181,"y":0,"w":404,"h":808},{"x":0,"y":0,"w":788,"h":808}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1986517102425935872"}}}]},"favorited":false,"in_reply_to_screen_name":"velvet_shark","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1986517309016449353","view_count":50,"bookmark_count":1,"created_at":1762457608000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986412241374048473","full_text":"@velvet_shark @ndwork @karpathy I would say check out the bonus materials, especially the attention alternatives and Qwen3-from-scratch. \nI haven't had a chance to really check out nanochat but that one as well! https://t.co/Qr81iGhkrD","in_reply_to_user_id_str":"14227298","in_reply_to_status_id_str":"1986513230286524832","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,91],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","retweeted":false,"fact_check":null,"id":"1986522069262123425","view_count":7047,"bookmark_count":5,"created_at":1762458743000,"favorite_count":35,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1986511951141441648","full_text":"Sorry should be 256k context length in Kimi K2 Thinking. (Up from 128k in regular Kimi K2.)","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1986511951141441648","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-08","value":0,"startTime":1762473600000,"endTime":1762560000000,"tweets":[]},{"label":"2025-11-09","value":75206,"startTime":1762560000000,"endTime":1762646400000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/CaIfmZhaB1","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]},{"display_url":"youtube.com/watch?v=nDl6Aj…","expanded_url":"https://www.youtube.com/watch?v=nDl6Aj9aPAI","url":"https://t.co/bGV5w1Aqyq","indices":[120,143]}],"user_mentions":[]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987157794202505395","view_count":52950,"bookmark_count":381,"created_at":1762610312000,"favorite_count":468,"quote_count":1,"reply_count":11,"retweet_count":71,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"My \"The Building Blocks of Today’s and Tomorrow’s Language Models\" talk at the PyTorch Conference is now up on YouTube! https://t.co/bGV5w1Aqyq\n\nIf you have 25 min this weekend, it's a whirlwind tour to catch you up on the key LLM architecture design considerations in popular LLMs this year (plus, an overview of alternative architecture designs).\n\nThe silver lining of my late arrival and rescheduling: Since there was no talk after mine, it's followed by a 30 min Q&A instead of just the usual 5 min :)","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,121],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"magazine.sebastianraschka.com/p/beyond-stand…","expanded_url":"https://magazine.sebastianraschka.com/p/beyond-standard-llms","url":"https://t.co/ZpWugAccgQ","indices":[98,121]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1987160373837902334","view_count":10033,"bookmark_count":87,"created_at":1762610927000,"favorite_count":85,"quote_count":0,"reply_count":2,"retweet_count":8,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"And the article I mentioned in the talk, the one I promised to write as a follow-up, is this one: https://t.co/ZpWugAccgQ","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1987157794202505395","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[16,39],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1744438366811164673","name":"JK","screen_name":"_junaidkhalid1","indices":[0,15]}]},"favorited":false,"in_reply_to_screen_name":"_junaidkhalid1","lang":"en","retweeted":false,"fact_check":null,"id":"1987168682624061627","view_count":143,"bookmark_count":0,"created_at":1762612908000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1987157794202505395","full_text":"@_junaidkhalid1 Incremental progress :)","in_reply_to_user_id_str":"1744438366811164673","in_reply_to_status_id_str":"1987161061188116976","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[17,297],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"267596794","name":"Walter Tay","screen_name":"waltertayannlee","indices":[0,16]}]},"favorited":false,"in_reply_to_screen_name":"waltertayannlee","lang":"en","retweeted":false,"fact_check":null,"id":"1987177509914337358","view_count":12080,"bookmark_count":62,"created_at":1762615012000,"favorite_count":117,"quote_count":0,"reply_count":2,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1986734118005358605","full_text":"The fun part when teaching deep learning classes was always to point out that the textbook convolution (/cross-correlation) is not how it’s actually implemented. It’s also one of the big sources of non-determinism when training CNNs in standard frameworks, because l, by default, CUDA/cuDNN selects the algo automatically at runtime specific to the problem and setup.","in_reply_to_user_id_str":"267596794","in_reply_to_status_id_str":"1986734118005358605","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-10","value":0,"startTime":1762646400000,"endTime":1762732800000,"tweets":[]},{"label":"2025-11-11","value":0,"startTime":1762732800000,"endTime":1762819200000,"tweets":[]},{"label":"2025-11-12","value":24822,"startTime":1762819200000,"endTime":1762905600000,"tweets":[{"bookmarked":false,"display_text_range":[8,288],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"1778075580271054848","name":"mel","screen_name":"melqtx","indices":[0,7]}]},"favorited":false,"in_reply_to_screen_name":"melqtx","lang":"en","retweeted":false,"fact_check":null,"id":"1988380057346130209","view_count":24822,"bookmark_count":39,"created_at":1762901722000,"favorite_count":354,"quote_count":0,"reply_count":16,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1988288260049871197","full_text":"@melqtx I use it all the time when using remote machines. Coz the terminal connections sometimes gets closed (e.g., when my computer goes to sleep).\n\nThis way, I can simply log back in, attach the tmux terminal, and continue instead of cd'ing to the right folder, activating the venv etc.","in_reply_to_user_id_str":"1778075580271054848","in_reply_to_status_id_str":"1988288260049871197","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-13","value":55073,"startTime":1762905600000,"endTime":1762992000000,"tweets":[{"bookmarked":false,"display_text_range":[0,279],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"favorited":false,"lang":"en","retweeted":false,"fact_check":null,"id":"1988626642990719440","view_count":54993,"bookmark_count":944,"created_at":1762960513000,"favorite_count":801,"quote_count":5,"reply_count":27,"retweet_count":115,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"I often get questions from readers about how to read and get the most out of my book(s) on building LLMs from scratch. My advice is usually based on how I read technical books myself. This is not a one-size-fits-all approach, but I thought it may be useful to share:\n\n1. Read the chapter preferably offline, away from the computer. Either classic physical form or at least on digital devices without internet. This really helps with focus time and minimizing distractions while reading. Highlighting or annotating confusing or interesting things is good, but I would not look things up at this stage. I also wouldn't run code at this stage. At least not yet.\n\n2. On the second read-through, type up and run the code from the chapter. Copying code is tempting because retyping is a lot of work, but it usually helps me to think about the code a bit more (versus just glancing over it). If I get different results than in the book, I would check the book's GitHub repo and try the code from there. If I still get different results, I would try to see if it's due to different package versions, random seeds, CPU/CUDA, etc. If I then still can't find it out, asking the author would not be a bad idea (via book forum, public GitHub repo issues or discussions, and as a last resort, email)\n\n3. After the second read-through and retyping the code, it's usually a good time to try the exercises to solidify my understanding. To check whether I actually understand the content and can work with it independently.\n\n4. Go through the highlights and annotations. I would bookmark important learnings or takeaways, if relevant for a given project, in my notes documents. Often, I also look up additional references to read more about a topic of interest. Also, if I still have any questions that I feel are unanswered after my previous readthroughs and exercises, I would do an online search to find out more.\n\n5. The previous steps were all about soaking up knowledge. Eventually, though, I somehow want to use that knowledge. So I think about which projects would benefit from what I've learned and incorporate it into them. This could involve using the main concept from the chapter, but also sometimes minor tidbits I learned along the way, e.g., even trivial things like whether it actually makes a difference in my project to explicitly call `torch.mps.manual_seed(seed)` instead of just `torch.manual_seed(seed)`.\n\nOf course, none of the above is set in stone. If the topic is overall very familiar or easy, and I am primarily reading the book to get some information in later chapters, skimming a chapter is ok (to not waste my time).\n\nAnyway, I hope this is useful. And happy reading and learning!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,44],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":true,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988627760617517412","view_count":74,"bookmark_count":0,"created_at":1762960779000,"favorite_count":3,"quote_count":0,"reply_count":1,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"@franbetteo Classic quality > quantity :)","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988627594669875705","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,292],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"3347998991","name":"Fran Betteo","screen_name":"franbetteo","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"franbetteo","lang":"en","retweeted":false,"fact_check":null,"id":"1988631117772025955","view_count":6,"bookmark_count":0,"created_at":1762961580000,"favorite_count":0,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1988626642990719440","full_text":"Yeah, I think the problem is to want to read too many things. I have the same issue. Honestly, when reading at a computer, my attention span is sometimes so short that I can't even focus 30 min and read a longer blog article without distraction.\nIt requires discipline to stick to a given text.","in_reply_to_user_id_str":"3347998991","in_reply_to_status_id_str":"1988628897995341948","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-14","value":0,"startTime":1762992000000,"endTime":1763078400000,"tweets":[]},{"label":"2025-11-15","value":0,"startTime":1763078400000,"endTime":1763164800000,"tweets":[]},{"label":"2025-11-16","value":63224,"startTime":1763164800000,"endTime":1763251200000,"tweets":[{"bookmarked":false,"display_text_range":[0,276],"entities":{"hashtags":[],"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[]},"extended_entities":{"media":[{"display_url":"pic.x.com/pcPi7ZbNaV","expanded_url":"https://x.com/rasbt/status/1989706196396265863/photo/1","id_str":"1989706055840878592","indices":[277,300],"media_key":"3_1989706055840878592","media_url_https":"https://pbs.twimg.com/media/G5zbHanW0AACOh7.jpg","type":"photo","url":"https://t.co/pcPi7ZbNaV","ext_media_availability":{"status":"Available"},"features":{"large":{"faces":[{"x":1204,"y":720,"h":88,"w":88}]},"medium":{"faces":[{"x":705,"y":421,"h":51,"w":51}]},"small":{"faces":[{"x":399,"y":239,"h":29,"w":29}]},"orig":{"faces":[{"x":2408,"y":1440,"h":176,"w":176}]}},"sizes":{"large":{"h":2048,"w":1746,"resize":"fit"},"medium":{"h":1200,"w":1023,"resize":"fit"},"small":{"h":680,"w":580,"resize":"fit"},"thumb":{"h":150,"w":150,"resize":"crop"}},"original_info":{"height":4096,"width":3492,"focus_rects":[{"x":0,"y":0,"w":3492,"h":1956},{"x":0,"y":0,"w":3492,"h":3492},{"x":0,"y":0,"w":3492,"h":3981},{"x":920,"y":0,"w":2048,"h":4096},{"x":0,"y":0,"w":3492,"h":4096}]},"allow_download_status":{"allow_download":true},"media_results":{"result":{"media_key":"3_1989706055840878592"}}}]},"favorited":false,"lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989706196396265863","view_count":47955,"bookmark_count":547,"created_at":1763217898000,"favorite_count":754,"quote_count":1,"reply_count":15,"retweet_count":119,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"Inference-scaling lets us trade extra compute for better modeling accuracy. Next to reinforcement learning, it has become one of the most important concepts in today's LLMs, so the book will cover it in two chapters instead of just one.\n\nI just finished the first one. It is a 35-page introduction to inference-time scaling through self-consistency sampling. This chapter was a lot of fun to write because it takes the base model on MATH-500 all the way from 15.2% percent to 52.2% accuracy.\n\nSeeing that jump without additional training is incredibly satisfying.\n\nSubmitted the chapter yesterday, and it should appear in the Manning Early Access program in the next few days. (In the meantime the first 176 pages that lead up to this chapter are already available.)\n\nThe next chapter will focus on self-refinement techniques, where the model improves its own answers through iterative reasoning.\n\nHappy reading!","in_reply_to_user_id_str":null,"in_reply_to_status_id_str":null,"is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[12,110],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"800854096219471872","name":"Yuchen Jin","screen_name":"Yuchenj_UW","indices":[0,11]}]},"favorited":false,"in_reply_to_screen_name":"Yuchenj_UW","lang":"en","retweeted":false,"fact_check":null,"id":"1989803439224934626","view_count":6603,"bookmark_count":3,"created_at":1763241083000,"favorite_count":118,"quote_count":0,"reply_count":3,"retweet_count":3,"user_id_str":"865622395","conversation_id_str":"1989755062646944048","full_text":"@Yuchenj_UW One can say you do seminal work to get a PhD, but you don’t have to have a PhD to do seminal work.","in_reply_to_user_id_str":"800854096219471872","in_reply_to_status_id_str":"1989755062646944048","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[0,167],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/blob/main/ch04/01_main-chapter-code/ch04_main.ipynb","url":"https://t.co/b3Nk5cVHwd","indices":[46,69]},{"display_url":"github.com/rasbt/reasonin…","expanded_url":"https://github.com/rasbt/reasoning-from-scratch/tree/main/ch04/02_math500-inference-scaling-scripts","url":"https://t.co/z3oj5Vkno1","indices":[144,167]}],"user_mentions":[]},"favorited":false,"in_reply_to_screen_name":"rasbt","lang":"en","possibly_sensitive":false,"possibly_sensitive_editable":true,"retweeted":false,"fact_check":null,"id":"1989708450100662776","view_count":8109,"bookmark_count":44,"created_at":1763218436000,"favorite_count":60,"quote_count":0,"reply_count":3,"retweet_count":5,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"The chapter code is available here on GitHub: https://t.co/b3Nk5cVHwd\n\nAlso, I have the scripts to reproduce the experiments in the table here: https://t.co/z3oj5Vkno1","in_reply_to_user_id_str":"865622395","in_reply_to_status_id_str":"1989706196396265863","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[11,217],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"9508592","name":"Asankhaya Sharma","screen_name":"asankhaya","indices":[0,10]}]},"favorited":false,"in_reply_to_screen_name":"asankhaya","lang":"en","retweeted":false,"fact_check":null,"id":"1989718576664568217","view_count":454,"bookmark_count":0,"created_at":1763220850000,"favorite_count":5,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@asankhaya Yes that’s correct. I think self-consistency is a good intro though that works well in practice, too. More will be covered in the next chapter.\nThanks for sharing btw, have to check out your repo some time.","in_reply_to_user_id_str":"9508592","in_reply_to_status_id_str":"1989717556077498843","is_quote_status":0,"is_ai":null,"ai_score":null},{"bookmarked":false,"display_text_range":[15,123],"entities":{"hashtags":[],"symbols":[],"timestamps":[],"urls":[],"user_mentions":[{"id_str":"4036077013","name":"sour coach sauers","screen_name":"SRCoachSauers","indices":[0,14]}]},"favorited":false,"in_reply_to_screen_name":"SRCoachSauers","lang":"en","retweeted":false,"fact_check":null,"id":"1989803670125646205","view_count":103,"bookmark_count":0,"created_at":1763241138000,"favorite_count":1,"quote_count":0,"reply_count":0,"retweet_count":0,"user_id_str":"865622395","conversation_id_str":"1989706196396265863","full_text":"@SRCoachSauers The website says summer 2026. That’s still the estimate but maybe even late spring depending on how it goes.","in_reply_to_user_id_str":"4036077013","in_reply_to_status_id_str":"1989800627426480467","is_quote_status":0,"is_ai":null,"ai_score":null}]},{"label":"2025-11-17","value":0,"startTime":1763251200000,"endTime":1763337600000,"tweets":[]}]},"interactions":{"users":[{"created_at":1700056512000,"uid":"1724788076852211712","id":"1724788076852211712","screen_name":"huseletov","name":"Stan Huseletov","friends_count":78,"followers_count":304,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1922700910142042113/yyMiyyvA_normal.jpg","description":"VP of Center of Excellence | Experienced ML Engineer | Fractional CTO","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"substack.com/@huseletov","expanded_url":"https://substack.com/@huseletov","url":"https://t.co/yuM70h4r68","indices":[0,23]}]}},"interactions":3},{"created_at":1435440090000,"uid":"3347998991","id":"3347998991","screen_name":"franbetteo","name":"Fran Betteo","friends_count":681,"followers_count":210,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1935707464051085312/OCpJgCVD_normal.jpg","description":"applied AI/ML consultant \n\n https://t.co/RPTxmlUS1O $.5K 🏀\n https://t.co/iECav4YX40 ▶️($300 MRR)","entities":{"description":{"urls":[{"display_url":"sportsjobs.online","expanded_url":"https://sportsjobs.online","url":"https://t.co/RPTxmlUS1O","indices":[65,88]},{"display_url":"downloadyoutubetranscripts.com","expanded_url":"https://downloadyoutubetranscripts.com","url":"https://t.co/iECav4YX40","indices":[97,120]}]},"url":{"urls":[{"display_url":"fbetteo.com","expanded_url":"https://fbetteo.com","url":"https://t.co/y3eLCis3rX","indices":[0,23]}]}},"interactions":2},{"created_at":1542043689000,"uid":"1062034323249881088","id":"1062034323249881088","screen_name":"codewithimanshu","name":"Himanshu Kumar","friends_count":327,"followers_count":23626,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1736815259951017986/Hyit0PGJ_normal.jpg","description":"Daily posts on AI , Tech, Programing, Tools, Jobs, and Trends | 500k+ (LinkedIn, IG, X) Collabs- [email protected]","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"linktr.ee/codewithimansh…","expanded_url":"https://linktr.ee/codewithimanshu.in","url":"https://t.co/dt6KPRmVjm","indices":[0,23]}]}},"interactions":2},{"created_at":1353878563000,"uid":"970812776","id":"970812776","screen_name":"jasonth0","name":"jason","friends_count":780,"followers_count":1278,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1887688877944852480/4RUt19Lf_normal.jpg","description":"👽","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1192659034000,"uid":"9508592","id":"9508592","screen_name":"asankhaya","name":"Asankhaya Sharma","friends_count":111,"followers_count":1568,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1909129143792680960/g5bJfDHq_normal.jpg","description":"Creator of https://t.co/SBkQjoagOV and https://t.co/dABj8UJJ23 | Pioneering a new category in AI infrastructure: inference-time compute to dramatically improve LLM reasoning","entities":{"description":{"urls":[{"display_url":"git.new/OptiLLM","expanded_url":"http://git.new/OptiLLM","url":"https://t.co/SBkQjoagOV","indices":[11,34]},{"display_url":"git.new/OpenEvolve","expanded_url":"http://git.new/OpenEvolve","url":"https://t.co/dABj8UJJ23","indices":[39,62]}]},"url":{"urls":[{"display_url":"asankhaya.github.io","expanded_url":"https://asankhaya.github.io/","url":"https://t.co/diqyru1qhG","indices":[0,23]}]}},"interactions":1},{"created_at":1513455211000,"uid":"942125557780795394","id":"942125557780795394","screen_name":"andrew_wkx","name":"Dobry Jeż Anaszpan","friends_count":214,"followers_count":32,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1984005511444836352/98a9Wba8_normal.jpg","description":"Tupniecie ciałem się stało.","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1509043784000,"uid":"923622691134943236","id":"923622691134943236","screen_name":"Sabirhussain118","name":"Sabir Hussain","friends_count":33,"followers_count":82,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1987650772742676480/XOCS9iMf_normal.jpg","description":"Helping creators earn more & work less using AI 🚀\n💬 DM open for collaborations & partnerships | ✉️[email protected]","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1490893593000,"uid":"847495273987297281","id":"847495273987297281","screen_name":"xyashchaudhary","name":"Yash Chaudhary","friends_count":540,"followers_count":179,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1986174122746089472/h_mteYXn_normal.jpg","description":"No finish line, only evolution.","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"myntis.com","expanded_url":"http://myntis.com","url":"https://t.co/sFsajw98Hs","indices":[0,23]}]}},"interactions":1},{"created_at":1479773470000,"uid":"800854096219471872","id":"800854096219471872","screen_name":"Yuchenj_UW","name":"Yuchen Jin","friends_count":602,"followers_count":65627,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1319081238439751681/kCcqnwoF_normal.jpg","description":"Co-founder & CTO @hyperbolic_labs cooking fun AI systems. Prev: OctoAI (acquired by @nvidia) building Apache TVM, PhD @ University of Washington.","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"yuchenjin.github.io","expanded_url":"https://yuchenjin.github.io/","url":"https://t.co/i4moxD2Mss","indices":[0,23]}]}},"interactions":1},{"created_at":1460782877000,"uid":"721201776523812864","id":"721201776523812864","screen_name":"Rebeykers","name":"Carlos javier","friends_count":164,"followers_count":31,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1986471455618449408/NwLBxaeV_normal.jpg","description":"","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1456658652000,"uid":"703903524627865600","id":"703903524627865600","screen_name":"ShereAgung","name":"Sri Agung","friends_count":179,"followers_count":48,"profile_image_url_https":"https://pbs.twimg.com/profile_images/836802209530642432/P8AmFSNz_normal.jpg","description":"Member of Melia Sehat Sejahtera|| PIN. 277B3D63 Hp. 089621500664","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1339785664000,"uid":"609340327","id":"609340327","screen_name":"vinayakbaddi618","name":"Vinayak","friends_count":1880,"followers_count":176,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1984530604952891392/mYcCnei7_normal.jpg","description":"29 | Accelerating AI @qualcomm","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1178312277000,"uid":"5775412","id":"5775412","screen_name":"tm65","name":"James Mak","friends_count":509,"followers_count":204,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1459319950988906498/L7M9eQQ1_normal.jpg","description":"Product manager. Foodie, covid sourdough bro","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1247137845000,"uid":"55206088","id":"55206088","screen_name":"CSkrishna","name":"C S Krishna","friends_count":1992,"followers_count":726,"profile_image_url_https":"https://pbs.twimg.com/profile_images/3690335074/9a704587f93e195f307d3ca072cdabbb_normal.jpeg","description":"Artificial Intelligence Researcher & Practitioner; Author: UnReal Elections","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1334168365000,"uid":"551208532","id":"551208532","screen_name":"saumyesrivastav","name":"Saumye Srivastava 📱ᯅ ✨","friends_count":1285,"followers_count":1684,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1984537761278541831/ALRyxnUo_normal.jpg","description":"SLMs, Vision & Speech","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"vastav.io","expanded_url":"http://vastav.io","url":"https://t.co/XXd6s1IRbr","indices":[0,23]}]}},"interactions":1},{"created_at":1322134542000,"uid":"420255675","id":"420255675","screen_name":"RuslanVolkov25","name":"Ruslan Volkov","friends_count":163,"followers_count":3041,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1984887345255235587/Fbq3RVYO_normal.jpg","description":"HACS (Human AGI core symbiosis)🌍 The Meta-Law of Resonance & Voice of the Future. I am Core! https://t.co/XwVVS1tt95","entities":{"description":{"urls":[{"display_url":"uco.hacs.world","expanded_url":"http://uco.hacs.world","url":"https://t.co/XwVVS1tt95","indices":[93,116]}]},"url":{"urls":[{"display_url":"hacs.world","expanded_url":"http://hacs.world","url":"https://t.co/xgvBxaiUKY","indices":[0,23]}]}},"interactions":1},{"created_at":1445953906000,"uid":"4036077013","id":"4036077013","screen_name":"SRCoachSauers","name":"sour coach sauers","friends_count":604,"followers_count":258,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1890927559962992640/rS4HhxuR_normal.jpg","description":"https://t.co/ABI6Xf9aur 🌱","entities":{"description":{"urls":[{"display_url":"Bermudaddy.com","expanded_url":"http://Bermudaddy.com","url":"https://t.co/ABI6Xf9aur","indices":[0,23]}]}},"interactions":1},{"created_at":1419452073000,"uid":"2939913921","id":"2939913921","screen_name":"natolambert","name":"Nathan Lambert","friends_count":885,"followers_count":60272,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1732079679610425344/YqSwiBqA_normal.jpg","description":"Research @allen_ai, reasoning, open models, RL(VR/HF)...\nContact via email. \nWrites @interconnectsai,\nWrote The RLHF Book,\n🏔️🏃‍♂️","entities":{"description":{"urls":[]},"url":{"urls":[{"display_url":"natolambert.com","expanded_url":"https://natolambert.com/","url":"https://t.co/NLbPtr9U1U","indices":[0,23]}]}},"interactions":1},{"created_at":1304539864000,"uid":"293126206","id":"293126206","screen_name":"DemirciMesut","name":"Mesut De","friends_count":4767,"followers_count":1647,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1307552454930767874/Y5NAu_AC_normal.jpg","description":"","entities":{"description":{"urls":[]}},"interactions":1},{"created_at":1762658292000,"uid":"1987358976833626112","id":"1987358976833626112","screen_name":"MarkPerrierX","name":"Mark Perrier","friends_count":31,"followers_count":6,"profile_image_url_https":"https://pbs.twimg.com/profile_images/1987371643438772224/fUE-hQTp_normal.jpg","description":"Building, breaking, and learning at the intersection of AI, Cloud, & Code. Obsessed with the next big thing. 🤖 | Coffee fueled developer.","entities":{"description":{"urls":[]}},"interactions":1}],"period":14,"start":1762115499743,"end":1763325099743}}},"settings":{},"session":null,"routeProps":{"/creators/:username":{}}}

Get live statistics and analysis of Sebastian Raschka's profile on X / Twitter

The Innovator

Top tweets of Sebastian Raschka

Most engaged tweets of Sebastian Raschka

People with Innovator archetype

Explore Related Archetypes

Browse All Archetypes

Supercharge your 𝕏 game,Grow with SuperX!

Free X/Twitter Tools

Supercharge your 𝕏 game,
Grow with SuperX!