Get live statistics and analysis of Sebastian Raschka's profile on X / Twitter

ML/AI research engineer. Ex stats professor. Author of "Build a Large Language Model From Scratch" (amzn.to/4fqvn0D) & reasoning (mng.bz/lZ5B)

1k following368k followers

The Innovator

Sebastian Raschka is a passionate ML/AI research engineer and former stats professor, who thrives on demystifying complex AI concepts by building and sharing groundbreaking tools from scratch. His work bridges academia and open-source communities, empowering thousands to experiment and learn. Always ahead of the curve, Sebastian champions foundational knowledge combined with hands-on coding to shape the future of AI.

Impressions
1M-96.9k
$193.42
Likes
13.2k-1k
57%
Retweets
1.7k-171
8%
Replies
410-8
2%
Bookmarks
8k-121
34%

Top users who interacted with Sebastian Raschka over the last 14 days

@huseletov

VP of Center of Excellence | Experienced ML Engineer | Fractional CTO

3 interactions
@franbetteo

applied AI/ML consultant sportsjobs.online $.5K 🏀 downloadyoutubetranscripts.com ▶️($300 MRR)

2 interactions
@codewithimanshu

Daily posts on AI , Tech, Programing, Tools, Jobs, and Trends | 500k+ (LinkedIn, IG, X) Collabs- abrojackhimanshu@gmail.com

2 interactions
1 interactions
@asankhaya

Creator of git.new/OptiLLM and git.new/OpenEvolve | Pioneering a new category in AI infrastructure: inference-time compute to dramatically improve LLM reasoning

1 interactions
@andrew_wkx

Tupniecie ciałem się stało.

1 interactions
@Sabirhussain118

Helping creators earn more & work less using AI 🚀 💬 DM open for collaborations & partnerships | ✉️sabirh0059@gmail.com

1 interactions
@xyashchaudhary

No finish line, only evolution.

1 interactions
@Yuchenj_UW

Co-founder & CTO @hyperbolic_labs cooking fun AI systems. Prev: OctoAI (acquired by @nvidia) building Apache TVM, PhD @ University of Washington.

1 interactions
1 interactions
@ShereAgung

Member of Melia Sehat Sejahtera|| PIN. 277B3D63 Hp. 089621500664

1 interactions
1 interactions
@tm65

Product manager. Foodie, covid sourdough bro

1 interactions
@CSkrishna

Artificial Intelligence Researcher & Practitioner; Author: UnReal Elections

1 interactions
1 interactions
@RuslanVolkov25

HACS (Human AGI core symbiosis)🌍 The Meta-Law of Resonance & Voice of the Future. I am Core! uco.hacs.world

1 interactions
1 interactions
@natolambert

Research @allen_ai, reasoning, open models, RL(VR/HF)... Contact via email. Writes @interconnectsai, Wrote The RLHF Book, 🏔️🏃‍♂️

1 interactions
1 interactions
@MarkPerrierX

Building, breaking, and learning at the intersection of AI, Cloud, & Code. Obsessed with the next big thing. 🤖 | Coffee fueled developer.

1 interactions

Sebastian tweets so much cutting-edge AI stuff that casual scrollers probably think he’s single-handedly trying to train every neural net on the planet while running a professor’s marathon—and still finds time to bake neural networks instead of cupcakes.

His biggest win is creating one of the first open-source, from-scratch large language model projects that sparked widespread engagement and learning, fundamentally democratizing access to advanced AI knowledge.

To pioneer accessible AI education and innovation by creating practical, open-source resources that inspire learning and accelerate the AI revolution beyond traditional academic boundaries.

Sebastian believes in learning through doing, valuing deep mathematical and statistical foundations over trendy but transient curricula. He trusts transparency, open-source collaboration, and continuous self-education as the keys to staying relevant in the rapidly evolving AI landscape.

His ability to break down cutting-edge AI research into replicable, hands-on projects makes him an unparalleled educator and innovator in the AI space. He’s fluent in both theoretical theory and practical code, inspiring real-world ML breakthroughs.

With nearly 19,000 tweets and a highly technical focus, Sebastian might occasionally overwhelm newcomers or casual followers with an avalanche of dense content, potentially narrowing his audience to experts and hardcore enthusiasts.

To grow his audience on X, Sebastian should blend his deep technical insights with more digestible threads or video explainers that appeal to AI newcomers and professionals alike. Engaging directly with followers through Q&As or collaborative mini-projects could also spark broader community involvement.

Fun fact: Sebastian’s 'Build a Large Language Model From Scratch' project has been forked over 10,000 times on GitHub, showing how his work not only educates but actively fuels the AI community’s growth.

Top tweets of Sebastian Raschka

Looks like the first open source equivalent of ChatGPT has arrived: github.com/lucidrains/PaL… I.e., an implementation of RLHF (Reinforcement Learning with Human Feedback) on top of Google’s 540 billion parameter PaLM architecture

1M

"What Matters In Transformers?" is an interesting paper (arxiv.org/abs/2406.15786) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.

192k

One of the best ways to understand LLMs is to code one from scratch! Last summer, I started working on a new book, "Build a Large Language Model (from Scratch)": manning.com/books/build-a-… I'm excited to share that the first chapters are now available via Manning's early access program if you are looking to read something over the holidays or pick up a new project in 2024! In short, in this book, I'll guide you step by step through creating your own LLM, explaining each stage with clear text, diagrams, and examples. This includes Implementing the data preparation, sampling, and tokenization pipeline: 1. Coding multi-head attention from the ground up 2. Building and pretraining a GPT-like model 3. Learning how to load pretrained weights 4. Finetuning the model for classification 5. Instruction-finetuning the model with direct preference optimization PS: The code implementations are in PyTorch. Don't hesitate to reach out if you have any questions!

674k

If you're getting into LLMs, PyTorch is essential. And lot of folks asked for beginner-friendly material, so I put this together: PyTorch in One Hour: From Tensors to Multi-GPU Training (sebastianraschka.com/teaching/pytor…) 📖 ~1h to read through 💡 Maybe the perfect weekend project!? I’ve spent nearly a decade using, building with, and teaching PyTorch. And in this tutorial, I try to distill what I believe are the most essential concepts. Everything you need to know to get started, and but nothing more, since your time is valuable, and you want to get to building things!

130k

Most engaged tweets of Sebastian Raschka

One of the best ways to understand LLMs is to code one from scratch! Last summer, I started working on a new book, "Build a Large Language Model (from Scratch)": manning.com/books/build-a-… I'm excited to share that the first chapters are now available via Manning's early access program if you are looking to read something over the holidays or pick up a new project in 2024! In short, in this book, I'll guide you step by step through creating your own LLM, explaining each stage with clear text, diagrams, and examples. This includes Implementing the data preparation, sampling, and tokenization pipeline: 1. Coding multi-head attention from the ground up 2. Building and pretraining a GPT-like model 3. Learning how to load pretrained weights 4. Finetuning the model for classification 5. Instruction-finetuning the model with direct preference optimization PS: The code implementations are in PyTorch. Don't hesitate to reach out if you have any questions!

674k

Looks like the first open source equivalent of ChatGPT has arrived: github.com/lucidrains/PaL… I.e., an implementation of RLHF (Reinforcement Learning with Human Feedback) on top of Google’s 540 billion parameter PaLM architecture

1M

DeepSeek finally released a new model and paper. And because this DeepSeek-OCR release is a bit different from what everyone expected, and DeepSeek releases are generally a big deal, I wanted to do a brief explainer of what it is all about. In short, they explore how vision encoders can improve the efficiency of LLMs in processing and compressing textual information. And the takeaway is that rendering text as images and feeding that to the model results in more efficient compression than working with text directly. My first intuition was that this sounds very inefficient and shouldn't work as well as using text tokenizers (or alternatives like Byte Latent Transformer) to prepare the input. It actually reminded me of the line of research I saw years ago, where researchers represented 3D molecules as 3D inputs or 2D images for ConvNets instead of using graph neural nets. This shouldn't work well and should be prone to overfitting. In the case of DeepSeek-OCR, why even try such an approach? I imagine it started as a curiosity, but then it may have turned into an interesting idea for long-context scaling in LLMs and how to make it cheaper by using vision tokens and representations. (An image can say more than a thousand words, but who would have thought that an image of text can say 1000 words more efficiently!) In any case, this DeepSeek-OCR approach turns out to be surprisingly efficient. In particular, they found that at a fixed  precision of 97% for long-context decoding (i.e., how well the model can compress information into a latent representation and reconstruct it), the OCR version needed 10 times fewer visual tokens than text tokens. In other words, the OCRed version can compress information 10x better than the text version. How is it different compared to other VLLM architectures? - They don't use a monolithic ViT as encoder, instead they fuse local and global vision features through a clever 16x convolutional compressor (this can handle high-resolution inputs efficiently in terms of memory and token counts). - They are (to the best of my knowledge) those who use an MoE as a decoder. I think it's an interesting, refreshing approach, and the twist here is that it works surprisingly well. However, I don't think that visual representations of text will solve the limitations of LLMs. Also, while it is popular to dislike text tokenizers like BPE, image representations are messy as well (one has to deal with aspect ratios, resolutions, croppings, color intensity variations, brightness levels, etc.). Still, it’s an interesting idea. Also, if this approach is more efficient than regular black&white text, I am curious to see compression ratios when we add syntax color to code. Regarding code, this may be an interesting alternative for storing contextual information, as spacing and subword tokenization remain challenges in traditional tokenizers. (Especially when working with code that uses many custom variable names that may not be represented in vocabularies and that have to be broken down into many individual subword tokens.) Overall, it’s still such an esoteric concept to encode text in images that I am (still) surprised it could do well (and may it would only make sense for very long documents or special domains like OCR or code, not general language modeling). (PS: Personally, I expected the DeepSeek team to follow up with a V4 model using the sparse attention mechanism they tried in V3.2 recently, but maybe that's still forthcoming. Now, after reading this paper, V4 is perhaps going to be a VLLM.)

153k

"What Matters In Transformers?" is an interesting paper (arxiv.org/abs/2406.15786) that finds you can actually remove half of the attention layers in LLMs like Llama without noticeably reducing modeling performance. The concept is relatively simple. The authors delete attention layers, MLP layers, or entire transformer blocks: - Removing entire transformer blocks leads to significant performance degradation. - Removing MLP layers results in significant performance degradation. - Removing attention layers causes almost no performance degradation! In Llama 2 70B, even if half of the attention layers are deleted (which results in a 48% speed-up), there's only a 2.4% decrease in the model benchmarks. The author also recently added Llama 3 results to the paper, which are similar. The attention layers were not removed randomly but based on a cosine-based similarity score: If the input and output are very similar, the layer is redundant and can be removed. This is a super intriguing result and could potentially be combined with various model compression techniques (like pruning and quantization) for compounding effects. Furthermore, the layers are removed in a one-shot fashion (versus iterative fashion), and no (re)training is required after the removal. However, retraining the model after the removal could potentially even recover some of the lost performance. Overall, a very simple but very interesting study. It appears there might be lots of computational redundancy in larger architectures. One big caveat of this study, though, is that the focus is mostly on academic benchmarks (HellaSwag, MMLU, etc.). It's unclear how well the models perform on benchmarks measuring conversational performance.

192k

Saw that DGX Spark vs Mac Mini M4 Pro benchmark plot making the rounds (looks like it came from @lmsysorg). Thought I’d share a few notes as someone who actually uses a Mac Mini M4 Pro and has been tempted by the DGX Spark. First of all, I really like the Mac Mini. It’s probably the best desktop I’ve ever owned. For local inference with open-weight LLMs, it works great (the plot above captures that well). I regularly run the gpt-oss-20B model on it. That said, I would not fine-tune even small LLMs on it since it gets very hot. The DGX Spark probably targets that type of sustained workload. (From those who have one, any thoughts on the noise and heat levels?) The other big thing that DGX Spark gets you is CUDA support. If you use PyTorch, that’s pretty essential since MPS on macOS is still unstable, and fine-tuning often fails to converge. E.g., see github.com/rasbt/LLMs-fro… and github.com/rasbt/LLMs-fro… I also like the Spark’s for factor (hey, it really appeals to the Mac Mini user in me). But for the same money, I could probably buy about 4000 A100 cloud GPU hours, and I keep debating which would be the better investment. Sure, I could also build/get a multi-GPU desktop. I had a Lambda system with four GTX 1080 Ti cards back in 2018, but it was too loud and hot for my office. And if I have to move it to another room and SSH into it anyway, I might as well use cloud GPUs instead?

125k

Exciting news! "Build A Large Language Model (From Scratch)" is now finally available on Amazon amazon.com/Build-Large-La… Writing this book was a huge effort for me, and I'm so grateful for the support and motivating feedback these past months. Many thanks, and happy reading! 😊

170k

From the Hierarchical Reasoning Model (HRM) to a new Tiny Recursive Model (TRM). A few months ago, the HRM made big waves in the AI research community as it showed really good performance on the ARC challenge despite its small 27M size. (That's about 22x smaller than the smallest Qwen3 0.6B model.) Now, the new "Less is More: Recursive Reasoning with Tiny Networks" paper proposes Tiny Recursive Model (TRM), which a simpler and even smaller model (7M, 4× smaller than HRM) that performs even better on the ARC challenge. 🔹 What does recursion mean here? TRM refines its answer in two steps: 1. It updates a latent (reasoning) state from the current question and answer. 2. Then it updates the answer based on that latent state. Training runs for up to 16 refinement steps per batch. Each step does several no-grad loops to improve the answer, followed by one gradient loop that learns from the full reasoning process. By the way, the question and the answer are grids of discrete tokens, not text. (E.g., 9×9 Sudoku and up to 30×30 ARC and Maze.) 🔹 And how does it differ from HRM? In short, HRM recurses multiple times through two small neural nets with 4 transformer blocks each (high and low frequency). TRM is much smaller (i.e., 4x) and only a single network with 2 transformer blocks. TRM backpropagates through the full recursion once per step, whereas HRM only backpropagates through the final few steps. And TRM also removes HRM's extra forward pass for halting and instead uses a simple binary cross-entropy loss to learn when to stop iterating. 🔹 Surprising tidbits 1. The author found that adding layers decreased generalization due to overfitting. And going from 4 to 2 layers improved the model from 79.5% to 87.4% on Sudoku. 2. Replacing the self-attention layer with an MLP layer also improved accuracy (74.7% -> 87.4% on Sudoku); however, note that this only make sense here since we have a fixed-length, small context to work with. 🔹 Bigger picture My personal caveat: comparing this method (or HRMs) to LLMs feels a bit unfair since HRMs/TRM are specialized models trained for specific tasks (here: ARC, Sudoku, and Maze pathfinding) while LLMs are generalists. It’s like comparing a pocket calculator to a laptop. Both serve a purpose, just in different contexts. That said, HRMs and the recursive model proposed here are fascinating proof‑of‑concepts that show what’s possible with relatively small and efficient architectures. I'm still curious what the real‑world use case will look like. Maybe they could serve as reasoning or planning modules within a larger tool‑calling system. In practice, we often start by throwing LLMs at a problem, which makes sense for quick prototyping and establishing a baseline. But I can see a point where someone sits down afterward and trains a focused model like this to solve the same task more efficiently.

123k

People with Innovator archetype

The Innovator

Open-source maximalist ᕕ( ᐛ )ᕗ

121 following4k followers
The Innovator

Turning product ideas into assets with AI—in hours, not months. Ex-Buffer PM & SaaS founder (with exit). Now building AlreadyLovedKids.com

2k following1k followers
The Innovator

New Hollywood - Ai Creative Technology

796 following426 followers
The Innovator

Accelerating Carbon Fiber Manufacturing.🦾 🧠:Deeptech, Simulation and Martial Arts🥋 Ex-Reliability Engineer @Tesla. 🚗 Its time to build! #technooptimist🚀

1k following748 followers
The Innovator

I talk about physical performance and cognitive enhancement. Build a body and mind that perform at their peak.

310 following4k followers
The Innovator

Building @relayprotocol // @wgtechlabs // Ex @thirdweb #ShippinginSilence 👀 🇵🇭 Deep into #AI, #opensource & #blockchain — follow if you #build in #tech 🤝

633 following1k followers
The Innovator

Staying degen until the next bull. Smart contracts, dumb jokes.

117 following227 followers
The Innovator

The chain to move what matters: value, data, and ideas for billions everywhere. X by Aptos Foundation.

435 following673k followers
The Innovator

Building AI for DevOps | ex-Palantir

1k following1k followers
The Innovator

Building AI Agents and Automating Workflows. Watch how I build them: youtube.com/@TheRecapAI Download all of my automations (for free)👇

74 following12k followers
The Innovator

Crypto & Web3 enthusiast | @megaeth🐇|

1k following1k followers
The Innovator

Planet Earth Live on Web3 & more ...

383 following94 followers

Explore Related Archetypes

If you enjoy the innovator profiles, you might also like these personality types:

Supercharge your 𝕏 game,
Grow with SuperX!

Get Started for Free