Get live statistics and analysis of Andi Marafioti's profile on X / Twitter

cooking multimodal models @huggingface (prev @unity)

531 following5k followers

Archetype analysis

The Innovator

Andi Marafioti is a trailblazer in the world of multimodal AI models, passionately pushing the limits of what’s possible with open-source vision and OCR technology. Known for their unapologetic enthusiasm and lightning-fast breakthroughs, Andi bridges the gap between cutting-edge research and accessible tech. They’re the go-to voice if you want to know how to run powerful AI models on your laptop — or even your toaster!

Recent engagement

Impressions

597.7k-35.1k

Estimate earning$112.03

Likes

4.7k-719

63%

Retweets

363-42

Replies

270-18

Bookmarks

2.1k-111

28%

Get more insights about Andi Marafioti with SuperX

Social Circle

Top users who interacted with Andi Marafioti over the last 14 days

Reza Sayar @iamRezaSayar

👨🏻‍🎓Life-long Learner👨🏻‍🎓 Kindness❤️, Helpfulness🫂 , AI🧠, Robotics🦾 & Reggaetón💃🏻

1 interactions

Jamie Gale @JamieGale7

Transforming the worlds physical data into decision making insights

1 interactions

Ivan Fioravanti ᯅ @ivanfioravanti

Co-founder and CTO of @CoreViewHQ GenAI/LLM addicted, Apple MLX, Microsoft 365, Azure, Kubernetes, builder of my personal dreams.

1 interactions

Luis @lusxvr

CS @ TUM

1 interactions

Nando de Freitas @NandoDF

Writing my own AI story. Recent: NPI, AlphaGo tuning, learn to learn, AlphaCode, Gato, ReST, r-Gemma, Imagen3, Veo, Genie, MAI …

1 interactions

clogwog @clogwog

1 interactions

Manuel Faysse @ManuelFaysse

NLP Research, interning at FAIR @AIatMeta + PhD Candidate @CentraleSupelec Prev: @imperialcollege, @epfl

1 interactions

Maziyar PANAHI @MaziyarPanahi

AI x Healthcare | Creator of @OpenMed_AI | Open-Source AI Advocate ❤️ | eu/acc 🇫🇷🇪🇺

1 interactions

clem 🤗@ClementDelangue

Co-founder & CEO @HuggingFace 🤗, the open and collaborative platform for AI builders

1 interactions

Observer @Observer_ofyou

cs • except cs

1 interactions

Dennis @dennis_nik75

Animator/Programmer, Christian, Conservative, Married, Father, Patriot. I love learning about aerospace, AI & robotics. Free speech equals free thought. 🚫DMs

1 interactions

davinci @leothecurious

teaching robots to see by day, learning from nature by night. in search of elegant solutions to the metaproblem. infinitely curious.

1 interactions

ved @vedantadoestech

i write, research and eat. that’s all basically

1 interactions

Mykhailo Sorochuk @sir4K_zen

Automation specialist

1 interactions

liuyo @iamdiegopy

exploring the universe and building stuff

1 interactions

Niccolo' Gentile @Niccolg92

PhD in ML, now AI Research Lead in 🇱🇺. Here mostly AI, including sharing paper reviews. Chess, philosophy, and a travel pic may appear. Opinions are my own.

1 interactions

Altiam Kabir @altiamkabir

🧠 AI Educator | Career Coach | Founder 📧 DM for Collaboration 🚀 Want to Learn & Earn with AI? 🤝 Join our 100k+ AI community & learn AI with 27+ Free Gifts👇

1 interactions

codewithP @CodewithP

🙋‍♂️ SWE @ServiceNow 👨‍💻 Talks to software & cats 🐈 QE's are not my best friends 👀

1 interactions

Alvaro Bartolome @alvarobartt

machine learning @huggingface

1 interactions

atharva @k7agar

your friendly neighbourhood engineer. prev world models @lossfunk.

1 interactions

🔥 Roast

Andi’s idea of ‘fine-tuning on a laptop’ sounds more like their laptop deserves a bodyguard — it’s basically running a marathon in flip-flops while their code sprints effortlessly ahead.

⚡️ Nice achievement

Spearheading the release of SmolDocling and SmolVLM, Andi helped deliver record-breaking open-source vision models that outperform competitors up to 27x larger, redefining efficiency in AI OCR.

🌟 Life's purpose

To democratize AI technology by creating efficient, scalable, and open-source multimodal models that empower developers and enthusiasts to innovate without barriers.

💬 Values and Beliefs

They believe in transparency, open collaboration, and achieving state-of-the-art results through community-driven projects. Efficiency and accessibility are core values, proving that powerful AI can come in small, sleek packages without the need for monstrous hardware.

💪 Strength

A fearless open-sourcer and energetic communicator who not only builds game-changing models but also makes complex AI concepts digestible and exciting for a broad audience.

🫣 Weakness

Their casual and sometimes profanity-laced style might alienate more traditional or formal tech communities despite their undeniable expertise.

⚡️ Growth audience tips

To grow their audience on X, Andi should continue blending technical depth with their authentic voice, leveraging informative threads with humor and transparency. Engaging more with followers through Q&As and live demos could turn passive viewers into loyal fans.

💁 Bonus

Fun fact: Andi’s SmolVLM models can run on less than 1GB of GPU memory, meaning you could practically fire one up on your toaster if it had a GPU! Talk about small but mighty.

Andi Marafioti@andimarafioti · Mar 17

🚀We just dropped SmolDocling: a 256M open-source vision LM for complete document OCR!📄✨ It's lightning fast, process a page in 0.35 sec on consumer GPU using < 500MB VRAM⚡ SOTA in document conversion, beating every competing model we tested up to 27x larger🤯 But how? 🧶⬇️

246k

Andi Marafioti@andimarafioti · Jun 17

📢 A new open-source OCR model is breaking the internet: Nanonets-OCR-s! Nanonets understands context and semantic structures, transforming documents into clean, structured markdown. It has an Apache 2.0 license, and the authors compare it to Mistral-OCR 🧵 Let's look closer:

154k

Andi Marafioti@andimarafioti · Jan 31, 2025

Fuck it, today we're open-sourcing the codebase used to train SmolVLM from scratch on 256 H100s🔥 Inspired by our team's effort to open-source DeepSeek's R1 training, we are releasing the training and evaluation code on top of the weights 🫡 Now you can train any of our SmolVLMs—or create your own custom VLMs!

96k

Andi Marafioti@andimarafioti · Jan 23, 2025

Introducing the smollest VLMs yet! 🤏 SmolVLM (256M & 500M) runs on <1GB GPU memory. Fine-tune it on your laptop and run it on your toaster. 🚀 Even the 256M model outperforms our Idefics 80B (Aug '23). How small can we go? 👀

121k

Andi Marafioti@andimarafioti · Sep 04

Fuck it. Today, we open source FineVision: the finest curation of datasets for VLMs, over 200 sources! > 20% improvement across 10 benchmarks > 17M unique images > 10B answer tokens > New capabilities: GUI navigation, pointing, counting FineVision 10x’s open-source VLMs.

123k

Andi Marafioti@andimarafioti · Oct 23

Everyone hypes new OCR models, but olmOCR quietly updates every few months, stays SOTA, and costs $178 per 1M pages. Don’t skip it—it even beats DeepSeek-OCR

75k

Andi Marafioti@andimarafioti · Oct 23

Funny enough, a few years ago I was shopping for data labeling companies with a 100k/year budget and did a POC with Scale and a few others vendors. They were by far the worst. I thought our contract just wasn’t big enough for them to care, but maybe that was just their standard.

139k

Andi Marafioti@andimarafioti · Nov 05

Mixing vision and robotics is incredibly hard, but when it finally works it feels like magic.

55k

Andi Marafioti@andimarafioti · Apr 07

Just read the Qwen2.5-Omni technical report from the Qwen team, it's super interesting. Here are my notes. Qwen2.5-Omni is a unified end-to-end model that can perceive text, images, audio, and video — and generate both text and natural speech responses in a streaming fashion. At its core is the Thinker-Talker architecture: Thinker: a large language model that processes multimodal inputs and generates text. Talker: an autoregressive speech decoder that turns Thinker's hidden states into speech tokens. They're trained together, end-to-end. Handling audio: audio is converted to 128-channel mel-spectrograms (16kHz, 25ms window, 10ms hop). Encoded via a modified Whisper model. Audio is processed in 2s blocks with streaming-compatible attention to reduce latency. Handling video: uses a ViT-based encoder with dynamic frame sampling. Each frame is treated like an image. To sync with audio, they introduce TMRoPE — Time-aligned Multimodal RoPE — a novel positional embedding that aligns video and audio in time. TMRoPE splits positional encoding into temporal, height, and width axes, letting Qwen2.5-Omni represent image/video/audio/text all on the same timeline. Interleaving of audio and visual tokens every 2 seconds enables synchronized fusion. Streaming audio generation: audio tokens from Talker are decoded using a sliding-window DiT model + modified BigVGAN. The receptive field includes 2 lookback blocks and 1 lookahead to allow context-aware streaming audio generation. Pretraining involved locking the LLM and training the audio/vision encoders first. Later stages unfreeze everything and train on a massive mix of audio-text, video-text, image-text, and long-sequence (32k tokens) data. Post-training includes reinforcement learning for Talker to reduce hallucinations and improve pronunciation/timing. Plus, multi-speaker fine-tuning for better prosody and naturalness. Qwen2.5-Omni achieves SOTA on OmniBench, AV-Odyssey, and strong results across text, image, audio, and video tasks. End-to-end speech instruction following is nearly on par with text-based inputs. That's rare. Overall: a super ambitious and well-integrated multimodal model. The Thinker-Talker separation is elegant. TMRoPE is a clever solution to a tricky problem. That said, I wish the paper had included more ablation studies or experiments justifying some of the architectural decisions. Many claims are reasonable but would benefit from more empirical evidence. Still, major kudos to the team. Qwen2.5-Omni is a big step toward real-time, unified multimodal assistants.

32k

Andi Marafioti@andimarafioti · Nov 26, 2024

Let's go! We are releasing SmolVLM, a smol 2B VLM built for on-device inference that outperforms all models at similar GPU RAM usage and tokens throughputs. SmolVLM can be fine-tuned on a Google collab and be run on a laptop! Or process millions of documents with a consumer GPU!

58k

Andi Marafioti@andimarafioti · Oct 21

🚨 New paper out! “FineVision: Open Data Is All You Need” 🥳 We unified 200+ data sources into 24M samples. That’s 17.3M images and 9.5B answer tokens, the largest open VLM dataset ever released. All fully documented, reproducible, and available for everyone. And there's more! 🎢

45k

Andi Marafioti@andimarafioti · Jul 31

🚀 We're thrilled to launch four new OCR datasets with 20M images: DoclingMatix, SynthFormulaNet, SynthCodeNet, and SynthChartNet. We used them train SmolDocling, our ultra‑compact (256M) full-page document conversion VLM with performance rivaling models up to 27× larger.

25k

Andi Marafioti@andimarafioti · Oct 21, 2024

A warm welcome to Moonshine, a new family of speech-to-text models! Moonshine claims to be as fast and accurate as whisper-base, while being up to 5x faster! 🤯 They achieve this by removing whisper's constraint on 30-second length audios. Instead, Moonshine processes audios of any length. 🧠 I will be trying this with our speech-to-speech pipeline and reporting back the results. So far, I'm sad not to see an MLX native implementation.

29k

Andi Marafioti@andimarafioti · Apr 08

Today, we share the tech report for SmolVLM: Redefining small and efficient multimodal models. 🔥 Explaining how to design a tiny 256M VLM that uses less than 1GB of RAM and outperforms our 80B models from 18 months ago! Here are the coolest insights from our experiments: ✨ Longer context = Big wins: Increasing the context length from 2K to 16K gave our tiny VLMs a 60% performance boost! ✨ Smaller is smarter with SigLIP: Surprise! Smaller LLMs didn't benefit from the usual large SigLIP (400M). Instead, we use the 80M base SigLIP that performs equally well at just 20% of the original size! ✨ Pixel shuffling magic: Aggressively pixel shuffling helped our compact VLMs "see" better, achieving the same performance with sequences 16x shorter! ✨ Learned positional tokens FTW: For compact models, learned positional tokens significantly outperform raw text tokens, enhancing efficiency and accuracy. ✨ System prompts and special tokens are key: Introducing system prompts and dedicated media intro/outro tokens significantly boosted our compact VLM’s performance—especially for video tasks. ✨ Less CoT, more efficiency: Turns out, too much Chain-of-Thought (CoT) data actually hurts performance in small models. They dumb ✨ Longer videos, better results: Increasing video length during training enhanced performance on both video and image tasks. 🌟 State-of-the-Art Performance, SmolVLM comes in three powerful yet compact sizes—256M, 500M, and 2.2B parameters—each setting new SOTA benchmarks for their hardware constraints in image and video understanding. 📱 Real-world Efficiency: We've created an app using SmolVLM on an iPhone 15 and got real-time inference directly from its camera! 🌐 Browser-based Inference? Yep! We get lightning-fast inference speeds of 40-80 tokens per second directly in a web browser. No tricks, just compact, efficient models! If you’re into efficient multimodal models, you’ll love this one.

64k

Andi Marafioti@andimarafioti · Nov 06, 2024

🚀 Today, we are introducing SmolTools! 🚀 Last week, at Hugging Face we made a significant leap forward with the release of SmolLM2, a compact 1.7B language model that sets a new benchmark for performance among models of its size. But beyond the impressive stats, SmolLM2 truly shines in practical, real-world applications, unlocking new possibilities for on-device inference. To demonstrate its potential, we're thrilled to introduce Smol-Tools, a suite of simple yet powerful applications that showcase the capabilities of SmolLM2. 🌟 Today, we present you two key tools: Summarize and Rewrite. 🔹 Summarize: Feed SmolLM2 a text up to 20 pages long, and it will provide you with a concise summary. Need to dig deeper? Just ask follow-up questions to clarify any details – it's that simple. 🔹 Rewrite: Draft a response to a message with your main points, and SmolLM2 will transform it into a clear, polished version that's easy for your recipient to read and understand. These tools are designed to make your workflow smoother and more efficient, leveraging the power of SmolLM2 in practical ways. 💡We hope you give these tools a try and see their potential for yourself! Feel free to build your own SmolTools and contribute them to the project. We'd love to see what you build! Check out the code here: github.com/huggingface/sm…

84k

Andi Marafioti@andimarafioti · Nov 11

Sneak peek of Reachy Mini's conversation capabilities! He can chat fluidly and change languages naturally. You can interrupt him and he can see and react to his environment. They start shipping soon, I'm excited to see what people start building with him!

46k

Most engaged tweets of Andi Marafioti

Andi Marafioti@andimarafioti · Mar 17

246k

Andi Marafioti@andimarafioti · Nov 05

Mixing vision and robotics is incredibly hard, but when it finally works it feels like magic.

55k

Andi Marafioti@andimarafioti · Nov 11

46k

Andi Marafioti@andimarafioti · Jan 31, 2025

96k

Andi Marafioti@andimarafioti · Oct 21

I heard X likes ML interview questions. So here's one: what's wrong with this PyTorch DataLoader worker seed?

56k

Andi Marafioti@andimarafioti · Oct 23

139k

Andi Marafioti@andimarafioti · Sep 04

123k

Andi Marafioti@andimarafioti · Nov 06, 2024

84k

Andi Marafioti@andimarafioti · Jun 17

154k

Andi Marafioti@andimarafioti · Nov 26, 2024

58k

Andi Marafioti@andimarafioti · Jan 15, 2025

I came up with a technique for dynamic token selection in Vision-Language Models. Instead of wasting compute on every part of an image, this method adapts the number of tokens based on the complexity of each region. Here’s an example of how it works: 👇

26k

Andi Marafioti@andimarafioti · Oct 23

Everyone hypes new OCR models, but olmOCR quietly updates every few months, stays SOTA, and costs $178 per 1M pages. Don’t skip it—it even beats DeepSeek-OCR

75k

Andi Marafioti@andimarafioti · Jan 23, 2025

121k

Andi Marafioti@andimarafioti · Jun 24

What would you like the Hugging Face multimodal team to release for the community? Is there anything you feel is missing to train truly great multimodal models?

Andi Marafioti@andimarafioti · Oct 27

You can now train SOTA models without any storage!🌩️ We completely revamped the Hub’s backend to enable streaming at scale. We streamed TBs of data to 100s of H100s to train SOTA VLMs and saw serious speed-ups. But how?

48k

Andi Marafioti@andimarafioti · Oct 30

New ML interview question: I have a training job that fails 1/5 times when I launch with 64 parallel jobs. This is the error. What is happening? (Feel free to check the files in nanoVLM, this is not a drill)

14k

People with Innovator archetype

The Innovator

mono bo@mono__bo

Product designer Framer expert MagicPath expert ➞ framer.link/iziviziframers…

67 following329 followers

The Innovator

Terencio White@InitialFinds

Helping professionals with breaking the traditional rules for creating a quality lifestyle. How will you find the right fit in life to create balance?

42 following36 followers

The Innovator