Get live statistics and analysis of Alex Dimakis's profile on X / Twitter

Professor, UC berkeley | Founder @bespokelabsai |

2k following21k followers

Archetype analysis

The Analyst

Alex Dimakis is a sharp and detail-oriented academic and AI founder who thrives on dissecting complex AI behaviors with precision and curiosity. As a professor at UC Berkeley and founder of Bespoke Labs AI, Alex blends deep theoretical insights with practical innovation to push the boundaries of understanding in machine learning. His tweets reveal a passion for unraveling the nuances of AI models and a commitment to fostering realistic expectations around their capabilities.

Recent engagement

Impressions

35.6k9.8k

Estimate earning$6.68

Likes

2584

74%

Retweets

27-6

Replies

12-7

Bookmarks

5213

15%

Get more insights about Alex Dimakis with SuperX

Social Circle

Top users who interacted with Alex Dimakis over the last 14 days

Crypto Daddy ֎@cryptodaaddy

Growth Strategist • Web3 Investor | 9-Figure Vision • EX @ezu_xyz

1 interactions

🔥 Roast

Alex probably proofreads his own tweets with a spellchecker set to 'nitpicky professor mode'—so much detail that even robots get overwhelmed and start doubting their own reasoning skills.

⚡️ Nice achievement

Alex has successfully combined an academic career at a top institution with founding an AI startup, while maintaining influential thought leadership through detailed, high-engagement tweets dissecting the nuances of modern AI systems.

🌟 Life's purpose

To deepen the collective understanding of AI’s capabilities and limitations, bridging theoretical research with real-world applications, while educating and challenging the AI community to think critically about model reasoning and performance.

💬 Values and Beliefs

Alex values scientific rigor, transparency, and intellectual honesty. He believes in thorough empirical analysis, embracing nuance over hype, and the importance of advancing AI responsibly through careful scrutiny. He is skeptical of oversimplifications and champions a nuanced, data-driven approach to AI research.

💪 Strength

Alex’s greatest strength lies in his exceptional analytical mind and ability to communicate complex AI research insights clearly. His academic background combined with entrepreneurial experience allows him to critically assess AI models while influencing the field with fresh, practical ideas.

🫣 Weakness

Tending toward deep technical dives and critical scrutiny, Alex sometimes risks coming across as overly cautious or skeptical, potentially limiting his appeal to audiences craving more optimistic or simplified AI narratives.

⚡️ Growth audience tips

On X, Alex should leverage his expertise by sharing thread-style deep dives that break down complex AI topics with accessible analogies, paired with engaging visuals or simplified summaries. Regular interactive Q&A sessions could boost engagement and attract followers interested in thoughtful AI discourse.

💁 Bonus

Alex often highlights surprising weaknesses in state-of-the-art models, such as GPT-4’s struggles with basic multiplication and the counterintuitive observation that wrong reasoning model answers tend to be longer than correct ones.

Alex Dimakis@AlexGDimakis · Sep 05, 2024

GPT is having a profound effect on how students write. Its verbose style, full of cliches and 'fancy', out of place vocabulary is in every paper and draft I read. A few years back, there were grammar errors and awkwardness -- but at least people had their own voice. Now, scholarship is getting full of robotic triviality.

929k

Alex Dimakis@AlexGDimakis · Aug 16, 2023

I was surprised by a talk Yejin Choi (an NLP expert) gave yesterday in Berkeley, on some surprising weaknesses of GPT4: As many humans know, 237*757=179,409 but GPT4 said 179,289. For the easy problem of multiplying two 3 digit numbers, they measured GPT4 accuracy being only 59% accuracy on 3 digit number multiplication. Only 4% on 4 digit number multiplication and zero on 5x5. Adding scratchpad helped GPT4 but only to 92% accuracy on multiplying two 3 digit numbers. Even more surprisingly, finetuning GPT3 on 1.8m examples of 3 digit multiplication still only gives 55 percent test accuracy (in distribution). ¯\_(⊙︿⊙)_/¯ So whats going on? Multiplication is algorithmically very challenging (as are less known algorithmic problems). The authors hypothesize that Transformers have a hard time because they learn linear patterns that they can memorize, maybe compose, but not generally reason with. The paper raises interesting theoretical and practical questions on understanding what Transformers can learn. The paper "Faith and Fate: Limits of Transformers on Compositionality" says: "Our empirical findings suggest that Transformers solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem solving skills"

Alex Dimakis@AlexGDimakis · Jun 19, 2024

This paper seems very interesting: say you train an LLM to play chess using only transcripts of games of players up to 1000 elo. Is it possible that the model plays better than 1000 elo? (i.e. "transcends" the training data performance?). It seems you get something from nothing, and some information theory arguments that this should be impossible were discussed in conversations I had in the past. But this paper shows this can happen: training on 1000 elo game transcripts and getting an LLM that plays at 1500! Further the authors connect to a clean theoretical framework for why: it's ensembling weak learners, where you get "something from nothing" by averaging the independent mistakes of multiple models. The paper argued that you need enough data diversity and careful temperature sampling for the transcendence to occur. I had been thinking along the same lines but didn't think of using chess as a clean measurable way to scientifically measure this. Fantastic work that I'll read I'll more depth.

387k

Alex Dimakis@AlexGDimakis · Jan 31, 2025

Discovered a very interesting thing about DeepSeek-R1 and all reasoning models: The wrong answers are much longer while the correct answers are much shorter. Even on the same question, when we re-run the model, it sometimes produces a short (usually correct) answer or a wrong verbose one. Based on this, I'd like to propose a simple idea called Laconic decoding: Run the model 5 times (in parallel) and pick the answer with the smallest number of tokens. Our preliminary results show that this decoding gives +6-7% on AIME24 with only a few parallel runs. I think this is better (and faster) than consensus decoding.

218k

Alex Dimakis@AlexGDimakis · Jan 21, 2025

Most AI researchers I talk to have been a bit shocked by DeepSeek-R1 and its performance. My preliminary understanding nuggets: 1. Simple post-training recipe called GRPO: Start with a good model and reward for correctness and style outcomes. No PRM, no MCTS no fancy reward models. Basically checks if the answer is correct. 😅 2. Small models can reason very very well with correct distillation post-training. They released a 1.5B model (!) that is better than Claude and Llama 405B in AIME24. Also, their distilled 7B model seems better than o1 preview. 🤓 3. The datasets used are not released, if I understand correctly. 🫤 4. DeepSeek seems to be the best at executing Open AI's original mission right now. We need to catch up.

181k

Alex Dimakis@AlexGDimakis · May 10

"RL with only one training example" and "Test-Time RL" are two recent papers that I found fascinating. In the "One Training example" paper the authors find one question and ask the model to solve it again and again. Every time, the model tries 8 times (the Group in GRPO), and a gradient step is performed, to increase the reward which is a very simple verification of the correct answers, repeated thousands of times on the same problem. The shocking finding is that the model does not overfit to this one question: RL on one example, makes the model better in MATH500 and other benchmarks. (If instead you did SFT repeating one training question-solution finetuning, the model would quickly memorize this answer and overfit). But with RL, the model has to solve the problem itself, since it only sees the question, not the answer. Every time it produces different answers, and this seems to prevent overfitting. The other papers are relying on the same phenomenon: you can have a small number of training questions and re-solve them thousands of times. You can do this for the test set (as test-time RL does) and still not overfit. We also independently saw this by doing RL training on half the test set and seeing benefits in the other half for BFCL agents. My thought now is that this shows our RL learning algorithm must be extremely inefficient. When a human is learning by solving a math puzzle, they immediately learn what they can learn by solving it once (or twice). No further benefit would come by assigning the same homework problem to students a tenth time. But in RL, we keep asking the model to re-solve the same question thousands of times, and the model slowly gets better. We should be able to have much better RL learning algorithms since the information is there. (1/2)

349k

Alex Dimakis@AlexGDimakis · Dec 17, 2024

Life update: I am excited to announce that I will be starting as a Professor in UC Berkeley in the EECS Department. I spend 12 wonderful years teaching in UT Austin and I am grateful to all my colleagues and students there and extremely proud of what we have achieved in AI in UT Austin, and I plan to continue my numerous UT close collaborations. I will also continue as Chief Scientist in Bespoke Labs, making it much easier now being in the Bay area. I received my Phd in 2008 from @Berkeley_EECS and I am thrilled to be back. I am grateful for this new opportunity.

108k

Alex Dimakis@AlexGDimakis · Jan 22, 2025

DeepSeek-R1 is amazing but they did not release their reasoning dataset. We release a high-quality open reasoning dataset building on the Berkeley NovaSky Sky-T1 pipeline and R1. Using this, we post-train a 32B model Bespoke-Stratos-32B that shows o1-Preview reasoning performance. Surprisingly, we get good performance with only 17k questions-answers while DeepSeek distillation used 800k, i.e. 47x more data. We open-source everything for the community to experiment with.

111k

Alex Dimakis@AlexGDimakis · Oct 09, 2024

For the first (and probably last) time in my life I understand the technical details of both the physics and chemistry Nobel prizes.

33k

Alex Dimakis@AlexGDimakis · Nov 23, 2023

youtube.com/watch?v=zjkBMF… Probably the best 1h introduction to LLMs that I've seen. And after 20mins its not an introduction, its getting into cutting edge research updates updated up to this month. I had not heard of the data exfiltration by prompt injection or the recent finetuning Poisoning attacks.

73k

Alex Dimakis@AlexGDimakis · Mar 15, 2022

I was informed that Alexander Vardy, a giant in coding theory passed away. A tragic loss for his family, UCSD and academia. Alex's many discoveries include the Polar decoding algorithm used in the 5G wireless standard, (1/3)

Alex Dimakis@AlexGDimakis · May 07, 2023

New neural renderer by Nvidia. The model adds fingerprints, smudges and dust and generates renders indistinguishable from real to me. Oh, and its done at *real-time!*. Can't wait to see games using this. (1/2)

29k

Alex Dimakis@AlexGDimakis · Sep 11

What are RL environments? Are they just evals? There is significant confusion in the community, so here is my opinion: My answer is inspired by Terminal-bench, an elegant framework for creating RL environments, evaluating agents and even training agents. First, an RL environment is simply a Docker container. It contains three things: 1. A snapshot of the state of the world when a problem happened. 2. A task description and 3. A reward that verifies if the agent has solved the task. Can be using LLM as a judge or run tests. For example, lets take the 'broken-python' environment in Terminal bench. The Dockerfile setups the container: Installs Python and intentionally breaks it by removing critical files E.g. install python and RUN rm -rf /usr/local/lib/python3.13/site-packages/pip The task is "There's something wrong with my python- I can't install packages with pip." The verifier tests if pip works by trying to install a test package. Now the agent can try anything it wants to fix pip, by writing bash commands, or using any tools available. (1/2)

25k

Alex Dimakis@AlexGDimakis · Jul 29

I am excited to announce that our AI institute (Institute for Foundations of Machine Learning, IFML) has been renewed. IFML was part of the first cohort of AI Institutes announced in 2020. Led by UT Austin, the new award will build on the trajectory of the past five years and develop new foundational tools to advance generative AI. NSF IFML's work on diffusion models is a key technology behind major Google products, powering widely used generative models such as Stable Diffusion 3 and Flux. In it's next phase, NSF IFML will expand generative AI to new domains, including protein engineering, clinical imaging, new methods to handle noisy data, improve agent reliability and open source AI. (1/n)

26k

Alex Dimakis@AlexGDimakis · Jun 05

I'm excited to announce what we have been working on for months. Announcing OpenThinker3, the strongest 7B reasoning model with open data. Also more than 1000 experiments on what works and what doesn't for post-training data curation.

17k

Alex Dimakis@AlexGDimakis · Nov 07, 2024

Ok this paper seems super interesting and also makes me want to teach graphical models again. The question is, when does chain of thought help, and the answer proposed is “ finding that intermediate steps are only helpful when the training data is locally structured with respect to dependencies between variables.” So it depends on the training data and they test that by training on different types of synthetic datasets. Also has theory and seems to do the entire formulation using Bayes nets which is very cool, and I’ll try to understand this more. Any insights welcome.

36k

Most engaged tweets of Alex Dimakis

Alex Dimakis@AlexGDimakis · Sep 05, 2024

929k

Alex Dimakis@AlexGDimakis · Aug 16, 2023

Alex Dimakis@AlexGDimakis · Jan 31, 2025

218k

Alex Dimakis@AlexGDimakis · Jun 19, 2024

387k

Alex Dimakis@AlexGDimakis · Dec 17, 2024

108k

Alex Dimakis@AlexGDimakis · May 10

349k

Alex Dimakis@AlexGDimakis · Jul 29

26k

Alex Dimakis@AlexGDimakis · Jan 21, 2025

181k

Alex Dimakis@AlexGDimakis · Jan 22, 2025

111k

Alex Dimakis@AlexGDimakis · Jan 24, 2025

We are trying to check for contamination in math and reasoning datasets. I have a question: Let's say the training dataset has the question: "How many ways are there to put 5 balls in 3 boxes" and the test set has: "How many ways are there to put 6 balls in 2 boxes" Is this contamination in your opinion?

12k

Alex Dimakis@AlexGDimakis · Oct 09, 2024

For the first (and probably last) time in my life I understand the technical details of both the physics and chemistry Nobel prizes.

33k

Alex Dimakis@AlexGDimakis · Mar 15, 2022

Alex Dimakis@AlexGDimakis · Nov 18, 2024

github.com/mlfoundations/… I’m excited to introduce Evalchemy 🧪, a unified platform for evaluating LLMs. If you want to evaluate an LLM, you may want to run popular benchmarks on your model, like MTBench, WildBench, RepoBench, IFEval, AlpacaEval etc as well as standard pre-training metrics like MMLU. This requires you to download and install more than 10 repos, each with different dependencies and issues. This is, as you might expect, an actual nightmare. (1/n)

146k

Alex Dimakis@AlexGDimakis · Jul 16

Interesting post. However, it seems to be in conflict with the most central problem in theoretical computer science: P vs NP ,which is exactly the question: is it fundamentally easier to verify a solution rather than solve a problem. Most people believe that verification is easier than solution, ie we believe that P!=NP. But the post claims that ‘All tasks that are possible to solve and easy to verify will be solved by AI.’ As a counter-example I would propose colouring a graph with 3 colors (color vertices so that all adjacent vertices have different colors) assuming the input graph is 3 colorable. Very easy to verify, satisfies all requirements of the post, but RL won’t solve this problem in polynomial time. (Any NP complete problem will work obviously just giving an easy example ).

14k

Alex Dimakis@AlexGDimakis · Aug 09, 2024

Excited to launch the first model from our startup: Bespoke Labs. Bespoke-Minicheck-7B is a grounded factuality checker: super lightweight and fast. Outperforms all big foundation models including Claude 3.5 Sonnet, Mistral-Large m2 and GPT 4o and its only 7B. Also, I want to congratulate Greg Durrett and his group for making the best benchmark and leaderboard for grounded factuality.

18k

Alex Dimakis@AlexGDimakis · Aug 15

For anyone thinking that LORA alignment has any safety guarantees: If we are given a few different LORA finetunings of a model, we can reconstruct exactly the original weights of the pre-trained model. (i.e. linear algebra question: given a few low-rank perturbations of an unknown matrix, we can reconstruct the original matrix. ) I would think that given multiple dense SFT finetunings, the original weights should be recoverable even without LORA. Low-rank matrix completion experts here is some some fresh butter for your LLM bread. @PNetrapalli @jainprateek_ @sujaysanghavi

20k