Train Your Own O1 Preview Model Within $450

sky.cs.berkeley.edu

429 points by 9woc 5 months ago

If anyone's interested, I made Colab notebooks with free GPUs for both GRPO (the algo DeepSeek used) to train a reasoning model from scratch, and also general finetuning, which the Berkeley team employed!

GRPO notebook for Llama 3.1 8B: https://colab.research.google.com/github/unslothai/notebooks...

General finetuning notebook: https://colab.research.google.com/github/unslothai/notebooks...

The Berkeley team's 17K dataset: https://huggingface.co/datasets/NovaSky-AI/Sky-T1_data_17k Hugging Face also released a 220K dataset: https://huggingface.co/datasets/open-r1/OpenR1-Math-220k

threecheese 5 months ago

How long does this take on a free tier T4? This is really neat, I’d assumed this type of “playing with the guts” work was more difficult to access as a normie programmer. Looks like something I’d like to try!
- danielhanchen 5 months ago
  
  For GRPO - we also made it much faster, but you might need to wait 2 to 4 hours in the minimum for anything meaningful :)
  Also you can install Unsloth on your local machine :)
  Kaggle has 2x Tesla T4s as well for free for 30 hours per week!

mkagenius 5 months ago

Weird that they had to resort to click bait using "O1 preview" in their name.

I expected some sort of way to actually get o1 preview retrained (and downloadable).

Also, calling it O1 preview on just 7 benchmarks is not correct. What if someone comes up with some use cases where O1 preview does better than this.

apart from that, good that things are becoming cheaper.

jug 5 months ago

It’s dishonest because they not only point towards a specific language model, but the beta version of a specific model. WTH?
- anigbrowl 5 months ago
  
  You should always assume headlines are hyperbolic, and 'verb your own noun for cheap' headlines are always offering a way to make your own version of $expensive_thing for hobby prices, not to provide a copy of $expensive_thing.
  If you a headline saying 'make your own James Webb Space Telescope in a weekend' they're offering a project that leverages some tech concept from the JWST, like mirror arrays or a particular sort of sensor. They're not promising that you will be able to build a space-capable telescope the size of a semi truck.
- echelon 5 months ago
  
  It's not dishonest, it's simple human behavior.
  The vocabulary used to describe the culturally prevailing leader will be used to explain similar concepts and create analogies. That's an easier tool to communicate to the masses than crafting super tailored messages for only domain experts.
  It's why we keep doing this, and it's also why trademarks become generics.
  "Google it", "Uber for X", "band aid", "the band sounds like Y", "the actor looks like Z", etc. etc.
  This is a core part of how human language works and how we as a species communicate with one another.
  - michaelt 5 months ago
    
    "Build your own Lamborghini Huracan at home for $450"
    "Wow! Quite a feat to deliver an iconic design, a 631 horsepower engine, and performance of 0-150 mph in 15.4 seconds on such a small budget!"
    "Actually what we mean is, like the Lamborghini Huracan, our vehicle has two seats."
    
    vineyardmike 5 months ago
    
    $450 for a Lamborghini clone is a lot more impressive when it compares favorably on (some) benchmarks.
    Also, at $450 no one expect it to truly be a from-scratch complete recreation of a model that cost hundreds of millions to produce.
    Instead, they built a model (via fine tuning) using similar technique and got similar results within their attempted are of experimentation that they created their training data for.
    I personally was not mislead by the title at all.
    
    echelon 5 months ago
    
    Nothing OpenAI has produced is a Lamborghini Huracan level above other generic AI models, though.
    There are open source models better than OpenAI's image and video models, and OpenAI is not winning the LLM space by any measure.
    The hobbyist absolutely won't feel as though they're trying to fake a Huracan with a Camry here. They're going to build useful products with whatever they choose, regardless of what vendor or open source project produced the model.
    Your analogy is silly. OpenAI is more like Band-Aid(r) than Lamborghini Huracan.
    
    anigbrowl 5 months ago
    
    When I see a thing like that I assume they're abstracting one or two things that are unique to (or at least strongly associated with) the desired object. idk, perhaps 'significantly increase the output power of your little hobby engine with this one weird trick' where said trick turns out to be cylinder firing order and a custom made drive shaft.
  - yieldcrv 5 months ago
    
    ChatGPT is the market leader, nobody except enthusiasts are distinguishing between their models, any models. And the enthusiasts know the difference
    Verdict: dishonest
codelion 5 months ago

Yeah, I agree. The "O1 preview" naming feels a bit misleading. It sets an expectation of broader coverage than just those specific benchmarks. It's cool to see cost reductions, but the marketing could be more transparent about the scope.

fl4tul4 5 months ago

I do love competition.

In the last weeks are are seeing a torrent of advances, just because someone opened their architectures.

Imagine where we could go if the training datasets were also publicly available and unbounded by any copyright laws. (I'm not talking about doing anything illegal).

I can only dream, I guess.

Lucasoato 5 months ago

A torrent of advances is the right way to word it, especially after it has been discovered what Meta trained their models on :)
paper2d 5 months ago

Those training datasets can never be free as almost all of them is copyrighted.
- landryraccoon 5 months ago
  
  Japan has said AI can train on copyrighted materials.
  https://www.privacyworld.blog/2024/03/japans-new-draft-guide...
  I imagine if copyright is a big issue for AI, Japanese startups will have an advantage.
  - 0xdeadbeefbabe 5 months ago
    
    Does China need to say anything or can you guess their policy?
- chii 5 months ago
  
  perhaps copyright needs to be updated. And in any case, my personal belief is that training on data that is publicly released, and as well as purchased media, is fair use.
  - philipwhiuk 5 months ago
    
    If anything it needs to be updated to actually prevent the rampant profit extraction from human creation in order to protect actual creators.
    
    FergusArgyll 5 months ago
    
    Not OP, but that should be part of the update, I think.
    I think we can all agree there does need to be an update. You don't want to forever outlaw deep learning (even if you do want to, that's not going to happen so it's worth helping to shape the future)
    It's very complicated with a bunch of moving parts but I really want society to start arguing about it so we can get to a semi-fair place
    
    realusername 5 months ago
    
    I don't see how any of these authors loses money when you use chatgpt, even in theory.
    You weren't going to buy a book instead of asking a question.
    
    chii 5 months ago
    
    The people who propose that authors lose money by chatGPT's usage of their works in the training, is the same idea that piracy costs music labels money.
    
    realusername 5 months ago
    
    And we know that piracy costing money is a bogus idea from research.
    LLMs costing money makes even less sense as you can't get back the source material
    
    woah 5 months ago
    
    Each time someone clicks "send" on chatGPT, Warner Bros gets 1c
    $25 to Elsevier per GPU purchase
    
    eikenberry 5 months ago
    
    I don't think you will ever see any law to benefit the creators. Better to eliminate it and at least let the artists the freedom to work with any media they want. Artists will generally still be poor, but they'll be more creative.
    
    anigbrowl 5 months ago
    
    Creativity and productivity are two completely different things.
    
    spookie 5 months ago
    
    I'll be honest, even if this comment won't fly: It is impossible to change the views here, on this point. Specifically, here.
    I do share your opinion. Others may argue "What about x country? They don't care!", even though that position is about as good as making anything excusable because someone else did it.
    I might add, I'm really not trying to be toxic. Just saying this based on what I see when this comes up.
    
    CamperBob2 5 months ago
    
    Yeah, that's a good idea. Stop the most important advance in storing, retrieving, and disseminating knowledge since the printing press because muh copyright!!1!!
    Never mind that you've just handed control of an incredibly-powerful tool over to nations that DGAF about copyright law.
    If copyright interests want to fight AI, then copyright has to go. It's that simple. It's an unnecessary fight, but somebody needs to convince them of that.
  - tonyedgecombe 5 months ago
    
    The UK government is doing that at the behest of the AI companies which tends to indicate they have bet misbehaving up to now.
  - azinman2 5 months ago
    
    Why should it be? I’d personally be pissed if my book, which came from my own hard work and is sold per person, all of the sudden get subsumed by a general AI. Even worse if it is commercialized and I get nothing for it.
    
    chii 5 months ago
    
    what if a classroom of students learnt from your book, and ended up with a high paying job, innovation, or production, none of which makes any profit for you as an author of said book (except for the copy sold to the student)?
    
    azinman2 5 months ago
    
    That’s perfectly in line with the common role and understanding of books.
- taosx 5 months ago
  
  Share the non-copyrighted ones and it's still a win if you make it possible to people to contribute, both through PRs, testing and discussion.
- lionkor 5 months ago
  
  almost all free things are copyrighted
Kye 5 months ago

It seems like the torrent was already happening and DeepSeek's part is just one example of that. They did help bring attention to those advancements, and that's led to lots more people contributing and finding more niche applications.
noduerme 5 months ago

Isn't the general attitude these days to just break laws and bribe officials once you own the hottest startup? /s
edit: re. the /s I was living offshore and running the most popular bitcoin casino at the time, spending a vast amount of money and energy to block any player who might be American. As a result I didn't make that much money. And I tried to calculate how much I would need to make if I wanted to break the law and hide out forever. I figured I could make $10-15M a year but that wouldn't be enough to hide. I fucked up, I guess. Because the richest man in the world made most of his first round of money facilitating gambling transactions, and he's now got his snout in every federal agency. I should have had the balls, I guess, to ask forgiveness rather than permission.
- coliveira 5 months ago
  
  This was always like this. Youtube started publishing mostly copyrighted content, then Google settled with copyright owners. Google by the way has perfected the "art" of training their algos with content without approval from copyright owners.

scosman 5 months ago

Inference time compute is still very under utilized in actual AI deployments. Lots of folks are working on foundation models, which require reasoning about broad problem domains. Not enough people are using the same techniques for task-specific performance improvements. You can easily distill the reasoning from larger models like R1 for your task. Often better, you can mix in custom thinking instructions for specific sub-problems so a fine tuned model learns a mix of task specific reasoning and custom logic. It’s not hard and easily beats prompt iteration. When you find bugs, you can fix it.

I made a GitHub project for distilling thinking models (and customs COT inference time fine tuning): https://docs.getkiln.ai/docs/guide-train-a-reasoning-model

anon373839 5 months ago

Thanks for linking to this. That’s a good resource!
Do you have any pointers on assembling fine-tuning data not for isolated tasks, but for a flexible range of queries in a particular problem domain? Similar to general purpose instruction-tuning, but much more focused.
For example, suppose you’re building an app that helps doctors search through research literature to aid in diagnosis, check hypotheses, etc. Of course you would want to have some domain experts and real users available to see what kind of queries they would create. But getting from that point to a well-balanced dataset that adequately represents the distribution of possible queries, instructions, writing/cognitive styles, formatting, dialog flows, etc. your app will encounter —- it just seems kind of hard to know how to approach a task like that. It seems there are infinitely many dimensions you could accidentally overfit on.
- pizza 5 months ago
  
  General advice? Collect data, train a model, note the mistakes in the model, mistakes in the data, and think critically about what it is that you're ending up teaching. Repeat many, many, many times.. For some tasks, don't be surprised if it ends up taking months or a year or several. It took me 6 months of building a dataset, by hand, by myself, to produce ~1600 'gold standard' text examples (bolstered by ~100K synthetic examples) - texts plus 20 dimensions rated 1-4. But I managed to beat SOTA models in this task from all the frontier labs by doing so. It also makes sense to consider all of the various "lacks" of the competing models.
  It's quite difficult to see all the future decisions you will make due to future insights about future versions of the whole loop. But you will be needing to make some.
  I will say one more concrete thing though: the more metadata you collect, generally, the better, but this can make it more expensive.
  Also, if you ever need to update your schema.. well this is actually one reason why text data for LLMs is nice: your schema is essentially fluid in the first place, so you could eg stick metadata in the text itself if at some future point you start collecting it.
  I guess, also, it's a good thing to constantly add new benchmarks, if possible. Treat your model's capabilities as knowable, but never treat your model's capabilities as actually known.
  - anon373839 5 months ago
    
    Thanks for the input. It sounds like the task is about as daunting as it seems, then, but doable. Are there any resources (such as papers) you’ve found especially helpful?
    
    anon373839 5 months ago
    
    To answer my own question in case anyone else has it: The Tülu 3 paper is really illuminating:
    > Language model post-training is applied to refine behaviors and unlock new skills across a wide range of language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce Tülu 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. Tülu 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With Tülu 3, we build a multi-task evaluation scheme for post-training with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance. The Tülu 3 release includes model weights, a demo, and the complete recipe — datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the Tülu 3 approach to more domains.
    https://arxiv.org/pdf/2411.15124

rdli 5 months ago

The blog post was a little unclear, so my summary was:

- They used QwQ to generate training data (with some cleanup using GPT-4o-mini)

- The training data was then used to FT Qwen2.5-32B-Instruct (non-reasoning model)

- Result was that Sky-T1 performs slightly worse than QwQ but much better than Qwen2.5 on reasoning tasks

There are a few dismissive comments here but I actually think this is pretty interesting as it shows how you can FT a foundation model to do better at reasoning.

azinman2 5 months ago

I wish they would have compared to the r1 distills of qwen2.5

magicalhippo 5 months ago

So this is a fine-tune and not from scratch, which makes the proposition much more reasonable.

That said, for someone who's not in the game but been curious as to the details of fine-tuning, it's great to get both the dataset and the code.

Tiberium 5 months ago

Better URL: https://novasky-ai.github.io/posts/sky-t1/

9woc 5 months ago

True. The previous discussion on this is here: https://news.ycombinator.com/item?id=42681417

moconnor 5 months ago

They trained on QwQ traces and in their evaluation they are… mostly slightly worse than QwQ.

Hardly a huge win.

genpfault 5 months ago

> The model training finishes in 19 hours on 8 H100 with DeepSpeed Zero-3 offload (~ $450 according to Lambda Cloud pricing).

tw1984 5 months ago

just several weeks ago, OpenAI was still using reasoning as a part of its tech moat to partially justify its hugely inflated valuation. in just weeks after the release of deepseek and kimi and their paper on how to do it, average joes can now train it at home by spending less than the purchase cost of one single mid end gaming GPU.

_joel 5 months ago

It's not from scratch, though, right? Am I missing something here as to why it's at the top of the posts?

twobitshifter 5 months ago

There’s no real reason to start from true scratch anymore. You don’t harvest wheat, mill flour, milk a cow, and churn butter for your cake.
- _joel 5 months ago
  
  Yes and LoRA etc has been a thing for a while, what's new?

JoshTko 5 months ago

Has anyone tested if the consensus of top 4-5 mini models together would out perform the best frontier model?

qqmm 5 months ago

Is it because Deepseek decided to open their model? I noticed they have a similar timeline

m3kw9 5 months ago

Looks like they need to put quotes on the 450$

brador 5 months ago

I just want to make music with AI and it is very difficult. The meta model on hugging gives an error when used through the website and no one will ever fix it.

Kye 5 months ago

It depends on how much you want it to do for you. I've used ChatGPT to come up with song briefs which I then turn into music myself.
polishdude20 5 months ago

Suno?
- ionwake 5 months ago
  
  I find I can only give them one sentence to describe the music I want which is not good enough - has this changed at all?
  - xyproto 5 months ago
    
    You can describe or upload the first N seconds, then extend from that by using another description, then extend from N further seconds etc. But Suno music within a genre has a pretty limited range.
  - petercooper 5 months ago
    
    It's still only 240 characters or whatever, but it pays to be dense. So rather than "Write a song that sounds like polka etc etc" just keyword pack it.
- fragmede 5 months ago
  
  Yeah. If you want to play ai researcher, by all means go play around with hugging face and build a local AI GPU rig. if you want to make some music, just use Suno.

rlforllms 5 months ago

Wait so Qwen trained QWQ 32B from Qwen 32B and then they distill QWQ back into Qwen 32B? What's the point?

This is massive marketing scam here. Borderline academic dishonesty.

jojaja 5 months ago

So you are better off just using QwQ
andy_xor_andrew 5 months ago

I wouldn't go that far, but I agree, my reaction to reading the details was to go "huh?"
From the title, my best guess was they applied some kind of RL/GRPO to an existing model.
But... they took an existing model that had already undergone SFT for reasoning... and then used it to generate data to SFT the exact same model... nothing wrong with that, but it doesn't seem to warrant the title they chose.
barrenko 5 months ago

Not sure if scam, honestly depends on the data sometimes it might work.
- rlforllms 5 months ago
  
  The goal is distillation is to distill into smaller models like 7B, 1.5B.
  They didn't even change the model size, let alone try a different class of models.
  Getting expert model's trajectories is trivial if you have vLLM to do batched inference.

cytocync 5 months ago

[dead]