Import AI 381: Chips for Peace; Facebook segments the world; and open source decentralized training
How much chain of thought reasoning do humans do themselves?
Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this (and comment on posts!) please subscribe.
Facebook makes it easier to label and categorize the world for AI systems:
…Segment Anything 2 makes it easy to segment objects in images and videos…
Facebook has released SAM 2, a followup to its earlier 'Segment Anything' model. SAM 2 is a system that "can segment any object in any video or image—even for objects and visual domains it has not seen previously, enabling a diverse range of use cases without custom adaptation." Segmenting objects is the task of figuring out in an image or video which things are distinct from one another - e.g., correctly labeling a skateboarder versus their background, or distinguishing the skateboard from the human riding on top of it.
"SAM 2 has many potential real-world applications. For example, the outputs of SAM 2 can be used with a generative video model to create new video effects and unlock new creative applications. SAM 2 could also aid in faster annotation tools for visual data to build better computer vision systems," Facebook writes.
What SAM was built out of: SAM 2 was built via SA-V, a dataset containing 51k distinct videos with 643k spatio-temporal segmentation masks. "Out of the 643K masklets, 191K were SAM 2 assisted manual annotation and 452K were automatically generated by SAM 2 verified by annotators."
Why this matters - utility systems for a better world: SAM 2 is a generic, utility AI capability that now anyone can access. By making it easy and effective to label and segment the world - even seen via video - SAM 2 will make it easier to build AI systems which are more context dependent; one usecase Meta images is for smart glasses, but there are many more.
And while things like SAM 2 can potentially be misused, it's a much more bounded and controlled misuse than with large-scale foundational models.
Read the blog: Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images (Meta AI Research).
Try a SAM 2 demo online here (Meta).
Get the dataset used to train SAM 2 here (SA-V, Meta).
Get the SAM 2 model here (SAM 2: Segment Anything in Images and Videos, Facebook Research, GitHub).
***
Could "Chips for Peace" reduce race conditions around AI development?
…One way to solve international AI policy…
AI researcher (and, disclosure, former dear colleague of mine) Cullen O'Keefe, has tried to figure out how states can coordinate on AI development in a way that reduces race conditions. Their idea is "Chips for Peace", an idea modeled on the "Atoms for Peace" framework that was almost pursued in the 20th century. The key idea is that states with a leading edge in AI development can use their lead to export a regulatory model - as well as the benefits of the technology - to other states.
Three key ingredients for Chips for Peace:
1) "States would commit to regulating their domestic frontier AI development and deployment to reduce risks to public safety and global security."
2) "States would agree to share the benefits of safe frontier AI systems broadly, especially with states that would not benefit by default."
3) "States would coordinate to ensure that nonmembers cannot undercut the other two commitments."
Key issues with this idea:
"Chips for Peace probably works best if most frontier AI development is done by private actors, and member states can be largely trusted to regulate their domestic sectors rigorously and in good faith."
"Chips for Peace would likely need a sizable budget to function properly, but there is no guarantee that states will be more financially generous in the future."
"I have left open the question of whether membership should be open only to democracies… Chips for Peace would be seriously weakened unless China was admitted."
Why this matters - governance versus payouts: Chips for Peace, like many ideas in policy, relies on restricting and controlling a technology for public safety and in return the public (and various countries around the world) get a payout. The key issue here relates to how powerful people expect AI to be - if you think AI can truly decide the fate of nations (as many do) then it's hard to see you being comfortable with a world where states offer to export you some 'safe' AI technology while controlling the means of production for the underlying stuff.
Ideas like Chips for Peace point in the right direction but I think until we have a payout mechanic that reckons with the essential nation state desire for sovereignty, it might be hard to get support for this idea.
Read more: Chips for Peace: how the U.S. and its allies can lead on safe and beneficial AI (Institute for Law & AI).
***
Making AI policy harder with open source decentralized training code:
…OpenDiLoCo will make it harder for people to figure out where large training runs can come from…
PrimeIntellect, an AI startup providing decentralized training services, has published OpenDiLoCo, an open source implementation of Google's distributed training 'DiLoCo' system (Import AI #349). "We demonstrate its effectiveness by training a model across two continents and three countries, while maintaining 90-95% compute utilization," they write.
What DiLoCo is and what they did: DiLoCo is a way to split up a training job across multiple clusters that can be located at large geographic distances from one another, giving researchers a way to pool the compute of many different systems into one big machine for training a model. Here, the PrimeIntellect researchers make an open source version of the code and also extend it to billion+ parameter-scale training. "The original DiLoCo paper demonstrated the efficacy of the method up to model sizes of 400 million parameters. We expand on this and test the scalability of DiLoCo to larger models sizes by pre-training a 1.1 billion parameter model," they write. "We use four DiLoCo workers, each with eight H100 GPUs, located in Canada, Finland, and two different states within the United States. Figure 8 shows the network bandwidth between the workers, which varies between 127 to 935 Mbit/s. We train our 1.1B parameter model with 500 local steps, as in our scaling experiment. The gradients are all-reduced in FP16."
It mostly works, though with some hiccups: It's not perfect - the distributed trained models are a little crappier than ones trained in a more standard, dense form. However, the startup tells me on Twitter that it is currently "scaling decentralized training to 10b model size and beyond", so we may soon get more evidence of the viability of this approach.
Why this matters - transcontinental training collectives break some policies around AI control: Some AI policy is oriented around applying 'know your customer' policies to people which buy up a certain amount of compute. These policies rest on the notion that customers will be buying big blobs of compute in individual allotments. Techniques like OpenDiLoCo push us towards a world where customers can instead buy a few smaller blobs of compute from different providers then chain them together, letting them perform training runs that would otherwise be closely monitored.
Read more: OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training (arXiv).
Get the code here: OpenDiLoCo (PrimeIntellect, GitHub).
***
It now costs ~$2000 to approximate the performance of a 2022 model that cost ~$100k+:
…"Micro diffusion" shows how cheap the frontier eventually becomes…
Researchers with Sony AI and the University of California at Riverside have tried to train a really good and cheap image generation performance, spending $1,800 to train a model that approximates the performance of models that cost $100k+ to train in 2022.
What they did: "Using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset," they write. The resulting model compares favorably to popular image generators from a couple of years ago like Stable Diffusion 1.5, though still significantly lags much more expensive contemporary models like Dall-E 3.
"The wall-clock time of our training is only 2.6 days on a single 8×H100 GPU machine, 14× lower than the current state-of-the-art approach that would take 37.6 training days ($28,400 GPU cost)," they write.
The key result - it approximates the performance of Stable-Diffusion-1.5: The best way to understand this work is to compare its scores to an early Stable Diffusion image model, where it gets a FID-30K score of 12.66 versus 11.18 for Stable-Diffusion-1.5 (lower is better) which was released in 2022 and 17.89 for the original Dall-E (released in 2021). By comparison, the modern frontier is defined by larger-scale systems like Dall-E 2 (released late 2022, FID 10.39) and Parti-20B (2022, 7.23). The original Stable Diffusion models cost $100,000s+ to train back in summer 2022, per Stability founder Emad Mostaque.
Additionally, the compute comparisons are favorable - MicroDiT used 6.6 8XA100 GPU days, versus 781 for Stable Diffusion 1.5.
Why this matters - algorithmic progress + hardware progress + good enough models = massive proliferation: Yes, frontier models still cost order(s) of magnitude more than the prices listed here, but this paper is a demonstration of how once you know a thing can be done (e.g, good text-to-image diffusion models) it becomes significantly cheaper to train a simple version of the thing. It also illustrates how AI systems can create the fuel to train miniature versions of themselves, given how some of the training data for this model stemmed from synthetic data taken from other models as well.
Read more: Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget (arXiv).
***
Facebook pushes synthetic data generation further with the "LLM-as-a-Meta-Judge" approach:
…Bootstraps an 8B LlaMa 3 model to be somewhat competitive with GPT4-Turbo and Claude Opus…
Researchers with Facebook AI Research, the University of California at Berkeley, and New York University have developed a new way to generate synthetic data with language models via a technique called Meta-Rewarding.
The key here is to not only generate synthetic data and have a synthetic judge filter that data, but to "introduce a third role of metajudge, whose task is to evaluate the model’s own judgements. While the judge evaluates the actor’s responses, the meta-judge evaluates the judge’s judgments (including rewards that it assigns) using a mechanism similar to LLM-as-a-Judge, which we term LLM-as-a-Meta-Judge". Though this sounds horrendously complicated and recursive - it's LLMs all the way down folks! - the technique seems to work well; "the meta-judge enables us to build training data containing preference pairs of judgements, in addition to the standard preferences between actor responses derived from the standard judge".
How it works: "Our method is an iterative training scheme that starts from a given seed LLM, which assumes all three roles. An iteration starts with the actor generating multiple response variations for each prompt. This is followed by the judge evaluating each response using an LLM-as-a-Judge prompt and generating a judgement that contains a score. This score then allows us to build preference pairs of responses for training the actor. For training the judge, we pick a single response and let the meta-judge compare two of its judgement variations generated by the judge to determine which one is better using an LLM-as-a-Meta-Judge prompt," Facebook writes.
…And it works! The technique is promising; Facebook takes a basic instruction-finetuned Llama-3-8B-Instruct model, then conduct an iterative training process to try and bootstrap the 8B model into higher quality. In tests on AlpacaEval 2 (an automatic system for evaluating language modells), they show significant improvements: the base model goes from a 22.57% win rate against GPT4-Turbo to 39.45%. Similarly, when controlling for length it goes from a 22.9% winrate against Claude Opus to 39.4%.
So far, the technique only works for four iterations, where it seems like it could lead to reduced performance after that - but bear in mind a year or two ago, most synthetic data techniques only worked for one or two iterations before mode collapse, so the number of iterations we can do over time seems to be increasing.
Why this matters - synthetic data is real, despite what you've read: Recently, people have become more bearish on synthetic data, mostly based on the idea that after using too much of it you induce some kind of mode collapse and end up in a recursive 'garbage in, garbage out' situation. This is true! But it ignores the fact that there's tons of evidence that a little bit of synthetic data is a helpful thing to do today, and it also skips over the fact that scientists are working to develop techniques that increase the amount of synthetic data you can safely use without worrying. Papers like this from Facebook show how it's possible to further improve the amount of synthetic data we can use via clever techniques, like using LLMs to judge the judges of synthetic data pipelines.
Read more: Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge (arXiv).
***
Tech Tales:
Path Dependency
I stopped talking to the machine because it kept on telling me all my futures ended in misery.
Do you believe it?
Of course I don't. But it freaks me out that it believes it.
How will you know if it's right?
I guess I'd die? Or get poor?
That's tough.
So, how has it been going?
Alright. My partner and I broke up but that was on the cards for a while.
Did you talk to the system about it?
I did and it referred me to a past prediction where it said this would happen.
How did that make you feel?
I told it part of why we broke up was because I said the machine thought we should and that kicked off this argument which spiraled out of control.
What did it say?
It asked me if based on this experience I would change my actions in line with its recommendations.
What did you say?
I stopped the session and went to the pub.
That looks quite serious.
It looks worse than it is - there isn't a fracture.
Have you been drinking more lately?
Yes.
Why?
Because my life has been shit lately.
I'm sorry.
…
Is there anything you think you could be doing differently?
Yes, but then I wouldn't be me. That thing really got to me. I keep thinking about what it said.
Things that inspired this story: Generative models and theory of mind; inevitability and agency.
Thanks for reading!
Predicting vs. Acting: https://arxiv.org/abs/2407.02446
Experimental confirmation of the observation many of us had reported anecdotally: RLHF makes models less creative, more predictable. Maybe a little of this is good, but too much leads to a boring and unhelpful model.
Diffusion Guided Language Modeling: https://arxiv.org/abs/2408.04220
Perhaps a better option is go light on the RLHF, and combine it with other methods of steering outputs. Diffusion-guiding is a novel and exciting approach to steering outputs.
What an odd assumption underlying the text, that the State is, or should be, owner and proprietor of all technological innovation.