Import AI 326:Chinese AI regulations; Stability's new LMs If AI is fashionable in 2023, then what will be fashionable in 2024?
What would you do at the end of time?
Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this (and comment on posts!) please subscribe.
Want better AI policy? Figure out how to measure what you care about:
…Tim O'Reilly lists some simple ways for better AI governance…
If we want to govern AI systems, we need to be able to measure and assess their properties, says Tim O'Reilly. "Alignment will be impossible without robust institutions for disclosure and auditing," he writes. "If we want prosocial outcomes, we need to design and report on the metrics that explicitly aim for those outcomes and measure the extent to which they have been achieved".
Measurement rules everything around me: O'Reilly's basic idea is that AI regulation comes from measuring AI systems for positives and negatives and then designing regulatory frameworks around that. The best way to start here is for regulators to draw on what AI companies themselves do.
"Regulators should start by formalizing and requiring detailed disclosure about the measurement and control methods already used by those developing and operating advanced AI systems," he writes. "Regulations should first focus on disclosure of current monitoring and best practices. In that way, companies, regulators, and guardians of the public interest can learn together how these systems work, how best they can be managed, and what the systemic risks really might be."
One thing measurement doesn't help with: There is one area of AI policy where measurement isn't necessarily going to be that helpful: "with small LLMs that can be run on a laptop, there is a risk of an irreversible and uncontrollable proliferation of technologies that are still poorly understood," he writes.
Why this matters: You can't manage what you can't measure: The longer AI policy runs on rhetorical soundbites and less on quantitative methods, the harder it's going to be to get down to brass tacks about what behaviors are good, what behaviors are bad, and what behaviors people should pay attention to. Proposals like O'Reilly's are eminently sensible - but of course I'd say this, as I've proposed similar ideas myself!
Read more: You Can’t Regulate What You Don’t Understand (O'Reilly).
####################################################
China publishes some very detailed generative AI regulations:
…Broad regulations see China try to exert control over generative ideological engines…
Chinese policymakers have published draft generative AI regulations which would target services and products offered in China. Stanford's DigiChina project has published an analysis of the regulations as well as a full translation of them. The takeaway from the recommendations is the Chinese government wants to exercise a lot more control over what AI-imbued services are allowed in its country, and it also wants to place a lot more responsibility and liability onto the providers of the underlying generative AI models.
What the regulations mean: It's worth reading them in full, but here are some highlights (translated via Stanford's 'DigiChina' project):
"Content generated through the use of generative AI shall reflect the Socialist Core Values"
"Respect intellectual property rights and commercial ethics"
"Organizations or individuals that use generative AI to provide services such as chat, text, image, or audio generation … including providing programmable interfaces … bear responsibility as the producer of the content generated by the product."
"Before using generative AI products to provide services to the public, a security assessment must be submitted to the state cyberspace and information department"
"When providing generative AI services, users shall be required to provide real identity information"
"When generated content that does not conform to the requirements of these Measures is discovered during operations or reported by users … repeat generation is to be prevented through such methods as optimization training within three months."
AI companies are political parties: One interpretation of this rulemaking is a recognition by the Chinese government that AI models - and therefore the companies that make them - are political forces which produce political artifacts; here, AI systems which magnify specific ideologies.
""Suddenly, instead of trying to control searches on websites and monitor forbidden terms in emails, the system will have to deal with individual users being able to ask questions to a generative AI application without any ability to monitor and block the output for sensitivity and offending word," writes Paul Triolo, Senior Associate, Trustee Chair in Chinese Business and Economics, Center for Strategic and International Studies, in DigiChina. ""Beijing and the CAC are in the initial stages of coming up with a regulatory regime that pushes companies toward political alignment as they develop their models. This is new territory for regulatory bodies like CAC, and for the entire Internet censorship apparatus that China has developed over the past three decades."
The 'tiananmen problem' - one thought about AI safety and authoritarianism: I think it's probably just as hard to get models to not help you make an explosive, as it is to get models to not display knowledge of Tiananmen Square in 1989. I think this illustrates how radically different ideological frames may end up having a strange area of agreement when it comes to investing in technologies relating to safety and alignment.
Read more: Translation: Measures for the Management of Generative Artificial Intelligence Services (Draft for Comment) – April 2023 (DigiChina).
Read more: How will China’s Generative AI Regulations Shape the Future? A DigiChina Forum (DigiChina).
####################################################
Stability tries to catch lightning in a bottle twice with release of 'StableLM' LLMs:
…Open source models++...
Stability AI, the company which released the open source 'Stable Diffusion' model into the world, has released a 3bn and 7bn parameter language model called StableLM. Stability plans to soon release 15bn and 65bn parameter models as well. "Developers can freely inspect, use, and adapt our StableLM base models for commercial or research purposes, subject to the terms of the CC BY-SA-4.0 license."
What's special about StableLM? This year, tons of open source language models have been released, ranging from Dolly-2, Cerebras-GPT, Eleuther's Pythia models, Facebook's lab leak 'LLaMa' model, and more. StableLM differs to these by virtue of being trained on a new dataset which, at 1.5 trillion tokens of content, is even larger than the 1.2trillion parameter dataset (RedPajama) written about elsewhere in this issue.
"We will release details on the dataset in due course. The richness of this dataset gives StableLM surprisingly high performance in conversational and coding tasks, despite its small size of 3 to 7 billion parameters," Stability writes.
Stability has also released some models finetuned for instruction following - "these fine-tuned models are intended for research use only and are released under a noncommercial CC BY-NC-SA 4.0 license," the company wrote.
Why this matters: Stability believes that open source is the safest and best way to deploy AI in a large-scale manner, while many other organizations (e.g, OpenAI) skew more towards proprietary control. Both groups hold their beliefs due to a combination of idiosyncratic philosophies around the safety impacts of different types of release, as well as by virtue of their distinct business models. In the coming years we'll get to see which approach is more correct.
Read more: Stability AI Launches the FIrst of its StableLM Suite of Language Models (stability.ai blog).
Get the StableLM models here (Stability GitHub).
Chat with a 7B StableLM model here (StableLM-Tuned-Alpha-7b Chat, Hugging Face).
####################################################
Better language models via retrieval:
….Retrieval might just be a generically good idea…
Researchers with NVIDIA, the University of Illinois Urbana-Champaign, and Arizona State University, have trained and released some language models using a technique called 'retrieval' based on DeepMind's RETRO paper. The idea of retrieval is that you train your language model to have a module that helps it retrieve over a large external dataset during training - the idea seems effective, so in this research the scientists try and answer the question "Shall we pretrain autoregressive (decode-only) LMs with retrieval by default or not?"
What they did: In tests, their models (called RETRO), "outperforms GPT on text generation with much less degeneration (i.e., repetition), moderately higher factual accuracy, and slightly lower toxicity with a nontoxic retrieval database," they write. "Our findings demonstrate that RETRO can leverage retrieved neighbors and significantly improves accuracy for knowledge intensive tasks in zero-shot evaluations."
They test out their approach on models which range from 148M up to 9.5B parameters in size.
How well does it work? "Shall we pretrain decoder-only LMs with retrieval? We observe consistent improvements in text generation quality, factual accuracy, lower toxicity, and downstream task accuracy, especially for knowledge-intensive tasks, including open-domain QA," they write. "Given the ∼ 25% percentage of additional GPU hours for pretraining, we argue pre-training generative language models with retrieval is a promising direction."
Why this matters - retrieval might just be a robustly good idea: Papers like this show that techniques like retrieval might be sufficiently good that it's worth just broadly integrating them into most language models.
Read more: Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study (arXiv).
More about RETRO: Improving language models by retrieving from trillions of tokens (DeepMind blog).
####################################################
Together.xyz releases a vast dataset for training huge language models:
…Distributed AI research startup releases the ingredients to replicate a large LLaMa…
Together.xyz, an AI startup pushing decentralized training and an open AI ecosystem, has published RedPajama. RedPajama is "an effort to produce a reproducible, fully-open, leading language model. RedPajama is a collaboration between Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute."
As a first step, Together has released a vast dataset to help people train large language models. "We aim to create a fully open-source reproduction of LLaMA, which would be available for commercial applications, and provide a more transparent pipeline for research," the company says.
The dataset: The full dataset, RedPajama-Data-1T, is 1.2 trillion tokens, totalling ~5TB unzipped on disk and ~3TB to download compressed. The dataset consists of seven large-scale data slices. These are:
CommonCrawl: Five dumps of CommonCrawl, filtered for quality.
C4: the Standard C4 dataset.
GitHub: GitHub data, filtered by licenses and quality.
arXiv: Scientific articles with boilerplate removed.
Books: A corpus of open books.
Wikipedia: Subset of Wikipedia pages with boilerplate removed.
StackExchange: Popular websites under StackExchange, with boilerplate removed.
Why this matters: The biggest AI policy debate of the 2020s relates to centralization versus decentralization - will AI models be controlled by a tiny set of actors or will they be broadly developed and distributed by a collective? Companies like Stability.ai (of Stable Diffusion fame) and Together.xyz are betting on the latter.
Read more: RedPajama, a project to create leading open-source models, starts by reproducing LLaMA training dataset of over 1.2 trillion tokens (Together.xyz).
####################################################
Synthetic + Real images = more performance than training on reality alone:
…Google paper shows tantalizing hints of being able to speed up another part of AI research…
Researchers with Google have shown that they can augment a dataset (ImageNet) with AI-generated images, then get greater performance on models trained on that dataset. This means that by combining synthetic imagery with real imagery you can train models with greater performance than if they were just trained on reality. This has big implications - it suggests that synthetically generated data may not only be a substitute for real data but may (sometimes) let you get better results than with real data alone.
"Augmenting the ImageNet training set with samples from the resulting models yields significant improvements in ImageNet classification accuracy over strong ResNet and Vision Transformer baselines," they write. "We show that performance of models trained on generative data further improves by combining synthetic data with real data, with larger amounts of synthetic data, and with longer training times. These results hold across a host of convolutional and Transformer-based architectures."
What they did: They mix in Imagen-generated images with the larger ImageNet dataset and the result is a model with better performance and more accurate labels (e.g, some of the original ImageNet dataset is mislabeled so the generated images offset this a bit). "Our results indicate that the fine-tuned generative diffusion model outperforms the previous methods by a substantial margin," they say. "As one might expect, models trained solely on generated samples perform worse than models trained on real data. Nevertheless, augmenting real data with synthetic images from the diffusion model yields a substantial boost in performance across all classifiers tested."
Why this matters - the 'AI production inputs' keep getting cheaper: For a long time, people said AI had three main ingredients - compute, algorithms, and data. Well, in recent years, compute has got ever cheaper (thanks, Moore's Law), and algorithms have become somewhat cheaper (most people use transformer-architecture models for an increasingly wide range of tasks), but the costs of data have seemed quite stable - you need to create or scrape it from some part of the world.
Papers like this suggest that the cost of data as an input might fall as a consequence of being able to 'mix in' synthetic data via increasingly capable models. All of this adds up to further speedups in AI development as a consequence of the reduction of the costs of basic inputs into AI research.
Read more: Synthetic Data from Diffusion Models Improves ImageNet Classification (arXiv).
####################################################
Tech tales
Unregistered Computer
We had a big Unregistered Computer built out of a bunch of pre-Tracking Accords hardware. We used it to make money off of porn and illegal-ideology models and weapons systems and the other things that the ruling class sought to control or stamp out.
We had to bring data in via disc or USB and getting it out was even more complicated - we had to launder the data through a few different mediums before we let it touch the internet, so that it'd be hard for anyone to isolate the trail and find our computer.
We made a lot of jokes about getting found out by the Compute Police and going to jail. One year, we made some money by making T-Shirts that said 'Don't Tread On Me' and had a picture of a GPU on them. Then we made mugs that said 'Out of My Cold Dead Hands' with two hands clutching the circle&line cats cradle symbol of a neural net.
As the years went on, we found ourselves dealing more and more with criminals and less and less with hobbyists. Things got scarier and the software we got asked to run felt stranger to allow. We started doing lots of disinformation operations for third parties who probably represented nation states, or intelligence agency cut outs.
One time, someone asked us to run some very particular scientific questions about some very particular chemicals - we could never work out if this was for drugs or poison or explosives, and we were too scared to check.
Another time, we trained some model and whenever we ran inferences off of it to test it during training we found it did strange things to us - after looking at the outputs, people reported confusing left and right, or finding it difficult to spell words that previously had been easy to spell.
The problem was that as time went on the Unregistered Computer became so valuable that the criminals started 'protecting' it - which meant they both protected us and watched us. So here we are, working like cooks in a meth lab for some drug dealer, watching over servers and hot-swapping hard drives, maintaining a baroque machine so that it can produce things banned by polite society.
Things that inspired this story: Thinking about what happens if AI policy ends up leading to compute controls; the logic of the criminal underground; libertarian AI; data centers; distributed training over heterogeneous computing nodes.
Inbetween the lines of our Measuring Data paper (https://arxiv.org/abs/2212.05129) is that there isn't even the right research being done to understand how to measure data for ML systems, so that it can be regulated. Pretty remarkable how poorly understood each component tends to be 😅
And what's not so special about StableLM: no evaluation.