An international team of around 1,000 volunteers, mostly academics, has tried to break Big Tech’s stranglehold on natural language processing and reduce its harm. Trained with $7 million in publicly funded computing time, the BLOOM language model will compete in scale with those created by firms Google and OpenAI, but it will be open source. BLOOM will also be the first model of its scale to be multilingual.
The collaboration, called BigScience, released an early version of the model on June 17 and hopes it will ultimately help reduce the harmful outcomes of artificial intelligence (AI) language systems. Big tech companies are increasingly using models that recognize and generate language in applications, from chat bots to translators, and they can sound so eerily human that a Google engineer claimed this month that the company’s AI model was responsive (Google strongly denies that the AI possesses sentience). ). But such models also suffer from serious practical and ethical flaws, such as parroting human biases. These are difficult to address because the inner workings of most of these models are closed to researchers.
As well as being a tool to explore AI, BLOOM will be open for a variety of research uses, such as extracting information from historical texts and making classifications in biology. “We believe model access is an essential step in doing responsible machine learning,” says Thomas Wolf, co-founder of Hugging Face, a company that hosts an open source platform for AI models and datasets, and has helped spearhead the initiative.
“This technology has been around for a long time in the open source world, and this is a pretty interesting way it happened,” says Connor Leahy, co-founder of EleutherAI, which is creating its own big open source language. he modeled in English and did not participate in the project.
Large language models are algorithms that learn statistical associations between billions of words and phrases to perform tasks such as generating summaries, translating, answering questions, and classifying text. Built with brain-inspired architectures known as neural networks, the models are trained by adjusting values, called parameters, removing words, and comparing their predictions to reality. BLOOM has 176 billion parameters, on par with GPT-3, one of the best-known models of its kind, which was created by the non-profit firm OpenAI and licensed from Microsoft.
Although such models are sometimes impressive (they generate poetry or correctly answer trivia questions), they have no sense of the meaning of language, which causes them to create gibberish as well. More worryingly, they can also promote abuse or self-harm, and echo existing racist or sexist associations that are sewn into all human-written text in which they learn, such as linking ‘Islam’ to terrorism. The models typically cost millions of dollars to train and have a huge carbon footprint (BigScience eventually plans to reveal their carbon emissions).
While most natural language models are created by small internal teams, BLOOM was the work of hundreds of researchers, mostly academics, including ethicists, legal scholars, and philosophers, but also some Facebook and Google employees, who they worked in a personal capacity. To train BLOOM, BigScience got free access to France’s Jean Zay supercomputer facility outside of Paris. The model is currently in the final weeks of her three-month training period.
hand selected text
Models are only as good as the data sets they are based on, so an important task was selecting which texts the model should learn from, says Yacine Jernite, machine learning researcher at Hugging Face. Most of the major models copy the language directly from the web, including sites like Reddit. Instead, BigScience researchers hand-selected nearly two-thirds of their 341 billion-word dataset from 500 sources. Among them was Semantic Scholar, an AI-powered search engine for scholarly journals that also includes content like Nature news articles The fonts were suggested during a series of workshops, including with community groups, such as the Masakhane African Natural Language Processing Community, LatinX in AI, and Machine Learning Tokyo. “We wanted to make sure that people close to the data, their country, the language they speak, were involved in choosing the language that would be included in the model training,” says Jernite.
To make the most of the available computing power, the team rounded out the trove of data using multilingual web crawling, filtering for quality, and with some redaction for privacy. The collaboration also attempted to reduce the common overrepresentation of porn sites (which can lead to sexist associations in the model), but without excluding keywords that would remove content associated with frank discussions of sexuality in often underrepresented communities.
Jernite acknowledges that BLOOM will not be free of bias. But by providing you with high-quality, multicultural sources, the team hopes to improve existing models. Crucially, because the code and dataset behind the model are open, researchers can try to understand the roots of harmful behaviors, which could improve future iterations, says Wolf.
Model evaluation will also differ from the usual benchmarks, says Ellie Pavlick, a natural language learning researcher at Brown University in Providence, Rhode Island. In addition to comparing BLOOM to other models on its abilities to, say, answer questions, the researchers also want to look at more diverse metrics, such as the strength with which it makes certain stereotyped associations or the bias of its abilities toward a specific language. Pavlick hopes that because the model has been trained to be multilingual, it might have a deeper understanding of the language, which could help its ability to generalize to a variety of tasks.
Leahy predicts that the model might perform slightly worse than other large models in English, given its smaller dataset in the language, but that should be balanced against noticeably better performance elsewhere.
free to use
The fully trained BLOOM model will be available for download for researchers who want to experiment with it or train it with new data for specific applications. But downloading and running it requires significant hardware capacity. Because that’s available to so few research teams, BigScience will also release smaller, less hardware-intensive versions, as well as create a distributed system that allows labs to share the model across their servers. Additionally, Hugging Face will launch a web app that will allow anyone to check out BLOOM without downloading it. A similar app will be available for early release later this week.
BLOOM could find uses in research outside of AI. Francesco de Toni, a linguist at the University of Western Australia in Perth, co-leads a BigScience working group that is looking at using models to extract information from collections of historical texts that are too large to read by hand. Models can, for example, extract all the names or goods mentioned in a collection of Renaissance merchant letters, information that would be impossible to find using a search engine.
BLOOM comes with documentation that describes its capabilities and limitations. Using it also requires subscribing to an evolving legal license that commits researchers not to use the model for malicious or inappropriate purposes, such as generating fake news. The collaboration will monitor how the model is applied and adjust licensing and documentation as needed, says Giada Pistilli, an ethicist at Hugging Face and a philosopher at Sorbonne University in Paris, who co-chaired the Ethics and Legal Working Group of BigScience. “It’s very difficult to imagine and predict all the uses,” she says.