Galactica — Stochastic Parrots in action?

A demo of Galactica (named after Isaac Asimov’s Encyclopedia Galactica), a large language model (LLM) trained on 48 million scientific articles, was made available last week by Facebook’s parent company Meta. Two days later, the AI community pulled the model amid debate over its propensity to produce inaccurate or misleading content.

In this article, we’ll try to examine the drama that has been unfolding and determine whether Galactica can actually live up to our expectations or if it’s just a stochastic parrot that generates random nonsense while posing as a knowledge portal.

Where can I get some of these Stochastic Parrot cuties?

Unfortunately, I have to disappoint you here… you can’t. Stochastic parrots is a term introduced by the authors of this paper. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? The general idea is that large language models can be compared to parrots — they will mostly repeat the statements seen in the training data (Parrot), slightly tweaking the prediction (Stochastic). Moreover, they can do it better for concepts that are well and frequently described so they can struggle to describe concepts that are underrepresented.

The article describes several issues and limitations of large language models. The most important ones for this story are connected to the importance of data:

  1. The size of the corpus doesn’t guarantee diversity; e.g., the data for training GPT-2 is in large part scraped from Reddit, which is mostly used by young men (18–29 years old).
  2. Large language models are encoding bias. They exhibit similar bias as in their training corpus. For example, BERT associates phrases referencing people with disabilities with more negative sentiment words and in general words such as gun violence, homelessness, drug addiction, etc. are overrepresented in corpora scraped from the Internet.

Galactica researchers try to tackle the issue of Stochastic Parrots. What is the effect?

Robustness to toxic content. Yay or nay?

As we already discussed, the quality of large language models like GPT-3 or BLOOM is determined by the training data, which is frequently scraped from the Internet (Wikipedia, Reddit, etc.). Unfortunately, this can introduce incorrect or toxic information to the corpus.

On the other hand, scientific texts, such as academic papers, are generally immune from these data flaws. Most of the time, they are analytical texts with a neutral tone that contain evidence-based knowledge and are written by people who want to inform rather than incite. Moreover, many of the scientific papers are reviewed by other researchers, which increases the likelihood that the corpus will contain less misleading information. However, there is still no certainty that scientific papers contain no bias.

When we examine the corpus that Galactica was trained on, we can see that 48 million scientific papers provided 88 billion of the 106 billion tokens that were used in the training process. Code, knowledge bases, prompts are amongst additional data sources.

Servers like arXiv, PMC, and Semantic Scholar have been used the most as sources for data on scientific papers:

The authors contrasted the Galactica model’s toxicity and bias with those of other significant language models. Although the results are encouraging, there is still much work to be done in terms of absolute performance:

Nowhere in the publication, the researchers state that their model is not prone to prompt toxicity.

Issues with Galactica

Why then did the model’s release cause such a stir in the AI research community? For reference, I cite some of the critical claims about the model:

Facebook (sorry: Meta) AI: Check out our “AI” that lets you access all of humanity’s knowledge. Also Facebook AI: Be careful though, it just makes shit up. — Emily M. Bender, Professor of Linguistics

I asked #Galactica about some things I know about and I’m troubled. In all cases, it was wrong or biased but sounded right and authoritative. I think it’s dangerous. — Michael Black, Director at Max Plank Institute for Intelligence systems

The critique from the community was so significant that after 2 days, the open-sourced demo version of the model was removed from Galactica’s website:

In my opinion there were two main factors that led to the division of researchers:

1. It was fairly easy to prompt the model to output nonsense.

This was the direct cause of the critique. Using a model that may come up with a made up mathematical theorem or citations that do not exist may be a danger for users who could put too much trust in the outputs of the model. Let’s look at an example:

Have you ever heard about Brandolini’s law? Me neither, but Wikipedia states that it is a bullshit asymmetry principle, i.e.:

The amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it.

This quote becomes a little bit ironic when we look into how does Galactica describe Brandolini’s law:

The output appears to be logical and instructive. The only issue is that it is… WRONG. In addition, if we Google Gianni Brandolini, we will find out that this person has never existed. And about Paul Romer… (I guess you got it by now) he has never suggested that Brandolini’s law is a myth.

If you don’t know that all of these statements are false, it is actually not that simple to refute it. You would have to directly Google Brandolini’s law to check for the correctness of the output. Predictive search engine with an extra step of using a real search engine to confirm the output — that does not sound like an improvement of research work.

If the model outputs nonsense in a fairly obvious case for the human, how could we trust it in cases where the output is connected to a difficult concept? That is a challenging question. From one perspective, we are aware that there may not be many publications about Brandolini’s law (large language models do not perform well on poorly described concepts). The model could perform better in scenarios involving concepts in mathematics that are well-explained in training corpus, but on the other hand, we are unable to determine the model’s level of confidence in its predictions. In order to verify that the results are accurate, we would be forced to use a regular search engine. This example, in my opinion, diminished my confidence in and enthusiasm for using Galactica in its present form.

2. Overstatements of Galactica’s developers

Another issue that caused the backlash in the AI community is connected to several marketing overstatements that were present on Galactica’s page.

The main objection was the claim that the model has the capability to reason from scientific papers.

Galactica cannot infer meaning from text, as we can see from the previous example about Brandolini’s law. It can only output a list of the most likely words in relation to a given prompt or topic. Unfortunately, calling it reasoning is simply false. It might resemble reasoning in some circumstances (those that are well described in the corpus).

Conclusion

Despite the fact that I feel as though I have criticized the Galactica authors, I find their research to be very intriguing. I like the strategy of curating the corpus to lessen the amount of misinformation and hate as it brings an encouraging results. I continue to believe that researchers could use the model for less complex and contentious tasks like writing scientific code (in general, things that can be easily tested).

The lesson we should learn from this experience, in my opinion, is that researchers should exercise caution before making their findings publicly available. We should refrain from releasing solutions that haven’t been thoroughly tested, and we definitely shouldn’t exaggerate the capabilities of our models.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store