16: Miscellaneous

Evidence of a predictive coding hierarchy in the human brain listening to speech (Mar 2023)

(link)

The paper compares actual brain signals from fMRI studies to the activation of neurons in an LLM. It finds that:

  • Predictive coding theory offers an explanation to why LLMs can’t solve certain problems: while language models are optimized to predict nearby words, the human brain would continuously predict a hierarchy of representations that spans multiple timescales

  • First, we confirmed that the activations of modern language models linearly map onto the brain responses to speech. Second, we showed that enhancing these algorithms with predictions that span multiple timescales improves this brain mapping. Finally, we showed that these predictions are organized hierarchically: frontoparietal cortices predict higher-level, longer-range and more contextual representations than temporal cortices.

Methodology:

  • Make 300 people listen to short stories and record their brains through fMRI.

  • We then fitted, for each voxel and each individual independently, a linear ridge regression to predict the fMRI signals from the activations of several deep language models.

  • We then computed the corresponding ‘brain scores’ using held-out data, that is, the voxel-wise correlation between the fMRI signals and the predictions of the ridge regression input with the activations of a given language model

  • Interesting: we first focused on the activations of the eighth layer of Generative Pre-trained Transformer 2 (GPT-2), a 12-layer causal deep neural network because it best predicts brain activity


Low-resource Languages

Training models requires datasets. For many languages, those digital datasets are way more sparse than for others. Those are called low-resource languages. This paper (2020) looks at the LDC Catalog and ELRA Map for labeled data per language, and Wikipedia for unlabeled data, and clusters available languages by data availability:

This is how the clusters emerge, with some example languages for each cluster. For example, unlabeled data availability for languages in cluster 5 is two orders of magnitude larger than languages in cluster 3.

Recall that training an LLM from the ground up requires on the order of 10^12 tokens, but instruction-tuning only requires on the order of 10^5 tokens.

This paper (2021) looks at task-specific data available per language. Tiny Estonian has more task data available than any of the widely spoken African languages: