April 8, 2021

Bridging human and machine language processing

11 min read
Bridging human and machine language processing

TL;DR: Human language processing signals such as eye-tracking and brain activity data can be leveraged to improve and evaluate machine learning models of natural language understanding.

This blog post is a high-level summary of my PhD thesis, which is available here. More information about current research projects can be found on my website.

Some of the work described in this blog post has been conducted in collaboration with Maria Barrett, Nicolas Langer, Lisa Beinborn, and Lena Jäger.

‌             ‌


‌‌🧠 Let's consider human language processing first. Humans do many tasks unconsciously and without visible effort. Someone speaks, the sounds enters our ears, and we understand immediately. We see a long string of characters, and we can almost immediately distinguish whether it is non-sense or a meaningful sentence; and once we’ve read it, we grasp not only the meaning but also the syntactic structure of this sentence.‌‌These cognitive processes are complex computational problems and our brains contain dedicated information processing machinery to solve these tasks. And there are methods to collect these processing signals directly from the brain.

🖥️ On the other side of the abyss, computational language processing is waiting. Machine learning (ML) has brought immense improvements to the field of natural language processing. Computer programs are now much faster and better at understanding text and speech. However, current state-of-the-art ML algorithms for language understanding still lack certain skills that human are able to perform automatically and effortlessly. These skills include learning from limited data and generalization, multimodal learning, learning complex tasks and meta-learning.

🧠↔️🖥️ So, how can we bring these to fields of language processing closer together. Our goal is to find a systematic way in which both fields can benefit from each other. Can we capture this mental representation of language and use it to improve our machine learning systems? Can humans help to train machine learning systems without knowing it? And can a better comprehension of our computational algorithms further our understanding of the human brain?

To build this bridge, we leverage cognitive signals recorded from humans processing language with techniques such as eye tracking and brain activity measurements.‌‌The challenge in working with this type of data is its noisiness. We have to deal with this and decode the human signals so that we can then efficiently use them to improve machine learning methods for NLP. This is the empirical question that we are trying to answer.

Overview of research questions discussed in this post.

Research Questions

In recent years, there has been an increasing interest in this interdisciplinary research space at the intersection of cognitive science, natural language processing (NLP), and machine learning (ML). In this blog post, we discuss three research questions we have worked on for the past few years and the insights we have collected.‌‌‌‌1. Can we compile a dataset of recorded human language processing signals which fulfills the state of the art in neuroscience and is usable for ML applications?‌‌‌‌2. Can signals recorded during human language processing be applied to improve machine learning based NLP tasks?‌‌‌‌3. Can human language processing signals be applied to evaluate the quality and cognitive plausibility of computational language models?

Let's take a closer look at what these research questions entail...

1. Collecting Human Language Processing Signals

Obviously, as is the case for any NLP applications, we want as much data as possible to train our ML models. But where does the human data come from?

As we read, our eyes reveal what words go together, and which are the most important in the given context. We can collect this information using eye-tracking technology. Eye-trackers provide millisecond-accurate records of where a person is looking on the screen. We can use these eye movement signals as a window into the mysterious black boxes of our brains. Eye-tracking signals are generally considered to be an indirect measure of the cognitive load. ‌‌‌‌As we read, our brain also produces electrical signals while processing the information contained in a text. Electroencephalography (EEG) measures the brain's electrical activity over a period of time, using multiple electrodes placed along the scalp. EEG measures the voltage fluctuations resulting in the neurons of the brain.

One concern and general bottleneck in machine learning is often that not enough data is available. This is especially tricky when working with signals recorded from humans, since the recording procedure is expensive and time-consuming. Even though many language research studies are conducted by neuroscientists and psycholinguists, many can't or won't share their data. Fortunately, this is starting to change now and more datasets are shared in the research community. We've put together a collection of datasets of cognitive signals that can be used for NLP here.

Another concern is the experimental setup itself. For ML-based natural language processing, we want real-world, naturally occurring text spans. Traditionally cognitive language studies entail the careful construction of stimulus sentences to contain specific linguistic phenomena and often the reading process is recorded word-by-word and not on full sentences (or longer texts). However, thanks to recent advancements in technology naturalistic languages studies are on the rise and becoming more popular. Therefore, we compiled our own freely available dataset in such a naturalistic experiment setting. The Zurich Cognitive Language Processing Corpus (ZuCo) is specifically tailored to these interdisciplinary research questions. The data can be downloaded here.

We present the Zurich Cognitive Language Processing Corpus (ZuCo), a dataset combining electroencephalography (EEG) and eye-tracking recordings from subjects reading natural sentences. ZuCo includes high-density EEG and eye-tracking data of 12 healthy adult native English speakers, each reading natural English text for 4–6 hours. The recordings span two normal reading tasks and one task-specific reading task, resulting in a dataset that encompasses EEG and eye-tracking data of 21,629 words in 1107 sentences and 154,173 fixations. We believe that this dataset represents a valuable resource for natural language processing (NLP). The EEG and eye-tracking signals lend themselves to train improved machine-learning models for various tasks, in particular for information extraction tasks such as entity and relation extraction and sentiment analysis. Moreover, this dataset is useful for advancing research into the human reading and language understanding process at the level of brain activity and eye-movement.         ‌‌- Hollenstein et al. (2018)

Examining language comprehension in fully naturalistic environments will not only advance our understanding of human language processing. The possibility to test theories of speech and language comprehension with high temporal resolution will also open up new frontiers in leveraging this data for NLP and machine learning. The trend in machine learning of using ever-increasing quantities of labeled data isn’t really sustainable, so using human signals like eye-tracking and brain activity is an intriguing way to make machines a little more intuitive and to improve their generalization abilities. We will see how this can be done in the following sections.

2. Improving NLP Applications with Cognitive Data

Now that we have collected human language processing signals, we leverage these data to help neural networks understand language. We use these cognitive signals to augment ML models for NLP. Compared to purely text-based models, we show consistent improvements across a range of tasks with both eye tracking and brain activity data. The tasks include sentiment analysis, relation extraction and named entity recognition.

There are different methods of multi-modal learning than can be used to augment ML models with cognitive features:

  • Early fusion: Concatenating linguistic features (e.g., pre-trained word embeddings) with cognitive features as the the input for our models (as we did here).
  • Late fusion: Learning individual network components for the text features and for the cognitive features, and joining them just before the prediction. We experiment with this method using EEG data.
  • Attention: Cognitive features can be used to train attention mechanisms for ML models, e.g., with gaze data or brain activity data. Human attention, i.e., how long humans fixate certain words or how active their neurons are during the processing or words is used as a proxy for machine attention.
  • Fine-tuning: Pre-trained language models fine-tuned on brain activity data show increased performance on NLP tasks.

The results of these studies show that we can achieve consistent improvements across NLP tasks, but these improvements are often still modest. The main challenges in working with cognitive signals is their noisiness. Hence, more research is needed to effectively decode brain activity for language processing, to find or learn better features from human data and to deal with the limited amounts of training data.

Of course using eye-tracking or brain activity data in all NLP applications is not very practical. However, these insights can help to advance technologies at the intersection of language processing and cognitive science, such as brain computer interfaces and other wearable interfaces. Moreover, reader identities, reader expertise and text comprehension levels can be inferred from eye-tracking recorded during reading. This is also helpful in the diagnostic space, for example, to predict characteristics of reading difficulties.

3. Analyzing and Probing NLP models

Finally, we discuss the potential to use cognitive signals for the analysis and interpretability of NLP models. Pre-trained language models have become the cornerstones of state-of-the-art NLP models. Computational word representations are low-dimensional vectors representing the meaning of a word or sentence. Deep contextualized word representations model both complex syntactic and semantic characteristics of word use, and how these uses vary across linguistic contexts. Unfortunately, current state-of-the-art machine learning algorithms for language understanding are still mostly black box algorithms. The link between a vector of numbers and a humanly interpretable representation of semantics is hidden. This means we cannot comprehend or track the decision-making process of NLP models. However, interpretability is key for many NLP applications to be able to understand the algorithms’ decisions.‌‌‌‌In relation to the interpretability of NLP models, we can also ask whether these models are cognitively plausible or not? Do word embeddings and language models reflect human representations in the brain?

Say we want to evaluate word embeddings, but not by how well they perform on a downstream NLP task for which the embeddings can be specially tuned. And also not in an intrinsic evaluation of human judgement, such as word analogy, which only assesses one aspect of semantics. We want to evaluate word representations on a more global level and ask how well these embeddings reflect the mental lexical representation of the meaning of a word.

This question can be approached through cognitive lexical semantics, a theory proposing that words are defined by how they are organized in the brain. It has been shown in a neuroscientific study how words are represented in semantic maps across the brain. Recordings of brain activity and gaze patterns play a central role in furthering our understanding of how human language works and can be used to inspect certain linguistic characteristics in NLP models. Moreover, as we discussed above, language representations tuned on brain activity show improved performance on NLP tasks. Hence, it seems natural to evaluate language models against human language processing data.

Therefore, we presented CogniVal, a framework for cognitive evaluation of English word embeddings. Specifically, embeddings are evaluated by their performance at predicting a wide range of cognitive data sources recorded during language comprehension, including multiple eye tracking datasets and brain activity recordings such as electroencephalography (EEG) and functional magnetic resonance imaging (fMRI). Since cognitive data is very noisy, we evaluate the embeddings against a wide range of datasets covering different recording modalities and language stimuli.

The CogniVal framework is available as a command line interface. It saves NLP researchers the time of needing to preprocess cognitive data sources. Additionally, it is easily extensible and available to include other intrinsic and extrinsic evaluation methods, which is essential for achieving a global evaluation metric. We find strong correlations in the results between cognitive datasets, across recording modalities and to their performance on extrinsic NLP tasks.

Next, we can go from a general estimation of the cognitive plausibility of word representations to investigating specific linguistic and cognitive aspects of current language models. For example, do (multilingual) language models accurately predict human reading behavior? We address this question in a recent paper. The goal is to use human behavioral data to investigate which reading patterns – in the form of eye-tracking data – can also be found in state of the art language models.

We investigate to what extent human reading behavior can be predicted by state-of-the-art pretrained language models. Previous work on psycholinguistic modelling finds good fits between cognitive signals and transformer language models. However, the direct prediction of these features has not been explored previously. Some eye movement patterns are universal across languages, hence we perform all experiments on multilingual and monolingual models. We compare the performance of language-specific and multilingual pretrained transformer models to predict reading time measures reflecting natural human sentence processing on Dutch, English, German and Russian text.  This results in accurate models of human reading behavior and yields insights into the workings of transformer language models. We find that BERT and XLM models successfully predict a range of gaze features. In a series of experiments we analyze the cross-domain and cross-language abilities of these models and show how they reflect human sentence processing. We observe that the models learn to reflect characteristics of human reading such as the word length effect and higher accuracy in more easily readable sentences.

For instance, we analyze the word length effect, i.e., gaze patterns are strongly correlated with word length. The language models fine-tuned on eye-tracking data successfully accurately learn to predict higher fixation proportions for longer words. Similar patterns emerge for all four languages. Notably, the pre-trained models before fine-tuning do not reflect the word length effect at all.

While the benefits of pre-trained transformer language models have been established, we have yet to understand to which extent these models incorporate human language processing behavior. We took a step in this direction and our analyses provide insights into the differences between human processing strategies and computational models. This advances our understanding of how language models incorporate human language processing behavior. The ability of transformer models to achieve such high results in modelling human sentence processing indicates that we can learn more about the cognitive plausibility of these models by predicting behavioral metrics. Of course, more research is needed to analyze this in more depth. For instance, this recent shared task poses the challenge of predicting eye tracking-based metrics recorded during English sentence processing to advance our understanding of language processing.

Final Thoughts

In this post, we described how human language processing signals can be leveraged to improve and evaluate machine learning models of natural language understanding. To conclude, we shortly want to mention the ethical considerations that arise when working with human language processing signals for NLP.

First, we want to highlight the necessity of considering the high-level consequences of our work. It becomes increasingly relevant to examine the implications of the interaction between humans and machines, between what can be recorded from a human brain and what can be extracted from those signals. What is the objective of the final application? What is the impact on people and society? Second, it is essential to remember the responsibility towards research subjects and towards protecting the individual when working with data recorded from human participants. And finally, the origins of the data and any biases within them should be considered. Demographic biases can be extracted from text, and are also reflected in eye movements and brain activity recordings. It is important to remember that with extensive reuse of the same corpora these biases – participant sampling as well as experimental biases – are propagated to many experiments. We should thus be careful in the interpretation of the results.

Based on our previous research combining human language processing data such as eye tracking and brain activity with natural language understanding methods, we found that leveraging cognitive data for NLP is very promising and shows great potential for multi-modal machine learning as well as for interpretation approaches. Ultimately, the goal of this line of research is to bridge human intelligence with machine intelligence to build a general, interpretable framework for multi-modal NLP using a diverse range of cognitive signals. We believe that through human-grounded learning we can build truly generalizable models of natural language processing. We have shown how human language processing signals can be leveraged to increase the performance of models and to pursue explanatory research for a better understanding of the differences between human and machine language processing.