Meet the Researcher: Jesse Vig “Deconstructs” Natural Language Processing

Jesse Vig is a researcher at PARC specializing in conversational agents and Natural Language Processing (NLP). He also explores the intersection of machine learning and human-computer interaction, particularly around data visualization.

In the article titled, “Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters,” author Jesse Vig uses his visualization tool to describe the inner workings of BERT, Google’s Natural Language Processing (NLP) algorithm. We sat down with Jesse to learn more about what his visualization tool revealed about BERT and other NLP models.

Jesse, can you briefly describe BERT?  

BERT is a language understanding model: an algorithm that takes language – either written or spoken – and distills its meaning into a numerical format that computers can process. This is important for many different NLP applications, such as sentiment analysis or question answering.

BERT is able to extract deeper meaning from text, based on a new model called the Transformer. BERT is also pre-trained, so it doesn’t need to learn each new task from scratch. This pre-training is done through a fill-in-the-blank exercise, in which the model is given a sentence with some words removed, and asked to guess those missing words. This is actually a challenging task that requires both syntactic and semantic understanding of language. By repeating this exercise millions of times, it gains a deep understanding of the workings of language that can then be applied to more useful tasks. As a result, BERT may require orders of magnitude less training data as compared to previous models.

Why is it important to understand NLP models such as BERT?

Being able to interpret a model is important for several reasons. First, the model may be learning from features of the data that are irrelevant or even misleading. For example, there was a case where a model that diagnosed disease from X-rays learned to base its decisions on textual artifacts printed on the X-ray. As a result, the model was not able to properly interpret X-rays from machines that didn’t produce those same artifacts.

Second, the model may be learning biases based on factors such as gender or ethnicity. For example, if the model sees a sentence that mentions a doctor and a nurse, followed by a sentence with the pronoun “she,” the model will often assume that “she” refers to the nurse. If we understand the source of these biases, we can potentially correct for them.

How does your visualization tool work?

My visualization tool allows you to peer into the NLP model, through the lens of “attention”. In machine learning, “attention” is a modeling approach in which the model focuses its attention on particular parts of the input more than others.

By visualizing these attention patterns, the tool can help us understand how the model thinks. I’ve found the attention patterns to be even more interpretable than expected, and they seem to encode many of the same structures that human engineers have explicitly encoded in earlier language models such as Recurrent Neural Networks (RNNs).

What did your visualization tool reveal about BERT and other NLP models?

In general, models such as BERT are difficult to interpret and are often considered “black boxes”.  However, the visualization tool revealed that there are many aspects of these models that are interpretable by humans. For example, the tool shows how the model generates acronyms by matching letters to corresponding words, and how the model is able to predict last names of people by finding names of relatives mentioned in the text.

What do your findings tell us about the future of NLP algorithms and their potential applications?

BERT and similar models will continue to improve NLP applications across the board, but one area in particular that will benefit is conversational AI. Smart speakers like Amazon Alexa and Google Home are now ubiquitous, but they are mainly used for simple tasks like listening to music or getting the weather forecast. The most transformative applications of conversational AI, such as intelligent tutoring systems or physician assistants, will need to support much more complex interactions.

While speech recognition (converting audio to text) is almost a solved problem, understanding text at a deeper level is still challenging for conversational agents.  Similarly, agents can generate surprisingly human-like speech from text but are still limited in generating that text in the first place.

It turns out that the Transformer model is really good at generating language in addition to understanding it. The GPT-2 Transformer model from OpenAI, for example, can generate a wide variety of content such as news articles, stories and conversational responses.

Giving our algorithms the ability to emulate humans also brings great risks, such as fake news generation. Consequently, researchers are working on developing countermeasures for these malicious applications.  Model interpretability and visualization can potentially help in these efforts.


Additional information

Focus Areas

Our work is centered around a series of Focus Areas that we believe are the future of science and technology.

Licensing & Commercialization Opportunities

We’re continually developing new technologies, many of which are available for Commercialization.


PARC scientists and staffers are active members and contributors to the science and technology communities.