Differences Between Word2Vec and BERT

Published in

The Startup

3 min readNov 12, 2020

With so many rampant advances taking place in Natural Language Processing (NLP), it can sometimes become overwhelming to be able to objectively understand the differences between the different models.

It is important to understand not only how these models differ from each other, but also how one model overcomes the shortcomings of another.

Below I have drawn out a comparison between two very popular models — Word2Vec and BERT.

1. Context

Word2Vec models generate embeddings that are context-independent: ie - there is just one vector (numeric) representation for each word. Different senses of the word (if any) are combined into one single vector.

However, the BERT model generates embeddings that allow us to have multiple (more than one) vector (numeric) representations for the same word, based on the context in which the word is used. Thus, BERT embeddings are context-dependent.

For example, in the figure below, the word bank is being used in two different contexts — a) financial entity b) land along the river (geography).
Word2Vec will generate the same single vector for the word bank for both the sentences. Whereas, BERT will generate two different vectors for the word bank being used in two different contexts. One vector will be similar to words like money, cash etc. The other vector would be similar to vectors like beach, coast etc.

Word2Vec embedding for the word “bank” will be a confused representation as it has collapsed different contexts into a single vector.
The BERT embedding will be able to distinguish and capture the two different semantic meanings by producing two different vectors for the same word “bank”.

The same word (bank) being used in different contexts. https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/

2. Word Ordering

Word2Vec embeddings do not take into account the word position.

BERT model explicitly takes as input the position (index) of each word in the sentence before calculating its embedding.

3. Embeddings

Word2Vec pre-trained word embeddings are available to use directly off-the-shelf. The embeddings are available as a 1-to-1 mapping (key-value pairs) between the words and vectors. There is no need to have the model itself; all we need is the embeddings that the model generated. The input to the model is a single word and the output is a vector representation of that word.

On the other hand, since BERT generates contextual embeddings, the input to the model is a sentence rather than a single word. This is because the BERT model needs to know the context or the surrounding words before generating a word vector. We need to have the trained model with us so as to generate the embeddings based on our input and context. In the output, we get a fixed-length vector representation for the entire input sentence.
(Note: Of course, you can use the BERT model to get pre-computed static single-word vectors (similar to Word2Vec), but it defeats the purpose of having contextualized embeddings.)

4. Out-of-Vocabulary (OOV)

Word2Vec model learns embeddings at “word” level. That is, if your Word2Vec model was trained on a corpus of, say 1 million unique words, then the model will generate 1 million word embeddings — one vector for each word in the vocabulary. However, such representations cannot generate vectors for words encountered outside the vocabulary space. In other words, Word2Vec doesn't support out-of-vocabulary (OOV) words — which is one of its major disadvantages.

BERT, on the other hand, learns representations at a “subword” (also called WordPieces) level. Subwords can be thought of as a sweet spot between character-level embeddings and word-level embeddings. Thus, a BERT model will have a vocabulary space of only, say 50k words, despite being trained on a corpus of, say 1 million unique words. This kind of modeling has become very popular because the model can generate a vector representation for any arbitrary word and is not limited to the vocabulary space. It is essentially an infinite vocabulary! In other words, BERT provides support for out-of-vocabulary (OOV) words.
Thus, Word2Vec and BERT also differ at the granularity level of the learned representations.

Summary

These are the main important differences to remember. I hope it helps!