TensorFlow.org
View tutorial
Colab
Run in Google Colab
GitHub
View source
Download
Download notebook
- Continuous bag-of-words model: predicts the middle word based on surrounding context words. The context consists of a few words before and after the current (middle) word. This architecture is called a bag-of-words model as the order of words in the context is not important.
- Continuous skip-gram model: predicts words within a certain range before and after the current word in the same sentence. A worked example of this is given below.
Skip-gram and negative sampling
While a bag-of-words model predicts a word given the neighboring context, a skip-gram model predicts the context (or neighbors) of a word, given the word itself. The model is trained on skip-grams, which are n-grams that allow tokens to be skipped (see the diagram below for an example). The context of a word can be represented through a set of skip-gram pairs of(target_word, context_word) where context_word appears in the neighboring context of target_word.
Consider the following sentence of eight words:
The wide road shimmered in the hot sun.The context words for each of the 8 words of this sentence are defined by a window size. The window size determines the span of words on either side of a
target_word that can be considered a context word. Below is a table of skip-grams for target words based on different window sizes.
Note: For this tutorial, a window size of n implies n words on each side with a total window span of 2*n+1 words across a word.
The training objective of the skip-gram model is to maximize the probability of predicting context words given the target word. For a sequence of words w₁, w₂, … wₜ, the objective can be written as the average log probability
where c is the size of the training context. The basic skip-gram formulation defines this probability using the softmax function.
where v and v’ are target and context vector representations of words and W is vocabulary size.
Computing the denominator of this formulation involves performing a full softmax over the entire vocabulary words, which are often large (10⁵-10⁷) terms.
The noise contrastive estimation (NCE) loss function is an efficient approximation for a full softmax. With an objective to learn word embeddings instead of modeling the word distribution, the NCE loss can be simplified to use negative sampling.
The simplified negative sampling objective for a target word is to distinguish the context word from num_ns negative samples drawn from noise distribution Pₙ(w) of words. More precisely, an efficient approximation of full softmax over the vocabulary is, for a skip-gram pair, to pose the loss for a target word as a classification problem between the context word and num_ns negative samples.
A negative sample is defined as a (target_word, context_word) pair such that the context_word does not appear in the window_size neighborhood of the target_word. For the example sentence, these are a few potential negative samples (when window_size is 2).
Setup
Vectorize an example sentence
Consider the following sentence:The wide road shimmered in the hot sun.Tokenize the sentence:
Generate skip-grams from one sentence
Thetf.keras.preprocessing.sequence module provides useful functions that simplify data preparation for word2vec. You can use the tf.keras.preprocessing.sequence.skipgrams to generate skip-gram pairs from the example_sequence with a given window_size from tokens in the range [0, vocab_size).
Note: negative_samples is set to 0 here, as batching negative samples generated by this function requires a bit of code. You will use another function to perform negative sampling in the next section.
Negative sampling for one skip-gram
Theskipgrams function returns all positive skip-gram pairs by sliding over a given window span. To produce additional skip-gram pairs that would serve as negative samples for training, you need to sample random words from the vocabulary. Use the tf.random.log_uniform_candidate_sampler function to sample num_ns number of negative samples for a given target word in a window. You can call the function on one skip-grams’s target word and pass the context word as true class to exclude it from being sampled.
Key point: num_ns (the number of negative samples per a positive context word) in the [5, 20] range is shown to work best for smaller datasets, while num_ns in the [2, 5] range suffices for larger datasets.
Construct one training example
For a given positive(target_word, context_word) skip-gram, you now also have num_ns negative sampled context words that do not appear in the window size neighborhood of target_word. Batch the 1 positive context_word and num_ns negative context words into one tensor. This produces a set of positive skip-grams (labeled as 1) and negative samples (labeled as 0) for each target word.
(target, context, label) tensors constitutes one training example for training your skip-gram negative sampling word2vec model. Notice that the target is of shape (1,) while the context and label are of shape (1+num_ns,)
Summary
This diagram summarizes the procedure of generating a training example from a sentence:
Notice that the words temperature and code are not part of the input sentence. They belong to the vocabulary like certain other indices used in the diagram above.
Compile all steps into one function
Skip-gram sampling table
A large dataset means larger vocabulary with higher number of more frequent words such as stopwords. Training examples obtained from sampling commonly occurring words (such asthe, is, on) don’t add much useful information for the model to learn from. Mikolov et al. suggest subsampling of frequent words as a helpful practice to improve embedding quality.
The tf.keras.preprocessing.sequence.skipgrams function accepts a sampling table argument to encode probabilities of sampling any token. You can use the tf.keras.preprocessing.sequence.make_sampling_table to generate a word-frequency rank based probabilistic sampling table and pass it to the skipgrams function. Inspect the sampling probabilities for a vocab_size of 10.
sampling_table[i] denotes the probability of sampling the i-th most common word in a dataset. The function assumes a Zipf’s distribution of the word frequencies for sampling.
Key point: The tf.random.log_uniform_candidate_sampler already assumes that the vocabulary frequency follows a log-uniform (Zipf’s) distribution. Using these distribution weighted sampling also helps approximate the Noise Contrastive Estimation (NCE) loss with simpler loss functions for training a negative sampling objective.
Generate training data
Compile all the steps described above into a function that can be called on a list of vectorized sentences obtained from any text dataset. Notice that the sampling table is built before sampling skip-gram word pairs. You will use this function in the later sections.Prepare training data for word2vec
With an understanding of how to work with one sentence for a skip-gram negative sampling based word2vec model, you can proceed to generate training examples from a larger list of sentences!Download text corpus
You will use a text file of Shakespeare’s writing for this tutorial. Change the following line to run this code on your own data.tf.data.TextLineDataset object for the next steps:
Vectorize sentences from the corpus
You can use theTextVectorization layer to vectorize sentences from the corpus. Learn more about using this layer in this Text classification tutorial. Notice from the first few sentences above that the text needs to be in one case and punctuation needs to be removed. To do this, define a custom_standardization function that can be used in the TextVectorization layer.
TextVectorization.adapt on the text dataset to create vocabulary.
TextVectorization.get_vocabulary. This function returns a list of all vocabulary tokens sorted (descending) by their frequency.
vectorize_layer can now be used to generate vectors for each element in the text_ds (a tf.data.Dataset). Apply Dataset.batch, Dataset.prefetch, Dataset.map, and Dataset.unbatch.
Obtain sequences from the dataset
You now have atf.data.Dataset of integer encoded sentences. To prepare the dataset for training a word2vec model, flatten the dataset into a list of sentence vector sequences. This step is required as you would iterate over each sentence in the dataset to produce positive and negative examples.
Note: Since the generate_training_data() defined earlier uses non-TensorFlow Python/NumPy functions, you could also use a tf.py_function or tf.numpy_function with tf.data.Dataset.map.
sequences:
Generate training examples from sequences
sequences is now a list of int encoded sentences. Just call the generate_training_data function defined earlier to generate training examples for the word2vec model. To recap, the function iterates over each word from each sequence to collect positive and negative context words. Length of target, contexts and labels should be the same, representing the total number of training examples.
Configure the dataset for performance
To perform efficient batching for the potentially large number of training examples, use thetf.data.Dataset API. After this step, you would have a tf.data.Dataset object of (target_word, context_word), (label) elements to train your word2vec model!
Dataset.cache and Dataset.prefetch to improve performance:
Model and training
The word2vec model can be implemented as a classifier to distinguish between true context words from skip-grams and false context words obtained through negative sampling. You can perform a dot product multiplication between the embeddings of target and context words to obtain predictions for labels and compute the loss function against true labels in the dataset.Subclassed word2vec model
Use the Keras Subclassing API to define your word2vec model with the following layers:target_embedding: Atf.keras.layers.Embeddinglayer, which looks up the embedding of a word when it appears as a target word. The number of parameters in this layer are(vocab_size * embedding_dim).context_embedding: Anothertf.keras.layers.Embeddinglayer, which looks up the embedding of a word when it appears as a context word. The number of parameters in this layer are the same as those intarget_embedding, i.e.(vocab_size * embedding_dim).dots: Atf.keras.layers.Dotlayer that computes the dot product of target and context embeddings from a training pair.flatten: Atf.keras.layers.Flattenlayer to flatten the results ofdotslayer into logits.
call() function that accepts (target, context) pairs which can then be passed into their corresponding embedding layer. Reshape the context_embedding to perform a dot product with target_embedding and return the flattened result.
Key point: The target_embedding and context_embedding layers can be shared as well. You could also use a concatenation of both embeddings as the final word2vec embedding.
Define loss function and compile model
For simplicity, you can usetf.keras.losses.CategoricalCrossEntropy as an alternative to the negative sampling loss. If you would like to write your own custom loss function, you can also do so as follows:
tf.keras.optimizers.Adam optimizer.
dataset for some number of epochs:
Embedding lookup and analysis
Obtain the weights from the model usingModel.get_layer and Layer.get_weights. The TextVectorization.get_vocabulary function provides the vocabulary to build a metadata file with one token per line.
vectors.tsv and metadata.tsv to analyze the obtained embeddings in the Embedding Projector:
Next steps
This tutorial has shown you how to implement a skip-gram word2vec model with negative sampling from scratch and visualize the obtained word embeddings.- To learn more about word vectors and their mathematical representations, refer to these notes.
- To learn more about advanced text processing, read the Transformer model for language understanding tutorial.
- If you’re interested in pre-trained embedding models, you may also be interested in Exploring the TF-Hub CORD-19 Swivel Embeddings, or the Multilingual Universal Sentence Encoder.
- You may also like to train the model on a new dataset (there are many available in TensorFlow Datasets).

