BERTScore Explained: How It’s Calculated Mathematically

Understanding how well a machine-generated text matches a reference text has become crucial in evaluating natural language processing tasks like translation, summarization, and text generation. One of the most accurate methods for this evaluation is BERTScore.

In this article, we will explore how BERTScore is calculated mathematically, explain its inner workings step by step, and provide clear examples to make it easy to understand—even if you’re new to the concept. By the end, you’ll know exactly what BERTScore measures and why it’s preferred over traditional text similarity metrics.

What is BERTScore?

Before diving into the mathematical details, it’s important to understand what BERTScore actually is.

BERTScore is a metric that evaluates the similarity between two pieces of text using contextual embeddings from transformer models, particularly BERT. Unlike traditional metrics like BLEU or ROUGE that rely on exact word matches, BERTScore measures semantic similarity. This means it can recognize that “happy” and “joyful” are similar, even if the words don’t match exactly.

This ability makes BERTScore particularly useful in modern natural language tasks, where paraphrasing and nuanced meanings are common.

How BERTScore Uses Word Embeddings

The foundation of BERTScore lies in word embeddings. Every word in a sentence is converted into a high-dimensional vector that captures its meaning based on context. For example, the word “bank” in “river bank” has a different vector than “bank” in “financial bank.”

BERTScore compares the embeddings of each word in a candidate sentence with the embeddings of words in the reference sentence to determine their similarity. This approach is more flexible than counting exact matches, allowing the metric to focus on meaning rather than form.

Precision, Recall, and F1 in BERTScore

Mathematically, BERTScore calculates three key components: precision, recall, and F1 score. These are similar to measures used in information retrieval but applied to vector similarities.

Precision: Measures how many words in the candidate sentence match words in the reference sentence.
Recall: Measures how many words in the reference sentence are captured by the candidate sentence.
F1 Score: The harmonic mean of precision and recall, giving a balanced score.

These metrics ensure that BERTScore evaluates both completeness (recall) and accuracy (precision) of the text.

Mathematical Calculation of BERTScore

Now let’s look at how BERTScore is calculated mathematically in detail.

Step 1: Generate Word Embeddings

First, each token in the reference and candidate sentences is converted into a contextual embedding vector using a pre-trained BERT model.

For a reference sentence with tokens ( r_1, r_2, …, r_m ) and a candidate sentence with tokens ( c_1, c_2, …, c_n ), BERT outputs embeddings:

[
\mathbf{v}{r_i} \text{ for reference tokens and } \mathbf{v}{c_j} \text{ for candidate tokens}.
]

These embeddings are high-dimensional vectors, typically with 768 dimensions for the base BERT model.

Step 2: Compute Cosine Similarity

Next, BERTScore calculates the cosine similarity between every pair of candidate and reference token embeddings:

[
\text{cosine_sim}(\mathbf{v}{c_j}, \mathbf{v}{r_i}) = \frac{\mathbf{v}{c_j} \cdot \mathbf{v}{r_i}}{|\mathbf{v}{c_j}| |\mathbf{v}{r_i}|}
]

Cosine similarity ranges from -1 to 1, where 1 indicates perfect similarity, and -1 indicates complete dissimilarity.

Step 3: Maximal Matching

For each token in the candidate sentence, BERTScore finds the maximum similarity with any token in the reference sentence. This ensures that each candidate word is compared to the most similar reference word:

[
\text{max_sim}(c_j) = \max_{i=1}^m \text{cosine_sim}(\mathbf{v}{c_j}, \mathbf{v}{r_i})
]

Similarly, for each reference token, the maximum similarity with candidate tokens is found.

Step 4: Compute Precision, Recall, and F1

Once maximal similarities are determined, the scores are calculated as follows:

Precision (P): Average of maximal similarities for candidate tokens:

[
P = \frac{1}{n} \sum_{j=1}^n \text{max_sim}(c_j)
]

Recall (R): Average of maximal similarities for reference tokens:

[
R = \frac{1}{m} \sum_{i=1}^m \text{max_sim}(r_i)
]

F1 Score (F1): Harmonic mean of precision and recall:

[
F1 = \frac{2 \cdot P \cdot R}{P + R}
]

The F1 score is often reported as the final BERTScore because it balances precision and recall.

Step 5: Optional IDF Weighting

To improve accuracy, BERTScore can use Inverse Document Frequency (IDF) weighting. Words that appear frequently in many sentences (like “the” or “is”) contribute less to the score, while rarer, more meaningful words carry more weight.

The IDF-weighted precision and recall formulas are:

[
P_{idf} = \frac{\sum_j \text{IDF}(c_j) \cdot \text{max_sim}(c_j)}{\sum_j \text{IDF}(c_j)}
]

[
R_{idf} = \frac{\sum_i \text{IDF}(r_i) \cdot \text{max_sim}(r_i)}{\sum_i \text{IDF}(r_i)}
]

The F1 score is then computed using these weighted values. This method often produces a more accurate measure of semantic similarity.

Practical Example of BERTScore

Let’s look at a simple example:

Reference Sentence: “The cat is sitting on the mat.”
Candidate Sentence: “A cat sits on a mat.”

Each word is converted into embeddings.
Cosine similarity is calculated between all word pairs.
Maximal similarities are selected for candidate and reference words.
Precision and recall are computed, showing how much the candidate overlaps with the reference.
The F1 score gives a final BERTScore around 0.92, reflecting strong semantic similarity despite small word differences.

This example illustrates why BERTScore is more effective than traditional metrics, which might penalize differences like “sitting” vs. “sits.”

Why BERTScore is Important

Semantic Understanding

BERTScore captures meaning, not just exact words. This is essential for applications like summarization or machine translation, where wording can vary widely but meaning should remain intact.

Flexibility Across Languages

Because BERT and its multilingual variants understand contextual embeddings, BERTScore works for multiple languages, making it a universal metric for text evaluation.

Alignment With Human Judgment

Studies have shown that BERTScore correlates better with human judgment than traditional metrics, giving developers a more reliable tool for evaluating text quality.

Tips for Using BERTScore Effectively

Choose the Right Model

Using a pre-trained BERT variant suitable for your language or domain can significantly improve score accuracy.

Consider IDF Weighting

For longer texts, IDF weighting can prevent common words from skewing results and emphasize meaningful terms.

Combine With Other Metrics

While BERTScore is powerful, combining it with traditional metrics like BLEU or ROUGE can provide a more comprehensive evaluation.

FAQ About BERTScore

1. What does BERTScore measure?
BERTScore measures the semantic similarity between a candidate and reference sentence using contextual word embeddings.

2. How is BERTScore calculated mathematically?
It’s calculated using word embeddings, cosine similarity, maximal matching, and precision, recall, and F1 formulas, optionally weighted by IDF.

3. Is BERTScore better than BLEU?
Yes, because BERTScore focuses on meaning rather than exact word matches, making it more aligned with human judgment.

4. Can BERTScore be used for languages other than English?
Absolutely. Multilingual BERT models allow BERTScore to evaluate text similarity in many languages.

5. Does BERTScore require long texts to be accurate?
No, it works for both short and long texts, though IDF weighting improves accuracy for longer sentences or documents.