4. Evaluation

Evaluation of classifications plays an important role in record linkage. Express your classification quality in terms accuracy, recall and F-score based on true positives, false positives, true negatives and false negatives.

recordlinkage.reduction_ratio(links_pred, *total)

Compute the reduction ratio.

The reduction ratio is 1 minus the ratio candidate matches and the maximum number of pairs possible.

Parameters:
  • links_pred (int, pandas.MultiIndex) – The number of candidate record pairs or the pandas.MultiIndex with record pairs.

  • *total (pandas.DataFrame object(s)) – The DataFrames are used to compute the full index size with the full_index_size function.

Returns:

float – The reduction ratio.

recordlinkage.true_positives(links_true, links_pred)

Count the number of True Positives.

Returns the number of correctly predicted links, also called the number of True Positives (TP).

Parameters:
Returns:

int – The number of correctly predicted links.

recordlinkage.true_negatives(links_true, links_pred, total)

Count the number of True Negatives.

Returns the number of correctly predicted non-links, also called the number of True Negatives (TN).

Parameters:
Returns:

int – The number of correctly predicted non-links.

recordlinkage.false_positives(links_true, links_pred)

Count the number of False Positives.

Returns the number of incorrect predictions of true non-links. (true non- links, but predicted as links). This value is known as the number of False Positives (FP).

Parameters:
Returns:

int – The number of false positives.

recordlinkage.false_negatives(links_true, links_pred)

Count the number of False Negatives.

Returns the number of incorrect predictions of true links. (true links, but predicted as non-links). This value is known as the number of False Negatives (FN).

Parameters:
Returns:

int – The number of false negatives.

recordlinkage.confusion_matrix(links_true, links_pred, total=None)

Compute the confusion matrix.

The confusion matrix is of the following form:

Predicted Positives

Predicted Negatives

True Positives

True Positives (TP)

False Negatives (FN)

True Negatives

False Positives (FP)

True Negatives (TN)

The confusion matrix is an informative way to analyse a prediction. The matrix can used to compute measures like precision and recall. The count of true prositives is [0,0], false negatives is [0,1], true negatives is [1,1] and false positives is [1,0].

Parameters:
Returns:

numpy.array – The confusion matrix with TP, TN, FN, FP values.

Note

The number of True Negatives is computed based on the total argument. This argument is the number of record pairs of the entire matrix.

recordlinkage.precision(links_true, links_pred)

Compute the precision.

The precision is given by TP/(TP+FP).

Parameters:
Returns:

float – The precision

recordlinkage.recall(links_true, links_pred)

Compute the recall/sensitivity.

The recall is given by TP/(TP+FN).

Parameters:
Returns:

float – The recall

recordlinkage.accuracy(links_true, links_pred, total)

Compute the accuracy.

The accuracy is given by (TP+TN)/(TP+FP+TN+FN).

Parameters:
Returns:

float – The accuracy

recordlinkage.specificity(links_true, links_pred, total)

Compute the specificity.

The specificity is given by TN/(FP+TN).

Parameters:
Returns:

float – The specificity

recordlinkage.fscore(links_true, links_pred)

Compute the F-score.

The F-score is given by 2*(precision*recall)/(precision+recall).

Parameters:
Returns:

float – The fscore

Note

If there are no pairs predicted as links, this measure will raise a ZeroDivisionError.

recordlinkage.max_pairs(shape)

[DEPRECATED] Compute the maximum number of record pairs possible.

recordlinkage.full_index_size(*args)

Compute the number of records in a full index.

Compute the number of records in a full index without building the index itself. The result is the maximum number of record pairs possible. This function is especially useful in measures like the reduction_ratio.

Deduplication: Given a DataFrame A with length N, the full index size is N*(N-1)/2. Linking: Given a DataFrame A with length N and a DataFrame B with length M, the full index size is N*M.

Parameters:

*args (int, pandas.MultiIndex, pandas.Series, pandas.DataFrame) – A pandas object or a int representing the length of a dataset to link. When there is one argument, it is assumed that the record linkage is a deduplication process.

Examples

Use integers: >>> full_index_size(10) # deduplication: 45 pairs >>> full_index_size(10, 10) # linking: 100 pairs

or pandas objects >>> full_index_size(DF) # deduplication: len(DF)*(len(DF)-1)/2 pairs >>> full_index_size(DF, DF) # linking: len(DF)*len(DF) pairs