Evaluation

Evaluation of classifications plays an important role in record linkage. Express your classification quality in terms accuracy, recall and F-score based on true positives, false positives, true negatives and false negatives.

reduction_ratio(links_pred, *total)

Compute the reduction ratio.

The reduction ratio is 1 minus the ratio candidate matches and the maximum number of pairs possible.

Parameters:
  • links_pred (int, pandas.MultiIndex) – The number of candidate record pairs or the pandas.MultiIndex with record pairs.
  • *total (pandas.DataFrame object(s)) – The DataFrames are used to compute the full index size with the full_index_size function.
Returns:

float – The reduction ratio.

max_pairs(shape)

[DEPRECATED] Compute the maximum number of record pairs possible.

full_index_size(*args)

Compute the number of records in a full index.

Compute the number of records in a full index without building the index itself. The result is the maximum number of record pairs possible. This function is especially useful in measures like the reduction_ratio.

Deduplication: Given a DataFrame A with length N, the full index size is N*(N-1)/2. Linking: Given a DataFrame A with length N and a DataFrame B with length M, the full index size is N*M.

Parameters:*args (int, pandas.MultiIndex, pandas.Series, pandas.DataFrame) – A pandas object or a int representing the length of a dataset to link. When there is one argument, it is assumed that the record linkage is a deduplication process.

Examples

Use integers: >>> full_index_size(10) # deduplication: 45 pairs >>> full_index_size(10, 10) # linking: 100 pairs

or pandas objects >>> full_index_size(DF) # deduplication: len(DF)*(len(DF)-1)/2 pairs >>> full_index_size(DF, DF) # linking: len(DF)*len(DF) pairs

true_positives(links_true, links_pred)

Count the number of True Positives.

Returns the number of correctly predicted links, also called the number of True Positives (TP).

Parameters:
  • links_true (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The true (or actual) links.
  • links_pred (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The predicted links.
Returns:

int – The number of correctly predicted links.

true_negatives(links_true, links_pred, total)

Count the number of True Negatives.

Returns the number of correctly predicted non-links, also called the number of True Negatives (TN).

Parameters:
  • links_true (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The true (or actual) links.
  • links_pred (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The predicted links.
  • total (int, pandas.MultiIndex) – The count of all record pairs (both links and non-links). When the argument is a pandas.MultiIndex, the length of the index is used.
Returns:

int – The number of correctly predicted non-links.

false_positives(links_true, links_pred)

Count the number of False Positives.

Returns the number of incorrect predictions of true non-links. (true non- links, but predicted as links). This value is known as the number of False Positives (FP).

Parameters:
  • links_true (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The true (or actual) links.
  • links_pred (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The predicted links.
Returns:

int – The number of false positives.

false_negatives(links_true, links_pred)

Count the number of False Negatives.

Returns the number of incorrect predictions of true links. (true links, but predicted as non-links). This value is known as the number of False Negatives (FN).

Parameters:
  • links_true (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The true (or actual) links.
  • links_pred (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The predicted links.
Returns:

int – The number of false negatives.

confusion_matrix(links_true, links_pred, total=None)

Compute the confusion matrix.

The confusion matrix is of the following form:

  Predicted Positives Predicted Negatives
True Positives True Positives (TP) False Negatives (FN)
True Negatives False Positives (FP) True Negatives (TN)

The confusion matrix is an informative way to analyse a prediction. The matrix can used to compute measures like precision and recall. The count of true prositives is [0,0], false negatives is [0,1], true negatives is [1,1] and false positives is [1,0].

Parameters:
  • links_true (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The true (or actual) links.
  • links_pred (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The predicted links.
  • total (int, pandas.MultiIndex) – The count of all record pairs (both links and non-links). When the argument is a pandas.MultiIndex, the length of the index is used. If the total is None, the number of True Negatives is not computed. Default None.
Returns:

numpy.array – The confusion matrix with TP, TN, FN, FP values.

Note

The number of True Negatives is computed based on the total argument. This argument is the number of record pairs of the entire matrix.

precision(links_true, links_pred)

Compute the precision.

The precision is given by TP/(TP+FP).

Parameters:
  • links_true (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The true (or actual) collection of links.
  • links_pred (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The predicted collection of links.
Returns:

float – The precision

recall(links_true, links_pred)

Compute the recall/sensitivity.

The recall is given by TP/(TP+FN).

Parameters:
  • links_true (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The true (or actual) collection of links.
  • links_pred (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The predicted collection of links.
Returns:

float – The recall

accuracy(links_true, links_pred, total)

Compute the accuracy.

The accuracy is given by (TP+TN)/(TP+FP+TN+FN).

Parameters:
  • links_true (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The true (or actual) collection of links.
  • links_pred (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The predicted collection of links.
  • total (int, pandas.MultiIndex) – The count of all record pairs (both links and non-links). When the argument is a pandas.MultiIndex, the length of the index is used.
Returns:

float – The accuracy

specificity(links_true, links_pred, total)

Compute the specificity.

The specificity is given by TN/(FP+TN).

Parameters:
  • links_true (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The true (or actual) collection of links.
  • links_pred (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The predicted collection of links.
  • total (int, pandas.MultiIndex) – The count of all record pairs (both links and non-links). When the argument is a pandas.MultiIndex, the length of the index is used.
Returns:

float – The specificity

fscore(links_true, links_pred)

Compute the F-score.

The F-score is given by 2*(precision*recall)/(precision+recall).

Parameters:
  • links_true (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The true (or actual) collection of links.
  • links_pred (pandas.MultiIndex, pandas.DataFrame, pandas.Series) – The predicted collection of links.
Returns:

float – The fscore

Note

If there are no pairs predicted as links, this measure will raise a ZeroDivisionError.