Record pair classification

In the context of record linkage, classification refers to the process of dividing record pairs into matches and non-matches (distinct pairs). There are dozens of classification algorithms for record linkage. Roughly speaking, classification algorithms fall into two groups:

  • supervised learning algorithms - These algorithms make use of trainings data. If you do have trainings data, then you can use supervised learning algorithms. Most supervised learning algorithms offer good accuracy and reliability. Examples of supervised learning algorithms in the Python Record Linkage Toolkit are Logistic Regression, Naive Bayes and Support Vector Machines.
  • unsupervised learning algorithms - These algorithms do not need training data. The Python Record Linkage Toolkit supports K-means clustering and an Expectation/Conditional Maximisation classifier.

First things first

The examples below make use of the Krebs register (German for cancer registry) dataset. The Krebs register dataset contains comparison vectors of a large set of record pairs. For each record pair, it is known if the records represent the same person (match) or not (non-match). This was done with a massive clerical review. First, import the recordlinkage module and load the Krebs register data. The dataset contains 5749132 compared record pairs and has the following variables: first name, last name, sex, birthday, birth month, birth year and zip code. The Krebs register contains len(krebs_match) == 20931 matching record pairs.

In [2]:
import recordlinkage as rl
from recordlinkage.datasets import load_krebsregister

krebs_data, krebs_match = load_krebsregister(missing_values=0)
krebs_data
Out[2]:
cmp_firstname1 cmp_firstname2 cmp_lastname1 cmp_lastname2 cmp_sex cmp_birthday cmp_birthmonth cmp_birthyear cmp_zipcode
id1 id2
22161 38467 1.00000 0.0 0.14286 0.0 1 0.0 1.0 0.0 0.0
38713 75352 0.00000 0.0 0.57143 0.0 1 0.0 0.0 0.0 0.0
13699 32825 0.16667 0.0 0.00000 0.0 0 1.0 1.0 1.0 0.0
22709 37682 0.28571 0.0 1.00000 0.0 1 0.0 0.0 0.0 0.0
2342 69060 0.25000 0.0 0.12500 0.0 1 1.0 1.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ...
52124 53629 1.00000 0.0 0.28571 0.0 1 0.0 0.0 1.0 0.0
30007 76846 0.75000 0.0 0.00000 0.0 1 1.0 0.0 0.0 0.0
50546 59461 0.75000 0.0 0.00000 0.0 1 0.0 1.0 0.0 0.0
43175 62151 1.00000 0.0 0.11111 0.0 1 0.0 1.0 0.0 0.0
11651 57925 1.00000 0.0 0.00000 0.0 1 0.0 0.0 1.0 0.0

5749132 rows × 9 columns

Most classifiers can not handle comparison vectors with missing values. To prevent issues with the classification algorithms, we convert the missing values into disagreeing comparisons (using argument missing_values=0). This approach for handling missing values is widely used in record linkage applications.

In [3]:
krebs_data.describe().T
Out[3]:
count mean std min 25% 50% 75% max
cmp_firstname1 5749132.0 0.71278 0.38884 0.0 0.28571 1.00000 1.00000 1.0
cmp_firstname2 5749132.0 0.01623 0.12520 0.0 0.00000 0.00000 0.00000 1.0
cmp_lastname1 5749132.0 0.31563 0.33423 0.0 0.10000 0.18182 0.42857 1.0
cmp_lastname2 5749132.0 0.00014 0.01008 0.0 0.00000 0.00000 0.00000 1.0
cmp_sex 5749132.0 0.95500 0.20730 0.0 1.00000 1.00000 1.00000 1.0
cmp_birthday 5749132.0 0.22443 0.41721 0.0 0.00000 0.00000 0.00000 1.0
cmp_birthmonth 5749132.0 0.48879 0.49987 0.0 0.00000 0.00000 1.00000 1.0
cmp_birthyear 5749132.0 0.22272 0.41607 0.0 0.00000 0.00000 0.00000 1.0
cmp_zipcode 5749132.0 0.00552 0.07407 0.0 0.00000 0.00000 0.00000 1.0

Supervised learning

As described before, supervised learning algorithms do need training data. Training data is data for which the true match status is known for each comparison vector. In the example in this section, we consider that the true match status of the first 5000 record pairs of the Krebs register data is known.

In [4]:
golden_pairs = krebs_data[0:5000]
golden_matches_index = golden_pairs.index & krebs_match # 2093 matching pairs

Logistic regression

The recordlinkage.LogisticRegressionClassifier classifier is an application of the logistic regression model. This supervised learning method is one of the oldest classification algorithms used in record linkage. In situations with enough training data, the algorithm gives relatively good results.

In [5]:
# Initialize the classifier
logreg = rl.LogisticRegressionClassifier()

# Train the classifier
logreg.learn(golden_pairs, golden_matches_index)
print ("Intercept: ", logreg.intercept)
print ("Coefficients: ", logreg.coefficients)
Intercept:  -6.2980435710064135
Coefficients:  [  4.90452843e-01   1.21640484e-01   2.15040485e+00  -2.84818101e-03
  -1.79712465e+00   9.61085558e-01   6.72610441e-02   1.03408608e+00
   4.30556110e+00]
In [6]:
# Predict the match status for all record pairs
result_logreg = logreg.predict(krebs_data)

len(result_logreg)
Out[6]:
20150
In [7]:
conf_logreg = rl.confusion_matrix(krebs_match, result_logreg, len(krebs_data))
conf_logreg
Out[7]:
array([[  19884,    1047],
       [    266, 5727935]])
In [8]:
# The F-score for this prediction is
rl.fscore(conf_logreg)
Out[8]:
0.96804

The predicted number of matches is not much more than the 20931 true matches. The result was achieved with a small training dataset of 5000 record pairs.

In (older) literature, record linkage procedures are often divided in deterministic record linkage and probabilistic record linkage. The Logistic Regression Classifier belongs to deterministic record linkage methods. Each feature/variable has a certain importance (named weight). The weight is multiplied with the comparison/similarity vector. If the total sum exceeds a certain threshold, it as considered to be a match.

In [9]:
intercept = -9
coefficients = [2.0, 1.0, 3.0, 1.0, 1.0, 1.0, 1.0, 2.0, 3.0]

logreg = rl.LogisticRegressionClassifier(coefficients, intercept)

# predict without calling LogisticRegressionClassifier.learn
matches = logreg.predict(krebs_data)
print (len(matches))
21303
In [10]:
conf_logreg = rl.confusion_matrix(krebs_match, matches, len(krebs_data))
conf_logreg

# The F-score for this classification is
rl.fscore(conf_logreg)
Out[10]:
0.98769

For the given coefficients, the F-score is better than the situation without trainings data. Surprising? No (use more trainings data and the result will improve)

Naive Bayes

In contrast to the logistic regression classifier, the Naive Bayes classifier is a probabilistic classifier. The probabilistic record linkage framework by Fellegi and Sunter (1969) is the most well-known probabilistic classification method for record linkage. Later, it was proved that the Fellegi and Sunter method is mathematically equivalent to the Naive Bayes method in case of assuming independence between comparison variables.

In [11]:
# Train the classifier
nb = rl.NaiveBayesClassifier()
nb.learn(golden_pairs, golden_matches_index)

# Predict the match status for all record pairs
result_nb = nb.predict(krebs_data)

len(result_nb)
Out[11]:
20345
In [12]:
conf_nb = rl.confusion_matrix(krebs_match, result_nb, len(krebs_data))
conf_nb
Out[12]:
array([[  20023,     908],
       [    322, 5727879]])
In [13]:
# The F-score for this classification is
rl.fscore(conf_nb)
Out[13]:
0.97020

Support Vector Machines

Support Vector Machines (SVM) have become increasingly popular in record linkage. The algorithm performs well there is only a small amount of training data available. The implementation of SVM in the Python Record Linkage Toolkit is a linear SVM algorithm.

In [14]:
# Train the classifier
svm = rl.SVMClassifier()
svm.learn(golden_pairs, golden_matches_index)

# Predict the match status for all record pairs
result_svm = svm.predict(krebs_data)

len(result_svm)
Out[14]:
20839
In [15]:
conf_svm = rl.confusion_matrix(krebs_match, result_svm, len(krebs_data))
conf_svm
Out[15]:
array([[  20825,     106],
       [     14, 5728187]])
In [16]:
# The F-score for this classification is
rl.fscore(conf_svm)
Out[16]:
0.99713

Unsupervised learning

In situations without training data, unsupervised learning can be a solution for record linkage problems. In this section, we discuss two unsupervised learning methods. One algorithm is K-means clustering, and the other algorithm is an implementation of the Expectation-Maximisation algorithm. Most of the time, unsupervised learning algorithms take more computational time because of the iterative structure in these algorithms.

K-means clustering

The K-means clustering algorithm is well-known and widely used in big data analysis. The K-means classifier in the Python Record Linkage Toolkit package is configured in such a way that it can be used for linking records. For more info about the K-means clustering see Wikipedia.

In [17]:
kmeans = rl.KMeansClassifier()
result_kmeans = kmeans.learn(krebs_data)

# The predicted number of matches
len(result_kmeans)
Out[17]:
371525

The classifier is now trained and the comparison vectors are classified.

In [18]:
cm_kmeans = rl.confusion_matrix(krebs_match, result_kmeans, len(krebs_data))

rl.fscore(cm_kmeans)
Out[18]:
0.10598

Expectation/Conditional Maximization Algorithm

The ECM-algorithm is an Expectation-Maximisation algorithm with some additional constraints. This algorithm is closely related to the Naive Bayes algorithm. The ECM algorithm is also closely related to estimating the parameters in the Fellegi and Sunter (1969) framework. The algorithms assume that the attributes are independent of each other. The Naive Bayes algorithm uses the same principles.

In [19]:
# Train the classifier
ecm = rl.ECMClassifier()
result_ecm = ecm.learn((krebs_data > 0.8).astype(int))

len(result_ecm)
Out[19]:
19817
In [20]:
conf_ecm = rl.confusion_matrix(krebs_match, result_ecm, len(krebs_data))
conf_ecm
Out[20]:
array([[  19813,    1118],
       [      4, 5728197]])
In [21]:
# The F-score for this classification is
rl.fscore(conf_ecm)
Out[21]:
0.97246