# Classification algorithms¶

In the context of record linkage, classification refers to the process of dividing record pairs into matches and non-matches (distinct pairs). There are dozens of classification algorithms for record linkage. Roughly speaking, classification algorithms fall into two groups:

**supervised learning algorithms**- These algorithms make use of trainings data. If you do have trainings data, then you can use supervised learning algorithms. Most supervised learning algorithms offer good accuracy and reliability. Examples of supervised learning algorithms in the*Python Record Linkage Toolkit*are*Logistic Regression*,*Naive Bayes*and*Support Vector Machines*.**unsupervised learning algorithms**- These algorithms do not need training data. The*Python Record Linkage Toolkit*supports*K-means clustering*and an*Expectation/Conditional Maximisation*classifier.

**First things first**

The examples below make use of the Krebs register (German for cancer registry) dataset. The Krebs register dataset contains comparison vectors of a large set of record pairs. For each record pair, it is known if the records represent the same person (match) or not (non-match). This was done with a massive clerical review. First, import the recordlinkage module and load the Krebs
register data. The dataset contains 5749132 compared record pairs and has the following variables: first name, last name, sex, birthday, birth month, birth year and zip code. The Krebs register contains `len(krebs_true_links) == 20931`

matching record pairs.

```
[2]:
```

```
import recordlinkage as rl
from recordlinkage.datasets import load_krebsregister
krebs_X, krebs_true_links = load_krebsregister(missing_values=0)
krebs_X
```

```
[2]:
```

cmp_firstname1 | cmp_firstname2 | cmp_lastname1 | cmp_lastname2 | cmp_sex | cmp_birthday | cmp_birthmonth | cmp_birthyear | cmp_zipcode | ||
---|---|---|---|---|---|---|---|---|---|---|

id1 | id2 | |||||||||

22161 | 38467 | 1.00000 | 0.0 | 0.14286 | 0.0 | 1 | 0.0 | 1.0 | 0.0 | 0.0 |

38713 | 75352 | 0.00000 | 0.0 | 0.57143 | 0.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |

13699 | 32825 | 0.16667 | 0.0 | 0.00000 | 0.0 | 0 | 1.0 | 1.0 | 1.0 | 0.0 |

22709 | 37682 | 0.28571 | 0.0 | 1.00000 | 0.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |

2342 | 69060 | 0.25000 | 0.0 | 0.12500 | 0.0 | 1 | 1.0 | 1.0 | 1.0 | 0.0 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

52124 | 53629 | 1.00000 | 0.0 | 0.28571 | 0.0 | 1 | 0.0 | 0.0 | 1.0 | 0.0 |

30007 | 76846 | 0.75000 | 0.0 | 0.00000 | 0.0 | 1 | 1.0 | 0.0 | 0.0 | 0.0 |

50546 | 59461 | 0.75000 | 0.0 | 0.00000 | 0.0 | 1 | 0.0 | 1.0 | 0.0 | 0.0 |

43175 | 62151 | 1.00000 | 0.0 | 0.11111 | 0.0 | 1 | 0.0 | 1.0 | 0.0 | 0.0 |

11651 | 57925 | 1.00000 | 0.0 | 0.00000 | 0.0 | 1 | 0.0 | 0.0 | 1.0 | 0.0 |

5749132 rows × 9 columns

Most classifiers can not handle comparison vectors with missing values. To prevent issues with the classification algorithms, we convert the missing values into disagreeing comparisons (using argument missing_values=0). This approach for handling missing values is widely used in record linkage applications.

```
[3]:
```

```
krebs_X.describe().T
```

```
[3]:
```

count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|

cmp_firstname1 | 5749132.0 | 0.71278 | 0.38884 | 0.0 | 0.28571 | 1.00000 | 1.00000 | 1.0 |

cmp_firstname2 | 5749132.0 | 0.01623 | 0.12520 | 0.0 | 0.00000 | 0.00000 | 0.00000 | 1.0 |

cmp_lastname1 | 5749132.0 | 0.31563 | 0.33423 | 0.0 | 0.10000 | 0.18182 | 0.42857 | 1.0 |

cmp_lastname2 | 5749132.0 | 0.00014 | 0.01008 | 0.0 | 0.00000 | 0.00000 | 0.00000 | 1.0 |

cmp_sex | 5749132.0 | 0.95500 | 0.20730 | 0.0 | 1.00000 | 1.00000 | 1.00000 | 1.0 |

cmp_birthday | 5749132.0 | 0.22443 | 0.41721 | 0.0 | 0.00000 | 0.00000 | 0.00000 | 1.0 |

cmp_birthmonth | 5749132.0 | 0.48879 | 0.49987 | 0.0 | 0.00000 | 0.00000 | 1.00000 | 1.0 |

cmp_birthyear | 5749132.0 | 0.22272 | 0.41607 | 0.0 | 0.00000 | 0.00000 | 0.00000 | 1.0 |

cmp_zipcode | 5749132.0 | 0.00552 | 0.07407 | 0.0 | 0.00000 | 0.00000 | 0.00000 | 1.0 |

## Supervised learning¶

As described before, supervised learning algorithms do need training data. Training data is data for which the true match status is known for each comparison vector. In the example in this section, we consider that the true match status of the first 5000 record pairs of the Krebs register data is known.

```
[4]:
```

```
golden_pairs = krebs_X[0:5000]
golden_matches_index = golden_pairs.index & krebs_true_links # 2093 matching pairs
```

### Logistic regression¶

The `recordlinkage.LogisticRegressionClassifier`

classifier is an application of the logistic regression model. This supervised learning method is one of the oldest classification algorithms used in record linkage. In situations with enough training data, the algorithm gives relatively good results.

```
[5]:
```

```
# Initialize the classifier
logreg = rl.LogisticRegressionClassifier()
# Train the classifier
logreg.fit(golden_pairs, golden_matches_index)
print ("Intercept: ", logreg.intercept)
print ("Coefficients: ", logreg.coefficients)
```

```
Intercept: -6.298043571006437
Coefficients: [ 4.90452843e-01 1.21640484e-01 2.15040485e+00 -2.84818101e-03
-1.79712465e+00 9.61085558e-01 6.72610441e-02 1.03408608e+00
4.30556110e+00]
```

```
[6]:
```

```
# Predict the match status for all record pairs
result_logreg = logreg.predict(krebs_X)
len(result_logreg)
```

```
[6]:
```

```
20150
```

```
[7]:
```

```
rl.confusion_matrix(krebs_true_links, result_logreg, len(krebs_X))
```

```
[7]:
```

```
array([[ 19884, 1047],
[ 266, 5727935]])
```

```
[8]:
```

```
# The F-score for this prediction is
rl.fscore(krebs_true_links, result_logreg)
```

```
[8]:
```

```
0.96804
```

The predicted number of matches is not much more than the 20931 true matches. The result was achieved with a small training dataset of 5000 record pairs.

In (older) literature, record linkage procedures are often divided in **deterministic record linkage** and **probabilistic record linkage**. The Logistic Regression Classifier belongs to deterministic record linkage methods. Each feature/variable has a certain importance (named weight). The weight is multiplied with the comparison/similarity vector. If the total sum exceeds a certain threshold, it as considered to be a match.

```
[9]:
```

```
intercept = -9
coefficients = [2.0, 1.0, 3.0, 1.0, 1.0, 1.0, 1.0, 2.0, 3.0]
logreg = rl.LogisticRegressionClassifier(coefficients, intercept)
# predict without calling LogisticRegressionClassifier.fit
result_logreg_pretrained = logreg.predict(krebs_X)
print (len(result_logreg_pretrained))
```

```
21303
```

```
[10]:
```

```
rl.confusion_matrix(krebs_true_links, result_logreg_pretrained, len(krebs_X))
```

```
[10]:
```

```
array([[ 20857, 74],
[ 446, 5727755]])
```

```
[11]:
```

```
# The F-score for this classification is
rl.fscore(krebs_true_links, result_logreg_pretrained)
```

```
[11]:
```

```
0.98769
```

For the given coefficients, the F-score is better than the situation without trainings data. Surprising? No (use more trainings data and the result will improve)

### Naive Bayes¶

In contrast to the logistic regression classifier, the Naive Bayes classifier is a probabilistic classifier. The probabilistic record linkage framework by Fellegi and Sunter (1969) is the most well-known probabilistic classification method for record linkage. Later, it was proved that the Fellegi and Sunter method is mathematically equivalent to the Naive Bayes method in case of assuming independence between comparison variables.

```
[12]:
```

```
# Train the classifier
nb = rl.NaiveBayesClassifier(binarize=0.3)
nb.fit(golden_pairs, golden_matches_index)
# Predict the match status for all record pairs
result_nb = nb.predict(krebs_X)
len(result_nb)
```

```
[12]:
```

```
19837
```

```
[13]:
```

```
rl.confusion_matrix(krebs_true_links, result_nb, len(krebs_X))
```

```
[13]:
```

```
array([[ 19825, 1106],
[ 12, 5728189]])
```

```
[14]:
```

```
# The F-score for this classification is
rl.fscore(krebs_true_links, result_nb)
```

```
[14]:
```

```
0.97258
```

### Support Vector Machines¶

Support Vector Machines (SVM) have become increasingly popular in record linkage. The algorithm performs well there is only a small amount of training data available. The implementation of SVM in the Python Record Linkage Toolkit is a linear SVM algorithm.

```
[15]:
```

```
# Train the classifier
svm = rl.SVMClassifier()
svm.fit(golden_pairs, golden_matches_index)
# Predict the match status for all record pairs
result_svm = svm.predict(krebs_X)
len(result_svm)
```

```
[15]:
```

```
20839
```

```
[16]:
```

```
rl.confusion_matrix(krebs_true_links, result_svm, len(krebs_X))
```

```
[16]:
```

```
array([[ 20825, 106],
[ 14, 5728187]])
```

```
[17]:
```

```
# The F-score for this classification is
rl.fscore(krebs_true_links, result_svm)
```

```
[17]:
```

```
0.99713
```

## Unsupervised learning¶

In situations without training data, unsupervised learning can be a solution for record linkage problems. In this section, we discuss two unsupervised learning methods. One algorithm is K-means clustering, and the other algorithm is an implementation of the Expectation-Maximisation algorithm. Most of the time, unsupervised learning algorithms take more computational time because of the iterative structure in these algorithms.

### K-means clustering¶

The K-means clustering algorithm is well-known and widely used in big data analysis. The K-means classifier in the Python Record Linkage Toolkit package is configured in such a way that it can be used for linking records. For more info about the K-means clustering see Wikipedia.

```
[18]:
```

```
kmeans = rl.KMeansClassifier()
result_kmeans = kmeans.fit_predict(krebs_X)
# The predicted number of matches
len(result_kmeans)
```

```
[18]:
```

```
371525
```

The classifier is now trained and the comparison vectors are classified.

```
[19]:
```

```
rl.confusion_matrix(krebs_true_links, result_kmeans, len(krebs_X))
```

```
[19]:
```

```
array([[ 20797, 134],
[ 350728, 5377473]])
```

```
[ ]:
```

```
rl.fscore(krebs_true_links, result_kmeans)
```

```
0.10598
```

### Expectation/Conditional Maximization Algorithm¶

The ECM-algorithm is an Expectation-Maximisation algorithm with some additional constraints. This algorithm is closely related to the Naive Bayes algorithm. The ECM algorithm is also closely related to estimating the parameters in the Fellegi and Sunter (1969) framework. The algorithms assume that the attributes are independent of each other. The Naive Bayes algorithm uses the same principles.

```
[ ]:
```

```
# Train the classifier
ecm = rl.ECMClassifier(binarize=0.8)
result_ecm = ecm.fit_predict(krebs_X)
len(result_ecm)
```

```
[ ]:
```

```
rl.confusion_matrix(krebs_true_links, result_ecm, len(krebs_X))
```

```
[ ]:
```

```
# The F-score for this classification is
rl.fscore(krebs_true_links, result_ecm)
```