Annotation

Manually labeled record pairs are useful in training and validation tasks. Training data is usually not available in record linkage applications because it is highly dataset and sample-specific. The Python Record Linkage Toolkit comes with a browser-based user interface for manually classifying record pairs. A hosted version of RecordLinkage ANNOTATOR can be found on Github.

Review screen of RecordLinkage ANNOTATOR

Generate annotation file

The RecordLinkage ANNOTATOR software requires a structured annotation file. The required schema of the annotation file is open. The function recordlinkage.write_annotation_file() can be used to render and save an annotation file. The function can be used for both linking and deduplication purposes.

recordlinkage.write_annotation_file(fp, pairs, df_a, df_b=None, dataset_a_name=None, dataset_b_name=None, *args, **kwargs)

Render and export annotation file.

This function renders and annotation object and stores it in a json file. The function is a wrapper around the AnnotationWrapper class.

Parameters:
  • fp (str) – The path to the annotation file.

  • pairs (pandas.MultiIndex) – The record pairs to annotate.

  • df_a (pandas.DataFrame) – The data frame with full record information for the pairs.

  • df_b (pandas.DataFrame) – In case of data linkage, this is the second data frame. Default None.

  • dataset_a_name (str) – The name of the first data frame.

  • dataset_b_name (str) – In case of data linkage, the name of the second data frame. Default None.

Linking

This is a simple example of the code to render an annotation file for linking records:

import recordlinkage as rl
from recordlinkage.index import Block
from recordlinkage.datasets import load_febrl4

df_a, df_b = load_febrl4()

blocker = Block("surname", "surname")
pairs = blocker.index(df_a, df_b)

rl.write_annotation_file(
    "annotation_demo_linking.json",
    pairs[0:50],
    df_a,
    df_b,
    dataset_a_name="Febrl4 A",
    dataset_b_name="Febrl4 B"
)

Deduplication

This is a simple example of the code to render an annotation file for duplicate detection:

import recordlinkage as rl
from recordlinkage.index import Block
from recordlinkage.datasets import load_febrl1

df_a = load_febrl1()

blocker = Block("surname", "surname")
pairs = blocker.index(df_a)

rl.write_annotation_file(
    "annotation_demo_dedup.json",
    pairs[0:50],
    df_a,
    dataset_a_name="Febrl1 A"
)

Manual labeling

Go to RecordLinkage ANNOTATOR or start the server yourself.

Choose the annotation file on the landing screen or use the drag and drop functionality. A new screen shows the first record pair to label. Start labeling data the manually. Use the button Match for record pairs belonging to the same entity. Use Distinct for record pairs belonging to different entities. After all records are labeled by hand, the result can be saved to a file.

Export/read annotation file

After labeling all record pairs, you can export the annotation file to a JSON file. Use the function recordlinkage.read_annotation_file() to read the results.

import recordlinkage as rl

result = rl.read_annotation_file('my_annotation.json')
print(result.links)

The function recordlinkage.read_annotation_file() reads the file and returns an recordlinkage.annotation.AnnotationResult object. This object contains links and distinct attributes that return a pandas.MultiIndex object.

recordlinkage.read_annotation_file(fp)

Read annotation file.

This function can be used to read the annotation file and extract the results like the linked pairs and distinct pairs.

Parameters:

fp (str) – The path to the annotation file.

Returns:

AnnotationResult – An AnnotationResult object.

Example

Read the links from an annotation file:

> annotation = read_annotation_file("result.json")
> print(annotation.links)
class recordlinkage.annotation.AnnotationResult(pairs=None, version=1)

Result of (manual) annotation.

Parameters:
  • pairs (list) – Raw data of each record pair in the annotation file.

  • version (str) – The version number corresponding to the file structure.

Return the links.

Returns:

pandas.MultiIndex – The links stored in a pandas MultiIndex.

property distinct

Return the distinct pairs.

Returns:

pandas.MultiIndex – The distinct pairs stored in a pandas MultiIndex.

property unknown

Return the unknown or unlaballed pairs.

Returns:

pandas.MultiIndex – The unknown or unlaballed pairs stored in a pandas MultiIndex.

classmethod from_dict(d)

Create AnnotationResult from dict

Parameters:

d (dict) – The annotation file as a dict.

Returns:

AnnotationResult – An AnnotationResult object.

classmethod from_file(fp)

Create AnnotationResult from file

Parameters:

fp (str) – The path to the annotation file.

Returns:

AnnotationResult – An AnnotationResult object.