Skip to contents

Implements an unsupervised maximum entropy classification algorithm for record linkage, iteratively estimating probability / density ratios to classify record pairs into matches and non-matches based on comparison vectors.

Usage

mec(
  A,
  B,
  variables,
  comparators = NULL,
  methods = NULL,
  start_params = NULL,
  nonpar_hurdle = FALSE,
  set_construction = NULL,
  target_flr = 0.03,
  max_iter_bisection = 100,
  tol = 0.005,
  delta = 0.5,
  eps = 0.05,
  controls_nleqslv = list(),
  controls_kliep = control_kliep(),
  true_matches = NULL
)

Arguments

A

A duplicate-free data.frame or data.table.

B

A duplicate-free data.frame or data.table.

variables

A character vector of key variables used to create comparison vectors.

comparators

A named list of functions for comparing pairs of records.

methods

A named list of methods used for estimation ("binary", "continuous_parametric" or "continuous_nonparametric").

start_params

Start parameters for the binary and continuous_parametric methods.

nonpar_hurdle

Logical indicating whether to use a hurdle model or not (used only if the "continuous_nonparametric" method has been chosen for at least one variable).

set_construction

A method for constructing the predicted set of matches ("size" or "flr").

target_flr

A target false link rate (FLR) (used only if set_construction == "flr").

max_iter_bisection

A maximum number of iterations for the bisection procedure (used only if set_construction == "flr").

tol

Error tolerance in the bisection procedure (used only if set_construction == "flr").

delta

A numeric value specifying the tolerance for the change in the estimated number of matches between iterations.

eps

A numeric value specifying the tolerance for the change in model parameters between iterations.

controls_nleqslv

Controls passed to the nleqslv function (only if the "continuous_parametric" method has been chosen for at least one variable).

controls_kliep

Controls passed to the kliep function (only if the "continuous_nonparametric" method has been chosen for at least one variable).

true_matches

A data.frame or data.table indicating known matches.

Value

Returns a list containing:

  • M_est – a data.table with predicted matches,

  • n_M_est – estimated classification set size,

  • flr_est – estimated false link rate (FLR),

  • mmr_est – estimated missing match rate (MMR),

  • iter_bisection – the number of iterations in the bisection procedure,

  • b_vars – a character vector of variables used for the "binary" method (with the prefix "gamma_"),

  • cpar_vars – a character vector of variables used for the "continuous_parametric" method (with the prefix "gamma_"),

  • cnonpar_vars – a character vector of variables used for the "continuous_nonparametric" method (with the prefix "gamma_"),

  • b_params – parameters estimated using the "binary" method,

  • cpar_params – parameters estimated using the "continuous_parametric" method,

  • ratio_kliep – a result of the kliep function,

  • variables – a character vector of key variables used for comparison,

  • set_construction – a method for constructing the predicted set of matches,

  • eval_metrics – metrics for quality assessment (if true_matches is provided),

  • confusion – confusion matrix (if true_matches is provided).

References

Lee, D., Zhang, L.-C. and Kim, J. K. (2022). Maximum entropy classification for record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 48, No. 1.

Vo, T. H., Chauvet, G., Happe, A., Oger, E., Paquelet, S., and Garès, V. (2023). Extending the Fellegi-Sunter record linkage model for mixed-type data with application to the French national health data system. Computational Statistics & Data Analysis, 179, 107656.

Sugiyama, M., Suzuki, T., Nakajima, S. et al. Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math 60, 699–746 (2008). doi:10.1007/s10463-008-0197-x

Author

Adam Struzik

Examples

df_1 <- data.frame(
  name = c("Emma", "Liam", "Olivia", "Noah", "Ava",
           "Ethan", "Sophia", "Mason", "Isabella", "James"),
  surname = c("Smith", "Johnson", "Williams", "Brown", "Jones",
              "Garcia", "Miller", "Davis", "Rodriguez", "Wilson"),
  city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix",
           "Philadelphia", "San Antonio", "San Diego", "Dallas", "San Jose")
)

df_2 <- data.frame(
  name = c(
    "Emma", "Liam", "Olivia", "Noah",
    "Ava", "Ehtan", "Sopia", "Mson",
    "Charlotte", "Benjamin", "Amelia", "Lucas"
  ),
  surname = c(
     "Smith", "Johnson", "Williams", "Brown",
    "Jnes", "Garca", "Miler", "Dvis",
    "Martinez", "Lee", "Hernandez", "Clark"
  ),
  city = c(
    "New York", "Los Angeles", "Chicago", "Houston",
    "Phonix", "Philadelpia", "San Antnio", "San Dieg",
    "Seattle", "Miami", "Boston", "Denver"
  )
)
true_matches <- data.frame(
  "a" = 1:8,
  "b" = 1:8
)

variables <- c("name", "surname", "city")
comparators <- list(
  "name" = jarowinkler_complement(),
  "surname" = jarowinkler_complement(),
  "city" = jarowinkler_complement()
)
methods <- list(
  "name" = "continuous_parametric",
  "surname" = "continuous_parametric",
  "city" = "continuous_parametric"
)

set.seed(1)
result <- mec(A = df_1, B = df_2,
              variables = variables,
              comparators = comparators,
              methods = methods,
              true_matches = true_matches)
result
#> Record linkage based on the following variables: name, surname, city.
#> ========================================================
#> The algorithm predicted 8 matches.
#> The first 6 predicted matches are:
#>        a     b ratio / 1000
#>    <num> <num>        <num>
#> 1:     6     6 1.433031e+08
#> 2:     8     8 3.198692e+07
#> 3:     7     7 9.673745e+05
#> 4:     5     5 2.813745e+04
#> 5:     1     1 3.375000e+00
#> 6:     2     2 3.375000e+00
#> ========================================================
#> The construction of the classification set was based on estimates of its size.
#> Estimated false link rate (FLR): 0.2066 %.
#> Estimated missing match rate (MMR): 0.0000 %.
#> ========================================================
#> Variables selected for the continuous parametric method: name, surname, city.
#> Estimated parameters for the continuous parametric method:
#>         variable p_0_M    alpha_M   beta_M      p_0_U  alpha_U    beta_U
#>           <char> <num>      <num>    <num>      <num>    <num>     <num>
#> 1:    gamma_name 0.625 138.462279 2199.107 0.04166667 6.516736 11.173089
#> 2: gamma_surname 0.500 120.665706 1974.530 0.03333333 4.622775  7.167261
#> 3:    gamma_city 0.500   6.512723  135.163 0.03333333 5.233194  9.313035
#> ========================================================
#> Evaluation metrics (presented in percentages):
#>      recall   precision         fpr         fnr    accuracy specificity 
#>         100         100           0           0         100         100 
#>    f1_score 
#>         100