Skip to contents

Trains a supervised record linkage model using probability or density ratio estimation.

Usage

train_rec_lin(
  A,
  B,
  matches,
  variables,
  comparators = NULL,
  methods = NULL,
  prob_ratio = NULL,
  controls_nleqslv = list(),
  controls_kliep = control_kliep()
)

Arguments

A

A duplicate-free data.frame or data.table.

B

A duplicate-free data.frame or data.table.

matches

A data.frame or data.table indicating known matches.

variables

A character vector of key variables used to create comparison vectors.

comparators

A named list of functions for comparing pairs of records.

methods

A named list of methods used for estimation ("binary", "continuous_parametric" or "continuous_nonparametric").

prob_ratio

Probability ratio type ("1" or "2").

controls_nleqslv

Controls passed to the nleqslv function (only if the "continuous_parametric" method has been chosen for at least one variable).

controls_kliep

Controls passed to the kliep function (only if the "continuous_nonparametric" method has been chosen for at least one variable).

Value

Returns a list containing:

  • b_vars – a character vector of variables used for the "binary" method (with the prefix "gamma_"),

  • cpar_vars – a character vector of variables used for the "continuous_parametric" method (with the prefix "gamma_"),

  • cnonpar_vars – a character vector of variables used for the "continuous_nonparametric" method (with the prefix "gamma_"),

  • b_params – parameters estimated using the "binary" method,

  • cpar_params – parameters estimated using the "continuous_parametric" method,

  • ratio_kliep – a result of the kliep function,

  • ml_model – here NULL,

  • pi_est – a prior probability of matching,

  • match_prop – proportion of matches in the smaller dataset,

  • variables – a character vector of key variables used for comparison,

  • comparators – a list of functions used to compare pairs of records,

  • methods – a list of methods used for estimation.

References

Lee, D., Zhang, L.-C. and Kim, J. K. (2022). Maximum entropy classification for record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 48, No. 1.

Vo, T. H., Chauvet, G., Happe, A., Oger, E., Paquelet, S., and Garès, V. (2023). Extending the Fellegi-Sunter record linkage model for mixed-type data with application to the French national health data system. Computational Statistics & Data Analysis, 179, 107656.

Sugiyama, M., Suzuki, T., Nakajima, S. et al. Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math 60, 699–746 (2008). doi:10.1007/s10463-008-0197-x

Author

Adam Struzik

Examples

df_1 <- data.frame(
  "name" = c("John", "Emily", "Mark", "Anna", "David"),
  "surname" = c("Smith", "Johnson", "Taylor", "Williams", "Brown")
)
df_2 <- data.frame(
  "name" = c("John", "Emely", "Marc", "Michael"),
  "surname" = c("Smith", "Jonson", "Tailor", "Henderson")
)
comparators <- list("name" = jarowinkler_complement(),
                    "surname" = jarowinkler_complement())
matches <- data.frame("a" = 1:3, "b" = 1:3)
methods <- list("name" = "continuous_nonparametric",
                "surname" = "continuous_nonparametric")
model <- train_rec_lin(A = df_1, B = df_2, matches = matches,
                       variables = c("name", "surname"),
                       comparators = comparators,
                       methods = methods)
model
#> Record linkage model based on the following variables: name, surname.
#> The prior probability of matching is 0.15.
#> ========================================================
#> Variables selected for the continuous nonparametric method: name, surname.