Skip to contents

Predicts matches between records in two datasets based on a given record linkage model.

Usage

# S3 method for class 'rec_lin_model'
predict(
  object,
  newdata_A,
  newdata_B,
  set_construction = c("size", "flr"),
  fixed_method = "Newton",
  target_flr = 0.05,
  tol = 10^(-3),
  max_iter = 50,
  data_type = c("data.frame", "data.table", "matrix"),
  true_matches = NULL,
  ...
)

Arguments

object

A rec_lin_model object from the train_rec_lin or custom_rec_lin_model functions.

newdata_A

A duplicate-free data.frame or data.table.

newdata_B

A duplicate-free data.frame or data.table.

set_construction

A method for constructing the predicted set of matches ("size" or "flr").

fixed_method

A method for solving fixed-point equations using the FixedPoint function (used only if set_construction == "size").

target_flr

A target false link rate (FLR) (used only if set_construction == "flr").

tol

Error tolerance in the bisection procedure (used only if set_construction == "flr").

max_iter

A maximum number of iterations for the bisection procedure (used only if set_construction == "flr").

data_type

Data type for predictions with a custom ML model ("data.frame", "data.table" or "matrix"; used only if object is from the custom_rec_lin_model function).

true_matches

A data.frame or data.table indicating true matches.

...

Additional controls passed to the predict function for custom ML model (used only if the object is from the custom_rec_lin_model function).

Value

Returns a list containing:

  • M_est – a data.table with predicted matches,

  • set_construction – a method for constructing the predicted set of matches,

  • n_M_est – estimated classification set size,

  • flr_est – estimated false link rate (FLR),

  • mmr_est – estimated missing match rate (MMR),

  • iter – the number of iterations in the bisection procedure,

  • eval_metrics – metrics for quality assessment, if true_matches is provided,

  • confusion – confusion matrix, if true_matches is provided.

References

Lee, D., Zhang, L.-C. and Kim, J. K. (2022). Maximum entropy classification for record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 48, No. 1.

Vo, T. H., Chauvet, G., Happe, A., Oger, E., Paquelet, S., and Garès, V. (2023). Extending the Fellegi-Sunter record linkage model for mixed-type data with application to the French national health data system. Computational Statistics & Data Analysis, 179, 107656.

Sugiyama, M., Suzuki, T., Nakajima, S. et al. Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math 60, 699–746 (2008). doi:10.1007/s10463-008-0197-x

Author

Adam Struzik

Examples

df_1 <- data.frame(
  "name" = c("John", "Emily", "Mark", "Anna", "David"),
  "surname" = c("Smith", "Johnson", "Taylor", "Williams", "Brown")
)
df_2 <- data.frame(
  "name" = c("John", "Emely", "Marc", "Michael"),
  "surname" = c("Smith", "Jonson", "Tailor", "Henderson")
)
comparators <- list("name" = jarowinkler_complement(),
                    "surname" = jarowinkler_complement())
matches <- data.frame("a" = 1:3, "b" = 1:3)
methods <- list("name" = "continuous_nonparametric",
                "surname" = "continuous_nonparametric")
model <- train_rec_lin(A = df_1, B = df_2, matches = matches,
                       variables = c("name", "surname"),
                       comparators = comparators,
                       methods = methods)

df_new_1 <- data.frame(
  "name" = c("Jame", "Lia", "Tomas", "Matthew", "Andrew"),
  "surname" = c("Wilsen", "Thomsson", "Davis", "Robinson", "Scott")
)
df_new_2 <- data.frame(
  "name" = c("James", "Leah", "Thomas", "Sophie", "Mathew", "Andrew"),
  "surname" = c("Wilson", "Thompson", "Davies", "Clarks", "Robins", "Scots")
)
predict(model, df_new_1, df_new_2)
#> The algorithm predicted 5 matches.
#> The first 5 predicted matches are:
#>        a     b ratio / 1000
#>    <num> <num>        <num>
#> 1:     3     3    0.6466869
#> 2:     4     5    0.5865049
#> 3:     1     1    0.5696382
#> 4:     5     6    0.3103742
#> 5:     2     2    0.2935612
#> ========================================================
#> The construction of the classification set was based on estimates of its size.
#> Estimated false link rate (FLR): 1.1486 %.
#> Estimated missing match rate (MMR): 1.1486 %.