Skip to contents

Creates a supervised record linkage model using a custom machine learning (ML) classifier.

Usage

custom_rec_lin_model(ml_model, vectors)

Arguments

ml_model

A trained ML model that predicts the probability of a match based on comparison vectors.

vectors

An object of class comparison_vectors (a result of the comparison_vectors function), used for training the ml_model.

Value

Returns a list containing:

  • b_vars – here NULL,

  • cpar_vars – here NULL,

  • cnonpar_vars – here NULL,

  • b_params – here NULL,

  • cpar_params – here NULL,

  • ratio_kliep – here NULL,

  • ml_model – ML model used for creating the record linkage model,

  • pi_est – a prior probability of matching,

  • match_prop – proportion of matches in the smaller dataset,

  • variables – a character vector of key variables used for comparison,

  • comparators – a list of functions used to compare pairs of records,

  • methods – here NULL.

Author

Adam Struzik

Examples

if (requireNamespace("xgboost", quietly = TRUE)) {
  df_1 <- data.frame(
    "name" = c("John", "Emily", "Mark", "Anna", "David"),
    "surname" = c("Smith", "Johnson", "Taylor", "Williams", "Brown")
  )
  df_2 <- data.frame(
    "name" = c("Jon", "Emely", "Marc", "Michael"),
    "surname" = c("Smitth", "Jonson", "Tailor", "Henderson")
  )
  comparators <- list("name" = jarowinkler_complement(),
                      "surname" = jarowinkler_complement())
  matches <- data.frame("a" = 1:3, "b" = 1:3)
  vectors <- comparison_vectors(A = df_1, B = df_2, variables = c("name", "surname"),
                               comparators = comparators, matches = matches)
  train_data <- xgboost::xgb.DMatrix(
    data = as.matrix(vectors$Omega[, c("gamma_name", "gamma_surname")]),
    label = vectors$Omega$match
  )
  params <- list(objective = "binary:logistic",
                 eval_metric = "logloss")
  model_xgb <- xgboost::xgboost(data = train_data, params = params,
                                nrounds = 50, verbose = 0)
  custom_xgb_model <- custom_rec_lin_model(model_xgb, vectors)
  custom_xgb_model
}
#> Record linkage model based on the following variables: name, surname.
#> A custom ML model was used.
#> The prior probability of matching is 0.15.