Train a Record Linkage Model
train_rec_lin.Rd
Trains a supervised record linkage model using probability or density ratio estimation.
Usage
train_rec_lin(
A,
B,
matches,
variables,
comparators = NULL,
methods = NULL,
prob_ratio = NULL,
controls_nleqslv = list(),
controls_kliep = control_kliep()
)
Arguments
- A
A duplicate-free
data.frame
ordata.table
.- B
A duplicate-free
data.frame
ordata.table
.- matches
A
data.frame
ordata.table
indicating known matches.- variables
A character vector of key variables used to create comparison vectors.
- comparators
A named list of functions for comparing pairs of records.
- methods
A named list of methods used for estimation (
"binary"
,"continuous_parametric"
or"continuous_nonparametric"
).- prob_ratio
Probability ratio type (
"1"
or"2"
).- controls_nleqslv
Controls passed to the nleqslv function (only if the
"continuous_parametric"
method has been chosen for at least one variable).- controls_kliep
Controls passed to the kliep function (only if the
"continuous_nonparametric"
method has been chosen for at least one variable).
Value
Returns a list containing:
b_vars
– a character vector of variables used for the"binary"
method (with the prefix"gamma_"
),cpar_vars
– a character vector of variables used for the"continuous_parametric"
method (with the prefix"gamma_"
),cnonpar_vars
– a character vector of variables used for the"continuous_nonparametric"
method (with the prefix"gamma_"
),b_params
– parameters estimated using the"binary"
method,cpar_params
– parameters estimated using the"continuous_parametric"
method,ratio_kliep
– a result of the kliep function,ml_model
– hereNULL
,pi_est
– a prior probability of matching,match_prop
– proportion of matches in the smaller dataset,variables
– a character vector of key variables used for comparison,comparators
– a list of functions used to compare pairs of records,methods
– a list of methods used for estimation.
References
Lee, D., Zhang, L.-C. and Kim, J. K. (2022). Maximum entropy classification for record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 48, No. 1.
Vo, T. H., Chauvet, G., Happe, A., Oger, E., Paquelet, S., and Garès, V. (2023). Extending the Fellegi-Sunter record linkage model for mixed-type data with application to the French national health data system. Computational Statistics & Data Analysis, 179, 107656.
Sugiyama, M., Suzuki, T., Nakajima, S. et al. Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math 60, 699–746 (2008). doi:10.1007/s10463-008-0197-x
Examples
df_1 <- data.frame(
"name" = c("John", "Emily", "Mark", "Anna", "David"),
"surname" = c("Smith", "Johnson", "Taylor", "Williams", "Brown")
)
df_2 <- data.frame(
"name" = c("John", "Emely", "Marc", "Michael"),
"surname" = c("Smith", "Jonson", "Tailor", "Henderson")
)
comparators <- list("name" = jarowinkler_complement(),
"surname" = jarowinkler_complement())
matches <- data.frame("a" = 1:3, "b" = 1:3)
methods <- list("name" = "continuous_nonparametric",
"surname" = "continuous_nonparametric")
model <- train_rec_lin(A = df_1, B = df_2, matches = matches,
variables = c("name", "surname"),
comparators = comparators,
methods = methods)
model
#> Record linkage model based on the following variables: name, surname.
#> The prior probability of matching is 0.15.
#> ========================================================
#> Variables selected for the continuous nonparametric method: name, surname.