Predict Matches Based on a Given Record Linkage Model
predict.rec_lin_model.Rd
Predicts matches between records in two datasets based on a given record linkage model.
Arguments
- object
A
rec_lin_model
object from thetrain_rec_lin
orcustom_rec_lin_model
functions.- newdata_A
A duplicate-free
data.frame
ordata.table
.- newdata_B
A duplicate-free
data.frame
ordata.table
.- set_construction
A method for constructing the predicted set of matches (
"size"
or"flr"
).- fixed_method
A method for solving fixed-point equations using the FixedPoint function (used only if
set_construction == "size"
).- target_flr
A target false link rate (FLR) (used only if
set_construction == "flr"
).- tol
Error tolerance in the bisection procedure (used only if
set_construction == "flr"
).- max_iter
A maximum number of iterations for the bisection procedure (used only if
set_construction == "flr"
).- data_type
Data type for predictions with a custom ML model (
"data.frame"
,"data.table"
or"matrix"
; used only ifobject
is from thecustom_rec_lin_model
function).- true_matches
A
data.frame
ordata.table
indicating true matches.- ...
Additional controls passed to the
predict
function for custom ML model (used only if theobject
is from thecustom_rec_lin_model
function).
Value
Returns a list containing:
M_est
– adata.table
with predicted matches,set_construction
– a method for constructing the predicted set of matches,n_M_est
– estimated classification set size,flr_est
– estimated false link rate (FLR),mmr_est
– estimated missing match rate (MMR),iter
– the number of iterations in the bisection procedure,eval_metrics
– metrics for quality assessment, iftrue_matches
is provided,confusion
– confusion matrix, iftrue_matches
is provided.
References
Lee, D., Zhang, L.-C. and Kim, J. K. (2022). Maximum entropy classification for record linkage. Survey Methodology, Statistics Canada, Catalogue No. 12-001-X, Vol. 48, No. 1.
Vo, T. H., Chauvet, G., Happe, A., Oger, E., Paquelet, S., and Garès, V. (2023). Extending the Fellegi-Sunter record linkage model for mixed-type data with application to the French national health data system. Computational Statistics & Data Analysis, 179, 107656.
Sugiyama, M., Suzuki, T., Nakajima, S. et al. Direct importance estimation for covariate shift adaptation. Ann Inst Stat Math 60, 699–746 (2008). doi:10.1007/s10463-008-0197-x
Examples
df_1 <- data.frame(
"name" = c("John", "Emily", "Mark", "Anna", "David"),
"surname" = c("Smith", "Johnson", "Taylor", "Williams", "Brown")
)
df_2 <- data.frame(
"name" = c("John", "Emely", "Marc", "Michael"),
"surname" = c("Smith", "Jonson", "Tailor", "Henderson")
)
comparators <- list("name" = jarowinkler_complement(),
"surname" = jarowinkler_complement())
matches <- data.frame("a" = 1:3, "b" = 1:3)
methods <- list("name" = "continuous_nonparametric",
"surname" = "continuous_nonparametric")
model <- train_rec_lin(A = df_1, B = df_2, matches = matches,
variables = c("name", "surname"),
comparators = comparators,
methods = methods)
df_new_1 <- data.frame(
"name" = c("Jame", "Lia", "Tomas", "Matthew", "Andrew"),
"surname" = c("Wilsen", "Thomsson", "Davis", "Robinson", "Scott")
)
df_new_2 <- data.frame(
"name" = c("James", "Leah", "Thomas", "Sophie", "Mathew", "Andrew"),
"surname" = c("Wilson", "Thompson", "Davies", "Clarks", "Robins", "Scots")
)
predict(model, df_new_1, df_new_2)
#> The algorithm predicted 5 matches.
#> The first 5 predicted matches are:
#> a b ratio / 1000
#> <num> <num> <num>
#> 1: 3 3 0.6466869
#> 2: 4 5 0.5865049
#> 3: 1 1 0.5696382
#> 4: 5 6 0.3103742
#> 5: 2 2 0.2935612
#> ========================================================
#> The construction of the classification set was based on estimates of its size.
#> Estimated false link rate (FLR): 1.1486 %.
#> Estimated missing match rate (MMR): 1.1486 %.