Function creates shingles (strings with 2 characters, default) or vectors using a given model (e.g., GloVe), applies approximate nearest neighbour (ANN) algorithms via the rnndescent, RcppHNSW, RcppAnnoy and mlpack packages, and creates blocks using graphs via igraph.
Usage
blocking(
x,
y = NULL,
representation = c("shingles", "vectors"),
model,
deduplication = TRUE,
on = NULL,
on_blocking = NULL,
ann = c("nnd", "hnsw", "annoy", "lsh", "kd"),
distance = c("cosine", "euclidean", "l2", "ip", "manhatan", "hamming", "angular"),
ann_write = NULL,
ann_colnames = NULL,
true_blocks = NULL,
verbose = c(0, 1, 2),
graph = FALSE,
seed = 2023,
n_threads = 1,
control_txt = controls_txt(),
control_ann = controls_ann()
)
Arguments
- x
reference data (a character vector or a matrix),
- y
query data (a character vector or a matrix), if not provided NULL by default and thus deduplication is performed,
- representation
method of representing input data (possible
c("shingles", "vectors")
; default"shingles"
),- model
a matrix containing word embeddings (e.g., GloVe), required only when
representation = "vectors"
,- deduplication
whether deduplication should be applied (default TRUE as y is set to NULL),
- on
variables for ANN search (currently not supported),
- on_blocking
variables for blocking records before ANN search (currently not supported),
- ann
algorithm to be used for searching for ann (possible,
c("nnd", "hnsw", "annoy", "lsh", "kd")
, default"nnd"
which corresponds to nearest neighbour descent method),- distance
distance metric (default
cosine
, more options are possible see details),- ann_write
writing an index to file. Two files will be created: 1) an index, 2) and text file with column names,
- ann_colnames
file with column names if
x
ory
are indices saved on the disk (currently not supported),- true_blocks
matrix with true blocks to calculate evaluation metrics (standard metrics based on confusion matrix as well as all metrics from compare are returned).
- verbose
whether log should be provided (0 = none, 1 = main, 2 = ANN algorithm verbose used),
- graph
whether a graph should be returned (default FALSE),
- seed
seed for the algorithms (for reproducibility),
- n_threads
number of threads used for the ANN algorithms and adding data for index and query,
- control_txt
list of controls for text data (passed only to itoken_parallel or itoken), used only when
representation = "shingles"
,- control_ann
list of controls for the ANN algorithms.
Value
Returns a list containing:
result
–data.table
with indices (rows) of x, y, block and distance between pointsmethod
– name of the ANN algorithm used,deduplication
– information whether deduplication was applied,representation
– information whether shingles or vectors were used,metrics
– metrics for quality assessment, iftrue_blocks
is provided,confusion
– confusion matrix, iftrue_blocks
is provided,colnames
– variable names (colnames) used for search,graph
–igraph
class object.
Examples
## an example using RcppHNSW
df_example <- data.frame(txt = c("jankowalski", "kowalskijan", "kowalskimjan",
"kowaljan", "montypython", "pythonmonty", "cyrkmontypython", "monty"))
result <- blocking(x = df_example$txt,
ann = "hnsw",
control_ann = controls_ann(hnsw = control_hnsw(M = 5, ef_c = 10, ef_s = 10)))
result
#> ========================================================
#> Blocking based on the hnsw method.
#> Number of blocks: 2.
#> Number of columns used for blocking: 28.
#> Reduction ratio: 0.5714.
#> ========================================================
#> Distribution of the size of the blocks:
#> 4
#> 2
## an example using GloVe and RcppAnnoy
if (FALSE) { # \dontrun{
old <- getOption("timeout")
options(timeout = 500)
utils::download.file("https://nlp.stanford.edu/data/glove.6B.zip", destfile = "glove.6B.zip")
utils::unzip("glove.6B.zip")
glove_6B_50d <- readr::read_table("glove.6B.50d.txt",
col_names = FALSE,
show_col_types = FALSE)
data.table::setDT(glove_6B_50d)
glove_vectors <- glove_6B_50d[,-1]
glove_vectors <- as.matrix(glove_vectors)
rownames(glove_vectors) <- glove_6B_50d$X1
## spaces between words are required
df_example_spaces <- data.frame(txt = c("jan kowalski", "kowalski jan", "kowalskim jan",
"kowal jan", "monty python", "python monty", "cyrk monty python", "monty"))
result_annoy <- blocking(x = df_example_spaces$txt,
ann = "annoy",
representation = "vectors",
model = glove_vectors)
result_annoy
options(timeout = old)
} # }