Skip to contents

Model for the outcome for the mass imputation estimator. The implementation is currently based on RANN::nn2 function and thus it uses Euclidean distance for matching units from \(S_{\mathrm{NP}}\) (non-probability) to \(S_{\mathrm{P}}\) (probability) based on predicted values from model \(\boldsymbol{x}_i\) based either on method_glm or method_npar. Estimation of the mean is done using \(S_{\mathrm{P}}\) sample: when pop_size is supplied this is the known-\(N\) Horvitz-Thompson mean, otherwise it reduces to the usual ratio mean with \(\hat{N} = \sum_{i\in S_{\mathrm{P}}} d_{\mathrm{P}, i}\). The pop_size argument is not converted into a finite population correction; if an fpc is needed, it should be supplied in svydesign, where it is handled by the {survey} variance routines. Matching ties are randomized by the nearest-neighbour step before donor values are aggregated.

This implementation extends Yang et al. (2021) approach as described in Chlebicki et al. (2025), namely:

pmm_weights

if k>1 weighted aggregation of the mean for a given unit is used. We use distance matrix returned by RANN::nn2 function (pmm_weights from the control_out() function)

nn_exact_se

if the non-probability sample is small we recommend using a mini-bootstrap approach to estimate variance from the non-probability sample (nn_exact_se from the control_inf() function). If non-constant pseudo-weights are supplied, bootstrap samples are drawn with probabilities proportional to inverse weights and the resampled weights are used in each refitted outcome model.

pmm_k_choice

the main nonprob function allows for dynamic selection of k neighbours based on a full-grid variance minimization procedure over 1:n_NP (pmm_k_choice from the control_out() function)

Usage

method_pmm(
  y_nons,
  X_nons,
  X_rand,
  svydesign,
  weights = NULL,
  family_outcome = "gaussian",
  start_outcome = NULL,
  vars_selection = FALSE,
  pop_totals = NULL,
  pop_size = NULL,
  control_outcome = control_out(),
  control_inference = control_inf(),
  verbose = FALSE,
  se = TRUE
)

Arguments

y_nons

target variable from non-probability sample

X_nons

a model.matrix with auxiliary variables from non-probability sample

X_rand

a model.matrix with auxiliary variables from non-probability sample

svydesign

a svydesign object

weights

case / frequency weights from non-probability sample. If nn_exact_se=TRUE, non-constant weights also define mini-bootstrap sampling probabilities proportional to their inverses and are resampled for each bootstrap refit.

family_outcome

family for the glm model

start_outcome

start parameters

vars_selection

whether variable selection should be conducted

pop_totals

a place holder (not used in method_pmm)

pop_size

population size from the nonprob function. If NULL, the method uses sum(weights(svydesign)). If supplied, it is used as the known-\(N\) denominator for the mean and variance scaling, but it does not modify the finite population correction of svydesign.

control_outcome

controls passed by the control_out function

control_inference

controls passed by the control_inf function

verbose

parameter passed from the main nonprob function

se

whether standard errors should be calculated

Value

an nonprob_method class which is a list with the following entries

model_fitted

fitted model either an glm.fit or cv.ncvreg object

y_nons_pred

predicted values for the non-probablity sample

y_rand_pred

predicted values for the probability sample or population totals

coefficients

coefficients for the model (if available)

svydesign

an updated surveydesign2 object (new column y_hat_MI is added)

y_mi_hat

estimated population mean for the target variable

vars_selection

whether variable selection was performed

var_prob

variance for the probability sample component (if available)

var_nonprob

variance for the non-probability sampl component

model

model type (character "pmm")

family

depends on the method selected for estimating E(Y|X)

Details

Matching

In the package we support two types of matching:

  1. \(\hat{y} - \hat{y}\) matching (default; control_out(pmm_match_type = 1)).

  2. \(\hat{y} - y\) matching (control_out(pmm_match_type = 2)).

Analytical variance

The variance of the mean is estimated based on the following approach (a) non-probability part (\(S_{\mathrm{NP}}\) with size \(n_{\mathrm{NP}}\); denoted as var_nonprob in the result) is currently estimated using the non-parametric mini-bootstrap estimator proposed by Chlebicki et al. (2025, Algorithm 2). It is not proved to be consistent but with good finite population properties. This bootstrap can be applied using control_inference(nn_exact_se=TRUE) and can be summarized as follows:

  1. Sample \(n_{\mathrm{NP}}\) units from \(S_{\mathrm{NP}}\) with replacement to create \(S_{\mathrm{NP}}'\). If non-constant pseudo-weights are supplied through weights, sampling probabilities are proportional to their inverses; equal weights use uniform resampling.

  2. Estimate regression model \(\mathbb{E}[Y|\boldsymbol{X}]=m(\boldsymbol{X}, \cdot)\) based on \(S_{\mathrm{NP}}'\) from step 1, using the resampled weights when supplied.

  3. Compute \(\hat{\nu}'(i,t)\) for \(t=1,\dots,k, i\in S_{\mathrm{P}}\) using estimated \(m(\boldsymbol{x}', \cdot)\) and \(\left\lbrace(y_{j},\boldsymbol{x}_{j})| j\in S_{\mathrm{NP}}'\right\rbrace\).

  4. Compute \(\displaystyle\frac{1}{k}\sum_{t=1}^{k}y_{\hat{\nu}'(i)}\) using \(Y\) values from \(S_{\mathrm{NP}}'\).

  5. Repeat steps 1-4 \(M\) times (we set (hard-coded) \(M=50\) in our code).

  6. Estimate \(\hat{V}_1=\mathrm{var}({\hat{\boldsymbol{\mu}}})\) obtained from simulations and save it as var_nonprob.

(b) probability part (\(S_{\mathrm{P}}\) with size \(n_{\mathrm{P}}\); denoted as var_prob in the result)

This part uses functionalities of the {survey} package and the variance is estimated using the following equation:

$$ \hat{V}_2=\frac{1}{N^2} \sum_{i=1}^{n_{\mathrm{P}}} \sum_{j=1}^{n_{\mathrm{P}}} \frac{\pi_{i j}-\pi_i \pi_j}{\pi_{i j}} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_i} \frac{m(\boldsymbol{x}_i; \hat{\boldsymbol{\beta}})}{\pi_j}. $$

Note that \(\hat{V}_2\) in principle can be estimated in various ways depending on the type of the design and whether population size is known or not.

Examples


sample_a <- data.frame(y = c(1, 0, 1, 0, 1), x = c(0, 1, 2, 3, 4))
sample_b <- data.frame(x = c(0.5, 1.5, 2.5, 3.5), w = c(4, 4, 4, 4))
sample_b_svy <- svydesign(ids = ~1, weights = ~w, data = sample_b)

res_pmm <- method_pmm(
  y_nons = sample_a$y,
  X_nons = model.matrix(~x, sample_a),
  X_rand = model.matrix(~x, sample_b),
  svydesign = sample_b_svy,
  control_outcome = control_out(k = 1),
  se = FALSE
)

res_pmm
#> Mass imputation model (PMM approach). Estimated mean: 0.7500 (se: NA)