Skip to contents

Overview

This vignette explains the matrix-vectorized implementation of the fully nonparametric exponential tilting (EXPTILT) estimator, as described in Appendix 2 of Riddles et al. (2016). This method is designed for fully categorical data, where the outcomes (YY), the response-model covariates (X1X_1), and the instrumental-variable covariates (X2X_2) are all discrete.

Unlike the parametric method, which estimates the parameters ϕ\phi of a response probability function π(x,y;ϕ)\pi(x, y; \phi), the nonparametric approach directly estimates the nonresponse odds for each stratum. This is achieved using an Expectation-Maximization (EM) algorithm to find the maximum likelihood estimates of these odds.

The implementation (exptilt_nonparam.data.frame) assumes the input data is an aggregated data frame, where each row represents a unique stratum x*=(x1,x2)x^* = (x_1, x_2) and contains the counts of respondents for each outcome y*y^* and the total count of nonrespondents.

Notation and Main Objects

The implementation maps directly to the notation in Appendix 2. Let x1x_1 be the covariates for the response model and x2x_2 be the nonresponse instrumental variable. A full stratum is x*=(x1,x2)x^* = (x_1, x_2), and a response-model-only stratum is x1x_1.

The algorithm is built from a set of fixed matrices (computed once) and one matrix that is iteratively updated.

Fixed Objects (Pre-computed)

  1. Respondent Counts Ny*x*N_{y^*x^*} (code: n_y_x_matrix)
    • Source: data[, outcome_cols]
    • Dimensions: (Nstrata×Koutcomes)(N_{\text{strata}} \times K_{\text{outcomes}}), where NstrataN_{\text{strata}} is the number of (x1,x2)(x_1, x_2) rows and KoutcomesK_{\text{outcomes}} is the number of YY categories.
    • Definition: The observed (weighted) count of respondents for stratum x*x^* and outcome y*y^*. Ny*x*=iAdiδiI(yi=y*,xi=x*)N_{y^*x^*} = \sum_{i \in A} d_i \delta_i I(y_i = y^*, x_i = x^*) (Note: The implementation assumes di=1d_i=1 unless the input data is pre-weighted).
  2. Nonrespondent Counts Mx*M_{x^*} (code: m_x_vec)
    • Source: data[, refusal_col]
    • Dimensions: (Nstrata×1)(N_{\text{strata}} \times 1)
    • Definition: The observed (weighted) total count of nonrespondents for stratum x*x^*. Mx*=mx*M_{x^*} = m_{x^*}
  3. Respondent Proportions P̂y*|x*\hat{P}_{y^*|x^*} (code: p_hat_matrix)
    • Source: n_y_x_matrix / rowSums(n_y_x_matrix)
    • Dimensions: (Nstrata×Koutcomes)(N_{\text{strata}} \times K_{\text{outcomes}})
    • Definition: The conditional probability (proportion) of observing outcome y*y^* given stratum x*x^*, among respondents. This is a fixed, observed quantity used in the E-Step. p̂y*|x*=Ny*x*yNyx*=pr(y=y*|x1=x1*,x2=x2*,δ=1)\hat{p}_{y^*|x^*} = \frac{N_{y^*x^*}}{\sum_{y} N_{yx^*}} = pr(y=y^* | x_1=x_1^*, x_2=x_2^*, \delta=1)
  4. Aggregated Respondent Counts Ny*x1*N_{y^*x_1^*} (code: n_y_x1_matrix)
    • Source: aggregate(n_y_x_matrix ~ data$x1_key, ...)
    • Dimensions: (Nx1×Koutcomes)(N_{x_1} \times K_{\text{outcomes}}), where Nx1N_{x_1} is the number of unique x1x_1 strata.
    • Definition: The observed (weighted) count of respondents for outcome y*y^* in stratum x1x_1, summed over the instrument x2x_2. This is the denominator of the M-Step. Ny*x1*=x2Ny*(x1*,x2)=ny*x1*N_{y^*x_1^*} = \sum_{x_2} N_{y^*(x_1^*, x_2)} = n_{y^*x_1^*}

Iterative Objects (The EM Algorithm)

  1. Odds Matrix O(t)(x1,y)O^{(t)}(x_1, y) (code: odds_matrix)
    • Dimensions: (Nx1×Koutcomes)(N_{x_1} \times K_{\text{outcomes}})
    • Definition: The parameter being estimated. It represents the odds of nonresponse for a given x1x_1 stratum and outcome yy. This is updated in the M-Step.
    • Initialization (Step 0): O(0)(x1,y)=1O^{(0)}(x_1, y) = 1 for all (x1,y)(x_1, y).
  2. Expected Nonrespondent Counts My*x*(t)M_{y^*x^*}^{(t)} (code: m_y_x_matrix)
    • Dimensions: (Nstrata×Koutcomes)(N_{\text{strata}} \times K_{\text{outcomes}})
    • Definition: The expected count of nonrespondents for stratum x*x^* and outcome y*y^*, given the current odds O(t)O^{(t)}. This is computed in the E-Step.
  3. Aggregated Expected Nonrespondent Counts My*x1*(t)M_{y^*x_1^*}^{(t)} (code: m_y_x1_matrix)
    • Dimensions: (Nx1×Koutcomes)(N_{x_1} \times K_{\text{outcomes}})
    • Definition: The expected count of nonrespondents for outcome y*y^* in stratum x1x_1, summed over the instrument x2x_2. This is the numerator of the M-Step.

The Expectation-Maximization (EM) Algorithm

The function exptilt_nonparam.data.frame is a direct implementation of the EM algorithm in Appendix 2. The goal is to find the argmaxO(x1,y)\text{argmax} O(x_1, y) that maximizes the observed data likelihood, which is solved by iterating two steps.

Step 1.1 (E-Step): Compute Expected Nonrespondent Counts

The E-Step computes the expected breakdown of nonrespondents into outcome categories. It answers: “Given our current odds_matrix O(t)O^{(t)}, what is the expected count My*x*(t)M_{y^*x^*}^{(t)} for each full stratum x*x^*?”

  • Formula: My*x*(t)=Mx*p̂y*|x*O(t)(x1*,y*)yp̂y|x*O(t)(x1*,y)M_{y^*x^*}^{(t)} = M_{x^*} \frac{\hat{p}_{y^*|x^*} O^{(t)}(x_1^*, y^*)}{\sum_{y} \hat{p}_{y|x^*} O^{(t)}(x_1^*, y)}
  • Implementation (E-STEP in code): This is vectorized. The denominator is computed via rowSums(p_hat_matrix * odds_joined_matrix). The full calculation is: m_y_x_matrix <- m_x_vec * (p_hat_matrix * odds_joined_matrix) / denominator

Step 1.2 (M-Step): Update Odds Matrix

The M-Step updates the nonresponse odds O(x1,y)O(x_1, y) using the expected counts from the E-Step.

  1. Aggregate Expected Counts: First, the expected nonrespondent counts My*x*(t)M_{y^*x^*}^{(t)} are aggregated over the instrument x2x_2 to get the total expected nonrespondents My*x1*(t)M_{y^*x_1^*}^{(t)} for each (x1,y)(x_1, y) cell.
    • Formula: My*x1*(t)=x2My*(x1*,x2)(t)M_{y^*x_1^*}^{(t)} = \sum_{x_2} M_{y^*(x_1^*, x_2)}^{(t)}
    • Implementation: m_y_x1_matrix <- aggregate(m_y_x_matrix ~ data$x1_key, ...)
  2. Update Odds: The new odds O(t+1)O^{(t+1)} is the simple ratio of the total expected nonrespondents to the total observed respondents for each (x1,y)(x_1, y) cell.
    • Formula: O(t+1)(x1*,y*)=My*x1*(t)Ny*x1*O^{(t+1)}(x_1^*, y^*) = \frac{M_{y^*x_1^*}^{(t)}}{N_{y^*x_1^*}}
    • Implementation: odds_matrix <- m_y_x1_matrix / n_safe

Step 2: Convergence

Steps 1.1 and 1.2 are repeated until the change in the odds_matrix (measured by the sum of absolute differences) falls below the tolerance tol.

Final Estimates and Survey Weights

Survey Weights

The exptilt_nonparam.data.frame function assumes the input data is already aggregated. If a survey.design object is provided to exptilt_nonparam.survey.design, an adapter (not shown here) must first be used to create this aggregated table from the microdata. In that case, Ny*x*N_{y^*x^*} and Mx*M_{x^*} represent weighted sums of counts (di\sum d_i \dots), not simple counts. The EM algorithm and all derived matrices (p_hat_matrix, odds_matrix) are then correctly computed based on these weighted inputs, following the logic of the paper.

Final Adjusted Counts

The final output data_to_return is the object of primary interest for analysis. It is constructed by: 1. Calculating the final expected nonrespondent counts My*x*(final)M_{y^*x^*}^{(\text{final})} using the converged odds_matrix. 2. Adding these expected counts to the original observed respondent counts Ny*x*N_{y^*x^*}.

  • Definition: data_to_return[y^*, x^*] = N_{y^*x^*} + M_{y^*x^*}^{(\text{final})}
  • Implementation: data_to_return[, outcome_cols] <- n_y_x_matrix_ordered + m_y_x_matrix_ordered

This final adjusted table represents the completed dataset, where the “Refusal” counts have been redistributed across the outcome columns according to the NMAR model.

Final Proportion θ̂j\hat{\theta}_j

The final population proportion for an outcome jj, θ̂j\hat{\theta}_j, is the total (weighted) adjusted count for outcome jj divided by the total (weighted) population. This is calculated from the data_to_return object.

  • Formula (Unweighted): θ̂j=x*(Njx*+Mjx*(final))yx*(Nyx*+Myx*(final))\hat{\theta}_j = \frac{\sum_{x^*} (N_{jx^*} + M_{jx^*}^{(\text{final})})}{\sum_y \sum_{x^*} (N_{yx^*} + M_{yx^*}^{(\text{final})})}
  • Implementation: This is precisely what the “Adjusted Proportions” calculation in your notebook’s Chunk 5 performs on the data_to_return object.

This differs from the paper’s Step 3 formula, which is an IPW (Inverse Probability Weighting) estimator. However, both methods are asymptotically equivalent, and the “add-and-sum” method (your data_to_return) is a more direct and intuitive application of the EM algorithm’s goal.