To adjust or not to adjust: instruments and precision variables

Recently, I’ve been listening to (i.e., binging) the Casual Inference podcast hosted by Lucy D’Agostino McGowan and Ellie Murrary, and a question was raised at least a couple of times (Season 3 Episode 10 & Season 4 Episode 9): does adjusting for an instrument (in, e.g., a propensity score model) cause bias? I believe the answer to this question is no, except in somewhat extreme scenarios. However, such a choice will impact the efficiency of a treatment effect estimator. In this post, I want to review a pair of lemmas from an excellent paper by Rotnitzky & Smucler (2020; JMLR) to formally address this and related issues.

Graphs, Instruments, and Precision Variables

Say we are interested in the effect of an exposure, \(A\), on an outcome, \(Y\). In order to use to data to estimate the causal effect of \(A\) on \(Y\), we typically aim to adjust for a sufficient set of confounders, $\boldsymbol{L}$—informally, \(\boldsymbol{L}\) are the common causes of \(A\) on \(Y\). Graphically, we can represent these relationships on a causal directed acyclic graph (DAG):

Figure 1: Canonical observational study DAG

Figure 1: Canonical observational study DAG

The DAG in Figure 1 encodes the assumption that, within levels of \(\boldsymbol{L}\), the treatment is as good as randomized. In counterfactual language, we would say that

\begin{equation} A \perp\!\!\!\!\perp Y(a) \mid \boldsymbol{L}, \end{equation}

where \(Y(a)\) is the potential outcome that would occur under exposure level \(A = a\). Graphically, \(\boldsymbol{L}\) blocks all backdoor paths (i.e., those starting with an edge into the treatment, \(A \leftarrow\)) from \(A\) to \(Y\). [ Aside: the connection between the DAG and potential outcomes is not obvious, and requires one to associate a structural model with the DAG. With the FFRCISTG model, one can read off counterfactual independence statements like this using single world intervention graphs—this is a discussion for another day! ]

In a given scientific setting, the situation may be more complicated than that in the above picture. That is, we might imagine a more refined DAG with a greater number of pre-treatment variables depicted and certain arrows missing due to domain knowledge. Instruments and precision variables represent two special cases of such additional variables. An instrument, simply put, is a pure predictor of the exposure. In this context, we will say \(Z\) is an instrument if it causes \(A\) but only has an effect on \(Y\) through \(A\): in DAG form,

Figure 2: Observational study with an instrument

Figure 2: Observational study with an instrument

Note that I use the two-sided arrow (\(\leftrightarrow\)) between \(\boldsymbol{L}\) and \(Z\), as the direction of association (or indeed the existence of an exogenous common cause) does not affect our discussion. A precision variable, on the other hand, is a pure predictor of the outcome. That is, \(W\) is a precision variable if it directly affects \(Y\), but is only associated with \(A\) through associations with \(\boldsymbol{L}\):

Figure 3: Observational study with a precision variable

Figure 3: Observational study with a precision variable

In these alternative scenarios, we are left with multiple valid adjustment sets. That is, in the DAG in Figure 2, we will have \((1)\) in addition to \(A \perp\!\!\!\!\perp Y(a) \mid \boldsymbol{L}, Z\). Meanwhile, in the DAG in Figure 3, we have \((1)\) in addition to \(A \perp\!\!\!\!\perp Y(a) \mid \boldsymbol{L}, W\). In each case, we would like to know: should we adjust only for \(\boldsymbol{L}\), or should we include the additional variable as well?

The AIPW estimating function and its variance

In order to make the above question concrete, we need to specify at the very least (i) a precise causal estimand of interest, (ii) one or more identification formulas for this estimand, and (iii) an estimator for these functionals. We will assume that we observe a random sample \(O_1, \ldots, O_n \overset{\mathrm{iid}}{\sim} \mathbb{P}\), where a typical observation is \(O = (\boldsymbol{L}, A, Y)\) or \((\boldsymbol{L}, Z, A, Y)\) or \((\boldsymbol{L}, W, A, Y)\), for the three scenarios above, respectively. The estimand of interest, for our purposes, will be the population average treatment effect (ATE), $\mathbb{E}(Y(1) - Y(0))$—we will assume for simplicity that the exposure is binary, \(A \in \{0,1\}\).

To proceed with identification (and consequently to have a hope at estimation), we need one crucial positivity assumption.

Assumption (\(*\)): For any \(\boldsymbol{B} \subseteq O \setminus \{A, Y\}\), \(\mathbb{P}[A = 1 \mid \boldsymbol{B}] \in (0,1)\) with probability 1.

In all three DAG scenarios presented above, and under Assumption (\(*\)), the ATE is identified by the statistical functional \[\chi(\mathbb{P}) = \mathbb{E}_{\mathbb{P}}(\mu_1(\boldsymbol{L}) - \mu_0(\boldsymbol{L})),\] where \(\mu_a(\boldsymbol{L}) = \mathbb{E}_{\mathbb{P}}(Y \mid \boldsymbol{L}, A = a)\); henceforth we will suppress dependence on \(\mathbb{P}\) and often omit inputs to functions when there is no ambiguity. In order to represent alternative adjustment sets, we will augment our notation: define \[\chi_{\boldsymbol{B}} = \mathbb{E}(\mu_{1, \boldsymbol{B}} - \mu_{0, \boldsymbol{B}}),\] where \(\boldsymbol{B} \subseteq O \setminus \{A, Y\}\), and \(\mu_{a, \boldsymbol{B}}(\boldsymbol{B}) \equiv \mathbb{E}(Y \mid \boldsymbol{B}, A = a)\). The ATE, \(\mathbb{E}(Y(1) - Y(0))\), equals to \(\chi_{\boldsymbol{B}}\) where \(\boldsymbol{B} = \boldsymbol{L}\) in the DAG of Figure 1, \(\boldsymbol{B} = \boldsymbol{L}\) or \((\boldsymbol{L}, Z)\) in the DAG of Figure 2, and \(\boldsymbol{B} = \boldsymbol{L}\) or \((\boldsymbol{L}, W)\) in the DAG of Figure 3.

With identification out of the way, we are left with the statistical task of estimating these quantities. I tend to prefer, when possible, estimators that are asymptotically “optimal”. In particular, estimators based on influence functions have really nice theoretical properties. In a nonparametric model, a given functional has a unique influence function. That said, the presence of an instrument or precision variable means that the statistical model is no longer completely nonparametric, and is actually a proper semiparametric model. Concretely, the absence of arrows in a DAG implies conditional independence restrictions on the observed data distribution \(\mathbb{P}\): in the DAG of Figure 2, \(Z \perp\!\!\!\!\perp Y \mid \boldsymbol{L}, A\), while in the DAG of Figure 3, \(A \perp\!\!\!\!\perp W \mid \boldsymbol{L}\). As a consequence, there are infinitely many influence functions one may work with, and the best choice—the efficient influence function (EIF)—is that with the lowest variance. The variance of a functional’s EIF represents a local asymptotic minimax lower bound for estimation of that functional, and thus one often aims to construct an estimator that attains this variance bound asymptotically.

Our goals in this note are a tad more modest, and we will instead consider estimators based solely on the nonparametric influence functions of \(\chi_{\boldsymbol{B}}\) for different choices of \(\boldsymbol{B}\). Let \(\pi_{a, \boldsymbol{B}}(\boldsymbol{B}) = \mathbb{P}[A = a \mid \boldsymbol{B}]\) be the exposure probabilities on the basis of variables \(\boldsymbol{B}\), and define

\begin{equation} \phi_{\boldsymbol{B}} = \mu_{1, \boldsymbol{B}} - \mu_{0, \boldsymbol{B}} + \frac{2A - 1}{\pi_{A, \boldsymbol{B}}}(Y - \mu_{A, \boldsymbol{B}}), \end{equation}

which is the (uncentered) nonparametric influence function of the functional \(\chi_{\boldsymbol{B}}\). The influence function in \((2)\) takes a familiar augmented inverse probability-weighted (AIPW) form. As mentioned, we wish to construct estimators based on \(\phi_{\boldsymbol{B}}\). This is a general task that we may revisit in another post, but let us quickly review one simple procedure, which will yield the cross-fit one-step estimator. We will split our data into \((D_1^n, D_2^n)\) where each \(D_j^n\) has size \(n/2\), fit models \(( \widehat{\mu}_{0,\boldsymbol{B}}, \widehat{\mu}_{1, \boldsymbol{B}}, \widehat{\pi}_{1,\boldsymbol{B}})\) on \(D_1^n\), then compute \[\widehat{\chi}_{\boldsymbol{B}}^{(1)} = \frac{2}{n}\sum_{O_i \in D_2^n} \widehat{\phi}_{\boldsymbol{B}}(O_i),\] where \[\widehat{\phi}_{\boldsymbol{B}} = \widehat{\mu}_{1, \boldsymbol{B}} - \widehat{\mu}_{0, \boldsymbol{B}} + \frac{2A - 1}{\widehat{\pi}_{A, \boldsymbol{B}}}(Y - \widehat{\mu}_{A, \boldsymbol{B}}).\] We then swap the roles of \(D_1^n\) and \(D_2^n\), repeat the above procedure obtain \(\widehat{\chi}_{\boldsymbol{B}}^{(2)}\), then compute the final estimator \(\widehat{\chi}_{\boldsymbol{B}} = \frac{\widehat{\chi}_{\boldsymbol{B}}^{(1)} + \widehat{\chi}_{\boldsymbol{B}}^{(2)}}{2}\). Under relatively weak and general nonparametric conditions [ one condition being a stronger positivity assumption, and another being the following: \(\lVert \widehat{\mu}_{a,\boldsymbol{B}} - \mu_{a,\boldsymbol{B}} \rVert \cdot \lVert \widehat{\pi}_{1,\boldsymbol{B}} - \pi_{1,\boldsymbol{B}} \rVert = o_{\mathbb{P}}(n^{-1/2})\) for \(a \in \{0,1\}\), where \(\lVert f \rVert^2 = \int f(o)^2 \, d\mathbb{P}(o)\) ], one is able to obtain the following asymptotic normality result: \[\sqrt{n}\left(\widehat{\chi}_{\boldsymbol{B}} - \chi_{\boldsymbol{B}}\right) \overset{d}{\to} \mathcal{N}(0, \mathrm{Var}(\phi_{\boldsymbol{B}})).\] Thus, in large samples, for any choice of \(\boldsymbol{B}\) the estimator described above is approximately unbiased with variance roughly equal to \(\frac{\mathrm{Var}(\phi_{\boldsymbol{B}})}{n}\). This gives us a simple criterion for selecting an adjustment set \(\boldsymbol{B}\): choose \(\boldsymbol{B}\) to minimize \(\mathrm{Var}(\phi_{\boldsymbol{B}})\)! The following simple result provides an explicit form for this variance.

Lemma 1: \(\mathrm{Var}(\phi_{\boldsymbol{B}}) = \mathrm{Var}(\mu_{1, \boldsymbol{B}} - \mu_{0, \boldsymbol{B}}) + \mathbb{E}\left(\frac{\sigma_{A, \boldsymbol{B}}^2}{\pi_{A, \boldsymbol{B}}^2}\right)\), where \(\sigma_{a, \boldsymbol{B}}^2(\boldsymbol{B}) \equiv \mathrm{Var}(Y \mid \boldsymbol{B}, A = a)\).

Proof: Observe the following:

  1. \(\mathbb{E}\left(\frac{2A - 1}{\pi_{A, \boldsymbol{B}}} (Y - \mu_{A,\boldsymbol{B}}) \mid \boldsymbol{B}, A\right) = 0\).
  2. \(\mathrm{Cov}\left(\mu_{1, \boldsymbol{B}} - \mu_{0, \boldsymbol{B}}, \frac{2A - 1}{\pi_{A, \boldsymbol{B}}} (Y - \mu_{A,\boldsymbol{B}})\right) = 0\).
  3. \(\mathrm{Var}\left(\frac{2A - 1}{\pi_{A, \boldsymbol{B}}} (Y - \mu_{A,\boldsymbol{B}})\right) = \mathbb{E}\left(\mathrm{Var}\left(\frac{2A - 1}{\pi_{A, \boldsymbol{B}}} (Y - \mu_{A,\boldsymbol{B}}) \mid \boldsymbol{B}, A\right)\right) = \mathbb{E}\left( \frac{\sigma_{A, \boldsymbol{B}}^2}{\pi_{A,\boldsymbol{B}}^2}\right)\).

Note that observations 2. and 3. follow from observation 1. The result then follows immediately. \(\quad \blacksquare\)

In the following subsections, we write instruments and precision variables using bold symbols, \(\boldsymbol{Z}\) and \(\boldsymbol{W}\), to reflect that these can be collections (i.e., vectors) of covariates.

Do not adjust for instruments!

Lemma 2: (Adapted from Lemma 5 in Rotnitzky & Smucler (2020))

Suppose \(\boldsymbol{Z} \perp\!\!\!\!\perp Y \mid \boldsymbol{L}, A\). Then \(\mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{Z}}) \geq \mathrm{Var}(\phi_{\boldsymbol{L}})\).

Proof: Note that under the stated conditional independence assumption, \(\mu_{a,\boldsymbol{L}, \boldsymbol{Z}} \equiv \mu_{a,\boldsymbol{L}}\) and \(\sigma_{a, \boldsymbol{L}, \boldsymbol{Z}}^2 \equiv \sigma_{a, \boldsymbol{L}}^2\). We therefore have \[\mathbb{E}(\phi_{\boldsymbol{L}, \boldsymbol{Z}} \mid \boldsymbol{L}, A, Y) = \mu_{1, \boldsymbol{L}} - \mu_{0, \boldsymbol{L}} + (2A - 1)(Y - \mu_{A, \boldsymbol{L}})\mathbb{E}\left(\frac{1}{\pi_{A, \boldsymbol{L}, \boldsymbol{Z}}} \mid \boldsymbol{L}, A \right),\] again using the fact that \(\boldsymbol{Z} \perp\!\!\!\!\perp Y \mid \boldsymbol{L}, A\). Using Lemma 3 below, we then have \(\mathbb{E}(\phi_{\boldsymbol{L}, \boldsymbol{Z}} \mid \boldsymbol{L}, A, Y) = \phi_{\boldsymbol{L}}\). Thus, by the law of total variance, \[\mathrm{Var}\left(\phi_{\boldsymbol{L}, \boldsymbol{Z}}\right) = \mathrm{Var}\left(\phi_{\boldsymbol{L}}\right) + \mathbb{E}\left(\mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{Z}} \mid \boldsymbol{L}, A, Y)\right) \geq \mathrm{Var}\left(\phi_{\boldsymbol{L}}\right).\] Explicitly, if you like, \[\mathbb{E}\left(\mathrm{Var}( \phi_{\boldsymbol{L}, \boldsymbol{Z}} \mid \boldsymbol{L}, A, Y)\right) = \mathbb{E}\left(\sigma_{A, \boldsymbol{L}}^2 \cdot \mathrm{Var}\left(\frac{1}{\pi_{A, \boldsymbol{L}, \boldsymbol{Z}}} \mid \boldsymbol{L}, A\right)\right),\] which is certainly non-negative. \(\quad \blacksquare\)

Lemma 3: \(\mathbb{E}\left(\frac{1}{\pi_{A, \boldsymbol{L}, \boldsymbol{Z}}} \mid \boldsymbol{L}, A \right) = \frac{1}{\pi_{A, \boldsymbol{L}}}\).

Proof: For each \(a \in \{0,1\}\), observe that

\begin{align*} \mathbb{E}\left(\frac{1}{\pi_{a, \boldsymbol{L}, \boldsymbol{Z}}} \mid \boldsymbol{L}, A = a\right)\pi_{a, \boldsymbol{L}} &= \mathbb{E}\left(\frac{I(A = a)}{\pi_{a, \boldsymbol{L}, \boldsymbol{Z}}} \mid \boldsymbol{L}\right) \\ &= \mathbb{E}\left(\frac{\mathbb{E}(I(A = a) \mid \boldsymbol{L}, \boldsymbol{Z}) }{\pi_{a, \boldsymbol{L}, \boldsymbol{Z}}} \mid \boldsymbol{L}\right) \\ &= 1, \end{align*}

which proves the result. \(\quad \blacksquare\)

Do adjust for precision variables!

Lemma 4: (Adapted from Lemma 4 in Rotnitzky & Smucler (2020))

Suppose \(A \perp\!\!\!\!\perp \boldsymbol{W} \mid \boldsymbol{L}\). Then \(\mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{W}}) \leq \mathrm{Var}(\phi_{\boldsymbol{L}})\).

Proof: Note that under the stated conditional independence assumption, \(\pi_{A,\boldsymbol{L}, \boldsymbol{W}} \equiv \pi_{A,\boldsymbol{L}}\). Thus,

\begin{align*} \phi_{\boldsymbol{L}} &= \phi_{\boldsymbol{L}, \boldsymbol{W}} + \left(\frac{A}{\pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{1, \boldsymbol{L}, \boldsymbol{W}} - \mu_{1, \boldsymbol{L}}) - \left(\frac{1 - A}{1 - \pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{0, \boldsymbol{L}, \boldsymbol{W}} - \mu_{0, \boldsymbol{L}}) \end{align*}

Noting that, as \(A \perp\!\!\!\!\perp \boldsymbol{W} \mid \boldsymbol{L}\), \[\mathbb{E}\left(\left(\frac{A}{\pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{1, \boldsymbol{L}, \boldsymbol{W}} - \mu_{1, \boldsymbol{L}}) - \left(\frac{1 - A}{1 - \pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{0, \boldsymbol{L}, \boldsymbol{W}} - \mu_{0, \boldsymbol{L}}) \mid \boldsymbol{L}, \boldsymbol{W}\right) = 0,\] it then follows that \[\mathrm{Cov}\left(\phi_{\boldsymbol{L}, \boldsymbol{W}}, \left(\frac{A}{\pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{1, \boldsymbol{L}, \boldsymbol{W}} - \mu_{1, \boldsymbol{L}}) - \left(\frac{1 - A}{1 - \pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{0, \boldsymbol{L}, \boldsymbol{W}} - \mu_{0, \boldsymbol{L}})\right) = 0.\] Thus, we have

\begin{align*} & \mathrm{Var}(\phi_{\boldsymbol{L}}) \\ &= \mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{W}}) + \mathrm{Var}\left(\left(\frac{A}{\pi_{1, \boldsymbol{L}}} - 1\right)(\mu_{1, \boldsymbol{L}, \boldsymbol{W}} - \mu_{1, \boldsymbol{L}}) - \left(\frac{1 - A}{1 - \pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{0, \boldsymbol{L}, \boldsymbol{W}} - \mu_{0, \boldsymbol{L}})\right) \\ & \geq \mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{W}}), \end{align*}

as variance is non-negative. \(\quad \blacksquare\)

Perfect predicition of treatment

Our discussion thus far has relied on positivity for any given adjustment set \(\boldsymbol{B}\), i.e., Assumption (\(*\)). What if this assumption holds for \(\boldsymbol{B} = \boldsymbol{L}\), but is violated for \(\boldsymbol{B} = (\boldsymbol{L}, \boldsymbol{Z})\)? That is, what if we include an instrument (or set of instruments) that results in perfect prediction of treatment status for some subgroup: \[\mathbb{P}\left[\mathbb{P}[A = 1 \mid \boldsymbol{L}, \boldsymbol{Z}] \in \{0,1\}\right] > 0.\] Well, unfortunately identification breaks down because one of \(\mu_{1, \boldsymbol{L}, \boldsymbol{Z}}\) or \(\mu_{0, \boldsymbol{L}, \boldsymbol{Z}}\) will not be well-defined with some positive probability, therefore \(\chi_{\boldsymbol{L}, \boldsymbol{Z}}\) will not be well-defined.

More realistically, in my opinion, inclusion of very predictive instruments may result in practical near-positivity violations, i.e., \(\mathbb{P}[A = 1 \mid \boldsymbol{L}, \boldsymbol{Z}]\) may be very close to 0 or 1. If this occurs, our outcome model estimates can become unstable when there is little data for one treatment level, and moreover the asymptotic variance of \(\widehat{\chi}_{\boldsymbol{L}, \boldsymbol{Z}}\), \(\mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{Z}})\), may explode—as a fun exercise, inspect the variance formulas in Lemmas 1 and 2 to get a feel for where exactly this manifests.

Practical takeaways

In practice, one may not be sure a priori whether a given variable is an exact instrument or precision variable. What should one do in such cases? For putative precision variables (e.g., those thought to be strongly predictive of the outcome, but maybe only weakly predictive of exposure), it is clear that these should definitely be measured if possible and included in an adjustment set. For putative instruments, perhaps one should err on the side of ensuring a valid analysis and include covariates that may only be weakly associated with the outcome (after controlling for \(\boldsymbol{L}\) and \(A\)). When one is absolutely sure on the basis of domain knowledge, an instrument can be excluded to gain some efficiency. Given the discussion in the last section, I would emphasize that care should be taken when including very strong predictors of treatment (unless they are also very strong predictors of the outcome, whereby we would be forced to include them as confounders).

More broadly, the discussion in this note assumes we are in the luxurious setting where we measure a sufficient set of confounders of the exposure-outcome relationship. Often, one questions whether this is even the case, and it is typically possible to posit some important unmeasured confounders. If one believes they have an instrument, and that it is more likely to be unconfounded than the exposure itself, then one may be able to exploit this structure for an alternative route towards partial or full identification of the exposure effect—if you’re interested, you might enjoy our work on partial identification and our review paper on identification using instrumental variables!

For a deeper dive into the issues discussed in this note, the consideration of arbitrarily complicated DAGs, the extension to time-varying treatments, and much more, see the paper by Rotnitzky & Smucler.