# To adjust or not to adjust: instruments and precision variables

Recently, I’ve been listening to (i.e., binging) the Casual Inference
podcast hosted by Lucy D’Agostino McGowan and Ellie Murrary, and a
question was raised at least a couple of times (Season 3 Episode 10 &
Season 4 Episode 9): does adjusting for an instrument (in, e.g., a
propensity score model) cause bias? I believe the answer to this
question is no, except in somewhat extreme scenarios. However, such a
choice will impact the *efficiency* of a treatment effect
estimator. In this post, I want to review a pair of lemmas from an
excellent paper by Rotnitzky & Smucler (2020; JMLR) to formally
address this and related issues.

## Graphs, Instruments, and Precision Variables

Say we are interested in the effect of an exposure, \(A\), on an
outcome, \(Y\). In order to use to data to estimate the causal effect of
\(A\) on \(Y\), we typically aim to adjust for a sufficient set of
*confounders*, $\boldsymbol{L}$—informally, \(\boldsymbol{L}\) are the
common causes of \(A\) on \(Y\). Graphically, we can represent these
relationships on a causal directed acyclic graph (DAG):

The DAG in Figure 1 encodes the assumption that, within levels of \(\boldsymbol{L}\), the treatment is as good as randomized. In counterfactual language, we would say that

\begin{equation} A \perp\!\!\!\!\perp Y(a) \mid \boldsymbol{L}, \end{equation}

where \(Y(a)\) is the potential outcome that would occur under exposure
level \(A = a\). Graphically, \(\boldsymbol{L}\) blocks all *backdoor*
paths (i.e., those starting with an edge *into* the treatment, \(A
\leftarrow\)) from \(A\) to \(Y\). [ **Aside**: the connection between the
DAG and potential outcomes is not obvious, and requires one to
associate a structural model with the DAG. With the FFRCISTG model,
one can read off counterfactual independence statements like this
using single world intervention graphs—this is a discussion for
another day! ]

In a given scientific setting, the situation may be more complicated
than that in the above picture. That is, we might imagine a more
refined DAG with a greater number of pre-treatment variables depicted
and certain arrows missing due to domain knowledge. Instruments and
precision variables represent two special cases of such additional
variables. An *instrument*, simply put, is a pure predictor of the
exposure. In this context, we will say \(Z\) is an instrument if it
causes \(A\) but only has an effect on \(Y\) through \(A\): in DAG form,

Note that I use the two-sided arrow (\(\leftrightarrow\)) between
\(\boldsymbol{L}\) and \(Z\), as the direction of association (or indeed
the existence of an exogenous common cause) does not affect our
discussion. A *precision variable*, on the other hand, is a pure
predictor of the outcome. That is, \(W\) is a precision variable if it
directly affects \(Y\), but is only associated with \(A\) through
associations with \(\boldsymbol{L}\):

In these alternative scenarios, we are left with multiple valid adjustment sets. That is, in the DAG in Figure 2, we will have \((1)\) in addition to \(A \perp\!\!\!\!\perp Y(a) \mid \boldsymbol{L}, Z\). Meanwhile, in the DAG in Figure 3, we have \((1)\) in addition to \(A \perp\!\!\!\!\perp Y(a) \mid \boldsymbol{L}, W\). In each case, we would like to know: should we adjust only for \(\boldsymbol{L}\), or should we include the additional variable as well?

## The AIPW estimating function and its variance

In order to make the above question concrete, we need to specify at the very least (i) a precise causal estimand of interest, (ii) one or more identification formulas for this estimand, and (iii) an estimator for these functionals. We will assume that we observe a random sample \(O_1, \ldots, O_n \overset{\mathrm{iid}}{\sim} \mathbb{P}\), where a typical observation is \(O = (\boldsymbol{L}, A, Y)\) or \((\boldsymbol{L}, Z, A, Y)\) or \((\boldsymbol{L}, W, A, Y)\), for the three scenarios above, respectively. The estimand of interest, for our purposes, will be the population average treatment effect (ATE), $\mathbb{E}(Y(1) - Y(0))$—we will assume for simplicity that the exposure is binary, \(A \in \{0,1\}\).

To proceed with identification (and consequently to have a hope at
estimation), we need one crucial *positivity* assumption.

**Assumption (\(*\))**: For any \(\boldsymbol{B} \subseteq O \setminus \{A,
Y\}\), \(\mathbb{P}[A = 1 \mid \boldsymbol{B}] \in (0,1)\) with
probability 1.

In all three DAG scenarios presented above, and under Assumption
(\(*\)), the ATE is identified by the *statistical* functional
\[\chi(\mathbb{P}) = \mathbb{E}_{\mathbb{P}}(\mu_1(\boldsymbol{L}) -
\mu_0(\boldsymbol{L})),\] where \(\mu_a(\boldsymbol{L}) =
\mathbb{E}_{\mathbb{P}}(Y \mid \boldsymbol{L}, A = a)\); henceforth we
will suppress dependence on \(\mathbb{P}\) and often omit inputs to
functions when there is no ambiguity. In order to represent
alternative adjustment sets, we will augment our notation: define
\[\chi_{\boldsymbol{B}} = \mathbb{E}(\mu_{1, \boldsymbol{B}} - \mu_{0,
\boldsymbol{B}}),\] where \(\boldsymbol{B} \subseteq O \setminus \{A,
Y\}\), and \(\mu_{a, \boldsymbol{B}}(\boldsymbol{B}) \equiv \mathbb{E}(Y
\mid \boldsymbol{B}, A = a)\). The ATE, \(\mathbb{E}(Y(1) - Y(0))\),
equals to \(\chi_{\boldsymbol{B}}\) where \(\boldsymbol{B} =
\boldsymbol{L}\) in the DAG of Figure 1, \(\boldsymbol{B} =
\boldsymbol{L}\) or \((\boldsymbol{L}, Z)\) in the DAG of Figure 2, and
\(\boldsymbol{B} = \boldsymbol{L}\) or \((\boldsymbol{L}, W)\) in the DAG
of Figure 3.

With identification out of the way, we are left with the statistical
task of estimating these quantities. I tend to prefer, when possible,
estimators that are asymptotically “optimal”. In particular,
estimators based on *influence functions* have really nice theoretical
properties. In a nonparametric model, a given functional has a unique
influence function. That said, the presence of an instrument or
precision variable means that the statistical model is no longer
completely nonparametric, and is actually a proper semiparametric
model. Concretely, the absence of arrows in a DAG implies conditional
independence restrictions on the observed data distribution
\(\mathbb{P}\): in the DAG of Figure 2, \(Z \perp\!\!\!\!\perp Y \mid
\boldsymbol{L}, A\), while in the DAG of Figure 3, \(A
\perp\!\!\!\!\perp W \mid \boldsymbol{L}\). As a consequence, there are
infinitely many influence functions one may work with, and the best
choice—the *efficient influence function* (EIF)—is that with the
lowest variance. The variance of a functional’s EIF represents a local
asymptotic minimax lower bound for estimation of that functional, and
thus one often aims to construct an estimator that attains this
variance bound asymptotically.

Our goals in this note are a tad more modest, and we will instead
consider estimators based solely on the *nonparametric* influence
functions of \(\chi_{\boldsymbol{B}}\) for different choices of
\(\boldsymbol{B}\). Let \(\pi_{a, \boldsymbol{B}}(\boldsymbol{B}) =
\mathbb{P}[A = a \mid \boldsymbol{B}]\) be the exposure probabilities
on the basis of variables \(\boldsymbol{B}\), and define

\begin{equation} \phi_{\boldsymbol{B}} = \mu_{1, \boldsymbol{B}} - \mu_{0, \boldsymbol{B}} + \frac{2A - 1}{\pi_{A, \boldsymbol{B}}}(Y - \mu_{A, \boldsymbol{B}}), \end{equation}

which is the (uncentered) nonparametric influence function of the
functional \(\chi_{\boldsymbol{B}}\). The influence function in \((2)\)
takes a familiar augmented inverse probability-weighted (AIPW)
form. As mentioned, we wish to construct estimators based on
\(\phi_{\boldsymbol{B}}\). This is a general task that we may revisit in
another post, but let us quickly review one simple procedure, which
will yield the *cross-fit one-step estimator*. We will split our data
into \((D_1^n, D_2^n)\) where each \(D_j^n\) has size \(n/2\), fit models \((
\widehat{\mu}_{0,\boldsymbol{B}}, \widehat{\mu}_{1, \boldsymbol{B}},
\widehat{\pi}_{1,\boldsymbol{B}})\) on \(D_1^n\), then compute
\[\widehat{\chi}_{\boldsymbol{B}}^{(1)} = \frac{2}{n}\sum_{O_i \in
D_2^n} \widehat{\phi}_{\boldsymbol{B}}(O_i),\] where
\[\widehat{\phi}_{\boldsymbol{B}} = \widehat{\mu}_{1,
\boldsymbol{B}} - \widehat{\mu}_{0, \boldsymbol{B}} + \frac{2A -
1}{\widehat{\pi}_{A, \boldsymbol{B}}}(Y - \widehat{\mu}_{A,
\boldsymbol{B}}).\] We then swap the roles of \(D_1^n\) and \(D_2^n\),
repeat the above procedure obtain
\(\widehat{\chi}_{\boldsymbol{B}}^{(2)}\), then compute the final
estimator \(\widehat{\chi}_{\boldsymbol{B}} =
\frac{\widehat{\chi}_{\boldsymbol{B}}^{(1)} +
\widehat{\chi}_{\boldsymbol{B}}^{(2)}}{2}\). Under *relatively* weak
and general nonparametric conditions [ one condition being a stronger
positivity assumption, and another being the following: \(\lVert
\widehat{\mu}_{a,\boldsymbol{B}} - \mu_{a,\boldsymbol{B}} \rVert \cdot
\lVert \widehat{\pi}_{1,\boldsymbol{B}} - \pi_{1,\boldsymbol{B}}
\rVert = o_{\mathbb{P}}(n^{-1/2})\) for \(a \in \{0,1\}\), where \(\lVert
f \rVert^2 = \int f(o)^2 \, d\mathbb{P}(o)\) ], one is able to obtain
the following asymptotic normality result:
\[\sqrt{n}\left(\widehat{\chi}_{\boldsymbol{B}} -
\chi_{\boldsymbol{B}}\right) \overset{d}{\to} \mathcal{N}(0,
\mathrm{Var}(\phi_{\boldsymbol{B}})).\] Thus, in large samples, for
any choice of \(\boldsymbol{B}\) the estimator described above is
approximately unbiased with variance roughly equal to
\(\frac{\mathrm{Var}(\phi_{\boldsymbol{B}})}{n}\). This gives us a
simple criterion for selecting an adjustment set \(\boldsymbol{B}\):
choose \(\boldsymbol{B}\) to minimize
\(\mathrm{Var}(\phi_{\boldsymbol{B}})\)! The following simple result
provides an explicit form for this variance.

**Lemma 1**: \(\mathrm{Var}(\phi_{\boldsymbol{B}}) = \mathrm{Var}(\mu_{1,
\boldsymbol{B}} - \mu_{0, \boldsymbol{B}}) +
\mathbb{E}\left(\frac{\sigma_{A, \boldsymbol{B}}^2}{\pi_{A,
\boldsymbol{B}}^2}\right)\), where \(\sigma_{a,
\boldsymbol{B}}^2(\boldsymbol{B}) \equiv \mathrm{Var}(Y \mid
\boldsymbol{B}, A = a)\).

*Proof*: Observe the following:

- \(\mathbb{E}\left(\frac{2A - 1}{\pi_{A, \boldsymbol{B}}} (Y - \mu_{A,\boldsymbol{B}}) \mid \boldsymbol{B}, A\right) = 0\).
- \(\mathrm{Cov}\left(\mu_{1, \boldsymbol{B}} - \mu_{0, \boldsymbol{B}}, \frac{2A - 1}{\pi_{A, \boldsymbol{B}}} (Y - \mu_{A,\boldsymbol{B}})\right) = 0\).
- \(\mathrm{Var}\left(\frac{2A - 1}{\pi_{A, \boldsymbol{B}}} (Y - \mu_{A,\boldsymbol{B}})\right) = \mathbb{E}\left(\mathrm{Var}\left(\frac{2A - 1}{\pi_{A, \boldsymbol{B}}} (Y - \mu_{A,\boldsymbol{B}}) \mid \boldsymbol{B}, A\right)\right) = \mathbb{E}\left( \frac{\sigma_{A, \boldsymbol{B}}^2}{\pi_{A,\boldsymbol{B}}^2}\right)\).

Note that observations 2. and 3. follow from observation 1. The result
then follows immediately. \(\quad \blacksquare\)

In the following subsections, we write instruments and precision variables using bold symbols, \(\boldsymbol{Z}\) and \(\boldsymbol{W}\), to reflect that these can be collections (i.e., vectors) of covariates.

*Do not* adjust for instruments!

**Lemma 2**: (Adapted from Lemma 5 in Rotnitzky & Smucler (2020))

Suppose \(\boldsymbol{Z} \perp\!\!\!\!\perp Y \mid \boldsymbol{L}, A\). Then \(\mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{Z}}) \geq \mathrm{Var}(\phi_{\boldsymbol{L}})\).

*Proof*: Note that under the stated conditional independence
assumption, \(\mu_{a,\boldsymbol{L}, \boldsymbol{Z}} \equiv \mu_{a,\boldsymbol{L}}\)
and \(\sigma_{a, \boldsymbol{L}, \boldsymbol{Z}}^2 \equiv \sigma_{a,
\boldsymbol{L}}^2\). We therefore have
\[\mathbb{E}(\phi_{\boldsymbol{L}, \boldsymbol{Z}} \mid \boldsymbol{L}, A, Y) =
\mu_{1, \boldsymbol{L}} - \mu_{0, \boldsymbol{L}} + (2A - 1)(Y -
\mu_{A, \boldsymbol{L}})\mathbb{E}\left(\frac{1}{\pi_{A,
\boldsymbol{L}, \boldsymbol{Z}}} \mid \boldsymbol{L}, A \right),\] again using the
fact that \(\boldsymbol{Z} \perp\!\!\!\!\perp Y \mid \boldsymbol{L}, A\). Using Lemma
3 below, we then have \(\mathbb{E}(\phi_{\boldsymbol{L}, \boldsymbol{Z}} \mid
\boldsymbol{L}, A, Y) = \phi_{\boldsymbol{L}}\). Thus, by the law of
total variance, \[\mathrm{Var}\left(\phi_{\boldsymbol{L}, \boldsymbol{Z}}\right) =
\mathrm{Var}\left(\phi_{\boldsymbol{L}}\right) +
\mathbb{E}\left(\mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{Z}} \mid
\boldsymbol{L}, A, Y)\right) \geq
\mathrm{Var}\left(\phi_{\boldsymbol{L}}\right).\] Explicitly, if you
like, \[\mathbb{E}\left(\mathrm{Var}( \phi_{\boldsymbol{L}, \boldsymbol{Z}} \mid
\boldsymbol{L}, A, Y)\right) = \mathbb{E}\left(\sigma_{A,
\boldsymbol{L}}^2 \cdot \mathrm{Var}\left(\frac{1}{\pi_{A,
\boldsymbol{L}, \boldsymbol{Z}}} \mid \boldsymbol{L}, A\right)\right),\] which is
certainly non-negative. \(\quad \blacksquare\)

**Lemma 3**: \(\mathbb{E}\left(\frac{1}{\pi_{A, \boldsymbol{L}, \boldsymbol{Z}}} \mid
\boldsymbol{L}, A \right) = \frac{1}{\pi_{A, \boldsymbol{L}}}\).

*Proof*: For each \(a \in \{0,1\}\), observe that

\begin{align*} \mathbb{E}\left(\frac{1}{\pi_{a, \boldsymbol{L}, \boldsymbol{Z}}} \mid \boldsymbol{L}, A = a\right)\pi_{a, \boldsymbol{L}} &= \mathbb{E}\left(\frac{I(A = a)}{\pi_{a, \boldsymbol{L}, \boldsymbol{Z}}} \mid \boldsymbol{L}\right) \\ &= \mathbb{E}\left(\frac{\mathbb{E}(I(A = a) \mid \boldsymbol{L}, \boldsymbol{Z}) }{\pi_{a, \boldsymbol{L}, \boldsymbol{Z}}} \mid \boldsymbol{L}\right) \\ &= 1, \end{align*}

which proves the result. \(\quad \blacksquare\)

*Do* adjust for precision variables!

**Lemma 4**: (Adapted from Lemma 4 in Rotnitzky & Smucler (2020))

Suppose \(A \perp\!\!\!\!\perp \boldsymbol{W} \mid \boldsymbol{L}\). Then \(\mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{W}}) \leq \mathrm{Var}(\phi_{\boldsymbol{L}})\).

*Proof*: Note that under the stated conditional independence
assumption, \(\pi_{A,\boldsymbol{L}, \boldsymbol{W}} \equiv
\pi_{A,\boldsymbol{L}}\). Thus,

\begin{align*} \phi_{\boldsymbol{L}} &= \phi_{\boldsymbol{L}, \boldsymbol{W}} + \left(\frac{A}{\pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{1, \boldsymbol{L}, \boldsymbol{W}} - \mu_{1, \boldsymbol{L}}) - \left(\frac{1 - A}{1 - \pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{0, \boldsymbol{L}, \boldsymbol{W}} - \mu_{0, \boldsymbol{L}}) \end{align*}

Noting that, as \(A \perp\!\!\!\!\perp \boldsymbol{W} \mid \boldsymbol{L}\), \[\mathbb{E}\left(\left(\frac{A}{\pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{1, \boldsymbol{L}, \boldsymbol{W}} - \mu_{1, \boldsymbol{L}}) - \left(\frac{1 - A}{1 - \pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{0, \boldsymbol{L}, \boldsymbol{W}} - \mu_{0, \boldsymbol{L}}) \mid \boldsymbol{L}, \boldsymbol{W}\right) = 0,\] it then follows that \[\mathrm{Cov}\left(\phi_{\boldsymbol{L}, \boldsymbol{W}}, \left(\frac{A}{\pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{1, \boldsymbol{L}, \boldsymbol{W}} - \mu_{1, \boldsymbol{L}}) - \left(\frac{1 - A}{1 - \pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{0, \boldsymbol{L}, \boldsymbol{W}} - \mu_{0, \boldsymbol{L}})\right) = 0.\] Thus, we have

\begin{align*} & \mathrm{Var}(\phi_{\boldsymbol{L}}) \\ &= \mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{W}}) + \mathrm{Var}\left(\left(\frac{A}{\pi_{1, \boldsymbol{L}}} - 1\right)(\mu_{1, \boldsymbol{L}, \boldsymbol{W}} - \mu_{1, \boldsymbol{L}}) - \left(\frac{1 - A}{1 - \pi_{1, \boldsymbol{L}}} - 1\right) (\mu_{0, \boldsymbol{L}, \boldsymbol{W}} - \mu_{0, \boldsymbol{L}})\right) \\ & \geq \mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{W}}), \end{align*}

as variance is non-negative. \(\quad \blacksquare\)

## Perfect predicition of treatment

Our discussion thus far has relied on positivity for any given
adjustment set \(\boldsymbol{B}\), i.e., Assumption (\(*\)). What if this
assumption holds for \(\boldsymbol{B} = \boldsymbol{L}\), but is
violated for \(\boldsymbol{B} = (\boldsymbol{L}, \boldsymbol{Z})\)? That
is, what if we include an instrument (or set of instruments) that
results in *perfect* prediction of treatment status for some subgroup:
\[\mathbb{P}\left[\mathbb{P}[A = 1 \mid \boldsymbol{L},
\boldsymbol{Z}] \in \{0,1\}\right] > 0.\] Well, unfortunately
identification breaks down because one of \(\mu_{1, \boldsymbol{L},
\boldsymbol{Z}}\) or \(\mu_{0, \boldsymbol{L}, \boldsymbol{Z}}\) will not
be well-defined with some positive probability, therefore
\(\chi_{\boldsymbol{L}, \boldsymbol{Z}}\) will not be well-defined.

More realistically, in my opinion, inclusion of very predictive instruments may result in practical near-positivity violations, i.e., \(\mathbb{P}[A = 1 \mid \boldsymbol{L}, \boldsymbol{Z}]\) may be very close to 0 or 1. If this occurs, our outcome model estimates can become unstable when there is little data for one treatment level, and moreover the asymptotic variance of \(\widehat{\chi}_{\boldsymbol{L}, \boldsymbol{Z}}\), \(\mathrm{Var}(\phi_{\boldsymbol{L}, \boldsymbol{Z}})\), may explode—as a fun exercise, inspect the variance formulas in Lemmas 1 and 2 to get a feel for where exactly this manifests.

## Practical takeaways

In practice, one may not be sure *a priori* whether a given variable
is an *exact* instrument or precision variable. What should one do in
such cases? For putative precision variables (e.g., those thought to
be strongly predictive of the outcome, but maybe only weakly
predictive of exposure), it is clear that these should definitely be
measured if possible and included in an adjustment set. For putative
instruments, perhaps one should err on the side of ensuring a valid
analysis and include covariates that may only be weakly associated
with the outcome (after controlling for \(\boldsymbol{L}\) and
\(A\)). When one is absolutely sure on the basis of domain knowledge, an
instrument can be excluded to gain some efficiency. Given the
discussion in the last section, I would emphasize that care should be
taken when including very strong predictors of treatment (unless they
are also very strong predictors of the outcome, whereby we would be
forced to include them as confounders).

More broadly, the discussion in this note assumes we are in the luxurious setting where we measure a sufficient set of confounders of the exposure-outcome relationship. Often, one questions whether this is even the case, and it is typically possible to posit some important unmeasured confounders. If one believes they have an instrument, and that it is more likely to be unconfounded than the exposure itself, then one may be able to exploit this structure for an alternative route towards partial or full identification of the exposure effect—if you’re interested, you might enjoy our work on partial identification and our review paper on identification using instrumental variables!

For a deeper dive into the issues discussed in this note, the
consideration of arbitrarily complicated DAGs, the extension to
time-varying treatments, and *much* more, see the paper by Rotnitzky &
Smucler.