MGFs, power series, and pushing the expectation through

I got stuck on what I thought was a simple point while working through a stats textbook! I figured it out, I think, but was this the most straightforward path to understanding? You tell me! In this note, we explore some oft-used facts about moment-generating functions, and explain a connection with tail and concentration inequalities.

Overview and my confusion

I am currently reading through Martin Wainwright’s High-dimensional statistics: A non-asymptotic viewpoint, to fill some gaps in my stat theory knowledge. While I’m only partway through, I can definitely already recommend this book for those wanting to get familiar with some of the modern theoretical machinery needed to validate statistical methods that we use on difficult problems!

The book starts off relatively lightly in Chapter 2 with an overview of fundamental tail and concentration bounds. A simple but powerful idea is that of Chernoff bounding, which combines the moment generating function of a random variable (when it exists), and Markov’s inequality to tightly bound tail probabilities. In reading through this chapter, one quickly comes across the sub-exponential class of distributions: a random variable \(X\) is sub-exponential if \(\mathbb{E}(|X|) < \infty\) and there exist constants \(\nu > 0\) and \(\alpha \in [0,\infty)\) such that \[\mathbb{E}(e^{t (X-\mu)}) \leq e^{\frac{\nu^2 t^2}{2}}, \text{for all } t \in \left(-\frac{1}{\alpha}, \frac{1}{\alpha}\right),\] where \(\mu \coloneqq \mathbb{E}(X)\) and we use the convention that \(\frac{1}{0} = \infty\).

Following this introduction, we are then exposed to a sufficient condition for \(X\) to be sub-exponential, the so-called Bernstein condition. In particular, \(X\) is said to be $b$-Bernstein, for some \(b > 0\), if \(\mathbb{E}(|X|^2) < \infty\), and letting \(\mu \coloneqq \mathbb{E}(X)\), \(\sigma^2 \coloneqq \mathrm{Var}(X)\), it holds that \[\left|\mathbb{E}[(X - \mu)^k]\right| \leq \frac{1}{2}k! \sigma^2 b^{k-2}, \text{ for all } k \in \mathbb{N} \setminus \{1\}.\] In proving that this condition implies that \(X\) is sub-exponential, the author employs the “power series decomposition” \(\mathbb{E}(e^{t(X- \mu)}) = \sum_{n = 0}^\infty \frac{t^n}{n!} \mathbb{E}[(X-\mu)^n]\), for \(t\) non-zero but sufficiently small. That said, I couldn’t find anywhere in the book where this equality was justified.

At first glance, this equality might seem obvious, as \(e^{z} = \sum_{n=0}^\infty \frac{z^n}{n!}\) for all \(z \in \mathbb{R}\), so one should just be able to push the expectation through this power series representation. But usually this sort of statement is justified by the dominated convergence theorem (DCT): if \(Y_n \to Y\), and the sequence \(Y_n\) is bounded by \(Z \in L_1\), i.e., \(|Y_n| \leq Z\) for all \(n \in \mathbb{N}\) and \(\mathbb{E}(Z) < \infty\), then \(\mathbb{E}(|Y_n|), \mathbb{E}(|Y|) < \infty\) and \(\mathbb{E}(Y) = \lim_{n \to \infty} \mathbb{E}(Y_n)\). In our problem, can we just directly apply the DCT? Well, we would like to say that, for small but non-zero \(t\), \[\mathbb{E}\left(\lim_{n \to \infty}\sum_{k=0}^n \frac{t^k}{k!}(X-\mu)^k\right) = \lim_{n \to \infty}\mathbb{E}\left(\sum_{k=0}^n \frac{t^k}{k!}(X-\mu)^k\right),\] i.e., \(Y_n = \sum_{k=0}^n \frac{t^k}{k!}(X-\mu)^k\) and \(Y = e^{t(X-\mu)}\). But what should be the dominating variable \(Z\)? We can easily see that \(Y_n \leq \sum_{k=0}^n \frac{|t|^n}{n!} |X-\mu|^n \leq e^{|t| |X-\mu|}\), but do we know that the latter is integrable? It wasn’t obvious to me, so let’s dive into the details.

Facts about moment generating functions

Let \(X\) be a random variable, and define \(\Psi_X(t) = \mathbb{E}(e^{tX})\), for \(t \in D\), where \(D = \{t \in \mathbb{R}: \mathbb{E}(e^{tX}) < \infty\}\). Note that \(e^{tX} \geq 0\), so that its expectation is always well-defined in \([0, \infty]\). Moreover, \(0 \in D\) no matter what, as \(\mathbb{E}(e^{0X}) = 1 < \infty\) — this could possibly be the only element of \(D\). The first interesting fact about \(\Psi_X\) is that its domain \(D\) is always an interval including zero.

Lemma 1: If \(t \in D \cap (0, \infty)\), then \([0, t] \subseteq D\). Similarly, if \(t \in D \cap (-\infty, 0)\), then \([t, 0] \subseteq D\).

Proof: Suppose \(t > 0\), \(\mathbb{E}(e^{tX}) < \infty\), and let \(s \in [0, t]\). Observe that \[\mathbb{E}(e^{sX}) = \mathbb{E}(e^{sX} \mathbf{1}_{[0, \infty)}(X)) + \mathbb{E}(e^{sX} \mathbf{1}_{(-\infty, 0)}(X)) \leq \mathbb{E}(e^{tX} \mathbf{1}_{[0,\infty)}) + P[X < 0] \leq \Psi_{X}(t) + 1 < \infty,\] so \(s \in D\). Similarly, if \(t < 0\), \(\mathbb{E}(e^{tX}) < \infty\), then let \(s \in [t, 0]\) and note that \[\mathbb{E}(e^{sX}) = \mathbb{E}(e^{sX} \mathbf{1}_{(0, \infty)}(X)) + \mathbb{E}(e^{sX} \mathbf{1}_{(-\infty, 0]}(X)) \leq P[X > 0] + \mathbb{E}(e^{tX} \mathbf{1}_{(-\infty, 0]}(X)) \leq 1 + \Psi_X(t),\] so \(s \in D\). \(\quad \blacksquare\)

MGFs are useful when they exist on an open interval around zero — it is in this case that they actually generate moments! This is formalized in the next result.

Proposition 1: Suppose \(t^* > 0\) is such that \((-t^*, t^*) \subseteq D\). Then \(\mathbb{E}(|X|^n) < \infty\), for all \(n \in \mathbb{N}\), \[\Psi_X(t) = \sum_{n = 0}^\infty \frac{t^n}{n!} \mathbb{E}(X^n)\] holds and this series is absolutely convergent, for all \(t \in (-t^*, t^*)\). Consequently, \(\mathbb{E}(X^n) = \left. \frac{d^n}{dt^n} \Psi_X(t) \right|_{t = 0}\).

Proof: Let \(t \in (0, t^*)\). The key observation is that \(e^{t|x|} \leq e^{-tx} + e^{tx}\), for all \(x \in \mathbb{R}\). Thus, \[\mathbb{E}(e^{t|X|}) \leq \Psi_X(-t) + \Psi_X(t) < \infty,\] given that \(-t, t \in D\). In other words, by the power series expansion of the exponential, \[\mathbb{E}(e^{t|X|}) = \mathbb{E}\left(\sum_{n=0}^{\infty} \frac{t^n}{n!} |X|^n\right) = \sum_{n=0}^{\infty} \frac{t^n}{n!} \mathbb{E}(|X|^n) < \infty,\] where the interchange of expectation and summation is permitted by the monotone convergence theorem (MCT). Immediately, we can deduce that \(\mathbb{E}(|X|^n) < \infty\) for all \(n \in \mathbb{N}\). We further can see by the DCT that \(\Psi_X(t) = \sum_{n=0}^\infty \frac{t^n}{n!} \mathbb{E}(X^n)\) for any \(t \in (-t^*, t^*)\), as the partial sums \(\sum_{k=0}^N \frac{t^n}{n!} X^n\), for \(N \in \mathbb{N}\), are uniformly bounded by \(e^{|t| \cdot |X|} \in L_1\).

Notice that we have shown that the power series \(\sum_{n=0}^{\infty} \frac{t^n}{n!} \mathbb{E}(X^n)\) converges absolutely on \((-t^*, t^*)\), i.e., its radius of convergence is at least \(t^*\). By basic facts about power series, \(\Psi_X(t)\) is infinitely differentiable on \((-t^*, t^*)\), and \(\mathbb{E}(X^n) = \left. \frac{d^n}{dt^n} \Psi_X(t) \right|_{t = 0}\). \(\quad \blacksquare\)

Inspecting the proof of Proposition 1, we can see that the real power (no pun intended) comes from \(e^{|tX|}\) being integrable. The crux of the argument was that \(\{-t, t\} \subseteq D\) is equivalent to \(\mathbb{E}(e^{|tX|}) < \infty\), and the latter readily establishes that \(\Psi_X\) permits the power series representation \(\Psi_X(s) = \sum_{n=0}^\infty \frac{s^n}{n!}\mathbb{E}(X^n)\), for \(s \in \{-t, t\}\). With Lemma 1, we pretty much immediately obtain the following nice chain of equivalences.

Theorem: For any \(t \in \mathbb{R}\):

\begin{align*} &\{-t, t\} \subseteq D \text{ (or say } [-|t|, |t|] \subseteq D \text{ if you like)}\\ \iff &\mathbb{E}(e^{|sX|}) < \infty, \textit{for all } s \in [-|t|, |t|] \\ \iff &\sum_{n=0}^\infty \frac{t^n}{n!} \mathbb{E}(X^n) \textit{ absolutely converges and equals } \Psi_X(t) \textit{ for } t \in [-|t|, |t|] \end{align*}

[Notice that this is useful only when \(t \neq 0\).]

Relation to sub-exponential distributions

At this point, we are equipped to understand a more general version of the problem we started out with (side note: are the necessary facts about MGFs just supposed to be common knowledge!?). Namely, we will verify that \(X\) is sub-exponential if and only if the domain \(D\) of \(\Psi_X\) contains an open interval around zero [for those reading the Wainwright text, this is part of Theorem 2.13]. Clearly, if \(\nu > 0\) and \(\alpha \in [0,\infty)\) satisfy \(\mathbb{E}(e^{t(X-\mu)}) \leq e^{\frac{\nu^2 t^2}{2}}\) (notice in particular that this is finite) for \(t \in (-\alpha^{-1}, \alpha^{-1})\), then \(D\) contains any pair \(\{-t, t\}\) for \(t \in (0, \alpha^{-1})\), and we see that \((-\alpha^{-1}, \alpha^{-1}) \subseteq D\). The following says that the converse result also holds.

Proposition 2: The MGF of \(X\) is defined on \(D \supseteq (-c, c)\) for some \(c > 0\) if and only if \(X\) is sub-exponential.

Proof: We have just argued in the “if” direction. For the converse, suppose \(c > 0\) and \((-c,c) \subseteq D\). By our theorem, it is kosher to use the power series expansion of \(\Psi_{X - \mu}(t)\) for any \(t \in (-c,c)\) — note by the way that the domain \(D\) of \(\Psi_X\) is equal to that of the MGF of the centered variable \(\Psi_{X - \mu}\), as they differ by a deterministic finite multiplicative factor \(e^{-t\mu}\). Further, as the radius of convergence of this power series is at least \(c\), we can differentiate term by term on \((-c,c)\) to obtain the derivative of \(\Psi_{X - \mu}\). Consequently, by Taylor’s theorem with Peano remainder, \[\Psi_{X - \mu}(t) = \Psi_{X-\mu}(0) + \Psi^{(1)}_{X - \mu}(0)t + \frac{\Psi^{(2)}_{X - \mu}(0)}{2}t^2 + o(t^2) = 1 + \frac{\sigma^2 t^2}{2} + o(t^2), \text{ as } t \to 0,\] where \(\sigma^2 = \mathbb{E}[(X - \mu)^2]\); recall that all moments are finite by Proposition 1. Similarly, for any \(\nu > 0\),

\begin{align*} \exp{\left\{\frac{\nu^2 t^2}{2}\right\}} &= \left[1 + \left\{\nu^2 s\exp{\left\{\frac{\nu^2 s^2}{2}\right\}}\right\} t + \frac{1}{2} \left\{\nu^2 \exp{\left\{\frac{\nu^2 s^2}{2}\right\}}\left(1 + \nu^2 s^2\right)\right\} t^2\right]_{s=0} + o(t^2) \\ &= 1 + \frac{\nu^2 t^2}{2} + o(t^2), \text{ as } t \to 0, \end{align*}

as \(e^x\) is smooth. Combining these facts, we obtain for any \(\nu > 0\), \[\exp{\left\{\frac{\nu^2 t^2}{2}\right\}} - \mathbb{E}(e^{t(X - \mu)}) = \frac{t^2}{2}(\nu^2 - \sigma^2)+ o(t^2), \text{ as } t \to 0.\] To make sure this exceeds zero, at least for small \(t\), we will choose an arbitrary \(\nu > \sigma\). Meanwhile, choose \(\delta > 0\) such that whenever \(|t| < \delta\), \[\left|\frac{1}{t^2}\left( \exp{\left\{\frac{\nu^2 t^2}{2}\right\}} - \mathbb{E}(e^{t(X - \mu)})\right) - \frac{\nu^2 - \sigma^2}{2}\right| < \epsilon,\] where we choose \(\epsilon = \frac{\nu^2 - \sigma^2}{4} > 0\). We then can see that \[\epsilon t^2 + \mathbb{E}(e^{t(X - \mu)}) \leq \exp{\left\{\frac{\nu^2 t^2}{2}\right\}} \text{ for all } t \in (-\delta, \delta),\] so \(X\) is $(ν, \frac{1}{\delta})$-sub-exponential. \(\quad \blacksquare\)

Control of the moments and Bernstein’s condition

Viewing the formula for the MGF as a power series, we can obtain one more useful characterization of its existence in an open interval around zero.

Proposition 3: The MGF of \(X\) is defined on \(D \supseteq (-c, c)\) for some \(c > 0\), if and only if \[\gamma \coloneqq \limsup_{n \to \infty} \left(\frac{\left|\mathbb{E}(X^n)\right|}{n!}\right)^{1/n} < \infty.\]

Proof: By our theorem, \((-c, c) \subseteq D\) for some \(c > 0\) if and only if \(X^n \in L_1\) for all \(n \in \mathbb{N}\) and the radius of convergence \(R\) of the power series \(\sum_{n=0}^\infty \frac{t^n}{n!} \mathbb{E}(X^n)\) is positive. But \(R = \frac{1}{\gamma}\) by standard power series facts, so we conclude that \(R > 0\) if and only if \(\gamma < \infty\). \(\quad \blacksquare\)

With this result in hand, we can more easily see that the Bernstein condition implies the sub-exponential property. Indeed, if \(X\) is $b$-Bernstein for some \(b > 0\), then \[\limsup_{n \to \infty} \left(\frac{\left|\mathbb{E}[(X - \mu)^n]\right|}{n!}\right)^{1/n} \leq b \limsup_{n \to \infty} \left(\frac{\sigma^2}{2b^2}\right)^{1/n} = b < \infty.\] Thus, \(\left(-\frac{1}{b}, \frac{1}{b}\right) \subseteq D\), again using that the domains of \(\Psi_X\) and \(\Psi_{X - \mu}\) are both \(D\). By Proposition 2, we are done!

BONUS: A direct analysis of the Bernstein condition

In talking through this with some peers at CMU, I learned (thanks to Tiger Zeng) that we could also directly verify that the Bernstein condition implies \(\mathbb{E}(e^{|tX|}) < \infty\) for \(t \in \left(-\frac{1}{b}, \frac{1}{b}\right)\). At that point we could just use the DCT to justify the interchange of expectation and sum. Since the idea was pretty clever, I’ve decided to write it out here.

Suppose \(X\) is $b$-Bernstein for \(b > 0\). We wish to show that \(\mathbb{E}(e^{|t| \cdot |X-\mu|}) < \infty\) for any \(t \in \left(-\frac{1}{b}, \frac{1}{b}\right)\), and recall that by the Taylor series for \(e^x\) and the MCT, this expectation always equals \(\sum_{n=0}^{\infty} \frac{|t|^n}{n!}\mathbb{E}(|X-\mu|^n)\). The difficulty comes from the fact that the Bernstein condition does not speak directly to \(\mathbb{E}(|X-\mu|^n)\), but rather to \(\left|\mathbb{E}[(X-\mu)^n]\right|\). For \(n\) even, these coincide, but for \(n\) odd, some care is needed.

Explicitly separating the even and odd terms, we have \[\mathbb{E}(e^{|t| \cdot |X-\mu|}) = \sum_{k=0}^\infty \frac{|t|^{2k}}{(2k)!}\mathbb{E}(|X-\mu|^{2k}) + \sum_{k=0}^\infty \frac{|t|^{2k + 1}}{(2k + 1)!}\mathbb{E}(|X-\mu|^{2k + 1}).\] We show these sums are both finite in turn. Starting with the even terms, since \(X\) is $b$-Bernstein, \[\sum_{k=0}^\infty \frac{t^{2k}}{(2k)!}\mathbb{E}[(X-\mu)^{2k}] = 1 + \sum_{k=1}^\infty \frac{t^{2k}}{(2k)!}\mathbb{E}[(X-\mu)^{2k}] \leq 1 + \frac{\sigma^2}{2} \sum_{k=1}^{\infty}t^{2k}b^{2k - 2} = 1 + \frac{\sigma^2 t^2 / 2}{1 - t^2b^2},\] by summing the geometric series, noting that \(t^2 b^2 = (|t|b)^2 < 1\) by assumption. For the odd terms, the trick is to forcefully introduce even moments using Cauchy-Schwarz: for any \(k \in \mathbb{N}\), \[\mathbb{E}(|X-\mu|^{2k + 1}) = \mathbb{E}(|X - \mu|^{k + 1} |X - \mu|^k) \leq \left\{\mathbb{E}[(X - \mu)^{2k + 2}]\mathbb{E}[(X - \mu)^{2k}]\right\}^{1/2},\] by Cauchy-Schwarz. Thus, as \(X\) is $b$-Bernstein, \[\mathbb{E}(|X-\mu|^{2k + 1}) \leq \frac{\sigma^2 b^{2k-1}}{2} \sqrt{(2k)! (2k+2)!} \leq (2k + 1)! \sigma^2 b^{2k - 1},\] since \(\sqrt{(2k)! (2k+2)!} = (2k + 1)! \sqrt{\frac{2k+2}{2k + 1}} < 2 (2k+1)!\) for all \(k \in \mathbb{N}\). Therefore, \[\sum_{k=0}^\infty \frac{|t|^{2k+1}}{(2k+1)!}\mathbb{E}(|X-\mu|^{2k+1}) \leq \mathbb{E}(|X - \mu|) + \sigma^2\sum_{k=1}^{\infty}|t|^{2k+1}b^{2k-1} = \mathbb{E}(|X - \mu|) + \frac{b \sigma^2 |t|^3}{1 - |t|b}.\] Putting everything together, we have \(\mathbb{E}(e^{|t| \cdot |X-\mu|}) < \infty\). Phew!