Jan. 13, 2019 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$ $\newcommand{\st}{\,\big|\,}$ $\newcommand{\E}[1]{\mathbb{E}\left[#1\right]}$ $\newcommand{\P}[1]{\mathbb{P}\left(#1\right)}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\blue}[1]{{\color{blue} {#1}}}$ $\newcommand{\red}[1]{{\color{red} {#1}}}$ $\newcommand{\orange}[1]{{\color{orange} {#1}}}$ $\newcommand{\pfrac}[2]{\frac{\partial #1}{\partial #2}}$

Now we are ready to state some of the most important results in statistics. We begin with the law of large numbers.

Intuitively, the `law of large numbers`

says that if we repeat the chance experiment enough times, the average result will be close to the mean of its distribution. For instance, if we flip a coin enough times, the proportion of heads will get close to 1/2. But what does "get close" really mean? If we let $X_n$ denote the number of heads in $n$ coin tosses, then as $n$ becomes large, the proportion $X_n/n$ will be close to $1/2$, but the difference between heads and tails $D = X_n-(n-X_n) = 2X_n-n$ does not converge to 0, contratry to our intuition at first! This quantity $D$ is known as a symmetric random walk. Note that after each flip, $D$ changes by 1 in either direction with probability of 1/2.

In [108]:

```
set.seed(9)
rolls <- rbinom(1000, 1, 1/2)
D <- rep(0, 1000)
for (i in 1:1000){
D[i] <- 2*sum(rolls[1:i]) - i
}
options(repr.plot.width=5, repr.plot.height=4)
plot(1:1000, D, type='l', xlab='Rolls', ylab='D', main='Random Walk')
abline(h=0, col='red')
```

As a matter of fact, by the `central limit theorem`

, which we will discuss shortly, the fluctuation $D$ grows proportionally with $\sqrt{n}$ as $n$ increases. This result does not contradict the law of large numbers, however, since to calculate the proportion, we divide by $n$, which dominates $\sqrt{n}$ as $n$ gets large, so asymptotically this difference becomes negligible in calculating proportions.

Formally speaking, let $T_1,...,T_n$ denote the results of the $n$ tosses where $T_i=0$ if the $i$th toss is tails and $T_i=1$ if heads. The difference can be represented by

$$ \begin{aligned} D &= 2(T_1+\cdots + T_n) - n \\ &= 2(T_1+\cdots + T_n - n/2) \end{aligned} $$

By the central limit theorem, the quantity $(T_1+\cdots + T_n - n/2)$ has approximately a normal distribution $N(0, \sigma\sqrt{n})$ where $\sigma=1/2$ is the standard deviation of the $T_i$'s. This explains the increase in the fluctuation. On the other hand, the proportion $P_n$ of heads in $n$ coin tosses is represented as

$$ P_n = \frac{1}{n}(T_1+\cdots +T_n), $$

which by the central limit theorem has the distribution $N(0.5, \frac{1}{2\sqrt{n}})$. This means the proportion becomes "concentrated" more and more toward 0.5. Let's formalize the notion lf the LLN below.

Theorem 1 (The Weak Law of Large Numbers)Let $X_1,...,X_n$ be iid random variables with finite mean $\mu$ and standard deviation $\sigma$. Then$$\overline{X}_n = \frac{X_1+\cdots+X_n}{n} \overset{p}{\to}\mu $$

*Proof* By Markov's inequality,

$$ \begin{aligned} \P{\abs{\overline{X}_n-\mu}>\epsilon} &=\P{\abs{\overline{X}_n-\mu}^2 >\epsilon^2} \\ &\leq \frac{\text{Var}\left(\overline{X}_n\right)}{\epsilon^2} & [\text{Markov}]\\ &= \frac{\sigma^2}{n\epsilon^2} & [\sigma <\infty]. \end{aligned} $$

This quantity tends to 0 as $n \to \infty$. $\blacksquare$

Before preceding we need the following important results.

Lemma 1 (Chernoff Bound)Let $X$ be a random variable, then$$ \P{X>\epsilon} \leq \inf_{t\geq 0} e^{-t\epsilon}M_X(t). $$ where $M_X(t)=\E{e^{tX}}$ denote the moment-generating function of $X$.

*Proof* For any $t>0$, we have

$$ \begin{aligned} \P{X>\epsilon} &= \P{e^X>e^{\epsilon}} \\ &\P{e^{tX}>e^{t\epsilon}} \\ &\leq e^{-t\epsilon}\E{e^{tX}} & [\text{Markov}] \\ &= e^{-t\epsilon}M_X(t). \end{aligned} $$

Since this is true for every $t\geq 0$, the Lemma is proved. $\blacksquare$

Lemma 2 (Borel-Cantelli)Let $E_1,E_2,...$ be a sequence of events. Then,$$\sum_{n=1}^{\infty} \P{E_n}<\infty \implies \P{\limsup_{n\to\infty} E_n} = 0.$$

where the limsup is defined as

$$\limsup_{n\to\infty} E_n = \bigcap_{n=1}^{\infty}\bigcup_{m\geq n} E_m,$$ which represents the set of outcomes that occur

`infinitely often`

(i.o.) in the infinite sequence of events $\{E_n\}$.

*Proof* Since the series $\sum \P{E_n}$ converges, we must have the tail sum go to zero, i.e.,

$$ \lim_{n\to\infty} \sum_{m\geq n}\P{E_m} = 0. $$

This implies that

$$ \inf_{n\geq 1} \sum_{m\geq n}\P{E_m} = 0. $$

Hence,

$$ \begin{aligned} \P{\limsup_{n\to\infty} E_N} &= \P{\bigcap_{n=1}^{\infty}\bigcup_{m\geq n}E_m} \\ &= \inf_{n\geq 1}\P{\bigcup_{m\geq n}E_m} & [\text{decreasing sets}] \\ &\leq \inf_{n\geq 1}\sum_{m\geq n}\P{E_m} & [\text{Bonferroni}]\\ &=0. \end{aligned} $$

This concludes the proof. $\blacksquare$

Theorem 2 (Strong Law of Large Numbers)Under the same condition as Theorem 1, we have that $\overline{X}_n \overset{a.s.}{\to}\mu$.

*Proof* Fix $\epsilon>0$. We have

$$ \P{\abs{\overline{X}_n-\mu}\geq\epsilon} =\P{\overline{X}_n \geq \mu+\epsilon} + \P{\overline{X}_n\leq\mu-\epsilon} $$

Assume that the moment-generating function $M_X(t)$ of $X_i$ exists. The moment generating function of $X_1+\cdots+X_n$ is $\left[M_X(t)\right]^n$. By Chernoff bound, we have

$$ \begin{aligned} \P{\overline{X}_n \geq \mu + \epsilon} &\leq \inf_{t\geq0} e^{-n(\mu+\epsilon)t} \left[M_X(t)\right]^n \\ &= \inf_{t\geq 0}\left[e^{-(\mu+\epsilon)t}M_X(t)\right]^n \\ &\leq \lambda^n, \end{aligned} $$

where $\lambda = \inf_{t\geq 0} e^{-(\mu+\epsilon)t}M_X(t)$. Let $G(t) = e^{-(\mu+\epsilon)t}M_X(t)$. Since $M_X(0)=1$ and $M'_X(0)=\mu$, we have $G(0)=1$, and

$$ G'(0) = -(\mu+\epsilon)+\mu = -\epsilon <0. $$

Since $G(t)$ is nonnegative and strictly decreasing, we conclude that $0\leq \lambda <1$. Similarly, we can show that

$$ \P{\overline{X}_n \leq \mu-\epsilon} \leq \eta^n, $$

where $\eta = \inf_{t\geq 0} e^{-(\mu-\epsilon)t} M_X(t)$. Thus, we have

$$ \begin{aligned} \sum_{i=1}^{\infty}\P{\abs{\overline{X}_n-\mu}\geq\epsilon} &\leq \sum_{n=1}^{\infty}(\lambda^n+\eta^n) \\ &=\frac{\lambda}{1-\lambda}+\frac{\eta}{1-\eta} \\ &< \infty. \end{aligned} $$

Now invoking the Borel-Cantelli lemma, for any positive integer $k$, we have that $P(A_k)=0$, where

$$ \begin{aligned} A_{k} &= \limsup_{n\to\infty} \left\{\omega\in\Omega \st \abs{\overline{X}_n(\omega)-\mu}\geq \frac{1}{k} \right\} \\ &= \left\{\omega\in\Omega \st \abs{\overline{X}_n(\omega)-\mu}\geq \frac{1}{k} \text{ infinitely often}\right\}. \end{aligned} $$

The complement of $A_{k}$, denoted by $A^c_{k}$, is the set

$$ A^c_{k} =\left\{\omega\in\Omega \st \abs{\overline{X}_n(\omega)-\mu}<\frac{1}{k} \text{ for some } n \right\}, $$

where $P(A^c_{k})=1$. As $k$ increases, we have a decreasing sequence of sets $\{A^c_k\}_{k=1}^{\infty}$. Using the continuity property of probability measure,

$$ \begin{aligned} \P{\lim_{n\to\infty} \overline{X}_n=\mu} &= \P{\limsup_{k\to\infty} A^c_k} \\ &= \lim_{k\to\infty} \P{A^c_{k}} & [\text{continuity}]\\ &=1. \end{aligned} $$

This proves almost sure convergence. $\blacksquare$

So what does the strong version say that weak version can't? The strong law of large numbers shows that no matter how small $\epsilon$ is, eventually the sample mean will *always* be within a distance $\epsilon$ from $\mu$ and stay there. This conclusion cannot be drawn from the weak law of large numbers, which only states that for a specified large $n$, the sample mean is *likely* to be near $\mu$. This is sharpened by the central limit theorem, which states that for a fixed $n$,

$$ \P{\abs{\overline{X}_n - \mu} \geq \epsilon} \approx 2\left(1-\Phi(\epsilon\sqrt{n}/\sigma)\right), $$

where $\sigma$ is the standard deviation of each sample. With the strong law of large numbers, we can further claim that given any $\epsilon>0$, there exists an $N$ large enough so that whenever $n>N$, the error can be no more than $\epsilon$ almost surely.

Now we show an interesting application of the law of large numbers. Consider a machine that ages and needs to be replaced once in a while. The machine with a continuously distributed lifetime needs to be replaced either upon failure or its lifetime has reached $T>0$. If the machine lasts until time $T$, there is a fixed cost $c_1>0$ for replacement. On the other hand, if the machine fails before time $T$, the cost for a failure replacment is $c_2>0$ which is larger than $c_1$. A reward $R_k$ is earned for the $k$th machine with lifetime $L_k\leq T$. This is the classic example of a `regenerative`

stochastic process since each time we replace an old machine, the process restarts, and becomes independent of its past. The problem can be approached using a Monte Carlo simulation.

We generate iid random variables $L_1,L_2,...$ by taking a minimum between $T=1$ and the value randomly generated from an exponential random variable with rate parameter $\lambda=1$. Furthermore, we assume that the reward $R_k$ for the $k$th machine earned during its lifetime $L_k$ is distributed as $\text{Poisson}(L_k)$. So we have the following heiarchical process:

$$ \begin{aligned} R_k \given L_k &\sim \text{Poisson}(L_k) \\ L_k &= \min(T, Y) \\ Y &\sim \text{Exponential}(1) \end{aligned} \tag{1} $$

First we generate $n$ cycle lengths $L_1,...,L_n$ coming from the exponential distribution with rate 1. Then we replace those $L_k > T$ where $T$ is set to be the lifetime for a planned replacement. At the same time, we also generate a list of rewards based on (1).

In [205]:

```
n <- 1000 # number of cycles
c1 <- 0.1 # cost for planned replacement
c2 <- 0.2 # cost for failure replacement
lambda <- 1 # rate of the Poisson process
T <- 1 # maximum time before replacement
L <- rep(0,n) # storage for cycle lengths
R <- rep(0,n) # storage for rewards
set.seed(39) # for reproducible result
for (i in 1:n) {
L[i] <- min(T,rexp(1,lambda)) # generate L_k
Rk <- rpois(1, L[i])
if (L[i] == T) {
R[i] <- Rk - c1 # planned replacement
} else {
R[i] <- Rk - c2 # failure replacement
}
}
```

Let $R(t)$ denote the cumulative reward earned up to time $t$. We are interested in the ratio $R(t)/t$. Intuitively we expect it to converge to something, but to what? Let's illustrate this graphically by plotting a vector of $t$ versus the ratio $R(t)/t$. In R this can be easily accomplished using the `cumsum`

function.

In [206]:

```
plot(cumsum(R)/cumsum(L), type='l', xlab='t', ylab='R(t)/t')
abline(h=mean(R)/mean(L), col='red')
```

We see that the ratio $R(t)/t$ converges to the ratio between the average of $R_k$ and the average of $L_k$! Let's formalize this result.

Definition 1A stochastic process $\{S(t), t\in \Gamma\}$ is`regenerative`

if there exists time points $0= T_0<T_1<T_2<\cdots$ such that the processes $\{S(T_k+t), t\in \Gamma\}$ are independent and identically distributed for all $k$. The time interval between two successive regeneration $L_{k+1}=T_{k+1}-T_k$ is called a`cycle`

. Let the random variable $R_k$ denote the`reward`

earned in cycle $k$. For any $t>0$, we let $R(t)$ denote the`cumulative reward`

earned up to time $t$. The iid assumptions imply that both $L_k$'s and $R_n$'s are iid random variables.

Now for a regenerative continuous-time stochastic process $\{S(t), t\in \Gamma\}$, assume that $L_k$'s and $R_k$'s are iid with finite expected value. We have the following result.

Theorem 3 (Renewal-reward)With probability 1,$$ \lim_{t\to\infty}\frac{R(t)}{t} = \frac{\E{R_1}}{\E{L_1}}.$$

*Proof* For any $t>0$, let $N(t)$ denote the number of cycles completed up to time $t$. First note that

$$ \sum_{k=1}^{N(t)} L_k \leq t < \sum_{k=1}^{N(t)+1} L_k. $$

Dividing all times by $N(t)$, we obtain

$$ \frac{1}{N(t)}\sum_{k=1}^{N(t)}L_k \leq \frac{t}{N(t)} < \frac{1}{N(t)}\sum_{k=1}^{N(t)+1}L_k. $$

Taking the limit as $t\to\infty$, by the strong law of large numbers, both the leftmost and rightmost terms converge almost surely to $\E{L_1}$. This shows that with probability 1,

$$ \lim_{t\to\infty} \frac{N(t)}{t} = \frac{1}{\E{L_1}}. \tag{2} $$

Next we consider the case when all the rewards $R_k$ are nonnegative. By definition,

$$ \sum_{k=1}^{N(t)}R_k \leq R(t) \leq \sum_{k=1}^{N(t)+1}R_k. $$

This implies that

$$ \frac{1}{N(t)}\sum_{k=1}^{N(t)}R_k \times \frac{N(t)}{t} \leq \frac{R(t)}{t} \leq \frac{1}{N(t)}\sum_{k=1}^{N(t)+1} R_k \times \frac{N(t)}{t}. $$

Taking the limit as $t\to\infty$ and using (2), we arrive at the desired result. Finally if $R_k$ is negative for some $k$, we can take $R_k = R_k^+ - R_k^-$ where $R_k^+ =\max(R_k,0)$ and $R_k^-=-\min(R_k,0)$, and repeat the proof for both parts. $\blacksquare$