Probabilities - CodiMD

--- title: Probabilities author: Pierre Veron --- ![](https://hub.bio.ens.psl.eu/index.php/s/Ritc7Lggz9SLKkn/download/logos_67.png) Probability memo :::info École normale supérieure -- PSL, Département de biologie :mortar_board: [Maths training 2024-2025](https://codimd.math.cnrs.fr/s/hmbX8GuA4#) :bust_in_silhouette: Pierre Veron :email: `pveron [at] bio.ens.psl.eu` ::: [TOC] # 1 Basics of probabilities ## 1.1 Definitions, random variables * the **sample space**, denoted $\Omega$ is the set of all possible issues of a random experiment. * an **outcome** is an element of $\Omega$ * an **event** is a subset of $\Omega$ on which we can calculate a probability (we say it is measurable). * the **distribution** describes the measure of probabilities, ie a distribution characterises the value of $P(A)$ for all event $A \subset \Omega$. > Example. > We throw a coin and observe the result. The sample space is $\Omega = \{\text{face}, \text{tail}\}$. Each of the element of $\Omega$ is an outcome. > An **event** is a measurable subset of $\Omega$. For instance $\{\text{face}\}$ is an event. > In this example, the probability distribution is characterized by the value of $p = P(\{\text{face}\})$ since $P(\{\text{tail}\}) = 1-p$. A probability space is often noted $(\Omega, \mathcal{F}, P)$ where $\Omega$ is the sample space, $\mathcal{F}$ is the events space and $P$ the distribution or measure. This is the formal definition of a random variable. :::danger **Definition** Given a probability space,$(\Omega, \mathcal{F}, P)$, a **random variable** $X$ is a function $\Omega \to \mathbb{R}$. The image of that random variable $X(\Omega)$ is the set of all values taken by the random variable. * If $X(\Omega)$ is finite or countable, we say $X$ is a **discrete** random variable. * If $X(\Omega)$ is not countable nor finite, we say $X$ is a **continuous** random variable. ::: > Examples > * We roll a dice and not $X$ the obtained result. $X$ is a discrete random variable with $X(\Omega) = \{1, 2, 3, 4, 5, 6\}$. > * We roll a dice until we obtain 6 and note $T$ the number of throw until we reach $6$. $T$ is a discrete random variable with $T(\Omega) = \mathbb{N}^* = \{1, 2, 3, ...\}$. > * We pick a random place on Earth and note $\Theta$ its current temperature, in degrees. $\Theta$ is a continuous random variable with $\Theta(\Omega) = \mathbb{R}$. ## 1.2 What is a distribution? The distribution of a random variable determines the probabilities of all the measurable events in $X(\Omega)$. The distribution is a function on $X(\Omega)$ but it is defined differently depending if we work with discrete or continuous random variables. ### 1.2.1 Probability mass function (PMF) of a discrete random variable :::danger **Definition: probability mass function (PMF) of a discrete random variable** Let $X$ be a discrete random variable. The probability mass function $p_X$ is the function: \begin{equation} p_X: \left(\begin{array}{rcl} X(\Omega) & \to & [0,1] \\ x & \mapsto & P(X = x) \end{array} \right) \end{equation} ::: A discrete random variable can be determined by the probability of each of its values. A PMF has to satisfy the condition: \begin{equation} \sum_{x \in X(\Omega)} p_X(x) = 1. \end{equation} ### 1.2.2 Probability density function (PDF) of a random variable :::danger **Definition: probability density function (PDF) of a continuous random variable** Let $X$ be a continuous random variable. The probability density function $f_X$ is the function $$f_X : X(\Omega) \to \mathbb R_+$$ such that, for all $a, b$ with $a<b$, \begin{equation} P(a \le X \le b) = \int_a^b f_X(x) dx. \end{equation} ::: A continuous random variable can be determined by its probability density function. A PDF has to satisfy the condition: \begin{equation} \int_{X(\Omega)} f_X(x) dx = 1. \end{equation} ### 1.2.3 Cumulative distribution function (CDF) of a continuous random variable :::danger **Definition: cumulative distribution function of a random variable** Let $X$ be a random variable. The cumulative distribution function of $X$ is the function $F_X$ defined by: \begin{equation} F_X: \left(\begin{array}{rcl} \mathbb R & \to & [0,1] \\ x & \mapsto & P(X \le x) \end{array} \right). \end{equation} ::: **Properties of the CDF** * Left and right limits \begin{equation} \lim_{x \to -\infty} F_X (x) = 0 \quad \text{and} \quad \lim_{x \to +\infty} F_X (x) = 1 \end{equation} * $F_X$ is a monotonous increasing function: if $a<b$ then $F_X(a) \le F_X(b)$. * For $a<b$: \begin{equation} F_X(b) - F_X(a) = P(a < X \le b). \end{equation} * If $X$ is a continuous random variable, then $F_X$ is differentiable and the PMF is the derivative of the CDF: \begin{equation} f_X(x) = \frac{d F_X(x)}{d x} \quad \Leftrightarrow F_X(x) = \int_{-\infty}^x f_X(y) dy. \end{equation} * If $X$ is a discrete random variable with finite state space, then $F_X$ is piecewise-constant with a jump at each value of $X(\Omega)$: \begin{equation} \forall x \in \mathbb R, \quad F_X(x) = \sum_{y\in X(\Omega) | y \le x} p_X(y) \end{equation} where $p_X$ is the PMF of the discrete random variable. ## 1.3 Expected value The expected value (or mathematical expectation) can be understood as the arithmetic mean of the possible values a random variable can take, weighted by the probability of those outcomes. ### 1.3.1 Formal definition of the expected value The definition of the expected value depends on whether the variable is discrete or continuous. :::danger **Definition: expected value of a ==discrete== random variable** If $X$ is a discrete random variable with PMF $p_X$, the expected value is the number \begin{equation} E[X] = \sum_{x \in X(\Omega)} x p_X(x) = \sum_{x \in X(\Omega)} x P(X = x). \end{equation} ::: > Example: dice > We roll a fair dice. $X(\Omega) = \{1,2,3,4,5,6\}$. For all $x \in X(\Omega)$, $p_X(x) = 1/6$. So the expected value is: > \begin{equation} > E[X] = \sum_{x = 1}^{6} x \times \frac{1}{6} = \frac{1}{6} \times (1+2+3+4+5+6) = 3.5 > \end{equation} :::danger **Definition: expected value of a ==continuous== random variable** If $X$ is a discrete random variable with PDF $f_X$, the expected value is the number \begin{equation} E[X] = \int_{X(\Omega)} x f_X(x) dx \end{equation} ::: :::info Some random variables have an infinite expected values even though they are finite. ::: ### 1.3.2 Properties of the expected value * Linearity: if $X$ and $Y$ are random variables and $a$ is a number, then $E[aX+Y] = aE[X] + E[Y]$ * Non-negativity: if $X \ge 0$ then $E[X] \ge 0$. > More precisely, if $P(X \ge 0) = 1$ then $E[X] \ge 0$. * If $X \le Y$ then $E[X] \le E[Y]$ * If $X$ and $Y$ are independant random variables (see definition after, [in this section](#1.5-Conditional-probabilities-and-Bayes-theorem)) then $E[XY] = E[X] E[Y]$. > This property is not an equivalent: one can **not** conclude that $X$ and $Y$ are independant based only on the observation that $E[XY]= E[X] E[Y]$. * If $g : E \to \mathbb R$ where $X(\Omega) \subset E$, then \begin{equation} E[g(X)] = \sum_{x \in X(\Omega)} g(x) P(X = x) \quad \text{in the case of a discrete RV,} \end{equation} or \begin{equation} E[g(X)] = \int_{X(\Omega)} g(x) f_X(x) dx \quad \text{in the case of a continuous RV.} \end{equation} :::success **Property: Markov inequality** If $X$ is a *non-negative* random variable with a finite expected value, then, for all $a > 0$: \begin{equation} P(X \ge a) \le \frac{E[X]}{a}. \end{equation} ::: :::spoiler proof In the case of a continuous RV, since $X(\Omega) \subset [0, +\infty[$: \begin{align} E[X] &= \int_{0}^{+\infty} x f_X(x) dx \\ &= \int_{0}^{a} x f_X(x) dx + \int_{a}^{+\infty} x f_X(x) dx \\ & \ge \int_{a}^{+\infty} x f_X(x) dx \quad \text{since the first integral is not-negative} \\ & \ge \int_{a}^{+\infty} a f_X(x) dx \quad \text{since $x \ge a$ on the interval of this integral} \\ & \ge a \int_{a}^{+\infty} f_X(x) dx \\ & \ge a P(X \ge a) \end{align} so $P(X \ge a) \le \frac{E[X]}{a}. \quad \square$ ::: ### 1.3.3 Law of large numbers The law of large number states that if we repeat the same experience a large number of times, the average result of this experience will be close to the expected value of this experiment. Formally it can be written in two ways, the strong and the weak version. The difference between the two is subtile and helds in the assumption "$E[|X_1|]<\infty$", so we will not go into the details and we will only present the strong law. :::success **Law of large numbers** Let $X_1, X_2, X_3,...$ be an infinite sequence of independant and identically distributed random variables such that $E[|X_1|]<\infty$. It means that all of them have the same distribution and they are independant. Let us introduce the sample mean: \begin{equation} \overline X_n:= \frac{X_1 + X_2 + \cdots + X_n}{n} \end{equation} Denoting $\mu = E[X_1]$ the expected value of these variables (it is the same value for all of them because they are identically distributed), we have: \begin{equation} \overline X_n \longrightarrow \mu = E[X_1] \quad \text{almost surely, as $n \to \infty$} \end{equation} > "Almost surely" here means that > \begin{equation} > P(\lim_{n\to \infty} \overline X_n = \mu) = 1. > \end{equation} ::: ## 1.4 Variance ### 1.4.1 Definition of variance :::danger **Definition: variance** Let $X$ be a random variable. We define the **variance** of $X$ as the number: \begin{equation} V(X) := E\left[ (X - E[X])^2\right]. \end{equation} This other definition is equivalent: $V(X) = E[X^2] - E[X]^2$. ::: :::spoiler proof of the equivalence of the two definitions \begin{align*} E\left[ (X - E[X])^2\right] &= E[ X ^2 - 2X E[X] + E[X]^2] \\ &= E[X^2] - 2E[X E[X]] + E[E[X]^2] \quad \text{by linearity of the expected value} \\ &= E[X^2] - 2E[X] E[X] + E[X]^2 \quad \text{since $E[X]$ is a number, not a random variable} \\ &= E[X^2] - E[X]^2. \quad \square \end{align*} ::: :::info Not all variables have a finite variance. Naturally, a variable that has no finite expected value has no finite variance. ::: The standard deviation is the square root of the variance, so the variance is sometimes denoted $\sigma^2$. ### 1.4.2 Properties of variance * Non-negativity: for any RV, we have $V(X) \ge 0$. * $V(X)=0$ if and only it $X$ is almost surely constant, i.e. there is a number $x$ such that $P(X=x) = 1$. * Affine transformation: if $a$ and $b$ are real numbers and $X$ a RV, then $V(aX+b) = a^2 V(X)$ * Sum of two independant variables, if $X$ and $Y$ are *independant* RV, then $V(X+Y) = V(X) + V(Y)$. See section [covariance](#1.4.3-Covariance) for the general case when $X$ and $Y$ are not independant. ### 1.4.3 Covariance Let us try to calculate the variance of a sum of two RV without making the assumtion of independance: \begin{align*} V(X+Y) &= E\left[(X+Y-E[X+Y])^2\right] \\ &= E\left[(X + Y -E[X] - E[Y])^2\right] \quad \text{since $E[X+Y] = E[X]+E[Y]$ in all cases} \\ &= E\left[((X - E[X]) + (Y -E[X]))^2\right] \\ &= E\left[(X - E[X])^2 + 2 (X-E[X])(Y-E[Y]) + (Y-E[Y])^2\right] \\ &= E\left[(X-E[X])^2\right] + 2 E[(X-E[X])(Y-E[Y])] + E\left[(Y-E[Y])^2\right] \\ & \qquad \qquad \text{by lineariry of the expected value} \\ &= V(X) + 2 E[(X-E[X])(Y-E[Y])] + V(Y). \end{align*} The middle term is the expected value of a double product between flutcuations of the variables $X$ and $Y$ (value-expected value): this is called the covariance. :::danger **Definition: covariance** Let $X$ and $Y$ be two random variables with a variance. The covariance beteen $X$ and $Y$ is the number \begin{equation} \mathrm{Cov}(X,Y) = E[(X-E[X])(Y-E[Y])] \end{equation} This alternative definition is equivalent: $\mathrm{Cov}(X,Y) = E[XY] - E[X]E[Y]$. ::: **Properties** * If $X$ and $Y$ are independant, then $\mathrm{Cov}(X,Y) = 0$ (the other implication is wrong). * $V(X+Y) = V(X) + V(Y) + 2 \mathrm{Cov}(X,Y)$ * symmetry: $\mathrm{Cov}(X,Y) = \mathrm{Cov}(Y,X)$ * $\mathrm{Cov}(X,X) = V(X)$ * homogeneity: $\mathrm{Cov}(aX, Y) = a \mathrm{Cov}(X, Y)$ * bilinearity: if $Z$ and $W$ are two other RV, $\mathrm{Cov}(X+Y, Z+W) = \mathrm{Cov}(X,Z) + \mathrm{Cov}(X,W) + \mathrm{Cov}(Y, Z) + \mathrm{Cov}(Y,W)$. ## 1.5 Conditional probabilities and Bayes theorem :::danger **Definition: conditional probability** Let $A$ and $B$ two events such that $P(B) > 0$. We define the **probability of $A$ knowing $B$**: \begin{equation} P(A | B) := \frac{P(A \cap B)}{P(B)}. \end{equation} ::: :::danger **Definition: independance** The events $A$ and $B$ are **independant** if $P(A \cap B) = P(A) \times P(B)$. ::: :::success **Property: alternative definition** If $P(B)>0$, $A$ and $B$ are indendenant fif and only if $P(A | B) = P(A)$. ::: In other words knowing $B$ has no influence on the probability of $A$. :::spoiler proof \begin{align*} A \; \text{and}\; B\; \text{independant} & \Leftrightarrow P(A \cap B) = P(A) P(B) \\ & \Leftrightarrow P(A | B) P(B) = P(A) P(B) \quad \text{by definition of conditional probabilites} \\ & \Leftrightarrow P(A | B) = P(A) \quad \text{after simplification.}\quad \square \end{align*} ::: We can also define: :::danger **Definition: indendant random variables** Two random variables $X$ and $Y$ are independant if and only if for all $x \in X(\Omega)$ and $y \in Y(\Omega)$ the events $\{X \le x\}$ and $\{Y \le y\}$ are independant. ::: :::warning **Bayes' theorem** Let $A$ and $B$ be two events such that $P(A)>0$ and $P(B)>0$. We then have \begin{equation} P(A | B) = P(B | A) \times \frac{P(A)}{P(B)}. \end{equation} ::: :::spoiler proof By definition \begin{equation} P(A | B) = \frac{P(A \cap B)}{P(B)} \end{equation} then writhe $P(A\cap B) = P(B|A)P(A)$. $\square$ ::: ## 1.6 Conditional expected value In this section we will consider only discrete RV, but the definitions can be extrapolated to continuous RV without difficulty. :::danger **Definition: conditional expected value** Let $X$ be a discrete RV and $A$ an event such as $P(A)>0$. The expected value of $X$ conditionnaly on $A$, also called "expected value of $X$ knowing $A$" is the number: \begin{equation} E[X | A ] := \sum_{x \in \Omega} x P(X = x | A) \end{equation} ::: > Example: dice > Let us roll a fair die, note $X$ the result and note $A$ the event: "the result is even". Then naturally $P(A) = 0.5$. The expected result knowing that the result is even is: > \begin{equation} > E[X| A] = 1 \times 0 + 2 \times \frac{1}{3} + 3 \times 0 + 4 \times \frac{1}{3} 5 \times 0 + 6 \times \frac{1}{3} = 4 > \end{equation} > since $P(X = x | A) = 0$ when $x$ is odd and $1/3$ when $x$ is even. # 2 Usual distributions ## 2.1 Discrete distributions ### 2.1.1 Bernoulli distribution :::info **Bernoulli distribution** $\mathcal{B}(p)$ A Bernoulli random variable is a variable that takes value 1 with probability $p$ (success) and 0 with probability $1-p$ (fail). If $X \hookrightarrow \mathcal{B}(p)$: * support $X(\Omega) = \{0, 1\}$ * probability function $P(X = 0) = 1-p$ and $P(X = 1) = p$. * expected value $E[X] = p$ * variance $V(X) = p (1-p)$. ::: ### 2.1.2 Binomial distribution :::info **Binomial distribution** $\mathcal{B}(n,p)$ The binomial distribution with parameters $n,p$ is the distribution of the number of success after $n$ indendant Bernoulli experiments with a success probability $p$. If $X \hookrightarrow \mathcal{B}(n,p)$: * support $X(\Omega) = \{0, 1, 2, ..., n\}$ * probability function $P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}$ for all $k \in \{0, 1, 2, ..., n\} * expected value $E[X] = n p$ * variance $V(X) = n p (1-p)$. ::: The number $\binom{n}{k}$ is called the **binomial coefficient** and is pronounced "n choose k". It is defined, for $0\le k\le n$, $n,k$ integrers \begin{equation} \binom{n}{k} := \frac{n!}{k!(n-k)!} \end{equation} where $n! = n(n-1)(n-2)...\times 2$ is the factorial of $n$. $\binom{n}{k}$ is the number of ways to choose $k$ elements among a set of $n$ elements. Notabely, $\forall n\ge 1, \forall k = 0, 1, ..., n$: * $\binom{n}{0} = \binom{n}{n} = 1$ * $\binom{n}{1} = \binom{n}{n-1} = n$ * $\binom{n}{2} = \binom{n}{n-2} = \frac{n(n-1)}{2}$ * symmetry: $\binom{n}{k} = \binom{n}{n-k}$. The binomial coefficient in the PMF of the biomial distribution corresponds to the number of possible choices for the positions of the $k$ success among the $n$ trials. Since each sequence involving $k$ success and $n-k$ fails has a probability $p^k (1-p)^{n-k}$, one can understand this expression easily. > Example: > We throw a dice 5 times and count the number of 6. The probability of a success (S) is $p=1/6$ and the probability of a fail is $5/6$. > Let us calculate for instance $P(X = 3)$, the probability of having exactly 3 success. > Here are the sequences that have exactly 4 successes: > SSSFF, SSFSF, SSFFS, SFSSF, SFSFS, SFFSS, FSSSF, FSSFS, FSFSS, FFSSS > There are 10 different sequences. We could have calculated it easier with the binomal coefficients: > \begin{equation} > {5 \choose 3} = \frac{5!}{3!(5-3)!} = \frac{5\times4\times3\times2}{3\times2\times2} = 10. > \end{equation} > Each of this sequence has the same probability: > \begin{equation} > P(\mathrm{SSSFF}) = \left( \frac{1}{6} \right)^3 \times \left( \frac{5}{6} \right)^2 = \frac{25}{6^5} > \end{equation} > So the probability of having exactly 3 successes is > \begin{equation} > P(X= 3) = \binom{5}{3} p^3 (1-p)^2 \approx 0.032. > \end{equation} ### 2.1.3 Geometric distribution :::info **Geometric distribution** $\mathcal{G}(p)$ The geometric distribution with parameter $p$ is the number of trials of a Bernoulli experiment until we get one success. * support $X(\Omega) = \mathbb{N}^* = \{1, 2, 3, ...\}$ * probability function, $P(X = k) = p \times (1-p)^{k-1}$ for all $k$ positive integer * expected value $E[X] = 1/p$ * variane $V(X) = (1-p)/p^2$. ::: ### 2.1.4 Poisson distribution :::info **Poisson distribution** $\mathcal{P}(\lambda)$ The Poisson distribution with parameter $\lambda>0$ is the distribution of the number of events occurring in a fixed interval of time, if these events occur with a known constant mean rate $\lambda$ and independently of the time since the last event. * support $X(\Omega) = \mathbb{N}$ * probability function $P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}$ * expected value $E[X] = \lambda$ * variance $V(X) = \lambda$. ::: ## 2.2 Continuous distributions ### 2.2.1 Uniform distribution :::info **Uniform distribution** $\mathcal{U}([a,b])$ The uniform distribution over an interval $[a,b]$ is defined by this PDF: \begin{equation} f_X(x) = \begin{cases} \frac{1}{b-a} & \text{if}\quad x \in [a,b] \\ 0 & \text{otherwise} \end{cases} \end{equation} * support $[a,b]$ * expected value $E[X] = (a+b)/2$ * variance $V(X) = (b-a)^2 / 12$ ![graphical representation of the PDF of the uniform distribution](https://upload.wikimedia.org/wikipedia/commons/9/96/Uniform_Distribution_PDF_SVG.svg =300x300) ::: ### 2.2.2 Normal / gaussian distribution :::info **Normal (or gaussian) distribution** $\mathcal{N}(\mu, \sigma^2)$ The normal distribution is the famous bell-shaped distribution. It has two parameters, the mean $\mu$ and the variance $\sigma^2 > 0$. It is defined by the PDF: \begin{equation} f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} \end{equation} * support $\mathbb{R}$ * expected value $E[X] = \mu$ * variance $V(X) = \sigma^2 ![Graphical representation of the PDF of the normal distribution](https://upload.wikimedia.org/wikipedia/commons/7/74/Normal_Distribution_PDF.svg) The particular case of $\mathcal{N}(0, 1)$ is called the standard normal distribution. ::: Here are some interesting properties of the normal distribution. * If $X \hookrightarrow \mathcal{N}(\mu, \sigma^2)$, $a$ is a real number, then * $X+a \hookrightarrow \mathcal{N}(\mu+a, \sigma^2)$ * $aX \hookrightarrow \mathcal{N}(a\mu, (a\sigma)^2)$ * $(X-\mu)/\sigma \hookrightarrow \mathcal{N}(0,1)$. This is called the centred reduced variable. * If $X \hookrightarrow \mathcal{N}(\mu_1, \sigma_1^2)$ and $Y \hookrightarrow \mathcal{N}(\mu_2, \sigma_2^2)$ and $X, Y$ are independant then $X+Y\hookrightarrow \mathcal{N}(\mu_1+\mu_2, \sigma_1^2 + \sigma_2^2)$. :::warning **Specific case of the standard normal distribution** The standard normal distribution $\mathcal{N}(0,1)$ has the PDF: \begin{equation} f_X(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}. \end{equation} Since the standard normal distribution is involved in the CLT, it is worth knowing some of the confidence intervals. If we pick a lot a random variables independantly in this distribution then approximately: * 68.2% of values are in [-1,1] * 95% of values are in [-1.96, 1.96] * 95.4% of values are in [-2,2] * 99% of values are in [-2.58, 2.58] * 99.6% of values are in [-3,3] * ![intervals of the standard normal distribution](https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Standard_deviation_diagram_micro.svg/640px-Standard_deviation_diagram_micro.svg.png) ::: :wrench: [Geogebra interactive calculator](https://www.geogebra.org/probability) is very useful to calculate the intervals of the normal distribution. Alternatively, you can use the old-fashioned [tables](https://www.growingknowing.com/Images/NormalTable1.png). ### 2.2.3 Exponential distribution :::info **Exponential distribution** $\mathcal{E}(\lambda)$ The exponential distribution with rate $\lambda>0$ is the distribution with the PDF: \begin{equation} f_X(x) = \begin{cases} \lambda e^{-\lambda x} &\text{if $x>0$} \\ 0 &\text{otherwise} \end{cases} \end{equation} * Support $X(\Omega) = \mathbb R_+$ * Expected value $E[X] = 1/\lambda$ * Variance $V(X) = \frac{1}{\lambda^2}$. ![PDF of the exponential distribution](https://upload.wikimedia.org/wikipedia/commons/f/f5/Exponential_distribution_pdf_-_public_domain.svg) ::: The most important property of the exponential distribution is its absence of memory. :::warning **Absence of memory of the exponential distribution** Let $X \hookrightarrow \mathcal{E}(\lambda)$ and $s, t> 0$. Then \begin{equation} P(X > s+t | X > t) = P(X > s) = e^{-\lambda s}. \end{equation} ::: In other words, if $X$ is a waiting time for an event, the information that this event did not occur before time $t$ does not change the distribution of the remaining time. That is why we say that the exponential distribution is memoryless. :::spoiler proof For a given time $t$, we have \begin{equation} P(X > t) = \int_t^{+\infty} \lambda e^{-\lambda x} dx = [-e^{-\lambda x}]_t^{+\infty} = e^{-\lambda t} \end{equation} Since $X>t+s \Rightarrow X> t$ we have $(X>t+s) \cap (X> t) = (X>t+s)$. By definition of the conditional probabilities: \begin{align} P(X > s+t | X > t) &= \frac{P((X>s+t) \cap (X>t))}{P(X>t)} \\ &= \frac{P(X>s+t)}{P(X>t)} \\ &= \frac{e^{-\lambda (s+t)}}{e^{-\lambda t}} \\ &= e^{-\lambda s} \\ &= P(X > s). \quad \square \end{align} ::: Here is another interesting property of exponential distributions. :::warning **Property: minimum of exponential variables** Let $X$ and $Y$ be two independant RV such that $X\hookrightarrow \mathcal{E}(\lambda)$ and $Y\hookrightarrow \mathcal{E}(\mu)$. Let us call $Z$ the minimum of the two variables: $Z := \min (X, Y)$. Then \begin{equation} Z\hookrightarrow \mathcal E (\lambda+\mu) \end{equation} Additionnaly the minimum is $X$ with a probability $\lambda/(\lambda+\mu)$ and it is $Y$ with a probability $\mu/(\lambda+\mu)$. As a consequence, if $X_1, ..., X_n$ are $n$ RV independant following exponential distributions with rates $\lambda_1,..., \lambda_n$ then: \begin{equation} \min (X_1, ..., X_n) \hookrightarrow \mathcal E (\lambda_1+...+\lambda_n). \end{equation} ::: The minimum of independant exponential variables is still an exponential variable with a rate equal to the sum of the rates. > Example: > Let us consider a cell that can either die with a rate $d$ or divide with a birth rate $b$. The events of death and birth are independant and the waiting time are exponentially distributed. Let us call $T$ the time before the cell either dies or divide. Then $T$ is exponentially distributed with a rate $b+d$ and the probability that this event is a birth event is $b/(b+d)$. ### 2.2.4 Link between Poisson and exponential distributions: Poisson point processes :::danger **Definition: Poisson point process** A Poisson point process is a random variable $N(t)$ indexed with time $t\ge 0$ characterized by: * $N(0) = 0$ * for any $t$, $N(t)$ is an integer and $N$ is increasing * $(N(t))_t$ has independant increments, i.e. for all $t_1<t_2<t_3<t_4$ the variables $N_{t_2}-N_{t_1}$ and $N_{t_4} - N_{t_3}$ are independant * the number of events in any interval of size $\tau$ follows an Poisson distribution with parameter $\lambda \tau$: for all $t$, $N(t+\tau) - N(t) \hookrightarrow \mathcal{P}(\lambda \tau)$. ::: A Poisson point process belong to the class of *jump process*, it counts the number of events occuring through time. Let us consider a Poisson point process with parameter $\lambda$. Let us denote $T_1, T_2, T_3, ...$ the time of jumps. $T_i$ are random variables such that $T1<T2<T3<...$. By convention we set $T_0=0$ and we have: * $\forall t < T_1$, $N(t) = 0$ * $\forall t \in [T_1, T_2[, N(t) = 1$, * ... * $\forall t \in [T_i, T_{i+1}[, N(t) = i$. We then have the property :::warning **Property** The waiting times are independant and exponentially distributed with rate $\lambda$. This can be written formally * if $i\ne j$ then $T_{i+1}-T_{i}$ and $T_{j+1}-T_j$ are independant RV * for all $i$ \begin{equation} T_{i+1}-T_{i} \hookrightarrow \mathcal E (\lambda). \end{equation} ::: A Poisson point process can be defined in two ways: * the process that counts events occuring on $\mathbb R$ with independant increments such that the number of events are Poisson distributed on each interval, or * the process that count events separated by independant waiting time exponentially distributed with a given rate. # 3 Central limit theorem The central limit theorem (CLT) is an important result in the fields of probability theory and statistics. It is a convergence theorem, which gives us the behavior of the observed mean of a large number of iid random variables. We have already seen the [law of large numbers](#1.3.3-Law-of-large-numbers), which claims that the observed mean converges to the expected value as the number of samples increases, but it does not say how this convergence occurs. The CLT refines this result. It is a very powerful theorem because it claims that, under some conditions: 1. the observed mean converge to the expected value 2. the fluctuations between the observed mean and the expected value are normally distributed for a large number of samples 3. the standard deviation of these fluctuations is proportional to $1/\sqrt{n}$ where $n$ is the number of samples. Here is the precise statement of the CLT. :::success **Central limit theorem** Let $X_1, X_2, X_3,...$ be an infinite sequence of independant and identically distributed random variables with a finite variance. Let us introduce the sample mean: \begin{equation} \overline X_n := \frac{X_1 + X_2 + ... + X_n}{n} \end{equation} Denoting $\mu = E[X_1]$ and $\sigma^2 = V(X)$ the expected value and variance of these variables, we have: \begin{equation} \sqrt{n} \left(\overline X_n - \mu \right) \longrightarrow^d \mathcal{N}(0, \sigma^2) \quad \text{as $n \to \infty$}. \end{equation} Here the arrow $\longrightarrow^d$ means that the convergence is a "convergence in distribution", that we will not define here, but see [here](https://en.wikipedia.org/wiki/Convergence_of_random_variables#Convergence_in_distribution). ::: Here is a formulation of the CLT that is less rigurous but that has a statistical implication. :::info **Statistician's view of the CLT** Let us assume that we observe $n$ times a value independantly and that this value come from a same distribution with an expected value $\mu$ and a variance $\sigma^2$. We calculate the mean $m = (x_1+x_2+...+x_n)/n$ of these samples. The value taken by $m$ fluctuates around $\mu$ but will not be equal exactly to $\mu$. Indeed, the CLT states that $m$ can be approached by a random variable taken from a normal distribution with mean $\mu$ and variance $\sigma^2 / \sqrt{n}$. However, in general, we do not know the value of $\sigma^2$ and $\mu$ and we only have access to the observations. Therefore the statisticians developped a complex framework to get rid of these unknown, but most of this field is still based on the CLT. ::: *[countable]: A countable set is a set that can be made in one to one correspondance with the set of natural numbers (N). For instance the set of relative numbers (Z) or the set of rational numbers (Q) are countable. The set or real numbers (R) however is not countable. *[piecewise-constant]: A piecewise constant (or step) function is a function which support can be partitionned in a finite number of intervals, and the function is constant on each of these intervals. *[PMF]: Probability mass function *[PDF]: Probability density function *[CDF]: Cumulative distribution function *[RV]: random variable *[CLT]: central limit theorem