# CSE 1210 Probability Theory and Statistics — Complete Summary

---

## Lecture 1: Introduction to Course; Uncertainty

### Key Concepts — Probability Foundations

- **Sample space** $\Omega$: set of all possible outcomes
- **Event**: subset $A \subseteq \Omega$
- **Elementary outcome**: single outcome $\omega \in \Omega$ with probability $p(\omega)$

**Probability Axioms**:
1. $P(A) \geq 0$ for all events $A$
2. $P(\Omega) = 1$
3. **Additivity** (for disjoint events $A \cap B = \emptyset$):
   $$P(A \cup B) = P(A) + P(B)$$

**Probability function** on a generic (finite) sample space:
- Assign probabilities $p(\omega)$ to each elementary outcome
- $P(A) = \sum_{\omega \in A} p(\omega)$

**Uniform probability**: if all outcomes equally likely, $p(\omega) = 1/|\Omega|$, so:
$$P(A) = \frac{|A|}{|\Omega|}$$

### Key Concepts — Probabilistic vs Statistical Reasoning

- **Probabilistic reasoning**: known model $\to$ predict data (forward)
- **Statistical reasoning**: observed data $\to$ infer model (reverse)

### WARNING
- $P(A \cup B) = P(A) + P(B)$ only holds when $A$ and $B$ are **disjoint**. For general events: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$

---

## Lecture 2: Conditional Probability and Bayes' Rule

### Key Concepts

- **Conditional probability**: given event $C$ with $P(C) > 0$:
  $$P(A|C) = \frac{P(A \cap C)}{P(C)}$$
- **Multiplication rule**:
  $$P(A \cap C) = P(A|C) \cdot P(C)$$
- **Union of non-disjoint events**:
  $$P(A \cup B) = P(A) + P(B) - P(A \cap B)$$
- **Complement rule**:
  $$P(A^c) = 1 - P(A)$$

### Key Concepts — Bayes' Rule

Given a **partition** $C_1, \ldots, C_n$ of $\Omega$ (mutually exclusive, exhaustive):

$$P(C_i|A) = \frac{P(A|C_i) \cdot P(C_i)}{\sum_{j=1}^{n} P(A|C_j) \cdot P(C_j)}$$

The denominator is the **Law of Total Probability**:
$$P(A) = \sum_{j=1}^{n} P(A|C_j) \cdot P(C_j)$$

### Key Concepts — Independence

- **Independent events**: $A$ and $B$ are independent iff:
  $$P(A \cap B) = P(A) \cdot P(B)$$
  Equivalently: $P(A|B) = P(A)$ (if $P(B) > 0$)
- **Conditionally independent** given $C$: $P(A \cap B|C) = P(A|C) \cdot P(B|C)$

### Method: Apply Bayes' Rule
1. **Identify the partition** $C_1, \ldots, C_n$ (the "causes")
2. **Identify the evidence** $A$ (what was observed)
3. **Find prior probabilities** $P(C_i)$ for each cause
4. **Find likelihoods** $P(A|C_i)$ for each cause
5. **Compute total probability**: $P(A) = \sum_j P(A|C_j)P(C_j)$
6. **Apply Bayes**: $P(C_i|A) = \dfrac{P(A|C_i)P(C_i)}{P(A)}$

### WARNING
- Independence means $P(A \cap B) = P(A)P(B)$. **Pairwise independence does NOT imply mutual independence** for more than 2 events
- $P(A|B) \neq P(B|A)$ in general — a common mistake (the "base rate fallacy")

---

## Lecture 3: Random Variables and Distributions

### Key Concepts — Discrete Random Variables

- **Discrete RV**: takes values in a countable set $\{a_1, a_2, \ldots\}$
- **Probability mass function (PMF)**: $p(a_i) = P(X = a_i)$
  - Properties: $p(a_i) \geq 0$, $\sum_i p(a_i) = 1$
- **Distribution function (CDF)**: $F(x) = P(X \leq x) = \sum_{a_i \leq x} p(a_i)$
  - Right-continuous, non-decreasing, $0 \leq F(x) \leq 1$

### Key Concepts — Continuous Random Variables

- **Continuous RV**: takes values in an interval (or union of intervals)
- **Probability density function (PDF)**: $f(x)$
  - $f(x) \geq 0$
  - $P(a \leq X \leq b) = \int_a^b f(x)\,dx$
  - $f(x)$ is **not** a probability (can exceed 1)
- **CDF**: $F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t)\,dt$
  - By Fundamental Theorem of Calculus: $f(x) = F'(x)$ (where differentiable)

### Key Concepts — Connection to Empirical Data

- **Histogram** approximates the **PDF** as sample size grows
- **Empirical CDF (ecdf)** approximates the **CDF** as sample size grows

### Method: Find Probabilities Using the Standard Normal Table

For $Z \sim N(0, 1)$, use the table to find $\Phi(x) = P(Z \leq x)$:

1. **$P(Z \leq a)$**: read directly from table as $\Phi(a)$
2. **$P(Z > a)$**: $1 - \Phi(a)$
3. **$P(a < Z < b)$**: $\Phi(b) - \Phi(a)$
4. **Symmetry**: $\Phi(-x) = 1 - \Phi(x)$

### WARNING
- For continuous RVs: $P(X = a) = 0$, so $P(X \leq a) = P(X < a)$
- For discrete RVs: $P(X \leq a) \neq P(X < a)$ — check the endpoint carefully

---

## Lecture 4: Modelling — Standard Distributions

### Key Concepts — Discrete Distributions

| Distribution | Parameters | PMF | $E[X]$ | $\text{Var}(X)$ | Support |
|---|---|---|---|---|---|
| **Bernoulli** | $p$ | $p(x) = p^x(1-p)^{1-x}$, $x \in \{0,1\}$ | $p$ | $p(1-p)$ | $\{0, 1\}$ |
| **Binomial** | $n, p$ | $\binom{n}{k} p^k (1-p)^{n-k}$ | $np$ | $np(1-p)$ | $\{0, 1, \ldots, n\}$ |
| **Geometric** | $p$ | $p(k) = (1-p)^{k-1}p$, $k \geq 1$ | $1/p$ | $(1-p)/p^2$ | $\{1, 2, \ldots\}$ |
| **Poisson** | $\mu$ | $\frac{\mu^k}{k!} e^{-\mu}$, $k \geq 0$ | $\mu$ | $\mu$ | $\{0, 1, \ldots\}$ |

- **Binomial**: number of successes in $n$ independent Bernoulli($p$) trials
- **Geometric**: number of trials to get the first success
- **Poisson**: approximation of Binomial$(n, p)$ when $n$ large, $p$ small ($\mu = np$)

### Key Concepts — Continuous Distributions

| Distribution | Parameters | PDF | $E[X]$ | $\text{Var}(X)$ | Support |
|---|---|---|---|---|---|
| **Uniform** | $a, b$ | $f(x) = \frac{1}{b-a}$ | $\frac{a+b}{2}$ | $\frac{(b-a)^2}{12}$ | $[a, b]$ |
| **Exponential** | $\lambda$ | $f(x) = \lambda e^{-\lambda x}$ | $\frac{1}{\lambda}$ | $\frac{1}{\lambda^2}$ | $[0, \infty)$ |
| **Pareto** | $\alpha$ | $f(x) = \frac{\alpha}{x^{\alpha+1}}$ | $\frac{\alpha}{\alpha-1}$ | $\frac{1}{(\alpha-1)^2(\alpha-2)}$ | $[1, \infty)$ |
| **Normal** | $\mu, \sigma^2$ | $\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$ | $\mu$ | $\sigma^2$ | $(-\infty, \infty)$ |

### Key Concepts — Quantiles

- **$p$-quantile** of distribution with CDF $F$: the value $x_p$ such that $F(x_p) = p$
- **Median**: $0.5$-quantile
- **Interquartile range (IQR)**: $x_{0.75} - x_{0.25}$
- **Empirical quantiles**: from sorted sample $x_{(1)} \leq \cdots \leq x_{(n)}$, the $p$-quantile is approximately $x_{(np)}$

### Method: Exponential Distribution Computations
1. **CDF**: $F(x) = 1 - e^{-\lambda x}$ for $x \geq 0$
2. **Memoryless property**: $P(X > s+t | X > s) = P(X > t)$
3. **$P(X > a)$**: $e^{-\lambda a}$
4. **$P(a < X < b)$**: $e^{-\lambda a} - e^{-\lambda b}$

### WARNING
- Geometric distribution: check if it counts trials **until** first success (starts at 1) or **failures before** first success (starts at 0)
- Pareto: mean exists only for $\alpha > 1$, variance only for $\alpha > 2$
- Poisson has equal mean and variance — a key diagnostic property

---

## Lecture 5: Independence and Expectation

### Key Concepts

- **Independence of RVs**: $X$ and $Y$ are independent iff for all sets $A, B$:
  $$P(X \in A, Y \in B) = P(X \in A) \cdot P(Y \in B)$$
  Equivalently for continuous: joint PDF factors as $f_{X,Y}(x,y) = f_X(x) \cdot f_Y(y)$

- **Expectation (mean)**:
  - Discrete: $E[X] = \sum_i a_i \cdot p(a_i)$
  - Continuous: $E[X] = \int_{-\infty}^{\infty} x \cdot f(x)\,dx$
  - Requires $\sum |a_i| p(a_i) < \infty$ or $\int |x| f(x)\,dx < \infty$

- **Expectation of a function** (discrete and continuous unified):
  $$E[g(X)] = \sum_i g(a_i) \cdot p(a_i) \quad \text{or} \quad \int_{-\infty}^{\infty} g(x) \cdot f(x)\,dx$$

### Key Properties of Expectation

- **Linearity**: $E[aX + bY + c] = aE[X] + bE[Y] + c$ (holds for **any** $X, Y$, independent or not)
- **Mean vs Average**:
  - Population mean: $E[X]$
  - Sample average: $\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i$
  - For i.i.d. samples: $E[\bar{X}_n] = \mu$

### Method: Change-of-Variable Law
If $Y = aX + b$, then:
$$E[Y] = aE[X] + b$$

### Method: Change-of-Units Law
If $Y = aX$ (scaling only), then probabilities are preserved, just rescaled:
$$E[Y] = aE[X]$$

### WARNING
- $E[XY] = E[X] \cdot E[Y]$ holds **only if** $X$ and $Y$ are independent (or uncorrelated)
- Linearity always holds: $E[X + Y] = E[X] + E[Y]$ regardless of independence

---

## Lecture 6: Estimation

### Key Concepts

- **Parameter**: unknown characteristic of a distribution (e.g., $p$ in Binomial$(n,p)$, $\lambda$ in Exponential$(\lambda)$)
- **Estimator**: a rule/statistic $\hat{\theta}$ used to estimate a parameter $\theta$
- **Estimate**: the realized value of the estimator from observed data

### Method: Method of Moments

1. Compute the theoretical population moment $E[X^k]$ as a function of parameters
2. Compute the sample moment $\frac{1}{n} \sum_{i=1}^{n} X_i^k$
3. **Set equal**: population moment = sample moment
4. **Solve** for the parameters

For the mean (first moment): set $E[X] = \bar{X}$

### Key Concepts — Bias

- **Bias of an estimator**: $\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta$
- **Unbiased estimator**: $\text{Bias}(\hat{\theta}) = 0$, i.e., $E[\hat{\theta}] = \theta$
- **Sample mean is unbiased**: $E[\bar{X}_n] = \mu$
- **Sample variance** $S^2 = \frac{1}{n-1} \sum (X_i - \bar{X})^2$ is unbiased for variance
  - Note: $\frac{1}{n} \sum (X_i - \bar{X})^2$ is **biased** (underestimates)

### WARNING
- Unbiasedness is a long-run property — an unbiased estimator can still be far from $\theta$ in a single sample
- Method of moments estimators are not always efficient (may have high variance)

---

## Lecture 7: Variance and MSE

### Key Concepts — Variance

- **Definition**: $\text{Var}(X) = E[(X - \mu)^2] = E[X^2] - (E[X])^2$
- **Standard deviation**: $\sigma_X = \sqrt{\text{Var}(X)}$
- **Properties**:
  - $\text{Var}(aX + b) = a^2 \text{Var}(X)$
  - $\text{Var}(\bar{X}_n) = \frac{\sigma^2}{n}$
- **Sample variance**:
  $$S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$$

### Key Concepts — Mean Squared Error (MSE)

For an estimator $\hat{\theta}$:
$$\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Var}(\hat{\theta}) + \left[\text{Bias}(\hat{\theta})\right]^2$$

- **MSE decomposition**: variance + squared bias
- An estimator with small bias but large variance can have higher MSE than one with small variance and moderate bias

### Key Concepts — Law of Large Numbers (LLN)

For i.i.d. $X_1, \ldots, X_n$ with $E[X_i] = \mu$:
$$\bar{X}_n = \frac{1}{n} \sum_{i=1}^{n} X_i \xrightarrow{P} \mu$$

The sample average converges in probability to the population mean as $n \to \infty$.

### Key Concepts — Covariance and Correlation

- **Covariance**: $\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]$
- **Correlation coefficient**:
  $$\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \cdot \sigma_Y}$$
  - $-1 \leq \rho \leq 1$
  - $\rho = 1$: perfect positive linear relationship
  - $\rho = -1$: perfect negative linear relationship
  - $\rho = 0$: **uncorrelated** (does NOT imply independence!)

### WARNING
- **Independence $\Rightarrow$ uncorrelated**, but **uncorrelated $\not\Rightarrow$ independent**
- $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$
- For independent $X, Y$: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$

---

## Lecture 8: Joint Continuous Distribution, Correlation

### Key Concepts — Joint Distributions

- **Joint PDF** $f(x, y)$ of continuous RVs $X, Y$:
  - $f(x, y) \geq 0$
  - $\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x, y)\,dx\,dy = 1$
  - $P((X, Y) \in A) = \iint_A f(x, y)\,dx\,dy$

- **Marginal PDFs**:
  $$f_X(x) = \int_{-\infty}^{\infty} f(x, y)\,dy, \quad f_Y(y) = \int_{-\infty}^{\infty} f(x, y)\,dx$$

- **Independence**: $X$ and $Y$ independent iff $f(x, y) = f_X(x) \cdot f_Y(y)$ for all $x, y$

### Key Concepts — Functions of Two RVs

- **Expectation of $g(X, Y)$**:
  $$E[g(X, Y)] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(x, y) \cdot f(x, y)\,dx\,dy$$

### Key Concepts — Covariance and Correlation (Joint)

- $\text{Cov}(X, Y) = E[XY] - E[X]E[Y]$
- $\rho(X, Y) = \dfrac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$
- $|\rho| = 1$ iff $Y = aX + b$ (linear relationship)
- $\rho > 0$: positive linear association; $\rho < 0$: negative

### Method: Compute Covariance from Joint PDF
1. Compute marginal PDFs: $f_X(x) = \int f(x, y)\,dy$, $f_Y(y) = \int f(x, y)\,dx$
2. Compute $E[X] = \int x \cdot f_X(x)\,dx$, $E[Y] = \int y \cdot f_Y(y)\,dy$
3. Compute $E[XY] = \int\int xy \cdot f(x, y)\,dx\,dy$
4. $\text{Cov}(X, Y) = E[XY] - E[X]E[Y]$
5. Compute $\sigma_X = \sqrt{E[X^2] - (E[X])^2}$, $\sigma_Y$ similarly
6. $\rho = \dfrac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$

---

## Lecture 9: Sums of Random Variables and the Multivariate Gaussian

### Key Concepts — Sums of Independent RVs

- If $X_1, \ldots, X_n$ are **independent**:
  $$\text{Var}\left(\sum_{i=1}^{n} X_i\right) = \sum_{i=1}^{n} \text{Var}(X_i)$$
- For i.i.d. with mean $\mu$, variance $\sigma^2$:
  - $E[\sum X_i] = n\mu$
  - $\text{Var}(\sum X_i) = n\sigma^2$
  - $\text{Var}(\bar{X}_n) = \sigma^2/n$

### Key Concepts — Convolution

To find the PDF of $Z = X + Y$ where $X, Y$ are independent:
$$f_Z(z) = \int_{-\infty}^{\infty} f_X(x) \cdot f_Y(z - x)\,dx$$

Special case: **Gamma addition** — if $X \sim \text{Gamma}(\alpha_1, \lambda)$ and $Y \sim \text{Gamma}(\alpha_2, \lambda)$ independent, then $X + Y \sim \text{Gamma}(\alpha_1 + \alpha_2, \lambda)$

### Key Concepts — Moment Generating Functions (MGFs)

- **MGF of $X$**: $M_X(t) = E[e^{tX}]$
- **Uniqueness**: the MGF uniquely determines the distribution
- **Sum of independent RVs**: $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$
- **MGF of constants**: $M_{aX+b}(t) = e^{bt} \cdot M_X(at)$

### Key Concepts — Multivariate Normal

- **Multivariate normal vector** $\mathbf{X} = (X_1, \ldots, X_d)^T$: any linear combination $\sum a_i X_i$ is normally distributed
- **Mean vector**: $\boldsymbol{\mu} = (E[X_1], \ldots, E[X_d])^T$
- **Covariance matrix** $\Sigma$:
  $$\Sigma_{ij} = \text{Cov}(X_i, X_j)$$
  - Symmetric, positive semi-definite
  - Diagonal: $\Sigma_{ii} = \text{Var}(X_i)$

### Method: Compute the Sample Covariance Matrix
1. Center the data: subtract the sample mean from each variable
2. $S = \frac{1}{n-1} X_c^T X_c$ where $X_c$ is the centered data matrix
3. Diagonal elements are sample variances; off-diagonal are covariances

### WARNING
- $X$ and $Y$ can both be normal but $(X, Y)$ is **not** bivariate normal — check the joint distribution
- For multivariate normal, **uncorrelated $\Rightarrow$ independent** (unlike general RVs)

---

## Lecture 10: CLT and Maximum Likelihood

### Key Concepts — Central Limit Theorem (CLT)

For i.i.d. $X_1, \ldots, X_n$ with mean $\mu$ and variance $\sigma^2$:
$$\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0, 1)$$

Equivalently, for large $n$:
$$\bar{X}_n \approx N\left(\mu, \frac{\sigma^2}{n}\right)$$

Or for the sum: $\sum_{i=1}^{n} X_i \approx N(n\mu, n\sigma^2)$

### Method: Apply the CLT to Approximate Probabilities
1. **Identify** $\mu = E[X]$ and $\sigma^2 = \text{Var}(X)$
2. **Standardize**: $Z = \dfrac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \approx N(0, 1)$
3. **Compute**: $P(\bar{X}_n \leq a) \approx P\left(Z \leq \dfrac{a - \mu}{\sigma/\sqrt{n}}\right) = \Phi\left(\dfrac{a - \mu}{\sigma/\sqrt{n}}\right)$
4. For sums: $P(\sum X_i \leq s) \approx \Phi\left(\dfrac{s - n\mu}{\sigma\sqrt{n}}\right)$

### Key Concepts — Maximum Likelihood Estimation (MLE)

- **Likelihood function**: given data $x_1, \ldots, x_n$:
  $$L(\theta) = \prod_{i=1}^{n} p_\theta(x_i) \quad \text{(discrete)} \quad \text{or} \quad \prod_{i=1}^{n} f_\theta(x_i) \quad \text{(continuous)}$$
- **Log-likelihood**: $\ell(\theta) = \ln L(\theta) = \sum_{i=1}^{n} \ln p_\theta(x_i)$
- **MLE**: $\hat{\theta}_{MLE} = \arg\max_\theta L(\theta) = \arg\max_\theta \ell(\theta)$

### Method: Compute the MLE
1. **Write the likelihood**: $L(\theta) = \prod_{i=1}^{n} f_\theta(x_i)$
2. **Take the log**: $\ell(\theta) = \sum \ln f_\theta(x_i)$
3. **Differentiate**: $\dfrac{d\ell}{d\theta}$
4. **Set to zero** and solve for $\theta$
5. **Check second derivative**: $\dfrac{d^2\ell}{d\theta^2} < 0$ (maximization)

### WARNING
- CLT requires **large enough** $n$ and finite variance. For heavy-tailed distributions, convergence is very slow
- The MLE must be **admissible** (in the parameter space). The critical point may lie outside the parameter space
- $\hat{\sigma}^2 = \frac{1}{n}\sum (X_i - \bar{X})^2$ is the MLE for variance but is **biased** (the unbiased version uses $n-1$)

---

## Lecture 11: Confidence Intervals

### Key Concepts

- **Point estimate**: single value $\hat{\theta}$ for parameter $\theta$
- **Confidence interval (CI)**: interval $[L, U]$ constructed from data such that:
  $$P(L \leq \theta \leq U) = 1 - \alpha$$
  (e.g., $95\%$ CI when $\alpha = 0.05$)
- **Not**: the probability that $\theta$ is in the interval (θ is fixed, the interval is random)
- **Confidence level** $1-\alpha$: long-run proportion of intervals containing $\theta$

### Method — CI for Mean: Normal Data, Known Variance

For $X_1, \ldots, X_n \sim N(\mu, \sigma^2)$ with **known** $\sigma^2$:

$$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$$

**$1-\alpha$ CI**:
$$\left[\bar{x} - z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}},\;\; \bar{x} + z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}\right]$$

where $z_{\alpha/2} = \Phi^{-1}(1 - \alpha/2)$ (e.g., $z_{0.025} = 1.96$ for $95\%$ CI)

### Method — CI for Mean: Normal Data, Unknown Variance (t-Test)

When $\sigma^2$ is **unknown**, use sample standard deviation $S$:

$$T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}$$

**$1-\alpha$ CI**:
$$\left[\bar{x} - t_{n-1, \alpha/2} \cdot \frac{s}{\sqrt{n}},\;\; \bar{x} + t_{n-1, \alpha/2} \cdot \frac{s}{\sqrt{n}}\right]$$

where $t_{n-1, \alpha/2}$ is the critical value from the t-distribution with $n-1$ degrees of freedom

### Method — CI for Mean: Large Samples (Unknown Variance)

For large $n$ (regardless of distribution):

$$\bar{X} \approx N\left(\mu, \frac{S^2}{n}\right)$$

**$1-\alpha$ CI (approximate)**:
$$\left[\bar{x} - z_{\alpha/2} \cdot \frac{s}{\sqrt{n}},\;\; \bar{x} + z_{\alpha/2} \cdot \frac{s}{\sqrt{n}}\right]$$

### Method — CI for a Proportion

For $X_i \sim \text{Bernoulli}(p)$, $\hat{p} = \bar{X}$:

**$1-\alpha$ CI (asymptotic)**:
$$\left[\hat{p} - z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}},\;\; \hat{p} + z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right]$$

### Method — CI via the Pivotal Quantity Method (General Principle)
1. Find a **pivotal quantity** $Q(\bar{X}_n, \theta)$ whose distribution does **not** depend on $\theta$
2. Find $a, b$ such that $P(a \leq Q \leq b) = 1 - \alpha$
3. **Rewrite**: $P(a \leq Q \leq b) = P(L \leq \theta \leq U)$ by algebraic manipulation

### Key Concepts — CI Width and Sample Size

- **Width** of a CI is proportional to $\dfrac{\sigma}{\sqrt{n}}$
- To **halve** the width, **quadruple** the sample size
- Width decreases with $1/\sqrt{n}$
- Higher confidence level ($1-\alpha$) $\Rightarrow$ wider interval (larger $z_{\alpha/2}$ or $t$)

### WARNING
- t-distribution is used when variance is **unknown** and data is **normal** (or approximately normal for large $n$)
- t-tables in the formula sheet: look up row with $n-1$ degrees of freedom, column for $1-\alpha/2$
- For proportions, the CI requires $n\hat{p} \geq 5$ and $n(1-\hat{p}) \geq 5$ for the normal approximation to hold

---

## Lecture 12: Hypothesis Testing

### Key Concepts

- **Null hypothesis** $H_0$: the default assumption (e.g., $\mu = \mu_0$)
- **Alternative hypothesis** $H_1$: what we want to detect (e.g., $\mu \neq \mu_0$, $\mu > \mu_0$, or $\mu < \mu_0$)
- **Test statistic**: a function of the data whose distribution is known (or approximately known) under $H_0$
- **p-value**: $P(\text{data as extreme or more extreme} | H_0)$. Probability of observing the result (or more extreme) if $H_0$ is true
- **Significance level** $\alpha$: threshold for rejecting $H_0$ (commonly 0.05, 0.01, 0.10)
- **Decision**: reject $H_0$ if p-value $< \alpha$

### Key Concepts — Type I and Type II Errors

| | $H_0$ True | $H_0$ False |
|---|---|---|
| **Don't reject $H_0$** | Correct | — |
| **Reject $H_0$** | **Type I error** (false positive) | Correct (Power) |

- **Type I error rate**: $P(\text{reject } H_0 | H_0 \text{ is true}) = \alpha$
- **Type II error rate**: $\beta = P(\text{don't reject } H_0 | H_0 \text{ is false})$
- **Power**: $1 - \beta = P(\text{reject } H_0 | H_0 \text{ is false})$

### Method — One-Sample t-Test (Mean, Unknown Variance)

For $H_0: \mu = \mu_0$ vs $H_1: \mu \neq \mu_0$ (or one-sided):

1. **Compute test statistic**: $t = \dfrac{\bar{x} - \mu_0}{s/\sqrt{n}}$
2. **Under $H_0$**: $t \sim t_{n-1}$
3. **p-value**:
   - Two-sided: $2 \cdot P(T_{n-1} > |t|)$
   - Right-sided: $P(T_{n-1} > t)$
   - Left-sided: $P(T_{n-1} < t)$
4. **Reject $H_0$** if p-value $< \alpha$

### Method — Asymptotic z-Test for Proportion

For $H_0: p = p_0$ vs $H_1: p \neq p_0$:

1. **Compute test statistic**: $z = \dfrac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$
2. **Under $H_0$**: $z \approx N(0, 1)$
3. **p-value**: $2(1 - \Phi(|z|))$ for two-sided
4. **Reject $H_0$** if p-value $< \alpha$

### Key Concepts — Critical Region

- The **critical region** (or rejection region) is the set of test statistic values leading to rejection of $H_0$
- For two-sided z-test at level $\alpha$: reject if $|z| > z_{\alpha/2}$
- Equivalently: reject if p-value $< \alpha$

### WARNING
- **Small p-value** means the data is **inconsistent** with $H_0$, NOT that $H_0$ is false with high probability
- **Large p-value** does NOT mean $H_0$ is true — it means there's insufficient evidence against it
- The p-value depends on the alternative hypothesis direction (one-sided vs two-sided)
- For one-sided tests, make sure the direction of $H_1$ matches the research question

---

## Lecture 13: Linear Regression Part I

### Key Concepts

- **Regression model**: $Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$, where $\varepsilon_i \sim N(0, \sigma^2)$ (i.i.d.)
- $Y$: response/dependent variable
- $x$: predictor/independent variable (fixed, known)
- $\beta_0$: intercept
- $\beta_1$: slope
- $\varepsilon_i$: random error, i.i.d. $N(0, \sigma^2)$

### Method — Least Squares Estimation (Normal Equations)

The **sum of squared residuals (SSR)**:
$$\text{SSR} = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2$$

Minimize SSR by setting partial derivatives to zero:

$$\hat{\beta}_1 = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}$$

$$\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}$$

### Key Concepts — Fitted Values and Residuals

- **Fitted value**: $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$
- **Residual**: $e_i = y_i - \hat{y}_i$
- Properties: $\sum e_i = 0$ and $\sum x_i e_i = 0$
- The regression line always passes through $(\bar{x}, \bar{y})$

### Key Concepts — Decomposition

$$\sum (y_i - \bar{y})^2 = \sum (y_i - \hat{y}_i)^2 + \sum (\hat{y}_i - \bar{y})^2$$

**Total Sum of Squares (TSS)** = **Residual Sum of Squares (RSS)** + **Explained Sum of Squares (ESS)**

### WARNING
- Least squares estimates $\hat{\beta}_0, \hat{\beta}_1$ are **unbiased**: $E[\hat{\beta}_0] = \beta_0$, $E[\hat{\beta}_1] = \beta_1$
- $\hat{\sigma}^2 = \frac{1}{n-2}\sum e_i^2$ is the unbiased estimator of error variance (degrees of freedom = $n - 2$)

---

## Lecture 14: Linear Regression Part II

### Key Concepts — Coefficient of Determination

- **$R^2$** (coefficient of determination):
  $$R^2 = \frac{\sum (\hat{y}_i - \bar{y})^2}{\sum (y_i - \bar{y})^2} = 1 - \frac{\sum e_i^2}{\sum (y_i - \bar{y})^2}$$
- Proportion of variance in $Y$ explained by the regression
- $0 \leq R^2 \leq 1$; higher means better fit
- For simple linear regression: $R^2 = r^2$ (square of correlation between $y_i$ and $\hat{y}_i$)

### Key Concepts — Categorical Predictors

- **Dummy variables** for categorical variable with $k$ levels: use $k-1$ dummy variables
- Base category is represented by all dummies = 0
- Coefficients represent **difference from base category**

Example: season with 4 levels (Spring, Summer, Fall, Winter) $\to$ 3 dummies:
- $D_1 = 1$ if Summer, 0 otherwise
- $D_2 = 1$ if Fall, 0 otherwise
- $D_3 = 1$ if Winter, 0 otherwise
- Spring is the base category

### Key Concepts — Model Diagnostics (Residual Analysis)

**Check residual plots** to verify model assumptions:
1. **Residuals vs fitted values**: should show no pattern (random scatter around 0)
   - Pattern $\Rightarrow$ non-linear relationship
   - Funnel shape $\Rightarrow$ non-constant variance
2. **Q-Q plot of residuals**: should follow straight line (normality assumption)
   - Deviations from line $\Rightarrow$ non-normal errors
3. **Residuals vs predictors**: check for omitted variables

### Key Concepts — Transformations

When assumptions are violated:
- **Log transformation**: $Y \to \ln(Y)$ for right-skewed responses or multiplicative effects
- **Square root**: $Y \to \sqrt{Y}$ for count-like data
- **Reciprocal**: $Y \to 1/Y$ for rate data
- **Transform predictor**: e.g., $x \to x^2$ for quadratic relationship

After transformation, **check residuals again** to verify improvement.

### WARNING
- $R^2$ **always** increases (or stays the same) when adding predictors — doesn't mean the model is better
- For models with categorical predictors, $R^2$ interpretation is similar but the decomposition involves more terms
- Don't just optimize $R^2$ — check residuals for assumption violations

---

## Lecture 15: More Hypothesis Testing

### Key Concepts — t-Test for Regression Coefficients

For the model $Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$:

**Test for slope**: $H_0: \beta_1 = 0$ vs $H_1: \beta_1 \neq 0$

- **Test statistic**: $t = \dfrac{\hat{\beta}_1 - 0}{\widehat{\text{se}}(\hat{\beta}_1)}$
  where $\widehat{\text{se}}(\hat{\beta}_1) = \dfrac{\hat{\sigma}}{\sqrt{S_{xx}}}$ and $\hat{\sigma} = \sqrt{\dfrac{1}{n-2}\sum e_i^2}$
- **Under $H_0$**: $t \sim t_{n-2}$
- **Reject $H_0$** if p-value $< \alpha$ (or $|t| > t_{n-2, \alpha/2}$)

**Test for intercept**: $H_0: \beta_0 = 0$ vs $H_1: \beta_0 \neq 0$:
- $t = \dfrac{\hat{\beta}_0}{\widehat{\text{se}}(\hat{\beta}_0)} \sim t_{n-2}$ under $H_0$

### Key Concepts — Power of a Test

- **Power**: $1 - \beta = P(\text{reject } H_0 | H_0 \text{ is false})$
- Power depends on:
  1. **Effect size** ($\delta$): the true difference from $H_0$ (e.g., $|\mu - \mu_0|$)
  2. **Significance level** ($\alpha$): higher $\alpha \Rightarrow$ higher power
  3. **Noise** ($\sigma$): more noise $\Rightarrow$ lower power
  4. **Sample size** ($n$): larger $n \Rightarrow$ higher power

### Method — Power Calculation (One-Sample t-Test)

For $H_0: \mu = \mu_0$ vs $H_1: \mu \neq \mu_0$, true mean = $\mu_1$:

1. **Critical value** under $H_0$: reject if $|\bar{X} - \mu_0| > t_{n-1, \alpha/2} \cdot \dfrac{\sigma}{\sqrt{n}}$
2. **Under $H_1$**: $\bar{X} \sim N(\mu_1, \sigma^2/n)$
3. **Power** = $P(\text{reject } H_0 | \mu = \mu_1)$, computed using the distribution under $\mu_1$

### Key Concepts — Power Curves

- Plot power as a function of effect size $\delta = |\mu - \mu_0|$
- Power increases as effect size increases
- Curves for different sample sizes show how $n$ affects power
- The **minimum detectable effect** at a given power level: find $\delta$ such that power = $1 - \beta$

### WARNING
- Power is computed at a **specific** alternative value (not for all possible alternatives)
- Low power means you may fail to detect a real effect — plan sample size accordingly
- Using $t$ instead of $z$ reduces power slightly (wider critical region)

---

## Lecture 16: Even More Hypothesis Testing

### Key Concepts — Sample Size Determination

To achieve power $1 - \beta$ at significance level $\alpha$ for detecting effect size $\delta$:

For a **one-sample z-test** (known $\sigma$), two-sided:

$$n \approx \frac{(z_{\alpha/2} + z_{\beta})^2 \cdot \sigma^2}{\delta^2}$$

For a **one-sided** test, replace $z_{\alpha/2}$ with $z_{\alpha}$:
$$n \approx \frac{(z_{\alpha} + z_{\beta})^2 \cdot \sigma^2}{\delta^2}$$

### Key Concepts — Multiple Testing

- When performing $m$ hypothesis tests, the **family-wise error rate** (FWER) increases:
  $$P(\text{at least one Type I error}) = 1 - (1 - \alpha)^m$$
- For $m = 20$ tests at $\alpha = 0.05$: $P(\text{at least one false positive}) \approx 64\%$

### Method — Bonferroni Correction

To maintain family-wise error rate at $\alpha$ when performing $m$ tests:
- Use significance level $\alpha/m$ for **each** individual test
- Adjusted p-value: multiply each p-value by $m$ (capped at 1)

### Key Concepts — p-Hacking

**p-hacking practices to avoid**:
- **Data snooping**: trying many hypotheses on the same data and reporting only significant ones
- **Optional stopping**: collecting data, checking p-value, collecting more data, repeating
- **Outlier cherry-picking**: removing outliers until results are significant
- **Hypothesizing after the result is known (HARKing)**
- **Multiple outcomes**: testing many dependent variables and only reporting significant ones

### WARNING
- **Correlation does not imply causation** — multiple testing corrections only address false positives, not confounding
- Bonferroni is **conservative** — it controls FWER but may reduce power significantly
- Always pre-specify hypotheses and analysis plan to avoid p-hacking

---

## Quick Reference

### Distributions Cheat Sheet

| Distribution | PMF/PDF | Mean | Variance |
|---|---|---|---|
| Bernoulli($p$) | $p^x(1-p)^{1-x}$ | $p$ | $p(1-p)$ |
| Binomial$(n,p)$ | $\binom{n}{k}p^k(1-p)^{n-k}$ | $np$ | $np(1-p)$ |
| Geometric($p$) | $(1-p)^{k-1}p$ | $1/p$ | $(1-p)/p^2$ |
| Poisson($\mu$) | $\mu^k e^{-\mu}/k!$ | $\mu$ | $\mu$ |
| Uniform($a,b$) | $1/(b-a)$ | $(a+b)/2$ | $(b-a)^2/12$ |
| Exponential($\lambda$) | $\lambda e^{-\lambda x}$ | $1/\lambda$ | $1/\lambda^2$ |
| Pareto($\alpha$) | $\alpha/x^{\alpha+1}$, $x \geq 1$ | $\alpha/(\alpha-1)$ | $1/[(\alpha-1)^2(\alpha-2)]$ |
| Normal($\mu,\sigma^2$) | $\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$ | $\mu$ | $\sigma^2$ |

### Key Formulas

- **Union**: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$
- **Conditional**: $P(A|C) = \dfrac{P(A \cap C)}{P(C)}$
- **Bayes**: $P(C_i|A) = \dfrac{P(A|C_i)P(C_i)}{\sum_j P(A|C_j)P(C_j)}$
- **Variance**: $\text{Var}(X) = E[X^2] - (E[X])^2$
- **Covariance**: $\text{Cov}(X,Y) = E[XY] - E[X]E[Y]$
- **Correlation**: $\rho = \dfrac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}$
- **MSE**: $\text{MSE}(\hat{\theta}) = \text{Var}(\hat{\theta}) + \text{Bias}(\hat{\theta})^2$
- **MGF**: $M_X(t) = E[e^{tX}]$
- **CLT**: $\dfrac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0,1)$

### Confidence Intervals

| Setting | CI Formula |
|---|---|
| Normal, known $\sigma$ | $\left[\bar{x} \pm z_{\alpha/2} \cdot \dfrac{\sigma}{\sqrt{n}}\right]$ |
| Normal, unknown $\sigma$ | $\left[\bar{x} \pm t_{n-1, \alpha/2} \cdot \dfrac{s}{\sqrt{n}}\right]$ |
| Large sample, unknown $\sigma$ | $\left[\bar{x} \pm z_{\alpha/2} \cdot \dfrac{s}{\sqrt{n}}\right]$ |
| Proportion (large $n$) | $\left[\hat{p} \pm z_{\alpha/2} \cdot \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}}\right]$ |

### Hypothesis Tests

| Test | Statistic | Distribution under $H_0$ |
|---|---|---|
| Mean (known $\sigma$) | $z = \dfrac{\bar{x} - \mu_0}{\sigma/\sqrt{n}}$ | $N(0, 1)$ |
| Mean (unknown $\sigma$) | $t = \dfrac{\bar{x} - \mu_0}{s/\sqrt{n}}$ | $t_{n-1}$ |
| Proportion | $z = \dfrac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}}$ | $N(0, 1)$ |
| Regression slope | $t = \dfrac{\hat{\beta}_1}{\widehat{\text{se}}(\hat{\beta}_1)}$ | $t_{n-2}$ |
| Regression intercept | $t = \dfrac{\hat{\beta}_0}{\widehat{\text{se}}(\hat{\beta}_0)}$ | $t_{n-2}$ |

### Critical Values (Common $\alpha$)

| $\alpha$ | $z_{\alpha/2}$ |
|---|---|
| 0.10 | 1.645 |
| 0.05 | 1.96 |
| 0.01 | 2.576 |

---

*This summary covers all 16 lectures of CSE 1210 Probability Theory and Statistics. Use it alongside the textbook (Dekking et al.), lecture notes, and Grasple exercises for comprehensive preparation.*
