# PTS Endterm Study Plan (Based on Endterm 2025 + Resit 2025)

## Exam Overview

- **Format:** Written exam, 2h 30min, closed book
- **Points:** 90 points total (grading: 1 + 0.1S where S = score out of 90)
- **Allowed:** Non-graphing, non-symbolic calculator; self-made formula sheet (A4, both sides)
- **Sections:** Multiple choice (Q1–5, 5 pts each), Short answer (Q6–12, 3–5 pts each), Open questions (Q10–15, 5–10 pts each)
- **Endterm 2025:** 45 points, 2h, formula sheet (A4, two sides) — grading: 1 + 0.2S
- **Resit 2025:** 90 points, 2h 30min — grading: 1 + 0.1S
- **Textbook:** *A Modern Introduction to Probability and Statistics* (Dekking & Kraaikamp)

**PTS has NO HTML pattern file.** Based on analysis of Endterm 2025 and Resit 2025, there are ~15 core patterns (P1–P15+) covering confidence intervals, hypothesis tests, regression, CLT, MLE, convolution, and probability calculations.

---

## Tier 1: Very High frequency + on both Endterm & Resit — study these first

| # | Topic | Endterm 2025 Q | Resit 2025 Q | Lectures | Pattern | Done |
|---|-------|---------------|-------------|----------|---------|------|
| 1 | Confidence Interval for Normal Mean (unknown variance) | Q1 | Q4 | L11 | P1 | [ ] |
| 2 | Regression t-test (coefficient significance) | Q2 | Q11 | L13, L14 | P2 | [ ] |
| 3 | CLT application (sum of i.i.d. approx. normal) | Q6 | Q1 | L9, L10 | P3 | [ ] |
| 4 | Binomial Proportion Test | Q5 | — | L12, L15 | P4 | [ ] |
| 5 | R-squared Interpretation | Q9 | Q9 | L14 | P5 | [ ] |

---

## Tier 2: High frequency + on at least one exam

| # | Topic | Endterm 2025 Q | Resit 2025 Q | Lectures | Pattern | Done |
|---|-------|---------------|-------------|----------|---------|------|
| 6 | Convolution for Sum of Independent RVs | Q14 | Q7 | L9 | P6 | [ ] |
| 7 | MLE and Functional Invariance | Q15a, Q15b | — | L10 | P7 | [ ] |
| 8 | Dummy Variables in Regression | Q12 | Q8, Q12 | L14 | P8 | [ ] |
| 9 | Power Calculations | Q13b | — | L15 | P9 | [ ] |
| 10 | Gamma Distribution Hypothesis Test | Q15c | — | L10, L12 | P10 | [ ] |
| 11 | Type I / II Error Computation | Q7 | — | L12, L15 | P11 | [ ] |

---

## Tier 3: On one exam but lower frequency — review if time permits

| # | Topic | Endterm 2025 Q | Resit 2025 Q | Lectures | Pattern | Done |
|---|-------|---------------|-------------|----------|---------|------|
| 12 | Conditional Probability with Continuous Distributions | — | Q2 | L2, L4 | P12 | [ ] |
| 13 | Bayes' Rule (with continuous/mixed) | — | Q3 | L2 | P13 | [ ] |
| 14 | t-Test Statistic Computation + Conclusion | — | Q5 | L12, L15 | P14 | [ ] |
| 15 | Binomial Distribution Identification | — | Q6 | L4 | P15 | [ ] |
| 16 | Confidence Interval for Regression Coefficient | — | Q10 | L13, L14 | P16 | [ ] |
| 17 | Method of Moments / Unbiased Estimators | — | Q11 | L6 | P17 | [ ] |
| 18 | MLE for Non-Standard Distributions | — | Q12 | L10 | P18 | [ ] |

---

## Tier 4: Topic area, appear in exercises / midterms — review if time permits

| # | Topic | Lectures | Pattern | Done |
|---|-------|----------|---------|------|
| 19 | Pareto Distribution Calculations | L4 | P19 | [ ] |
| 20 | Chebychev's Inequality | L7 | P20 | [ ] |
| 21 | Joint Distributions / Independence Checks | L8 | P21 | [ ] |
| 22 | Correlation / Covariance Computation | L7, L8 | P22 | [ ] |
| 23 | Two-Sample t-Test | L16 | P23 | [ ] |
| 24 | Multiple Testing / p-hacking | L16 | P24 | [ ] |
| 25 | Distribution Identification from Histogram/Boxplot | L3, L6 | P25 | [ ] |
| 26 | Change-of-Variable Formula / Transformation | L5 | P26 | [ ] |
| 27 | Estimator Bias / MSE Calculation | L6, L7 | P27 | [ ] |
| 28 | Gamma Distribution Properties | L9 | P28 | [ ] |

---

## Endterm 2025 Questions

| Q# | Pts | Topic | Pattern | Lectures | Done |
|----|-----|-------|---------|----------|------|
| MC 1 | 1 | a in CI for mean (sample mean) | P1 | L11 | [ ] |
| MC 2 | 1 | b = critical value t (t_2,0.025 = 4.303) | P1 | L11 | [ ] |
| MC 3 | 1 | c = sample std dev s | P1 | L11 | [ ] |
| MC 4 | 1 | d = sqrt(n) = sqrt(3) | P1 | L11 | [ ] |
| MC 5 | 4 | Binomial proportion test — p-value for 0 fours in 12 throws | P4 | L12, L15 | [ ] |
| MC 6 | 4 | CLT: P(-1 <= S_400 <= 1) for sum of bounded RVs | P3 | L9, L10 | [ ] |
| SA 7 | 4 | Type I error (0.1) and Type II error (0.75) for U(-theta,theta) test | P11 | L12, L15 | [ ] |
| SA 8 | 3 | Degrees of freedom for t in regression (n - p = 7633) | P2 | L13, L14 | [ ] |
| SA 9 | 3 | q = 1 - R-squared = 0.071 | P5 | L14 | [ ] |
| SA 10 | 3 | 90% CI for beta_0 using z-critical (1.645) with large n | P2, P16 | L13, L14 | [ ] |
| SA 11 | 3 | Test beta_2 = -0.240: value is in 95% CI [-0.249, -0.217] so do not reject | P2 | L13, L14 | [ ] |
| SA 12 | 3 | Add 2 dummy variables for 3-level categorical (relative humidity: dry, humid, very humid) | P8 | L14 | [ ] |
| O 13 | 4 | (a) Critical value for mean test with known variance; (b) Sample size for power 0.9 | P9 | L15 | [ ] |
| O 14 | 5 | (a) Support of Z = Exp + Uniform; (b) f_Z(3) via convolution integral | P6 | L9 | [ ] |
| O 15 | 5 | (a) MLE for lambda (Exp); (b) Functional invariance for 1/lambda^2; (c) Gamma test for lambda > 0.1 | P7, P10 | L10, L12 | [ ] |

---

## Resit 2025 Questions

| Q# | Pts | Topic | Pattern | Lectures | Done |
|----|-----|-------|---------|----------|------|
| MC 1 | 5 | CLT: P(S_72 < 100) for sum of Pareto(4) asteroid masses | P3 | L9, L10 | [ ] |
| MC 2 | 5 | Conditional Pareto: P(X>2 | X>1.5) = 2^{-4}/1.5^{-4} = 0.3164 | P12 | L4 | [ ] |
| MC 3 | 5 | Bayes' rule: P(Variety A | not infected) = 30/53 | P13 | L2 | [ ] |
| MC 4 | 5 | 99% CI for mean with unknown variance (t_5,0.005) | P1 | L11 | [ ] |
| MC 5 | 5 | t-test: t = 1.771, do not reject (t_9 critical = 1.833) | P14 | L12, L15 | [ ] |
| SA 6 | 5 | X = remaining points after 10 throws: X = 10 - Bin(10, 1/6) | P15 | L4 | [ ] |
| SA 7 | 5 | Convolution integral for sum of two independent Pareto(1): f_Z(z) = integral of f_X(t)f_Y(z-t) | P6 | L9 | [ ] |
| SA 8 | 6 | (a) Adjust prediction when changing dummy category; (b) Diagnose residual pattern | P8 | L14 | [ ] |
| SA 9 | 6 | (a) R-squared always increases with more variables (adjusted R-squared punishes); (b) Multiple testing correction | P5, P24 | L14, L16 | [ ] |
| SA 10 | 3 | 90% CI for beta_0 with large sample (use z = 1.645) | P16 | L13, L14 | [ ] |
| SA 11 | 3 | Test beta_2 = -0.240 using 95% CI: value lies in [-0.249, -0.217], do not reject | P2 | L13, L14 | [ ] |
| SA 12 | 3 | Add 2 dummy variables for 3-level categorical (dry, rather humid, very humid) | P8 | L14 | [ ] |
| O 10 | 10 | (a) Significance level alpha from two-tailed critical values; (b) Power at lambda = 0.6 using Exp(6) | P10, P9 | L10, L15 | [ ] |
| O 11 | 9 | (a) Unbiased estimator for p using count A; (b) Using sample mean; (c) MSE computation | P17 | L6 | [ ] |
| O 12 | 9 | (a) Integral for E[X]; (b) Method of moments estimator for theta; (c) Bias of max-based estimator | P18, P27 | L6, L10 | [ ] |

---

## How to Solve Each Pattern

### P1 — Confidence Interval for Normal Mean (Unknown Variance)

**How to recognize:** "Give a CI for mu given a random sample from N(mu, sigma^2) with unknown sigma"

**Formula:**
$$\bar{x} \pm t_{n-1, \alpha/2} \cdot \frac{s}{\sqrt{n}}$$

**Steps:**
1. Compute sample mean: $$\bar{x} = \frac{1}{n}\sum x_i$$
2. Compute sample std dev: $$s = \sqrt{\frac{1}{n-1}\sum (x_i - \bar{x})^2}$$
3. Find t-critical from t-table: t_{n-1, alpha/2}
4. Plug into formula: [x̄ - t·s/√n, x̄ + t·s/√n]

**Key insight:** For n < 30 use t-distribution with n-1 df. For large n (n > 100), z ≈ 1.96 (95%) or 1.645 (90%) works as approximation.

**2025 Endterm Q1–4:** n=3, x̄ = 13.1, s = 1.609, 95% CI → t_{2,0.025} = 4.303. CI = [13.1 ± 4.303 · 1.609/√3]
**Resit Q4:** n=6, x̄ = 13.0927, s = 1.4091, 99% CI → t_{5,0.005}

---

### P2 — Regression t-Test (Coefficient Significance)

**How to recognize:** "Test whether beta_j = some_value" or "Is the coefficient significant?"

**Formula:**
$$t = \frac{\hat{\beta}_j - \beta_{j,0}}{\text{se}(\hat{\beta}_j)} \sim t_{n-p}$$

**Steps:**
1. Read coef and std err from regression output
2. Compute t-statistic: (coef - hypothesized value) / std err
3. Find critical value: t_{n-p, alpha/2} where p = number of parameters (including intercept)
4. Compare: |t| > t_crit → reject H0
5. **Shortcut:** If hypothesized value lies within the reported 95% CI, do NOT reject at alpha=0.05

**Degrees of freedom:** df = n - p where p = number of estimated parameters (intercept + all slopes)

**2025 Endterm SA 11:** beta_2 = -0.2329, CI = [-0.249, -0.217]. Testing beta_2 = -0.240: -0.240 is IN the CI → do not reject.
**2025 Endterm SA 10:** Large sample (n=7638), so z ≈ 1.645 for 90% CI. CI = 453.57 ± 1.645 · 10.93 = [435.59, 471.55]

---

### P3 — CLT Application (Sum of i.i.d. Approx. Normal)

**How to recognize:** "Sum of n independent RVs, find probability the sum is between values"

**Steps:**
1. Find E[X] and Var(X) for the individual RV (may be given or computed)
2. For sum S_n: E[S_n] = n·E[X], Var(S_n) = n·Var(X)
3. By CLT: S_n ≈ N(n·mu, n·sigma^2)
4. Standardize: Z = (S_n - n·mu) / (sigma·√n)
5. Look up standard normal probabilities in table

**2025 Endterm Q6:** X on [-1,1], f(x) = (3/4)(1-x^2), E[X] = 0, E[X^2] = 1/5, so Var(X) = 1/5. S_400 ≈ N(0, 80). P(-1 <= S_400 <= 1) = P(-1/√80 <= Z <= 1/√80) ≈ 2·0.0438 = 0.0876.

**Resit Q1:** Pareto(alpha=4), E[X] = 4/3, Var(X) = 2/9. S_72 ≈ N(96, 16). P(S_72 < 100) = P(Z < 1) ≈ 0.8413.

---

### P4 — Binomial Proportion Test

**How to recognize:** "Testing if a die/proportion is fair" with count of successes out of n trials

**Steps:**
1. Under H0: Y ~ Bin(n, p0) where p0 = expected proportion (e.g., 1/6 for fair die, 1/4 for 4-sided)
2. Count observed successes (or more extreme)
3. p-value = P(Y <= observed) for left-tail, P(Y >= observed) for right-tail, P(Y <= observed) + P(Y >= observed') for two-tail
4. Compare p-value to alpha

**2025 Endterm Q5:** 4-sided die, p0 = 1/4, n = 12, observed 0 fours. H1: p < 1/4. p-value = P(Y=0 | p=1/4) = (3/4)^12 ≈ 0.0317. Since 0.0317 < 0.05, reject H0.

---

### P5 — R-Squared Interpretation

**How to recognize:** "What does R-squared mean?" or "Is adding variables justified?" or computing 1-R²

**Key facts:**
- R² = 1 - (SSR/SST) = proportion of variance in Y explained by the model
- R² NEVER decreases when adding variables (always increases or stays same)
- Adjusted R² penalizes for extra variables: can decrease if new variable adds little
- 1 - R² = SSR/SST = ratio of residual variance to total variance
- For large samples, adjusted R² ≈ R²

**2025 Endterm SA 9:** q = SSR/SST = 1 - R² = 1 - 0.929 = 0.071
**Resit SA 9(a):** R² increased from 0.690 to 0.692 — this is expected with more variables. Need to check adjusted R² and multiple testing.
**Resit SA 9(b):** With 5 tests (beta_1 to beta_5), use Bonferroni: alpha/5 = 0.01 threshold.

---

### P6 — Convolution for Sum of Independent RVs

**How to recognize:** "X and Y are independent, find the distribution/pdf of Z = X + Y"

**Formula:**
$$f_Z(z) = \int_{-\infty}^{\infty} f_X(t) \cdot f_Y(z-t) \, dt$$

**Steps:**
1. Determine the support of Z (sum of supports of X and Y)
2. Write f_X(t) and f_Y(z-t) with their piecewise definitions
3. Find integration limits where BOTH densities are non-zero:
   - f_X(t) > 0 requires: t in support of X
   - f_Y(z-t) > 0 requires: z-t in support of Y, i.e., t in z - support_of_Y
4. Intersect the limits and integrate

**2025 Endterm Q14:** X ~ Exp(lambda), Y ~ U(2,5), Z = X + Y.
- Support of Z: [0, infinity) + [2, 5] = [2, infinity)
- f_Z(3): Need t >= 0 (for Exp) AND 2 <= 3-t <= 5, i.e., -2 <= -t <= 2, i.e., -2 <= t <= 2. Combined: 0 <= t <= 1.
- f_Z(3) = int_0^1 lambda·e^{-lambda·t} · (1/3) dt = (1/3)(1 - e^{-lambda})

**Resit Q7:** X, Y ~ Pareto(alpha=1, x0=1), independent.
- f_X(x) = x^{-2} for x >= 1, f_Y(y) = y^{-2} for y >= 1
- f_Z(z) = int_1^{z-1} t^{-2} · (z-t)^{-2} dt for z >= 2 (limits from t >= 1 and z-t >= 1)

---

### P7 — MLE and Functional Invariance

**How to recognize:** "Find the MLE for theta" then "Find the MLE for g(theta)"

**Steps for MLE:**
1. Write the likelihood: L(theta) = product of f(x_i; theta)
2. Take log: l(theta) = log(L(theta))
3. Differentiate w.r.t. theta, set to 0, solve
4. Verify maximum (second derivative < 0)

**Functional Invariance:** If theta_hat is the MLE of theta, then for any function g, the MLE of g(theta) is g(theta_hat).

**2025 Endterm Q15:** X ~ Exp(lambda), sample of 4: x = {2.8, 3.1, 2.5, 3.9}
- L(lambda) = lambda^4 · e^{-lambda·12.3}
- l(lambda) = 4·ln(lambda) - 12.3·lambda
- dl/dlambda = 4/lambda - 12.3 = 0 → lambda_hat = 4/12.3 ≈ 0.325
- By invariance: MLE of 1/lambda^2 = 1/(0.325)^2 ≈ 9.456

---

### P8 — Dummy Variables in Regression

**How to recognize:** "Add a categorical variable with k levels" or "Interpret regression output with dummy variables"

**Key rules:**
- For k categories, add k-1 dummy variables (one base category omitted)
- The intercept beta_0 is the expected value for the base category
- Each dummy coefficient is the difference from the base category
- To predict for category C (not base): intercept + beta_C_dummy

**2025 Endterm SA 12 / Resit SA 12:** 3 humidity levels → 2 dummy variables. If base = "moderately humid", then add x_4 (dry) and x_5 (very humid).
**2025 Endterm SA 8 / Resit SA 8:** Model Y = beta_0 + beta_1·x_1 + beta_2·x_2 + beta_3·x_3 + beta_4·x_4. Base = A. If prediction for B was 20.1 (using x_1=1), and we need C (x_1=0, x_2=1): y_hat = 20.1 - beta_1 + beta_2.
**Resit SA 8(a):** y_hat = 20.1 - 1.8221 + (-2.8831) = 15.3948

---

### P9 — Power Calculations

**How to recognize:** "How large should the sample be to achieve power pi?" or "What is the power of this test?"

**For known variance, one-sided test H0: mu = mu_0 vs H1: mu > mu_0:**

**Critical value:** Reject if x̄ >= mu_0 + z_{alpha} · sigma/√n

**Power at mu = mu_a:**
$$\pi = P_{\mu_a}\left(\bar{X} \ge \mu_0 + z_\alpha \cdot \frac{\sigma}{\sqrt{n}}\right) = P\left(Z \ge z_\alpha - \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma}\right)$$

**Solving for sample size given desired power pi:**
$$z_\alpha - \frac{\sqrt{n}(\mu_a - \mu_0)}{\sigma} = -z_\pi$$
$$n = \left(\frac{(z_\alpha + z_\pi)\sigma}{\mu_a - \mu_0}\right)^2$$

**2025 Endterm Q13:** sigma^2 = 4, H0: mu = 5 vs H1: mu > 5, alpha = 0.05, power = 0.9 at mu = 6.
- (a) Critical value: x̄ >= 5 + 1.645 · 2/√n = 5 + 3.29/√n
- (b) Power: P(Z >= 1.645 - √n/2) = 0.9 → 1.645 - √n/2 = -1.28 → n = 35

---

### P10 — Gamma / Exponential Hypothesis Test

**How to recognize:** "Sum of exponential observations follows a gamma" or "Minimum of exponentials is exponential"

**Key facts:**
- If X_i ~ Exp(lambda), then S_n = sum X_i ~ Gamma(n, lambda) with pdf: f(s) = lambda^n · s^{n-1} · e^{-lambda·s} / (n-1)!
- M_n = min(X_1,...,X_n) ~ Exp(n·lambda)
- For H0: lambda = lambda_0, S_n ~ Gamma(n, lambda_0)
- Larger lambda → smaller X values → reject H0 if S_n is very small (for H1: lambda > lambda_0)

**2025 Endterm Q15(c):** S_4 ~ Gamma(4, lambda), test H0: lambda = 0.1 vs H1: lambda > 0.1
- S_4 = 12.3, larger lambda → smaller observations → reject for small S_4
- p-value = P(S_4 <= 12.3 | lambda=0.1)
- From table: Gamma(0.025) = 10.90, Gamma(0.05) = 13.66
- So p-value is between 0.025 and 0.05
- Since p < 0.05 = alpha → reject H0

**Resit Q10:** H0: lambda = 0.2 vs H1: lambda != 0.2, M_10 ~ Exp(10·0.2) = Exp(2) under H0
- (a) Alpha = P(M_10 <= 0.0204) + P(M_10 >= 1.6094) under Exp(2)
  = (1-e^{-2·0.0204}) + e^{-2·1.6094} = 0.04 + 0.04 = 0.08
- (b) Power at lambda = 0.6: M_10 ~ Exp(6)
  Power = P(M_10 <= 0.0204 | lambda=0.6) + P(M_10 >= 1.6094 | lambda=0.6)
  = (1-e^{-6·0.0204}) + e^{-6·1.6094} = 0.1152

---

### P11 — Type I / II Error Computation

**How to recognize:** "Compute the probability of type I error" or "type II error at theta = ..."

**Definitions:**
- **Type I error (alpha):** Reject H0 when H0 is true = P(reject region | H0)
- **Type II error (beta):** Fail to reject H0 when H1 is true = P(not in reject region | H1 parameter)
- **Power:** 1 - beta = P(reject | H1 parameter)

**Steps:**
1. Identify the reject region R from the test description
2. Type I: Integrate pdf over R assuming H0 parameter
3. Type II: Integrate pdf over R^c (complement of R) assuming H1 parameter

**2025 Endterm Q7:** H0: theta = 1 (X ~ U(-1,1)), H1: theta < 1. Reject if -0.1 <= X <= 0.1.
- (a) Type I error = P(-0.1 <= X <= 0.1 | theta=1) = 0.2/2 = 0.1
- (b) Type II at theta = 0.4: P(not reject | theta=0.4) = 1 - P(-0.1 <= X <= 0.1 | theta=0.4) = 1 - 0.2/0.8 = 1 - 0.25 = 0.75

---

### P12 — Conditional Probability with Continuous Distributions

**How to recognize:** "Given that X > a, what is P(X > b)?" (for continuous distributions)

**Formula:**
$$P(X > b \mid X > a) = \frac{P(X > b \cap X > a)}{P(X > a)} = \frac{P(X > b)}{P(X > a)} \quad \text{for } b > a$$

**2025 Resit Q2:** Pareto(alpha=4, x0=1), P(X > 2 | X > 1.5) = P(X > 2)/P(X > 1.5)
- Pareto survival: P(X > x) = x^{-alpha} = x^{-4}
- = 2^{-4}/1.5^{-4} = (1.5/2)^4 = 0.75^4 ≈ 0.3164

---

### P13 — Bayes' Rule (with Multiple Hypotheses)

**How to recognize:** "Given prior probabilities of several causes, and an observation, find posterior probability of each cause"

**Formula (with partitions):**
$$P(H_i \mid E) = \frac{P(E \mid H_i)P(H_i)}{\sum_j P(E \mid H_j)P(H_j)}$$

**2025 Resit Q3:** Varieties A, B, C with P(A)=1/2, P(B)=1/3, P(C)=1/6
- P(not infected | A) = 1, P(not infected | B) = 3/4, P(not infected | C) = 4/5
- P(A | not infected) = (1 · 1/2) / (1 · 1/2 + 3/4 · 1/3 + 4/5 · 1/6)
- = (1/2) / (1/2 + 1/4 + 2/15) = (1/2) / (30/60 + 15/60 + 8/60) = (1/2)/(53/60) = 30/53 ≈ 0.5660

---

### P14 — t-Test Statistic Computation + Conclusion

**How to recognize:** "Compute the test statistic and draw conclusion" for a one-sample t-test

**Steps:**
1. t = (x̄ - mu_0) / (s/√n)
2. Determine df = n-1
3. Find critical value t_{df, alpha} (one-tailed) or t_{df, alpha/2} (two-tailed)
4. For one-tailed H1: mu > mu_0: reject if t > t_crit
5. For one-tailed H1: mu < mu_0: reject if t < -t_crit

**Resit Q5:** H0: mu = 50 vs H1: mu > 50, alpha = 0.05, n = 10
- t = (52.38 - 50) / (4.25/√10) = 2.38 / 1.3436 ≈ 1.771
- t_crit = t_{9, 0.05} ≈ 1.833
- 1.771 < 1.833 → do not reject H0

---

### P15 — Binomial Distribution Identification

**How to recognize:** "Count the number of successes in n independent trials" → Binomial

**Key facts:**
- Count of successes in n independent Bernoulli trials → Bin(n, p)
- Geometric: number of trials until first success
- Sum of independent geometrics → Negative binomial
- P(X = k) = C(n,k) · p^k · (1-p)^{n-k}

**Resit Q6:** 10 throws of fair die, each six = lose 1 point. X = remaining points = 10 - (number of sixes). Number of sixes ~ Bin(10, 1/6). So X = 10 - Bin(10, 1/6).

---

### P16 — Confidence Interval for Regression Coefficient

**How to recognize:** "Give a CI for beta_j" in a regression context

**Formula:**
$$\hat{\beta}_j \pm t_{n-p, \alpha/2} \cdot \text{se}(\hat{\beta}_j)$$

**For large n (n > 1000):** t ≈ z, so z_{0.05} = 1.645 for 90%, z_{0.025} = 1.96 for 95%

**2025 Endterm SA 10:** n = 7638, large sample. 90% CI for beta_0:
- z_{0.05} = 1.645, se = 10.93, beta_0 = 453.57
- CI = [453.57 - 1.645·10.93, 453.57 + 1.645·10.93] = [435.59, 471.55]

**Resit Q10:** Same setup, same formula

---

### P17 — Method of Moments / Unbiased Estimators

**How to recognize:** "Find an unbiased estimator for theta" or "Find the method of moments estimator"

**Method of Moments:**
1. Compute E[X] as a function of theta
2. Set E[X] = x̄ (sample mean)
3. Solve for theta in terms of x̄

**Unbiased Estimator:**
- T is unbiased for theta if E[T] = theta
- If E[g(X)] = c·theta, then T = g(X)/c is unbiased

**Resit Q11:** X takes {-1, 0, 1, 2} with probs {p, 3p, 0.5-2p, 0.5-2p}
- E[X] = -p + 0 + (0.5-2p) + 2(0.5-2p) = 1.5 - 7p
- p = (1.5 - E[X])/7 → T_2 = (1.5 - X̄)/7 = 3/14 - X̄/7
- Count of 1s: A = sum I(X_i = 1). P(X=1) = 0.5-2p. E[A] = n(0.5-2p).
- p = (0.5 - E[A]/n)/2 = 1/4 - A/(2n) → T_1 = 1/4 - A/(2n)
- MSE(T_1) = Var(T_1) = (1/(4n^2)) · n · (0.5-2p)(1-(0.5-2p)) = (1/(4n))·(1/4 - 4p^2)

---

### P18 — MLE for Non-Standard Distributions

**How to recognize:** "Find the MLE for theta" for a non-standard pdf

**Steps:**
1. Write the likelihood based on the given pdf
2. Note that each observation contributes a factor depending on whether it falls in different regions
3. Take log, differentiate, solve

**Resit Q12:** f_X(x) = (pi/(2theta))·cos(pi·x/(2theta)) for 0 <= x <= theta
- E[X] = ((pi-2)/pi)·theta
- MoM: Set x̄ = ((pi-2)/pi)·theta → theta_hat = (pi/(pi-2))·x̄
- Max-based: T = ((n+1)/n)·max(X_i). For U(0,theta), this is unbiased. But cos-density has more mass near 0, so max tends to be smaller → T is negatively biased.

**Resit Q13:** Mixed distribution from coin toss:
- If Heads: U(0,1), if Tails: U(-1,0). P(Heads) = alpha
- Likelihood: L(alpha) = alpha^{n_+} · (1-alpha)^{n_-} where n_+ = count of positive values
- MLE: alpha_hat = n_+/n
- MSE = Var(alpha_hat) = alpha(1-alpha)/n

---

### P19 — Pareto Distribution Calculations

**Key formulas:**
- Pareto(alpha, x0): f(x) = alpha·x0^alpha / x^{alpha+1} for x >= x0
- P(X > x) = (x0/x)^alpha for x >= x0
- E[X] = alpha·x0/(alpha-1) for alpha > 1
- Var(X) = alpha·x0^2/((alpha-1)^2·(alpha-2)) for alpha > 2

**Resit MC 1:** Pareto(4, 1): E[X] = 4/3, Var(X) = 4/(9·2) = 2/9
- S_72 ≈ N(96, 16) by CLT
- P(S_72 < 100) ≈ P(Z < 1) = 0.8413

---

### P20 — Chebychev's Inequality

**Inequality:**
$$P(|X - E[X]| \ge a) \le \frac{\text{Var}(X)}{a^2}$$

**Usage:** Verify the inequality holds for a given distribution by computing both sides.

---

### P21 — Joint Distributions / Independence Checks

**How to recognize:** "Are X and Y independent?" given a joint pdf

**Test:** X and Y are independent iff f_{X,Y}(x,y) = f_X(x) · f_Y(y) for all x, y.

**Shortcut:** If the joint pdf cannot be factored into g(x)·h(y), then they are dependent.

**Example:** f(x,y) = (3/16)(2-x^2-y^2) cannot be factored → X, Y dependent.

---

### P22 — Correlation / Covariance Computation

**Formulas:**
- Cov(X,Y) = E[XY] - E[X]·E[Y]
- Corr(X,Y) = Cov(X,Y) / (sigma_X · sigma_Y)

**For independent RVs:** Cov = 0, Corr = 0

**Qualitative:** If large X tends to go with large Y → positive correlation; large X with small Y → negative.

---

### P23 — Two-Sample t-Test

**How to recognize:** "Compare means of two independent groups"

**Equal variance assumed (Levene's test p > 0.05):**
$$t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \cdot \sqrt{1/n_1 + 1/n_2}}$$
where s_p^2 = ((n_1-1)s_1^2 + (n_2-1)s_2^2) / (n_1 + n_2 - 2)
df = n_1 + n_2 - 2

**Unequal variance (Welch):** df computed via Welch-Satterthwaite formula

---

### P24 — Multiple Testing / p-hacking

**Key insight:** When testing k hypotheses at level alpha, the family-wise error rate increases.

**Bonferroni correction:** Use alpha/k as the significance threshold for each individual test.

**2025 Endterm SA 8(b):** Residuals show quadratic pattern → add x_4^2 term to model.
**Resit SA 9(b):** 5 tests → Bonferroni threshold = 0.05/5 = 0.01.

---

### P25 — Distribution Identification from Histogram/Boxplot

**Clues:**
- Symmetric, bell-shaped → Normal
- Right-skewed with long tail → Exponential, Pareto
- Uniform boxplot (median centered, whiskers equal) → Uniform
- Boxplot with Q1 far from median, long upper whisker → Right-skewed (Exp, Pareto)
- Five-number summary with median not at center → Check for skew

**Resit MC 3:** Median = 3.443, not halfway between Q1 and Q3 (1.440, 6.035). Max = 18.203 much further from Q3 than Min = 0.028 is from Q1. → Right-skewed. Most likely Exp(0.25) (median = ln(2)/0.25 ≈ 2.77, close to 3.443).

---

### P26 — Change-of-Variable Formula

**For Y = g(X) where g is monotone:**
$$f_Y(y) = f_X(g^{-1}(y)) \cdot \left|\frac{d}{dy}g^{-1}(y)\right|$$

**Or using CDF method:**
$$F_Y(y) = P(g(X) \le y) = P(X \le g^{-1}(y)) = F_X(g^{-1}(y))$$
Then differentiate: f_Y(y) = d/dy F_Y(y)

---

### P27 — Estimator Bias / MSE Calculation

**Definitions:**
- **Bias:** Bias(T) = E[T] - theta
- **MSE:** MSE(T) = E[(T - theta)^2] = Var(T) + Bias(T)^2
- **Unbiased:** Bias = 0, so MSE = Var

**If T is unbiased:** MSE(T) = Var(T)
**If T is biased:** MSE(T) = Var(T) + (E[T] - theta)^2

---

## Confidence Assessment

### Strong — confident in solving these without help
- Confidence intervals for normal mean (P1) — formula is mechanical
- Binomial proportion test (P4) — straightforward calculation
- R-squared interpretation (P5) — conceptual understanding
- Dummy variables (P8) — follow the k-1 rule
- Type I/II error for uniform distributions (P11) — just integrate over the reject region
- t-test statistic computation (P14) — plug into formula
- Binomial distribution identification (P15) — count successes

### Needs review — review formulas and practice problems
- Regression t-tests (P2) — need to be careful with degrees of freedom
- CLT applications (P3) — need to compute E[X] and Var(X) correctly first
- MLE + functional invariance (P7) — practice the log-likelihood steps
- Convolution integrals (P6) — setting up limits is the tricky part
- Power calculations (P9) — remember the sign flip: z_alpha - sqrt(n)*delta/sigma = -z_pi
- Gamma distribution hypothesis tests (P10) — need to read quantile tables correctly
- Method of moments / unbiased estimators (P17) — need to solve for theta correctly

### Not attempted — review if time permits
- Conditional Pareto probability (P12)
- Bayes' rule with continuous/mixed (P13)
- CI for regression coefficients (P16)
- MLE for non-standard distributions (P18)
- Pareto calculations (P19)
- Chebychev's inequality (P20)
- Joint distribution independence checks (P21)
- Correlation/covariance computation (P22)
- Two-sample t-test (P23)
- Multiple testing / p-hacking (P24)
- Distribution identification from plots (P25)
- Change-of-variable formula (P26)
- Estimator bias / MSE (P27)
- Gamma distribution properties (P28)

---

## Points Distribution Summary

Based on analysis of Endterm 2025 and Resit 2025 question weights:

| Category | Key Patterns | Approx. Points | % of Exam |
|----------|-------------|---------------|-----------|
| Hypothesis Testing | P4, P9, P10, P11, P14, P23, P24 | 33 | 37% |
| Confidence Intervals | P1, P16 | 21 | 23% |
| Regression | P2, P5, P8, P24 | 17 | 19% |
| CLT / Convolution | P3, P6, P12, P19 | 18 | 20% |
| MLE / Estimation | P7, P17, P18, P26, P27 | 6 | 7% |
| Probability Foundations | P13, P15, P20–P22, P25 | — | (included above) |
| **Total** | | **~90** | **100%** |

**Priority focus:** Hypothesis testing (37%) and Confidence intervals (23%) together cover 60% of the exam. Regression (19%) is the third largest area. CLT/Convolution (20%) is substantial and often appears as multi-part open questions.

**Highest-yield patterns to master first:** P1 (CI), P2 (regression t-test), P3 (CLT), P4 (binomial test), P9 (power), P10 (gamma test), P11 (type I/II error), P14 (t-test computation).
