# IDM Endterm Study Plan (Based on Exam Pattern Analysis)

## Exam Overview

- **180 minutes**, computer-based exam
- Mix of open questions, multiple-answer questions, calculations, and diagrams
- One double-sided handwritten A4 cheat sheet allowed
- **8 past exams analyzed**: 4 Endterms (2020, 2022, 2023, 2024) + 4 Resits (2020, 2022, 2023, 2024)
- **~55 total questions** across all exams, distilled into **18 distinct patterns**
- Based on patterns that recur across multiple exam sittings — study in tier order

---

## Tier 1: Very High frequency + on Endterm 2024 -- study these first

| # | Topic | 2024 Endterm Q | Lectures | Pattern | Done |
|---|-------|---------------|----------|---------|------|
| 1 | Cost-Based Access Path Selection (Size Estimation & Algorithm Choice) | Q1.1 | L09, L10 | P1 | [ ] |
| 2 | General Discussion Questions (Agree/Disagree with Claims) | Q4 | L01–L15 | P2 | [ ] |
| 3 | Logging & Recovery (UNDO / REDO / UNDO-REDO) | Q3 | L11, L12 | P4 | [ ] |
| 4 | CAP Theorem | Q4.2 | L13, L14 | P5 | [ ] |
| 5 | Conflict Serializability (Precedence / Conflict Graph) | Q2 | L11, L12 | P6 | [ ] |
| 6 | Join Order Optimization | Q1.2.2 | L09, L10 | P7 | [ ] |

---

## Tier 2: High frequency (appeared on multiple exams, not on Endterm 2024)

| # | Topic | Lectures | Pattern | Done |
|---|-------|----------|---------|------|
| 7 | Two-Phase Commit Protocol (2PC) | L11, L12 | P3 | [ ] |
| 8 | Relational Algebra Queries | L01, L02 | P8 | [ ] |
| 9 | Reduction Factor Accuracy & Size Estimation Discussion | L09, L10 | P9 | [ ] |
| 10 | Algebraic Query Optimization Heuristics | L09, L10 | P10 | [ ] |
| 11 | Advanced SQL Queries | L01, L02 | P11 | [ ] |
| 12 | Indexing Technique Selection (Hash-Index vs B+Tree) | L06, L07 | P12 | [ ] |

---

## Tier 3: Low frequency (past endterms, less recurring)

| # | Topic | Lectures | Pattern | Done |
|---|-------|----------|---------|------|
| 13 | Row-Store vs Column-Store | L05 | P14 | [ ] |
| 14 | Two-Phase Locking (2PL) Schedule Validation | L11, L12 | P15 | [ ] |
| 15 | Data Integration (Schema Matching, Merging, Quality) | — | P16 | [ ] |

---

## Tier 4: Resits only (appeared on resits, not endterms)

| # | Topic | Lectures | Pattern | Done |
|---|-------|----------|---------|------|
| 16 | B+Tree / KD-Tree Insertion | L06, L07 | P13 | [ ] |
| 17 | Conceptual Modelling (EER Diagrams) | — | P17 | [ ] |
| 18 | Functional Dependencies & Normalization (2NF, 3NF) | — | P18 | [ ] |

---

## Endterm 2024 Questions

| Q# | Topic | Pattern | Lectures | Done |
|----|-------|---------|----------|------|
| Q1.1 | Cost-Based Access Path Selection: size estimation, algorithm choice (Relation Scan vs Index Scan, join costs) | P1 | L09, L10 | [ ] |
| Q1.2.2 | Join Order Optimization: evaluate operator tree join order, propose better order | P7 | L09, L10 | [ ] |
| Q2 | Conflict Serializability: given schedule, build precedence graph, determine serializability | P6 | L11, L12 | [ ] |
| Q3 | Logging & Recovery: fill in log entries (UNDO/REDO/UNDO-REDO), determine disk values before/after crash | P4 | L11, L12 | [ ] |
| Q4 | General Discussion Questions (multiple sub-questions): query optimization goals, prepared statements, SQL injection, schema-free, transaction processing | P2 | L01–L15 | [ ] |
| Q4.2 | CAP Theorem: evaluate vendor claims about consistency/availability/partition tolerance | P5 | L13, L14 | [ ] |

---

## How to Solve Each Pattern

### P1 -- Cost-Based Access Path Selection (Size Estimation & Algorithm Choice) (10 pts)
**How to recognize:** "Given a schema with table/column statistics and an operator tree, estimate the size of intermediate results, then choose the cheapest algorithm"

**Methods to Solve:**
- **Reduction factor for equality selections:** rf = 1 / #distinct_values(column). E.g., σ_name='X' → rf = 1 / #dv(name).
- **Reduction factor for range selections:** rf = (value − low) / (max − low + 1) for < operator, or (high − value) / (max − low + 1) for > operator.
- **Combined reduction factor:** Multiply individual reduction factors when multiple conditions exist: rf = rf₁ × rf₂.
- **Estimated tuples:** #records × rf. Round up to next integer for result set size estimations.
- **Bytes per record:** Sum all attribute sizes. Calculate records/block = ⌊block_size / bytes_per_record⌋. Blocks = ⌈tuples / records_per_block⌉.
- **Selection cost — Relation Scan:** c_RS = #blocks(table). Read the entire table.
- **Selection cost — Index Scan:** c_IS = #matching_tuples × (c_ix + 1), where c_ix = index access cost (tree height − 1, or given). For multiple conditions on individually indexed attributes, use index on the most selective attribute and filter the rest in memory.
- **Join size estimation:** |R ⋈ S| = (|R| × |S|) / max(#dv(R.join_attr), #dv(S.join_attr)).
- **Join cost — Block Nested Loop (BNL):** c_BNL = #blocks_outer + (#blocks_outer × #blocks_inner) + c_result. Use the smaller relation as the outer.
- **Join cost — Index Nested Loop (INL):** c_INL = #blocks_outer + #records_outer × (c_ix + 1) + c_result. Requires an index on the inner relation's join attribute.
- **Join cost — Sort-Merge Join (MJ):** Only applicable if both relations are sorted on the join attribute. c_MJ = #blocks_R + #blocks_S + c_result.
- **Commutativity:** Always check if swapping inner/outer in BNL or INL gives a cheaper cost. The join operator is commutative!
- **c_result:** Always include the cost of writing the intermediate result to disk (= #blocks of result).

**Example (Resit 2024 Q1.1):** For σ_price<1000 ∧ max_occupants=3(Apartment) with 10,000 records, 69 blocks: rf_price = 500/2001 ≈ 0.25, rf_max_occ = 1/6. Combined rf ≈ 0.0416 → 417 records, 3 blocks. Index Scan on max_occupants: 1/6 × 10000 × 2 ≈ 3334 block accesses. Relation Scan: 69 blocks. → Relation Scan is cheaper.

---

### P2 -- General Discussion Questions (Agree/Disagree with Claims) (6 pts)
**How to recognize:** A statement is presented (e.g., about query optimization, transactions, NoSQL, SQL injection, schemas) and you must agree, disagree, or say "it depends," providing 1–2 strong arguments with justification.

**Methods to Solve:**
- **Take a nuanced position:** Pure "agree" or "disagree" without qualification often scores 0. "It depends" with good reasoning is usually best.
- **"Goal of query optimization is the best plan":** Disagree. The goal is a *good enough* plan found quickly, not the optimal one. Finding the true optimum is NP-hard for large join orders, and heuristics are coarse anyway.
- **"Hash-index lookup is always more expensive than B+Tree":** Disagree. Hash-index is O(1) for equality lookups (1 block access if no overflow). B+Tree is O(log n). But B+Tree supports range queries; hash does not.
- **"SQL injection is developers' responsibility":** Partially agree. Developers must use prepared statements / parameterized queries. But DBAs can also add constraints and access controls as defense-in-depth.
- **"Prepared statements only accelerate processing":** Disagree. They also prevent SQL injection by separating code from data.
- **"Schema-free is better":** Disagree for relational systems. Schemas enforce data integrity, enable query optimization, and prevent inconsistencies. Schema-free suits flexible/evolving data but sacrifices these guarantees.
- **"Transaction processing not needed in DB":** Disagree. Application-level transaction management is complex, error-prone, and unreliable. DB-level transactions (ACID) are essential for consistency in concurrent/failure scenarios.
- **"Any relational app can use aggregate data model":** Partially. Aggregates work for simple access patterns but struggle with complex cross-aggregate queries and relationships that relational joins handle naturally.
- **Provide specific arguments:** Reference concrete consequences (data loss, performance issues, complexity) rather than abstract statements.

**Example (Endterm 2022 Q6a):** "The goal of query optimization is to find the best possible execution plan." — No, the goal is to find a good plan quickly. Finding the optimal plan can be prohibitively expensive (especially for large join orders), and heuristics are coarse approximations anyway.

---

### P3 -- Two-Phase Commit Protocol (2PC) (6 pts)
**How to recognize:** Mark TRUE/FALSE statements about 2PC properties (atomicity, consistency, network partitions, majority voting), and/or analyze a 2PC scenario where a node fails at a specific step and determine what happens next.

**Methods to Solve:**
- **2PC ensures atomicity:** TRUE. All nodes either commit or abort — that is the core purpose of 2PC.
- **2PC ensures serializability/consistency:** FALSE. 2PC handles commit coordination, not concurrency control. You need 2PL or other protocols for serializability.
- **2PC works with network partitions:** FALSE. 2PC is a blocking protocol; it cannot progress during network partitions. Nodes may be stuck waiting indefinitely.
- **2PC commits if majority acknowledges:** FALSE. 2PC requires *all* nodes to vote "ready." Even one "abort" vote causes the entire transaction to abort.
- **2PC prevents inconsistency:** TRUE (when it completes). It ensures all nodes reach the same decision.
- **2PC can be used for concurrency control:** FALSE. 2PC is strictly for distributed commit coordination.
- **Scenario analysis — failure after commit received:** If all nodes received the commit message and logged it, nodes can independently apply changes even if the network goes down. The transaction is already decided.
- **Scenario analysis — failure before all ACKs:** If the coordinator has sent "commit" and all nodes received it but one crashes before ACK — the coordinator can declare success because the commit decision is final once sent. The failed node will recover from its log.
- **Key rule:** Once the coordinator decides to commit (after all "ready" votes), that decision is *irrevocable*. Nodes that crash after receiving "commit" will redo upon recovery.

**Example (Endterm 2020 Q5.1.2):** After coordinator sends commit, all nodes receive it, all send ACK except S1 (pulled plug, disk OK). "Coordinator should send rollback and abort" → FALSE. "Coordinator can declare success after S1 comes back" → TRUE. "Coordinator can declare success even before S1 returns (enough info in S1's log)" → TRUE.

---

### P4 -- Logging & Recovery (UNDO / REDO / UNDO-REDO) (6 pts)
**How to recognize:** Given a table of actions (READ, WRITE, OUTPUT, computations) with memory/disk values, fill in the log entries and FLUSH LOG commands. Alternatively, determine disk values before/after recovery following a crash.

**Methods to Solve:**
- **UNDO logging rules:**
  - Log entry <T, Variable, *OLD* value> is written **before** the WRITE (to memory buffer) or OUTPUT (to disk).
  - FLUSH LOG **before** OUTPUT — the log must be on disk before the data reaches disk.
  - <COMMIT T> is written **after** all OUTPUTs are done.
  - Recovery: For uncommitted transactions, restore old values from log (undo changes).
- **REDO logging rules:**
  - Log entry <T, Variable, *NEW* value> is written at WRITE time.
  - <COMMIT T> is written **before** any OUTPUT to disk.
  - FLUSH LOG before OUTPUT — commit must be on disk before data goes to disk.
  - Recovery: For committed transactions, redo all writes from log to disk. Uncommitted → ignore.
- **UNDO-REDO logging rules:**
  - Log entry <T, Variable, OLD value, NEW value> records both old and new values.
  - FLUSH LOG before OUTPUT (WAL — Write-Ahead Logging).
  - <COMMIT T> can appear at any point relative to OUTPUT.
  - Recovery: Redo committed transactions, undo uncommitted ones.
- **Determining disk values after crash (before recovery):** Track which OUTPUTs executed before the crash. Only OUTPUTs that completed change the disk. Memory state is lost.
- **Determining disk values after recovery:** Apply the appropriate recovery protocol (undo uncommitted / redo committed) based on what's in the log on disk (only entries before the last FLUSH LOG before crash are guaranteed on disk).
- **<START T>:** Always the first log entry for a transaction.

**Example (Resit 2024 Q4 — UNDO):** UNDO log with crash at line 13 (after OUTPUT(A) at line 11, before OUTPUT(B)). Before recovery: A = 20 (OUTPUT completed), B = 4 (OUTPUT not reached). After recovery: since COMMIT was not flushed, transaction is uncommitted → undo A back to old value 6. Final: A = 6, B = 4.

---

### P5 -- CAP Theorem (6 pts)
**How to recognize:** Discuss or evaluate claims about the CAP theorem: whether a distributed system can simultaneously guarantee Consistency, Availability, and Partition tolerance. Often framed as "can a vendor guarantee consistency during network partitions?" or TRUE/FALSE about AP/CP system properties.

**Methods to Solve:**
- **CAP theorem:** In the presence of a network partition (P), a distributed system must choose between Consistency (C) and Availability (A). You can only have 2 of 3.
- **CP system (Consistent + Partition-tolerant):** Sacrifices availability. During a partition, the system may refuse to answer queries to maintain consistency. It provides serializable consistency and can recover to a consistent state after partitions heal.
- **AP system (Available + Partition-tolerant):** Sacrifices consistency. Even during partitions, all live nodes answer queries, but responses may be stale/inconsistent. When partitions heal, the system will eventually reach a consistent state.
- **"Consistency during partition = refuse requests":** TRUE. The only way to guarantee consistency during a partition is to stop serving requests (sacrifice availability). This is the CP trade-off.
- **"Vendor guarantees consistency + availability during partitions":** FALSE. This violates the CAP theorem. The vendor is misleading.
- **"Amazon-like systems enjoy consistency during partitions":** FALSE. Amazon prioritizes availability (AP), so consistency is sacrificed during network disruptions.
- **AP system properties:** (a) Live nodes always return answers — TRUE. (b) Eventually consistent after partitions heal — TRUE. (c) Can execute 2PC during partition — FALSE (2PC requires all nodes).
- **CP system properties:** (a) Provides serializable consistency — TRUE. (b) Can tolerate partitions and recover — TRUE. (c) Replicas can answer during partition — FALSE (availability is sacrificed).

**Example (Endterm 2022 Q4.2):** "The only solution to ensure consistency during network partition is to refuse answering user requests." — Agree. By the CAP theorem, if you want consistency (C) during a partition (P), you must sacrifice availability (A), meaning the system must refuse some requests.

---

### P6 -- Conflict Serializability (Precedence / Conflict Graph) (6 pts)
**How to recognize:** Given a schedule of concurrent transactions with read/write operations, draw the conflict (precedence) graph and determine whether the schedule is conflict-serializable.

**Methods to Solve:**
- **Identify conflicting operations:** Two operations conflict if they are from *different* transactions, access the *same* data item, and *at least one is a write*. The three conflict types: R-W, W-R, W-W.
- **Build the precedence graph:** Create a node for each transaction. For every pair of conflicting operations, draw an edge from T_i → T_j if T_i's operation comes first in the schedule.
- **Check for cycles:** If the graph has a cycle, the schedule is **NOT** conflict-serializable. If acyclic, it IS conflict-serializable, and any topological sort gives a valid serial order.
- **Systematic approach:** Go through the schedule left-to-right. For each write w_i(X), find all later r_j(X) or w_j(X) where j ≠ i → edge i → j. For each read r_i(X), find all later w_j(X) where j ≠ i → edge i → j.
- **State the conclusion clearly:** "The schedule is (not) conflict-serializable because the precedence graph is (a)cyclic." If acyclic, give the equivalent serial order.

**Example (Endterm 2024 Q2):** Schedule: T1 reads(A), T2 writes(A), T1 writes(A). Conflicts on A: T1 r(A) before T2 w(A) → T1→T2. T2 w(A) before T1 w(A) → T2→T1. Cycle: T1→T2→T1. → NOT conflict-serializable.

---

### P7 -- Join Order Optimization (6 pts)
**How to recognize:** Evaluate whether a given operator tree has a good join order, and if not, propose a better one. Argue why a particular join order is better based on intermediate result sizes.

**Methods to Solve:**
- **Heuristic: minimize intermediate result sizes.** Join the smallest relations first to keep intermediate results small throughout the plan.
- **Deep-left trees are generally preferred:** They allow pipelining and are more efficient for nested-loop joins since the inner relation is repeatedly scanned.
- **Push selections before joins:** Apply selections as early as possible to reduce the size of relations before joining.
- **Avoid Cartesian products:** Never join relations that don't share a join attribute unless absolutely necessary. Cartesian products explode intermediate sizes.
- **Join selective relations first:** If one relation becomes very small after selection (e.g., a single ingredient name), join it early to filter down the data.
- **Greedy heuristic:** At each step, pick the join that produces the smallest intermediate result. Start with the two smallest relations/results.
- **Cannot guarantee optimality:** The greedy heuristic finds a good join order, but not necessarily the best one. Full cost-based enumeration is needed for that (but is expensive).
- **Argue with sizes, not just intuition:** Reference table sizes, reduction factors, and expected intermediate sizes to justify your proposed order.

**Example (Endterm 2023 Q1.2.2):** Original tree joins all Users with all Ratings first (100,000 tuples), then filters for Dandelion recipes. Better: start with σ_name='Dandelion'(Ingredient) ⋈ uses ⋈ Recipe ⋈ rating ⋈ User. This filters to ~1 ingredient first, drastically reducing all subsequent joins.

---

### P8 -- Relational Algebra Queries (6 pts)
**How to recognize:** Translate natural-language queries into relational algebra expressions using selection (σ), projection (π), join (⋈), union (∪), difference (−), rename (ρ), and extended operators like grouping/aggregation (γ).

**Methods to Solve:**
- **Selection (σ):** Filter rows by condition. σ_age>30(Customer).
- **Projection (π):** Select specific columns. π_name,age(Customer). Remember: RA uses set semantics (duplicates removed).
- **Natural join (⋈):** Join on shared attribute names. For explicit conditions, use theta-join: R ⋈_R.id=S.id S.
- **Grouping/Aggregation (γ):** γ_group_attrs, agg_func(attr)(R). E.g., γ_city, SUM(revenue)(Company).
- **Set difference (−):** For "never" or "not" queries: find all entities, then subtract those that match the condition.
- **Union (∪):** For "or" queries combining two sets. Both operands must have the same schema.
- **Rename (ρ):** Used for self-joins or to resolve attribute name conflicts.
- **Strategy:** Start from the innermost tables, apply selections and joins step by step, and project at the end. Build inside-out.
- **"At least N" pattern:** Join, group, aggregate with COUNT, then select σ_count≥N.
- **"For each" pattern:** Typically requires grouping (γ) on the grouping attribute.

**Example (Resit 2022 Q2.1):** "For each date in booking, show total payment." → γ_booking_date, SUM(payment_amount)(booking).

---

### P9 -- Reduction Factor Accuracy & Size Estimation Discussion (4 pts)
**How to recognize:** Discuss whether the standard heuristic formula (1/#distinct_values or (value−low)/(max−low+1)) gives accurate size estimates for a specific query. Identify when heuristics fail.

**Methods to Solve:**
- **Uniform distribution assumption:** The heuristic rf = 1/#dv assumes values are uniformly distributed. If true (e.g., unique ingredient names), the estimate is accurate.
- **Correlated attributes:** When two selection conditions are on correlated attributes (e.g., price and square meters in housing), multiplying independent reduction factors produces a wrong estimate. Correlation means the conditions are not independent.
- **Skewed distributions:** If values are not uniformly distributed (e.g., most recipes are difficulty 1-2, few are 5), the uniform assumption over-/under-estimates.
- **Histograms fix non-uniform distributions:** Histograms capture the actual distribution of values per attribute, giving much better reduction factor estimates than the simple 1/#dv formula.
- **Histograms do NOT fix correlation:** Histograms address single-attribute distribution issues, not multi-attribute correlation.
- **Estimate direction:** When attributes are positively correlated (e.g., large apartments cost more) and you select for cheap + large, the estimate will be *too high* because reality has fewer matches than independence assumes.

**Example (Resit 2024 Q2.2):** σ_price<700 ∧ sqm>90(apartment). The heuristic assumes price and sqm are independent, but in reality they are positively correlated (larger = more expensive). Asking for cheap AND large yields far fewer results than estimated → estimate is too high.

---

### P10 -- Algebraic Query Optimization Heuristics (4 pts)
**How to recognize:** Identify which algebraic heuristics (push selections down, push projections down, replace selection+Cartesian product with join) should be applied to an operator tree, and discuss their impact on query cost.

**Methods to Solve:**
- **Push selections down:** Move σ operators as close to the leaf (base) relations as possible. This reduces the number of tuples early, making subsequent joins cheaper.
- **Push projections down:** Move π operators down to remove unnecessary attributes early, reducing tuple sizes and thus block counts. *But be careful:* do not remove attributes needed by later joins or selections!
- **Replace σ + × with ⋈:** A selection over a Cartesian product is equivalent to a join. Joins are much cheaper than computing the full Cartesian product first.
- **Impact on cost-based results:** After pushing selections, intermediate result sizes shrink, which changes the cost calculations for algorithm selection. Recompute sizes with the new tree.
- **Pushing projection pitfall:** If you remove an attribute that is needed for a join higher in the tree, the query becomes *incorrect* (not just suboptimal). Always verify that projected attributes include everything needed upstream.

**Example (Resit 2024 Q2.4):** Removing a_id from projection π_o_id, a_id, end_date to π_o_id, end_date seems like "push projection" optimization. But a_id is needed for the join at i7 above → this breaks the query. Answer: NOT beneficial.

---

### P11 -- Advanced SQL Queries (6 pts)
**How to recognize:** Write SQL queries involving JOINs, GROUP BY, HAVING, subqueries (EXISTS, IN, NOT EXISTS), CTEs (WITH clause), aggregate functions, and conditions on dates/ages.

**Methods to Solve:**
- **GROUP BY + HAVING:** Use GROUP BY for per-group aggregation. HAVING filters groups (not rows). E.g., "agencies with at least 10 bookings" → GROUP BY t_id HAVING COUNT(*) ≥ 10.
- **WITH (CTE):** Use Common Table Expressions to break complex queries into readable steps. E.g., first compute city totals, then find the minimum.
- **NOT EXISTS for negation:** "Customers who never visited X" → WHERE NOT EXISTS (SELECT 1 FROM visits WHERE ...). This is the standard pattern for universal negation.
- **Correlated subqueries:** For "same price to all companies" → use NOT EXISTS with a self-join checking for different prices.
- **JOIN strategy:** Identify which tables to join by tracing foreign key relationships in the schema. Only join tables needed for the query.
- **Date comparisons:** Use standard date arithmetic. "Booking performed at least 3 days before starting" → start_date − booking_date ≥ 3.
- **DISTINCT:** Use when the query asks for unique results or when joins might produce duplicates.
- **Set operators:** UNION, INTERSECT, EXCEPT for combining/comparing result sets.

**Example (Resit 2022 Q1.1):** "List travel agencies with ≥10 bookings after 01-01-2022." → SELECT ta.name, COUNT(*) FROM travel_agency ta JOIN booking b ON ta.t_id = b.t_id WHERE b.booking_date > '2022-01-01' GROUP BY ta.t_id, ta.name HAVING COUNT(*) ≥ 10.

---

### P12 -- Indexing Technique Selection (Hash-Index vs B+Tree) (4 pts)
**How to recognize:** Given queries with different WHERE conditions, select which index type (Hash-Index or B+Tree) is best suited for each attribute used. Also includes TRUE/FALSE statements about index properties.

**Methods to Solve:**
- **Hash-Index:** Best for *equality* lookups (= operator). O(1) access. Cannot handle range queries (<, >, BETWEEN).
- **B+Tree:** Supports both equality and *range* queries. Cost is O(log n) = tree height. Preferred when the query uses <, >, ≤, ≥, BETWEEN, or ORDER BY.
- **If only equality:** Hash-Index is better (fewer block accesses: typically 1 vs tree height).
- **If range query:** B+Tree is the only viable option. Hash-Index cannot be used.
- **Primary key with equality:** Hash is optimal if no range queries on that key.
- **"Does not matter":** If no index exists on the attribute, or if a full table scan is needed regardless, the index type doesn't matter (mark N/A or "does not matter").
- **Secondary indexes:** Multiple secondary indexes can exist per table. They *decrease* write performance (need to update index on insert/delete) but *improve* read performance for selections on non-key attributes.
- **Relation scan vs Index scan:** When selectivity is very low (many tuples match), a full relation scan may be cheaper than an index scan because index scan has per-tuple overhead.

**Example (Endterm 2020 Q3.1):** SELECT * FROM Person WHERE age = 32 → age: Hash-Index (equality), id: N/A (not used). SELECT * FROM Person WHERE id > 23 AND age > 32 → age: B+Tree (range), id: B+Tree (range).

---

### P13 -- B+Tree / KD-Tree Insertion (4 pts) [RESITS]
**How to recognize:** Draw the resulting B+Tree after inserting a value (may require node splits), or draw a KD-Tree after inserting a multi-dimensional point.

**Methods to Solve:**
- **B+Tree insertion:**
  - Navigate to the correct leaf node using the search key.
  - If the leaf has space, insert the key in sorted order.
  - If the leaf is full, **split**: distribute keys evenly between old and new leaf. The *middle key* is copied up to the parent (for leaf splits) or pushed up (for internal node splits).
  - If the parent is also full, recursively split upward. A split at the root creates a new root.
  - Leaf nodes maintain a linked list (next-leaf pointers).
- **KD-Tree insertion:**
  - Alternate splitting dimensions at each level (e.g., level 0: Salary, level 1: Age, level 2: Salary, ...).
  - Navigate: compare the insertion point's value on the current dimension with the node's splitting value.
  - Go left if < (or ≤ depending on convention), right if ≥ (or >).
  - Insert as a new leaf when reaching a null child pointer.
- **Pay attention to conventions:** The exam specifies whether left is ≤ or <. Follow exactly as stated.

**Example (Resit 2020 Q4.1):** Insert 100 into a B+Tree with capacity 3 keys per node. 100 goes to the rightmost leaf [46, 47, 50], which is full. Split: [46, 47] and [50, 100]. Push 50 up to parent [22, 33, 44], which is also full. Split parent: [22, 33] and [50]. Push 44 up as new root.

---

### P14 -- Row-Store vs Column-Store (4 pts)
**How to recognize:** Given query characteristics (OLTP updates, full-row retrieval, column aggregations), determine whether row-wise or column-wise storage is more suitable.

**Methods to Solve:**
- **Row-store (OLTP):** Best for queries that access *complete rows* (SELECT *, INSERT, UPDATE, DELETE). Each row is stored contiguously → one I/O to read/write a full record.
- **Column-store (OLAP):** Best for queries that aggregate *specific columns* (SUM, AVG, COUNT on one attribute). Each column is stored contiguously → reads only the needed column's blocks.
- **Updates on multiple tables with joins:** Row-store. Updates touch entire rows, and row-store keeps all attributes together.
- **Queries returning complete rows:** Row-store. Column-store would need to reconstruct rows from separate column files.
- **Aggregations over specific columns:** Column-store. Only reads the relevant column, skipping all others → far fewer I/Os.

---

### P15 -- Two-Phase Locking (2PL) Schedule Validation (4 pts)
**How to recognize:** Given schedules using lock/unlock/read/write notation (L, U, R, W), determine whether each schedule follows a valid two-phase locking protocol.

**Methods to Solve:**
- **2PL rule:** For each transaction, there must be a **growing phase** (only acquires locks) followed by a **shrinking phase** (only releases locks). Once a transaction releases *any* lock, it must never acquire another.
- **Check per transaction:** For each transaction T_i, extract all L_i and U_i operations in order. Verify that no L_i appears after any U_i.
- **Other rules:** A transaction must lock an item before reading/writing it. It must not lock an item it already holds. These are basic well-formedness rules.
- **If ANY transaction violates 2PL:** The entire schedule is NOT a valid 2PL schedule.
- **ALL transactions must comply:** Every transaction in the schedule must individually follow the growing-then-shrinking pattern.

**Example (Endterm 2020 Q5.3):** Schedule: L1(B);R1(B); L2(A);R2(A); W1(B); U1(B);L3(B);W3(B); L1(C);... — Transaction 1 unlocks B then later locks C → violation of 2PL for T1. Schedule is FALSE (not valid 2PL).

---

### P16 -- Data Integration (Schema Matching, Merging, Quality) (6 pts)
**How to recognize:** Describe the key tasks of data integration (e.g., when two companies merge databases) and list data quality dimensions with examples.

**Methods to Solve:**
- **Data integration tasks (know any 4):**
  - **Schema matching:** Find correspondences between attributes in different source schemas (e.g., "product_name" in DB1 = "item_title" in DB2).
  - **Schema merging:** Combine matched schemas into a unified global schema.
  - **Schema mapping:** Define how to translate queries/data between the global schema and source schemas.
  - **Data matching:** Identify records in different databases that refer to the same real-world entity (entity resolution/deduplication).
  - **Query reformulation:** Translate queries on the global schema into sub-queries on the source schemas.
- **Data quality dimensions (know any 2):**
  - **Accuracy:** Data values match real-world values.
  - **Completeness:** No missing data; all real-world states are represented.
  - **Timeliness:** Data is sufficiently up-to-date for the task.
  - **Consistency:** No contradictions; data follows defined semantic rules.
  - **Duplication:** No unwanted duplicate records within or across systems.

**Example (Endterm 2023 Q4.3):** Jumbo and Albert Heijn merge. Schema matching: find that Jumbo's "artikel" = AH's "product". Data matching: identify that "Coca-Cola 1L" in both DBs is the same item. Data quality issue: Duplication (same product in both DBs) and Consistency (different price formats).

---

### P17 -- Conceptual Modelling (EER Diagrams) [RESITS]
**How to recognize:** Given a scenario description, complete or design a Conceptual EER schema: identify entities, relationships, attributes, primary keys, cardinalities, specialization/generalization (ISA), and participation constraints.

**Methods to Solve:**
- **Identify entities:** Nouns that represent independent objects (Camper, Staff, Facility, Activity).
- **Identify relationships:** Verbs connecting entities (Participates, Uses, Organizes). Determine cardinalities (1:1, 1:N, M:N).
- **Attributes:** Properties of entities. Underline primary keys. Use ovals in ER notation.
- **Specialization/Generalization (ISA):** If entities share common attributes but have distinct sub-types (e.g., Staff → C1/C2/C3), use ISA hierarchy. Mark as total/partial and disjoint/overlapping.
- **Weak entities:** Entities that cannot be uniquely identified without their owner entity (partial key + owner's key).
- **Participation constraints:** Total (every entity must participate) vs partial. Double line = total participation.
- **Constraints that cannot be modeled:** Age ranges (5-17), "exactly 2 campers per tent," minimum age requirements — these are business rules that ER diagrams cannot express.
- **Multi-valued attributes:** Use double-line ovals for attributes that can have multiple values.

---

### P18 -- Functional Dependencies & Normalization (2NF, 3NF) [RESITS]
**How to recognize:** Given a relation R(A, B, C, D, E) and a set of functional dependencies, derive candidate keys, then decompose the relation into 2NF and 3NF.

**Methods to Solve:**
- **Finding candidate keys:**
  - Compute the closure of attribute sets. A candidate key K is minimal such that K⁺ = all attributes.
  - Start with attributes that appear only on the left side of FDs (they must be in every key).
  - Use Armstrong's axioms: reflexivity, augmentation, transitivity to expand closures.
- **2NF decomposition:**
  - A relation is in 2NF if every non-prime attribute is *fully* functionally dependent on every candidate key (no partial dependencies).
  - If a non-prime attribute depends on only *part* of a composite key, decompose: create a separate relation for that partial dependency.
  - If the key is a single attribute, the relation is automatically in 2NF.
- **3NF decomposition (dependency-preserving):**
  - A relation is in 3NF if for every FD X → A, either X is a superkey or A is a prime attribute (part of some candidate key).
  - Algorithm: For each FD X → Y, create a relation R(X, Y). Then ensure a relation containing a candidate key exists. Merge relations with the same key.
  - If already in 3NF, write R(A, B, C, D, E) unchanged.
- **Common FD patterns:**
  - {A → BCDE}: Key is A, already in 2NF and 3NF.
  - {B → CD}: Key must include B+more. Check for partial deps.
  - {A → B, B → CE}: Transitive dependency A → B → CE. Violates 3NF. Decompose into R1(B, C, E) and R2(A, B, D).

---

## Confidence Assessment (based on this session)

### Strong — solved with minimal hints

### Needs review — required multiple hints or conceptual confusion

### Not attempted — review if time permits

---

## Points Distribution Summary

| Category | Patterns | Points | % of exam |
|----------|----------|--------|-----------|
| Cost-Based Access Path Selection | P1 | 10 | ~18% |
| Transaction & Concurrency (2PC + Logging/Recovery + Conflict Serializability + 2PL) | P3 + P4 + P6 + P15 | 22 | ~39% |
| General Discussion (Agree/Disagree) | P2 | 6 | ~11% |
| CAP Theorem | P5 | 6 | ~11% |
| Query Optimization (Join Order + Algebraic Heuristics + Reduction Factors) | P7 + P10 + P9 | 14 | ~25% |
| SQL & Relational Algebra | P8 + P11 | 12 | ~21% |
| Indexing | P12 | 4 | ~7% |
| Data Integration | P16 | 6 | ~11% |
| Row-Store vs Column-Store | P14 | 4 | ~7% |
| **Total (Endterm-relevant)** | **15 patterns** | **~56** | **~100%** |
| Resit-only patterns (P13, P17, P18) | 3 patterns | ~12 | (Tier 4 — review if time) |

**Priority: Transaction/Concurrency (~39%) and Query Optimization (~25%) together cover ~64% of the endterm exam. Study P1, P4, P6, P7, P2, and P5 first.**
