CSE1505 -- Exam Patterns

1. Cost-Based Access Path Selection (Size Estimation & Algorithm Choice) Very High

END 2020 Q5.1 • 2022 Q5.1 • 2023 Q1.1 • 2024 Q1.1 | RES 2020 Q4.2 • 2022 Q3.2 • 2024 Q1

Given a schema with table/column statistics and an operator tree, estimate the size of intermediate results (number of tuples and blocks using reduction factors), then choose the cheapest algorithm (Relation Scan, Index Scan, BNL, INL, MJ) by computing block-access costs.

Methods to Solve

Reduction factor for equality selections: rf = 1 / #distinct_values(column). E.g., σ_name='X' ⇒ rf = 1 / #dv(name).
Reduction factor for range selections: rf = (value − low) / (max − low + 1) for < operator, or (high − value) / (max − low + 1) for > operator.
Combined reduction factor: Multiply individual reduction factors when multiple conditions exist: rf = rf₁ × rf₂.
Estimated tuples: #records × rf. Round up to next integer for result set size estimations.
Bytes per record: Sum all attribute sizes. Calculate records/block = ⌊block_size / bytes_per_record⌋. Blocks = ⌈tuples / records_per_block⌉.
Selection cost — Relation Scan: c_RS = #blocks(table). Read the entire table.
Selection cost — Index Scan: c_IS = #matching_tuples × (c_ix + 1), where c_ix = index access cost (tree height − 1, or given). For multiple conditions on individually indexed attributes, use index on the most selective attribute and filter the rest in memory.
Join size estimation: |R ⋈ S| = (|R| × |S|) / max(#dv(R.join_attr), #dv(S.join_attr)).
Join cost — Block Nested Loop (BNL): c_BNL = #blocks_outer + (#blocks_outer × #blocks_inner) + c_result. Use the smaller relation as the outer.
Join cost — Index Nested Loop (INL): c_INL = #blocks_outer + #records_outer × (c_ix + 1) + c_result. Requires an index on the inner relation's join attribute.
Join cost — Sort-Merge Join (MJ): Only applicable if both relations are sorted on the join attribute. c_MJ = #blocks_R + #blocks_S + c_result.
Commutativity: Always check if swapping inner/outer in BNL or INL gives a cheaper cost. The join operator is commutative!
c_result: Always include the cost of writing the intermediate result to disk (= #blocks of result).

Example (Resit 2024 Q1.1): For σ_{price<1000 ∧ max_occupants=3}(Apartment) with 10,000 records, 69 blocks: rf_price = 500/2001 ≈ 0.25, rf_{max_occ} = 1/6. Combined rf ≈ 0.0416 ⇒ 417 records, 3 blocks. Index Scan on max_occupants: 1/6 × 10000 × 2 ≈ 3334 block accesses. Relation Scan: 69 blocks. ⇒ Relation Scan is cheaper.

2. General Discussion Questions (Agree/Disagree with Claims) Very High

END 2020 Q6 • 2022 Q6 • 2023 Q4 • 2024 Q4 | RES 2020 Q6 • 2022 (within Q4) • 2024 Q2.2--2.5

A statement is presented (e.g., about query optimization, transactions, NoSQL, SQL injection, schemas) and you must agree, disagree, or say "it depends," providing 1--2 strong arguments with justification. Recurring topics: "best possible execution plan" goal, transaction processing necessity, SQL injection responsibility, schema-free vs schema-based, aggregate data models, hash vs B+Tree lookups.

Methods to Solve

Take a nuanced position: Pure "agree" or "disagree" without qualification often scores 0. "It depends" with good reasoning is usually best.
"Goal of query optimization is the best plan": Disagree. The goal is a good enough plan found quickly, not the optimal one. Finding the true optimum is NP-hard for large join orders, and heuristics are coarse anyway.
"Hash-index lookup is always more expensive than B+Tree": Disagree. Hash-index is O(1) for equality lookups (1 block access if no overflow). B+Tree is O(log n). But B+Tree supports range queries; hash does not.
"SQL injection is developers' responsibility": Partially agree. Developers must use prepared statements / parameterized queries. But DBAs can also add constraints and access controls as defense-in-depth.
"Prepared statements only accelerate processing": Disagree. They also prevent SQL injection by separating code from data.
"Schema-free is better": Disagree for relational systems. Schemas enforce data integrity, enable query optimization, and prevent inconsistencies. Schema-free suits flexible/evolving data but sacrifices these guarantees.
"Transaction processing not needed in DB": Disagree. Application-level transaction management is complex, error-prone, and unreliable. DB-level transactions (ACID) are essential for consistency in concurrent/failure scenarios.
"Any relational app can use aggregate data model": Partially. Aggregates work for simple access patterns but struggle with complex cross-aggregate queries and relationships that relational joins handle naturally.
Provide specific arguments: Reference concrete consequences (data loss, performance issues, complexity) rather than abstract statements.

Example (Endterm 2022 Q6a): "The goal of query optimization is to find the best possible execution plan." — No, the goal is to find a good plan quickly. Finding the optimal plan can be prohibitively expensive (especially for large join orders), and heuristics are coarse approximations anyway.

3. Two-Phase Commit Protocol (2PC) Very High

END 2020 Q5.1 • 2022 Q4.1 | RES 2020 Q5.1 • 2022 Q4.1 • 2023 Q5.1 • 2024 Q3.1

Mark TRUE/FALSE statements about 2PC properties (atomicity, consistency, network partitions, majority voting), and/or analyze a 2PC scenario where a node fails at a specific step and determine what happens next.

Methods to Solve

2PC ensures atomicity: TRUE. All nodes either commit or abort — that is the core purpose of 2PC.
2PC ensures serializability/consistency: FALSE. 2PC handles commit coordination, not concurrency control. You need 2PL or other protocols for serializability.
2PC works with network partitions: FALSE. 2PC is a blocking protocol; it cannot progress during network partitions. Nodes may be stuck waiting indefinitely.
2PC commits if majority acknowledges: FALSE. 2PC requires all nodes to vote "ready." Even one "abort" vote causes the entire transaction to abort.
2PC prevents inconsistency: TRUE (when it completes). It ensures all nodes reach the same decision.
2PC can be used for concurrency control: FALSE. 2PC is strictly for distributed commit coordination.
Scenario analysis — failure after commit received: If all nodes received the commit message and logged it, nodes can independently apply changes even if the network goes down. The transaction is already decided.
Scenario analysis — failure before all ACKs: If the coordinator has sent "commit" and all nodes received it but one crashes before ACK — the coordinator can declare success because the commit decision is final once sent. The failed node will recover from its log.
Key rule: Once the coordinator decides to commit (after all "ready" votes), that decision is irrevocable. Nodes that crash after receiving "commit" will redo upon recovery.

Example (Endterm 2020 Q5.1.2): After coordinator sends commit, all nodes receive it, all send ACK except S1 (pulled plug, disk OK). "Coordinator should send rollback and abort" ⇒ FALSE. "Coordinator can declare success after S1 comes back" ⇒ TRUE. "Coordinator can declare success even before S1 returns (enough info in S1's log)" ⇒ TRUE.

4. Logging & Recovery (UNDO / REDO / UNDO-REDO) Very High

END 2020 Q5.2 • 2023 Q3 • 2024 Q3 | RES 2020 Q5.2 • 2022 Q4.2 • 2024 Q4

Given a table of actions (READ, WRITE, OUTPUT, computations) with memory/disk values, fill in the log entries and FLUSH LOG commands. Alternatively, determine disk values before/after recovery following a crash. Variants: UNDO logging, REDO logging, or UNDO-REDO logging.

Methods to Solve

UNDO logging rules:
- Log entry <T, Variable, OLD value> is written before the WRITE (to memory buffer) or OUTPUT (to disk).
- FLUSH LOG before OUTPUT — the log must be on disk before the data reaches disk.
- <COMMIT T> is written after all OUTPUTs are done.
- Recovery: For uncommitted transactions, restore old values from log (undo changes).
REDO logging rules:
- Log entry <T, Variable, NEW value> is written at WRITE time.
- <COMMIT T> is written before any OUTPUT to disk.
- FLUSH LOG before OUTPUT — commit must be on disk before data goes to disk.
- Recovery: For committed transactions, redo all writes from log to disk. Uncommitted ⇒ ignore.
UNDO-REDO logging rules:
- Log entry <T, Variable, OLD value, NEW value> records both old and new values.
- FLUSH LOG before OUTPUT (WAL — Write-Ahead Logging).
- <COMMIT T> can appear at any point relative to OUTPUT.
- Recovery: Redo committed transactions, undo uncommitted ones.
Determining disk values after crash (before recovery): Track which OUTPUTs executed before the crash. Only OUTPUTs that completed change the disk. Memory state is lost.
Determining disk values after recovery: Apply the appropriate recovery protocol (undo uncommitted / redo committed) based on what's in the log on disk (only entries before the last FLUSH LOG before crash are guaranteed on disk).
<START T>: Always the first log entry for a transaction.

Example (Resit 2024 Q4 — UNDO): UNDO log with crash at line 13 (after OUTPUT(A) at line 11, before OUTPUT(B)). Before recovery: A = 20 (OUTPUT completed), B = 4 (OUTPUT not reached). After recovery: since COMMIT was not flushed, transaction is uncommitted ⇒ undo A back to old value 6. Final: A = 6, B = 4.

5. CAP Theorem Very High

END 2020 Q4.2 • 2022 Q4.2 • 2023 Q4.2 • 2024 Q4.2 | RES 2020 Q6a • 2024 Q3.2

Discuss or evaluate claims about the CAP theorem: whether a distributed system can simultaneously guarantee Consistency, Availability, and Partition tolerance. Often framed as "can a vendor guarantee consistency during network partitions?" or TRUE/FALSE about AP/CP system properties.

Methods to Solve

CAP theorem: In the presence of a network partition (P), a distributed system must choose between Consistency (C) and Availability (A). You can only have 2 of 3.
CP system (Consistent + Partition-tolerant): Sacrifices availability. During a partition, the system may refuse to answer queries to maintain consistency. It provides serializable consistency and can recover to a consistent state after partitions heal.
AP system (Available + Partition-tolerant): Sacrifices consistency. Even during partitions, all live nodes answer queries, but responses may be stale/inconsistent. When partitions heal, the system will eventually reach a consistent state.
"Consistency during partition = refuse requests": TRUE. The only way to guarantee consistency during a partition is to stop serving requests (sacrifice availability). This is the CP trade-off.
"Vendor guarantees consistency + availability during partitions": FALSE. This violates the CAP theorem. The vendor is misleading.
"Amazon-like systems enjoy consistency during partitions": FALSE. Amazon prioritizes availability (AP), so consistency is sacrificed during network disruptions.
AP system properties: (a) Live nodes always return answers — TRUE. (b) Eventually consistent after partitions heal — TRUE. (c) Can execute 2PC during partition — FALSE (2PC requires all nodes).
CP system properties: (a) Provides serializable consistency — TRUE. (b) Can tolerate partitions and recover — TRUE. (c) Replicas can answer during partition — FALSE (availability is sacrificed).

Example (Endterm 2022 Q4.2): "The only solution to ensure consistency during network partition is to refuse answering user requests." — Agree. By the CAP theorem, if you want consistency (C) during a partition (P), you must sacrifice availability (A), meaning the system must refuse some requests.

6. Conflict Serializability (Precedence / Conflict Graph) Very High

END 2020 Q4.3 • 2022 Q4.3 • 2023 Q2 • 2024 Q2 | RES 2022 Q4.3

Given a schedule of concurrent transactions with read/write operations, draw the conflict (precedence) graph and determine whether the schedule is conflict-serializable. If there is a cycle, it is NOT conflict-serializable.

Methods to Solve

Identify conflicting operations: Two operations conflict if they are from different transactions, access the same data item, and at least one is a write. The three conflict types: R-W, W-R, W-W.
Build the precedence graph: Create a node for each transaction. For every pair of conflicting operations, draw an edge from T_i → T_j if T_i's operation comes first in the schedule.
Check for cycles: If the graph has a cycle, the schedule is NOT conflict-serializable. If acyclic, it IS conflict-serializable, and any topological sort gives a valid serial order.
Systematic approach: Go through the schedule left-to-right. For each write w_i(X), find all later r_j(X) or w_j(X) where j ≠ i ⇒ edge i → j. For each read r_i(X), find all later w_j(X) where j ≠ i ⇒ edge i → j.
State the conclusion clearly: "The schedule is (not) conflict-serializable because the precedence graph is (a)cyclic." If acyclic, give the equivalent serial order.

Example (Endterm 2024 Q2): Schedule: T1 reads(A), T2 writes(A), T1 writes(A). Conflicts on A: T1 r(A) before T2 w(A) ⇒ T1→T2. T2 w(A) before T1 w(A) ⇒ T2→T1. Cycle: T1→T2→T1. ⇒ NOT conflict-serializable.

7. Join Order Optimization Very High

END 2020 Q5.2a • 2022 Q5.2 • 2023 Q1.2.2 • 2024 Q1.2.2 | RES 2024 Q2.1

Evaluate whether a given operator tree has a good join order, and if not, propose a better one. Argue why a particular join order is better based on intermediate result sizes (without necessarily computing exact costs).

Methods to Solve

Heuristic: minimize intermediate result sizes. Join the smallest relations first to keep intermediate results small throughout the plan.
Deep-left trees are generally preferred: They allow pipelining and are more efficient for nested-loop joins since the inner relation is repeatedly scanned.
Push selections before joins: Apply selections as early as possible to reduce the size of relations before joining.
Avoid Cartesian products: Never join relations that don't share a join attribute unless absolutely necessary. Cartesian products explode intermediate sizes.
Join selective relations first: If one relation becomes very small after selection (e.g., a single ingredient name), join it early to filter down the data.
Greedy heuristic: At each step, pick the join that produces the smallest intermediate result. Start with the two smallest relations/results.
Cannot guarantee optimality: The greedy heuristic finds a good join order, but not necessarily the best one. Full cost-based enumeration is needed for that (but is expensive).
Argue with sizes, not just intuition: Reference table sizes, reduction factors, and expected intermediate sizes to justify your proposed order.

Example (Endterm 2023 Q1.2.2): Original tree joins all Users with all Ratings first (100,000 tuples), then filters for Dandelion recipes. Better: start with σ_{name='Dandelion'}(Ingredient) ⋈ uses ⋈ Recipe ⋈ rating ⋈ User. This filters to ~1 ingredient first, drastically reducing all subsequent joins.

8. Relational Algebra Queries Very High

END 2020 Q2 • 2022 Q2 | RES 2020 Q3 • 2022 Q2 • 2023 Q3

Translate natural-language queries into relational algebra expressions using selection (σ), projection (π), join (⋈), union (∪), difference (−), rename (ρ), and extended operators like grouping/aggregation (γ).

Methods to Solve

Selection (σ): Filter rows by condition. σ_age>30(Customer).
Projection (π): Select specific columns. π_{name, age}(Customer). Remember: RA uses set semantics (duplicates removed).
Natural join (⋈): Join on shared attribute names. For explicit conditions, use theta-join: R ⋈_R.id=S.id S.
Grouping/Aggregation (γ): γ_{group_attrs, agg_func(attr)}(R). E.g., γ_{city, SUM(revenue)}(Company).
Set difference (−): For "never" or "not" queries: find all entities, then subtract those that match the condition.
Union (∪): For "or" queries combining two sets. Both operands must have the same schema.
Rename (ρ): Used for self-joins or to resolve attribute name conflicts.
Strategy: Start from the innermost tables, apply selections and joins step by step, and project at the end. Build inside-out.
"At least N" pattern: Join, group, aggregate with COUNT, then select σ_count≥N.
"For each" pattern: Typically requires grouping (γ) on the grouping attribute.

Example (Resit 2022 Q2.1): "For each date in booking, show total payment." ⇒ γ_{booking_date, SUM(payment_amount)}(booking).

9. Reduction Factor Accuracy & Size Estimation Discussion High

END 2023 Q1.2.1 • 2024 Q1.2.1 | RES 2024 Q2.2 • 2024 Q2.3

Discuss whether the standard heuristic formula (1/#distinct_values or (value−low)/(max−low+1)) gives accurate size estimates for a specific query. Identify when heuristics fail (correlated attributes, skewed distributions) and suggest improvements like histograms.

Methods to Solve

Uniform distribution assumption: The heuristic rf = 1/#dv assumes values are uniformly distributed. If true (e.g., unique ingredient names), the estimate is accurate.
Correlated attributes: When two selection conditions are on correlated attributes (e.g., price and square meters in housing), multiplying independent reduction factors produces a wrong estimate. Correlation means the conditions are not independent.
Skewed distributions: If values are not uniformly distributed (e.g., most recipes are difficulty 1-2, few are 5), the uniform assumption over-/under-estimates.
Histograms fix non-uniform distributions: Histograms capture the actual distribution of values per attribute, giving much better reduction factor estimates than the simple 1/#dv formula.
Histograms do NOT fix correlation: Histograms address single-attribute distribution issues, not multi-attribute correlation.
Estimate direction: When attributes are positively correlated (e.g., large apartments cost more) and you select for cheap + large, the estimate will be too high because reality has fewer matches than independence assumes.

Example (Resit 2024 Q2.2): σ_{price<700 ∧ sqm>90}(apartment). The heuristic assumes price and sqm are independent, but in reality they are positively correlated (larger = more expensive). Asking for cheap AND large yields far fewer results than estimated ⇒ estimate is too high.

10. Algebraic Query Optimization Heuristics High

END 2020 Q5.2b • 2022 Q5.2 | RES 2020 Q4.2c • 2024 Q2.4

Identify which algebraic heuristics (push selections down, push projections down, replace selection+Cartesian product with join) should be applied to an operator tree, and discuss their impact on query cost.

Methods to Solve

Push selections down: Move σ operators as close to the leaf (base) relations as possible. This reduces the number of tuples early, making subsequent joins cheaper.
Push projections down: Move π operators down to remove unnecessary attributes early, reducing tuple sizes and thus block counts. But be careful: do not remove attributes needed by later joins or selections!
Replace σ + × with ⋈: A selection over a Cartesian product is equivalent to a join. Joins are much cheaper than computing the full Cartesian product first.
Impact on cost-based results: After pushing selections, intermediate result sizes shrink, which changes the cost calculations for algorithm selection. Recompute sizes with the new tree.
Pushing projection pitfall: If you remove an attribute that is needed for a join higher in the tree, the query becomes incorrect (not just suboptimal). Always verify that projected attributes include everything needed upstream.

Example (Resit 2024 Q2.4): Removing a_id from projection π_{o_id, a_id, end_date} to π_{o_id, end_date} seems like "push projection" optimization. But a_id is needed for the join at i7 above ⇒ this breaks the query. Answer: NOT beneficial.

11. Advanced SQL Queries High

END 2020 Q1 • 2022 Q1 | RES 2022 Q1

Write SQL queries involving JOINs, GROUP BY, HAVING, subqueries (EXISTS, IN, NOT EXISTS), CTEs (WITH clause), aggregate functions, and conditions on dates/ages. Queries range from simple aggregations to complex nested/correlated subqueries.

Methods to Solve

GROUP BY + HAVING: Use GROUP BY for per-group aggregation. HAVING filters groups (not rows). E.g., "agencies with at least 10 bookings" ⇒ GROUP BY t_id HAVING COUNT(*) ≥ 10.
WITH (CTE): Use Common Table Expressions to break complex queries into readable steps. E.g., first compute city totals, then find the minimum.
NOT EXISTS for negation: "Customers who never visited X" ⇒ WHERE NOT EXISTS (SELECT 1 FROM visits WHERE ...). This is the standard pattern for universal negation.
Correlated subqueries: For "same price to all companies" ⇒ use NOT EXISTS with a self-join checking for different prices.
JOIN strategy: Identify which tables to join by tracing foreign key relationships in the schema. Only join tables needed for the query.
Date comparisons: Use standard date arithmetic. "Booking performed at least 3 days before starting" ⇒ start_date - booking_date ≥ 3.
DISTINCT: Use when the query asks for unique results or when joins might produce duplicates.
Set operators: UNION, INTERSECT, EXCEPT for combining/comparing result sets.

Example (Resit 2022 Q1.1): "List travel agencies with ≥10 bookings after 01-01-2022." ⇒ SELECT ta.name, COUNT(*) FROM travel_agency ta JOIN booking b ON ta.t_id = b.t_id WHERE b.booking_date > '2022-01-01' GROUP BY ta.t_id, ta.name HAVING COUNT(*) ≥ 10.

12. Indexing Technique Selection (Hash-Index vs B+Tree) High

END 2020 Q3.1 • 2022 Q3.1 | RES 2024 Q2.5

Given queries with different WHERE conditions, select which index type (Hash-Index or B+Tree) is best suited for each attribute used. Also includes TRUE/FALSE statements about index properties (primary vs secondary, insertion cost, etc.).

Methods to Solve

Hash-Index: Best for equality lookups (= operator). O(1) access. Cannot handle range queries (<, >, BETWEEN).
B+Tree: Supports both equality and range queries. Cost is O(log n) = tree height. Preferred when the query uses <, >, ≤, ≥, BETWEEN, or ORDER BY.
If only equality: Hash-Index is better (fewer block accesses: typically 1 vs tree height).
If range query: B+Tree is the only viable option. Hash-Index cannot be used.
Primary key with equality: Hash is optimal if no range queries on that key.
"Does not matter": If no index exists on the attribute, or if a full table scan is needed regardless, the index type doesn't matter (mark N/A or "does not matter").
Secondary indexes: Multiple secondary indexes can exist per table. They decrease write performance (need to update index on insert/delete) but improve read performance for selections on non-key attributes.
Relation scan vs Index scan: When selectivity is very low (many tuples match), a full relation scan may be cheaper than an index scan because index scan has per-tuple overhead.

Example (Endterm 2020 Q3.1): SELECT * FROM Person WHERE age = 32 ⇒ age: Hash-Index (equality), id: N/A (not used). SELECT * FROM Person WHERE id > 23 AND age > 32 ⇒ age: B+Tree (range), id: B+Tree (range).

13. B+Tree / KD-Tree Insertion High

RES 2020 Q4.1 (B+Tree) • 2022 Q3.1 (KD-Tree) • 2023 Q4.1 (B+Tree)

Draw the resulting B+Tree after inserting a value (may require node splits), or draw a KD-Tree after inserting a multi-dimensional point. Follow specified semantics (≤ left, > right, or < left, ≥ right).

Methods to Solve

B+Tree insertion:
- Navigate to the correct leaf node using the search key.
- If the leaf has space, insert the key in sorted order.
- If the leaf is full, split: distribute keys evenly between old and new leaf. The middle key is copied up to the parent (for leaf splits) or pushed up (for internal node splits).
- If the parent is also full, recursively split upward. A split at the root creates a new root.
- Leaf nodes maintain a linked list (next-leaf pointers).
KD-Tree insertion:
- Alternate splitting dimensions at each level (e.g., level 0: Salary, level 1: Age, level 2: Salary, ...).
- Navigate: compare the insertion point's value on the current dimension with the node's splitting value.
- Go left if < (or ≤ depending on convention), right if ≥ (or >).
- Insert as a new leaf when reaching a null child pointer.
Pay attention to conventions: The exam specifies whether left is ≤ or <. Follow exactly as stated.

Example (Resit 2020 Q4.1): Insert 100 into a B+Tree with capacity 3 keys per node. 100 goes to the rightmost leaf [46, 47, 50], which is full. Split: [46, 47] and [50, 100]. Push 50 up to parent [22, 33, 44], which is also full. Split parent: [22, 33] and [50]. Push 44 up as new root.

14. Row-Store vs Column-Store Low

END 2020 Q3.2 • 2022 Q3.2

Given query characteristics (OLTP updates, full-row retrieval, column aggregations), determine whether row-wise or column-wise storage is more suitable.

Methods to Solve

Row-store (OLTP): Best for queries that access complete rows (SELECT *, INSERT, UPDATE, DELETE). Each row is stored contiguously ⇒ one I/O to read/write a full record.
Column-store (OLAP): Best for queries that aggregate specific columns (SUM, AVG, COUNT on one attribute). Each column is stored contiguously ⇒ reads only the needed column's blocks.
Updates on multiple tables with joins: Row-store. Updates touch entire rows, and row-store keeps all attributes together.
Queries returning complete rows: Row-store. Column-store would need to reconstruct rows from separate column files.
Aggregations over specific columns: Column-store. Only reads the relevant column, skipping all others ⇒ far fewer I/Os.

15. Two-Phase Locking (2PL) Schedule Validation Low

END 2020 Q5.3 | RES 2020 Q5.3

Given schedules using lock/unlock/read/write notation (L, U, R, W), determine whether each schedule follows a valid two-phase locking protocol.

Methods to Solve

2PL rule: For each transaction, there must be a growing phase (only acquires locks) followed by a shrinking phase (only releases locks). Once a transaction releases any lock, it must never acquire another.
Check per transaction: For each transaction T_i, extract all L_i and U_i operations in order. Verify that no L_i appears after any U_i.
Other rules: A transaction must lock an item before reading/writing it. It must not lock an item it already holds. These are basic well-formedness rules.
If ANY transaction violates 2PL: The entire schedule is NOT a valid 2PL schedule.
ALL transactions must comply: Every transaction in the schedule must individually follow the growing-then-shrinking pattern.

Example (Endterm 2020 Q5.3): Schedule: L1(B);R1(B); L2(A);R2(A); W1(B); U1(B);L3(B);W3(B); L1(C);... — Transaction 1 unlocks B then later locks C ⇒ violation of 2PL for T1. Schedule is FALSE (not valid 2PL).

16. Data Integration (Schema Matching, Merging, Data Quality) Low

END 2023 Q4.3 • 2024 Q4.3

Describe the key tasks of data integration (e.g., when two companies merge databases) and list data quality dimensions with examples. Typically worth 6 points with multiple sub-parts.

Methods to Solve

Data integration tasks (know any 4):
- Schema matching: Find correspondences between attributes in different source schemas (e.g., "product_name" in DB1 = "item_title" in DB2).
- Schema merging: Combine matched schemas into a unified global schema.
- Schema mapping: Define how to translate queries/data between the global schema and source schemas.
- Data matching: Identify records in different databases that refer to the same real-world entity (entity resolution/deduplication).
- Query reformulation: Translate queries on the global schema into sub-queries on the source schemas.
Data quality dimensions (know any 2):
- Accuracy: Data values match real-world values.
- Completeness: No missing data; all real-world states are represented.
- Timeliness: Data is sufficiently up-to-date for the task.
- Consistency: No contradictions; data follows defined semantic rules.
- Duplication: No unwanted duplicate records within or across systems.

Example (Endterm 2023 Q4.3): Jumbo and Albert Heijn merge. Schema matching: find that Jumbo's "artikel" = AH's "product". Data matching: identify that "Coca-Cola 1L" in both DBs is the same item. Data quality issue: Duplication (same product in both DBs) and Consistency (different price formats).

17. Conceptual Modelling (EER Diagrams) Low

RES 2020 Q1 • 2023 Q1

Given a scenario description, complete or design a Conceptual EER schema: identify entities, relationships, attributes, primary keys, cardinalities, specialization/generalization (ISA), and participation constraints. Comment on requirements that cannot be modeled.

Methods to Solve

Identify entities: Nouns that represent independent objects (Camper, Staff, Facility, Activity).
Identify relationships: Verbs connecting entities (Participates, Uses, Organizes). Determine cardinalities (1:1, 1:N, M:N).
Attributes: Properties of entities. Underline primary keys. Use ovals in ER notation.
Specialization/Generalization (ISA): If entities share common attributes but have distinct sub-types (e.g., Staff → C1/C2/C3), use ISA hierarchy. Mark as total/partial and disjoint/overlapping.
Weak entities: Entities that cannot be uniquely identified without their owner entity (partial key + owner's key).
Participation constraints: Total (every entity must participate) vs partial. Double line = total participation.
Constraints that cannot be modeled: Age ranges (5-17), "exactly 2 campers per tent," minimum age requirements — these are business rules that ER diagrams cannot express.
Multi-valued attributes: Use double-line ovals for attributes that can have multiple values.

18. Functional Dependencies & Normalization (2NF, 3NF) Low

RES 2020 Q2 • 2023 Q2

Given a relation R(A, B, C, D, E) and a set of functional dependencies, derive candidate keys, then decompose the relation into 2NF and 3NF. Multiple rows with different FD sets are given in a table format.

Methods to Solve

Finding candidate keys:
- Compute the closure of attribute sets. A candidate key K is minimal such that K⁺ = all attributes.
- Start with attributes that appear only on the left side of FDs (they must be in every key).
- Use Armstrong's axioms: reflexivity, augmentation, transitivity to expand closures.
2NF decomposition:
- A relation is in 2NF if every non-prime attribute is fully functionally dependent on every candidate key (no partial dependencies).
- If a non-prime attribute depends on only part of a composite key, decompose: create a separate relation for that partial dependency.
- If the key is a single attribute, the relation is automatically in 2NF.
3NF decomposition (dependency-preserving):
- A relation is in 3NF if for every FD X → A, either X is a superkey or A is a prime attribute (part of some candidate key).
- Algorithm: For each FD X → Y, create a relation R(X, Y). Then ensure a relation containing a candidate key exists. Merge relations with the same key.
- If already in 3NF, write R(A, B, C, D, E) unchanged.
Common FD patterns:
- {A → BCDE}: Key is A, already in 2NF and 3NF.
- {B → CD}: Key must include B+more. Check for partial deps.
- {A → B, B → CE}: Transitive dependency A → B → CE. Violates 3NF. Decompose into R1(B, C, E) and R2(A, B, D).

Generated from 8 exams: Endterms 2020, 2022, 2023, 2024 | Resits 2020, 2022, 2023, 2024
CSE1505 Information and Data Management, TU Delft

CSE1505 -- Exam Pattern Analysis

Table of Contents

1. Cost-Based Access Path Selection (Size Estimation & Algorithm Choice) Very High

Methods to Solve

2. General Discussion Questions (Agree/Disagree with Claims) Very High

Methods to Solve

3. Two-Phase Commit Protocol (2PC) Very High

Methods to Solve

4. Logging & Recovery (UNDO / REDO / UNDO-REDO) Very High

Methods to Solve

5. CAP Theorem Very High

Methods to Solve

6. Conflict Serializability (Precedence / Conflict Graph) Very High

Methods to Solve

7. Join Order Optimization Very High

Methods to Solve

8. Relational Algebra Queries Very High

Methods to Solve

9. Reduction Factor Accuracy & Size Estimation Discussion High

Methods to Solve

10. Algebraic Query Optimization Heuristics High

Methods to Solve

11. Advanced SQL Queries High

Methods to Solve

12. Indexing Technique Selection (Hash-Index vs B+Tree) High

Methods to Solve

13. B+Tree / KD-Tree Insertion High

Methods to Solve

14. Row-Store vs Column-Store Low

Methods to Solve

15. Two-Phase Locking (2PL) Schedule Validation Low

Methods to Solve

16. Data Integration (Schema Matching, Merging, Data Quality) Low

Methods to Solve

17. Conceptual Modelling (EER Diagrams) Low

Methods to Solve

18. Functional Dependencies & Normalization (2NF, 3NF) Low

Methods to Solve