# CSE 1505 Information and Data Management — Complete Summary

---

## Lecture 1: Introduction / SQL

### Key Concepts
- **Relational Model**: A relation is a set of tuples with the same domains. $R \subseteq D_1 \times \cdots \times D_n$
- **Database Schemas**: Conceptual (ER), Logical (relational tables), Physical (storage structures)
- **Data Models**: Relational, Document, Key-Value, Graph, Wide-Columnar
- **SQL is Declarative**: Describe what, not how
- **SQL Query Structure**:
```sql
SELECT [DISTINCT] <columns> FROM <tables> WHERE <condition> GROUP BY <cols> HAVING <condition> ORDER BY <cols>
```
- **SQL Error Handling**: SQLCODE, SQLSTATE (e.g., SQL0304N = value out of range for host variable)
- **Natural Keys vs. Surrogate Keys**: Natural keys are domain-meaningful (matriculation number); surrogate keys are synthetic (auto-increment, UUID)

### Key Concepts — SQL Types
- **DDL** (Data Definition Language): CREATE, ALTER, DROP
- **DML** (Data Manipulation Language): SELECT, INSERT, UPDATE, DELETE
- **SQL allows multi-sets (bags)**: Duplicates are allowed by default; eliminate with DISTINCT

### WARNING
- SQL result sets can be empty — this is a valid result, not an error
- Even a single scalar value is a result set (one row, one column)

---

## Lecture 2: Complex SQL

### Key Concepts
- **JOIN Types**:
  - INNER JOIN (same as JOIN): only matching tuples
  - LEFT JOIN: all left tuples, NULL-padded right side
  - RIGHT JOIN: all right tuples, NULL-padded left side
  - FULL JOIN: all tuples from both sides, NULL-padded where no match
- **Set Operators**: UNION, INTERSECT, EXCEPT — require union-compatible relations (same number of attributes, compatible data types). Eliminates duplicates by default; use ALL to keep duplicates.
- **Subqueries**:
  - **Uncorrelated**: inner query evaluated once
  - **Correlated**: inner query uses attributes from outer query → re-evaluated per outer tuple → inefficient
  - **Scalar subquery**: returns exactly one row, one column; usable in expressions
- **EXISTS**: semi-join pattern; true if inner query returns any rows
- **IN**: true if value is in the set returned by inner query
- **WITH clause** (CTE): named temporary result; useful for readability and repeated use

### Method: Writing Complex SQL Queries
1. **Identify the base tables** needed
2. **Determine the join structure** (which tables, which join types)
3. **Apply WHERE filtering** early
4. **Apply GROUP BY and HAVING** for aggregate filters
5. **Use subqueries/CTEs** for multi-step logic
6. **Order the result** with ORDER BY
7. **Check result schema** matches expected column names

### Method: LEFT JOIN with Filter Conditions
- When filtering on the right table, put the condition in the ON clause, NOT the WHERE clause (otherwise the LEFT JOIN behavior is defeated)
```sql
SELECT s.id, MAX(e.grade)
FROM students s
LEFT JOIN exams e ON s.id = e.student_id AND e.course_id = 117
GROUP BY s.id
```

### WARNING
- Correlated subqueries are inefficient — avoid when possible
- H2 database (used in Weblab) has a weaker optimizer than PostgreSQL; some queries with IN may timeout after 1 minute
- Result schema column names must match exactly (use AS for aliases)

---

## Lecture 3: Application Programming (JDBC / JPA)

### Key Concepts
- **Impedance Mismatch**: Object-oriented data models differ from relational tables (no implicit identity in relational model, complex object relationships vs. flat tables)
- **Object Persistence**: Storing object state beyond application lifetime
- **Surrogate Keys**: Required for object-relational mapping when no natural key exists
  - **Sequence keys**: DBMS-generated, short, easy to debug, but bottleneck and network traffic
  - **Hi-Lo keys**: App generates low part, DBMS provides high part; reduces network calls
  - **UUIDs**: 128-bit, no central control, easy cross-source integration, but hard to debug and use more storage

### Key Concepts — UUID Types
| Type | Generation |
|------|-----------|
| Type-1 | MAC address + timestamp (traceable) |
| Type-2 | Type-1 + POSIX UID |
| Type-3 | MD5 hash of namespace + name |
| Type-4 | Cryptographically random (collision prob: $4 \times 10^{-16}$ at $2^{36}$) |
| Type-5 | SHA-1 hash of namespace + name |

### Method: JDBC Workflow
1. Load the driver
2. Define connection URL
3. Establish connection (`DriverManager.getConnection()`)
4. Create statement(s)
5. Execute statement(s)
6. Process result(s)
7. Close the connection

### Method: JDBC Transactions
1. `conn.setAutoCommit(false)` — disable auto-commit
2. Execute multiple statements
3. `conn.commit()` — or `conn.rollback()` on failure
4. Optional: `conn.setSavepoint(name)` for savepoints

### Method: JPA Persistence
1. Annotate entities with `@Entity`, `@Table`, `@Id`, `@GeneratedValue`
2. Define relationships: `@OneToMany`, `@ManyToOne`, `@ManyToMany`
3. Define fetch strategy: `FetchType.EAGER` vs. `FetchType.LAZY`
4. Create `persistence.xml` with persistence-unit
5. Use `EntityManager`: `persist()`, `merge()`, `remove()`, `find()`
6. JPQL queries use entity names and attributes, not table/column names

### WARNING
- EAGER loading fetches entire object graph immediately; LAZY loading fetches on demand
- UUIDs are hard to debug: `WHERE id = '8ac7fb3d4f4047419c7f7d22d1802fe3'`
- JPA adds a performance penalty; "automagic" nature can hide details
- Always use PreparedStatements in JDBC for security and performance

---

## Lecture 4: Towards Query Processing

### Key Concepts
- **Query Processing Pipeline**: SQL → Scanner/Parser → Relational Algebra Expression → Query Optimizer → Execution Plan → Evaluation Engine
- **Relational Completeness**: Query languages equivalent in expressive power to Relational Algebra (Codd's Theorem: RA $\equiv$ Tuple/Domain Calculus)
- **Relational Algebra is Procedural**: Shows step-by-step computation; SQL is Declarative
- **Evaluation Primitives**: Tuple scan, selection, index scan, join algorithms, sort, duplicate elimination
- **Operator Trees**: Tree-shaped sequences of RA operators; relations as leaves, operators as internal nodes

### Key Concepts — Basic RA Operators
| Operator | Symbol | Description |
|----------|--------|-------------|
| Selection | $\sigma$ | Filter rows by condition |
| Projection | $\pi$ | Select columns |
| Union | $\cup$ | Combine tuples from two relations |
| Set Difference | $-$ | Tuples in R but not S |
| Cartesian Product | $\times$ | All combinations of tuples |
| Rename | $\rho$ | Rename attributes or relations |
| Join | $\bowtie$ | Combination of $\times$ + $\sigma$ |
| Division | $\div$ | "For all" queries |

### Method: Translate SQL to Relational Algebra
1. FROM clause $\to$ Cartesian Product ($\times$) or Join ($\bowtie$)
2. WHERE clause $\to$ Selection ($\sigma$)
3. SELECT clause $\to$ Projection ($\pi$)
4. GROUP BY/HAVING $\to$ Gamma ($\gamma$) aggregation
5. Aliases $\to$ Rename ($\rho$)

### WARNING
- RA operators produce **sets** (no duplicates); SQL produces **bags** (duplicates allowed)
- Joins are associative: $(R \bowtie S) \bowtie T \equiv R \bowtie (S \bowtie T)$
- Selections are commutative and can be decomposed: $\sigma_{c1 \land c2}(R) \equiv \sigma_{c1}(\sigma_{c2}(R))$

---

## Lecture 5: Storage Media

### Key Concepts
- **RAID Levels**:
  - **RAID 0**: Striping only; max performance, no fault tolerance
  - **RAID 1**: Mirroring; 100% overhead, single disk failure tolerance
  - **RAID 3**: Byte-level striping with dedicated parity disk
  - **RAID 5**: Block-level striping with distributed parity; tolerates 1 failure, good read performance
- **Row-Store vs. Column-Store**:

| Scenario | Best Choice | Reason |
|----------|-------------|--------|
| Queries scanning most columns (e.g., SELECT *) | Row-store | Fewer block reads |
| Queries on few columns with aggregation | Column-store | Only read needed columns |
| OLTP (many point lookups on full rows) | Row-store | Natural row access pattern |
| OLAP (analytical queries, aggregations) | Column-store | Columnar compression + selective reads |
| Full table scan with GROUP BY on few columns | Column-store | Less I/O for aggregated columns |
| Full table scan needing all columns | Row-store | No overhead from column joins |

### WARNING
- Column-store has overhead for full-row retrieval and update-heavy workloads
- Row-store reads unnecessary columns for narrow queries
- Choose based on the **dominant workload pattern**, not edge cases

### Method: Choosing Storage Type
1. Identify which columns the query accesses
2. If querying most columns $\to$ row-store
3. If querying few columns with aggregation $\to$ column-store
4. If the query scans the entire table $\to$ row-store (avoids column-join overhead)

---

## Lecture 6: Indexing #1

### Key Concepts
- **B+ Tree Index**:
  - Ordered, balanced tree; efficient for equality and range queries
  - High fanout in disk-based systems minimizes tree height $\to$ fewer disk seeks
  - Leaf nodes form a linked list (efficient for range scans)
  - Internal nodes are only for navigation
- **Hash Index**:
  - O(1) lookup for exact matches only
  - No range query support
  - Overflow lists grow with continuous insertions without rehashing
- **When to Use Which**:

| Query Pattern | Best Index |
|---------------|-----------|
| Equality ($=$) | Hash index or B-tree |
| Range ($<$, $>$, BETWEEN) | B-tree only |
| Prefix search (LIKE 'B%') | B-tree (some systems) |
| Multi-attribute equality | Hash on composite or individual |

### WARNING
- B-trees require rebalancing on frequent updates
- Hash indexes cannot support range queries at all
- B-tree on a composite key concatenates attributes into a single string
- Index overhead matters for update-heavy workloads

---

## Lecture 7: Indexing #2

### Key Concepts
- **KD-Tree**: Multi-dimensional index for spatial/range queries
  - Alternates splitting by dimension at each tree level
  - Efficient for multi-dimensional range queries
  - Worst case for single-dimensional range queries (limited pruning)
- **Index Selection for Workloads**:

| Workload | Recommended Index |
|----------|------------------|
| Single-dimensional range ($price > 3$) | B-tree on that attribute |
| Multi-dimensional range on same table (BETWEEN on price AND weight) | KD-tree on both attributes |
| Multi-table filter joins | Individual indexes on each table's filter columns |
| Exact match on multiple attributes (price = 199.99 AND weight = 10) | Hash index on composite key |
| Date range queries | B-tree (likely yes, depends on workload) |

### Method: Index Selection Decision Tree
1. **Single attribute range query** $\to$ B-tree
2. **Single attribute equality query** $\to$ Hash or B-tree
3. **Multi-attribute range on same table** $\to$ KD-tree
4. **Multi-attribute equality on same table** $\to$ Hash on composite
5. **Multi-table joins with filters** $\to$ Individual indexes per table
6. **Prefix search (LIKE 'B%')** $\to$ B-tree (system-dependent)
7. **High-update frequency** $\to$ Avoid excessive indexing (rebalancing cost)

### WARNING
- KD-trees cannot span multiple tables (indexes are per-table)
- Multi-dimensional indexes help when queries filter on all dimensions
- If only one dimension is filtered in a KD-tree, performance degrades toward linear scan

---

## Lecture 8: Query Execution

### Key Concepts
- **Cost Measurement**: Primarily disk I/O (seeks + block reads + block writes); CPU time usually negligible
  - Cost = (seeks $\times$ seek\_cost) + (block reads $\times$ read\_cost) + (block writes $\times$ write\_cost)
  - Write cost > read cost (data read back for verification)
- **Join Algorithms**:
  - **Nested Loop Join (NLJ)**: For each tuple in R, scan all tuples in S
  - **Block Nested Loop Join (BNLJ)**: Buffer blocks of R, scan S once per block
  - **Hash Join**: Build hash table on small relation, probe with large relation
  - **Sort-Merge Join**: Sort both relations on join attribute, merge sequentially
  - **Index Nested Loop Join**: Use index to probe S for each tuple of R

### Method: Choosing Join Algorithm
1. **Small table + large table** (no PK info) $\to$ Block Nested Loop Join
2. **One relation is sorted on join attribute** $\to$ Sort-Merge Join
3. **Hash index available on join attribute** $\to$ Hash Join
4. **Secondary index on join attribute** $\to$ Index Nested Loop Join (if selectivity is high)
5. **Both relations sorted AND index available** $\to$ Sort-Merge Join
6. **Large intermediate result + index available** $\to$ Block Nested Loop Join or Index Nested Loop Join

### Method: Index Scan vs. Table Scan
- **Index Scan** is better when:
  - The selection is highly selective (few matching tuples)
  - Hash index on primary key (equality on PK)
- **Table Scan** is better when:
  - The selection is not selective (many matching tuples)
  - Reading most columns (index scan would require additional lookups)
  - The index is larger than the table

### WARNING
- Merge Sort Join requires sorted input — only use if the preceding selection preserves order
- BNLJ cost: $B_R \times B_S$ (brute force) vs. $B_R + \lceil B_R/B_f \rceil \times B_S$ (block with buffer $B_f$)
- In-memory databases shift cost balance toward CPU time

---

## Lecture 9: Query Optimization (Heuristics)

### Key Concepts
- **Query Optimizer**: Rewrites naive query plan into efficient evaluation plan
  - Query Rewriting (algebraic equivalences)
  - Cost Estimation (using DB statistics)
  - Evaluation Plan Generation
- **Algebraic Optimization Heuristics**:
  - **Push selections** as far down the tree as possible
  - **Break selections** into conjuncts and push each individually
  - **Push projections** to reduce intermediate result sizes
  - **Combine Cartesian Products + matching selections into joins**
  - **Eliminate redundant operations**
- **Reduction Factors**: For a selection with condition $c$, estimate:
  - $RF(c) = 1 / \text{number of distinct values}$ (for equality)
  - For range: $RF(c) = \text{fraction of range}$ (e.g., passing grades $\geq 6$ on 19 values $\to$ 9/19)
  - For conjunctive selections: $RF(c_1 \land c_2) = RF(c_1) \times RF(c_2)$

### Method: Estimate Intermediate Result Sizes
1. Start with base table size $|R|$ and block count $B_R$
2. For selection $\sigma_c(R)$: estimated size $= |R| \times RF(c)$, blocks $= \lceil \text{size} / \text{tuples per block} \rceil$
3. For join $R \bowtie S$: estimated size $= |R| \times |S| \times RF(\text{join})$
4. For projection $\pi_A(R)$: estimate distinct values of A

### WARNING
- The formula $|R| \times \prod RF(c_i)$ assumes **uniform value distribution**
- If values are **not uniformly distributed** (e.g., grades, birthdays), the formula **overestimates or underestimates**
- Use **histograms** for more accurate estimates
- For PK equality: formula is exact (at most +/- 1 error depending on whether the value exists)
- For non-PK equality (e.g., shirt color): formula likely **overestimates**

---

## Lecture 10: Query Optimization (Cost-Based)

### Key Concepts
- **Greedy Heuristic for Join Order**:
  1. Compute intermediate sizes for all pairs: $|R \bowtie S| = |R| \times |S| \times RF(R \bowtie S)$
  2. Pick the pair with the smallest intermediate result
  3. Join the result with remaining relations, always choosing the smallest intermediate
- **Join Order Optimization**:
  - $(R \bowtie S) \bowtie T$ vs. $R \bowtie (S \bowtie T)$ can differ enormously in cost
  - Choose the order that produces the smallest intermediate results first
- **Pushing Projections**: Reduces the number of attributes in intermediate results $\to$ smaller tuples $\to$ fewer blocks
- **DB Statistics Needed**:
  - Table size (records, blocks)
  - Number of distinct values per attribute
  - Value distributions (histograms for accuracy)
  - Index availability
  - Join selectivity/reduction factors

### Method: Optimal Join Order (Greedy)
1. Compute all pairwise join sizes using reduction factors
2. Select the pair producing the smallest result
3. Iteratively join the result with remaining relations, picking the smallest intermediate each time
4. Example: $R(1000) \bowtie S(500) \bowtie T(200)$ with $RF(R\bowtie S)=0.1$, $RF(S\bowtie T)=0.2$, $RF(R\bowtie T)=0.01$
   - $R \bowtie S: 1000 \times 500 \times 0.1 = 50,000$
   - $S \bowtie T: 500 \times 200 \times 0.2 = 20,000$
   - $R \bowtie T: 1000 \times 200 \times 0.01 = 2,000$ (smallest $\to$ join first)
   - Optimal: $((R \bowtie T) \bowtie S)$

### WARNING
- Cost is dominated by **disk I/O**, not CPU
- Join order significantly impacts intermediate result sizes and overall cost
- The greedy heuristic doesn't always find the globally optimal order, but works well in practice

---

## Lecture 11: Transactions #1

### Key Concepts
- **Transaction**: A unit of work that performs a single logical function; must commit or abort as a single unit
- **ACID Properties**:
  - **Atomicity**: All or nothing — if a transaction fails, its effects are undone
  - **Consistency**: Transforms a consistent state to another consistent state (enforced by integrity constraints)
  - **Isolation**: Transactions don't see effects of other uncommitted transactions
  - **Durability**: Committed transactions have permanent effects
- **Concurrency Problems**:
  - Lost update: Two transactions update the same data, one overwrites the other
  - Dirty read: Reading uncommitted data from another transaction
  - Non-repeatable read: Data changes between two reads in the same transaction
  - Phantom read: New rows appear between two reads in the same transaction
- **Transaction Schedules**:
  - **Serial**: Transactions execute one after another (always consistent)
  - **Concurrent**: Transactions interleave (need to check serializability)
  - **Conflict Serializable**: Equivalent to a serial schedule via swap of non-conflicting operations
  - **Conflict**: Two operations conflict if they access the same data item and at least one is a write

### WARNING
- Isolation does NOT mean transactions run simultaneously; it means the concurrent execution is equivalent to some serial execution
- 2PL (Two-Phase Locking) ensures conflict serializability:
  - **Growing phase**: Acquire locks, no releases
  - **Shrinking phase**: Release locks, no acquisitions

---

## Lecture 12: Transactions #2

### Key Concepts
- **Two-Phase Commit (2PC)**: Protocol for distributed transactions
  - **Phase 1 (Prepare)**: Coordinator sends PREPARE to all participants; participants check if they can commit and log their readiness
  - **Phase 2 (Commit/Abort)**: If all READY $\to$ COMMIT; if any NO $\to$ ABORT
- **2PC Failure Scenarios**:
  - Coordinator crashes after PREPARE but before COMMIT/ABORT $\to$ participants **block**, waiting
  - Coordinator sends COMMIT but network partition prevents some nodes from receiving $\to$ abort after timeout, temporary inconsistency
- **2PC cannot commit** in case of network partition (participants may block indefinitely)
- **2PC is NOT a concurrency control protocol** — it's for distributed atomicity
- **Logging for Recovery**:
  - **UNDO**: For uncommitted transactions at crash time
  - **REDO**: For committed transactions whose writes may not have reached disk
  - **UNDO/REDO**: Both — standard approach
  - **WAL (Write-Ahead Logging)**: Log records written to disk BEFORE data pages

### Method: Recovery Procedures
1. **Scan log** from last checkpoint
2. **REDO list**: Transactions that have committed (have COMMIT record)
3. **UNDO list**: Transactions that started but did not commit (crash during execution)
4. **Forward pass**: Identify REDO and UNDO lists
5. **Backward pass**: UNDO uncommitted changes; REDO committed changes that may not be on disk

### WARNING
- In REDO-only logging: if a committed transaction's pages were flushed to disk before crash, REDO is redundant but safe
- In UNDO-only logging: uncommitted transactions must have their changes reversed
- WAL ensures the log is on disk before the modified data page — the log is the source of truth for recovery
- 2PC participants block if the coordinator crashes after sending PREPARE — this is a known limitation

---

## Lecture 13: NoSQL — Document Data Models

### Key Concepts
- **Document Model**: Data stored as documents (JSON, BSON, XML)
  - Each document is self-describing with its own schema
  - Flexible schema: different documents in the same collection can have different fields
  - Nested structures: embedded documents and arrays
- **Document DBMS Examples**: MongoDB, CouchDB, Firebase Firestore
- **Advantages over Relational**:
  - Schema flexibility (no need for JOINs for nested data)
  - Easy horizontal scaling (sharding)
  - Natural fit for hierarchical/nested data
- **Trade-offs**:
  - No ACID guarantees (typically BASE: Basically Available, Soft state, Eventual consistency)
  - Limited JOIN support
  - Query patterns optimized for document retrieval, not cross-document analysis

### Key Concepts — CAP Theorem
- **CAP Theorem**: In a distributed system, you can guarantee at most 2 of 3:
  - **C**onsistency: Every read receives the most recent write
  - **A**vailability: Every request receives a response (without guarantee it's the latest)
  - **P**artition Tolerance: System continues to operate despite network partitions

### CAP Scenarios
| Scenario | Trade-off | Reason |
|----------|-----------|--------|
| Online retailer (prevent overselling) | CP | Consistency is critical; availability sacrificed during partitions |
| Social media feed | AP | Responsiveness > freshness; eventual consistency |
| Banking transactions | CP | Accuracy is paramount; users accept delays |
| Collaborative document editing | AP | Always allow editing; conflict resolution later |

### WARNING
- AP systems may return stale data; CP systems may become unavailable during partitions
- Document stores typically choose AP (or at best, weak consistency)
- The CAP theorem applies only to distributed systems — single-node systems don't face partitions

---

## Lecture 14: NoSQL — Graph DBs

### Key Concepts
- **Graph Model**: Data as nodes (entities) and edges (relationships)
  - Nodes have properties; edges have labels and properties
  - Native storage of relationships (no JOINs needed)
  - Ideal for connected data: social networks, recommendation engines, fraud detection
- **Graph DBMS Examples**: Neo4j, Amazon Neptune, JanusGraph
- **Advantages**:
  - Efficient traversal of connected data
  - Relationships are first-class citizens (not derived via JOINs)
  - Flexible schema evolution
- **Query Languages**:
  - **Cypher** (Neo4j): Pattern-based declarative query language
  - **Gremlin**: Property graph query language

### Key Concepts — Graph vs. Relational
| Aspect | Relational | Graph |
|--------|-----------|-------|
| Data model | Tables, rows, columns | Nodes, edges, properties |
| Relationships | Foreign keys + JOINs | First-class edges |
| Deep traversal | Expensive (many JOINs) | Efficient (pointer chase) |
| Schema | Rigid | Flexible |
| Best for | Structured data, transactions | Connected data, analytics |

### WARNING
- Graph DBs are not a replacement for relational DBs — they solve different problems
- If your queries involve few relationships and mostly point lookups, relational may be better
- Graph DBs can have high write overhead due to relationship maintenance

---

## Lecture 15: Q&A / Review

### Key Concepts — Data Normalization
- **1NF**: Atomic values only (no sets or substructures)
- **2NF**: In 1NF + no partial dependencies (non-key attributes depend on the full candidate key)
- **3NF**: In 2NF + no transitive dependencies (non-key attributes determined only by the candidate key)
- **BCNF**: Stronger than 3NF — every determinant is a candidate key
- **Functional Dependencies** underpin keys:
  - **Candidate Key**: Irreducible set of attributes that uniquely identifies a tuple
  - **Superkey**: Superset of a candidate key
  - **Primary Key**: Chosen candidate key
  - **Foreign Key**: References a primary key in another table

### Key Concepts — DDL Constraints
| Constraint | Description |
|-----------|-------------|
| PRIMARY KEY | NOT NULL + UNIQUE; identifies the row |
| UNIQUE | No duplicate values in column(s) |
| NOT NULL | NULL values not allowed |
| CHECK | User-defined condition must be satisfied |
| FOREIGN KEY | Referential integrity; references another table's PK |
| DEFAULT | Default value if none specified |
| REFERENCES | Define foreign key relationship |

### ON DELETE / ON UPDATE Actions
| Action | Behavior |
|--------|---------|
| NO ACTION | Reject the operation (default) |
| SET NULL | Set referencing FK to NULL |
| CASCADE | Propagate the operation to referencing rows |

---

## Quick Reference: When to Use What

| Problem Type | Approach |
|-------------|----------|
| Equality queries | Hash index or B-tree |
| Range queries | B-tree (or KD-tree for multi-dimensional) |
| Multi-dimensional range on same table | KD-tree |
| Prefix search (LIKE 'B%') | B-tree |
| Full-row retrieval | Row-store |
| Aggregation on few columns | Column-store |
| Full table scan needing all columns | Row-store |
| Small + large table join (no PK) | Block Nested Loop Join |
| One sorted relation | Sort-Merge Join |
| Hash index available | Hash Join |
| Highly selective index on join attribute | Index Nested Loop Join |
| Pushing selections down | Reduces intermediate result sizes |
| Pushing projections down | Reduces intermediate result sizes |
| Greedy join order | Pick smallest intermediate result first |
| 2PC coordinator crash after PREPARE | Participants block |
| UNDO on recovery | Uncommitted transactions at crash time |
| REDO on recovery | Committed transactions not flushed to disk |
| Online retailer (prevent overselling) | CP (Consistency + Partition tolerance) |
| Social media feed | AP (Availability + Partition tolerance) |
| Banking | CP (Consistency + Partition tolerance) |
| Collaborative editing | AP (Availability + Partition tolerance) |
| Impedance mismatch solution | JPA/Hibernate with annotations or XML mapping |
| Surrogate key with minimal network traffic | Hi-Lo keys |
| Surrogate key with no central authority | UUID (Type-4 random) |
| Eager vs. Lazy loading | EAGER: fetch all at once; LAZY: fetch on demand |

---

## SQL Schema Reference (Common Across Exams)

```sql
CREATE TABLE addresses (
    address_id character varying(14) PRIMARY KEY,
    address character varying(30) NOT NULL,
    city character varying(32) NOT NULL,
    country character varying(14) NOT NULL
);

CREATE TABLE customers (
    id integer PRIMARY KEY,
    first_name character varying(12) NOT NULL,
    last_name character varying(13) NOT NULL,
    email character varying(35) NOT NULL,
    phone_number character varying(17),
    age integer NOT NULL
);

CREATE TABLE suppliers (
    name character varying(61) PRIMARY KEY,
    founding_year integer NOT NULL,
    country character varying(22) NOT NULL,
    industry character varying(62)
);

CREATE TABLE products (
    id integer PRIMARY KEY,
    name character varying(38) NOT NULL,
    manufacturer character varying(56) NOT NULL,
    lot_number integer NOT NULL,
    price numeric(7, 2) NOT NULL,
    availability integer NOT NULL,
    weight integer NOT NULL,
    FOREIGN KEY (manufacturer) REFERENCES suppliers(name)
);

CREATE TABLE orders (
    id integer PRIMARY KEY,
    customer_id integer NOT NULL,
    address_id character varying(14) NOT NULL,
    date date NOT NULL,
    FOREIGN KEY (customer_id) REFERENCES customers(id),
    FOREIGN KEY (address_id) REFERENCES addresses(address_id)
);

CREATE TABLE customer_address (
    customer_id integer NOT NULL,
    address_id character varying(14) NOT NULL,
    PRIMARY KEY (customer_id, address_id),
    FOREIGN KEY (customer_id) REFERENCES customers(id),
    FOREIGN KEY (address_id) REFERENCES addresses(address_id)
);

CREATE TABLE order_product (
    order_id integer NOT NULL,
    product_id integer NOT NULL,
    quantity integer NOT NULL,
    PRIMARY KEY (order_id, product_id),
    FOREIGN KEY (order_id) REFERENCES orders(id),
    FOREIGN KEY (product_id) REFERENCES products(id)
);
```