DBMS NotesPYQs

Chapter 5

Query Processing and Optimization

Chapter Notes

Block-based notes with markdown, diagrams, code, and math.

Chapter 5 — Query Processing and Optimization

5.1 Query Processing Pipeline

Parser

  • Checks SQL syntax (balanced parentheses, valid keywords).
  • Checks semantics: do the referenced tables and columns exist?
  • Produces a parse tree (or query tree).

Translator

  • Converts the parse tree into a relational algebra expression.
  • Example: SELECT name FROM Student WHERE dept='CS'
    π_name(σ_{dept='CS'}(Student))

Optimizer

  • Applies equivalence rules to rewrite into cheaper equivalent plans.
  • Uses statistics (table size, column cardinality, index availability) to estimate cost.
  • Returns the optimal physical plan.

Executor

  • Runs the physical plan operator by operator.
  • Returns result to the client.

5.2 Equivalence Rules for Optimization

Key equivalence rules (all preserve the result set):

1. Cascade of selections:
   σ_{p1 ∧ p2}(R) ≡ σ_{p1}(σ_{p2}(R))

2. Commutativity of selection:
   σ_{p1}(σ_{p2}(R)) ≡ σ_{p2}(σ_{p1}(R))

3. Cascade of projections:
   π_{L1}(π_{L2}(R)) ≡ π_{L1}(R)   (when L1 ⊆ L2)

4. Commutativity of selection and projection:
   π_{L}(σ_{p}(R)) ≡ σ_{p}(π_{L∪attrs(p)}(R))

5. Commutativity of join:
   R ⋈ S ≡ S ⋈ R

6. Associativity of join:
   (R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)

7. Push selection into join:
   σ_{p}(R ⋈ S) ≡ σ_{p}(R) ⋈ S   (when p only refers to R's attrs)

8. Push projection into join:
   π_{L}(R ⋈ S) ≡ π_{L1}(R) ⋈ π_{L2}(S)

Heuristic strategy: push σ and π as close to leaves (base relations) as possible.

5.3 Cost-Based vs Heuristic Optimization

ApproachHowProCon
HeuristicApply rules like "push selections down" blindlyFast, no statistics neededMay miss the optimal plan
Cost-basedEnumerate plans; estimate cost with statistics; pick cheapestFinds better plansNeeds up-to-date statistics; slow for many joins

Cost factors: number of disk I/Os, CPU time, memory usage. Disk I/O usually dominates.


5.4 Evaluation Strategies

Materialization

Each operator reads its input, computes its full output, and writes it to a temporary relation on disk before the next operator reads it.

  • Pro: simple to implement, easy to restart on failure.
  • Con: expensive in I/O and memory; must wait for entire result before passing it on.

Pipelining

Operators pass tuples directly to the next operator without materializing the full intermediate result.

  • Demand-driven (pull): the consumer asks for the next tuple (iterator model).
  • Producer-driven (push): the producer fills a shared buffer.
  • Pro: reduces I/O, can produce first output tuple early, lower memory footprint.
  • Con: some operators (sort, hash-join) are blocking — they must see all input before producing output, so pipelining breaks there.

5.5 Denormalization vs Materialized Views

DenormalizationMaterialized View
WhatIntentionally add redundancy to the schemaPrecomputed, stored query result
When refreshedManually maintained via application logicOn demand / on schedule (REFRESH)
Data consistency riskHigh — must update redundant data everywhereModerate — stale until refreshed
Use caseExtremely high-read, low-write OLTPOLAP/reporting dashboards
Schema changePermanent schema alterationNo schema change to base tables

5.6 Performance Tuning Checklist

  1. Profile first — identify slow queries with EXPLAIN / EXPLAIN ANALYZE.
  2. Add indexes — on WHERE columns, JOIN columns, ORDER BY columns.
  3. Rewrite queries — replace correlated subqueries with joins where possible.
  4. Partition tables — horizontal partitioning for very large tables.
  5. Denormalize selectively — only where read performance is critical.
  6. Update statistics — run ANALYZE / UPDATE STATISTICS.
  7. Buffer pool tuning — increase shared_buffers to reduce disk I/O.

Exam Quick-Reference

  • Pipeline: 4 stages — Parse → Translate → Optimize → Execute.
  • Push selections and projections down toward leaves (heuristic rule 1).
  • Materialization: writes intermediates to disk; pipelining streams tuples directly.
  • Blocking operators (sort, hash-join) break pipelining — they need full input first.
  • Denormalization = schema change; Materialized view = stored result, no schema change.

Full Syllabus Outline

Every topic from the syllabus data is preserved here.

Solved PYQs

A few past questions with short answers.

Chapter 58 marksasked 1x2082 Kartik B

What are the steps in query processing? Explain briefly. List out any 3 equivalence rules for relational algebra for query optimization.

Solution

Stages of Query Processing:

1. Parsing: The SQL query is tokenized and parsed against the SQL grammar. Syntax errors are caught here. A parse tree is produced.

2. Translation: The parse tree is converted into a relational algebra expression (or an equivalent internal representation — a query tree).

3. Optimization: The most important stage. The optimizer applies equivalence rules (push selections down, reorder joins, combine projections) and estimates the I/O cost of candidate plans using statistics (table sizes, cardinalities, histogram data). The plan with the lowest estimated cost is selected.

4. Execution: The selected physical execution plan is executed by the query executor. Physical operators (sequential scan, index scan, nested-loop join, hash join, sort-merge join) are invoked.

5. Result delivery: Rows flow from the executor to the client (via pipelining) or are materialized to a buffer first.

Chapter 58 marksasked 1x2081 Chaitra R

List out and describe the main steps involved in query processing in an RDBMS. Compare cost-based optimization and heuristic optimization.

Solution

Query Optimization:

Query optimization is the process of selecting the most efficient execution plan for a SQL query from among all equivalent execution plans. The optimizer does not change the query's semantics — the result is always the same — but chooses the physical plan that minimizes I/O (and sometimes CPU) cost.

Two Main Approaches:

1. Heuristic (Rule-Based) Optimization: Applies transformation rules without looking at statistics. Fast but not always optimal. Key heuristics:

  • Push selections (σ) as far down the query tree as possible — reduce rows before joining.
  • Push projections (π) down — reduce columns early.
  • Replace Cartesian products + selections with joins.
  • Perform the most selective operations first.
  • Reorder joins to join smallest intermediate results first.

2. Cost-Based Optimization: Enumerates multiple candidate plans, estimates the I/O cost of each using statistical information (table size n_r, distinct values V(A,r), histograms), and selects the plan with the lowest estimated cost.

Cost model: primarily counts block transfers (disk I/Os), since disk I/O dominates query execution time.

Statistics Used:

  • n_r: number of tuples in relation r
  • b_r: number of disk blocks containing r's tuples
  • V(A, r): number of distinct values of attribute A in r
  • Index information: whether an index exists on A, its type (B+ tree vs hash)
Chapter 58 marksasked 1x2080 Chaitra R

What are the different steps involved in Query Processing? Explain with a flow diagram. What are the different approaches for Query Optimization? Explain in brief.

Solution

Query Evaluation Methods — Join Algorithms:

The choice of join algorithm is critical for query performance. Different algorithms suit different conditions (data size, available memory, existence of indexes, whether data is sorted).

1. Nested-Loop Join: For each tuple r in R (outer), scan all of S (inner) for matching tuples.

  • Cost: n_r × b_s + b_r block transfers.
  • Worst case when neither relation fits in memory.
  • Simple but slow for large relations.

2. Block Nested-Loop Join: Load a block of R, then scan all of S. For each block of R, only one full scan of S is needed.

  • Cost: ⌈b_r / (M-2)⌉ × b_s + b_r (M = available buffer pages).
  • Much better than tuple-at-a-time when buffer is large enough.

3. Indexed Nested-Loop Join: For each tuple r in R (outer), use an index on S to find matching tuples — no full scan of S.

  • Cost: b_r + n_r × (index lookup cost). Best when an index exists on the inner relation's join attribute.

4. Sort-Merge Join: Sort both R and S on the join attribute, then merge in a single pass.

  • Cost: sort cost + b_r + b_s. Excellent when both relations are already sorted or the result needs to be ordered.

5. Hash Join: Build a hash table on the smaller (build) relation; probe with each tuple of the larger (probe) relation.

  • Cost: 3 × (b_r + b_s) in the simple case. Best for large equality joins with no index.

More PYQs

Additional practice from this chapter.