DBMS NotesPYQs

Chapter 5

Query Processing and Optimization

Full Syllabus Outline

Every topic from the syllabus data is preserved here.

Chapter Notes

Block-based notes with markdown, diagrams, code, and math.

Chapter 5 — Query Processing and Optimization

5.1 Query Processing Pipeline

When you submit an SQL query, the DBMS processes it through 4 major stages before returning results.

Stage 1: Parser

  • Checks syntax: Is the SQL grammatically correct?
  • Checks semantics: Do the referenced tables, columns, and functions actually exist?
  • Produces a parse tree representing the syntactic structure

Stage 2: Translator

  • Converts the parse tree into a relational algebra expression (the logical plan)
  • Specifies what to compute, not how
  • Example: SELECT name FROM Student WHERE dept='CS' → π_{name}(σ_{dept='CS'}(Student))

Stage 3: Optimizer (most critical)

  • Applies equivalence rules to generate alternative, semantically equivalent forms
  • Uses statistics (table sizes, column selectivity, index availability) to estimate cost
  • Selects the cheapest physical plan — specifies exact algorithm for each operation (e.g., hash join vs. merge join; index scan vs. sequential scan)
  • Cost is measured primarily in disk I/O (disk is ~50,000× slower than RAM)

Stage 4: Executor (Evaluation Engine)

  • Executes the physical plan operator by operator
  • Coordinates with the buffer manager to fetch/write disk pages
  • Returns result set to the client

5.2 Equivalence Rules for Query Transformation

Key intuition: push selections (σ) and projections (π) as close to the leaves (base relations) as possible — reduces intermediate result sizes.

Key Equivalence Rules (all semantically identical):

Rule 1 — Cascade selections (split conjuncts):
  σ_{p1 ∧ p2}(R) ≡ σ_{p1}(σ_{p2}(R))

Rule 2 — Commutativity of selection (order doesn't matter):
  σ_{p1}(σ_{p2}(R)) ≡ σ_{p2}(σ_{p1}(R))

Rule 3 — Cascade projections (outermost dominates):
  π_{L1}(π_{L2}(R)) ≡ π_{L1}(R)     (requires L1 ⊆ L2)

Rule 4 — Push selection through projection:
  π_{L}(σ_{p}(R)) ≡ σ_{p}(π_{L ∪ attrs(p)}(R))

Rule 5 — Commutativity of join:
  R ⋈ S ≡ S ⋈ R

Rule 6 — Associativity of join (allows optimizer to choose join ORDER):
  (R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)

Rule 7 — Push selection into join (MOST IMPORTANT for optimization):
  σ_{p}(R ⋈ S) ≡ σ_{p}(R) ⋈ S     (when p references only R's attributes)
  σ_{p}(R ⋈ S) ≡ R ⋈ σ_{p}(S)     (when p references only S's attributes)
  σ_{p1 ∧ p2}(R ⋈ S) ≡ σ_{p1}(R) ⋈ σ_{p2}(S)  (when p1 is about R, p2 about S)

Rule 8 — Push projection into join:
  π_{L}(R ⋈ S) ≡ π_{L1}(R) ⋈ π_{L2}(S)

Heuristic optimization strategy:
  Step 1: Push all σ as far DOWN the tree as possible → filter early
  Step 2: Replace Cartesian products + selections with joins
  Step 3: Push π down to eliminate unneeded columns early
  Step 4: Identify and reuse common sub-expressions

5.3 Cost-Based vs Heuristic Optimization

ApproachHow It WorksAdvantagesDisadvantages
Heuristic OptimizationApply fixed rules ("push σ down") without estimating costsFast; no statistics needed; predictableSuboptimal — may miss the best plan
Cost-Based OptimizationEnumerate many equivalent plans; estimate I/O + CPU for each; choose cheapestFinds near-optimal plans; adapts to dataNeeds up-to-date statistics; exponential search space for many joins

Cost estimation parameters:

  • n_r = number of tuples in relation r (from catalog)
  • b_r = number of disk blocks for r
  • V(A,r) = number of distinct values for attribute A (selectivity)
  • Selectivity of σ_{A=v}: estimated as 1/V(A,r) (uniform distribution assumption)

5.4 Query Evaluation Strategies: Materialization vs Pipelining

Materialization

Each operator reads its full input, computes its full output, and writes it to a temporary relation on disk before the next operator reads it.

Pro: Simple; restartable; any algorithm works for any operator.
Con: Expensive — every intermediate result requires a full disk round-trip.

Pipelining

Operators pass tuples directly to the next operator without materializing the full intermediate result.

Pipelining Models

ModelHow It WorksAlso Called
Demand-driven (pull)Consumer calls "get_next()" on producer; tuples generated on demandIterator model; lazy evaluation
Producer-driven (push)Producer generates tuples as fast as possible into a shared buffer; consumer reads from bufferEager evaluation

Pro: Reduces disk I/O; produces first output tuple early; lower memory footprint.
Con: Some operators are blocking — they must see ALL input before producing ANY output.

Blocking operators (break pipelining):

  • Sorting — must see all tuples to find the smallest
  • Hash-join — must build the complete hash table from the smaller relation first
  • GROUP BY with aggregation — must process all rows in a group

Exam question: "Describe the pipelining execution strategy for query evaluation."
Answer: Pipelining is a query evaluation strategy where operators in a query plan form a pipeline — each operator passes output tuples directly to the next operator as they are produced, without materializing the full intermediate result to disk. This reduces disk I/O and allows the first output tuple to be produced quickly. Two variants: demand-driven (consumer pulls tuples via get_next()) and producer-driven (producer pushes tuples into a buffer). Blocking operators such as sort and hash-join break the pipeline because they must consume all input before producing any output.


5.5 Denormalization vs Materialized Views

Both techniques sacrifice some data consistency/normalization for performance. This is a frequently tested comparison.

AspectDenormalizationMaterialized View
What it isIntentionally introducing redundancy into the base schemaPre-computing and storing the result of a query
Schema impactPermanent change to table structureNo change to base table schema
FreshnessAlways current (data IS in the base table)Stale until refreshed (REFRESH command)
Update overheadApplication must update redundant copies manuallyBase table writes are unaffected; refresh is batched
Data redundancyHigh — same data in multiple placesModerate — stored result duplicates base data
MotivationAvoid expensive joins on hot read pathsAvoid recomputing expensive aggregations repeatedly
Typical useHigh-traffic OLTP readsOLAP dashboards, reports

Exam distinction: Denormalization = schema change (adding redundant columns permanently); Materialized view = no schema change to base tables (precomputing a query result into a new storage object).


5.6 Materialized Views and Performance Tuning

Materialized view (recap): stores the physical result of a SELECT query. Must be REFRESHed to stay current.

Performance tuning checklist:

  1. Profile first — identify slow queries with EXPLAIN ANALYZE
  2. Add indexes — on WHERE, JOIN, and ORDER BY columns
  3. Rewrite queries — replace correlated subqueries with joins where possible
  4. Partition tables — horizontal partitioning for very large tables
  5. Denormalize selectively — only where read performance is critical
  6. Use materialized views — pre-compute expensive aggregations
  7. Update statistics — run ANALYZE / UPDATE STATISTICS regularly
  8. Tune buffer pool — increase shared_buffers to reduce disk I/O

Exam Quick-Reference — Chapter 5

  • 4 stages of query processing: Parse (syntax + semantics) → Translate (→ relational algebra) → Optimize (choose cheapest plan) → Execute
  • Heuristic optimization: push σ down, push π down — no cost statistics needed; may be suboptimal
  • Cost-based optimization: estimate I/O costs for multiple plans using catalog statistics; choose cheapest
  • Materialization: each operator writes full result to disk; simple but expensive
  • Pipelining: operators stream tuples directly; low I/O; fast first result; broken by blocking operators (sort, hash-join)
  • Denormalization = schema change with redundancy; Materialized view = stored query result, no schema change

Solved PYQs

A few past questions with short answers.

CT30105002Chapter 58 marksasked 1x2082 Kartik B

What are the steps in query processing? Explain briefly. List out any 3 equivalence rules for relational algebra for query optimization.

Solution

Steps: Parse (syntax check) → Translate (to RA tree) → Optimize (apply rules + cost estimate) → Execute (physical operators) → Return results.

3 Equivalence Rules:

  1. Cascade Selections: σ_{p1∧p2}(R) ≡ σ_{p1}(σ_{p2}(R))
  2. Push Selection through Join: σ_{p}(R⋈S) ≡ σ_{p}(R)⋈S (if p uses only R attrs)
  3. Join Commutativity: R⋈S ≡ S⋈R (enables build/probe side selection)
CT30105003Chapter 58 marksasked 1x2081 Chaitra R

List out and describe the main steps involved in query processing in an RDBMS. Compare cost-based optimization and heuristic optimization.

Solution

Steps: Parsing → Translation → Optimization → Execution.

Heuristic: Rule-based. Push selections/projections down, reorder joins. Fast, ignores data stats. Good baseline. Cost-Based: Estimates I/O/CPU using stats (cardinality, histograms, index selectivity). Evaluates multiple plans, picks cheapest. Slower but adapts to actual data distribution. Modern optimizers combine both.

CT30105004Chapter 58 marksasked 1x2080 Chaitra R

What are the different steps involved in Query Processing? Explain with a flow diagram. What are the different approaches for Query Optimization? Explain in brief.

Solution

Approaches:

  1. Heuristic: Applies transformation rules (push σ down) blindly. Fast, rule-driven.
  2. Cost-Based: Uses DB statistics to estimate plan costs (I/O, CPU). Chooses minimum cost plan. Handles data skew better.
  3. Rule+Cost Hybrid: Modern systems apply heuristics first, then cost-evaluate remaining join orders.

More PYQs

Additional practice from this chapter.