Chapter 5

Query Processing and Optimization

4 hours6 marks16 past questionsBack to chapters Open filtered PYQs Previous: Ch 4 Next: Ch 6

Full Syllabus Outline

Every topic from the syllabus data is preserved here.

5.1
Query processing, optimization and evaluation
DE
Description and significance of query processing and optimization. Stages of query processing: parsing, translation to relational algebra, optimization, evaluation. Block diagram of the query processing pipeline.
5.2
Transformation of relational expressions
DEI
Equivalence rules for transforming relational expressions: selection cascading, selection pushdown, projection cascading, join associativity/commutativity, combining selection and Cartesian product into join. Examples and exercises.
5.3
Techniques of implementing query optimization
DEI
Cost-based optimization: statistics (relation size, distinct values, histograms), cost estimation for selection, join, and sort operations. Heuristic (rule-based) optimization: push selections down, perform projections early, avoid Cartesian products.
5.4
Query evaluation — Materialization and pipelining
DEIDM
Query evaluation and execution plan. Materialization: evaluate and store intermediate results. Pipelining: pass tuples directly between operators without storing. Demand-driven (pull/iterator) pipelining. Producer-driven (push) pipelining. Demonstration of query execution plan using DBMS platform (MS SQL Server / PostgreSQL EXPLAIN).
5.5
Denormalization for performance
DE
Concept and motivation for denormalization. Trade-off between query performance and data redundancy/consistency. Practical examples.
5.6
Materialized view
DEDM
Definition and significance of materialized views. Difference from regular views. Refresh strategies (complete, incremental). Demonstration using PostgreSQL.
5.7
Performance tuning
DEDM
Causes of database performance degradation. Profiling and workload analysis. Index selection. Using performance tuning wizards. Demonstration using MS SQL Server or PostgreSQL.

Chapter Notes

Block-based notes with markdown, diagrams, code, and math.

Chapter 5 — Query Processing and Optimization

5.1 Query Processing Pipeline

When you submit an SQL query, the DBMS processes it through 4 major stages before returning results.

Stage 1: Parser

Checks syntax: Is the SQL grammatically correct?
Checks semantics: Do the referenced tables, columns, and functions actually exist?
Produces a parse tree representing the syntactic structure

Stage 2: Translator

Converts the parse tree into a relational algebra expression (the logical plan)
Specifies what to compute, not how
Example: SELECT name FROM Student WHERE dept='CS' → π_{name}(σ_{dept='CS'}(Student))

Stage 3: Optimizer (most critical)

Applies equivalence rules to generate alternative, semantically equivalent forms
Uses statistics (table sizes, column selectivity, index availability) to estimate cost
Selects the cheapest physical plan — specifies exact algorithm for each operation (e.g., hash join vs. merge join; index scan vs. sequential scan)
Cost is measured primarily in disk I/O (disk is ~50,000× slower than RAM)

Stage 4: Executor (Evaluation Engine)

Executes the physical plan operator by operator
Coordinates with the buffer manager to fetch/write disk pages
Returns result set to the client

5.2 Equivalence Rules for Query Transformation

Key intuition: push selections (σ) and projections (π) as close to the leaves (base relations) as possible — reduces intermediate result sizes.

Key Equivalence Rules (all semantically identical):

Rule 1 — Cascade selections (split conjuncts):
  σ_{p1 ∧ p2}(R) ≡ σ_{p1}(σ_{p2}(R))

Rule 2 — Commutativity of selection (order doesn't matter):
  σ_{p1}(σ_{p2}(R)) ≡ σ_{p2}(σ_{p1}(R))

Rule 3 — Cascade projections (outermost dominates):
  π_{L1}(π_{L2}(R)) ≡ π_{L1}(R)     (requires L1 ⊆ L2)

Rule 4 — Push selection through projection:
  π_{L}(σ_{p}(R)) ≡ σ_{p}(π_{L ∪ attrs(p)}(R))

Rule 5 — Commutativity of join:
  R ⋈ S ≡ S ⋈ R

Rule 6 — Associativity of join (allows optimizer to choose join ORDER):
  (R ⋈ S) ⋈ T ≡ R ⋈ (S ⋈ T)

Rule 7 — Push selection into join (MOST IMPORTANT for optimization):
  σ_{p}(R ⋈ S) ≡ σ_{p}(R) ⋈ S     (when p references only R's attributes)
  σ_{p}(R ⋈ S) ≡ R ⋈ σ_{p}(S)     (when p references only S's attributes)
  σ_{p1 ∧ p2}(R ⋈ S) ≡ σ_{p1}(R) ⋈ σ_{p2}(S)  (when p1 is about R, p2 about S)

Rule 8 — Push projection into join:
  π_{L}(R ⋈ S) ≡ π_{L1}(R) ⋈ π_{L2}(S)

Heuristic optimization strategy:
  Step 1: Push all σ as far DOWN the tree as possible → filter early
  Step 2: Replace Cartesian products + selections with joins
  Step 3: Push π down to eliminate unneeded columns early
  Step 4: Identify and reuse common sub-expressions

5.3 Cost-Based vs Heuristic Optimization

Approach	How It Works	Advantages	Disadvantages
Heuristic Optimization	Apply fixed rules ("push σ down") without estimating costs	Fast; no statistics needed; predictable	Suboptimal — may miss the best plan
Cost-Based Optimization	Enumerate many equivalent plans; estimate I/O + CPU for each; choose cheapest	Finds near-optimal plans; adapts to data	Needs up-to-date statistics; exponential search space for many joins

Cost estimation parameters:

n_r = number of tuples in relation r (from catalog)
b_r = number of disk blocks for r
V(A,r) = number of distinct values for attribute A (selectivity)
Selectivity of σ_{A=v}: estimated as 1/V(A,r) (uniform distribution assumption)

5.4 Query Evaluation Strategies: Materialization vs Pipelining

Materialization

Each operator reads its full input, computes its full output, and writes it to a temporary relation on disk before the next operator reads it.

Pro: Simple; restartable; any algorithm works for any operator.
Con: Expensive — every intermediate result requires a full disk round-trip.

Pipelining

Operators pass tuples directly to the next operator without materializing the full intermediate result.

Pipelining Models

Model	How It Works	Also Called
Demand-driven (pull)	Consumer calls "get_next()" on producer; tuples generated on demand	Iterator model; lazy evaluation
Producer-driven (push)	Producer generates tuples as fast as possible into a shared buffer; consumer reads from buffer	Eager evaluation

Pro: Reduces disk I/O; produces first output tuple early; lower memory footprint.
Con: Some operators are blocking — they must see ALL input before producing ANY output.

Blocking operators (break pipelining):

Sorting — must see all tuples to find the smallest
Hash-join — must build the complete hash table from the smaller relation first
GROUP BY with aggregation — must process all rows in a group

Exam question: "Describe the pipelining execution strategy for query evaluation."
Answer: Pipelining is a query evaluation strategy where operators in a query plan form a pipeline — each operator passes output tuples directly to the next operator as they are produced, without materializing the full intermediate result to disk. This reduces disk I/O and allows the first output tuple to be produced quickly. Two variants: demand-driven (consumer pulls tuples via get_next()) and producer-driven (producer pushes tuples into a buffer). Blocking operators such as sort and hash-join break the pipeline because they must consume all input before producing any output.

5.5 Denormalization vs Materialized Views

Both techniques sacrifice some data consistency/normalization for performance. This is a frequently tested comparison.

Aspect	Denormalization	Materialized View
What it is	Intentionally introducing redundancy into the base schema	Pre-computing and storing the result of a query
Schema impact	Permanent change to table structure	No change to base table schema
Freshness	Always current (data IS in the base table)	Stale until refreshed (REFRESH command)
Update overhead	Application must update redundant copies manually	Base table writes are unaffected; refresh is batched
Data redundancy	High — same data in multiple places	Moderate — stored result duplicates base data
Motivation	Avoid expensive joins on hot read paths	Avoid recomputing expensive aggregations repeatedly
Typical use	High-traffic OLTP reads	OLAP dashboards, reports

Exam distinction: Denormalization = schema change (adding redundant columns permanently); Materialized view = no schema change to base tables (precomputing a query result into a new storage object).

5.6 Materialized Views and Performance Tuning

Materialized view (recap): stores the physical result of a SELECT query. Must be REFRESHed to stay current.

Performance tuning checklist:

Profile first — identify slow queries with EXPLAIN ANALYZE
Add indexes — on WHERE, JOIN, and ORDER BY columns
Rewrite queries — replace correlated subqueries with joins where possible
Partition tables — horizontal partitioning for very large tables
Denormalize selectively — only where read performance is critical
Use materialized views — pre-compute expensive aggregations
Update statistics — run ANALYZE / UPDATE STATISTICS regularly
Tune buffer pool — increase shared_buffers to reduce disk I/O

Exam Quick-Reference — Chapter 5

4 stages of query processing: Parse (syntax + semantics) → Translate (→ relational algebra) → Optimize (choose cheapest plan) → Execute
Heuristic optimization: push σ down, push π down — no cost statistics needed; may be suboptimal
Cost-based optimization: estimate I/O costs for multiple plans using catalog statistics; choose cheapest
Materialization: each operator writes full result to disk; simple but expensive
Pipelining: operators stream tuples directly; low I/O; fast first result; broken by blocking operators (sort, hash-join)
Denormalization = schema change with redundancy; Materialized view = stored query result, no schema change

Solved PYQs

A few past questions with short answers.

CT30105002Chapter 58 marksasked 1x2082 Kartik B

What are the steps in query processing? Explain briefly. List out any 3 equivalence rules for relational algebra for query optimization.

Solution

Steps: Parse (syntax check) → Translate (to RA tree) → Optimize (apply rules + cost estimate) → Execute (physical operators) → Return results.

3 Equivalence Rules:

Cascade Selections: σ_{p1∧p2}(R) ≡ σ_{p1}(σ_{p2}(R))
Push Selection through Join: σ_{p}(R⋈S) ≡ σ_{p}(R)⋈S (if p uses only R attrs)
Join Commutativity: R⋈S ≡ S⋈R (enables build/probe side selection)

CT30105003Chapter 58 marksasked 1x2081 Chaitra R

List out and describe the main steps involved in query processing in an RDBMS. Compare cost-based optimization and heuristic optimization.

Solution

Steps: Parsing → Translation → Optimization → Execution.

Heuristic: Rule-based. Push selections/projections down, reorder joins. Fast, ignores data stats. Good baseline. Cost-Based: Estimates I/O/CPU using stats (cardinality, histograms, index selectivity). Evaluates multiple plans, picks cheapest. Slower but adapts to actual data distribution. Modern optimizers combine both.

CT30105004Chapter 58 marksasked 1x2080 Chaitra R

What are the different steps involved in Query Processing? Explain with a flow diagram. What are the different approaches for Query Optimization? Explain in brief.

Solution

Approaches:

Heuristic: Applies transformation rules (push σ down) blindly. Fast, rule-driven.
Cost-Based: Uses DB statistics to estimate plan costs (I/O, CPU). Chooses minimum cost plan. Handles data skew better.
Rule+Cost Hybrid: Modern systems apply heuristics first, then cost-evaluate remaining join orders.

More PYQs

Additional practice from this chapter.

Describe the Pipelining execution strategy for query evaluation. Explain the key differences between Denormalization and Materialized Views, focusing on motivation, data redundancy and consistency.
CT301050015 marksasked 1xModel Q
What are the steps in query processing? Explain briefly. List out any 3 equivalence rules for relational algebra for query optimization.
CT301050028 marksasked 1x2082 Kartik B
List out and describe the main steps involved in query processing in an RDBMS. Compare cost-based optimization and heuristic optimization.
CT301050038 marksasked 1x2081 Chaitra R
What are the different steps involved in Query Processing? Explain with a flow diagram. What are the different approaches for Query Optimization? Explain in brief.
CT301050048 marksasked 1x2080 Chaitra R
Describe the entire query processing events to retrieve data from database given a SQL query. Explain the strategies that can be used for query evaluation along with examples.
CT301050058 marksasked 1x2081 Ashwin B

Query Processing and Optimization

Query processing, optimization and evaluation

Transformation of relational expressions

Techniques of implementing query optimization

Query evaluation — Materialization and pipelining

Denormalization for performance

Materialized view

Performance tuning

Chapter 5 — Query Processing and Optimization

5.1 Query Processing Pipeline

Stage 1: Parser

Stage 2: Translator

Stage 3: Optimizer (most critical)

Stage 4: Executor (Evaluation Engine)

5.2 Equivalence Rules for Query Transformation

5.3 Cost-Based vs Heuristic Optimization

5.4 Query Evaluation Strategies: Materialization vs Pipelining

Materialization

Pipelining

Pipelining Models

5.5 Denormalization vs Materialized Views

5.6 Materialized Views and Performance Tuning

Exam Quick-Reference — Chapter 5