DBMS NotesPYQs

Chapter 9

Advanced Database Concepts

Full Syllabus Outline

Every topic from the syllabus data is preserved here.

Chapter Notes

Block-based notes with markdown, diagrams, code, and math.

Chapter 9 — Advanced Database Concepts

9.1 Object-Oriented and Object-Relational Databases

Why Object-Oriented Databases?

Traditional relational databases struggle with:

  • Complex nested objects: a document with embedded sections, a product with variant sub-products
  • Methods on data: behavior associated with data (not just raw attribute values)
  • Inheritance hierarchies: "an Employee IS-A Person" — hard to express cleanly in flat tables
  • Object identity: two objects with identical attribute values may still be distinct entities

Core OO Concepts Applied to Databases

OO ConceptMeaning in Database ContextExample
Object identityEach object has a unique OID independent of its attribute valuesTwo Person objects with same name and age are still distinct objects
EncapsulationObject's internal state is hidden; accessed only through defined methodsgetAge() method instead of direct column access
InheritanceA subtype inherits all attributes and methods of its supertypeEmployee inherits name, age from Person; adds salary
PolymorphismSame method name behaves differently for different object typesdescribe() for Employee shows job title; for Student shows GPA

Object-Relational Model (SQL:1999 and later)

Extends the relational model with OO features while maintaining SQL compatibility:

  • User-defined types (UDTs): composite types (structured types), distinct types
  • Collection types: arrays, multisets (bags)
  • Table inheritance: a table can inherit from another table
  • Methods: define methods on types, callable from SQL
-- ── OBJECT-RELATIONAL SQL EXAMPLES (PostgreSQL) ──────────────────────────

-- User-defined composite type (structured / nested type)
CREATE TYPE address_t AS (
  street   VARCHAR(200),
  city     VARCHAR(100),
  district VARCHAR(50),
  zipcode  VARCHAR(10)
);

-- Use composite type as a column (avoids a separate Address table + join)
CREATE TABLE Customer (
  cid     INT PRIMARY KEY,
  name    VARCHAR(100),
  address address_t          -- composite attribute — nested type
);

-- Access nested attributes with dot notation
SELECT cid, name, (address).city, (address).zipcode
FROM   Customer
WHERE  (address).district = 'Lalitpur';

-- Array type (multivalued attribute stored inline)
CREATE TABLE Student_OR (
  sid    INT PRIMARY KEY,
  name   VARCHAR(100),
  phones VARCHAR(20)[]       -- array of phone numbers
);
INSERT INTO Student_OR VALUES (1, 'Alice', ARRAY['9800000001','9800000002']);
SELECT sid, name, phones[1] AS primary_phone,
       array_length(phones, 1) AS total_phones
FROM Student_OR;

-- TABLE INHERITANCE (PostgreSQL extension — Employee IS-A Person)
CREATE TABLE Person (
  pid     INT PRIMARY KEY,
  name    VARCHAR(100) NOT NULL,
  age     INT
);

CREATE TABLE Employee_OO (
  emp_id  VARCHAR(20) NOT NULL,
  salary  DECIMAL(10,2),
  dept    VARCHAR(50)
) INHERITS (Person);   -- inherits pid, name, age from Person

-- Querying:
-- SELECT * FROM Person          → returns BOTH Person-only AND Employee_OO rows
-- SELECT * FROM ONLY Person     → returns ONLY Person-only rows (ONLY keyword)
-- SELECT * FROM ONLY Employee_OO → returns only Employee_OO rows

Object-Relational Mapping (ORM)

An ORM bridges the gap between application-level objects and a relational database:

  • Developer defines classes (e.g., Student with fields name, gpa, courses)
  • ORM automatically generates SQL for INSERT, SELECT, UPDATE, DELETE
  • Developer works entirely with objects; never writes raw SQL
ORM FrameworkLanguage
HibernateJava
SQLAlchemyPython
Prisma, TypeORMTypeScript / Node.js
ActiveRecordRuby on Rails
Entity FrameworkC# / .NET

Advantages: Reduces SQL boilerplate; type-safe queries; easier unit testing
Disadvantages: Can generate inefficient SQL; N+1 query problem (fetching a list of 100 students then issuing 100 separate queries for each student's courses = 101 queries instead of 1 JOIN)

Persistence in OO Databases

Persistence refers to an object surviving beyond the lifetime of the process that created it. In OO databases, objects are stored in the database and retrieved as fully-formed objects. In relational databases with ORM, persistence is achieved by serializing objects to rows.


9.2 Distributed Databases

A distributed database stores and manages data across multiple geographically separate sites connected by a network, but appears as a single unified database to users and applications.

Key design goals:

  • Location transparency: the user does not need to know where data is physically stored
  • Replication transparency: the user does not need to know how many copies exist
  • Fragmentation transparency: the user does not need to know how data is split across sites

Data Fragmentation Strategies

Fragmentation distributes parts of a relation across multiple sites to reduce data transfer between sites for common queries.

Fragmentation in Detail

Horizontal FragmentationVertical Fragmentation
What is distributedRows (subsets by predicate)Columns (subsets by access pattern)
How to reconstruct RUNION of all fragmentsNATURAL JOIN (or JOIN on PK) of all fragments
Useful whenDifferent sites serve different customer regionsDifferent departments need different columns (privacy/security)
ExampleNepal customers at Site 1; India customers at Site 2Personal info (name, address) at Site 1; Financial info (salary, balance) at Site 2

Correctness requirements for fragmentation:

  1. Completeness: every record of R appears in at least one fragment
  2. Reconstruction: R can be fully reconstructed from its fragments
  3. Disjointness: for horizontal, each record appears in exactly one fragment; for vertical, only the PK may repeat

Replication Strategies

Replication stores copies of data at multiple sites for availability and read performance.

TypeDescriptionKey AdvantageKey Disadvantage
Full replicationComplete database copy at every siteMaximum availability; all reads localHigh update cost (must update all copies)
Partial replicationSome fragments replicated at some sitesBalanced cost and availabilityMore complex to manage
No replicationEach fragment on exactly one siteNo redundancy overheadSingle point of failure per fragment

Exam answer — "Describe two primary strategies for distributing data across multiple physical locations, stating a key advantage of each."

Data Replication: maintains full or partial copies of the database at multiple sites. Key advantage: high availability — if one site fails, queries can be served from another site without interruption.

Data Fragmentation: distributes different parts of the data (by rows or columns) to different sites. Key advantage: reduced data transfer — a site can answer queries about its local data without needing to access remote sites, minimizing network communication.

Homogeneous vs Heterogeneous Distributed Databases

HomogeneousHeterogeneous
DBMS softwareSame DBMS at all sites (e.g., all PostgreSQL)Different DBMS at different sites (Oracle at Site 1, MySQL at Site 2)
SchemaIdentical or similarMay differ significantly
Management complexityLower — uniform query language and interfaceHigher — requires schema mapping, query translation, data type conversion
Common scenarioA company's branches all running the same systemTwo companies merging their databases after an acquisition

9.3 Data Warehousing and OLAP

OLTP vs OLAP

CharacteristicOLTP (Online Transaction Processing)OLAP (Online Analytical Processing)
Primary purposeDay-to-day operational transactionsBusiness intelligence, analytics, reporting
Query typeSimple: single-record INSERT/UPDATE/SELECTComplex: aggregates, multi-table joins, GROUP BY across millions of rows
Query frequencyVery high (thousands per second)Low (a few per minute or hour)
Data currencyCurrent, up-to-dateHistorical (months to years)
Data volumeGB rangeTB to PB range
Database designNormalized (3NF) to avoid redundancyDenormalized (star/snowflake schema) for fast reads
Typical usersClerks, customers, automated systemsBusiness analysts, executives, data scientists
ExampleBank transfer, e-commerce checkoutQuarterly sales report, customer churn analysis

ETL Process (Extract, Transform, Load)

PhaseDescriptionExample Operations
ExtractPull raw data from OLTP systems, files, APIs, external sourcesDatabase queries, CSV parsing, REST API calls
TransformClean, integrate, and reshape the data to fit warehouse schemaDeduplication, null handling, unit conversion, joining multiple sources, applying business rules
LoadWrite the transformed data into the data warehouseBulk INSERT, incremental delta loads, slowly changing dimension updates

Warehouse Schemas

Star Schema: Central fact table surrounded by denormalized dimension tables.

  • Simpler joins (fewer joins needed), faster queries
  • Some redundancy in dimension tables
              DimTime                      DimProduct
         (time_id, date, year,         (prod_id, name, category,
          quarter, month)               subcategory, price)
                 |                              |
                 └─────────── FactSales ────────┘
                              (time_id FK,
                               prod_id FK,
                               region_id FK,
                               cust_id FK,
                               sales_amount,
                               quantity_sold)
                                    |
                               DimRegion
                             (region_id, city,
                              country, zone)

Snowflake Schema: Fact table + normalized dimension tables. Each dimension table may join to sub-dimension tables.

  • Reduces redundancy in dimensions
  • Requires more JOINs → slightly slower queries

OLAP Operations on a Data Cube

A data cube is the multidimensional view of data in a warehouse. OLAP operations navigate this cube:

OperationDefinitionAnalogy / Example
Roll-up (aggregation)Move UP the dimension hierarchy — summarize dataMonthly sales → Quarterly sales → Annual sales
Drill-downMove DOWN the hierarchy — increase detail levelAnnual → Monthly → Daily
SliceFix one dimension at a specific value; result is a 2D cross-sectionAll sales in Q1 2024 only (fix time dimension to Q1 2024)
DiceFix two or more dimensions simultaneously; result is a smaller sub-cubeSales in Q1 2024 in Nepal only (fix time AND region)
Pivot (rotate)Rotate the data cube — swap the row and column axesRows = time, Columns = product → flip to Rows = product, Columns = time

9.4 NoSQL and Big Data

Why NoSQL?

Relational databases struggle at massive web scale due to:

  • Horizontal scalability limits: hard to shard (split) a relational DB across thousands of commodity nodes
  • Schema rigidity: every row must conform to a fixed schema; adding fields requires ALTER TABLE on a billion-row table
  • Semi-structured / unstructured data: JSON documents, graphs, time-series don't fit neatly into tables
  • Write-heavy workloads at massive scale: relational ACID can be a throughput bottleneck

NoSQL databases trade some of relational model's guarantees (e.g., full ACID, strict schema) for scalability and flexibility.

NoSQL Database Categories

CategoryData ModelKey PropertyUse CasesExamples
Key-ValueSimple map: key → value blobO(1) get/set by key; horizontally scalableCaching, session storage, shopping cartsRedis, DynamoDB, Memcached
DocumentCollections of JSON/BSON documentsFlexible schema; rich nested queriesContent management, catalogs, user profilesMongoDB, CouchDB, Firestore
Column-Family (Wide Column)Rows with dynamic column familiesFast writes; good for time-series dataIoT telemetry, logs, time-series, analyticsCassandra, HBase, Bigtable
GraphNodes + directed edges + propertiesEfficient multi-hop graph traversalSocial networks, recommendations, fraud detectionNeo4j, Amazon Neptune, JanusGraph

Big Data — The 5 V's

VFull NameDescriptionExample
VolumeScale of dataTB to PB of data generated continuously4 PB/day at Facebook
VelocitySpeed of data generationReal-time streams and sensor data500 M tweets/day on Twitter
VarietyDiversity of data formatsStructured, semi-structured, unstructuredRelational tables + JSON logs + images + video
VeracityTrustworthiness and qualityNoisy, incomplete, inconsistent raw dataSocial media has bots, typos, spam
ValueBusiness value extractedTurning raw big data into actionable insightsCustomer churn prediction, fraud detection

Hadoop Ecosystem

Hadoop is an open-source framework for distributed storage and processing of very large datasets on commodity hardware.

ComponentFunctionKey Idea
HDFSHadoop Distributed File System — distributed storageFiles split into 128 MB blocks; each block replicated 3× across different nodes by default
MapReduceParallel batch processing frameworkMap phase: filter/transform each record independently; Reduce phase: aggregate the mapped results
YARNYet Another Resource Negotiator — cluster resource managerAllocates CPU and RAM to jobs across the cluster
HiveSQL-like query language (HiveQL) on HDFSTranslates SQL queries into MapReduce or Tez jobs for batch analysis
SparkIn-memory distributed processing10–100× faster than MapReduce; supports SQL, streaming, machine learning, graph processing
HBaseColumn-family NoSQL database on top of HDFSRandom read/write access to huge tables (HDFS itself is write-once)

MapReduce Framework in Detail

Map phase: Input data is partitioned into chunks (splits). Each Map task processes one split independently, emitting (key, value) pairs.

Shuffle and Sort phase: The framework collects all (key, value) pairs with the same key and groups them together. Fully automatic.

Reduce phase: Each Reduce task receives all values for a given key and aggregates them into a final result.

Classic example — Word Count:

  • Map: for each word in a document → emit (word, 1)
  • Shuffle: group all ("hello", 1) pairs together, all ("world", 1) pairs together, etc.
  • Reduce: for each word → sum all the 1s → emit (word, total_count)

Exam Quick-Reference — Chapter 9

Distributed Databases:

  • Horizontal fragmentation = rows partitioned by predicate → reconstruct with UNION
  • Vertical fragmentation = columns partitioned → reconstruct with JOIN on PK
  • Full replication = maximum availability; No replication = no redundancy overhead
  • Homogeneous = same DBMS everywhere; Heterogeneous = different DBMS, needs schema mapping

Data Warehousing:

  • ETL = Extract (from OLTP) → Transform (clean, integrate) → Load (into warehouse)
  • OLTP: current, normalized, high-frequency simple transactions
  • OLAP: historical, denormalized, low-frequency complex analytical queries
  • OLAP operations: Roll-up (aggregate up hierarchy); Drill-down (more detail); Slice (fix 1 dimension); Dice (fix 2+ dimensions); Pivot (rotate axes)
  • Star schema: fact + denormalized dimensions; Snowflake schema: fact + normalized dimensions (more joins)

NoSQL and Big Data:

  • NoSQL types: Key-Value (Redis), Document (MongoDB), Column-family (Cassandra), Graph (Neo4j)
  • Big Data 5 Vs: Volume, Velocity, Variety, Veracity, Value
  • HDFS: 128 MB blocks, 3× replication across nodes
  • MapReduce: Map (transform per record) + Shuffle (group by key) + Reduce (aggregate)
  • Spark = in-memory alternative to MapReduce; much faster for iterative algorithms

Solved PYQs

A few past questions with short answers.

CT30109002Chapter 96 marksasked 3x2082 Kartik B, 2081 Chaitra R, 2080 Ashwin B

Write short notes on: (a) Distributed databases (b) Data warehouses.

Solution

(a) Distributed DB: Single logical DB spread across sites. Supports OLTP. Focus: transparency, distributed transactions (2PC), consistency. (b) Data Warehouse: Central historical repository from multiple sources. Supports OLAP. ETL loaded. Star schema. Read-heavy, analytical, non-volatile.

CT30109003Chapter 96 marksasked 2x2080 Chaitra R

Write short notes on: (a) Parallel databases (b) Data warehouse.

Solution

(a) Parallel DB: Multiple CPUs/disks execute queries concurrently. Architectures: Shared-Memory, Shared-Disk, Shared-Nothing (MPP). Speeds intra/inter-query processing. (b) Data Warehouse: Integrated historical data for BI. ETL pipeline. Denormalized schemas. OLAP ops (roll-up, drill-down). MPP hardware common.

CT30109005Chapter 96 marksasked 2x2081 Ashwin B

Write short notes on: (a) Data warehouse database and its application (b) Parallel database architecture.

Solution

(a) DW & Apps: Central historical data for decision support. ETL. Star/Snowflake. Apps: Sales trends, forecasting, customer segmentation, BI dashboards. (b) Parallel Arch: Shared-Memory (fast, limited), Shared-Disk (moderate, needs lock mgr), Shared-Nothing/MPP (best scalability, data partitioned, Redshift/Snowflake).

More PYQs

Additional practice from this chapter.