Chapter 9

Advanced Database Concepts

4 hours4 marks15 past questionsBack to chapters Open filtered PYQs Previous: Ch 8

Full Syllabus Outline

Every topic from the syllabus data is preserved here.

9.1
Concept of object-oriented databases
DE
Object-based database concepts: encapsulation, inheritance, polymorphism, object identity. Object-relational data model. Complex data types: collections, nested relations. Persistence. Object-relational mapping (ORM).
9.2
Distributed database model
DE
Data fragmentation: horizontal (selection-based) and vertical (projection-based). Data replication: full and partial. Homogeneous vs heterogeneous distributed database systems. Basic concepts of distributed file systems.
9.3
Concept of data warehousing and online analytical processing
DE
Data warehouse architecture: operational DBs → ETL → data warehouse → OLAP tools. ETL operations (Extract, Transform, Load). Multidimensional data model. Warehouse schemas: star schema, snowflake schema. OLAP operations: roll-up, drill-down, slice, dice, pivot. Introduction to OLAP.
9.4
Basic concepts of NoSQL and big data
DEDM
Big Data characteristics: volume, velocity, variety. Big Data storage systems. NoSQL categories: key-value (Redis), document (MongoDB), column-family (Cassandra), graph (Neo4j). Hadoop distributed file system (HDFS). MapReduce paradigm. Demonstration with MongoDB.

Chapter Notes

Block-based notes with markdown, diagrams, code, and math.

Chapter 9 — Advanced Database Concepts

9.1 Object-Oriented and Object-Relational Databases

Why Object-Oriented Databases?

Traditional relational databases struggle with:

Complex nested objects: a document with embedded sections, a product with variant sub-products
Methods on data: behavior associated with data (not just raw attribute values)
Inheritance hierarchies: "an Employee IS-A Person" — hard to express cleanly in flat tables
Object identity: two objects with identical attribute values may still be distinct entities

Core OO Concepts Applied to Databases

OO Concept	Meaning in Database Context	Example
Object identity	Each object has a unique OID independent of its attribute values	Two Person objects with same name and age are still distinct objects
Encapsulation	Object's internal state is hidden; accessed only through defined methods	`getAge()` method instead of direct column access
Inheritance	A subtype inherits all attributes and methods of its supertype	`Employee` inherits `name`, `age` from `Person`; adds `salary`
Polymorphism	Same method name behaves differently for different object types	`describe()` for Employee shows job title; for Student shows GPA

Object-Relational Model (SQL:1999 and later)

Extends the relational model with OO features while maintaining SQL compatibility:

User-defined types (UDTs): composite types (structured types), distinct types
Collection types: arrays, multisets (bags)
Table inheritance: a table can inherit from another table
Methods: define methods on types, callable from SQL

-- ── OBJECT-RELATIONAL SQL EXAMPLES (PostgreSQL) ──────────────────────────

-- User-defined composite type (structured / nested type)
CREATE TYPE address_t AS (
  street   VARCHAR(200),
  city     VARCHAR(100),
  district VARCHAR(50),
  zipcode  VARCHAR(10)
);

-- Use composite type as a column (avoids a separate Address table + join)
CREATE TABLE Customer (
  cid     INT PRIMARY KEY,
  name    VARCHAR(100),
  address address_t          -- composite attribute — nested type
);

-- Access nested attributes with dot notation
SELECT cid, name, (address).city, (address).zipcode
FROM   Customer
WHERE  (address).district = 'Lalitpur';

-- Array type (multivalued attribute stored inline)
CREATE TABLE Student_OR (
  sid    INT PRIMARY KEY,
  name   VARCHAR(100),
  phones VARCHAR(20)[]       -- array of phone numbers
);
INSERT INTO Student_OR VALUES (1, 'Alice', ARRAY['9800000001','9800000002']);
SELECT sid, name, phones[1] AS primary_phone,
       array_length(phones, 1) AS total_phones
FROM Student_OR;

-- TABLE INHERITANCE (PostgreSQL extension — Employee IS-A Person)
CREATE TABLE Person (
  pid     INT PRIMARY KEY,
  name    VARCHAR(100) NOT NULL,
  age     INT
);

CREATE TABLE Employee_OO (
  emp_id  VARCHAR(20) NOT NULL,
  salary  DECIMAL(10,2),
  dept    VARCHAR(50)
) INHERITS (Person);   -- inherits pid, name, age from Person

-- Querying:
-- SELECT * FROM Person          → returns BOTH Person-only AND Employee_OO rows
-- SELECT * FROM ONLY Person     → returns ONLY Person-only rows (ONLY keyword)
-- SELECT * FROM ONLY Employee_OO → returns only Employee_OO rows

Object-Relational Mapping (ORM)

An ORM bridges the gap between application-level objects and a relational database:

Developer defines classes (e.g., Student with fields name, gpa, courses)
ORM automatically generates SQL for INSERT, SELECT, UPDATE, DELETE
Developer works entirely with objects; never writes raw SQL

ORM Framework	Language
Hibernate	Java
SQLAlchemy	Python
Prisma, TypeORM	TypeScript / Node.js
ActiveRecord	Ruby on Rails
Entity Framework	C# / .NET

Advantages: Reduces SQL boilerplate; type-safe queries; easier unit testing
Disadvantages: Can generate inefficient SQL; N+1 query problem (fetching a list of 100 students then issuing 100 separate queries for each student's courses = 101 queries instead of 1 JOIN)

Persistence in OO Databases

Persistence refers to an object surviving beyond the lifetime of the process that created it. In OO databases, objects are stored in the database and retrieved as fully-formed objects. In relational databases with ORM, persistence is achieved by serializing objects to rows.

9.2 Distributed Databases

A distributed database stores and manages data across multiple geographically separate sites connected by a network, but appears as a single unified database to users and applications.

Key design goals:

Location transparency: the user does not need to know where data is physically stored
Replication transparency: the user does not need to know how many copies exist
Fragmentation transparency: the user does not need to know how data is split across sites

Data Fragmentation Strategies

Fragmentation distributes parts of a relation across multiple sites to reduce data transfer between sites for common queries.

Fragmentation in Detail

	Horizontal Fragmentation	Vertical Fragmentation
What is distributed	Rows (subsets by predicate)	Columns (subsets by access pattern)
How to reconstruct R	UNION of all fragments	NATURAL JOIN (or JOIN on PK) of all fragments
Useful when	Different sites serve different customer regions	Different departments need different columns (privacy/security)
Example	Nepal customers at Site 1; India customers at Site 2	Personal info (name, address) at Site 1; Financial info (salary, balance) at Site 2

Correctness requirements for fragmentation:

Completeness: every record of R appears in at least one fragment
Reconstruction: R can be fully reconstructed from its fragments
Disjointness: for horizontal, each record appears in exactly one fragment; for vertical, only the PK may repeat

Replication Strategies

Replication stores copies of data at multiple sites for availability and read performance.

Type	Description	Key Advantage	Key Disadvantage
Full replication	Complete database copy at every site	Maximum availability; all reads local	High update cost (must update all copies)
Partial replication	Some fragments replicated at some sites	Balanced cost and availability	More complex to manage
No replication	Each fragment on exactly one site	No redundancy overhead	Single point of failure per fragment

Exam answer — "Describe two primary strategies for distributing data across multiple physical locations, stating a key advantage of each."

Data Replication: maintains full or partial copies of the database at multiple sites. Key advantage: high availability — if one site fails, queries can be served from another site without interruption.

Data Fragmentation: distributes different parts of the data (by rows or columns) to different sites. Key advantage: reduced data transfer — a site can answer queries about its local data without needing to access remote sites, minimizing network communication.

Homogeneous vs Heterogeneous Distributed Databases

	Homogeneous	Heterogeneous
DBMS software	Same DBMS at all sites (e.g., all PostgreSQL)	Different DBMS at different sites (Oracle at Site 1, MySQL at Site 2)
Schema	Identical or similar	May differ significantly
Management complexity	Lower — uniform query language and interface	Higher — requires schema mapping, query translation, data type conversion
Common scenario	A company's branches all running the same system	Two companies merging their databases after an acquisition

9.3 Data Warehousing and OLAP

OLTP vs OLAP

Characteristic	OLTP (Online Transaction Processing)	OLAP (Online Analytical Processing)
Primary purpose	Day-to-day operational transactions	Business intelligence, analytics, reporting
Query type	Simple: single-record INSERT/UPDATE/SELECT	Complex: aggregates, multi-table joins, GROUP BY across millions of rows
Query frequency	Very high (thousands per second)	Low (a few per minute or hour)
Data currency	Current, up-to-date	Historical (months to years)
Data volume	GB range	TB to PB range
Database design	Normalized (3NF) to avoid redundancy	Denormalized (star/snowflake schema) for fast reads
Typical users	Clerks, customers, automated systems	Business analysts, executives, data scientists
Example	Bank transfer, e-commerce checkout	Quarterly sales report, customer churn analysis

ETL Process (Extract, Transform, Load)

Phase	Description	Example Operations
Extract	Pull raw data from OLTP systems, files, APIs, external sources	Database queries, CSV parsing, REST API calls
Transform	Clean, integrate, and reshape the data to fit warehouse schema	Deduplication, null handling, unit conversion, joining multiple sources, applying business rules
Load	Write the transformed data into the data warehouse	Bulk INSERT, incremental delta loads, slowly changing dimension updates

Warehouse Schemas

Star Schema: Central fact table surrounded by denormalized dimension tables.

Simpler joins (fewer joins needed), faster queries
Some redundancy in dimension tables

              DimTime                      DimProduct
         (time_id, date, year,         (prod_id, name, category,
          quarter, month)               subcategory, price)
                 |                              |
                 └─────────── FactSales ────────┘
                              (time_id FK,
                               prod_id FK,
                               region_id FK,
                               cust_id FK,
                               sales_amount,
                               quantity_sold)
                                    |
                               DimRegion
                             (region_id, city,
                              country, zone)

Snowflake Schema: Fact table + normalized dimension tables. Each dimension table may join to sub-dimension tables.

Reduces redundancy in dimensions
Requires more JOINs → slightly slower queries

OLAP Operations on a Data Cube

A data cube is the multidimensional view of data in a warehouse. OLAP operations navigate this cube:

Operation	Definition	Analogy / Example
Roll-up (aggregation)	Move UP the dimension hierarchy — summarize data	Monthly sales → Quarterly sales → Annual sales
Drill-down	Move DOWN the hierarchy — increase detail level	Annual → Monthly → Daily
Slice	Fix one dimension at a specific value; result is a 2D cross-section	All sales in Q1 2024 only (fix time dimension to Q1 2024)
Dice	Fix two or more dimensions simultaneously; result is a smaller sub-cube	Sales in Q1 2024 in Nepal only (fix time AND region)
Pivot (rotate)	Rotate the data cube — swap the row and column axes	Rows = time, Columns = product → flip to Rows = product, Columns = time

9.4 NoSQL and Big Data

Why NoSQL?

Relational databases struggle at massive web scale due to:

Horizontal scalability limits: hard to shard (split) a relational DB across thousands of commodity nodes
Schema rigidity: every row must conform to a fixed schema; adding fields requires ALTER TABLE on a billion-row table
Semi-structured / unstructured data: JSON documents, graphs, time-series don't fit neatly into tables
Write-heavy workloads at massive scale: relational ACID can be a throughput bottleneck

NoSQL databases trade some of relational model's guarantees (e.g., full ACID, strict schema) for scalability and flexibility.

NoSQL Database Categories

Category	Data Model	Key Property	Use Cases	Examples
Key-Value	Simple map: key → value blob	O(1) get/set by key; horizontally scalable	Caching, session storage, shopping carts	Redis, DynamoDB, Memcached
Document	Collections of JSON/BSON documents	Flexible schema; rich nested queries	Content management, catalogs, user profiles	MongoDB, CouchDB, Firestore
Column-Family (Wide Column)	Rows with dynamic column families	Fast writes; good for time-series data	IoT telemetry, logs, time-series, analytics	Cassandra, HBase, Bigtable
Graph	Nodes + directed edges + properties	Efficient multi-hop graph traversal	Social networks, recommendations, fraud detection	Neo4j, Amazon Neptune, JanusGraph

Big Data — The 5 V's

V	Full Name	Description	Example
Volume	Scale of data	TB to PB of data generated continuously	4 PB/day at Facebook
Velocity	Speed of data generation	Real-time streams and sensor data	500 M tweets/day on Twitter
Variety	Diversity of data formats	Structured, semi-structured, unstructured	Relational tables + JSON logs + images + video
Veracity	Trustworthiness and quality	Noisy, incomplete, inconsistent raw data	Social media has bots, typos, spam
Value	Business value extracted	Turning raw big data into actionable insights	Customer churn prediction, fraud detection

Hadoop Ecosystem

Hadoop is an open-source framework for distributed storage and processing of very large datasets on commodity hardware.

Component	Function	Key Idea
HDFS	Hadoop Distributed File System — distributed storage	Files split into 128 MB blocks; each block replicated 3× across different nodes by default
MapReduce	Parallel batch processing framework	Map phase: filter/transform each record independently; Reduce phase: aggregate the mapped results
YARN	Yet Another Resource Negotiator — cluster resource manager	Allocates CPU and RAM to jobs across the cluster
Hive	SQL-like query language (HiveQL) on HDFS	Translates SQL queries into MapReduce or Tez jobs for batch analysis
Spark	In-memory distributed processing	10–100× faster than MapReduce; supports SQL, streaming, machine learning, graph processing
HBase	Column-family NoSQL database on top of HDFS	Random read/write access to huge tables (HDFS itself is write-once)

MapReduce Framework in Detail

Map phase: Input data is partitioned into chunks (splits). Each Map task processes one split independently, emitting (key, value) pairs.

Shuffle and Sort phase: The framework collects all (key, value) pairs with the same key and groups them together. Fully automatic.

Reduce phase: Each Reduce task receives all values for a given key and aggregates them into a final result.

Classic example — Word Count:

Map: for each word in a document → emit (word, 1)
Shuffle: group all ("hello", 1) pairs together, all ("world", 1) pairs together, etc.
Reduce: for each word → sum all the 1s → emit (word, total_count)

Exam Quick-Reference — Chapter 9

Distributed Databases:

Horizontal fragmentation = rows partitioned by predicate → reconstruct with UNION
Vertical fragmentation = columns partitioned → reconstruct with JOIN on PK
Full replication = maximum availability; No replication = no redundancy overhead
Homogeneous = same DBMS everywhere; Heterogeneous = different DBMS, needs schema mapping

Data Warehousing:

ETL = Extract (from OLTP) → Transform (clean, integrate) → Load (into warehouse)
OLTP: current, normalized, high-frequency simple transactions
OLAP: historical, denormalized, low-frequency complex analytical queries
OLAP operations: Roll-up (aggregate up hierarchy); Drill-down (more detail); Slice (fix 1 dimension); Dice (fix 2+ dimensions); Pivot (rotate axes)
Star schema: fact + denormalized dimensions; Snowflake schema: fact + normalized dimensions (more joins)

NoSQL and Big Data:

NoSQL types: Key-Value (Redis), Document (MongoDB), Column-family (Cassandra), Graph (Neo4j)
Big Data 5 Vs: Volume, Velocity, Variety, Veracity, Value
HDFS: 128 MB blocks, 3× replication across nodes
MapReduce: Map (transform per record) + Shuffle (group by key) + Reduce (aggregate)
Spark = in-memory alternative to MapReduce; much faster for iterative algorithms

Solved PYQs

A few past questions with short answers.

CT30109002Chapter 96 marksasked 3x2082 Kartik B, 2081 Chaitra R, 2080 Ashwin B

Write short notes on: (a) Distributed databases (b) Data warehouses.

Solution

(a) Distributed DB: Single logical DB spread across sites. Supports OLTP. Focus: transparency, distributed transactions (2PC), consistency. (b) Data Warehouse: Central historical repository from multiple sources. Supports OLAP. ETL loaded. Star schema. Read-heavy, analytical, non-volatile.

CT30109003Chapter 96 marksasked 2x2080 Chaitra R

Write short notes on: (a) Parallel databases (b) Data warehouse.

Solution

(a) Parallel DB: Multiple CPUs/disks execute queries concurrently. Architectures: Shared-Memory, Shared-Disk, Shared-Nothing (MPP). Speeds intra/inter-query processing. (b) Data Warehouse: Integrated historical data for BI. ETL pipeline. Denormalized schemas. OLAP ops (roll-up, drill-down). MPP hardware common.

CT30109005Chapter 96 marksasked 2x2081 Ashwin B

Write short notes on: (a) Data warehouse database and its application (b) Parallel database architecture.

Solution

(a) DW & Apps: Central historical data for decision support. ETL. Star/Snowflake. Apps: Sales trends, forecasting, customer segmentation, BI dashboards. (b) Parallel Arch: Shared-Memory (fast, limited), Shared-Disk (moderate, needs lock mgr), Shared-Nothing/MPP (best scalability, data partitioned, Redshift/Snowflake).

More PYQs

Additional practice from this chapter.

Describe the two primary strategies for distributing data across multiple physical locations: Data Replication and Data Fragmentation, stating a key advantage of each.
CT301090014 marksasked 1xModel Q
Write short notes on: (a) Distributed databases (b) Data warehouses.
CT301090026 marksasked 3x2082 Kartik B
Write short notes on: (a) Parallel databases (b) Data warehouse.
CT301090036 marksasked 2x2080 Chaitra R
Explain the main features of an object-relational database. Explain fragmentation and replication in distributed databases.
CT301090046 marksasked 1x2081 Chaitra R
Write short notes on: (a) Data warehouse database and its application (b) Parallel database architecture.
CT301090056 marksasked 2x2081 Ashwin B

Advanced Database Concepts

Concept of object-oriented databases

Distributed database model

Concept of data warehousing and online analytical processing

Basic concepts of NoSQL and big data

Chapter 9 — Advanced Database Concepts

9.1 Object-Oriented and Object-Relational Databases

Why Object-Oriented Databases?

Core OO Concepts Applied to Databases

Object-Relational Model (SQL:1999 and later)

Object-Relational Mapping (ORM)

Persistence in OO Databases

9.2 Distributed Databases

Data Fragmentation Strategies

Fragmentation in Detail

Replication Strategies

Homogeneous vs Heterogeneous Distributed Databases

9.3 Data Warehousing and OLAP

OLTP vs OLAP

ETL Process (Extract, Transform, Load)

Warehouse Schemas

OLAP Operations on a Data Cube

9.4 NoSQL and Big Data

Why NoSQL?

NoSQL Database Categories

Big Data — The 5 V's

Hadoop Ecosystem

MapReduce Framework in Detail

Exam Quick-Reference — Chapter 9