Chapter 9 — Advanced Database Concepts
9.1 Object-Oriented and Object-Relational Databases
Why Object-Oriented Databases?
Traditional relational databases struggle with:
- Complex nested objects: a document with embedded sections, a product with variant sub-products
- Methods on data: behavior associated with data (not just raw attribute values)
- Inheritance hierarchies: "an Employee IS-A Person" — hard to express cleanly in flat tables
- Object identity: two objects with identical attribute values may still be distinct entities
Core OO Concepts Applied to Databases
| OO Concept | Meaning in Database Context | Example |
|---|
| Object identity | Each object has a unique OID independent of its attribute values | Two Person objects with same name and age are still distinct objects |
| Encapsulation | Object's internal state is hidden; accessed only through defined methods | getAge() method instead of direct column access |
| Inheritance | A subtype inherits all attributes and methods of its supertype | Employee inherits name, age from Person; adds salary |
| Polymorphism | Same method name behaves differently for different object types | describe() for Employee shows job title; for Student shows GPA |
Object-Relational Model (SQL:1999 and later)
Extends the relational model with OO features while maintaining SQL compatibility:
- User-defined types (UDTs): composite types (structured types), distinct types
- Collection types: arrays, multisets (bags)
- Table inheritance: a table can inherit from another table
- Methods: define methods on types, callable from SQL
-- ── OBJECT-RELATIONAL SQL EXAMPLES (PostgreSQL) ──────────────────────────
-- User-defined composite type (structured / nested type)
CREATE TYPE address_t AS (
street VARCHAR(200),
city VARCHAR(100),
district VARCHAR(50),
zipcode VARCHAR(10)
);
-- Use composite type as a column (avoids a separate Address table + join)
CREATE TABLE Customer (
cid INT PRIMARY KEY,
name VARCHAR(100),
address address_t -- composite attribute — nested type
);
-- Access nested attributes with dot notation
SELECT cid, name, (address).city, (address).zipcode
FROM Customer
WHERE (address).district = 'Lalitpur';
-- Array type (multivalued attribute stored inline)
CREATE TABLE Student_OR (
sid INT PRIMARY KEY,
name VARCHAR(100),
phones VARCHAR(20)[] -- array of phone numbers
);
INSERT INTO Student_OR VALUES (1, 'Alice', ARRAY['9800000001','9800000002']);
SELECT sid, name, phones[1] AS primary_phone,
array_length(phones, 1) AS total_phones
FROM Student_OR;
-- TABLE INHERITANCE (PostgreSQL extension — Employee IS-A Person)
CREATE TABLE Person (
pid INT PRIMARY KEY,
name VARCHAR(100) NOT NULL,
age INT
);
CREATE TABLE Employee_OO (
emp_id VARCHAR(20) NOT NULL,
salary DECIMAL(10,2),
dept VARCHAR(50)
) INHERITS (Person); -- inherits pid, name, age from Person
-- Querying:
-- SELECT * FROM Person → returns BOTH Person-only AND Employee_OO rows
-- SELECT * FROM ONLY Person → returns ONLY Person-only rows (ONLY keyword)
-- SELECT * FROM ONLY Employee_OO → returns only Employee_OO rows
Object-Relational Mapping (ORM)
An ORM bridges the gap between application-level objects and a relational database:
- Developer defines classes (e.g.,
Student with fields name, gpa, courses)
- ORM automatically generates SQL for INSERT, SELECT, UPDATE, DELETE
- Developer works entirely with objects; never writes raw SQL
| ORM Framework | Language |
|---|
| Hibernate | Java |
| SQLAlchemy | Python |
| Prisma, TypeORM | TypeScript / Node.js |
| ActiveRecord | Ruby on Rails |
| Entity Framework | C# / .NET |
Advantages: Reduces SQL boilerplate; type-safe queries; easier unit testing
Disadvantages: Can generate inefficient SQL; N+1 query problem (fetching a list of 100 students then issuing 100 separate queries for each student's courses = 101 queries instead of 1 JOIN)
Persistence in OO Databases
Persistence refers to an object surviving beyond the lifetime of the process that created it. In OO databases, objects are stored in the database and retrieved as fully-formed objects. In relational databases with ORM, persistence is achieved by serializing objects to rows.
9.2 Distributed Databases
A distributed database stores and manages data across multiple geographically separate sites connected by a network, but appears as a single unified database to users and applications.
Key design goals:
- Location transparency: the user does not need to know where data is physically stored
- Replication transparency: the user does not need to know how many copies exist
- Fragmentation transparency: the user does not need to know how data is split across sites
Data Fragmentation Strategies
Fragmentation distributes parts of a relation across multiple sites to reduce data transfer between sites for common queries.
Fragmentation in Detail
| Horizontal Fragmentation | Vertical Fragmentation |
|---|
| What is distributed | Rows (subsets by predicate) | Columns (subsets by access pattern) |
| How to reconstruct R | UNION of all fragments | NATURAL JOIN (or JOIN on PK) of all fragments |
| Useful when | Different sites serve different customer regions | Different departments need different columns (privacy/security) |
| Example | Nepal customers at Site 1; India customers at Site 2 | Personal info (name, address) at Site 1; Financial info (salary, balance) at Site 2 |
Correctness requirements for fragmentation:
- Completeness: every record of R appears in at least one fragment
- Reconstruction: R can be fully reconstructed from its fragments
- Disjointness: for horizontal, each record appears in exactly one fragment; for vertical, only the PK may repeat
Replication Strategies
Replication stores copies of data at multiple sites for availability and read performance.
| Type | Description | Key Advantage | Key Disadvantage |
|---|
| Full replication | Complete database copy at every site | Maximum availability; all reads local | High update cost (must update all copies) |
| Partial replication | Some fragments replicated at some sites | Balanced cost and availability | More complex to manage |
| No replication | Each fragment on exactly one site | No redundancy overhead | Single point of failure per fragment |
Exam answer — "Describe two primary strategies for distributing data across multiple physical locations, stating a key advantage of each."
Data Replication: maintains full or partial copies of the database at multiple sites. Key advantage: high availability — if one site fails, queries can be served from another site without interruption.
Data Fragmentation: distributes different parts of the data (by rows or columns) to different sites. Key advantage: reduced data transfer — a site can answer queries about its local data without needing to access remote sites, minimizing network communication.
Homogeneous vs Heterogeneous Distributed Databases
| Homogeneous | Heterogeneous |
|---|
| DBMS software | Same DBMS at all sites (e.g., all PostgreSQL) | Different DBMS at different sites (Oracle at Site 1, MySQL at Site 2) |
| Schema | Identical or similar | May differ significantly |
| Management complexity | Lower — uniform query language and interface | Higher — requires schema mapping, query translation, data type conversion |
| Common scenario | A company's branches all running the same system | Two companies merging their databases after an acquisition |
9.3 Data Warehousing and OLAP
OLTP vs OLAP
| Characteristic | OLTP (Online Transaction Processing) | OLAP (Online Analytical Processing) |
|---|
| Primary purpose | Day-to-day operational transactions | Business intelligence, analytics, reporting |
| Query type | Simple: single-record INSERT/UPDATE/SELECT | Complex: aggregates, multi-table joins, GROUP BY across millions of rows |
| Query frequency | Very high (thousands per second) | Low (a few per minute or hour) |
| Data currency | Current, up-to-date | Historical (months to years) |
| Data volume | GB range | TB to PB range |
| Database design | Normalized (3NF) to avoid redundancy | Denormalized (star/snowflake schema) for fast reads |
| Typical users | Clerks, customers, automated systems | Business analysts, executives, data scientists |
| Example | Bank transfer, e-commerce checkout | Quarterly sales report, customer churn analysis |
ETL Process (Extract, Transform, Load)
| Phase | Description | Example Operations |
|---|
| Extract | Pull raw data from OLTP systems, files, APIs, external sources | Database queries, CSV parsing, REST API calls |
| Transform | Clean, integrate, and reshape the data to fit warehouse schema | Deduplication, null handling, unit conversion, joining multiple sources, applying business rules |
| Load | Write the transformed data into the data warehouse | Bulk INSERT, incremental delta loads, slowly changing dimension updates |
Warehouse Schemas
Star Schema: Central fact table surrounded by denormalized dimension tables.
- Simpler joins (fewer joins needed), faster queries
- Some redundancy in dimension tables
DimTime DimProduct
(time_id, date, year, (prod_id, name, category,
quarter, month) subcategory, price)
| |
└─────────── FactSales ────────┘
(time_id FK,
prod_id FK,
region_id FK,
cust_id FK,
sales_amount,
quantity_sold)
|
DimRegion
(region_id, city,
country, zone)
Snowflake Schema: Fact table + normalized dimension tables. Each dimension table may join to sub-dimension tables.
- Reduces redundancy in dimensions
- Requires more JOINs → slightly slower queries
OLAP Operations on a Data Cube
A data cube is the multidimensional view of data in a warehouse. OLAP operations navigate this cube:
| Operation | Definition | Analogy / Example |
|---|
| Roll-up (aggregation) | Move UP the dimension hierarchy — summarize data | Monthly sales → Quarterly sales → Annual sales |
| Drill-down | Move DOWN the hierarchy — increase detail level | Annual → Monthly → Daily |
| Slice | Fix one dimension at a specific value; result is a 2D cross-section | All sales in Q1 2024 only (fix time dimension to Q1 2024) |
| Dice | Fix two or more dimensions simultaneously; result is a smaller sub-cube | Sales in Q1 2024 in Nepal only (fix time AND region) |
| Pivot (rotate) | Rotate the data cube — swap the row and column axes | Rows = time, Columns = product → flip to Rows = product, Columns = time |
9.4 NoSQL and Big Data
Why NoSQL?
Relational databases struggle at massive web scale due to:
- Horizontal scalability limits: hard to shard (split) a relational DB across thousands of commodity nodes
- Schema rigidity: every row must conform to a fixed schema; adding fields requires ALTER TABLE on a billion-row table
- Semi-structured / unstructured data: JSON documents, graphs, time-series don't fit neatly into tables
- Write-heavy workloads at massive scale: relational ACID can be a throughput bottleneck
NoSQL databases trade some of relational model's guarantees (e.g., full ACID, strict schema) for scalability and flexibility.
NoSQL Database Categories
| Category | Data Model | Key Property | Use Cases | Examples |
|---|
| Key-Value | Simple map: key → value blob | O(1) get/set by key; horizontally scalable | Caching, session storage, shopping carts | Redis, DynamoDB, Memcached |
| Document | Collections of JSON/BSON documents | Flexible schema; rich nested queries | Content management, catalogs, user profiles | MongoDB, CouchDB, Firestore |
| Column-Family (Wide Column) | Rows with dynamic column families | Fast writes; good for time-series data | IoT telemetry, logs, time-series, analytics | Cassandra, HBase, Bigtable |
| Graph | Nodes + directed edges + properties | Efficient multi-hop graph traversal | Social networks, recommendations, fraud detection | Neo4j, Amazon Neptune, JanusGraph |
Big Data — The 5 V's
| V | Full Name | Description | Example |
|---|
| Volume | Scale of data | TB to PB of data generated continuously | 4 PB/day at Facebook |
| Velocity | Speed of data generation | Real-time streams and sensor data | 500 M tweets/day on Twitter |
| Variety | Diversity of data formats | Structured, semi-structured, unstructured | Relational tables + JSON logs + images + video |
| Veracity | Trustworthiness and quality | Noisy, incomplete, inconsistent raw data | Social media has bots, typos, spam |
| Value | Business value extracted | Turning raw big data into actionable insights | Customer churn prediction, fraud detection |
Hadoop Ecosystem
Hadoop is an open-source framework for distributed storage and processing of very large datasets on commodity hardware.
| Component | Function | Key Idea |
|---|
| HDFS | Hadoop Distributed File System — distributed storage | Files split into 128 MB blocks; each block replicated 3× across different nodes by default |
| MapReduce | Parallel batch processing framework | Map phase: filter/transform each record independently; Reduce phase: aggregate the mapped results |
| YARN | Yet Another Resource Negotiator — cluster resource manager | Allocates CPU and RAM to jobs across the cluster |
| Hive | SQL-like query language (HiveQL) on HDFS | Translates SQL queries into MapReduce or Tez jobs for batch analysis |
| Spark | In-memory distributed processing | 10–100× faster than MapReduce; supports SQL, streaming, machine learning, graph processing |
| HBase | Column-family NoSQL database on top of HDFS | Random read/write access to huge tables (HDFS itself is write-once) |
MapReduce Framework in Detail
Map phase: Input data is partitioned into chunks (splits). Each Map task processes one split independently, emitting (key, value) pairs.
Shuffle and Sort phase: The framework collects all (key, value) pairs with the same key and groups them together. Fully automatic.
Reduce phase: Each Reduce task receives all values for a given key and aggregates them into a final result.
Classic example — Word Count:
- Map: for each word in a document → emit (
word, 1)
- Shuffle: group all ("hello", 1) pairs together, all ("world", 1) pairs together, etc.
- Reduce: for each word → sum all the 1s → emit (
word, total_count)
Exam Quick-Reference — Chapter 9
Distributed Databases:
- Horizontal fragmentation = rows partitioned by predicate → reconstruct with UNION
- Vertical fragmentation = columns partitioned → reconstruct with JOIN on PK
- Full replication = maximum availability; No replication = no redundancy overhead
- Homogeneous = same DBMS everywhere; Heterogeneous = different DBMS, needs schema mapping
Data Warehousing:
- ETL = Extract (from OLTP) → Transform (clean, integrate) → Load (into warehouse)
- OLTP: current, normalized, high-frequency simple transactions
- OLAP: historical, denormalized, low-frequency complex analytical queries
- OLAP operations: Roll-up (aggregate up hierarchy); Drill-down (more detail); Slice (fix 1 dimension); Dice (fix 2+ dimensions); Pivot (rotate axes)
- Star schema: fact + denormalized dimensions; Snowflake schema: fact + normalized dimensions (more joins)
NoSQL and Big Data:
- NoSQL types: Key-Value (Redis), Document (MongoDB), Column-family (Cassandra), Graph (Neo4j)
- Big Data 5 Vs: Volume, Velocity, Variety, Veracity, Value
- HDFS: 128 MB blocks, 3× replication across nodes
- MapReduce: Map (transform per record) + Shuffle (group by key) + Reduce (aggregate)
- Spark = in-memory alternative to MapReduce; much faster for iterative algorithms