DBMS NotesPYQs

Chapter 6

File Structure and Hashing

Chapter Notes

Block-based notes with markdown, diagrams, code, and math.

Chapter 6 — File Structure and Hashing

6.1 Storage Hierarchy

RAID Levels

RAID LevelTechniqueReadWriteFault ToleranceUse
0Striping onlyFastFastNonePerformance only
1MirroringFastSlower1 disk failureCritical data
5Striping + distributed parityFastModerate1 disk failureGeneral purpose
6Striping + dual parityFastSlower2 disk failuresHigh availability
10RAID 1 + RAID 0Very fastFast1 disk per mirrorDatabases

6.2 Record Organization

Fixed-Length Records

Every record occupies the same number of bytes. Access: record i starts at byte i × record_size.

  • Pro: simple, fast random access.
  • Con: wastes space if fields are often short/null.

Variable-Length Records

Fields can vary; null values take no space. Requires a slotted page structure.


6.3 Ordered Indices

An index maps search key values to records, allowing faster lookups than a full scan.

Index TypeDescription
Dense indexOne index entry per record in the data file
Sparse indexOne entry per data block (only works on sorted data)
Primary (clustering) indexOn the sort key of a sorted file
Secondary (non-clustering) indexOn any non-sort key; always dense
Multi-level indexSparse index on top of a dense index to reduce search space

Trade-off: dense index → faster search; sparse index → less space but only on sorted data.


6.4 B+ Tree Index

B+ tree is the standard index structure in all major RDBMS (PostgreSQL, MySQL InnoDB, Oracle).

B+ Tree Properties (order n)

  • All leaves are at the same depth (balanced).
  • Internal nodes: ⌈n/2⌉ to n pointers.
  • Leaf nodes: ⌈(n-1)/2⌉ to n-1 keys + pointers to records.
  • Leaf nodes are linked in a doubly-linked list → efficient range queries.
  • Root has at least 2 children (unless it is also a leaf).

B-tree vs B+ Tree

B-treeB+ tree
Data pointersIn internal + leaf nodesOnly in leaf nodes
Range queriesSlower (in-order traversal)Fast (linked leaf list)
SpaceLess redundancySlightly more (key repeated in leaf)
Used in practiceRarelyAlmost universally

6.5 Hashing

Static Hashing

  • A hash function h(K) maps a key K to a bucket number 0…B-1.
  • Each bucket holds one or more pages.
  • Overflow buckets are chained when a bucket fills.
  • Problem: if the table grows, performance degrades. Reorganization is expensive.

Extendable (Dynamic) Hashing

  • Uses a directory (array of pointers to buckets).
  • Global depth d: directory size = 2^d.
  • Local depth l per bucket: number of bits used to hash to that bucket.
  • On overflow: split the bucket; if l = d, double the directory.
Extendable Hashing — Insert example

Initial state: global depth = 1, 2 buckets
  Directory: [0→BucketA, 1→BucketB]
  BucketA (local depth 1): keys with h(k) ending in 0
  BucketB (local depth 1): keys with h(k) ending in 1

Insert key 15 (h(15)=1111):
  → goes to BucketB
  If BucketB overflows (local depth < global depth):
    → Split BucketB into B0 (ending 01) and B1 (ending 11)
    → Local depth of new buckets = 2
    → Update directory entries 01 → B0, 11 → B1

  If BucketB overflows AND local depth = global depth:
    → Double the directory: global depth becomes 2
    → Directory now has 4 entries: 00, 01, 10, 11
    → Then split as above

Search: compute h(k), take last d bits, follow directory pointer → O(1)

Hashing vs B+ Tree

B+ TreeHash Index
Exact matchO(log n)O(1) average
Range queryO(log n + result)Not supported
Ordered outputYesNo
Best forMost queriesPoint lookups only

Exam Quick-Reference

  • RAID 5 = striping + parity (1 disk failure tolerated); RAID 6 = 2 disk failures.
  • Dense index: one entry per record; Sparse: one entry per block (sorted file only).
  • B+ tree leaves are linked → range queries scan leaves sequentially.
  • Extendable hashing: global depth controls directory size; local depth controls bucket split.
  • Static hashing → overflow chaining; extendable → split + directory doubling.

Full Syllabus Outline

Every topic from the syllabus data is preserved here.

Solved PYQs

A few past questions with short answers.

Chapter 68 marksasked 1x2081 Chaitra R

Explain the different levels of RAID briefly. What is indexing in database? Differentiate between dense index and sparse index.

Solution

RAID and Indexing:

RAID (Redundant Array of Independent Disks):

LevelTechniqueFault TolerancePerformanceStorage Overhead
RAID 0Striping onlyNoneExcellent R/WNone
RAID 1Mirroring1 disk failureExcellent R; good W
RAID 5Striping + distributed parity1 disk failureGood R; moderate W1 disk
RAID 6Striping + 2 parity disks2 disk failuresGood R; slower W2 disks
RAID 10RAID 1+0 (mirror then stripe)1 per mirror pairExcellent R/W

RAID 10 is preferred for databases because it combines the write performance of RAID 0 striping with the fault tolerance of RAID 1 mirroring.

Index Types:

Dense Index: One index entry per data record (tuple). Can find any record by looking up its key in the index directly. More storage but faster search.

Sparse Index: One index entry per data block (only works on sorted/clustered data). Fewer entries but requires scanning within the block after finding it via the index. Less storage.

Primary Index: Built on the attribute(s) that determine the physical ordering of records in the file (the search key matches the file's sort order).

Secondary Index: Built on an attribute that is not the file's sort order. Must be dense (since records are not sorted by this attribute).

Chapter 68 marksasked 1x2081 Ashwin B

What is the purpose of using RAID? How can a variable length record be implemented? Compare dense and sparse ordered indices.

Solution

Storage Hierarchy:

Database storage is organized in a hierarchy from fastest/smallest/most expensive to slowest/largest/cheapest:

LevelTechnologyAccess TimeCapacityVolatility
L1/L2/L3 CacheSRAM~1 nsKB–MBVolatile
Main MemoryDRAM~100 nsGBVolatile
SSDFlash NAND~100 μsTBNon-volatile
HDDMagnetic disk~10 msTBNon-volatile
Tape / ArchiveMagnetic tapeSecondsPBNon-volatile

DBMS Buffer Pool: The DBMS manages a buffer pool (pool of fixed-size pages in main memory). Pages are brought from disk into the buffer pool on demand and evicted when the pool is full. The block/page (typically 4KB–16KB) is the unit of I/O transfer.

Buffer Replacement Policies:

  • LRU (Least Recently Used): Evict the page not accessed for the longest time. Works well for random access.
  • Clock Algorithm: Efficient approximation of LRU using a reference bit per page.
  • MRU (Most Recently Used): Evict the most recently used page — counterintuitively better for sequential scans (nested-loop joins).
  • Pinned Pages: Pages currently being used by a transaction cannot be evicted.
Chapter 68 marksasked 1x2082 Kartik B

Explain fixed-length and variable-length records in File Organization. Explain how a hash index works. Distinguish between static and dynamic hashing.

Solution

Fixed-Length vs Variable-Length Records:

Fixed-Length Records: All records in the file have the same size. The i-th record starts at byte offset i × record_size from the start of the file.

  • Advantages: random access by record number is O(1); simple deletion (mark as free or use a free list).
  • Disadvantages: wastes space for variable-length data (e.g., name padded to max length).

Variable-Length Records: Records have different sizes due to variable-length attributes (VARCHAR) or optional fields.

  • Implemented using a slotted page structure: each page has a header containing an array of (offset, length) pairs — one per record on the page. Records are packed from the end of the page toward the header.
  • Advantages: no wasted space for short values.
  • Disadvantages: more complex access; records can only be accessed by page+slot, not by absolute byte offset.

Hashing vs B+ Tree Index (Index Structure Comparison):

PropertyHash IndexB+ Tree Index
Equality searchO(1) averageO(log n)
Range searchNot supportedO(log n + result)
Ordered outputNot supportedNaturally ordered
Space overheadModerateModerate
Insertion/deletionO(1) averageO(log n) with splits
Best use caseExact-match queriesMixed workloads, ORDER BY

More PYQs

Additional practice from this chapter.