Chapter 6

File Structure and Hashing

5 hours8 marks18 past questionsBack to chapters Open filtered PYQs Previous: Ch 5 Next: Ch 7

Full Syllabus Outline

Every topic from the syllabus data is preserved here.

6.1
Disks and storage
DE
Physical storage media: magnetic disk, SSD, optical, tape. Storage hierarchy. RAID levels (0–6): striping, mirroring, parity. Selecting an appropriate RAID level.
6.2
Records organizations
DEI
File organization concepts. Fixed-length records. Variable-length records: slotted page structure. Sequential record organization. Multi-cluster (heap) record organization. Database buffer management.
6.3
Ordered indices
DEI
Primary (clustering) index vs secondary (non-clustering) index. Dense index vs sparse index. Index sequential file. Searching, inserting, and deleting in ordered indices. Multi-level sparse indices.
6.4
B+ tree index
DEI
B+ tree node structure (internal nodes and leaf nodes). Order-n definition. Search, insert, and delete operations with examples. Properties and advantages. Comparison with B-tree.
6.5
Hashing concepts — Static and dynamic hashing
DEI
Hash function basics. Hash index structure. Static hashing: bucket array, overflow chaining, search and insert. Limitations of static hashing. Extendable (dynamic) hashing: global depth, local depth, directory doubling. Bitmap indices.

Chapter Notes

Block-based notes with markdown, diagrams, code, and math.

Chapter 6 — File Structure and Hashing

6.1 Storage Hierarchy

Data in a DBMS moves between different storage levels based on access frequency. The further from the CPU, the cheaper but slower the storage.

Primary storage: CPU cache + RAM — fast, volatile (data lost on power failure)
Secondary storage: SSD + HDD — slower, non-volatile, much cheaper per GB
Tertiary storage: Tape/optical — very slow, used for archival backups

The DBMS buffer manager controls which disk pages are kept in RAM (the buffer pool). Most DBMS use LRU (Least Recently Used) or Clock replacement to decide which pages to evict.

RAID (Redundant Array of Independent Disks)

RAID provides higher performance and/or fault tolerance by distributing data across multiple physical disks.

RAID Level	Technique	Read Perf	Write Perf	Fault Tolerance	Typical Use
RAID 0	Striping only (data split across disks)	Very fast	Very fast	None — one disk failure = total data loss	Scratch/temp data
RAID 1	Mirroring (exact copy on second disk)	Fast (read from either)	Slower (write to both)	1 disk failure	Critical data, write-ahead logs
RAID 5	Striping + distributed parity	Fast	Moderate	1 disk failure	General-purpose databases
RAID 6	Striping + dual parity	Fast	Slower	2 disk failures	High-availability systems
RAID 10	RAID 1 + RAID 0 (mirror then stripe)	Very fast	Fast	1 disk per mirror group	High-performance databases

Stable storage: Storage that survives any single failure — approximated by writing to multiple independent disks (RAID 1/10) before acknowledging a write.

6.2 Record Organization

A record is the storage unit for one tuple. Records are stored on disk pages/blocks (typically 4 KB–16 KB).

Fixed-Length Records

Every record occupies exactly the same number of bytes.

Record i starts at byte offset: i × record_length (from start of file)
Deletion: mark slot as "deleted" with a tombstone byte, or maintain a free list

Advantages: Simple; fast direct access by record number.
Disadvantages: Wastes space if variable-length fields are common; NULL values still consume space.

Variable-Length Records

Records can have different sizes (VARCHAR, TEXT fields).

Two common storage approaches:

Fixed-length fields first, then variable-length fields separated by special delimiters
Slotted page structure (most common in modern RDBMS)

Key advantage of slotted pages: Records can be moved within the page to compact free space. External pointers reference (page_id, slot_number) — the slot array is updated when the record moves internally, so external references stay valid.

File Organization Strategies

Strategy	Description	Good For	Bad For
Heap file	Records in no particular order; new records appended	Bulk loads; full table scans	Range queries; sorted access
Sequential/sorted file	Records sorted by a search key	Range queries; merge joins; ORDER BY	Random inserts (must maintain sort order)
Clustered file	Related records from different tables stored on same block	Joins on common key	Mixed access patterns

Buffer Management

The buffer pool is a pool of RAM frames, each holding one disk block. On a block request:

Buffer hit — block already in RAM → return directly (no disk I/O)
Buffer miss — block not in RAM → read from disk into a free frame (evict another frame using LRU if buffer is full)

6.3 Ordered Indices

An index maps search key values to record locations, enabling faster lookups without a full sequential scan.

Dense vs Sparse Index

	Dense Index	Sparse Index
Entries	One index entry for every record	One index entry for every disk block
Space	More (one entry per record)	Less (one entry per block)
Requirement	Works on any file organization	Only works on sorted data files
Lookup	Direct: find key, follow pointer	Find nearest block boundary, then scan within block

Primary vs Secondary Index

	Primary (Clustering) Index	Secondary (Non-clustering) Index
Built on	The sort key of a physically sorted file	Any other attribute or combination
Organization	Index order = file physical order	Index order ≠ file order
Type	Can be sparse or dense	Always dense (must point to each record)
Number	At most one per relation	Many allowed per relation

Multi-Level Index

Even a sparse primary index may be too large to search sequentially for a huge table.
Solution: build a sparse index on top of the index → index of index. At the top level, only one block needs to be loaded. Then follow pointers down through levels.
This is the foundational idea behind B+ trees.

6.4 B+ Tree Index

The B+ tree is the standard index structure used in all major RDBMS (PostgreSQL, MySQL InnoDB, Oracle, SQL Server). It is a balanced, ordered tree where all data pointers live in the leaf nodes.

B+ Tree Structural Properties (order n)

Property	Rule
All leaves at same depth	Guarantees O(log n) height for all keys — fully balanced
Internal nodes	Between ⌈n/2⌉ and n child pointers; between ⌈n/2⌉ − 1 and n − 1 keys
Leaf nodes	Between ⌈(n−1)/2⌉ and n−1 key-pointer pairs
Leaf linkage	Leaves are linked in a doubly-linked list → efficient range queries
Root (non-leaf)	Has at least 2 children

Search in B+ Tree

Starting at root, compare the search key against each internal node's keys to follow the correct child pointer. At the leaf, use binary search within the node.

Complexity: O(log_n(N)) where N = number of records, n = order
Typical height: 3–4 levels for millions of records (n ≈ 100 in practice → 100³ = 1M records in 3 levels)

Range Queries with B+ Tree

Find the leaf containing the start of the range
Follow the linked leaf list scanning forward until the end of the range This is extremely efficient — sequential I/O on the leaf level.

Why B+ Tree is Preferred Over B-Tree

Feature	B-tree	B+ tree
Data pointers	In both internal and leaf nodes	Only in leaf nodes
Range queries	In-order traversal of tree (complex, random I/O)	Scan linked leaves (simple, sequential I/O)
Internal node fanout	Lower (nodes store data too)	Higher (nodes store keys only → wider branching)
Consistent access time	Not always (keys may be in internal nodes)	Always — every search reaches a leaf
Used in RDBMS	Rarely	Universally

Exam answer — "Why is B+ tree preferred?"
B+ tree stores all data pointers only in leaf nodes, which are linked in a sorted list. This makes range queries very efficient — find the start key and then scan leaves sequentially. Internal nodes store only keys, allowing a higher branching factor and a shallower tree. Every search always traverses from root to a leaf, giving consistent performance.

B+ Tree Insertions and Deletions

Insert: Find the correct leaf and insert. If the leaf overflows (> n−1 keys), split: the median key moves up to the parent. If the parent overflows, split recursively. If the root splits, the tree grows taller.
Delete: Find and remove the key. If the leaf underflows (< ⌈(n−1)/2⌉ keys), either borrow from a sibling (redistribute) or merge with a sibling. Merging propagates up, potentially reducing height.

6.5 Hashing

Hash Function

A hash function h(K) maps search key K to a bucket number in range [0, B−1]. A good hash function distributes keys uniformly across buckets. Example: h(K) = K mod B (for integer keys)

Static Hashing

Fixed number of buckets B
Each bucket holds one or more pages
Overflow buckets: when a bucket fills, chain overflow pages with pointers
Problem: as the file grows, overflow chains lengthen and performance degrades. Reorganizing the entire hash file is expensive (recompute h for all keys with new B).

Extendable (Dynamic) Hashing

Grows the directory dynamically to handle more data without a full file reorganization.

Key concepts:

Directory: an array of 2^d pointers to buckets
Global depth d: the number of hash bits used to index the directory; directory size = 2^d
Local depth l_i: the number of bits used to hash to bucket i (l_i ≤ d at all times)

Extendable Hashing — Search and Insert

SEARCH for key K:
  1. Compute h(K)
  2. Use the LAST d bits of h(K) as an index into the directory
  3. Follow the directory pointer to the bucket
  4. Sequentially search within the bucket
  Complexity: O(1) average

INSERT key K:
  1. Compute h(K), find the target bucket (as above)
  2. If bucket has free space → insert key directly ✓
  3. If bucket OVERFLOWS:
     Case A — local depth l_i < global depth d:
       → Split the bucket: create two new buckets, local depth becomes l_i + 1
       → Redistribute keys between the two new buckets based on the (l_i+1)th hash bit
       → Update directory entries (only those currently pointing to the split bucket)
       → Directory SIZE does not change

     Case B — local depth l_i = global depth d:
       → First, DOUBLE the directory:
           d becomes d + 1; directory size goes from 2^d to 2^(d+1)
           Each old entry i is replicated: entries i and i+2^d both point to same bucket
       → Then proceed with Case A: split the overflowing bucket

──────────────────────────────────────────────────────────────────────
Worked Example:

Initial state: d=1 (global depth), 2 buckets
  Directory: [0 → Bucket A (l=1), 1 → Bucket B (l=1)]
  Bucket A: holds keys where h(k) ends in bit '0'
  Bucket B: holds keys where h(k) ends in bit '1'

Insert key with h(k) = 1111 → last 1 bit = 1 → goes to Bucket B
  Suppose Bucket B is FULL (overflows). Local depth l=1 = global depth d=1.

  Step 1: Double directory → d becomes 2
    Old directory:  [0→A, 1→B]
    New directory:  [00→A, 01→B, 10→A, 11→B]   (entries duplicated)

  Step 2: Split Bucket B (keys ending in '1', l=1)
    Keys ending in '01' → new Bucket B0 (l=2)
    Keys ending in '11' → new Bucket B1 (l=2)
    Update directory: 01 → B0, 11 → B1

  Final directory: [00→A, 01→B0, 10→A, 11→B1]

Search after restructure: any key still found in O(1) using last d=2 bits
No existing keys needed to be rehashed to a completely new file!

Bitmap Indices

A bitmap index stores a bit-array for each distinct value of a low-cardinality attribute.

Bit i is set to 1 if tuple i has that attribute value
Very efficient for AND, OR, NOT queries using bitwise operations
Ideal for data warehouses where attributes have few distinct values (e.g., gender, status, department code)
Not efficient for high-cardinality attributes (e.g., student ID — would need one bitmap per student)

Hashing vs B+ Tree Index

Operation	B+ Tree	Hash Index
Exact match (=)	O(log n)	O(1) average
Range query (<, >, BETWEEN, LIKE 'A%')	O(log n + result size)	❌ Not supported
Ordered output (ORDER BY)	Yes — scan leaves in order	No
Worst case	O(log n) — guaranteed	O(n) — if many hash collisions
Best use case	General-purpose; most queries	Point lookups only (caching, session storage)

Exam Quick-Reference — Chapter 6

Storage hierarchy: Cache (ns) → RAM (100 ns) → SSD (0.1 ms) → HDD (10 ms) → Tape (seconds)
RAID 1 = mirroring (1 disk failure tolerated); RAID 5 = striping + parity (1 disk); RAID 6 = 2 disks; RAID 10 = mirror + stripe (fastest + resilient)
Dense index: one entry per record; Sparse index: one entry per block (sorted files only)
B+ tree leaves are linked → range queries = find start leaf, then scan linked list
B+ tree internal nodes store keys only (no data) → higher branching factor → shallower tree → consistent O(log n)
Extendable hashing: global depth d → directory has 2^d entries; overflow → split bucket; if local depth = global depth → double directory first
Static hashing → overflow chains; Extendable hashing → split + directory doubling; no full file reorg
Hash index: O(1) exact match, NO range queries; B+ tree: O(log n) for everything including ranges

Solved PYQs

A few past questions with short answers.

CT30106006Chapter 68 marksasked 2x2080 Chaitra R

What do you mean by RAID? Briefly explain the different levels of RAID. Explain the working of static hash indexing along with suitable examples.

Solution

RAID: Redundant Array of Independent Disks. Levels: 0(striping), 1(mirroring), 5(distributed parity), 10(mirror+stripe). Static Hashing: Fixed N buckets. h(K) mod N determines bucket. Overflow chains handle collisions. Example: 100 buckets, h(id)=id%100. Insert 1001 → bucket 1. Fast equality, but requires full rehash to add buckets.

CT30106003Chapter 68 marksasked 1x2081 Chaitra R

Explain the different levels of RAID briefly. What is indexing in database? Differentiate between dense index and sparse index.

Solution

RAID: 0(striping, fast, no redundancy), 1(mirroring, safe), 5(striping+parity, balances speed/safety), 6(dual parity), 10(mirror+stripe, best for DB). Indexing: Data structure speeding lookups without full table scan. Dense: Entry per record. Works sorted/unsorted. Fast, high storage. Sparse: Entry per block. Requires sorted file. Less storage, requires block scan after index lookup.

CT30106004Chapter 68 marksasked 1x2081 Ashwin B

What is the purpose of using RAID? How can a variable length record be implemented? Compare dense and sparse ordered indices.

Solution

RAID Purpose: Improve I/O throughput (striping) and fault tolerance (mirroring/parity). Variable Records: Slotted page structure. Page header has slot array (offset, length). Records packed from end to start. Allows inserts/deletes without shifting whole pages. Dense vs Sparse: Dense=1 entry/record, works any order, larger. Sparse=1 entry/block, requires sorted file, smaller, needs intra-block scan.

More PYQs

Additional practice from this chapter.

Why is the B+ tree the preferred index structure for file organization in most modern relational database systems?
CT301060013 marksasked 1xModel Q
Briefly describe the mechanism that allows Extendible/Dynamic Hashing to grow without reorganizing the entire file.
CT301060024 marksasked 1xModel Q
Explain the different levels of RAID briefly. What is indexing in database? Differentiate between dense index and sparse index.
CT301060038 marksasked 1x2081 Chaitra R
What is the purpose of using RAID? How can a variable length record be implemented? Compare dense and sparse ordered indices.
CT301060048 marksasked 1x2081 Ashwin B
Explain fixed-length and variable-length records in File Organization. Explain how a hash index works. Distinguish between static and dynamic hashing.
CT301060058 marksasked 1x2082 Kartik B

File Structure and Hashing

Disks and storage

Records organizations

Ordered indices

B+ tree index

Hashing concepts — Static and dynamic hashing

Chapter 6 — File Structure and Hashing

6.1 Storage Hierarchy

RAID (Redundant Array of Independent Disks)

6.2 Record Organization

Fixed-Length Records

Variable-Length Records

File Organization Strategies

Buffer Management

6.3 Ordered Indices

Dense vs Sparse Index

Primary vs Secondary Index

Multi-Level Index

6.4 B+ Tree Index

B+ Tree Structural Properties (order n)

Search in B+ Tree

Range Queries with B+ Tree

Why B+ Tree is Preferred Over B-Tree

B+ Tree Insertions and Deletions

6.5 Hashing

Hash Function

Static Hashing

Extendable (Dynamic) Hashing

Bitmap Indices

Hashing vs B+ Tree Index

Exam Quick-Reference — Chapter 6