Chapter 6 — File Structure and Hashing
6.1 Storage Hierarchy
Data in a DBMS moves between different storage levels based on access frequency. The further from the CPU, the cheaper but slower the storage.
- Primary storage: CPU cache + RAM — fast, volatile (data lost on power failure)
- Secondary storage: SSD + HDD — slower, non-volatile, much cheaper per GB
- Tertiary storage: Tape/optical — very slow, used for archival backups
The DBMS buffer manager controls which disk pages are kept in RAM (the buffer pool). Most DBMS use LRU (Least Recently Used) or Clock replacement to decide which pages to evict.
RAID (Redundant Array of Independent Disks)
RAID provides higher performance and/or fault tolerance by distributing data across multiple physical disks.
| RAID Level | Technique | Read Perf | Write Perf | Fault Tolerance | Typical Use |
|---|
| RAID 0 | Striping only (data split across disks) | Very fast | Very fast | None — one disk failure = total data loss | Scratch/temp data |
| RAID 1 | Mirroring (exact copy on second disk) | Fast (read from either) | Slower (write to both) | 1 disk failure | Critical data, write-ahead logs |
| RAID 5 | Striping + distributed parity | Fast | Moderate | 1 disk failure | General-purpose databases |
| RAID 6 | Striping + dual parity | Fast | Slower | 2 disk failures | High-availability systems |
| RAID 10 | RAID 1 + RAID 0 (mirror then stripe) | Very fast | Fast | 1 disk per mirror group | High-performance databases |
Stable storage: Storage that survives any single failure — approximated by writing to multiple independent disks (RAID 1/10) before acknowledging a write.
6.2 Record Organization
A record is the storage unit for one tuple. Records are stored on disk pages/blocks (typically 4 KB–16 KB).
Fixed-Length Records
Every record occupies exactly the same number of bytes.
- Record i starts at byte offset: i × record_length (from start of file)
- Deletion: mark slot as "deleted" with a tombstone byte, or maintain a free list
Advantages: Simple; fast direct access by record number.
Disadvantages: Wastes space if variable-length fields are common; NULL values still consume space.
Variable-Length Records
Records can have different sizes (VARCHAR, TEXT fields).
Two common storage approaches:
- Fixed-length fields first, then variable-length fields separated by special delimiters
- Slotted page structure (most common in modern RDBMS)
Key advantage of slotted pages: Records can be moved within the page to compact free space. External pointers reference (page_id, slot_number) — the slot array is updated when the record moves internally, so external references stay valid.
File Organization Strategies
| Strategy | Description | Good For | Bad For |
|---|
| Heap file | Records in no particular order; new records appended | Bulk loads; full table scans | Range queries; sorted access |
| Sequential/sorted file | Records sorted by a search key | Range queries; merge joins; ORDER BY | Random inserts (must maintain sort order) |
| Clustered file | Related records from different tables stored on same block | Joins on common key | Mixed access patterns |
Buffer Management
The buffer pool is a pool of RAM frames, each holding one disk block. On a block request:
- Buffer hit — block already in RAM → return directly (no disk I/O)
- Buffer miss — block not in RAM → read from disk into a free frame (evict another frame using LRU if buffer is full)
6.3 Ordered Indices
An index maps search key values to record locations, enabling faster lookups without a full sequential scan.
Dense vs Sparse Index
| Dense Index | Sparse Index |
|---|
| Entries | One index entry for every record | One index entry for every disk block |
| Space | More (one entry per record) | Less (one entry per block) |
| Requirement | Works on any file organization | Only works on sorted data files |
| Lookup | Direct: find key, follow pointer | Find nearest block boundary, then scan within block |
Primary vs Secondary Index
| Primary (Clustering) Index | Secondary (Non-clustering) Index |
|---|
| Built on | The sort key of a physically sorted file | Any other attribute or combination |
| Organization | Index order = file physical order | Index order ≠ file order |
| Type | Can be sparse or dense | Always dense (must point to each record) |
| Number | At most one per relation | Many allowed per relation |
Multi-Level Index
Even a sparse primary index may be too large to search sequentially for a huge table.
Solution: build a sparse index on top of the index → index of index.
At the top level, only one block needs to be loaded. Then follow pointers down through levels.
This is the foundational idea behind B+ trees.
6.4 B+ Tree Index
The B+ tree is the standard index structure used in all major RDBMS (PostgreSQL, MySQL InnoDB, Oracle, SQL Server). It is a balanced, ordered tree where all data pointers live in the leaf nodes.
B+ Tree Structural Properties (order n)
| Property | Rule |
|---|
| All leaves at same depth | Guarantees O(log n) height for all keys — fully balanced |
| Internal nodes | Between ⌈n/2⌉ and n child pointers; between ⌈n/2⌉ − 1 and n − 1 keys |
| Leaf nodes | Between ⌈(n−1)/2⌉ and n−1 key-pointer pairs |
| Leaf linkage | Leaves are linked in a doubly-linked list → efficient range queries |
| Root (non-leaf) | Has at least 2 children |
Search in B+ Tree
Starting at root, compare the search key against each internal node's keys to follow the correct child pointer. At the leaf, use binary search within the node.
- Complexity: O(log_n(N)) where N = number of records, n = order
- Typical height: 3–4 levels for millions of records (n ≈ 100 in practice → 100³ = 1M records in 3 levels)
Range Queries with B+ Tree
- Find the leaf containing the start of the range
- Follow the linked leaf list scanning forward until the end of the range
This is extremely efficient — sequential I/O on the leaf level.
Why B+ Tree is Preferred Over B-Tree
| Feature | B-tree | B+ tree |
|---|
| Data pointers | In both internal and leaf nodes | Only in leaf nodes |
| Range queries | In-order traversal of tree (complex, random I/O) | Scan linked leaves (simple, sequential I/O) |
| Internal node fanout | Lower (nodes store data too) | Higher (nodes store keys only → wider branching) |
| Consistent access time | Not always (keys may be in internal nodes) | Always — every search reaches a leaf |
| Used in RDBMS | Rarely | Universally |
Exam answer — "Why is B+ tree preferred?"
B+ tree stores all data pointers only in leaf nodes, which are linked in a sorted list. This makes range queries very efficient — find the start key and then scan leaves sequentially. Internal nodes store only keys, allowing a higher branching factor and a shallower tree. Every search always traverses from root to a leaf, giving consistent performance.
B+ Tree Insertions and Deletions
- Insert: Find the correct leaf and insert. If the leaf overflows (> n−1 keys), split: the median key moves up to the parent. If the parent overflows, split recursively. If the root splits, the tree grows taller.
- Delete: Find and remove the key. If the leaf underflows (< ⌈(n−1)/2⌉ keys), either borrow from a sibling (redistribute) or merge with a sibling. Merging propagates up, potentially reducing height.
6.5 Hashing
Hash Function
A hash function h(K) maps search key K to a bucket number in range [0, B−1].
A good hash function distributes keys uniformly across buckets.
Example: h(K) = K mod B (for integer keys)
Static Hashing
- Fixed number of buckets B
- Each bucket holds one or more pages
- Overflow buckets: when a bucket fills, chain overflow pages with pointers
- Problem: as the file grows, overflow chains lengthen and performance degrades. Reorganizing the entire hash file is expensive (recompute h for all keys with new B).
Extendable (Dynamic) Hashing
Grows the directory dynamically to handle more data without a full file reorganization.
Key concepts:
- Directory: an array of 2^d pointers to buckets
- Global depth d: the number of hash bits used to index the directory; directory size = 2^d
- Local depth l_i: the number of bits used to hash to bucket i (l_i ≤ d at all times)
Extendable Hashing — Search and Insert
SEARCH for key K:
1. Compute h(K)
2. Use the LAST d bits of h(K) as an index into the directory
3. Follow the directory pointer to the bucket
4. Sequentially search within the bucket
Complexity: O(1) average
INSERT key K:
1. Compute h(K), find the target bucket (as above)
2. If bucket has free space → insert key directly ✓
3. If bucket OVERFLOWS:
Case A — local depth l_i < global depth d:
→ Split the bucket: create two new buckets, local depth becomes l_i + 1
→ Redistribute keys between the two new buckets based on the (l_i+1)th hash bit
→ Update directory entries (only those currently pointing to the split bucket)
→ Directory SIZE does not change
Case B — local depth l_i = global depth d:
→ First, DOUBLE the directory:
d becomes d + 1; directory size goes from 2^d to 2^(d+1)
Each old entry i is replicated: entries i and i+2^d both point to same bucket
→ Then proceed with Case A: split the overflowing bucket
──────────────────────────────────────────────────────────────────────
Worked Example:
Initial state: d=1 (global depth), 2 buckets
Directory: [0 → Bucket A (l=1), 1 → Bucket B (l=1)]
Bucket A: holds keys where h(k) ends in bit '0'
Bucket B: holds keys where h(k) ends in bit '1'
Insert key with h(k) = 1111 → last 1 bit = 1 → goes to Bucket B
Suppose Bucket B is FULL (overflows). Local depth l=1 = global depth d=1.
Step 1: Double directory → d becomes 2
Old directory: [0→A, 1→B]
New directory: [00→A, 01→B, 10→A, 11→B] (entries duplicated)
Step 2: Split Bucket B (keys ending in '1', l=1)
Keys ending in '01' → new Bucket B0 (l=2)
Keys ending in '11' → new Bucket B1 (l=2)
Update directory: 01 → B0, 11 → B1
Final directory: [00→A, 01→B0, 10→A, 11→B1]
Search after restructure: any key still found in O(1) using last d=2 bits
No existing keys needed to be rehashed to a completely new file!
Bitmap Indices
A bitmap index stores a bit-array for each distinct value of a low-cardinality attribute.
- Bit i is set to 1 if tuple i has that attribute value
- Very efficient for AND, OR, NOT queries using bitwise operations
- Ideal for data warehouses where attributes have few distinct values (e.g., gender, status, department code)
- Not efficient for high-cardinality attributes (e.g., student ID — would need one bitmap per student)
Hashing vs B+ Tree Index
| Operation | B+ Tree | Hash Index |
|---|
| Exact match (=) | O(log n) | O(1) average |
| Range query (<, >, BETWEEN, LIKE 'A%') | O(log n + result size) | ❌ Not supported |
| Ordered output (ORDER BY) | Yes — scan leaves in order | No |
| Worst case | O(log n) — guaranteed | O(n) — if many hash collisions |
| Best use case | General-purpose; most queries | Point lookups only (caching, session storage) |
Exam Quick-Reference — Chapter 6
- Storage hierarchy: Cache (ns) → RAM (100 ns) → SSD (0.1 ms) → HDD (10 ms) → Tape (seconds)
- RAID 1 = mirroring (1 disk failure tolerated); RAID 5 = striping + parity (1 disk); RAID 6 = 2 disks; RAID 10 = mirror + stripe (fastest + resilient)
- Dense index: one entry per record; Sparse index: one entry per block (sorted files only)
- B+ tree leaves are linked → range queries = find start leaf, then scan linked list
- B+ tree internal nodes store keys only (no data) → higher branching factor → shallower tree → consistent O(log n)
- Extendable hashing: global depth d → directory has 2^d entries; overflow → split bucket; if local depth = global depth → double directory first
- Static hashing → overflow chains; Extendable hashing → split + directory doubling; no full file reorg
- Hash index: O(1) exact match, NO range queries; B+ tree: O(log n) for everything including ranges