Intro

DBMS system different purposes:

temporary hot data
long-lived cold storage
complex analytical queries
access values by the key
optimized to store time-series data
store large blobs efficiently

DBMS three major categories:

Online transaction processing (OLTP) dbs
- handle a large number of user-facing requests and transactions.
- queries are often predefined and short-lived
Online analytical processing (OLAP) dbs
- handle complex aggregations
- used for analytics and data warehousing
- capable of handling complex, long-running ad hoc queries
Hybrid transactional and analytical processing (HTAP) dbs
- combine properties of both OLTP and OLAP stores

DBMS Architecture

DBMS use a client/server model, where DBS instances (nodes) take the role of servers and application instances take the role of clients

System Architecture

Transport
- Cluster Communication
- Client Communication
Query Processor
- Query Parser
- Query Optimizer
  - first eliminates impossible and redundant parts of the query
  - then attempts to find the most efficient way to execute it based on internal statistics and data placement
  - handles both relational operations and optimizations
Execution Engine
- Remote Execution
  - involve writing and reading data to and from other nodes in the cluster and replication
- Local Execution
Storage Engine
- Transaction Manager
  - This manager schedules transactions and ensures they cannot leave that database in a logically inconsistent state
- Lock Manager
  - locks on the database objects for the running transactions, ensuring that concurrent operations do not violate physical data integrity
- Access Methods(storage structures)
  - manage access and organizing data on disk
  - include heap files and storage structures such as B-Trees and LSM Trees
- Buffer Manager
  - caches data pages in memory
- Recovery Manager
  - maintains the operation log and restoring the system state in case of a failure

![[Pasted image 20241103183620.png]]

Memory Versus Disk-Based DBMS

In-memory DBMS (main memory DBMS)
- stores data primarily in memory and use the disk for recovery and logging
- stores the contents almost exclusively in RAM
Disk-based DBMS
- hold most of the data on disk and use memory for caching disk contents or as a temporary storage

Durability in Memory-Based Stores

Core Concepts

Basic Principles

In-memory databases maintain disk backups for durability
Prevents loss of volatile data
Some databases are memory-only (without durability) but not covered in scope

Durability Mechanism

Write-ahead Logging
- Operations must be written to sequential log file
- Essential for operation completion
- Enables recovery capabilities
Backup Copy Management
- Maintained as sorted disk-based structure
- Modifications often asynchronous
- Applied in batches to reduce I/O operations
- Used for recovery alongside logs

Checkpointing Process

Log records applied to backup in batches
After processing, backup represents specific time-point snapshot
Previous log contents can be discarded
Benefits:
- Reduces recovery times
- Keeps disk-resident database updated
- Doesn't block client operations

Key Differences from Disk-Based Systems

Performance Considerations

In-memory databases ≠ disk databases with large page cache
Reasons:
- Serialization format creates overhead
- Data layout limitations
- Less optimization flexibility

Storage Structure Optimization

Memory-Based Advantages
- Fast pointer following
- Quick random memory access
- More data structure options
- Greater optimization possibilities
Disk-Based Characteristics
- Uses specialized storage structures
- Optimized for disk access
- Often uses wide, short trees
- Requires special handling for variable-size data

Use Cases

Suitable Scenarios

Dataset fits entirely in memory
Bounded datasets like:
- School student records
- Corporate customer data
- Online store inventory
Record sizes typically few Kb
Limited number of records

Practical Implementation

Pointer-based value references for variable data
More flexible data structure choices
Enhanced optimization capabilities

Notes

Important distinction: In-memory database performance advantages come from fundamental architectural differences, not just from avoiding disk access.

Column Versus Row-Oriented Database Management Systems

Basic Structure

Database Components

Tables consist of:
- Columns
- Rows
- Fields (intersection of column and row)

Field Characteristics

Represents single value of specific type
Fields in same column typically share data type
Example: User records table
- All names would be same type
- All names belong to same column

Storage Classification

Primary Classification Methods

Based on disk storage approach:
1. Row-wise storage
2. Column-wise storage

Partitioning Methods

Horizontal Partitioning
- Stores values belonging to same row together
- Row-oriented approach
Vertical Partitioning
- Stores values belonging to same column together
- Column-oriented approach

![[Pasted image 20241103223418.png]] Figure 1-2: Visual representation of storage methods

Part (a): Column-wise partitioning
Part (b): Row-wise partitioning

Common DBMS Examples

Row-Oriented Systems

Traditional relational databases including:

MySQL
PostgreSQL
Other traditional RDBMS

Column-Oriented Systems

Pioneer open source implementations:

MonetDB
C-Store
- Open source predecessor to Vertica

Visual Reference Notes

The provided image illustrates:

Clear comparison between column vs row storage
Symbol, Date, and Price field arrangements
Different organizational structures for each approach
Should be placed after the "Partitioning Methods" section for optimal reference

Row-Oriented Data Layout in DBMS

Basic Concept

Stores data in records/rows
Layout closely resembles tabular data
Each row contains same set of fields

Example Data Structure

| ID | Name  | Birth Date  | Phone Number    |
|----|-------|-------------|-----------------|
| 10 | John  | 01 Aug 1981 | +1 111 222 333 |
| 20 | Sam   | 14 Sep 1988 | +1 555 888 999 |
| 30 | Keith | 07 Jan 1984 | +1 333 444 555 |

Key Characteristics

Record Structure

Fields form complete records
Records uniquely identified by key
Example components:
- Name
- Birth date
- Phone number
- ID (monotonically incremented key)

Data Access Patterns

Record Operations
- Multiple fields read together
- Written together (e.g., during registration)
- Individual field modifications possible
Spatial Locality
- Improved by storing entire rows together
- Beneficial for row-based access patterns

Storage Implications

Block-wise Access

Persistent medium (disk) accessed in blocks
Single block contains data for all columns
Advantages:
- Efficient for accessing complete user records

Performance Considerations

Efficient for:
- Full record retrieval
- Complete record insertions
- Row-based operations
Less Efficient for:
- Individual field queries across multiple records
- Example: Querying only phone numbers
- Reason: Unnecessary data for other fields must be paged in

Use Cases

User registration systems
Transaction records
Any scenario where entire records are frequently accessed together

Notes

Spatial locality optimization important for performance
Block-level access affects query efficiency
Trade-off between full record access and column-specific queries

Column-Oriented Data Layout in DBMS

Basic Concept

Partitions data vertically (by column)
Values from same column stored contiguously on disk
Contrasts with row-oriented storage

Data Representation

Logical View (Table Format)

| ID | Symbol | Date        | Price      |
|----|--------|-------------|------------|
| 1  | DOW    | 08 Aug 2018 | 24,314.65 |
| 2  | DOW    | 09 Aug 2018 | 24,136.16 |
| 3  | S&P    | 08 Aug 2018 | 2,414.45  |
| 4  | S&P    | 09 Aug 2018 | 2,232.32  |

Physical Layout (Column-Based Storage)

Symbol: 1:DOW; 2:DOW; 3:S&P; 4:S&P
Date:   1:08 Aug 2018; 2:09 Aug 2018; 3:08 Aug 2018; 4:09 Aug 2018
Price:  1:24,314.65; 2:24,136.16; 3:2,414.45; 4:2,232.32

Key Features

Storage Organization

Values from same column stored together
Separate files or file segments for different columns
Enables efficient column-specific queries

Data Reconstruction

Metadata Requirements
- Need metadata at column level
- Used to identify associated data points across columns
- Important for:
  - Joins
  - Filtering
  - Multirow aggregates
Identification Methods
- Virtual IDs (implicit identifiers)
- Uses value position/offset for mapping
- More efficient than explicit keys

Advantages

Query Efficiency
- Single-pass column reading
- No need to read entire rows
- Avoids loading unnecessary columns
Analytical Capabilities
- Excellent for computing aggregates
- Efficient for:
  - Finding trends
  - Computing averages
  - Complex analytical queries

Modern Implementations

Column-Oriented File Formats

Apache Parquet
Apache ORC
RCFile

Column-Oriented Stores

Apache Kudu
ClickHouse
Various others

Use Cases

Analytical workloads
Complex aggregations
Scenarios where specific columns have different importance
Large datasets requiring efficient column-specific access

Historical Context

Growing popularity due to:
- Increasing demand for complex analytical queries
- Growing dataset sizes
- Need for efficient data processing

Distinctions and Optimizations: Row vs Column Stores

Key Concept

Difference extends beyond mere data storage methods
Data layout is just one aspect of many optimizations
Choice depends heavily on specific use cases

Performance Optimizations

Cache Utilization

Column Store Benefits
- Multiple values from same column read in one run
- Improved cache utilization
- Enhanced computational efficiency
CPU Optimization
- Modern CPUs can use vectorized instructions
- Single CPU instruction processes multiple data points
- Parallel processing capabilities

Compression Advantages

Data Type Grouping
- Similar data types stored together
- Numbers with numbers
- Strings with strings
Compression Benefits
- Better compression ratios
- Type-specific compression algorithms
- Optimized compression methods per data type

Selection Criteria

Row-Oriented Preference

Best for:

Record-based data consumption
Most/all columns frequently requested
Workloads involving:
- Point queries
- Range scans

Column-Oriented Preference

Optimal for:

Scans spanning many rows
Aggregate computations over column subsets
Analytical workloads

Decision Framework

Key Considerations

Access Patterns
- How is data typically accessed?
- What percentage of columns are usually needed?
- Frequency of full record retrieval
Workload Type
- Nature of typical queries
- Balance between read and write operations
- Analytical vs transactional processing
Performance Requirements
- Cache efficiency needs
- Compression importance
- CPU optimization requirements

Best Practices

Thoroughly analyze access patterns before choosing
Consider hybrid approaches for mixed workloads
Factor in both current and future requirements
Evaluate compression needs and benefits
Consider hardware optimization capabilities

Data Files and Index Files

Core Concepts

Database systems use specialized file organization rather than simple filesystem hierarchies to store and manage data efficiently. This approach offers several key advantages over flat files.

Key Advantages of Specialized File Organization

1. Storage Efficiency

Files are organized to minimize storage overhead per stored data record
Optimized storage formats and structures

2. Access Efficiency

Records can be located in the minimum possible number of steps
Reduces search time and computational overhead

3. Update Efficiency

Record updates are optimized to minimize disk changes
Efficient modification of existing data

File Organization Structure

Data Storage

Records containing multiple fields are stored in tables
Each table is typically represented as a separate file
Records can be looked up using search keys

Index System

Indexes serve as auxiliary data structures
Enable efficient record location without full table scans
Built using subset of fields that identify records

File Separation

Data Files
- Store actual data records
- Contain the primary information
Index Files
- Store record metadata
- Used to locate records in data files
- Typically smaller than data files

Page Organization

Files are partitioned into pages
Page size typically matches single or multiple disk blocks
Two main organization methods:
- Sequences of records
- Slotted pages

Record Management

Record Operations

New Records (Insertions)
- Represented as key/value pairs
- Added to appropriate pages
Updates
- Also handled as key/value pairs
- Modify existing record data
Deletions
- Modern systems use deletion markers (tombstones)
- Contain deletion metadata:
  - Key
  - Timestamp

Garbage Collection

Process to reclaim space from deleted/updated records
Operations:
1. Reads pages
2. Writes live (nonshadowed) records to new location
3. Discards shadowed records
Helps maintain storage efficiency over time

Intro

DBMS Architecture

System Architecture

Transport

Query Processor

Execution Engine

Storage Engine

Memory Versus Disk-Based DBMS

In-memory DBMS (main memory DBMS)

Disk-based DBMS

Durability in Memory-Based Stores

Core Concepts

Basic Principles

Durability Mechanism

Checkpointing Process

Key Differences from Disk-Based Systems

Performance Considerations

Storage Structure Optimization

Use Cases

Suitable Scenarios

Practical Implementation

Notes

Column Versus Row-Oriented Database Management Systems

Basic Structure

Database Components

Field Characteristics

Storage Classification

Primary Classification Methods

Partitioning Methods

Common DBMS Examples

Row-Oriented Systems

Column-Oriented Systems

Visual Reference Notes

Row-Oriented Data Layout in DBMS

Basic Concept

Example Data Structure

Key Characteristics

Record Structure

Data Access Patterns

Storage Implications

Block-wise Access

Performance Considerations

Use Cases

Notes

Column-Oriented Data Layout in DBMS

Basic Concept

Data Representation

Logical View (Table Format)

Physical Layout (Column-Based Storage)

Key Features

Storage Organization

Data Reconstruction

Advantages

Modern Implementations

Column-Oriented File Formats

Column-Oriented Stores

Use Cases

Historical Context

Distinctions and Optimizations: Row vs Column Stores

Key Concept

Performance Optimizations

Cache Utilization

Compression Advantages

Selection Criteria

Row-Oriented Preference

Column-Oriented Preference

Decision Framework

Key Considerations

Best Practices

Data Files and Index Files

Core Concepts

Key Advantages of Specialized File Organization

1. Storage Efficiency

2. Access Efficiency

3. Update Efficiency

File Organization Structure

Data Storage

Index System

File Separation

Page Organization