Databases: Theoretical Squeeze
Comprehensive guide to Database internals: Indexing, Normalization, ACID, Transactions, and Optimization strategies.
SQL vs NoSQL
SQL (Relational)
- Stores strictly structured data about real-world entities (people, cities).
- Data is stored in Tables which are interconnected via keys.
- Scale: Vertically (stronger server).
NoSQL (Non-Relational)
- Stores data in hierarchical/document structures (e.g., JSON).
- Flexible schema.
- Scale: Horizontally (more servers).
- Analogy: What SQL stores in multiple linked tables, NoSQL might store in a single document.
Normalization
Normalization is the process of structuring a relational database to reduce redundancy and improve data integrity.
1. First Normal Form (1NF)
- Atomicity: Values must be atomic.
- Rule: One field = one value. No lists separated by commas.
2. Second Normal Form (2NF)
- Uniqueness: Each row must have a unique Primary Key.
- No partial dependency on a composite key.
3. Third Normal Form (3NF)
- No Transitive Dependency: Non-key attributes must depend only on the primary key, not on other non-key attributes.
- Rule: Move independent data to separate tables.
Denormalization
Intentionally bringing the database into a "lower" normal form (adding redundancy) to speed up reads by avoiding JOINs. Trade-off: Faster reads, slower writes, risk of data inconsistency.
Indexes
Indexes are special lookup tables that the database search engine can use to speed up data retrieval.
How it works: Instead of a Linear Search (scanning every row, O(N)), indexes often use a B-Tree structure (O(log N)). Indexes point to the physical location (page/byte) of the data on the disk.
Types of Indexes
1. Clustered Index
- Stores the actual data rows in the leaf nodes of the index tree.
- Sorts the physical storage of data.
- Limit: Only ONE per table (usually the Primary Key).
- Analogy: A dictionary (the book IS the index).
2. Non-Clustered Index
- Stores pointers to the data rows.
- The physical order of rows is not changed.
- Limit: Multiple per table.
- Analogy: The index at the back of a book.
3. Composite Index
- An index on multiple columns (e.g.,
Lastname, Firstname). - Left-to-Right Rule: Crucial! An index on
(A, B)helps queries onAandA+B, but NOT onBalone. - Analogy: A phone book sorted by Lastname then Firstname.
4. Prefix Index
- Indexing only the first N characters of a string to save space.
- Useful for long
VARCHARorTEXTfields. - Trade-off: Lower selectivity if prefixes are similar.
5. Invisible Index (MySQL 8.0+)
- An index that is maintained but ignored by the optimizer. Used for testing performance impacts without deleting the index.
Key Constraints
- Primary Key: Unique identifier. Cannot be NULL. Acts as Clustered Index in MySQL.
- Unique Key: Ensures uniqueness. Can allow NULLs. Useful for business logic (e.g., email).
Optimization Strategies
1. Hardware
- Storage: SSDs drastically improve random I/O performance over HDDs.
- Configuration: Tune memory allocation (Buffer Pool), max connections, and cache sizes for the specific workload.
2. Schema & Indexes
- Don't over-index: Indexes slow down INSERT/UPDATE operations.
- Remove unused indexes: They waste space and CPU.
- Maintenance: Rebuild fragmented indexes periodically.
3. Application Interaction
- Connection Management: Close sessions properly. Use Connection Pools.
- Partitioning: Split large tables into logical parts (e.g., by month) to speed up queries and management.
- Replication:
- Master-Slave: Write to Master, Read from Slaves.
- Offload heavy analytics to a read-only replica.
- Batching: Send data in batches (e.g., bulk inserts) instead of one by one.
- Asynchrony: Don't block the user interface for DB operations.
4. Query Optimization
- EXPLAIN / EXPLAIN ANALYZE: Use these tools to see the execution plan (index usage, rows scanned).
- Selectivity: Select only necessary columns (avoid
SELECT *). - Filtering: Filter data early (before JOINs).
- Subqueries vs Joins: Joins are generally faster, but modern optimizers handle subqueries well. Avoid correlated subqueries.
- Pagination:
OFFSETis slow on large datasets; use keyset pagination (cursor-based) where possible.
Transactions & ACID
Transaction: A sequence of operations treated as a single logical unit. Either all execute, or none.
ACID Properties:
- Atomicity: All or nothing. If one part fails, the whole transaction rolls back.
- Consistency: Data remains in a valid state (constraints satisfied) before and after.
- Isolation: Concurrent transactions don't interfere with each other (Levels: Read Uncommitted, Read Committed, Repeatable Read, Serializable).
- Durability: Once committed, changes are permanent (survive power loss).
Scaling: Partitioning vs Sharding
Partitioning (Logical)
- Splitting a table into smaller parts on the same server.
- Benefit: Faster queries on ranges, easier data management (e.g., drop old partition).
Sharding (Physical)
- Distributing data across multiple physical servers.
- Benefit: Horizontal scaling (CPU, RAM, Disk).
- Complexity: Routing logic, cross-shard joins are difficult.
MySQL vs PostgreSQL
MySQL
- Thread-based: One thread per connection. Lower memory overhead per connection.
- Better for simple read-heavy workloads initially.
- Supported by Oracle.
PostgreSQL
- Process-based: One process per connection. Higher isolation and stability, but higher memory cost.
- Bouncer: Almost always needs a connection pooler (PgBouncer) for high loads.
- More advanced features (JSONB, GIS, partial indexes).
- Open Source community.
SQL Cheat Sheet
Order of Operations
FROM&JOINWHERE(Restrict rows)GROUP BY(Aggregation)HAVING(Filter groups)SELECT(Columns)ORDER BYLIMIT
Joins
- INNER JOIN: Matches in both tables.
- LEFT JOIN: All from Left, matching from Right (or NULL).
- RIGHT JOIN: All from Right, matching from Left.
- FULL JOIN: All from both.
Examples
Basic Select
sqlSELECT DISTINCT t.surname, t.age, l.name FROM teacher t INNER JOIN lesson l ON t.id = l.teacher_id WHERE t.surname NOT LIKE '%trov%' AND t.age BETWEEN 31 AND 80 ORDER BY l.name DESC LIMIT 3;
Subquery with ANY/ALL
sql-- Find products cheaper than the cheapest Apple product SELECT * FROM Products WHERE Price < (SELECT MIN(Price) FROM Products WHERE Manufacturer='Apple');
Window Functions
sql-- Find the time of the NEXT event for each user SELECT user_id, created_at, LEAD(created_at) OVER(PARTITION BY user_id ORDER BY created_at ASC) as next_event_time FROM events;