2024-07-25
#databases
#sql
#backend
#optimization
#system-design

Databases: Theoretical Squeeze

Comprehensive guide to Database internals: Indexing, Normalization, ACID, Transactions, and Optimization strategies.

SQL vs NoSQL

SQL (Relational)

  • Stores strictly structured data about real-world entities (people, cities).
  • Data is stored in Tables which are interconnected via keys.
  • Scale: Vertically (stronger server).

NoSQL (Non-Relational)

  • Stores data in hierarchical/document structures (e.g., JSON).
  • Flexible schema.
  • Scale: Horizontally (more servers).
  • Analogy: What SQL stores in multiple linked tables, NoSQL might store in a single document.

Normalization

Normalization is the process of structuring a relational database to reduce redundancy and improve data integrity.

1. First Normal Form (1NF)

  • Atomicity: Values must be atomic.
  • Rule: One field = one value. No lists separated by commas.

2. Second Normal Form (2NF)

  • Uniqueness: Each row must have a unique Primary Key.
  • No partial dependency on a composite key.

3. Third Normal Form (3NF)

  • No Transitive Dependency: Non-key attributes must depend only on the primary key, not on other non-key attributes.
  • Rule: Move independent data to separate tables.

Denormalization

Intentionally bringing the database into a "lower" normal form (adding redundancy) to speed up reads by avoiding JOINs. Trade-off: Faster reads, slower writes, risk of data inconsistency.


Indexes

Indexes are special lookup tables that the database search engine can use to speed up data retrieval.

How it works: Instead of a Linear Search (scanning every row, O(N)), indexes often use a B-Tree structure (O(log N)). Indexes point to the physical location (page/byte) of the data on the disk.

Types of Indexes

1. Clustered Index

  • Stores the actual data rows in the leaf nodes of the index tree.
  • Sorts the physical storage of data.
  • Limit: Only ONE per table (usually the Primary Key).
  • Analogy: A dictionary (the book IS the index).

2. Non-Clustered Index

  • Stores pointers to the data rows.
  • The physical order of rows is not changed.
  • Limit: Multiple per table.
  • Analogy: The index at the back of a book.

3. Composite Index

  • An index on multiple columns (e.g., Lastname, Firstname).
  • Left-to-Right Rule: Crucial! An index on (A, B) helps queries on A and A+B, but NOT on B alone.
  • Analogy: A phone book sorted by Lastname then Firstname.

4. Prefix Index

  • Indexing only the first N characters of a string to save space.
  • Useful for long VARCHAR or TEXT fields.
  • Trade-off: Lower selectivity if prefixes are similar.

5. Invisible Index (MySQL 8.0+)

  • An index that is maintained but ignored by the optimizer. Used for testing performance impacts without deleting the index.

Key Constraints

  • Primary Key: Unique identifier. Cannot be NULL. Acts as Clustered Index in MySQL.
  • Unique Key: Ensures uniqueness. Can allow NULLs. Useful for business logic (e.g., email).

Optimization Strategies

1. Hardware

  • Storage: SSDs drastically improve random I/O performance over HDDs.
  • Configuration: Tune memory allocation (Buffer Pool), max connections, and cache sizes for the specific workload.

2. Schema & Indexes

  • Don't over-index: Indexes slow down INSERT/UPDATE operations.
  • Remove unused indexes: They waste space and CPU.
  • Maintenance: Rebuild fragmented indexes periodically.

3. Application Interaction

  • Connection Management: Close sessions properly. Use Connection Pools.
  • Partitioning: Split large tables into logical parts (e.g., by month) to speed up queries and management.
  • Replication:
    • Master-Slave: Write to Master, Read from Slaves.
    • Offload heavy analytics to a read-only replica.
  • Batching: Send data in batches (e.g., bulk inserts) instead of one by one.
  • Asynchrony: Don't block the user interface for DB operations.

4. Query Optimization

  • EXPLAIN / EXPLAIN ANALYZE: Use these tools to see the execution plan (index usage, rows scanned).
  • Selectivity: Select only necessary columns (avoid SELECT *).
  • Filtering: Filter data early (before JOINs).
  • Subqueries vs Joins: Joins are generally faster, but modern optimizers handle subqueries well. Avoid correlated subqueries.
  • Pagination: OFFSET is slow on large datasets; use keyset pagination (cursor-based) where possible.

Transactions & ACID

Transaction: A sequence of operations treated as a single logical unit. Either all execute, or none.

ACID Properties:

  • Atomicity: All or nothing. If one part fails, the whole transaction rolls back.
  • Consistency: Data remains in a valid state (constraints satisfied) before and after.
  • Isolation: Concurrent transactions don't interfere with each other (Levels: Read Uncommitted, Read Committed, Repeatable Read, Serializable).
  • Durability: Once committed, changes are permanent (survive power loss).

Scaling: Partitioning vs Sharding

Partitioning (Logical)

  • Splitting a table into smaller parts on the same server.
  • Benefit: Faster queries on ranges, easier data management (e.g., drop old partition).

Sharding (Physical)

  • Distributing data across multiple physical servers.
  • Benefit: Horizontal scaling (CPU, RAM, Disk).
  • Complexity: Routing logic, cross-shard joins are difficult.

MySQL vs PostgreSQL

MySQL

  • Thread-based: One thread per connection. Lower memory overhead per connection.
  • Better for simple read-heavy workloads initially.
  • Supported by Oracle.

PostgreSQL

  • Process-based: One process per connection. Higher isolation and stability, but higher memory cost.
  • Bouncer: Almost always needs a connection pooler (PgBouncer) for high loads.
  • More advanced features (JSONB, GIS, partial indexes).
  • Open Source community.

SQL Cheat Sheet

Order of Operations

  1. FROM & JOIN
  2. WHERE (Restrict rows)
  3. GROUP BY (Aggregation)
  4. HAVING (Filter groups)
  5. SELECT (Columns)
  6. ORDER BY
  7. LIMIT

Joins

  • INNER JOIN: Matches in both tables.
  • LEFT JOIN: All from Left, matching from Right (or NULL).
  • RIGHT JOIN: All from Right, matching from Left.
  • FULL JOIN: All from both.

Examples

Basic Select

sql
SELECT DISTINCT t.surname, t.age, l.name 
FROM teacher t
INNER JOIN lesson l ON t.id = l.teacher_id 
WHERE t.surname NOT LIKE '%trov%' AND t.age BETWEEN 31 AND 80 
ORDER BY l.name DESC 
LIMIT 3;

Subquery with ANY/ALL

sql
-- Find products cheaper than the cheapest Apple product
SELECT * FROM Products
WHERE Price < (SELECT MIN(Price) FROM Products WHERE Manufacturer='Apple');

Window Functions

sql
-- Find the time of the NEXT event for each user
SELECT 
    user_id,
    created_at,
    LEAD(created_at) OVER(PARTITION BY user_id ORDER BY created_at ASC) as next_event_time
FROM events;

Connected Thoughts

Egor Zdioruc | Lead Full Stack Developer | Laravel & AI Solutions