Comprehensive guide to Database internals: Indexing, Normalization, ACID, Transactions, and Optimization strategies.

SQL vs NoSQL

SQL (Relational)

Stores strictly structured data about real-world entities (people, cities).
Data is stored in Tables which are interconnected via keys.
Scale: Vertically (stronger server).

NoSQL (Non-Relational)

Stores data in hierarchical/document structures (e.g., JSON).
Flexible schema.
Scale: Horizontally (more servers).
Analogy: What SQL stores in multiple linked tables, NoSQL might store in a single document.

Normalization

Normalization is the process of structuring a relational database to reduce redundancy and improve data integrity.

1. First Normal Form (1NF)

Atomicity: Values must be atomic.
Rule: One field = one value. No lists separated by commas.

2. Second Normal Form (2NF)

Uniqueness: Each row must have a unique Primary Key.
No partial dependency on a composite key.

3. Third Normal Form (3NF)

No Transitive Dependency: Non-key attributes must depend only on the primary key, not on other non-key attributes.
Rule: Move independent data to separate tables.

Denormalization

Intentionally bringing the database into a "lower" normal form (adding redundancy) to speed up reads by avoiding JOINs. Trade-off: Faster reads, slower writes, risk of data inconsistency.

Indexes

Indexes are special lookup tables that the database search engine can use to speed up data retrieval.

How it works: Instead of a Linear Search (scanning every row, O(N)), indexes often use a B-Tree structure (O(log N)). Indexes point to the physical location (page/byte) of the data on the disk.

Types of Indexes

1. Clustered Index

Stores the actual data rows in the leaf nodes of the index tree.
Sorts the physical storage of data.
Limit: Only ONE per table (usually the Primary Key).
Analogy: A dictionary (the book IS the index).

2. Non-Clustered Index

Stores pointers to the data rows.
The physical order of rows is not changed.
Limit: Multiple per table.
Analogy: The index at the back of a book.

3. Composite Index

An index on multiple columns (e.g., Lastname, Firstname).
Left-to-Right Rule: Crucial! An index on (A, B) helps queries on A and A+B, but NOT on B alone.
Analogy: A phone book sorted by Lastname then Firstname.

4. Prefix Index

Indexing only the first N characters of a string to save space.
Useful for long VARCHAR or TEXT fields.
Trade-off: Lower selectivity if prefixes are similar.

5. Invisible Index (MySQL 8.0+)

An index that is maintained but ignored by the optimizer. Used for testing performance impacts without deleting the index.

Key Constraints

Primary Key: Unique identifier. Cannot be NULL. Acts as Clustered Index in MySQL.
Unique Key: Ensures uniqueness. Can allow NULLs. Useful for business logic (e.g., email).

Optimization Strategies

1. Hardware

Storage: SSDs drastically improve random I/O performance over HDDs.
Configuration: Tune memory allocation (Buffer Pool), max connections, and cache sizes for the specific workload.

2. Schema & Indexes

Don't over-index: Indexes slow down INSERT/UPDATE operations.
Remove unused indexes: They waste space and CPU.
Maintenance: Rebuild fragmented indexes periodically.

3. Application Interaction

Connection Management: Close sessions properly. Use Connection Pools.
Partitioning: Split large tables into logical parts (e.g., by month) to speed up queries and management.
Replication:
- Master-Slave: Write to Master, Read from Slaves.
- Offload heavy analytics to a read-only replica.
Batching: Send data in batches (e.g., bulk inserts) instead of one by one.
Asynchrony: Don't block the user interface for DB operations.

4. Query Optimization

EXPLAIN / EXPLAIN ANALYZE: Use these tools to see the execution plan (index usage, rows scanned).
Selectivity: Select only necessary columns (avoid SELECT *).
Filtering: Filter data early (before JOINs).
Subqueries vs Joins: Joins are generally faster, but modern optimizers handle subqueries well. Avoid correlated subqueries.
Pagination: OFFSET is slow on large datasets; use keyset pagination (cursor-based) where possible.

Transactions & ACID

Transaction: A sequence of operations treated as a single logical unit. Either all execute, or none.

ACID Properties:

Atomicity: All or nothing. If one part fails, the whole transaction rolls back.
Consistency: Data remains in a valid state (constraints satisfied) before and after.
Isolation: Concurrent transactions don't interfere with each other (Levels: Read Uncommitted, Read Committed, Repeatable Read, Serializable).
Durability: Once committed, changes are permanent (survive power loss).

Scaling: Partitioning vs Sharding

Partitioning (Logical)

Splitting a table into smaller parts on the same server.
Benefit: Faster queries on ranges, easier data management (e.g., drop old partition).

Sharding (Physical)

Distributing data across multiple physical servers.
Benefit: Horizontal scaling (CPU, RAM, Disk).
Complexity: Routing logic, cross-shard joins are difficult.

MySQL vs PostgreSQL

MySQL

Thread-based: One thread per connection. Lower memory overhead per connection.
Better for simple read-heavy workloads initially.
Supported by Oracle.

PostgreSQL

Process-based: One process per connection. Higher isolation and stability, but higher memory cost.
Bouncer: Almost always needs a connection pooler (PgBouncer) for high loads.
More advanced features (JSONB, GIS, partial indexes).
Open Source community.

SQL Cheat Sheet

Order of Operations

FROM & JOIN
WHERE (Restrict rows)
GROUP BY (Aggregation)
HAVING (Filter groups)
SELECT (Columns)
ORDER BY
LIMIT

Joins

INNER JOIN: Matches in both tables.
LEFT JOIN: All from Left, matching from Right (or NULL).
RIGHT JOIN: All from Right, matching from Left.
FULL JOIN: All from both.

Examples

Basic Select

sql
SELECT DISTINCT t.surname, t.age, l.name 
FROM teacher t
INNER JOIN lesson l ON t.id = l.teacher_id 
WHERE t.surname NOT LIKE '%trov%' AND t.age BETWEEN 31 AND 80 
ORDER BY l.name DESC 
LIMIT 3;

Subquery with ANY/ALL

sql
-- Find products cheaper than the cheapest Apple product
SELECT * FROM Products
WHERE Price < (SELECT MIN(Price) FROM Products WHERE Manufacturer='Apple');

Window Functions

sql
-- Find the time of the NEXT event for each user
SELECT 
    user_id,
    created_at,
    LEAD(created_at) OVER(PARTITION BY user_id ORDER BY created_at ASC) as next_event_time
FROM events;

Databases: Theoretical Squeeze