Here is a quick cheat sheet for database, when and why use them. There’s also some other section with common database questions and vocabulary. It’s none exhaustive, AI helped so use it as a starting point. (I didn’t have the opportunity to test all the use cases in production)
Database Technologies Comparison
| Category | Use Case | Technology | Why Use It | Specifications |
|---|---|---|---|---|
| SQL1 | High-volume transactions | PostgreSQL | Strong consistency guarantees for systems needing reliable transactions | 40,000+ TPS2, ACID3 compliance, <1ms query latency on indexed columns |
| SQL1 | Multi-tenant SaaS4 | MySQL | Isolates customer data while sharing infrastructure | 10,000+ connections, table partitioning by tenant_id, read replicas for scaling |
| SQL1 | Geospatial queries | PostGIS | Fast location-based queries and spatial calculations | Sub-100ms for radius queries, GiST5/SP-GiST indexes, handles 2D/3D geometries |
| Cache | User session management | Redis | Ultra-fast temporary data storage with rich data structures | 100,000+ ops/sec, <1ms latency, in-memory storage, 5GB-512GB RAM6 capacity |
| Cache | Distributed cache | Memcached | Reduces database load by caching frequently accessed data | 1M+ gets/sec, <1ms latency, simple key-value, multi-threaded |
| Cache | Configuration management | etcd | Stores configuration with automatic change notifications | 10,000 writes/sec, strong consistency via Raft7, watch API8 for changes |
| Document | Product catalog | MongoDB | Flexible schema for varying data structures without migrations | Flexible schema, 64MB document size limit, horizontal scaling to 100+ shards |
| Wide-Column | High-write log aggregation | Apache Cassandra | Handles massive write loads with high availability | 1M+ writes/sec per node, linear scalability, 99.99% availability |
| Key-Value (Cloud) | File metadata and object storage | DynamoDB + S3 | Scalable file storage with fast metadata queries | 3,500 PUT/sec per prefix, 5,500 GET/sec, 11 nines durability |
| Time-Series | IoT9 sensor data | InfluxDB | Optimized for time-stamped data with high write throughput | 1M+ writes/sec, 10:1 compression ratio, optimized for time-based queries |
| OLAP10 | Analytics on historical data | ClickHouse | Fast analytical queries on large historical datasets | 1B+ rows/sec scan rate, columnar storage, 10-100x compression |
| Graph | Social network | Neo4j | Fast traversal of complex relationships and connections | 1M+ relationship traversals/sec, native graph storage, 34B+ nodes capacity |
| Search | Full-text search | Elasticsearch | Full-text search with fuzzy matching and relevance scoring | 2-5 second indexing latency, 100+ search queries/sec, relevance scoring |
| Streaming | Event sourcing and audit trails | Apache Kafka + PostgreSQL | Captures and stores all events for audit and replay | 1M+ events/sec throughput, immutable log, partition-based parallelism |
Database Vocabulary: Essential Concepts
Understanding database systems requires familiarity with several fundamental concepts that define how data is structured, accessed, and maintained.
Transaction Properties and Consistency Models
ACID3 Properties
The foundation of reliable database transactions:
- Atomicity: All operations succeed or all fail. No half-completed transactions.
- Consistency: Data always follows defined rules and constraints.
- Isolation: Concurrent transactions don’t interfere with each other.
- Durability: Completed transactions survive crashes and persist permanently.
Transaction Isolation Levels
Controls how concurrent transactions interact:
- Read Uncommitted: Fastest but allows reading uncommitted data
- Read Committed: Prevents reading uncommitted data
- Repeatable Read: Same query returns same results throughout a transaction
- Serializable: Strictest isolation, transactions run as if sequential
Eventual Consistency
An alternative to strict consistency where data copies can temporarily show different values but will eventually sync up. When you update data, the change spreads to all copies over time rather than instantly. During this window, different users might see different values.
Systems like Cassandra and DynamoDB choose this approach to stay fast and available even when servers fail or networks are slow. They prioritize speed and uptime over instant consistency.
Data Organization and Schema Design
Normalization
Organizing data to reduce duplication in SQL/relational databases. Normalization prevents data duplication and update errors but requires JOINs. De-normalization duplicates data for faster reads.
Progressive levels (NF = Normal Form):
- 1NF (First Normal Form): Each cell contains a single atomic value (no arrays or lists). Don’t store the same data repeatedly.
- 2NF (Second Normal Form): All non-key columns must depend on the complete key (a key is composed by one or more
idcolumn in the table), not just part of it. If a value stays the same for multiple rows that share only one part of a composite key, extract it to a separate table and reference it by ID. - 3NF (Third Normal Form): Data depends directly on the key, not on other non-key columns. If two columns are related to each other (changing one means you should change the other), extract them to a separate table.
NoSQL databases often intentionally break these rules (denormalize) for performance, since they don’t have JOINs or handle them poorly.
Cardinality
Cardinality refers to the uniqueness of data values in a column:
- High cardinality: Many unique values (emails, transaction IDs). Indexes work best here.
- Low cardinality: Few distinct values (booleans, status flags). Less effective for indexing.
Query Performance Optimization
Indexing
Creates shortcuts to find data faster.
Index types:
- B-tree11: Supports range queries and ordering
- Hash: Fast equality lookups
- GiST5: Handles complex data types (geospatial, arrays)
Trade-off: Faster reads but slower writes and more storage (indexes needs to be updated on write).
Inverted Index
An inverted index, reverses the normal document-to-words relationship.
Given these documents:
Doc1: "database performance tuning"
Doc2: "database indexing strategies"
Doc3: "performance optimization tips"
The inverted index stores:
"database" → [Doc1, Doc2]
"performance" → [Doc1, Doc3]
"tuning" → [Doc1]
"indexing" → [Doc2]
"strategies" → [Doc2]
"optimization"→ [Doc3]
"tips" → [Doc3]
Instead of “Document 1 contains words X, Y, Z”, it creates a lookup of “Word X appears in Documents 1, 2, 3”. This takes milliseconds instead of reading every document to check if it contains those words.
Full-text search engines like Elasticsearch use this to enable quick text searches across large document sets. They also provide feature like exact phrase match, frequency ranking, and fuzzy matching.
Trade-off: Fast searches but requires storage space and rebuilding when documents change.
Bloom Filter
A quick test that answers: “Is this element definitely NOT in the set?”
Can say “definitely absent” or “possibly present” (but never misses an item that’s actually there, no false negative). Databases like Cassandra use this to skip checking files that definitely don’t contain requested data.
Trade-off: Saves disk reads but uses memory and might have false positives.
Scaling and Distribution Strategies
Partitioning vs Sharding
Both split data into smaller pieces for better performance. So you can make queries that would only scan the relevant data and not the full database, allowing for parallel processing. Sharding also distributes load across multiple servers. The key difference is scope:
- Partitioning: Splits tables within a single database. Improves query performance and maintenance.
- Sharding: Splits data across multiple databases/servers. Enables horizontal scaling beyond one machine’s limits.
Pitfall: Poor key choice creates hot spots where one piece gets all the traffic. Ex: An uuid could spread evenly, while timestamp concentrates traffic on recent data.
Replication
Keeps copies of data on multiple servers for backup and load balancing.
Replication modes:
- Synchronous: Waits for copies before confirming (no data loss, higher latency)
- Asynchronous: Faster but risks losing recent data on failure
- Cascading: Replicas copy from other replicas (reduces primary load)
Trade-offs:
- Synchronous: Waits for copies before confirming writes. No data loss but slower.
- Asynchronous: Faster but risks losing recent data if primary fails.
Durability and Recovery
Write-Ahead Logging (WAL)
A safety mechanism where the database writes every change to a sequential log file before modifying the actual data files.
How it works:
- You update a record (e.g., “Set user’s balance to $100”)
- Database first writes this change to the WAL file (fast sequential write)
- Database then updates the actual data pages in memory
- Later, modified pages are flushed to disk in the background
If the database crashes after step 2 but before step 4, no problem. On restart, the database replays the WAL12 to reconstruct any changes that didn’t make it to disk. Without WAL, the crash would lose your transaction.
WAL also enables replication (replicas read the log) and point-in-time13 backups (archive old WAL files to restore to any moment).
Replication Recovery
Depending on the architecture, recovery strategies vary:
Master-replica (most common): One write node, multiple read-only copies. When primary fails:
- Automatic failover: Tools like Patroni detect failures and promote replicas
- Manual failover: Human verifies before promoting (safer for critical systems)
- Split-brain prevention: Ensure only one primary exists (otherwise data conflicts)
Multi-master (advanced): Multiple nodes accept writes. Enables geographic distribution but adds complexity:
- Conflict resolution needed: When two masters update the same data
- Last-write-wins (timestamp decides, simple but loses data)
- Application logic (custom rules: sum counters, keep higher version)
- CRDTs (mathematical approach guaranteeing eventual consistency)
- Most applications avoid multi-master due to this complexity
Trade-off: Resilient architectures combine synchronous replication (high availability), asynchronous replicas (read scaling), CDC14 (event-driven integration), and PITR13 backups (Point-in-time disaster recovery) to survive hardware failures, network partitions, and operator errors.
Database Types and Workloads
Wide-Column Store
NoSQL database where each row can have different columns, and columns are grouped into column families (logical groupings of related columns stored together).
How it differs from SQL:
- Traditional SQL table (fixed columns for all rows):
users table | id | name | email | city | |-----|-------|-----------------|-----------| | 1 | Alice | alice@email.com | NYC | | 2 | Bob | bob@email.com | LA | - Wide-column store (flexible columns per row):
Row key: user:1 Column family "profile": {name: "Alice", age: 30} Column family "contact": {email: "alice@email.com", phone: "555-0100"} Row key: user:2 Column family "profile": {name: "Bob"} Column family "contact": {email: "bob@email.com"} Column family "preferences": {theme: "dark", lang: "en"}
user:2has a “preferences” family thatuser:1doesn’t have, anduser:1has an “age” thatuser:2doesn’t. Each row can have different columns without schema changes.
Column families group related data so queries like “get all contact info” read only the “contact” family without loading profile or preferences data.
Systems like Cassandra excel at high-write workloads (1M+ writes/sec). Ideal for time-series data and sparse attributes.
OLAP10 vs OLTP15
Two different database workload types with opposing optimization goals:
OLAP (Online Analytical Processing): Optimized for complex queries over huge datasets. Uses columnar storage and compression.
- Example: ClickHouse. Good for: “Calculate year-over-year revenue by category.”
OLTP (Online Transaction Processing): Optimized for high-volume individual transactions. Row-based storage.
- Examples: PostgreSQL, MySQL. Good for: Processing individual orders and payments.
ETL (Extract, Transform, Load)
Moving data from sources to data warehouses:
- Extract: Read from databases, APIs8, files
- Transform: Clean, aggregate, reshape data
- Load: Write to destination
A variation with ELT16 would be load raw data first, and transform later using the warehouse’s processing power. Modern data warehouses like Snowflake, BigQuery, or Redshift have massive parallel computing resources that can handle large-scale transformations efficiently.
Performance Metrics
TPS and QPS
You will often see these two acronyms when discussing database performance:
- TPS (Transactions Per Second): Measures write throughput
- QPS (Queries Per Second): Measures read throughput
Context matters: Simple lookups = higher QPS. Complex queries = lower QPS. Actual performance varies widely based on query complexity and hardware.
Common Database Questions
Why Are Caches Faster Than Databases?
Caches are simpler and skip expensive operations:
Speed advantages:
- All data in memory (no disk access): 100-500 microseconds vs 1-5ms
- Simple hash lookups instead of tree traversals
- No query parsing or optimization
- Skip ACID guarantees (no locking or logging)
Trade-offs:
- No durability: restarts lose data
- Limited queries: only key-based lookups
- Cache invalidation: hard to keep in sync with source data
Why Can’t Databases Be Optimized for Both Reads and Writes?
Fundamental architectural conflict:
Write-optimized (Cassandra):
- Use LSM (Log-Structured Merge) trees: writes go to memory first, then flush to disk in batches as sorted files. Converts random writes to fast sequential appends.
- Achieve millions of writes/sec
- Tradeoff: data fragments across multiple files, so reads must check several files (read amplification)
Read-optimized (PostgreSQL, MySQL):
- Use B-trees (Balanced trees): data stored in sorted order in a tree structure. Finding data is like binary search - eliminate half the possibilities at each step. Fast lookups even with millions of rows.
- Tradeoff: writes must find the correct sorted position, potentially reorganize the tree (node splits), and update indexes - all random disk access
Other conflicts:
- Indexes: Help reads find data instantly (like a book index), but every write must update all indexes. 5 indexes = 5x more work per write.
- Compression: Packed data reads faster from disk, but writing requires decompressing, modifying, and recompressing.
- Durability: Syncing to disk ensures data survives crashes but adds latency. Async writes are faster but risk data loss.
Modern databases use hybrid approaches (buffering, caching, tunable settings) to balance read and write performance, but can’t be optimal for both simultaneously.
When Use Document Databases vs SQL with JSON?
Choose Document DB (like MongoDB) when:
- Schema varies significantly: Electronics have specs (CPU17, RAM6), clothing has sizes (S, M, L). SQL would need mostly empty columns or complex workarounds. Documents store only relevant fields.
- Hierarchical data with nesting: Store a blog post with all its comments in one document. One read gets everything. SQL needs multiple JOINs across tables.
- Rapid schema changes: Add new fields without altering tables or running migrations. Just start storing the new field in new documents.
Choose SQL (like PostgreSQL, MySQL) with JSON18 when:
- Stable core schema with occasional flexibility: User table with standard fields (id, email, name) plus a JSON preferences field. Core structure stays predictable while allowing custom settings.
- Complex queries and JOINs: SQL excels at queries like “find all orders from users in California who bought product X in the last month”, combining multiple tables with filters. Document DBs struggle with cross-collection queries.
- Strong consistency and transactions: Need to update multiple tables atomically (transfer money: deduct from one account, add to another). SQL guarantees both happen or neither does. Document DBs have limited multi-document transactions.
- Analytics and ad-hoc queries: SQL query optimizers handle complex aggregations, grouping, and filtering efficiently. You can write new queries without restructuring data.
What Is Pre-Computing?
Calculate expensive queries once, store results, serve repeatedly. Can speed up queries 100-1000x.
Common implementations include:
- Materialized views: Store query results, refresh periodically
- Aggregation tables: Daily/weekly/monthly summaries
- Denormalization: Duplicate data to avoid JOINs
- Computed columns: Auto-calculated fields
- Cache warming: Pre-generate results before requests
Trade-offs:
- Uses 3-5x more storage
- Complex to keep in sync with source data
- Stale data until next update
When to use: High read/write ratio. Otherwise, compute on-demand.
Change Data Capture (CDC)
Publishes database changes as events for other systems to consume in near-real-time. Bridges databases with event-driven architectures.
Approaches:
- Log-based: Reads database logs via tools like Debezium. Captures everything, minimal performance impact. ( See this post for details.)
- Trigger-based: Database triggers write changes to tables. Simpler but slower, may miss bulk operations.
- Timestamp-based: Polls for
updated_atchanges. Simple but misses deletions, has polling delay.
Common uses:
- Sync data across microservices
- Update search indexes (Elasticsearch)
- Stream data to analytics warehouses (like ClickHouse)
- Invalidate caches when source data changes
Kafka is usually involved in this type of architecture for streaming data.
Challenges:
- Schema changes need coordination
- Additional infrastructure complexity
- Initial data snapshots need special handling
- Consumers must handle duplicate events (idempotency)
Footnotes
There’s a lot of acronyms, find the meaning from the footnote below.
-
TPS = Transactions Per Second ↩
-
SaaS = Software as a Service ↩
-
Raft = Consensus algorithm for distributed systems ↩
-
IoT = Internet of Things ↩
-
B-tree = Balanced tree ↩
-
WAL = Write-Ahead Logging ↩
-
CDC = Change Data Capture ↩
-
OLTP = Online Transaction Processing ↩
-
ELT = Extract, Load, Transform ↩
-
CPU = Central Processing Unit ↩
-
JSON = JavaScript Object Notation ↩