Database Sharding — Splitting Your Database Like a Pizza
When your database has 100 million rows and queries are slow, you need sharding. Let's break it down with a pizza analogy that actually makes sense.
What is Database Sharding?
Imagine you have a massive phone book with 10 crore (100 million) entries. Finding someone's number takes forever because you're searching through one giant book.
Now imagine splitting that phone book into 26 smaller books — one for each letter of the alphabet. Looking for "Sahil"? Go straight to the "S" book. Much faster.
That's sharding — splitting a large database into smaller, faster, more manageable pieces called shards.
Why Do We Need Sharding?
As your app grows, a single database becomes a bottleneck:
Sharding solves all three by distributing data across multiple database servers.
Real-World Example: PhonePe Transactions
PhonePe processes crores of transactions daily. They can't store all transactions in one database — it would be impossibly slow.
Instead, they might shard by user ID:
When you check your transaction history, the system knows exactly which shard to query based on your user ID.
Sharding Strategies
1. Range-Based Sharding
Split data by ranges of a key.
Example: Users with IDs 1-1M go to Shard 1, 1M-2M to Shard 2, etc.
Pros: Simple to implement
Cons: Uneven distribution (what if most active users are in Shard 1?)
2. Hash-Based Sharding
Apply a hash function to the key and use the result to determine the shard.
shard_number = hash(user_id) % number_of_shards
Example: hash(12345) % 4 = 1 → goes to Shard 1
Pros: Even distribution
Cons: Hard to add new shards (re-hashing needed)
3. Geographic Sharding
Split data by location.
Example: Indian users → Mumbai database, US users → Virginia database.
Pros: Low latency for users, data residency compliance
Cons: Cross-region queries are complex
The Challenges
1. Cross-Shard Queries
If you need data from multiple shards (e.g., "show all transactions above ₹10,000"), you have to query ALL shards and merge results. This is slow and complex.
2. Rebalancing
When one shard gets too big, you need to split it. Moving data between shards without downtime is hard.
3. Joins Across Shards
SQL JOINs across different database servers? Very painful. This is why sharded systems often denormalize data.
4. Consistent Hashing
When you add/remove shards, hash-based sharding requires rehashing everything. Consistent hashing minimizes this — only a fraction of data needs to move.
Sharding vs. Replication
Don't confuse them:
• Sharding: Different data on different servers (horizontal partitioning)
• Replication: Same data copied to multiple servers (for redundancy)
Most production systems use both — shard for scale, replicate each shard for reliability.
When to Shard
Don't shard prematurely. Try these first:
Optimize queries and add indexes
Vertical scaling (bigger machine)
Read replicas
Caching (Redis)
If you've done all that and still hitting limits → time to shard.
Key Takeaway
Sharding is about divide and conquer. Split your data smartly, and each piece becomes manageable. But it adds complexity — so only do it when you truly need it.
In interviews, always mention the trade-offs. Interviewers love candidates who understand that sharding isn't free — it comes with cross-shard query complexity, rebalancing challenges, and operational overhead.
Software Engineer at Spense. I write about system design, web development, and fintech — explained simply for students and developers.