Database Sharding — Splitting Your Database Like a Pizza

What is Database Sharding?

Imagine you have a massive phone book with 10 crore (100 million) entries. Finding someone's number takes forever because you're searching through one giant book.

Now imagine splitting that phone book into 26 smaller books — one for each letter of the alphabet. Looking for "Sahil"? Go straight to the "S" book. Much faster.

That's sharding — splitting a large database into smaller, faster, more manageable pieces called shards.

Why Do We Need Sharding?

As your app grows, a single database becomes a bottleneck:

• Slow queries: Scanning millions of rows takes time

• Storage limits: One machine can only hold so much data

• Write bottleneck: All writes go to one server

Sharding solves all three by distributing data across multiple database servers.

Real-World Example: PhonePe Transactions

PhonePe processes crores of transactions daily. They can't store all transactions in one database — it would be impossibly slow.

Instead, they might shard by user ID:

• Users 1-10M → Database Shard 1

• Users 10M-20M → Database Shard 2

• Users 20M-30M → Database Shard 3

When you check your transaction history, the system knows exactly which shard to query based on your user ID.

Sharding Strategies

1. Range-Based Sharding

Split data by ranges of a key.

Example: Users with IDs 1-1M go to Shard 1, 1M-2M to Shard 2, etc.

Pros: Simple to implement

Cons: Uneven distribution (what if most active users are in Shard 1?)

2. Hash-Based Sharding

Apply a hash function to the key and use the result to determine the shard.

shard_number = hash(user_id) % number_of_shards

Example: hash(12345) % 4 = 1 → goes to Shard 1

Pros: Even distribution

Cons: Hard to add new shards (re-hashing needed)

3. Geographic Sharding

Split data by location.

Example: Indian users → Mumbai database, US users → Virginia database.

Pros: Low latency for users, data residency compliance

Cons: Cross-region queries are complex

The Challenges

1. Cross-Shard Queries

If you need data from multiple shards (e.g., "show all transactions above ₹10,000"), you have to query ALL shards and merge results. This is slow and complex.

2. Rebalancing

When one shard gets too big, you need to split it. Moving data between shards without downtime is hard.

3. Joins Across Shards

SQL JOINs across different database servers? Very painful. This is why sharded systems often denormalize data.

4. Consistent Hashing

When you add/remove shards, hash-based sharding requires rehashing everything. Consistent hashing minimizes this — only a fraction of data needs to move.

Sharding vs. Replication

Don't confuse them:

• Sharding: Different data on different servers (horizontal partitioning)

• Replication: Same data copied to multiple servers (for redundancy)

Most production systems use both — shard for scale, replicate each shard for reliability.

When to Shard

Don't shard prematurely. Try these first:

1. Optimize queries and add indexes

2. Vertical scaling (bigger machine)

3. Read replicas

4. Caching (Redis)

If you've done all that and still hitting limits → time to shard.

Key Takeaway

Sharding is about divide and conquer. Split your data smartly, and each piece becomes manageable. But it adds complexity — so only do it when you truly need it.

In interviews, always mention the trade-offs. Interviewers love candidates who understand that sharding isn't free — it comes with cross-shard query complexity, rebalancing challenges, and operational overhead.