I Replaced Redis Locks with Database Atomicity and You Should Too
Picture this: You are a dev in a payment services company. The thing with payment transactions is that you are supposed to process each transaction (e.g. send money from John's account to Jane) exactly once. It's 3 AM, your PagerDuty is blowing up with alerts about transactions being processed multiple times by cron jobs, and you're staring at Redis locks that should prevent this exact problem. Sound familiar?
We didn't end up sending money to someone 20 times, but our problem was similar. This is the story of how we went from a complex Redis-based locking nightmare to a beautifully simple database-centric solution, and why you should probably ditch those distributed locks too.
The Great Redis Lock Disaster of 2025
Let me set the scene. We have this automated testing platform where developers would push their OpenAPI specs, and our job workers would generate tests in the background. Think Postman, but the test generation is automated. Simple enough, right? Wrong.
The system worked like this:
- Developer pushes an OpenAPI spec with 50 endpoints to our platform
- We create 50 "pending" test generation tasks in the database
- Multiple job instances fight over who gets to process what
- Chaos ensues, developers get duplicate test suites
Our "brilliant" solution was Redis user-level locking:
def main():
redis = redis_client()
# Get users with pending tasks
users_with_pending_tasks = get_users_with_pending_tasks()
# Try to acquire lock for a user
for user in users_with_pending_tasks:
user_id = user["user_id"]
lock_acquired = redis.setnx(f"process_lock_{user_id}", user_id)
if lock_acquired:
print(f"Acquired lock for user {user_id}")
process_user_tasks(user_id)
redis.delete(f"process_lock_{user_id}") # Release lock
break
This looked solid on paper. One job per user, clean separation, what could go wrong?
Turns out, a lot. Here's what we discovered after days of debugging.
The Ghost Lock Problem
Jobs would crash (because who writes perfect code?), leaving behind zombie locks in Redis. Developer #12345's tasks would be forever locked, waiting for a job that no longer exists. We'd have to manually clean these up, which is about as fun as debugging CSS alignment issues.
The Race Condition Ballet
Even worse, we had this beautiful race condition where two jobs would:
- Both check if a developer has pending tasks ✓
- Both try to acquire the same user's lock
- One succeeds, one fails, but...
- The winner sometimes processed tasks that were already being handled
It was like watching two developers simultaneously fix the same bug in different branches.
The "It Should Work" Syndrome
The most frustrating part? The logic was sound. User-level locking should prevent duplicate processing. But we were still getting duplicate test suites generated, and developers were opening GitHub issues faster than we could close them.
After staring at this code for the hundredth time, my teammate dropped this gem: "Why are we even using Redis for this? Isn't our database already designed to handle concurrency?"
The Database Epiphany
They were right. Here we were, adding this complex external dependency when PostgreSQL has been solving concurrency problems since before Redis was even a twinkle in antirez's eye.
The solution was embarrassingly simple:
def claim_task(task_id):
result = execute_query(
"UPDATE tasks SET status = 'processing' WHERE id = %s AND status = 'pending'",
(task_id,)
)
return result.rowcount > 0 # True if we successfully claimed it
That's it. Twelve lines of Redis complexity replaced by three lines of SQL.
Why This Actually Works
When two jobs try to claim the same task simultaneously:
- Job A executes the UPDATE first (we're talking microseconds here)
- Job B executes the UPDATE immediately after
- Job A finds
status = 'pending'
, updates it to 'processing', returnsrowcount = 1
- Job B finds
status = 'processing'
(not 'pending'), updates nothing, returnsrowcount = 0
The database engine handles all the locking, isolation, and consistency for us. It's literally what ACID properties were designed for.
The New Approach: Beautifully Boring
Here's what our main processing loop became:
def main():
# Get all pending tasks - dead simple
pending_tasks = get_tasks_by_status("pending")
for task in pending_tasks:
# Try to atomically claim this task
if claim_task(task["id"]):
print(f"Got task {task['id']}, let's go!")
process_task(task)
else:
print(f"Task {task['id']} stolen by another worker, moving on...")
def claim_task(task_id):
result = execute_query(
"UPDATE tasks SET status = 'processing' WHERE id = %s AND status = 'pending'",
(task_id,)
)
return result.rowcount > 0
No Redis calls. No lock cleanup. No timeouts. No complex error handling. Just pure, boring database operations.
Enter the Resource Hog Problem
But wait, there's more! (There's always more, isn't there?)
Our celebration was short-lived. Within a week, we discovered a new problem: one startup uploaded their monolithic API spec with 1,000 endpoints (yes, we've all been there). Guess what happened? All our job instances started fighting over that user's tasks, completely ignoring everyone else.
Meanwhile, Sarah uploaded her simple microservice spec with 3 endpoints and watched it sit in the queue for hours while MegaCorp's monolith hogged all the workers. Classic tragedy of the commons.
This is where the simplicity of our solution became both a blessing and a curse. It was too fair - treating all tasks equally regardless of user impact.
The Fairness Fix: Not All Users Are Created Equal
We went with the simplest solution first: limit each user to 2 concurrent test generation tasks max.
def main():
pending_tasks = get_tasks_by_status("pending")
for task in pending_tasks:
# Check if this user is already hogging resources
user_task_count = get_user_task_count(task["user_id"], "processing")
if user_task_count >= 2: # Max 2 concurrent tasks per user
print(f"User {task['user_id']} hit their limit, skipping...")
continue
# Try to claim the task
if claim_task(task["id"]):
print(f"Processing task for user {task['user_id']}")
process_task(task)
def get_user_task_count(user_id, status):
result = execute_query(
"SELECT COUNT(*) as count FROM tasks WHERE user_id = %s AND status = %s",
(user_id, status)
)
return result["count"]
Simple, effective, and Sarah is happy again.
There Are A Few Other Ways To Solve This
The Single-Query Approach: For the SQL Wizards
If you want to be fancy (and reduce database calls), you can do the fairness check and task claiming in one atomic operation:
def claim_task_with_fairness(task_id, max_concurrent=2):
result = execute_query("""
UPDATE tasks t1
SET status = 'processing'
WHERE t1.id = %s AND t1.status = 'pending'
AND (
SELECT COUNT(*)
FROM tasks t2
WHERE t2.user_id = t1.user_id
AND t2.status = 'processing'
) < %s
""", (task_id, max_concurrent))
return result.rowcount > 0
This is beautiful from a database perspective - one query does it all. But it's harder to debug when things go wrong, and trust me, things will go wrong.
The Round-Robin Approach: Maximum Fairness
For maximum fairness, you can prioritize users who have fewer tasks running:
def get_fair_pending_tasks():
"""Get tasks ordered by user fairness - users with fewer running tasks go first"""
return execute_query("""
SELECT t1.* FROM tasks t1
LEFT JOIN (
SELECT user_id, COUNT(*) as running_count
FROM tasks
WHERE status = 'processing'
GROUP BY user_id
) t2 ON t1.user_id = t2.user_id
WHERE t1.status = 'pending'
ORDER BY COALESCE(t2.running_count, 0) ASC, t1.created_at ASC
""")
This query is doing some heavy lifting:
- Get all pending tasks
- Count how many tasks each user has running
- Order by fewest running tasks first, then by creation time
It's more complex but gives you true round-robin fairness. MegaCorp's monolith still gets processed, but not at Sarah's microservice's expense.
The Time-Based Approach: "When Did I Last Process This User?"
You could also add a last_processed_at
timestamp to users and prioritize those who haven't been processed recently. But honestly, that's probably overkill unless you're running something like GitHub Actions at scale.
The Lessons I Wish I'd Learned Sooner
1. Your Database Is Smarter Than You Think
I spent weeks building a distributed locking system when PostgreSQL was sitting there like "I've literally been doing this since 1996, but okay..."
Databases are designed for concurrency. ACID properties exist for exactly these scenarios. Use them.
2. Complexity Is a Bug, Not a Feature
Every line of Redis locking code was a potential failure point:
- Network timeouts
- Lock cleanup failures
- Race conditions between services
- Memory management in Redis
The database solution eliminated all of this. Sometimes the best code is the code you don't write.
3. Fairness Isn't Optional in Multi-Tenant Systems
We learned this the hard way when MegaCorp's monolith starved all the microservice users. If you're building anything where multiple users compete for resources, think about fairness from day one, not when your users start filing angry GitHub issues.
4. Start Simple, Then Optimize
We went with the two-query approach (check user count, then claim task) rather than the fancy single-query version. Why? Because when something breaks at 3 AM, you want to be able to debug it quickly.
Performance Reality Check
Let's be honest about the trade-offs:
Database Approach:
- ✅ No Redis memory usage
- ✅ Leverages existing database infrastructure
- ✅ ACID guarantees
- ❌ Extra SELECT query for fairness checks
- ❌ Slightly higher database load
Redis Approach:
- ✅ Fast in-memory operations
- ✅ Dedicated locking primitives
- ❌ Additional infrastructure to maintain
- ❌ Network calls can fail
- ❌ Lock cleanup complexity
- ❌ Memory management
For our use case, the database approach was clearly better. Your mileage may vary.
When To Use Each Approach
Go with database atomicity when:
- Your tasks live in the database anyway
- You want strong consistency guarantees
- You're trying to reduce infrastructure complexity
- You trust your database more than your distributed systems skills (smart choice)
Stick with Redis/distributed locks when:
- You need locks across multiple databases
- Tasks involve complex multi-step operations
- You already have Redis infrastructure you're comfortable with
- You're building something like a workflow engine with complex state
The Bottom Line
We replaced 50 lines of complex Redis locking logic with 5 lines of SQL and immediately solved our duplicate processing problem. Sometimes the best engineering solution is the boring one.
Your database has been solving concurrency problems longer than most of us have been writing code. Maybe it's time to trust it.