I Replaced Redis Locks with Database Atomicity and You Should Too

03 Sep, 2025

Picture this: You are a dev in a payment services company. The thing with payment transactions is that you are supposed to process each transaction (e.g. send money from John's account to Jane) exactly once. It's 3 AM, your PagerDuty is blowing up with alerts about transactions being processed multiple times by cron jobs, and you're staring at Redis locks that should prevent this exact problem. Sound familiar?

We didn't end up sending money to someone 20 times, but our problem was similar. This is the story of how we went from a complex Redis-based locking nightmare to a beautifully simple database-centric solution, and why you should probably ditch those distributed locks too.

The Great Redis Lock Disaster of 2025

Let me set the scene. We have this automated testing platform where developers would push their OpenAPI specs, and our job workers would generate tests in the background. Think Postman, but the test generation is automated. Simple enough, right? Wrong.

The system worked like this:

Developer pushes an OpenAPI spec with 50 endpoints to our platform
We create 50 "pending" test generation tasks in the database
Multiple job instances fight over who gets to process what
Chaos ensues, developers get duplicate test suites

Our "brilliant" solution was Redis user-level locking:

def main():
    redis = redis_client()
    
    # Get users with pending tasks
    users_with_pending_tasks = get_users_with_pending_tasks()
    
    # Try to acquire lock for a user  
    for user in users_with_pending_tasks:
        user_id = user["user_id"]
        lock_acquired = redis.setnx(f"process_lock_{user_id}", user_id)
        
        if lock_acquired:
            print(f"Acquired lock for user {user_id}")
            process_user_tasks(user_id)
            redis.delete(f"process_lock_{user_id}")  # Release lock
            break

This looked solid on paper. One job per user, clean separation, what could go wrong?

Turns out, a lot. Here's what we discovered after days of debugging.

The Ghost Lock Problem

Jobs would crash (because who writes perfect code?), leaving behind zombie locks in Redis. Developer #12345's tasks would be forever locked, waiting for a job that no longer exists. We'd have to manually clean these up, which is about as fun as debugging CSS alignment issues.

The Race Condition Ballet

Even worse, we had this beautiful race condition where two jobs would:

Both check if a developer has pending tasks ✓
Both try to acquire the same user's lock
One succeeds, one fails, but...
The winner sometimes processed tasks that were already being handled

It was like watching two developers simultaneously fix the same bug in different branches.

The "It Should Work" Syndrome

The most frustrating part? The logic was sound. User-level locking should prevent duplicate processing. But we were still getting duplicate test suites generated, and developers were opening GitHub issues faster than we could close them.

After staring at this code for the hundredth time, my teammate dropped this gem: "Why are we even using Redis for this? Isn't our database already designed to handle concurrency?"

The Database Epiphany

They were right. Here we were, adding this complex external dependency when PostgreSQL has been solving concurrency problems since before Redis was even a twinkle in antirez's eye.

The solution was embarrassingly simple:

def claim_task(task_id):
    result = execute_query(
        "UPDATE tasks SET status = 'processing' WHERE id = %s AND status = 'pending'",
        (task_id,)
    )
    return result.rowcount > 0  # True if we successfully claimed it

That's it. Twelve lines of Redis complexity replaced by three lines of SQL.

Why This Actually Works

When two jobs try to claim the same task simultaneously:

Job A executes the UPDATE first (we're talking microseconds here)
Job B executes the UPDATE immediately after
Job A finds status = 'pending', updates it to 'processing', returns rowcount = 1
Job B finds status = 'processing' (not 'pending'), updates nothing, returns rowcount = 0

The database engine handles all the locking, isolation, and consistency for us. It's literally what ACID properties were designed for.

The New Approach: Beautifully Boring

Here's what our main processing loop became:

def main():
    # Get all pending tasks - dead simple
    pending_tasks = get_tasks_by_status("pending")
    
    for task in pending_tasks:
        # Try to atomically claim this task
        if claim_task(task["id"]):
            print(f"Got task {task['id']}, let's go!")
            process_task(task)
        else:
            print(f"Task {task['id']} stolen by another worker, moving on...")

def claim_task(task_id):
    result = execute_query(
        "UPDATE tasks SET status = 'processing' WHERE id = %s AND status = 'pending'",
        (task_id,)
    )
    return result.rowcount > 0

No Redis calls. No lock cleanup. No timeouts. No complex error handling. Just pure, boring database operations.

Enter the Resource Hog Problem

But wait, there's more! (There's always more, isn't there?)

Our celebration was short-lived. Within a week, we discovered a new problem: one startup uploaded their monolithic API spec with 1,000 endpoints (yes, we've all been there). Guess what happened? All our job instances started fighting over that user's tasks, completely ignoring everyone else.

Meanwhile, Sarah uploaded her simple microservice spec with 3 endpoints and watched it sit in the queue for hours while MegaCorp's monolith hogged all the workers. Classic tragedy of the commons.

This is where the simplicity of our solution became both a blessing and a curse. It was too fair - treating all tasks equally regardless of user impact.

The Fairness Fix: Not All Users Are Created Equal

We went with the simplest solution first: limit each user to 2 concurrent test generation tasks max.

def main():
    pending_tasks = get_tasks_by_status("pending")
    
    for task in pending_tasks:
        # Check if this user is already hogging resources
        user_task_count = get_user_task_count(task["user_id"], "processing")
        
        if user_task_count >= 2:  # Max 2 concurrent tasks per user
            print(f"User {task['user_id']} hit their limit, skipping...")
            continue
        
        # Try to claim the task
        if claim_task(task["id"]):
            print(f"Processing task for user {task['user_id']}")
            process_task(task)

def get_user_task_count(user_id, status):
    result = execute_query(
        "SELECT COUNT(*) as count FROM tasks WHERE user_id = %s AND status = %s",
        (user_id, status)
    )
    return result["count"]

Simple, effective, and Sarah is happy again.

There Are A Few Other Ways To Solve This

The Single-Query Approach: For the SQL Wizards

If you want to be fancy (and reduce database calls), you can do the fairness check and task claiming in one atomic operation:

def claim_task_with_fairness(task_id, max_concurrent=2):
    result = execute_query("""
        UPDATE tasks t1 
        SET status = 'processing' 
        WHERE t1.id = %s AND t1.status = 'pending'
        AND (
            SELECT COUNT(*) 
            FROM tasks t2 
            WHERE t2.user_id = t1.user_id 
            AND t2.status = 'processing'
        ) < %s
    """, (task_id, max_concurrent))
    return result.rowcount > 0

This is beautiful from a database perspective - one query does it all. But it's harder to debug when things go wrong, and trust me, things will go wrong.

The Round-Robin Approach: Maximum Fairness

For maximum fairness, you can prioritize users who have fewer tasks running:

def get_fair_pending_tasks():
    """Get tasks ordered by user fairness - users with fewer running tasks go first"""
    return execute_query("""
        SELECT t1.* FROM tasks t1
        LEFT JOIN (
            SELECT user_id, COUNT(*) as running_count
            FROM tasks 
            WHERE status = 'processing'
            GROUP BY user_id
        ) t2 ON t1.user_id = t2.user_id
        WHERE t1.status = 'pending'
        ORDER BY COALESCE(t2.running_count, 0) ASC, t1.created_at ASC
    """)

This query is doing some heavy lifting:

Get all pending tasks
Count how many tasks each user has running
Order by fewest running tasks first, then by creation time

It's more complex but gives you true round-robin fairness. MegaCorp's monolith still gets processed, but not at Sarah's microservice's expense.

The Time-Based Approach: "When Did I Last Process This User?"

You could also add a last_processed_at timestamp to users and prioritize those who haven't been processed recently. But honestly, that's probably overkill unless you're running something like GitHub Actions at scale.

The Lessons I Wish I'd Learned Sooner

1. Your Database Is Smarter Than You Think

I spent weeks building a distributed locking system when PostgreSQL was sitting there like "I've literally been doing this since 1996, but okay..."

Databases are designed for concurrency. ACID properties exist for exactly these scenarios. Use them.

2. Complexity Is a Bug, Not a Feature

Every line of Redis locking code was a potential failure point:

Network timeouts
Lock cleanup failures
Race conditions between services
Memory management in Redis

The database solution eliminated all of this. Sometimes the best code is the code you don't write.

3. Fairness Isn't Optional in Multi-Tenant Systems

We learned this the hard way when MegaCorp's monolith starved all the microservice users. If you're building anything where multiple users compete for resources, think about fairness from day one, not when your users start filing angry GitHub issues.

4. Start Simple, Then Optimize

We went with the two-query approach (check user count, then claim task) rather than the fancy single-query version. Why? Because when something breaks at 3 AM, you want to be able to debug it quickly.

Performance Reality Check

Let's be honest about the trade-offs:

Database Approach:

✅ No Redis memory usage
✅ Leverages existing database infrastructure
✅ ACID guarantees
❌ Extra SELECT query for fairness checks
❌ Slightly higher database load

Redis Approach:

✅ Fast in-memory operations
✅ Dedicated locking primitives
❌ Additional infrastructure to maintain
❌ Network calls can fail
❌ Lock cleanup complexity
❌ Memory management

For our use case, the database approach was clearly better. Your mileage may vary.

When To Use Each Approach

Go with database atomicity when:

Your tasks live in the database anyway
You want strong consistency guarantees
You're trying to reduce infrastructure complexity
You trust your database more than your distributed systems skills (smart choice)

Stick with Redis/distributed locks when:

You need locks across multiple databases
Tasks involve complex multi-step operations
You already have Redis infrastructure you're comfortable with
You're building something like a workflow engine with complex state

The Bottom Line

We replaced 50 lines of complex Redis locking logic with 5 lines of SQL and immediately solved our duplicate processing problem. Sometimes the best engineering solution is the boring one.

Your database has been solving concurrency problems longer than most of us have been writing code. Maybe it's time to trust it.