Race Conditions

Race conditions (CWE-362) are, in my opinion, the most insidious class of security bugs you’ll encounter. They occur when the behaviour of a program depends on the relative timing of concurrent operations, and at least one of those operations modifies shared state. The window between a check and a subsequent use of the checked value, the classic time-of-check to time-of-use (TOCTOU) pattern, is the most exploited form, but races also show up in counter increments, balance updates, session management, and file operations. What makes race conditions uniquely dangerous is their non-determinism: the bug may not manifest in thousands of test runs, then appear under production load when two requests arrive within microseconds of each other. I want to walk through race conditions in Python, Go, Java, and Rust, from the obvious unprotected counter to the subtle channel-based ordering assumption that passes every test but fails under contention.

Why Race Conditions Are Exploitable

A race condition creates a window where the program’s assumptions about state are violated. Attackers exploit this by:

Double-spend: Two concurrent requests both read a balance of $100, both approve a $100 withdrawal, and the account ends up at -$100 instead of $0.
Privilege escalation: A TOCTOU race between checking a user’s role and performing an action allows the role to change between the check and the use.
File system races: A program checks that a file is safe to read, then reads it, but between the check and the read, an attacker replaces the file with a symlink to /etc/shadow.
Authentication bypass: A session token is validated and then used, but a concurrent request invalidates the session between the two steps.

The non-deterministic nature makes race conditions extremely difficult to detect in testing. They require specific timing, which varies with CPU load, thread scheduling, and I/O latency. The research literature describes techniques like thread spraying (sending thousands of concurrent requests) to increase the probability of hitting the race window. It’s surprisingly effective, and it’s one of those things that really drove home for me how different “works in testing” is from “correct under load.”

The Easy-to-Spot Version

Python: Unprotected Shared Counter

import threading

class BankAccount:
    def __init__(self, balance):
        self.balance = balance

    def withdraw(self, amount):
        if self.balance >= amount:
            current = self.balance
            self.balance = current - amount
            return True
        return False

    def deposit(self, amount):
        current = self.balance
        self.balance = current + amount

account = BankAccount(1000)

def make_withdrawals():
    for _ in range(100):
        account.withdraw(10)

threads = [threading.Thread(target=make_withdrawals) for _ in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(f"Final balance: {account.balance}")

The withdraw method reads self.balance, checks it, then writes back the decremented value. Between the read and the write, another thread can read the same value and perform its own withdrawal. With 10 threads each withdrawing 10 × $10 = $100, the total should be $1,000, exactly the starting balance. But due to the race, some withdrawals read the same balance value and both succeed, resulting in a final balance that is higher than expected (money was “created”) or, if deposits are interleaved, lower than expected.

One thing that tripped me up when I first studied this: the common belief that “Python has the GIL, so it’s thread-safe.” Python’s GIL does not prevent this race because the check-then-modify sequence is not atomic, the GIL can release between the if check and the assignment. The GIL protects interpreter internals, not your application logic.

Go: Unsynchronized Map Access

package main

import (
	"fmt"
	"sync"
)

type SessionStore struct {
	sessions map[string]string
}

func NewSessionStore() *SessionStore {
	return &SessionStore{sessions: make(map[string]string)}
}

func (s *SessionStore) Set(key, value string) {
	s.sessions[key] = value
}

func (s *SessionStore) Get(key string) (string, bool) {
	val, ok := s.sessions[key]
	return val, ok
}

func (s *SessionStore) Delete(key string) {
	delete(s.sessions, key)
}

func main() {
	store := NewSessionStore()
	var wg sync.WaitGroup

	for i := 0; i < 100; i++ {
		wg.Add(1)
		go func(id int) {
			defer wg.Done()
			key := fmt.Sprintf("session-%d", id)
			store.Set(key, fmt.Sprintf("user-%d", id))
			store.Get(key)
			store.Delete(key)
		}(i)
	}

	wg.Wait()
	fmt.Println("Done")
}

Go’s built-in map is not safe for concurrent access. Concurrent reads and writes to the same map cause a runtime panic: fatal error: concurrent map writes. The Go race detector (go run -race) catches this immediately. In production without the race detector, the behaviour is undefined, the map’s internal hash table can be corrupted, leading to incorrect lookups, infinite loops in hash chain traversal, or memory corruption. This is one of the more common Go production issues I’ve come across in code reviews and bug reports.

The Hard-to-Spot Version

Python: TOCTOU in File Processing

import os
import json
import tempfile

class ConfigManager:
    def __init__(self, config_dir):
        self.config_dir = config_dir

    def load_config(self, name):
        path = os.path.join(self.config_dir, name)

        if not os.path.isfile(path):
            raise FileNotFoundError(f"Config not found: {name}")

        if not path.startswith(os.path.realpath(self.config_dir)):
            raise PermissionError("Path traversal detected")

        stat = os.stat(path)
        if stat.st_size > 1024 * 1024:
            raise ValueError("Config file too large")

        # TOCTOU: file could be replaced between stat() and open()
        with open(path, 'r') as f:
            return json.load(f)

    def update_config(self, name, data):
        path = os.path.join(self.config_dir, name)
        tmp_path = path + '.tmp'

        with open(tmp_path, 'w') as f:
            json.dump(data, f)

        # TOCTOU: another process could modify tmp_path before rename
        os.rename(tmp_path, path)

The load_config method checks that the file exists, is within the allowed directory, and is not too large, then opens it. Between the os.stat and the open, an attacker can replace the file with a symlink to a different file (e.g., /etc/passwd), or replace it with a much larger file. The checks pass against the original file, but the open reads the replacement.

The update_config method writes to a temporary file and renames it. But between the write and the rename, another process could modify or replace the temporary file. On systems where rename is not atomic across filesystems, this creates a window for data corruption.

The more I researched this, the more I realised this pattern is pervasive in configuration management, file upload processing, and any code that validates a file, it’s easy to explain but hard to eliminate because the “check then use” pattern feels so natural.

Go: Goroutine Ordering Assumption with Channels

package main

import (
	"fmt"
	"sync"
)

type WorkerPool struct {
	results map[int]string
	mu      sync.Mutex
}

func NewWorkerPool() *WorkerPool {
	return &WorkerPool{results: make(map[int]string)}
}

func (wp *WorkerPool) Process(items []string) map[int]string {
	done := make(chan struct{})
	var count int

	for i, item := range items {
		go func(id int, value string) {
			result := transform(value)
			wp.mu.Lock()
			wp.results[id] = result
			count++
			if count == len(items) {
				close(done)
			}
			wp.mu.Unlock()
		}(i, item)
	}

	<-done

	// Race: goroutines may still be between mu.Unlock() and function return
	// when we read wp.results here
	wp.mu.Lock()
	snapshot := make(map[int]string, len(wp.results))
	for k, v := range wp.results {
		snapshot[k] = v
	}
	wp.mu.Unlock()
	return snapshot
}

func transform(s string) string {
	return fmt.Sprintf("processed-%s", s)
}

func main() {
	pool := NewWorkerPool()
	results := pool.Process([]string{"a", "b", "c", "d"})
	fmt.Println(results)
}

The count variable is incremented inside the mutex, and the channel is closed when count reaches the expected total. This looks correct, the channel signal means all items have been processed. But there is a subtle race: the goroutine that closes the channel holds the mutex at that point, and the main goroutine immediately tries to acquire the mutex after <-done returns. This specific code happens to be safe because the mutex serializes access. However, if the count++ and close(done) were outside the mutex, or if the snapshot read did not acquire the mutex, the race would manifest. The pattern is fragile, a small refactor (moving the count increment outside the lock for “performance”) breaks the synchronization guarantee.

What I find interesting about this one is the more insidious variant: if Process is called concurrently on the same WorkerPool, the results map accumulates entries from multiple calls, and the count variable tracks the wrong total. This kind of bug tends to hide in worker pools where someone assumed the pool was single-use, it works perfectly in unit tests and then breaks under real concurrency.

Java: Double-Checked Locking with Mutable State

import java.util.HashMap;
import java.util.Map;

public class ConnectionPool {
    private volatile Map<String, Object> connections = new HashMap<>();
    private final Object lock = new Object();

    public Object getConnection(String key) {
        Object conn = connections.get(key);
        if (conn != null) {
            return conn;
        }

        synchronized (lock) {
            conn = connections.get(key);
            if (conn != null) {
                return conn;
            }

            conn = createConnection(key);
            Map<String, Object> newMap = new HashMap<>(connections);
            newMap.put(key, conn);
            connections = newMap;
            return conn;
        }
    }

    public void removeConnection(String key) {
        Map<String, Object> newMap = new HashMap<>(connections);
        newMap.remove(key);
        connections = newMap;
    }

    private Object createConnection(String key) {
        return new Object(); // Simulated connection
    }
}

The getConnection method uses double-checked locking: it reads connections without synchronization, and only acquires the lock if the key is missing. The connections field is volatile, so the reference assignment is visible across threads. But removeConnection is not synchronized, it creates a new map and assigns it. If removeConnection runs concurrently with getConnection, the following sequence is possible:

Thread A calls getConnection("db"), reads connections (which contains “db”), and gets the connection.
Thread B calls removeConnection("db"), creates a new map without “db”, and assigns it to connections.
Thread A returns the connection object, which Thread B considers removed.
Thread B closes the underlying resource associated with “db”.
Thread A uses the now-closed connection.

The race is between the unsynchronized read in getConnection and the unsynchronized write in removeConnection. The volatile keyword ensures visibility of the reference, but not consistency of the operation sequence. This pattern typically shows up as intermittent “connection closed” exceptions under load, the kind of bug that’s maddening to reproduce and only makes sense once you trace through the thread interleaving.

Rust: Unsafe Shared Mutable State Behind Arc

use std::sync::Arc;
use std::thread;

struct Metrics {
    counters: Vec<u64>,
}

impl Metrics {
    fn new(size: usize) -> Self {
        Metrics {
            counters: vec![0; size],
        }
    }

    fn increment(&mut self, index: usize) {
        if index < self.counters.len() {
            self.counters[index] += 1;
        }
    }

    fn get(&self, index: usize) -> u64 {
        self.counters[index]
    }
}

fn main() {
    let metrics = Arc::new(Metrics::new(10));

    let mut handles = vec![];
    for i in 0..4 {
        let m = Arc::clone(&metrics);
        handles.push(thread::spawn(move || {
            for _ in 0..1000 {
                // This won't compile as-is: Arc<Metrics> doesn't allow &mut
                // A developer might use unsafe to work around this:
                let ptr = Arc::as_ptr(&m) as *mut Metrics;
                unsafe {
                    (*ptr).increment(i % 10);
                }
            }
        }));
    }

    for h in handles {
        h.join().unwrap();
    }

    println!("Counter 0: {}", metrics.get(0));
}

Rust’s type system prevents data races at compile time, Arc<Metrics> only provides shared (&) references, not mutable (&mut) references. But I’ve come across cases where developers who want mutable access without the overhead of Mutex cast through a raw pointer using unsafe. This bypasses Rust’s aliasing rules: multiple threads now have mutable access to the same Vec<u64>, and the += 1 operation is not atomic. The increments are lost (torn reads/writes), and on architectures with weak memory ordering, the counters may contain garbage values.

The correct Rust approach is Arc<Mutex<Metrics>> or Arc<RwLock<Metrics>>, or using AtomicU64 for individual counters. The unsafe cast is a red flag that should be caught in code review. As a general rule, any time someone reaches for unsafe to get around the borrow checker for shared mutable access, they’re introducing a data race.

Detection Strategies

Static Analysis

Tool	Language	What It Catches	Limitations
`go run -race`	Go	Data races on memory access	Runtime only, requires triggering the race
ThreadSanitizer (TSan)	C/C++/Rust	Data races, lock order violations	Runtime only, 5-15x slowdown
SpotBugs	Java	Inconsistent synchronization, double-checked locking	Limited to known patterns
pylint / bandit	Python	Limited race condition detection	Cannot reason about thread interleavings
clippy	Rust	Warns about `unsafe` blocks, suggests `Mutex`/`RwLock`	Cannot detect races inside `unsafe`
Semgrep	All	Pattern matching for known race-prone patterns	Cannot reason about concurrency semantics

Runtime Detection

Tool	How It Works	Overhead
Go Race Detector	Instruments memory accesses, detects concurrent read/write	2-10x slowdown, 5-10x memory
ThreadSanitizer	Tracks happens-before relationships between memory accesses	5-15x slowdown
Java `-ea` + `ConcurrentModificationException`	Detects structural modification during iteration	Minimal overhead
Rust Miri	Detects undefined behavior including data races in `unsafe`	Interpretation mode, very slow

Manual Review Indicators

Shared mutable state without synchronization, any field accessed by multiple threads without a lock, atomic, or channel.
Check-then-act patterns, if (condition) { act() } where condition can change between the check and the act.
volatile without synchronization in Java, volatile ensures visibility but not atomicity of compound operations.
Python’s GIL misconception, the GIL prevents concurrent execution of bytecode but does not make multi-step operations atomic.
unsafe blocks in Rust that cast to *mut, bypassing the borrow checker for shared mutable access is almost always a data race.
File operations with separate check and use, os.path.exists() followed by open() is a TOCTOU race.
Lazy initialization without synchronization, singleton patterns, connection pools, and caches that initialize on first access.

Remediation

Python: Use Threading Lock

import threading

class BankAccount:
    def __init__(self, balance):
        self.balance = balance
        self._lock = threading.Lock()

    def withdraw(self, amount):
        with self._lock:
            if self.balance >= amount:
                self.balance -= amount
                return True
            return False

    def deposit(self, amount):
        with self._lock:
            self.balance += amount

The threading.Lock ensures that the check-and-modify sequence is atomic with respect to other threads. The with statement guarantees the lock is released even if an exception occurs.

For file TOCTOU races, the fix is to open the file first (obtaining a file descriptor), then perform checks on the open descriptor rather than the path:

import os
import json

def safe_load_config(config_dir, name):
    path = os.path.join(config_dir, name)
    real_path = os.path.realpath(path)
    if not real_path.startswith(os.path.realpath(config_dir)):
        raise PermissionError("Path traversal detected")

    fd = os.open(real_path, os.O_RDONLY)
    try:
        stat = os.fstat(fd)
        if stat.st_size > 1024 * 1024:
            raise ValueError("Config file too large")
        with os.fdopen(fd, 'r') as f:
            return json.load(f)
    except:
        os.close(fd)
        raise

Go: Use sync.RWMutex or sync.Map

package main

import (
	"fmt"
	"sync"
)

type SessionStore struct {
	sessions map[string]string
	mu       sync.RWMutex
}

func NewSessionStore() *SessionStore {
	return &SessionStore{sessions: make(map[string]string)}
}

func (s *SessionStore) Set(key, value string) {
	s.mu.Lock()
	defer s.mu.Unlock()
	s.sessions[key] = value
}

func (s *SessionStore) Get(key string) (string, bool) {
	s.mu.RLock()
	defer s.mu.RUnlock()
	val, ok := s.sessions[key]
	return val, ok
}

func (s *SessionStore) Delete(key string) {
	s.mu.Lock()
	defer s.mu.Unlock()
	delete(s.sessions, key)
}

sync.RWMutex allows concurrent reads (RLock) but exclusive writes (Lock). For simple key-value stores, sync.Map is an alternative that is optimised for the case where keys are mostly read and rarely written. Starting with sync.RWMutex is generally the safer bet since it’s more predictable, and you can switch to sync.Map if profiling shows contention.

Java: Proper Synchronization for Connection Pool

import java.util.concurrent.ConcurrentHashMap;

public class ConnectionPool {
    private final ConcurrentHashMap<String, Object> connections = new ConcurrentHashMap<>();

    public Object getConnection(String key) {
        return connections.computeIfAbsent(key, this::createConnection);
    }

    public void removeConnection(String key) {
        connections.remove(key);
    }

    private Object createConnection(String key) {
        return new Object();
    }
}

ConcurrentHashMap.computeIfAbsent atomically checks for the key and creates the value if absent. There is no race window between the check and the insert. The ConcurrentHashMap handles all synchronization internally, eliminating the need for manual locking and the fragile double-checked locking pattern. Here’s what clicked for me when studying Java concurrency: if you find yourself writing double-checked locking, step back and ask whether ConcurrentHashMap or java.util.concurrent already solves your problem, it almost always does.

Rust: Use Mutex or Atomic Types

use std::sync::{Arc, Mutex};
use std::thread;

struct Metrics {
    counters: Vec<u64>,
}

impl Metrics {
    fn new(size: usize) -> Self {
        Metrics {
            counters: vec![0; size],
        }
    }

    fn increment(&mut self, index: usize) {
        if index < self.counters.len() {
            self.counters[index] += 1;
        }
    }

    fn get(&self, index: usize) -> u64 {
        self.counters[index]
    }
}

fn main() {
    let metrics = Arc::new(Mutex::new(Metrics::new(10)));

    let mut handles = vec![];
    for i in 0..4 {
        let m = Arc::clone(&metrics);
        handles.push(thread::spawn(move || {
            for _ in 0..1000 {
                let mut guard = m.lock().unwrap();
                guard.increment(i % 10);
            }
        }));
    }

    for h in handles {
        h.join().unwrap();
    }

    let guard = metrics.lock().unwrap();
    println!("Counter 0: {}", guard.get(0));
}

Arc<Mutex<Metrics>> is the idiomatic Rust pattern for shared mutable state. The Mutex ensures exclusive access, and the type system enforces that you cannot access the data without holding the lock. For individual counters where a full mutex is too heavy, use AtomicU64:

use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::thread;

struct AtomicMetrics {
    counters: Vec<AtomicU64>,
}

impl AtomicMetrics {
    fn new(size: usize) -> Self {
        AtomicMetrics {
            counters: (0..size).map(|_| AtomicU64::new(0)).collect(),
        }
    }

    fn increment(&self, index: usize) {
        if index < self.counters.len() {
            self.counters[index].fetch_add(1, Ordering::Relaxed);
        }
    }

    fn get(&self, index: usize) -> u64 {
        self.counters[index].load(Ordering::Relaxed)
    }
}

fn main() {
    let metrics = Arc::new(AtomicMetrics::new(10));

    let mut handles = vec![];
    for i in 0..4 {
        let m = Arc::clone(&metrics);
        handles.push(thread::spawn(move || {
            for _ in 0..1000 {
                m.increment(i % 10);
            }
        }));
    }

    for h in handles {
        h.join().unwrap();
    }

    println!("Counter 0: {}", metrics.get(0));
}

Atomic operations are lock-free and significantly faster than mutex-based synchronization for simple counter increments. Ordering::Relaxed is sufficient when counters are independent and no ordering between different counters is required. One thing I appreciate about Rust is how it makes you think about memory ordering explicitly, it’s more work upfront, but it means you actually understand what guarantees you’re getting.

Key Takeaways

Race conditions are non-deterministic. A test suite that passes 10,000 times can still have a race that manifests under production load. Use race detectors (go run -race, ThreadSanitizer) in CI.
The GIL does not prevent races in Python. Multi-step operations (check-then-act, read-modify-write) are not atomic even with the GIL. Use threading.Lock for shared mutable state.
Prefer language-provided concurrent data structures, sync.Map in Go, ConcurrentHashMap in Java, Arc<Mutex<T>> in Rust. These are tested, optimised, and correct.
TOCTOU races in file operations are real. Open the file first, then check the file descriptor. Never check a path and then open it separately.
unsafe in Rust for shared mutable access is almost always wrong. If you need mutable access from multiple threads, use Mutex, RwLock, or atomic types. The type system is trying to protect you.
Double-checked locking is fragile. In Java, use ConcurrentHashMap.computeIfAbsent or synchronized blocks. The performance gain from avoiding synchronization is rarely worth the correctness risk.