Race Conditions
Race conditions (CWE-362) are, in my opinion, the most insidious class of security bugs you’ll encounter. They occur when the behaviour of a program depends on the relative timing of concurrent operations, and at least one of those operations modifies shared state. The window between a check and a subsequent use of the checked value, the classic time-of-check to time-of-use (TOCTOU) pattern, is the most exploited form, but races also show up in counter increments, balance updates, session management, and file operations. What makes race conditions uniquely dangerous is their non-determinism: the bug may not manifest in thousands of test runs, then appear under production load when two requests arrive within microseconds of each other. I want to walk through race conditions in Python, Go, Java, and Rust, from the obvious unprotected counter to the subtle channel-based ordering assumption that passes every test but fails under contention.
Why Race Conditions Are Exploitable
A race condition creates a window where the program’s assumptions about state are violated. Attackers exploit this by:
- Double-spend: Two concurrent requests both read a balance of $100, both approve a $100 withdrawal, and the account ends up at -$100 instead of $0.
- Privilege escalation: A TOCTOU race between checking a user’s role and performing an action allows the role to change between the check and the use.
- File system races: A program checks that a file is safe to read, then reads it, but between the check and the read, an attacker replaces the file with a symlink to
/etc/shadow. - Authentication bypass: A session token is validated and then used, but a concurrent request invalidates the session between the two steps.
The non-deterministic nature makes race conditions extremely difficult to detect in testing. They require specific timing, which varies with CPU load, thread scheduling, and I/O latency. The research literature describes techniques like thread spraying (sending thousands of concurrent requests) to increase the probability of hitting the race window. It’s surprisingly effective, and it’s one of those things that really drove home for me how different “works in testing” is from “correct under load.”
The Easy-to-Spot Version
Python: Unprotected Shared Counter
import threading
class BankAccount:
def __init__(self, balance):
self.balance = balance
def withdraw(self, amount):
if self.balance >= amount:
current = self.balance
self.balance = current - amount
return True
return False
def deposit(self, amount):
current = self.balance
self.balance = current + amount
account = BankAccount(1000)
def make_withdrawals():
for _ in range(100):
account.withdraw(10)
threads = [threading.Thread(target=make_withdrawals) for _ in range(10)]
for t in threads:
t.start()
for t in threads:
t.join()
print(f"Final balance: {account.balance}")
The withdraw method reads self.balance, checks it, then writes back the decremented value. Between the read and the write, another thread can read the same value and perform its own withdrawal. With 10 threads each withdrawing 10 × $10 = $100, the total should be $1,000, exactly the starting balance. But due to the race, some withdrawals read the same balance value and both succeed, resulting in a final balance that is higher than expected (money was “created”) or, if deposits are interleaved, lower than expected.
One thing that tripped me up when I first studied this: the common belief that “Python has the GIL, so it’s thread-safe.” Python’s GIL does not prevent this race because the check-then-modify sequence is not atomic, the GIL can release between the if check and the assignment. The GIL protects interpreter internals, not your application logic.
Go: Unsynchronized Map Access
package main
import (
"fmt"
"sync"
)
type SessionStore struct {
sessions map[string]string
}
func NewSessionStore() *SessionStore {
return &SessionStore{sessions: make(map[string]string)}
}
func (s *SessionStore) Set(key, value string) {
s.sessions[key] = value
}
func (s *SessionStore) Get(key string) (string, bool) {
val, ok := s.sessions[key]
return val, ok
}
func (s *SessionStore) Delete(key string) {
delete(s.sessions, key)
}
func main() {
store := NewSessionStore()
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func(id int) {
defer wg.Done()
key := fmt.Sprintf("session-%d", id)
store.Set(key, fmt.Sprintf("user-%d", id))
store.Get(key)
store.Delete(key)
}(i)
}
wg.Wait()
fmt.Println("Done")
}
Go’s built-in map is not safe for concurrent access. Concurrent reads and writes to the same map cause a runtime panic: fatal error: concurrent map writes. The Go race detector (go run -race) catches this immediately. In production without the race detector, the behaviour is undefined, the map’s internal hash table can be corrupted, leading to incorrect lookups, infinite loops in hash chain traversal, or memory corruption. This is one of the more common Go production issues I’ve come across in code reviews and bug reports.
The Hard-to-Spot Version
Python: TOCTOU in File Processing
import os
import json
import tempfile
class ConfigManager:
def __init__(self, config_dir):
self.config_dir = config_dir
def load_config(self, name):
path = os.path.join(self.config_dir, name)
if not os.path.isfile(path):
raise FileNotFoundError(f"Config not found: {name}")
if not path.startswith(os.path.realpath(self.config_dir)):
raise PermissionError("Path traversal detected")
stat = os.stat(path)
if stat.st_size > 1024 * 1024:
raise ValueError("Config file too large")
# TOCTOU: file could be replaced between stat() and open()
with open(path, 'r') as f:
return json.load(f)
def update_config(self, name, data):
path = os.path.join(self.config_dir, name)
tmp_path = path + '.tmp'
with open(tmp_path, 'w') as f:
json.dump(data, f)
# TOCTOU: another process could modify tmp_path before rename
os.rename(tmp_path, path)
The load_config method checks that the file exists, is within the allowed directory, and is not too large, then opens it. Between the os.stat and the open, an attacker can replace the file with a symlink to a different file (e.g., /etc/passwd), or replace it with a much larger file. The checks pass against the original file, but the open reads the replacement.
The update_config method writes to a temporary file and renames it. But between the write and the rename, another process could modify or replace the temporary file. On systems where rename is not atomic across filesystems, this creates a window for data corruption.
The more I researched this, the more I realised this pattern is pervasive in configuration management, file upload processing, and any code that validates a file, it’s easy to explain but hard to eliminate because the “check then use” pattern feels so natural.
Go: Goroutine Ordering Assumption with Channels
package main
import (
"fmt"
"sync"
)
type WorkerPool struct {
results map[int]string
mu sync.Mutex
}
func NewWorkerPool() *WorkerPool {
return &WorkerPool{results: make(map[int]string)}
}
func (wp *WorkerPool) Process(items []string) map[int]string {
done := make(chan struct{})
var count int
for i, item := range items {
go func(id int, value string) {
result := transform(value)
wp.mu.Lock()
wp.results[id] = result
count++
if count == len(items) {
close(done)
}
wp.mu.Unlock()
}(i, item)
}
<-done
// Race: goroutines may still be between mu.Unlock() and function return
// when we read wp.results here
wp.mu.Lock()
snapshot := make(map[int]string, len(wp.results))
for k, v := range wp.results {
snapshot[k] = v
}
wp.mu.Unlock()
return snapshot
}
func transform(s string) string {
return fmt.Sprintf("processed-%s", s)
}
func main() {
pool := NewWorkerPool()
results := pool.Process([]string{"a", "b", "c", "d"})
fmt.Println(results)
}
The count variable is incremented inside the mutex, and the channel is closed when count reaches the expected total. This looks correct, the channel signal means all items have been processed. But there is a subtle race: the goroutine that closes the channel holds the mutex at that point, and the main goroutine immediately tries to acquire the mutex after <-done returns. This specific code happens to be safe because the mutex serializes access. However, if the count++ and close(done) were outside the mutex, or if the snapshot read did not acquire the mutex, the race would manifest. The pattern is fragile, a small refactor (moving the count increment outside the lock for “performance”) breaks the synchronization guarantee.
What I find interesting about this one is the more insidious variant: if Process is called concurrently on the same WorkerPool, the results map accumulates entries from multiple calls, and the count variable tracks the wrong total. This kind of bug tends to hide in worker pools where someone assumed the pool was single-use, it works perfectly in unit tests and then breaks under real concurrency.
Java: Double-Checked Locking with Mutable State
import java.util.HashMap;
import java.util.Map;
public class ConnectionPool {
private volatile Map<String, Object> connections = new HashMap<>();
private final Object lock = new Object();
public Object getConnection(String key) {
Object conn = connections.get(key);
if (conn != null) {
return conn;
}
synchronized (lock) {
conn = connections.get(key);
if (conn != null) {
return conn;
}
conn = createConnection(key);
Map<String, Object> newMap = new HashMap<>(connections);
newMap.put(key, conn);
connections = newMap;
return conn;
}
}
public void removeConnection(String key) {
Map<String, Object> newMap = new HashMap<>(connections);
newMap.remove(key);
connections = newMap;
}
private Object createConnection(String key) {
return new Object(); // Simulated connection
}
}
The getConnection method uses double-checked locking: it reads connections without synchronization, and only acquires the lock if the key is missing. The connections field is volatile, so the reference assignment is visible across threads. But removeConnection is not synchronized, it creates a new map and assigns it. If removeConnection runs concurrently with getConnection, the following sequence is possible:
- Thread A calls
getConnection("db"), readsconnections(which contains “db”), and gets the connection. - Thread B calls
removeConnection("db"), creates a new map without “db”, and assigns it toconnections. - Thread A returns the connection object, which Thread B considers removed.
- Thread B closes the underlying resource associated with “db”.
- Thread A uses the now-closed connection.
The race is between the unsynchronized read in getConnection and the unsynchronized write in removeConnection. The volatile keyword ensures visibility of the reference, but not consistency of the operation sequence. This pattern typically shows up as intermittent “connection closed” exceptions under load, the kind of bug that’s maddening to reproduce and only makes sense once you trace through the thread interleaving.
Rust: Unsafe Shared Mutable State Behind Arc
use std::sync::Arc;
use std::thread;
struct Metrics {
counters: Vec<u64>,
}
impl Metrics {
fn new(size: usize) -> Self {
Metrics {
counters: vec![0; size],
}
}
fn increment(&mut self, index: usize) {
if index < self.counters.len() {
self.counters[index] += 1;
}
}
fn get(&self, index: usize) -> u64 {
self.counters[index]
}
}
fn main() {
let metrics = Arc::new(Metrics::new(10));
let mut handles = vec![];
for i in 0..4 {
let m = Arc::clone(&metrics);
handles.push(thread::spawn(move || {
for _ in 0..1000 {
// This won't compile as-is: Arc<Metrics> doesn't allow &mut
// A developer might use unsafe to work around this:
let ptr = Arc::as_ptr(&m) as *mut Metrics;
unsafe {
(*ptr).increment(i % 10);
}
}
}));
}
for h in handles {
h.join().unwrap();
}
println!("Counter 0: {}", metrics.get(0));
}
Rust’s type system prevents data races at compile time, Arc<Metrics> only provides shared (&) references, not mutable (&mut) references. But I’ve come across cases where developers who want mutable access without the overhead of Mutex cast through a raw pointer using unsafe. This bypasses Rust’s aliasing rules: multiple threads now have mutable access to the same Vec<u64>, and the += 1 operation is not atomic. The increments are lost (torn reads/writes), and on architectures with weak memory ordering, the counters may contain garbage values.
The correct Rust approach is Arc<Mutex<Metrics>> or Arc<RwLock<Metrics>>, or using AtomicU64 for individual counters. The unsafe cast is a red flag that should be caught in code review. As a general rule, any time someone reaches for unsafe to get around the borrow checker for shared mutable access, they’re introducing a data race.
Detection Strategies
Static Analysis
| Tool | Language | What It Catches | Limitations |
|---|---|---|---|
go run -race |
Go | Data races on memory access | Runtime only, requires triggering the race |
| ThreadSanitizer (TSan) | C/C++/Rust | Data races, lock order violations | Runtime only, 5-15x slowdown |
| SpotBugs | Java | Inconsistent synchronization, double-checked locking | Limited to known patterns |
| pylint / bandit | Python | Limited race condition detection | Cannot reason about thread interleavings |
| clippy | Rust | Warns about unsafe blocks, suggests Mutex/RwLock |
Cannot detect races inside unsafe |
| Semgrep | All | Pattern matching for known race-prone patterns | Cannot reason about concurrency semantics |
Runtime Detection
| Tool | How It Works | Overhead |
|---|---|---|
| Go Race Detector | Instruments memory accesses, detects concurrent read/write | 2-10x slowdown, 5-10x memory |
| ThreadSanitizer | Tracks happens-before relationships between memory accesses | 5-15x slowdown |
Java -ea + ConcurrentModificationException |
Detects structural modification during iteration | Minimal overhead |
| Rust Miri | Detects undefined behavior including data races in unsafe |
Interpretation mode, very slow |
Manual Review Indicators
- Shared mutable state without synchronization, any field accessed by multiple threads without a lock, atomic, or channel.
- Check-then-act patterns,
if (condition) { act() }whereconditioncan change between the check and the act. volatilewithout synchronization in Java,volatileensures visibility but not atomicity of compound operations.- Python’s GIL misconception, the GIL prevents concurrent execution of bytecode but does not make multi-step operations atomic.
unsafeblocks in Rust that cast to*mut, bypassing the borrow checker for shared mutable access is almost always a data race.- File operations with separate check and use,
os.path.exists()followed byopen()is a TOCTOU race. - Lazy initialization without synchronization, singleton patterns, connection pools, and caches that initialize on first access.
Remediation
Python: Use Threading Lock
import threading
class BankAccount:
def __init__(self, balance):
self.balance = balance
self._lock = threading.Lock()
def withdraw(self, amount):
with self._lock:
if self.balance >= amount:
self.balance -= amount
return True
return False
def deposit(self, amount):
with self._lock:
self.balance += amount
The threading.Lock ensures that the check-and-modify sequence is atomic with respect to other threads. The with statement guarantees the lock is released even if an exception occurs.
For file TOCTOU races, the fix is to open the file first (obtaining a file descriptor), then perform checks on the open descriptor rather than the path:
import os
import json
def safe_load_config(config_dir, name):
path = os.path.join(config_dir, name)
real_path = os.path.realpath(path)
if not real_path.startswith(os.path.realpath(config_dir)):
raise PermissionError("Path traversal detected")
fd = os.open(real_path, os.O_RDONLY)
try:
stat = os.fstat(fd)
if stat.st_size > 1024 * 1024:
raise ValueError("Config file too large")
with os.fdopen(fd, 'r') as f:
return json.load(f)
except:
os.close(fd)
raise
Go: Use sync.RWMutex or sync.Map
package main
import (
"fmt"
"sync"
)
type SessionStore struct {
sessions map[string]string
mu sync.RWMutex
}
func NewSessionStore() *SessionStore {
return &SessionStore{sessions: make(map[string]string)}
}
func (s *SessionStore) Set(key, value string) {
s.mu.Lock()
defer s.mu.Unlock()
s.sessions[key] = value
}
func (s *SessionStore) Get(key string) (string, bool) {
s.mu.RLock()
defer s.mu.RUnlock()
val, ok := s.sessions[key]
return val, ok
}
func (s *SessionStore) Delete(key string) {
s.mu.Lock()
defer s.mu.Unlock()
delete(s.sessions, key)
}
sync.RWMutex allows concurrent reads (RLock) but exclusive writes (Lock). For simple key-value stores, sync.Map is an alternative that is optimised for the case where keys are mostly read and rarely written. Starting with sync.RWMutex is generally the safer bet since it’s more predictable, and you can switch to sync.Map if profiling shows contention.
Java: Proper Synchronization for Connection Pool
import java.util.concurrent.ConcurrentHashMap;
public class ConnectionPool {
private final ConcurrentHashMap<String, Object> connections = new ConcurrentHashMap<>();
public Object getConnection(String key) {
return connections.computeIfAbsent(key, this::createConnection);
}
public void removeConnection(String key) {
connections.remove(key);
}
private Object createConnection(String key) {
return new Object();
}
}
ConcurrentHashMap.computeIfAbsent atomically checks for the key and creates the value if absent. There is no race window between the check and the insert. The ConcurrentHashMap handles all synchronization internally, eliminating the need for manual locking and the fragile double-checked locking pattern. Here’s what clicked for me when studying Java concurrency: if you find yourself writing double-checked locking, step back and ask whether ConcurrentHashMap or java.util.concurrent already solves your problem, it almost always does.
Rust: Use Mutex or Atomic Types
use std::sync::{Arc, Mutex};
use std::thread;
struct Metrics {
counters: Vec<u64>,
}
impl Metrics {
fn new(size: usize) -> Self {
Metrics {
counters: vec![0; size],
}
}
fn increment(&mut self, index: usize) {
if index < self.counters.len() {
self.counters[index] += 1;
}
}
fn get(&self, index: usize) -> u64 {
self.counters[index]
}
}
fn main() {
let metrics = Arc::new(Mutex::new(Metrics::new(10)));
let mut handles = vec![];
for i in 0..4 {
let m = Arc::clone(&metrics);
handles.push(thread::spawn(move || {
for _ in 0..1000 {
let mut guard = m.lock().unwrap();
guard.increment(i % 10);
}
}));
}
for h in handles {
h.join().unwrap();
}
let guard = metrics.lock().unwrap();
println!("Counter 0: {}", guard.get(0));
}
Arc<Mutex<Metrics>> is the idiomatic Rust pattern for shared mutable state. The Mutex ensures exclusive access, and the type system enforces that you cannot access the data without holding the lock. For individual counters where a full mutex is too heavy, use AtomicU64:
use std::sync::atomic::{AtomicU64, Ordering};
use std::sync::Arc;
use std::thread;
struct AtomicMetrics {
counters: Vec<AtomicU64>,
}
impl AtomicMetrics {
fn new(size: usize) -> Self {
AtomicMetrics {
counters: (0..size).map(|_| AtomicU64::new(0)).collect(),
}
}
fn increment(&self, index: usize) {
if index < self.counters.len() {
self.counters[index].fetch_add(1, Ordering::Relaxed);
}
}
fn get(&self, index: usize) -> u64 {
self.counters[index].load(Ordering::Relaxed)
}
}
fn main() {
let metrics = Arc::new(AtomicMetrics::new(10));
let mut handles = vec![];
for i in 0..4 {
let m = Arc::clone(&metrics);
handles.push(thread::spawn(move || {
for _ in 0..1000 {
m.increment(i % 10);
}
}));
}
for h in handles {
h.join().unwrap();
}
println!("Counter 0: {}", metrics.get(0));
}
Atomic operations are lock-free and significantly faster than mutex-based synchronization for simple counter increments. Ordering::Relaxed is sufficient when counters are independent and no ordering between different counters is required. One thing I appreciate about Rust is how it makes you think about memory ordering explicitly, it’s more work upfront, but it means you actually understand what guarantees you’re getting.
Key Takeaways
- Race conditions are non-deterministic. A test suite that passes 10,000 times can still have a race that manifests under production load. Use race detectors (
go run -race, ThreadSanitizer) in CI. - The GIL does not prevent races in Python. Multi-step operations (check-then-act, read-modify-write) are not atomic even with the GIL. Use
threading.Lockfor shared mutable state. - Prefer language-provided concurrent data structures,
sync.Mapin Go,ConcurrentHashMapin Java,Arc<Mutex<T>>in Rust. These are tested, optimised, and correct. - TOCTOU races in file operations are real. Open the file first, then check the file descriptor. Never check a path and then open it separately.
unsafein Rust for shared mutable access is almost always wrong. If you need mutable access from multiple threads, useMutex,RwLock, or atomic types. The type system is trying to protect you.- Double-checked locking is fragile. In Java, use
ConcurrentHashMap.computeIfAbsentorsynchronizedblocks. The performance gain from avoiding synchronization is rarely worth the correctness risk.