-
-
Notifications
You must be signed in to change notification settings - Fork 320
Description
Description
We have observed a severe performance issue in our high-throughput application using prometheus-net (v8.2.0+).
The issue stems from the usage of ReaderWriterLockSlim in ManagedLifetimeMetricHandle to protect metric leases. ReaderWriterLockSlim enforces a Writer Preference policy. This means that when the background "Reaper" task requests the write lock to clean up expired metrics, it immediately blocks all new concurrent readers (metric updates), even before the writer has acquired the lock.
The Mechanism of Failure:
- High Throughput Readers: Our application calls
GetOrCreateLifetimeAndIncrementLeaseCount(acquiring the Read Lock) thousands of times per second. - Suspected infrequent Writers:
a) The background "Reaper" task (TakeExpiredLeases) wakes up periodically to clean up expired metrics and calls EnterWriteLock().
b)GetOrCreateLifetimeAndIncrementLeaseCountcalls EnterWriteLock() to initialize new metric if it is not found. - The "Dam" Effect: As soon as the Writer requests the lock,
ReaderWriterLockSlimblocks all new Readers to prevent writer starvation. - The Stampede and Kernel Thrashing: Due to high throughput, thousands of reader threads queue up instantly while waiting for the Writer to enter and complete. This stampede causes extreme CPU contention (kernel time) as threads spin and call
Thread.Yield/do_sched_yield, leading to thread pool starvation and application unresponsiveness.
Expected Behavior
Background maintenance tasks (like cleaning up expired metrics or adding new metric) should not block the hot path of metric collection, or at least should not strictly prioritize themselves over the application's primary workload to the point of creating a denial-of-service.
Actual Behavior
The application enters a "death spiral." The write lock request effectively pauses the application's metric recording. The resulting queue of blocked threads causes the lock primitives to thrash the CPU scheduler upon release.
Evidence / Stack Traces
We captured the crash using Linux perf. The trace shows threads stuck in ReaderWriterLockSlim.SpinLock.EnterSpin, calling Thread.Sleep(1) (mapped to do_sched_yield), consuming 100% CPU.
Stack Trace:
kernel.kallsyms!do_sched_yield
...
System.Threading.ReaderWriterLockSlim+SpinLock.EnterSpin(...)
System.Threading.ReaderWriterLockSlim.TryEnterReadLockCore(...)
Prometheus.ManagedLifetimeMetricHandle...GetOrCreateLifetimeAndIncrementLeaseCount(...)
Prometheus.ManagedLifetimeMetricHandle....WithLease(...)
Prometheus.MeterAdapter.OnMeasurementRecorded(...)
CPU stats:
The Thread Pool stats:
The Linux perf trace
The starved readers
The Reaper's writer lock
Related Issues
This investigation may have identified the likely root cause for #499. In that report, we observed identical thread pool exhaustion. We captured a process dump at the time, but it was inconclusive. A dump provides a static snapshot of thread states (Waiting/Running) but cannot capture CPU contention dynamics. It showed threads waiting on locks that could've been just a symptom.
Relevant Code
- The usage of ReaderWriterLockSlim:
_lifetimesLock.EnterReadLock(); - The "Reaper" task taking the Write Lock:
_lifetimesLock.EnterWriteLock(); - The GetOrCreateLifetimeAndIncrementLeaseCount taking the Write Lock:
_lifetimesLock.EnterWriteLock();
Suggested Fixes
- Alternative Primitives: Consider a standard lock (Monitor or the .NET 9+ System.Threading.Lock). While a standard lock also blocks, it lacks the strict "Writer Preference" that dams up readers before acquisition, which might reduce the severity of the queue pile-up. Ironically,
ReadWriterLockSlimreplaced the standard lock in Rewrite lifetime tracking inManagedLifetimeMetricHandleto be leaner
part of Performance optimizations #458 but could not find any actual issue that explains why this had been done. - Granularity: Use finer-grained locking (
ConcurrentDictionary) or lock-free structures for the lease handles to avoid a global lock on the hot path.
A Workaround
Please confirm that it is possible to effectively disable the Reaper Task by
MeterAdapterOptions.MetricsExpireAfter = Timeout.InfiniteTimeSpan;
and it is a safe thing to do for the application that suppresses Debug Metrics but collects Process Metrics, Event Counters and Meters?
Prometheus.Metrics.SuppressDefaultMetrics(new SuppressDefaultMetricOptions
{
SuppressDebugMetrics = true,
SuppressProcessMetrics = false,
SuppressEventCounters = false,
SuppressMeters = false
});