Skip to content

ManagedLifetimeMetricHandle locks ReadWriterLockSlim, causes reader lock contention, high CPU and the threadpool exhaustion. #507

@baal2000

Description

@baal2000

Description

We have observed a severe performance issue in our high-throughput application using prometheus-net (v8.2.0+).

The issue stems from the usage of ReaderWriterLockSlim in ManagedLifetimeMetricHandle to protect metric leases. ReaderWriterLockSlim enforces a Writer Preference policy. This means that when the background "Reaper" task requests the write lock to clean up expired metrics, it immediately blocks all new concurrent readers (metric updates), even before the writer has acquired the lock.

The Mechanism of Failure:

  • High Throughput Readers: Our application calls GetOrCreateLifetimeAndIncrementLeaseCount (acquiring the Read Lock) thousands of times per second.
  • Suspected infrequent Writers:
    a) The background "Reaper" task (TakeExpiredLeases) wakes up periodically to clean up expired metrics and calls EnterWriteLock().
    b) GetOrCreateLifetimeAndIncrementLeaseCount calls EnterWriteLock() to initialize new metric if it is not found.
  • The "Dam" Effect: As soon as the Writer requests the lock, ReaderWriterLockSlim blocks all new Readers to prevent writer starvation.
  • The Stampede and Kernel Thrashing: Due to high throughput, thousands of reader threads queue up instantly while waiting for the Writer to enter and complete. This stampede causes extreme CPU contention (kernel time) as threads spin and call Thread.Yield / do_sched_yield, leading to thread pool starvation and application unresponsiveness.

Expected Behavior

Background maintenance tasks (like cleaning up expired metrics or adding new metric) should not block the hot path of metric collection, or at least should not strictly prioritize themselves over the application's primary workload to the point of creating a denial-of-service.

Actual Behavior

The application enters a "death spiral." The write lock request effectively pauses the application's metric recording. The resulting queue of blocked threads causes the lock primitives to thrash the CPU scheduler upon release.

Evidence / Stack Traces

We captured the crash using Linux perf. The trace shows threads stuck in ReaderWriterLockSlim.SpinLock.EnterSpin, calling Thread.Sleep(1) (mapped to do_sched_yield), consuming 100% CPU.

Stack Trace:

kernel.kallsyms!do_sched_yield
...
System.Threading.ReaderWriterLockSlim+SpinLock.EnterSpin(...)
System.Threading.ReaderWriterLockSlim.TryEnterReadLockCore(...)
Prometheus.ManagedLifetimeMetricHandle...GetOrCreateLifetimeAndIncrementLeaseCount(...)
Prometheus.ManagedLifetimeMetricHandle....WithLease(...)
Prometheus.MeterAdapter.OnMeasurementRecorded(...)

CPU stats:

Image

The Thread Pool stats:

Image

The Linux perf trace

The starved readers
Image
The Reaper's writer lock
Image

Related Issues

This investigation may have identified the likely root cause for #499. In that report, we observed identical thread pool exhaustion. We captured a process dump at the time, but it was inconclusive. A dump provides a static snapshot of thread states (Waiting/Running) but cannot capture CPU contention dynamics. It showed threads waiting on locks that could've been just a symptom.

Relevant Code

Suggested Fixes

  • Alternative Primitives: Consider a standard lock (Monitor or the .NET 9+ System.Threading.Lock). While a standard lock also blocks, it lacks the strict "Writer Preference" that dams up readers before acquisition, which might reduce the severity of the queue pile-up. Ironically, ReadWriterLockSlim replaced the standard lock in Rewrite lifetime tracking in ManagedLifetimeMetricHandle to be leaner
    part of Performance optimizations #458 but could not find any actual issue that explains why this had been done.
  • Granularity: Use finer-grained locking (ConcurrentDictionary) or lock-free structures for the lease handles to avoid a global lock on the hot path.

A Workaround

Please confirm that it is possible to effectively disable the Reaper Task by

MeterAdapterOptions.MetricsExpireAfter = Timeout.InfiniteTimeSpan;

and it is a safe thing to do for the application that suppresses Debug Metrics but collects Process Metrics, Event Counters and Meters?

        Prometheus.Metrics.SuppressDefaultMetrics(new SuppressDefaultMetricOptions
        {
            SuppressDebugMetrics = true,
            SuppressProcessMetrics = false,
            SuppressEventCounters = false,
            SuppressMeters = false
        });

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions