Skip to content

Possible race condition on DotNetStats #173

@ellull

Description

@ellull

Hello,

We have a bunch of ASP.NET Core applications and we are using the prometheus-net.AspNetCore nuget to report metrics. Most of the applications are running on Ubuntu Linux 16.04 as self-contained applications.

For high availability of the metrics system, we hace a pair of Prometheus servers which scrape all the applications.

We noticed that randomly the metric "process_open_handles" is reported as 0 which we know that it is not possible (a process always have some open file).

We suspected that when the two Prometheus servers scrape the same application at the same time (of very close) is when the "process_open_handles" is reported as 0, we've written a minimal reproducer which seems to confirm our suspects:

using Prometheus;
using System;
using System.Collections.Generic;
using System.Diagnostics;

namespace HandleCount
{
    class Program
    {
        static void Main(string[] args)
        {
            Metrics.SuppressDefaultMetrics();
            DotNetMetrics.Register(Metrics.DefaultRegistry);
            var metricServer = new KestrelMetricServer(port: 9303);
            metricServer.Start();

            Console.WriteLine("Press ENTER to exit");
            Console.ReadLine();
        }
    }

    /// <summary>
    /// Collects basic .NET metrics about the current process. This is not meant to be an especially serious collector,
    /// more of a producer of sample data so users of the library see something when they install it.
    /// </summary>
    public sealed class DotNetMetrics
    {
        /// <summary>
        /// Registers the .NET metrics in the specified registry.
        /// </summary>
        public static void Register(CollectorRegistry registry)
        {
            var instance = new DotNetMetrics(registry);
            registry.AddBeforeCollectCallback(instance.UpdateMetrics);
        }

        private readonly Process _process;
        private readonly List<Counter.Child> _collectionCounts = new List<Counter.Child>();
        private Gauge _totalMemory;
        private Gauge _virtualMemorySize;
        private Gauge _workingSet;
        private Gauge _privateMemorySize;
        private Counter _cpuTotal;
        private Gauge _openHandles;
        private Gauge _startTime;
        private Gauge _numThreads;

        private DotNetMetrics(CollectorRegistry registry)
        {
            _process = Process.GetCurrentProcess();
            var metrics = Metrics.WithCustomRegistry(registry);

            var collectionCountsParent = metrics.CreateCounter("dotnet_collection_count_total", "GC collection count", new[] { "generation" });

            for (var gen = 0; gen <= GC.MaxGeneration; gen++)
            {
                _collectionCounts.Add(collectionCountsParent.Labels(gen.ToString()));
            }

            // Metrics that make sense to compare between all operating systems
            // Note that old versions of pushgateway errored out if different metrics had same name but different help string.
            // This is fixed in newer versions but keep the help text synchronized with the Go implementation just in case.
            // See https://github.com/prometheus/pushgateway/issues/194
            // and https://github.com/prometheus-net/prometheus-net/issues/89
            _startTime = metrics.CreateGauge("process_start_time_seconds", "Start time of the process since unix epoch in seconds.");
            _cpuTotal = metrics.CreateCounter("process_cpu_seconds_total", "Total user and system CPU time spent in seconds.");

            _virtualMemorySize = metrics.CreateGauge("process_virtual_memory_bytes", "Virtual memory size in bytes.");
            _workingSet = metrics.CreateGauge("process_working_set_bytes", "Process working set");
            _privateMemorySize = metrics.CreateGauge("process_private_memory_bytes", "Process private memory size");
            _openHandles = metrics.CreateGauge("process_open_handles", "Number of open handles");
            _numThreads = metrics.CreateGauge("process_num_threads", "Total number of threads");

            // .net specific metrics
            _totalMemory = metrics.CreateGauge("dotnet_total_memory_bytes", "Total known allocated memory");

            var epoch = new DateTime(1970, 1, 1, 0, 0, 0, DateTimeKind.Utc);
            _startTime.Set((_process.StartTime.ToUniversalTime() - epoch).TotalSeconds);
        }

        private void UpdateMetrics()
        {
            try
            {
                _process.Refresh();

                for (var gen = 0; gen <= GC.MaxGeneration; gen++)
                {
                    var collectionCount = _collectionCounts[gen];
                    collectionCount.Inc(GC.CollectionCount(gen) - collectionCount.Value);
                }

                _totalMemory.Set(GC.GetTotalMemory(false));
                _virtualMemorySize.Set(_process.VirtualMemorySize64);
                _workingSet.Set(_process.WorkingSet64);
                _privateMemorySize.Set(_process.PrivateMemorySize64);
                _cpuTotal.Inc(Math.Max(0, _process.TotalProcessorTime.TotalSeconds - _cpuTotal.Value));
                _openHandles.Set(_process.HandleCount);
                _numThreads.Set(_process.Threads.Count);

                if (_process.HandleCount == 0)
                {
                    Console.WriteLine("HandleCount = 0");
                }

            }
            catch (Exception)
            {
            }
        }
    }    
}

The DotNetMetrics class is a copy of the DotNetStats class but with a print statement when the _process.HandleCount is equal to 0.

If we use the apache benchmark tool using only one concurrent request, everything works fine: no print.
$ ab -n 1000 -c 1 http://localhost:9303/metrics

But if we increment the concurrency to just 2, some prints are executed:
$ ab -n 1000 -c 2 http://localhost:9303/metrics

Next we've modified the UpdateMetrics method to synchronize its execution:

private void UpdateMetrics()
        {
            lock (_process)
            {
                try
                {
                    _process.Refresh();

                    for (var gen = 0; gen <= GC.MaxGeneration; gen++)
                    {
                        var collectionCount = _collectionCounts[gen];
                        collectionCount.Inc(GC.CollectionCount(gen) - collectionCount.Value);
                    }

                    _totalMemory.Set(GC.GetTotalMemory(false));
                    _virtualMemorySize.Set(_process.VirtualMemorySize64);
                    _workingSet.Set(_process.WorkingSet64);
                    _privateMemorySize.Set(_process.PrivateMemorySize64);
                    _cpuTotal.Inc(Math.Max(0, _process.TotalProcessorTime.TotalSeconds - _cpuTotal.Value));
                    _openHandles.Set(_process.HandleCount);
                    _numThreads.Set(_process.Threads.Count);

                    if (_process.HandleCount == 0)
                    {
                        Console.WriteLine("HandleCount = 0");
                    }

                }
                catch (Exception)
                {
                }
            }
        }

With this modification no print is executed even using a concurrent requests in apache benchmark.

I don't know if synchronizing the UpdateMetrics method is the best solution but it clearly demonstrates that there is some kind of race condition here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions