Huge number of open file descriptors after days of running #12

noushi · 2016-07-14T20:24:49Z

Hi Will,

We've been running pg exporters for the last 2 weeks, and we noticed that they leaked file descriptors.

Here's a list of some pg exporters running on our site, the 2nd column being the open fds on the docker daemon and names have been masked:

prometheus-postgres-xxxxxxxxx-replica  18959
prometheus-postgres-xxxxxxxxx-master   11800
prometheus-postgres-xxxxxxxxx-master   11780
prometheus-postgres-xxxxxxxxx-master   11796
prometheus-postgres-xxxxxxxxx-master   11790
prometheus-postgres-xxxxxxxxx-master   11782
prometheus-postgres-xxxxxxxxx-master   11810
prometheus-postgres-xxxxxxxxx-replica  2868
prometheus-postgres-xxxxxxxxx-replica  2870
prometheus-postgres-xxxxxxxxx-replica  2875
prometheus-postgres-xxxxxxxxx-replica  2872
prometheus-postgres-xxxxxxxxx-replica  2862
prometheus-postgres-xxxxxxxxx-replica  2875

As a result, this exposes a docker bug, where even destroying those containers, leaves fds deleted but still open on the docker daemon.

Here's a list of open fd count on our DB servers:

docker_fd_count                      125996
docker_fd_count                                373688
docker_fd_count                           484407
docker_fd_count                            520493

The last one is at half a million open fds, and it runs our masters.

docker ps just hangs indefinitely.

To reproduce this bug, just run an exporter and wait, open fds on docker daemon will keep growing.

Thank you

The text was updated successfully, but these errors were encountered:

wrouesnel · 2016-07-14T22:52:39Z

Thanks for the report, I'll look into it this weekend. Sounds like there's a query being improperly closed somewhere.

wrouesnel · 2016-07-16T02:26:49Z

Running the current release in a container as per the instructions I have in the README at the moment doesn't seem to cause any issues. My test setup:

 $ docker run -d --net=host -e POSTGRES_PASSWORD=password postgres
$ docker run -d --net=host -e DATA_SOURCE_NAME="postgresql://postgres:password@localhost:5432/?sslmode=disable" -p 9113:9113 wrouesnel/postgres_exporter
$ while [ 1 ]; do wget -O - http://localhost:9113/metrics > /dev/null; done

Could you supply the output of lsof -p on the container process and docker daemon, what does it seem to be holding open? Also what version of docker are you using? My test system here is

# docker -v
Docker version 1.11.2, build b9f10c9

Another source of debugging would be to check the exporter's own introspection data - the prometheus metrics should provide
process_max_fds and process_open_fds via http://localhost:9113/metrics so yo should be able to see if they're building up via prometheus (or just):

$ wget -O - http://localhost:9113/metrics | grep _fds

noushi · 2016-07-16T21:44:50Z

Thank you for your reply.

I've left an exporter running for the last 2 days.

1e0014eecbaa        wrouesnel/postgres_exporter                                  "/postgres_exporter"     2 days ago          Up 2 days           0.0.0.0:9113->9113/tcp   pg_export_ipv6

At exporter start, we had:

docker_fd_count                    176
pg_export_ipv6                     6

fd count kept growing:

docker_fd_count                    227
pg_export_ipv6                     25

then

docker_fd_count                    293
pg_export_ipv6                     57

Now I have:

docker_fd_count                    5202
pg_export_ipv6                     2521

over the container lifespan, I've queried /metrics less than 10 times.

exporter itself keeps few files:

# wget -q -O - http://localhost:9113/metrics | grep _fds                                                                                                                                         
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 524288
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 7

# docker -v
Docker version 1.11.1, build 5604cbe

# lsof -p <exporter_pid>
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF       NODE NAME
postgres_ 16019 root  cwd    DIR  0,110     4096          2 /
postgres_ 16019 root  rtd    DIR  0,110     4096          2 /
postgres_ 16019 root  txt    REG  0,110 10936128         18 /postgres_exporter
postgres_ 16019 root    0r  FIFO    0,9      0t0 1155072868 pipe
postgres_ 16019 root    1w  FIFO    0,9      0t0 1155072869 pipe
postgres_ 16019 root    2w  FIFO    0,9      0t0 1155072870 pipe
postgres_ 16019 root    3u  sock    0,8      0t0 1155079186 can't identify protocol
postgres_ 16019 root    4u  0000   0,10        0       7715 anon_inode

Now on docker daemon:

# lsof -p $(cat /var/run/docker.pid ) >pg_exporter_bug_info/lsof_p_docker_after_2_days.txt

# cat pg_exporter_bug_info/lsof_p_docker_after_2_days.txt  | wc -l
5235

# cat pg_exporter_bug_info/lsof_p_docker_after_2_days.txt | grep 1e0014eecbaa7110919c4bf45e794977eb1844eff349861d824c630627402815 | wc -l
2525

lsof_p_docker_after_2_days.txt is attached.

Thank you.

wrouesnel · 2016-07-17T00:38:50Z

So from the looks of that docker trace what's being held open are docker's json log files for some reason.

That looks like it's a docker issue you've run across rather then an exporter issue as we don't write to those files directly ever.

Try running the exporter with --log-driver=none and it'll probably fix it in the short term but docker shouldn't be behaving like this afaik.

wrouesnel · 2016-07-17T00:41:09Z

moby/moby#21231 looks like this might've been a docker problem up till quite recently...

noushi · 2016-07-17T01:23:50Z

Indeed, it is odd that:

such an issue occurred with this exporter, it prints almost nothing.
this issue occurs only with the postgres exporter.

I've started a new exporter with the log driver option, and I'll leave it for the weekend.

noushi · 2016-07-18T20:20:36Z

Okay, so after a few days, I get this:

docker_fd_count                    2414
pg_export_ipv6                     1144

A similar run on another server gives:

docker_fd_count                    2314
pg_export_ipv6                     1074

noushi · 2016-07-20T00:18:13Z

Okay, I think I found the issue: the postgres_exporter is pid #1 while it is not init compliant.

I used the same binary with tini on two images based on:

krallin/ubuntu-tini:14.04
alpine:3.4 + apk install tini

And these are the results:

docker_fd_count                     5596
pg_export_ipv6-alpine_tini-release  4         <-- alpine:3.4 + apk install tini + the yaml queries patch
pg_export_ipv6-tini2                4                 <-- alpine:3.4 + apk install tini
pg_export_ipv6-stock2               131           <-- wrouesnel/postgres_exporter
pg_export_ipv6-tini                 4                   <-- krallin/ubuntu-tini:14.04
pg_export_ipv6                      2580
``

wrouesnel · 2016-07-20T15:37:08Z

This still doesn't seem entirely right to me. init just requires dealing with zombie child processes as far as I'm aware. I'm not sure how switching init systems can cause problems here, because in postgres_exporter we don't fork child processes at all - we're not even connected to the docker json log files.

Looking at the lsof output you supplied again, all those open processes look like they're connected to separate container instantiations somehow (the hash changes each time). Can I ask what distribution and kernel version (uname -a) you're running on any machine exhibiting this problem?

EDIT: Something else I'm not clear on, what command is producing the result values you've been printing? i.e. pg_export_ipv6 - where's that value being derived from?

noushi · 2016-07-20T21:01:11Z

While checking tini source code, we can see that it deals not only with child process management but also with signals management. The lack of the latter is what I suspect is causing the issue.

A process running with PID 1 is handled differently by the kernel, child and signal wise, and this has been documented and addressed in various places:

http://phusion.github.io/baseimage-docker/ has proper SIGTERM handling)
http://engineeringblog.yelp.com/2016/01/dumb-init-an-init-for-docker.html : SIGTERM, SIGINT
my_init has explicit SIGTERM, SIGINT and SIGALARM handling

As for why this is happening, I think running a custom container that only reports signal events upon receiving would shed more light, but I can't personally do it presently.

I have to say that postgres exporter is not the only image we're running that has no init, a few other ones are running. They're Ruby based and the only relevant feature they have is explicit SIGTERM, SIGINT and SIGUSR1 handling.

Regarding the command I use to report open fds, it's a script that iterates over all existing containers and counts container specific fds by looking at /proc/$(cat /var/run/docker.pid)/fd. Nothing fancy.

I just ran the script now, and here's the report:

docker_fd_count                     9581
prometheus-configurator-2           4
prometheus-initiator                4
prometheus-cadvisor                 4
pg_export_ipv6-alpine_tini-release  4
pg_export_ipv6-tini2                4
pg_export_ipv6-stock2               1126
pg_export_ipv6-tini                 4
pg_export_ipv6                      3574

(intitiator and configurator are the Ruby based containers I'm running)

uname -a:
Linux tiefighter11 3.19.0-59-generic #66~14.04.1-Ubuntu SMP Fri May 13 17:27:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

We're using Ubuntu server 14.04LTS.

That said, the tini PR has a side benefit, I can finally run docker exec -it pg_exporter /bin/sh and delve into the container in case of a problem.

Thank you!

wrouesnel · 2016-07-21T09:15:08Z

So I've been trying to replicate what you're seeing, and I just can't:

77db2f3ff853        postgres                            "/docker-entrypoint.s"   17 hours ago        Up 17 hours                                    tender_brahmagupta
6cde27410eb0        wrouesnel/postgres_exporter         "/postgres_exporter"     17 hours ago        Up 17 hours         0.0.0.0:9113->9113/tcp     gloomy_ardinghelli

ls /proc/$(ps -ef | grep '/usr/bin/docker daemon' | head -n1 | tr -s ' ' | cut -d' ' -f2) | wc -l
49

Linux will-desktop 4.6.0-040600-generic #201605151930 SMP Sun May 15 23:32:59 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

It very much seems like there is something broken with docker's pipe handling on your specific setup somehow (the container not killing seems to point in that direction too).

noushi · 2016-07-21T09:58:55Z

Yes, this definitely on our setup, but since we're running production DB servers as containers, we can't just restart or upgrade Docker.

wrouesnel · 2016-10-06T13:33:35Z

Closing this unless it hits someone else or I can figure out a replication.

jalenplayvs · 2020-12-01T22:00:23Z

I am having a similar issue using AWS ECS Fargate. Eventually postgres_exporter stops collecting and spams logs with:

2020/11/02 17:26:37 http: Accept error: accept tcp [::]:9187: accept4: too many open files; retrying in 1s

I have Prometheus in 10 different environments, this bug comes for all of them after a similar amount of uptime.

graudeejs · 2022-04-27T16:59:03Z

Apr 26 19:46:37 server[19974]: 2022/04/26 19:46:37 http: Accept error: accept tcp 127.0.0.1:10222: accept4: too many open files; retrying in 1s
Apr 26 19:46:38 server[19974]: 2022/04/26 19:46:38 http: Accept error: accept tcp 127.0.0.1:10222: accept4: too many open files; retrying in 1s

got this as well on server (no docker)

…-cluster-label-and-log Update mixin to acommodate selector replacement.

* Monitor pga-reddit-service-taxonomy and -15 * Add dbname

noushi mentioned this issue Jul 20, 2016

Use tini as a minimal init for the docker image #14

Closed

wrouesnel added the invalid label Oct 6, 2016

wrouesnel closed this as completed Oct 6, 2016

dehaansa pushed a commit to dehaansa/postgres_exporter that referenced this issue Oct 24, 2024

Merge pull request prometheus-community#12 from grafana/gaantunes/add…

0dcf5c4

…-cluster-label-and-log Update mixin to acommodate selector replacement.

Sticksman pushed a commit to Sticksman/postgres_exporter that referenced this issue Apr 24, 2025

Monitor pga-reddit-service-taxonomy and -15 (prometheus-community#12)

5e96954

* Monitor pga-reddit-service-taxonomy and -15 * Add dbname

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge number of open file descriptors after days of running #12

Huge number of open file descriptors after days of running #12

noushi commented Jul 14, 2016

wrouesnel commented Jul 14, 2016

wrouesnel commented Jul 16, 2016

noushi commented Jul 16, 2016

wrouesnel commented Jul 17, 2016

wrouesnel commented Jul 17, 2016

noushi commented Jul 17, 2016

noushi commented Jul 18, 2016

noushi commented Jul 20, 2016

wrouesnel commented Jul 20, 2016 •

edited

Loading

noushi commented Jul 20, 2016

wrouesnel commented Jul 21, 2016

noushi commented Jul 21, 2016

wrouesnel commented Oct 6, 2016

jalenplayvs commented Dec 1, 2020 •

edited

Loading

graudeejs commented Apr 27, 2022

Huge number of open file descriptors after days of running #12

Huge number of open file descriptors after days of running #12

Comments

noushi commented Jul 14, 2016

wrouesnel commented Jul 14, 2016

wrouesnel commented Jul 16, 2016

noushi commented Jul 16, 2016

wrouesnel commented Jul 17, 2016

wrouesnel commented Jul 17, 2016

noushi commented Jul 17, 2016

noushi commented Jul 18, 2016

noushi commented Jul 20, 2016

wrouesnel commented Jul 20, 2016 • edited Loading

noushi commented Jul 20, 2016

wrouesnel commented Jul 21, 2016

noushi commented Jul 21, 2016

wrouesnel commented Oct 6, 2016

jalenplayvs commented Dec 1, 2020 • edited Loading

graudeejs commented Apr 27, 2022

wrouesnel commented Jul 20, 2016 •

edited

Loading

jalenplayvs commented Dec 1, 2020 •

edited

Loading