Error managing cluster: Failed to obtain a DB connection from data source 'default'

For the last few mornings, I have been getting the following error in my logs:

2023-02-23 09:04:29.962 -04:00 [ERR] Error managing cluster: Failed to obtain DB connection from data source 'default': Npgsql.NpgsqlException (0x80004005): Exception while connecting
 ---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (10055): An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full. 127.0.0.1:5432
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at Npgsql.NpgsqlConnector.Connect(NpgsqlTimeout timeout)
   at Npgsql.NpgsqlConnector.Connect(NpgsqlTimeout timeout)
   at Npgsql.NpgsqlConnector.RawOpen(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlConnector.Open(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.ConnectorPool.OpenNewConnector(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.ConnectorPool.<>c__DisplayClass38_0.<<Rent>g__RentAsync|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Npgsql.NpgsqlConnection.<>c__DisplayClass41_0.<<Open>g__OpenAsync|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Npgsql.NpgsqlConnection.Open()
   at Quartz.Impl.AdoJobStore.JobStoreSupport.GetConnection()
Quartz.JobPersistenceException: Failed to obtain DB connection from data source 'default': Npgsql.NpgsqlException (0x80004005): Exception while connecting
 ---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (10055): An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full. 127.0.0.1:5432
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at Npgsql.NpgsqlConnector.Connect(NpgsqlTimeout timeout)
   at Npgsql.NpgsqlConnector.Connect(NpgsqlTimeout timeout)
   at Npgsql.NpgsqlConnector.RawOpen(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlConnector.Open(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.ConnectorPool.OpenNewConnector(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.ConnectorPool.<>c__DisplayClass38_0.<<Rent>g__RentAsync|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Npgsql.NpgsqlConnection.<>c__DisplayClass41_0.<<Open>g__OpenAsync|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Npgsql.NpgsqlConnection.Open()
   at Quartz.Impl.AdoJobStore.JobStoreSupport.GetConnection()
 ---> Npgsql.NpgsqlException (0x80004005): Exception while connecting
 ---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (10055): An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full. 127.0.0.1:5432
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at Npgsql.NpgsqlConnector.Connect(NpgsqlTimeout timeout)
   at Npgsql.NpgsqlConnector.Connect(NpgsqlTimeout timeout)
   at Npgsql.NpgsqlConnector.RawOpen(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlConnector.Open(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.ConnectorPool.OpenNewConnector(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.ConnectorPool.<>c__DisplayClass38_0.<<Rent>g__RentAsync|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Npgsql.NpgsqlConnection.<>c__DisplayClass41_0.<<Open>g__OpenAsync|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Npgsql.NpgsqlConnection.Open()
   at Quartz.Impl.AdoJobStore.JobStoreSupport.GetConnection()
   --- End of inner exception stack trace ---
   at Quartz.Impl.AdoJobStore.JobStoreSupport.GetConnection()
   at Quartz.Impl.AdoJobStore.JobStoreSupport.DoCheckin(Guid requestorId, CancellationToken cancellationToken)
   at Quartz.Impl.AdoJobStore.ClusterManager.Manage() [See nested exception: Npgsql.NpgsqlException (0x80004005): Exception while connecting
 ---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (10055): An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full. 127.0.0.1:5432
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP)
   at Npgsql.NpgsqlConnector.Connect(NpgsqlTimeout timeout)
   at Npgsql.NpgsqlConnector.Connect(NpgsqlTimeout timeout)
   at Npgsql.NpgsqlConnector.RawOpen(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlConnector.Open(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.ConnectorPool.OpenNewConnector(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.ConnectorPool.<>c__DisplayClass38_0.<<Rent>g__RentAsync|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Npgsql.NpgsqlConnection.<>c__DisplayClass41_0.<<Open>g__OpenAsync|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Npgsql.NpgsqlConnection.Open()
   at Quartz.Impl.AdoJobStore.JobStoreSupport.GetConnection()]

I’ll upload the full log here

I don’t know much about postgresql but I cannot find any reference to ‘default’ within it. I also don’t know what buffer space it is referring to or why it would be running out. So far the only fix is to restart the server, after which, it works fine for about 24 hours and I get an HTTP 502 error when trying to access the server.

Is there any way to fix this issue?

Do you have an option to update Survey Solutions to the latest version? If you do - please do it. It might help in your case.

Ok I ended up upgrading Survey Solutions and PostgreSQL to the latest version but I’m having the exact same issue.

How many nodes in your cluster?
You might need to Increase max_connection and shared_buffers for your DB server. Default DB values are not sufficient for multi node environment:

max_connections = 100
shared_buffers = 24MB

You may use simple multiplier, depending on your cluster nodes count.

It’s just one node so it shouldn’t be an issue since it was working fine before.

After some investigation I’ve found that I’m getting a Event Viewer Tcpip event 4231, which is basically saying that all TCP ports have been used up. I’ll have to figure out what is causing this. Any ideas?

EDIT: I’ve found that the issue is being caused by a network monitoring process on the server. I’ll have to contact my host to fix that issue.

@dtrotman , it would help other users experiencing the same symptoms if you can mention specifically the “network monitoring process on the server” and the host name that is causing the error.

The specific process is the Advanced Monitoring Agent which is a part of N-Able remote monitoring and management software. My server host suspects that it might be a memory leak or some other issue but they have not resolved it as yet.

@dtrotman,

thank you very much for supplying this information.
I hope you will be able to get that fixed.

Best, Sergiy