External description of the problem

If a subscribed Control Point (CP) stops processing incoming eventing TCP connections without outright rejecting them, eventing stops for all connected CPs. The device may also stop accepting action commands and need a restart to become usable again.

This will only happen with devices which generate enough events (for example, with an OpenHome Media Renderer which generates at least an event per second). The value of "enough" depends on the exact libupnp configuration.

The situation can be most easily simulated by activating a firewall on the host running the "faulty" CP. If the firewall is set to silently drop connections, all control points become unresponsive. (If the firewall is set to noisily reject connections, only the control point running on the affected host stops working).

It can also be "naturally" triggered in a number of ways:

  • By turning off the WIFI on a connected tablet, or bringing it out of range, or turning it off forcibly.

  • By a buggy CP stopping processing the connections (it does happen).

Depending on the conditions, the device may become functional again after a (long) time, or not at all. This is also an issue for devices with a single control point connected, because the device may have to be restarted before it can be used again, even if the CP itself recovers.

Internal mechanism

Two factors have an influence on the problem:

  • The setting of the UPNP_ENABLE_NOTIFICATION_REORDERING switch.

  • The timeout length for the outgoing GENA connect() calls.

The second value is the system default if no option is passed to configure (probably depends on the system, of the order of 60 S). It is 5 S when --disable-blocking_tcp_connections is passed to configure.

The problem manifests itself in different ways depending on the parameters passed to configure.

I have assumed in the following that the device generates in the order of one event per second.

Default configuration

With the default configuration, the following section of code around line 343 in gena/gena_device.c, inside the genaNotifyThread() routine (which is the worker routine for GENA tasks on the gSendThreadPool thread pool) ends up dooming the device. This works independantly of the connect() timeout value.

#ifdef UPNP_ENABLE_NOTIFICATION_REORDERING
	/*If the event is out of order push it back to the job queue */
	if (in->eventKey != sub->ToSendEventKey) {
		TPJobInit(&job, (start_routine) genaNotifyThread, input);
		TPJobSetFreeFunction(&job, (free_function) free_notify_struct);
		TPJobSetPriority(&job, MED_PRIORITY);
		/* Sleep a little before creating another thread otherwise if there is
		 * a lot of notifications to send, the device will take 100% of the CPU
		 * to create threads and push them back to the job queue. */
		imillisleep(1);
		ThreadPoolAdd(&gSendThreadPool, &job, NULL);
		freeSubscription(&sub_copy);
		HandleUnlock();
		return;
	}
#endif

What this code does is store back an out of order task into the pool for later processing.

In practise, what happens with a non-responsive control point is that one task gets stuck inside a connect() call. When the device adds another event, libupnp starts to busy-loop executing mostly the above code,

  • Thread gets task from pool

  • Sees that it is newer than the still not sent one which is in the pool

  • Pushes back the job

  • goto 1 :)

The loop prevents an UpnpNotify() call from the device from acquiring the global lock. Events cease to be generated.

The imillisleep(1) call inside the loop (added in the last few years, not part of the original code) is mostly useless because it happens with the lock held (it does avoid taking full 100% CPU).

I am not sure why the thread calling UpnpNotify() does not get more chances to run when HandleUnlock() is called by the above call, but it’s a fact that it almost never gets through (with a timeout of 5 S set on the connect call, it does get to run every few seconds. I have verified that the thread remains in the HandleLock() call several seconds at a time in this situation).

Moving the imillisleep(1) call out of the locked section (just before the "return") does change the situation: the device is able to generate events, and they get sent to the working control points. What happens then is that the Thread Pool slowly gets filled up with events destined for the sleeping CP: because of the reordering above, the connect() calls are not attempted in parallel, they are queued, and as long as the events are generated faster than connect() timeouts, the pool fills up. Once the pool is full, everything stops again. Note that this can’t be fixed with any reasonable value for the connect() timeouts: as long as events are generated faster than the timeout, the pool will fill up.

The conclusion of this part is that the current event reordering code is flawed in principle in the presence of a non-responding CP: if events are generated faster than the connect() timeout, the device will end up disabled, for a long time, or permanently.

The code would need to be modified to queue events per-subscription, out of the ThreadPool, and limit the number of events queued for each subscription (discard events if the the CP does not respond).

Influence of configuration

libupnp configured only with --disable-notification_reordering

In this case, multiple events can be outgoing to the sleeping CP. All threads in the pool get stuck after a dozen seconds (each new event freezes one thread, until the 1 mn connect timeout() begins to free some). Practically no events can then be sent out to healthy CPs. This is what I initially guessed was happening (I had missed the reordering code). Because events keep coming in but can’t be processed, the job count limit also ends up being reached.

libupnp configured with --disable-notification_reordering and --disable-blocking_tcp_connections

In this situation, tasks destined for the sleeping CP still use multiple threads, but things work as long as the event generation rate remains reasonable. With one event/S, and a 5 S connect timeout, approximately 5-6 threads are wasted, and the rest work for the healthy CPs.

This is still not right in principle: fast events would still overflow the pool, but we get the acceptable rate into a practical range, as we reduced the time a thread takes to time out from 60 S to 5 S, and gained the ability to execute a dozen in parallel.

Events are not guaranteed to be delivered in order (but they would be practically always).

This does seem to be the best default configuration for the library. Adding an interface to set the different timeouts would be useful.

Implemented solution

The implemented solution changes the way event tasks are queued for each subscription. It avoids mis-using the thread pool for queueing, using instead an explicit subscription event queue. Only one active event is ever inserted in the thread pool, and events may be discarded if the CP fails to keep up.

This was implemented in the two following commits: