Bug #4226

SAT>IP with RTP/AVP/TCP unclosed connections

Added by Mono Polimorph 7 months ago. Updated 16 days ago.

Status:FixedStart date:2017-02-13
Priority:NormalDue date:
Assignee:-% Done:

100%

Category:SAT>IP
Target version:-
Found in version:last commit (2017-02-13) Affected Versions:

Description

Hi,

I found a fully reproducible bug using two instances of Tvheadend connected throught SAT>IP protocol over TCP.

Here the description of the enviroment:

- One Tvheadend instance exporting two tuners using the SAT>IP protocol.
- Another remote TVheadend instance connecting to the first over Internet (fast connection) forcing the discovery with the "--satip_xml" parameter.
- The SAT>IP client in the remote server are configured with "RTP/AVP/TCP (embedded data)" enabled. This forces to use only the TCP channel for both: RTSP commands and TS Streaming in an interlaved mode.

All works as expected... great! Until you saturate the link. In this case, the TCP socket in the remote server (the client) closes, and it tries to open another TCP socket with the second tuner (I have configured 2 tuners in the main server).

This isn't a problem. However, in the main server the first connection continues open (I see it in the UI). Also it appears the new one. But I can't manually close the connection (forcing close in the UI doesn't work). Futhermore, when three connections are open (two blocked, and a third running) as no more tuners are available the main server goes to an unusable state (it works, but I can't use any tuner or stablish new SAT>IP connections).

It's very frustrating! Please, can you fix this error?
You can reproduce this error, and I checked it with different releases (now with the last release from today compiled inhouse).

SATIP_Fixes_RTCP_errors_with_RTP_over_TCP.diff.txt Magnifier - Patch (865 Bytes) Mono Polimorph, 2017-08-02 22:45

SATIP-Add-socket_id-in-logs.diff Magnifier - Patch (2.34 KB) Mono Polimorph, 2017-08-07 17:31

SAT-IP-Fix-Interlaved-TCP-deadlock.diff Magnifier - Patch (979 Bytes) Mono Polimorph, 2017-08-09 17:31

Associated revisions

Revision 2582cde1
Added by Jaroslav Kysela about 1 month ago

satip server: handle TCP socket return code, destroy session for RTP/TCP on close, fixes #4226

Revision 052874ee
Added by Jaroslav Kysela about 1 month ago

satip server: add shutdown before rtsp_flush_requests() in rtsp_serve(), fixes #4226

Revision 97458ba6
Added by Jaroslav Kysela about 1 month ago

satip server: RTP handle more TCP errors, fixes #4226

From: Mono Polimorph

Revision 09607912
Added by Jaroslav Kysela about 1 month ago

satip server: try to shutdown the socket before RTP thread is called on RTSP timeout, fixes #4226

Revision 28386e99
Added by Jaroslav Kysela about 1 month ago

satip server: rtp - show the correct TCP port number in debug msg for streaming close, fixes #4226

Revision ebbbc3b1
Added by Jaroslav Kysela 24 days ago

http/satip server: change RTP/TCP data queuing to avoid dead-locks, fixes #4226

Revision b24c8e98
Added by Jaroslav Kysela 3 days ago

satip server: handle TCP socket return code, destroy session for RTP/TCP on close, fixes #4226

Revision cc279990
Added by Jaroslav Kysela 3 days ago

satip server: add shutdown before rtsp_flush_requests() in rtsp_serve(), fixes #4226

Revision 431e28e6
Added by Jaroslav Kysela 3 days ago

satip server: RTP handle more TCP errors, fixes #4226

From: Mono Polimorph

Revision 1793e280
Added by Jaroslav Kysela 3 days ago

satip server: try to shutdown the socket before RTP thread is called on RTSP timeout, fixes #4226

Revision 349a9f8e
Added by Jaroslav Kysela 3 days ago

satip server: rtp - show the correct TCP port number in debug msg for streaming close, fixes #4226

Revision a12c166b
Added by Jaroslav Kysela 3 days ago

http/satip server: change RTP/TCP data queuing to avoid dead-locks, fixes #4226

History

#1 Updated by Mono Polimorph 7 months ago

Hi,

Sorry, this problem doen't really exists in the last builds! It's fixed.

Please, close this issues.
Thank you!

#2 Updated by Jaroslav Kysela 7 months ago

  • Status changed from New to Invalid

#3 Updated by Mono Polimorph 7 months ago

Hi Jaroslav,

Please, don't close this ticket... after more testing the problem continues (I use the last commit release). It's the same problem described here: http://www.tvheadend.org/issues/4207#note-2

The trouble is like this:
If for some reason the RTSP connection (using a TCP socket) goes wrong, and the client reconnects, then the OLD connection continues active in the TVheadend, and a new connection appears. This creates some interference, and after some time (I feel after a new error in the TCP connection), the NEW connetion also fails. Then a new other connection is created. After several wrong connections the UI of TVheadend goes unresponsive. Then, if you like to close some connection in the UI (Status-Connections) it doesn't work (you accept, but the connetion isn't closed).

I'm doing some debugging to reproduce the problem. Any idea?

#4 Updated by Jaroslav Kysela 7 months ago

The SAT>IP is not basically useable when you have no enough bandwidth. TVH has no protection against these situations especially when you use AVP/RTP/TCP transfer protocol (TCP only). TVH tries to flush the data in queue and it may cause the stalls. For UDP, the packets are dropped by routers.

The TCP disconnection does not mean to close the streaming for RTSP. The client must send the SHUTDOWN command. But I see the problem for AVP/RTP/TCP - tvh should probably close streaming in this case, too.

#5 Updated by Mono Polimorph 7 months ago

Jaroslav Kysela wrote:

The SAT>IP is not basically useable when you have no enough bandwidth. TVH has no protection against these situations especially when you use AVP/RTP/TCP transfer protocol (TCP only). TVH tries to flush the data in queue and it may cause the stalls. For UDP, the packets are dropped by routers.

Hi Jaroslav,

Yes, I know the problem... with unsufficient bandwith the consequence is a lot of lost packets. Idem as with HTTP streaming of a pure TS stream. But the problem is not the errors in the streaming, but the unclosed sockets. This error implies a several "lost" connections in the server... until TVH goes locked. Then the only solution to continue working is restart the server.
So, I feel the code needs to clear all connections with unreliable TCP sockets. It's say: when any TCP sockets goes full instead of blocking the connection, you directly close the socket and clear the connection. You agree?

Jaroslav Kysela wrote:

The TCP disconnection does not mean to close the streaming for RTSP. The client must send the SHUTDOWN command. But I see the problem for AVP/RTP/TCP - tvh should probably close streaming in this case, too.

Yes. A regular behaviour is clear all connections when the socket disconnects. Futhermore, the RTSP protocol supports more than one connection (socket) for the same session. Then a client can use one socket for sending commands and another for receiving the streaming.

In any case, I suggest to autostop the streaming using AVP/RTP/TCP when the socket goes invalid... until some seconds (1 or 2, sufficient time for the client to reconnect).

I hope you can fix this problem soon! ;)

PS: Remember to change the status of the ticket... please!

#6 Updated by Mono Polimorph 7 months ago

Hi Jaroslav,

I do more tests, and when the TVH collapses this is the status:

- Both ports, RTSP 554 and HTTP 9981, contine working.

- When the problem appears, in the server with a 'netstat' command I see that the socket that has originated the problem is in a CLOSE_WAIT state... until the OS closes it. However, the TVH continues indicating that the connection exists.

- Using the API (http://TVH:9981/api/status/connections) I can see that ANY new connection creates a new connection, and neither are removed, even if the socket is closed. I do a simple check: "telnet TVH 554" and I write "GET / HTTP/1.1\n\n" this returns the 505 Unsupported HTML file, and it closes the socket; but the connection continues listed. Any new connection generates a new one.

Please, review the code. In the log I see this:

2017-02-17 08:25:01.633 [  DEBUG]:satips: -/56C1E24D/795: teardown from 192.168.1.31:19858
2017-02-17 08:25:18.766 [WARNING]:satips: -/56C1E24D/795: session closed (timeout)
2017-02-17 08:28:35.507 [  ERROR]:http: 192.168.224.33: HTTP/1.1 GET / -- 505
2017-02-17 08:54:47.811 [  ERROR]:http: 192.168.224.33: HTTP/1.1 GET / -- 505

The "teardown" is then executed, and the session is closed because a timeout... I feel the problem is here. FYI, the two HTTP gets are my telnet commands to check that the server socket continues active.

What more I can do to help you to fix this error?

#7 Updated by Mono Polimorph 7 months ago

Hi Jaroslav,

Please, can you change the status of the issue to "open"?
If not, I'll open a new one.

Thank you!

#8 Updated by Jaroslav Kysela 7 months ago

  • Status changed from Invalid to Accepted

#9 Updated by Jaroslav Kysela 7 months ago

It appears that the thread which handles the RTSP connection is not terminated correctly. Provide full backtrace when the situation occurs - https://tvheadend.org/projects/tvheadend/wiki/Debugging .

#10 Updated by Mono Polimorph 7 months ago

Jaroslav Kysela wrote:

It appears that the thread which handles the RTSP connection is not terminated correctly. Provide full backtrace when the situation occurs - https://tvheadend.org/projects/tvheadend/wiki/Debugging .

OK, I'll try to create it.

#11 Updated by Mono Polimorph 7 months ago

Hi Jaroslav,

I do some tests with the last commit (Revision 559ec46c):
http://www.tvheadend.org/projects/tvheadend/repository/revisions/559ec46c9614208c1e6aebeab2f744b91d27aa28/diff/src/tcp.c
But the problem persists.

In the meantime, until resolve this bug, I suggest to add a new SAT>IP server config: "Select RTP mode" (UDP or TCP or BOTH).
Now as a client the option "RTP/AVP/TCP (embedded data)" exists, but as server TVH allways accept both. Then why not select if you like to accept TCP and UDP, or only UDP (the standard), or TCP only (I know this has low sense, but it can). This can limit some troubles with some clients.

You agree to add this server option?

#12 Updated by Jaroslav Kysela 7 months ago

Provide the backtrace.

#13 Updated by Mono Polimorph about 1 month ago

Hi,

Finally, I found a method for reproduce the error. And seems that this bug is related to #4445.

Please, read first and configure an environment like this:

  • One regular SAT>IP client.
  • One instance of the TVH configured as a SAT>IP with two or more tuners.
  • One real SAT>IP server, this server runs as a server of the TVH instance.
  • Install the tool “tcpkill” (package “dsniff”) in the server running the TVH.

With this setup already configured execute these steps:

  1. Open the SAT>IP client and tune one frequency in the TVH.
  2. When the client is playing the program...
  3. See the for TCP sockets in the TVH server. Example:
    :~ # netstat -an | grep 554
    tcp        0      0 0.0.0.0:554            0.0.0.0:*               LISTEN
    tcp        0      0 192.168.1.42:51425     192.168.1.44:554        ESTABLISHED
    tcp        0      0 192.168.1.42:554       192.168.1.32:58882      ESTABLISHED
    
  4. Identify the socket used with the client and get the SOURCE port an the remote ADDRESS (here 42=TVH; 32=client; 44:SATIP server).
  5. Kill the socket with: “tcpkill -i eth0 host 192.168.1.32 and port 58882” (here the network interface in the TVH server is “eth0” and the address “192.168.1.32” is the address of the client and the port 58882 is the value returned by the netstat command).
  6. Then change the channel in the SAT>IP client.

This action (changing the channel) triggers a RSTP command over the TCP socket, that the tcpkill command interrupts. The result is that the socket is closed and the client reconnects. You can show it in the Web UI of the TVH the effect and the log:

2017-07-25 22:30:28.883 satips: -/BFCF2976/1: session closed (timeout)

But the subscription continues open... and only after a several minutes if the real SAT>IP server refuses the commands from the TVH, then the subscription closes:

2017-07-25 22:36:52.457 satip: SAT>IP DVB-S Tuner #2 (192.168.1.44) - RTSP error -90 (Message too long) [9874-0]
2017-07-25 22:36:52.750 subscription: 0023: service instance is bad, reason: Tuning failed

At the beginning I have associated this problem with a TCP buffer overflow, because the same problem appears when the TCP stream overkills the bandwidth. But this occurs because when the network gets saturated the TCP socket is closed. And the final result is the same: the socket is closed from the client without a proper shutdown.

So, as a summary: when the TCP socket of a RTSP session is closed without a proper shutdown, all running subscriptions in this RTSP session become invalid instead to be removed/released/closed.

Please, can you fix this problem?
It's a serious bug, as it can appear in a very different scenarios.
Thank you!

#14 Updated by Jaroslav Kysela about 1 month ago

If you see 'session closed (timeout)' then the session should be completely closed including subscriptions. It appears like a locking issue in src/satip/rtsp.c - rtsp_session_timer_cb(), but I don't see the problem for the first glance. Could you add more tvhinfo/tvhwarn lines to this fcn to see where the locking issue occurs?

#15 Updated by Mono Polimorph about 1 month ago

Jaroslav Kysela wrote:

If you see 'session closed (timeout)' then the session should be completely closed including subscriptions. It appears like a locking issue in src/satip/rtsp.c - rtsp_session_timer_cb(), but I don't see the problem for the first glance. Could you add more tvhinfo/tvhwarn lines to this fcn to see where the locking issue occurs?

Hi,

The message 'session closed (timeout)' appears some time after the client reconnects. Then a new session (identical) is created... and the "old" session will expire in the future.

I feel the problem isn't a race condition in the close of the session (in function rtsp_session_timer_cb() )... but a delayed close. Remember that when the client reconnects it uses the same stream and session identifiers. So the problem is that you create a new session and forget the previous. However, both are the same session. Why you create a new one? It's not preferable to reuse the same session? Or as a last result closing the session just when the socket is invalidated.

What you think?

#16 Updated by Mono Polimorph about 1 month ago

Hi,

Plaase, in order to search in the correct place... where are the code that writes to the TCP socket of the RTSP session? I want to review (and add logs) at this point, as it's the place where the error in the socket should be detected.

Thank you!

#17 Updated by Jaroslav Kysela about 1 month ago

Could you retest with latest? There are many changes regarding the session handling and perhaps, this issue is fixed, too.

#18 Updated by Mono Polimorph about 1 month ago

Jaroslav Kysela wrote:

Could you retest with latest? There are many changes regarding the session handling and perhaps, this issue is fixed, too.

I'll try today.
However, please where are the main "writes" to the RTSP socket? I want to review this part of the code.

Thank you!

#19 Updated by Jaroslav Kysela about 1 month ago

Look for http_send_header() in rtsp.c and satip_rtp_tcp_data() in rtp.c .

#20 Updated by Jaroslav Kysela about 1 month ago

v4.3-307-g4ee77e1 - another set of satip server changes

#21 Updated by Mono Polimorph about 1 month ago

Jaroslav Kysela wrote:

v4.3-307-g4ee77e1 - another set of satip server changes

Hi,

With last changes the problem persists!
Please, remember that the problem appears when using TWO TVH and one is the SAT>IP server of the other... and when the RTSP_over_TCP is used!

Another method for testing the problem: Make a VPN between both TVH servers. We name "remote" for the initial TVH SAT>IP server, and "local" for the TVH uses the other as SAT>IP server with RTSP_over_TCP enabled. Then the "local" TVH is used for seving any regular SAT>IP client. So, tune one program... and all is working, but make any SPEEDTEST or similar bandwith consumming tool over the VPN... after a short time the connection fails... and IN THE "REMOTE" TVH SERVER THE CONNECTION IS STALLED AND A NEW IS CREATED... then the server can only be killed using "kill -9 pid".

I don't know how to fix this problem, but it makes unusable the RTSP_over_TCP for this reason!

#22 Updated by Mono Polimorph about 1 month ago

Hi,

More info... If I don't kill the TVH when this error appears, around 10 minutes after, in the log of the TVH appears these messages:

2017-08-01 21:19:15.816 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:19:25.877 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:19:35.836 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:19:45.945 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:19:55.883 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:20:05.922 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:20:15.948 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:20:26.064 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:20:36.046 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:20:46.083 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:20:56.148 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:21:06.176 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:21:16.203 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:21:26.229 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:21:36.262 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:21:46.294 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:21:56.259 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:22:06.354 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:22:16.411 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:22:26.392 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:22:36.466 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:22:46.460 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:22:56.431 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:23:06.454 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:23:16.487 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:23:26.520 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:23:36.538 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:23:46.567 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:23:56.599 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:24:06.618 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:24:16.647 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:24:26.746 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:24:36.834 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:24:46.925 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:24:56.880 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:25:06.977 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:25:16.965 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:25:27.032 mpegts: too much queued table input data (over 2MB), discarding new
2017-08-01 21:25:37.067 mpegts: too much queued table input data (over 2MB), discarding new

At some point the TVH has a deadlock or a race condition!

Based on this message, the input (in this case from a multicast input) is already working. But the session needs to be closed, as the TCP socket when the RTSP data is send is closed.
Why not close (=kill) the session thread when the socket is closed?
I feel this needs to be done synchronous to overcome this trouble.

#23 Updated by Mono Polimorph about 1 month ago

Hi,

I'm completly crazy with this bug... so please, explain to me if I'm right:

  • In the satip/rtp.c file the point where the MPEG-TS packets are writed in the output socket is inside the loop "while (rtp->sq && !fatal)" in the function "satip_rtp_thread()" that is the loop of the thread, right?
  • Then if the write fails ('r' variable) then the loop-break-flag is set: "if (r) fatal = 1;"
  • However, the function "satip_rtp_tcp_loop()" has two coding errors:
  1. The first error is call to the function "satip_rtp_append_tcp_data()", that it's the method that writes to the socket, without checking the return code!
    (http://github.com/tvheadend/tvheadend/blob/master/src/satip/rtp.c#L325)
  1. The second error is that the function "satip_rtp_append_tcp_data()" used where "r = satip_rtp_append_tcp_data(rtp, data, 188);" never returns an invalid code!!!
    (http://github.com/tvheadend/tvheadend/blob/master/src/satip/rtp.c#L287)

Why this function only returns 0 ?
I'm missing something?
:( (!)

#24 Updated by Mono Polimorph about 1 month ago

Hi,

The problem is more complex!

The function "satip_rtp_append_tcp_data()" never returns a failure... however, it doesn't writes directly to the socket... it uses "other" functions that finally uses the "tvh_write()" function defined in "wrappers.c" for writing to the socket.

And two problems exist:

  1. Any failure detected in the function "tvh_write()" isn't propagated to the "satip_rtp_append_tcp_data()" function. So, if some error appears writing to the socket no one handleds it.
  2. Futhermore, the function "tvh_write()" doesn't catch socket errors. It only returns 0 or 1. Value 1 is OK, and 0 is... I can't write at time! No info about socket closed or other errors.

So, please, can you review the code to inform to upper functions about ANY socket error? Perhaps this isn't relevant for RTP over UDP... but for TCP is required!

#25 Updated by Jaroslav Kysela about 1 month ago

  • Status changed from Accepted to Fixed
  • % Done changed from 0 to 100

#26 Updated by Jaroslav Kysela about 1 month ago

This is not a fatal issue, because the socket should be closed in rtsp_serve(), so the errors are just ignored. Try v4.3-313-g2582cde .

#27 Updated by Mono Polimorph about 1 month ago

Hi,

Regarding the call to "tvh_write()" I created this graph of function calls inside "src/satip/rtp.c":

tvh_write()  <----  satip_rtp_tcp_data()  <----  satip_rtp_flush_tcp_data()  <----  satip_rtp_append_tcp_data()  <-x2-  satip_rtp_tcp_loop()  <--|
                                            |------  satip_rtcp_thread()       |                                                                 |
                                                                               |-------------------------  satip_rtp_thread()  ------------------|

So here my ideas for fixing this bug:

  • Add the parameter "int socket_err" to "satip_rtp_session_t". The objective is have a flag indicating if the output socket has a fatal error. You need to initialize it to 0, and 1 indicates that the socket is unusable.
  • In the function "satip_rtp_tcp_data()" check the return code of the "tvh_write()". If the code is 0, then set the flag "socket_err".
  • Inside the thread loop for the RTCP data (function "satip_rtcp_thread()"), after call to "satip_rtp_tcp_data()" check the socket error flag, and if it's true then set the flag "disable_rtcp". This is only for stopping this working thread, not for close it.
  • Inside the thread loop for the RTP data (function "satip_rtp_thread()"), after the call to "satip_rtp_tcp_loop()" do a check of the socket error flag. If the flag is set then set the loop-exit-flag "fatal". This will closes (kill) the thread in the next loop.

I feel that after killing the streaming thread for RTP the upper layers will be informed about the error and the session will be closed. It's this assumption right?
If so, can you implement this solution?
Thank you!

#28 Updated by Mono Polimorph about 1 month ago

Jaroslav Kysela wrote:

This is not a fatal issue, because the socket should be closed in rtsp_serve(), so the errors are just ignored. Try v4.3-313-g2582cde .

Hi,

I tested the last version (v4.3-313-g2582cde), and the problem continues. :(

More info: the socket status when the "remote" TVH server locks (the remote server is the one that serves the SAT>IP calls using the RTP_over_TCP) is "CLOSE_WAIT"; and in the "local" server (the server that works as a client receiving the streaming interlaved over the RTSP channel) is "FIN_WAIT2".

I need to see more in the code to find where is the problem... as the socket is never closed and the session (='connection' in the UI) continues open (even when it expires as it never be deleted).

Any idea?

#29 Updated by Jaroslav Kysela about 1 month ago

Look to rtsp_serve() and http_serve_requests() called from this fcn. The file descriptor should be closed just when http_serve_requests() finishes. I 'fixed' the shutdown sequence in v4.3-316-g052874e to avoid race condition (reuse file descriptor number for another purpose).

#30 Updated by Jaroslav Kysela about 1 month ago

Mono Polimorph wrote:

I feel that after killing the streaming thread for RTP the upper layers will be informed about the error and the session will be closed. It's this assumption right?

No, RTP threads are totaly independent from RTSP. The RTSP code should handle the proper shutdown sequence when the socket is out-of-service.

#31 Updated by Mono Polimorph about 1 month ago

Jaroslav Kysela wrote:

Mono Polimorph wrote:

I feel that after killing the streaming thread for RTP the upper layers will be informed about the error and the session will be closed. It's this assumption right?

No, RTP threads are totaly independent from RTSP. The RTSP code should handle the proper shutdown sequence when the socket is out-of-service.

OK. So if the RTP thread doesn't inform to the RTSP code... then how the RTSP code can handle the close of the socket? Remember that when an error closes the socket, only the RTP_over_TCP stream is interrupted. The RTSP manager will not receive any new message, as the socket is closed.

I feel we need to find how to "kill" the RTSP session.
You know a simple method to do it?

#32 Updated by Mono Polimorph about 1 month ago

Hi,

More info: Using the last version of the code, when the problem appears... that is, when a TCP connection using RTP_over_TCP is stalled (because insufficient bandwith). In this case, the SOCKET is OPEN on both ends (master server and local server). And the session in the master server is expired. But the client connection (in the local server) is still open. So, when I close (kill in the UI) the connection with the SAT>IP client, then the local server closes the socket with the master session and the session is finally free.

So, I feel part of the problem is with the SAT>IP client side. I'll review the code for handle stalled RTP_over_TCP.

#33 Updated by Mono Polimorph about 1 month ago

Hi,

The problem is really complex! :(

I do this change in the "src/satip/rtsp.c" file:

static void
rtsp_session_timer_cb(void *aux)
{
  session_t *rs = aux;

  tvhwarn(LS_SATIPS, "-/%s/%i: session closed (timeout)", rs->session, rs->stream);
  pthread_mutex_unlock(&global_lock);
  pthread_mutex_lock(&rtsp_lock);
  rtsp_close_session(rs);
  rtsp_free_session(rs);
  pthread_mutex_unlock(&rtsp_lock);
  pthread_mutex_lock(&global_lock);
+  tvhwarn(LS_SATIPS, "session closed (timeout) after mutex");
}

And sometimes (now, not all times) the second message isn't printed!
In this case the connection (session) continues open in the master server (remote server).

So, the problem is not only the client, but with the server as well.
Please, any idea why the mutex code isn't executed?

#34 Updated by Mono Polimorph about 1 month ago

Hi,

Stupid question: The order of the mutex lock/unlock in the function "rtsp_session_timer_cb()" is right?

I feel it quite confuse: UNlock(global),Lock(rtsp),...,unlock(rtsp),Lock(global).

It's this correct?

#35 Updated by Mono Polimorph about 1 month ago

Hi,

I present one patch to complement the changeset tvheadend|2582cde14a6a0ac40b8a58d054abd25f01de3a41 :

 src/satip/rtp.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/satip/rtp.c b/src/satip/rtp.c
index 0cdc82e..5679f7a 100644
--- a/src/satip/rtp.c
+++ b/src/satip/rtp.c
@@ -282,7 +282,7 @@ static inline int
 satip_rtp_flush_tcp_data(satip_rtp_session_t *rtp)
 {
   struct iovec *v = &rtp->tcp_data;
-  int r;
+  int r = 0;

   if (v->iov_len)
     r = satip_rtp_tcp_data(rtp, 0, v->iov_base, v->iov_len);
@@ -382,13 +382,13 @@ satip_rtp_thread(void *aux)
     sm = TAILQ_FIRST(&sq->sq_queue);
     if (sm == NULL) {
       if (tcp) {
-        satip_rtp_flush_tcp_data(rtp);
+        r = satip_rtp_flush_tcp_data(rtp);
       } else {
         r = satip_rtp_send(rtp);
-        if (r) {
-          fatal = 1;
-          continue;
-        }
+      }
+      if (r) {
+        fatal = 1;
+        continue;
       }
       tvh_cond_wait(&sq->sq_cond, &sq->sq_mutex);
       continue;
--
1.7.10.4

The patch is required to avoid this problem: After the first "session closed (timeout)" any consecutive new session generates this error:

2017-08-02 20:49:49.051 satips: RTCP send to error 192.168.1.131:42554 : Bad file descriptor
2017-08-02 20:49:49.201 satips: RTCP send to error 192.168.1.131:42554 : Bad file descriptor
2017-08-02 20:49:49.351 satips: RTCP send to error 192.168.1.131:42554 : Bad file descriptor
...

This patch completes and fixes this error.

Futhermore, I need to comment about the unresolved deadlock:

  • The most simple scenario to test: One "master" TVH and one "local" TVH, both connected using the Internet, using RTP_over_TCP for SAT>IP.
  • When you use one SAT>IP client with the "local" TVH, if you start a SpeedTest then when the UPLOAD process saturates your connection, then the RTSP Interlaved stream is stalled. This generates the infamous error "session closed (timeout)". The problem is related to the ACKs of the TCP. When the TCP socket don't receives on time the ACKs the buffer of socket gets full, and the RTP process can't write in the socket.
  • Two different troubles can appear when this error is visible:
  1. First is the "unclosed session". The visible effect is two identical connections from the same client (only changes the port). And after a lot of debug the problem is at this point: at the start of the function "rtsp_close_session()" in rtsp.c, called from the timed function "rtsp_session_timer_cb()". The call to "satip_rtp_close()" inside the start of the function never returns... I suspect because a mutex. However, this problem not appears all the time. Sometimes, the function "satip_rtp_close()" returns, and the duplicated connection not appears. The only effect is that the streaming is stalled. But if the client changes to another MUX then the streaming restarts.
  2. The second problem is the interruption of the streaming. All the time, when the write to the TCP socket fails the streaming is closed. This isn't a good solution, as all troubles with the TCP socket stops the streaming. More advanced behaviour can be: If the buffer is full then retry after a huge interval (for example RTSP_TIMEOUT*4). If the buffer continues full then discard some TS packets. And if the socket has a permanent error, then flush all and close the session. I feel this is required for a reliable streaming.

And yes, I know the problems with the streaming over TCP. However, the current behaviour is very unconfortable.

Please, can you help me to fix this?
Thank you!

#36 Updated by Jaroslav Kysela about 1 month ago

It seems that if write to socket stalls, the rtp thread is not terminated (you may check if pthread_join() in satip_rtp_close() stalls). Probably more logic to the output data buffering should be added to never fill the whole kernel's TCP buffer (and allow the 'atomic' data chunk interleaving with RTSP command replies).

#37 Updated by Mono Polimorph about 1 month ago

Jaroslav Kysela wrote:

It seems that if write to socket stalls, the rtp thread is not terminated (you may check if pthread_join() in satip_rtp_close() stalls). Probably more logic to the output data buffering should be added to never fill the whole kernel's TCP buffer (and allow the 'atomic' data chunk interleaving with RTSP command replies).

Hi,

Yes, when the two connections appear the pthread_join() is never reached. Any solution to this problem?

Regarding the TCP buffer is required some logic to overcome the full buffer exception. Any idea?

I like to work with you in resolve this problem. Please, let me know how I can do it.
Regards.

#38 Updated by Mono Polimorph about 1 month ago

Mono Polimorph wrote:

I like to work with you in resolve this problem. Please, let me know how I can do it.
Regards.

Hi,

I have openend a new issue to work in the TCP buffer optimization: #4517
At time, all transport over the Internet is stalled after a short period (between 10 seconds to 4 minutes). Futhermore, with last commits seems that the deadlock is resolved (I'm not sure, it's only an assumption). So, I'll work on the TCP optimization...

#39 Updated by Mono Polimorph about 1 month ago

Hi,

First, please, reopen this issue, isn't fixed! :(

Second, after fixing the timeout problem with my patch in #4517, the deadlock problem appears all the time when the bandwith is exhausted. Futhermore new messages are shown in the log. Here an example:

2017-08-03 22:30:21.328 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe
2017-08-03 22:30:21.479 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe
2017-08-03 22:30:21.629 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe
2017-08-03 22:30:21.779 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe
2017-08-03 22:30:21.929 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe
2017-08-03 22:30:22.080 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe
2017-08-03 22:30:22.230 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe
2017-08-03 22:30:22.381 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe
2017-08-03 22:30:22.531 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe
2017-08-03 22:30:24.619 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe
2017-08-03 22:30:28.066 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe
2017-08-03 22:30:28.415 satips: RTCP send to error 192.168.1.2:35905 : Broken pipe

This is the messages that the "master" (remote) TVH server prints when the TCP socket is closed and a new one is opened. The master server shows 2 connections from the client, but without subscriptions even if the client requests them. Futhermore, when the client disconnects the two connections continue appearing in the UI, even when the sockets are closed (checked with 'netstat'). And when a new client connects... the server can't serve any new request.

So, the deadlock exists. I don't know where, but when the connection is closed by an error it's not properly removed from the internal data.

Please, can you try to fix this?
Thank you!

#40 Updated by Mono Polimorph about 1 month ago

Hi,

Regarding this unresolved problem, I found more info:

First, a more clear description of the architecture:

"master" TVH <---SAT>IP(Interlaved RTP)(Internet)---> "local" TVH <---SAT>IP(RTP)(LAN)---> SAT>IP client

- "master" is the remote TVH server.
- "local" is the client TVH server.
- "client" is any SAT>IP client connected to the "local" TVE server.

On this scenario and with the current commits (plus my patch in #4517):

- Two tuners configured in the "local" server for connecting to the "master".
- One SAT>IP client using the "local" server.
- Internet network saturated.
- When this happens, the series of events is as follows:
  1. The first tuner goes invalid because the TCP socket collapses and the log is:
    2017-08-04 09:05:12.136 satip: SAT>IP DVB-S Tuner #1 (192.168.1.2) - RTSP error -90 (Message too long) [9874-0]
    2017-08-04 09:05:12.685 subscription: 0004: service instance is bad, reason: Tuning failed
    
  2. The "local" server changes to the second tuner, and opens a new connection to the "master" server.
  3. In the "master" server a new connection appears, but the old one continues, even if the TCP socket is closed.
  4. The "local" server can't retune and this appear in the log:
    2017-08-04 09:05:12.687 mpegts: 11837H in ASTRA-192 - tuning on SAT>IP DVB-S Tuner #2 (192.168.1.2)
    2017-08-04 09:05:12.687 subscription: 0004: "SAT>IP" subscribing to mux "11837H", weight: 100, adapter: "SAT>IP DVB-S Tuner #2 (192.168.1.2)", network: "ASTRA-192", service: "Raw PID Subscription", hostname="192.168.1.2" 
    2017-08-04 09:05:22.688 subscription: 0004: service instance is bad, reason: No input detected
    2017-08-04 09:05:24.688 subscription: 0004: No input source available for subscription "SAT>IP" to mux "11837H in ASTRA-192" 
    2017-08-04 09:05:26.688 subscription: 0004: No input source available for subscription "SAT>IP" to mux "11837H in ASTRA-192" 
    ...
    
  5. The "master" server show only this:
    2017-08-04 09:05:25.391 satips: -/BB14DD9B/2: session closed (timeout)
    
  6. However one subscription exists in the "master" server, but stalled as the input/output counters are freezed.
  7. And the two connections continue in the UI of the "master" forever... any new connection opens a new connection but never receives anything.

So, as a summary: One deadlock happens in the "master" server when a new connection from the same client connects and the old one is closed.

Please, can you review this?
Thank you!

#41 Updated by Mono Polimorph about 1 month ago

Hi,

More info:

I put one logging line before the command "pthread_join(rtp->tid, NULL);" inside the function "satip_rtp_close()@rtp.c", and another one just after this line.

Then, when I reproduce the deadlock (perfectly reproducible all the time!), this is the log:

2017-08-05 12:28:08.310 subscription: 0001: "SAT>IP" unsubscribing, hostname="192.168.1.2" 
2017-08-05 12:29:47.300 satips: satip_rtp_close: before pthread_join(rtp->tid, NULL);
2017-08-05 12:30:09.962 satips: -/7DCD70C0/2: session closed (timeout)

Conclusions:

  1. The function "pthread_join()" never returns and it produces a deadlock.
  2. The time between the error in the socket (when the unsubscription is done) and the call to the "satip_rtp_close()" function is 90 seconds!
  3. The call to "rtsp_session_timer_cb()@rtsp.c" happens 30 seconds after the call to the "satip_rtp_close()"

So the error seems to be in the RTSP thread, as it doesn't kill himself and it never signals to the RTP thread when an error happens in the socket.

Please, can you provide one fix or help me in the right direction for found a solution?
Thank you!

#42 Updated by Mono Polimorph about 1 month ago

Hi,

I show an inconsistency after a lot of testing & logging.
See this (simplyfied) log:

2017-08-05 17:09:14.783 satips: RTP streaming to 192.168.1.2:38723 open
2017-08-05 17:09:14.787 subscription: 0004: "SAT>IP" subscribing to mux ...
2017-08-05 17:09:14.827 satips: 3/5A2E8865/2: PLAY from 192.168.1.2:38723 DVB-S freq 11740000 V sym 27500000 fec 3/4 mod QPSK roff 35 is_id -1 pls_mode ROOT pls_code 0 pids 0,1,16,17,18,102,1501,1502,1503,8187
2017-08-05 17:09:14.829 satips: -/5A2E61D1/1: teardown from 192.168.1.2:38722
2017-08-05 17:09:14.829 satips: RTP streaming to 192.168.1.2:40002 closed (remote request)
2017-08-05 17:09:14.829 subscription: 0003: "SAT>IP" unsubscribing, hostname="192.168.1.2" 
2017-08-05 17:09:15.188 satips: 3/5A2E8865/2: PLAY from 192.168.1.2:38723 DVB-S freq 11740000 V sym 27500000 fec 3/4 mod QPSK roff 35 is_id -1 pls_mode ROOT pls_code 0 pids 0,1,16,17,18,101,102,103,104,105,201,202,203,1501,1502,1503,8187
2017-08-05 17:09:15.690 satips: 3/5A2E8865/2: PLAY from 192.168.1.2:38723 DVB-S freq 11740000 V sym 27500000 fec 3/4 mod QPSK roff 35 is_id -1 pls_mode ROOT pls_code 0 pids 0,1,16,17,18,102,1501,1502,1503,8187
2017-08-05 17:10:37.113 satips: RTP streaming to 192.168.1.2:1000000 closed (remote request)
2017-08-05 17:10:37.113 subscription: 0004: "SAT>IP" unsubscribing, hostname="192.168.1.2" 
2017-08-05 17:10:37.676 satips: 4/5A2EAEF9/3: SETUP from 192.168.1.2:38724 DVB-S freq 11740000 V sym 27500000 fec 3/4 mod QPSK roff 35 is_id -1 pls_mode ROOT pls_code 0 pids 
2017-08-05 17:10:37.678 mpegts: Stream ...
2017-08-05 17:10:37.683 subscription: 0005: "SAT>IP" subscribing to mux ...
2017-08-05 17:10:37.683 satips: RTP streaming to 192.168.1.2:38724 open
2017-08-05 17:10:37.713 satips: RTCP send to error 192.168.1.2:38724 : Broken pipe
2017-08-05 17:10:37.864 satips: RTCP send to error 192.168.1.2:38724 : Broken pipe
...

The key thing to observe is this line:

2017-08-05 17:10:37.113 satips: RTP streaming to 192.168.1.2:1000000 closed (remote request)

As you can check this RTP close corresponds to the one in the first line!
This has a remote port "38723", but the log indicates "1000000".

Futhermore, in other runs, the close line indicates a port that doesn't corresponds to the open port. Most of the time it's different to the open port; and sometimes the port+1. I checked at run (with netstat) that the open port is the correct. However, the close port is most the time incorrect.

I feel the code, because an incorrect mutex_lock, is closing an incorrect socket (fd) when the function "satip_rtp_close()" is called.
Please, can you review this function and try to fix the error?
Thank you!

#43 Updated by Jaroslav Kysela about 1 month ago

Try v4.3-322-g0960791 .

#44 Updated by Jaroslav Kysela about 1 month ago

Note that port value 1000000 belongs to the embedded TCP data mode.

#45 Updated by Mono Polimorph about 1 month ago

Jaroslav Kysela wrote:

Try v4.3-322-g0960791 .

I'll do asap! ;)
Thank you!

Jaroslav Kysela wrote:

Note that port value 1000000 belongs to the embedded TCP data mode.

Ah, OK!
However, sometimes the port is TCP+1, like when the log indicates "RTCP send to error : Broken pipe". In this case, seems that the "local" TVH opens another connection, and the "master" TVH closes the "new" TCP socket, instead of the "old" TCP socket.

#46 Updated by Mono Polimorph about 1 month ago

Jaroslav Kysela wrote:

Try v4.3-322-g0960791 .

Hi,

Problem only solved partially!
Now seems that the first connection is closed (great!), however the second one fails after few seconds... now all the time after the second connection is stablished, it appears in the log "RTCP send to error 192.168.1.2:38911 : Broken pipe" and now the port is the correct TCP port of the second connection. Why the RTCP thread fails to write to the TCP socket in the second time?

See the log

2017-08-07 15:40:24.079 satips: RTP streaming to 192.168.1.2:1000000 closed (remote request)
2017-08-07 15:40:24.079 subscription: 0002: "SAT>IP" unsubscribing, hostname="192.168.1.2" 
2017-08-07 15:40:26.858 satips: 4/4A237AA5/3: SETUP from 192.168.1.2:38911 DVB-S freq 12575000 H sym 22000000 fec 2/3 mod PSK/8 roff 35 is_id -1 pls_mode ROOT pls_code 0 pids 
2017-08-07 15:40:26.858 satips: RTP streaming to 192.168.1.2:38911 open
2017-08-07 15:40:26.862 subscription: 0003: "SAT>IP" subscribing to mux ...
2017-08-07 15:40:26.934 satips: RTCP send to error 192.168.1.2:38911 : Broken pipe
2017-08-07 15:40:27.085 satips: RTCP send to error 192.168.1.2:38911 : Broken pipe
2017-08-07 15:40:27.235 satips: RTCP send to error 192.168.1.2:38911 : Broken pipe
2017-08-07 15:40:27.386 satips: RTCP send to error 192.168.1.2:38911 : Broken pipe
2017-08-07 15:40:27.526 satips: 4/4A237AA5/3: PLAY from 192.168.1.2:38911 DVB-S freq 12575000 H sym 22000000 fec 2/3 mod PSK/8 roff 35 is_id -1 pls_mode ROOT pls_code 0 pids 0,1,16,17,18,102,1201,1202,1205,8187
2017-08-07 15:40:27.536 satips: RTCP send to error 192.168.1.2:38911 : Broken pipe
2017-08-07 15:40:27.686 satips: RTCP send to error 192.168.1.2:38911 : Broken pipe
2017-08-07 15:40:27.837 satips: RTCP send to error 192.168.1.2:38911 : Broken pipe
2017-08-07 15:40:28.526 satips: RTCP send to error 192.168.1.2:38911 : Broken pipe
2017-08-07 15:40:34.065 satips: RTCP send to error 192.168.1.2:38911 : Broken pipe
2017-08-07 15:40:34.537 satips: RTP streaming to 192.168.1.2:1000000 closed (remote request)
2017-08-07 15:40:34.537 subscription: 0003: "SAT>IP" unsubscribing, hostname="192.168.1.2" 

Futhermore, the problem in the client part persists. With two tuners, when the error appears in the tuner #1, then the tuner #2 is opened. However, when the error appears in the Tuner #2, the #1 isn't used even if it's free.

#47 Updated by Mono Polimorph about 1 month ago

Hi,

While working for fixing this bug, I created this patch:

---
 src/satip/rtp.c  |    6 +++---
 src/satip/rtsp.c |    8 ++++----
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/satip/rtp.c b/src/satip/rtp.c
index 5679f7a..5e12c60 100644
--- a/src/satip/rtp.c
+++ b/src/satip/rtp.c
@@ -910,7 +910,7 @@ satip_rtcp_thread(void *aux)
       if (tvhtrace_enabled()) {
         msg[len] = '\0';
         tcp_get_str_from_ip(&rtp->peer2, addrbuf, sizeof(addrbuf));
-        tvhtrace(LS_SATIPS, "RTCP send to %s:%d : %s", addrbuf, ntohs(IP_PORT(rtp->peer2)), msg + 16);
+        tvhtrace(LS_SATIPS, "RTCP send to %s:%d (sock:%d): %s", addrbuf, ntohs(IP_PORT(rtp->peer2)),(rtp->port == RTS
P_TCP_DATA)? rtp->fd_rtp:rtp->fd_rtcp, msg + 16);
       }
       if (rtp->port == RTSP_TCP_DATA) {
         err = satip_rtp_tcp_data(rtp, 1, msg, len);
@@ -925,8 +925,8 @@ satip_rtcp_thread(void *aux)
       if (r < 0) {
         if (err != ECONNREFUSED) {
           tcp_get_str_from_ip(&rtp->peer2, addrbuf, sizeof(addrbuf));
-          tvhwarn(LS_SATIPS, "RTCP send to error %s:%d : %s",
-                  addrbuf, ntohs(IP_PORT(rtp->peer2)), strerror(err));
+          tvhwarn(LS_SATIPS, "RTCP send to error %s:%d (sock:%d): %s",
+                  addrbuf, ntohs(IP_PORT(rtp->peer2)),(rtp->port == RTSP_TCP_DATA)? rtp->fd_rtp:rtp->fd_rtcp, strerro
r(err));
         } else {
           rtp->disable_rtcp = 1;
         }
diff --git a/src/satip/rtsp.c b/src/satip/rtsp.c
index 3c0baf9..4c2e08d 100644
--- a/src/satip/rtsp.c
+++ b/src/satip/rtsp.c
@@ -1163,9 +1163,9 @@ play:
   if (mpegts_pid_dump(&rs->pids, buf + r, sizeof(buf) - r, 0, 0) == 0)
     tvh_strlcatf(buf, sizeof(buf), r, "<none>");

-  tvhdebug(LS_SATIPS, "%i/%s/%d: %s from %s:%d %s",
+  tvhdebug(LS_SATIPS, "%i/%s/%d: %s from %s:%d (sock:%d) %s",
            rs->frontend, rs->session, rs->stream,
-           caller, hc->hc_peer_ipstr, ntohs(IP_PORT(*hc->hc_peer)), buf);
+           caller, hc->hc_peer_ipstr, ntohs(IP_PORT(*hc->hc_peer)), hc->hc_fd, buf);

 ok:
   errcode = 0;
@@ -1484,8 +1484,8 @@ rtsp_process_teardown(http_connection_t *hc)
     return 0;
   }

-  tvhdebug(LS_SATIPS, "-/%s/%i: teardown from %s:%d",
-           hc->hc_session, stream, hc->hc_peer_ipstr, ntohs(IP_PORT(*hc->hc_peer)));
+  tvhdebug(LS_SATIPS, "-/%s/%i: teardown from %s:%d (sock:%d)",
+           hc->hc_session, stream, hc->hc_peer_ipstr, ntohs(IP_PORT(*hc->hc_peer)), hc->hc_fd);

   pthread_mutex_lock(&rtsp_lock);
   rs = rtsp_find_session(hc, stream);
--
1.7.10.4

It helps to identify a faulty socket-id (fd) value in the satips part.
I suggest to also add similar code in other parts of the code for easy bugs search.
You agree to commit it?

Regards.

#48 Updated by Mono Polimorph about 1 month ago

Hi,

More info (and please reopen this issue, still it's not fixed!):

After doing this change in the code of "rtp.c":

//#define RTP_TCP_PAYLOAD (87*188+12+4) /* cca 16kB */
  #define RTP_TCP_PAYLOAD ((7*11*188)+12+4) /* less than 15kB */

First, why I do this change? The reason is this: Trying to fix the "broken pipe" with the RTCP messages (as the RTCP data is in fact writed in the socket, checked with a tcpdump, even if the "broken pipe" message appears) I see that using the RTP_TCP_PAYLOAD with the "original" value some "non complete writes" are done in the socket (that is two or more chunks in the TCP stream). So, I reduced the value to ((11*7)*188) bytes for TS packets. That is 11 times 7 packets of 188bytes (the original value is 7*12=87). This is less than 15kB, and as the Linux default TCP size is 16kB, then this is a safe value. So, I suggest to update the code and use this lower value.

However, using this value the deadlock now apeears all the times! Now, at every execution, when the bandwith is exhausted and a new connection from the local TVH server is created, the old connection in the master TVH server isn't closed... the code all the time be stalled in the function "satip_rtp_close()@rtp.c" inside the "pthread_join()".

Please, can you help me to fix this issue? I can see the effect of the trouble. And I can reproduce it every time. However, I can't see where is the bug. :(

#49 Updated by Mono Polimorph about 1 month ago

Hi,

One additional comment if it helps:

- Searching for the best size for the RTP_TCP_PAYLOAD value, I found one discussion about the Maximum interleaved RTP packet size. This thread is inside the mailinglist of the good-known Live555 RTSP library. See it at:

http://lists.live555.com/pipermail/live-devel/2015-April/019225.html

- One interesting comment is about a mutex for writing in the TCP socket. As I know, the TVH code at time doesn't do any blocking when writing the Interlaved data (stream=0 that it's the TS data, the stream=1 that it's the RTCP data, and the RTSP responses) to the TCP socket. So, it will be a good idea to write thread-safe when using the Interlaved mode?

- For sure, reducing the RTP_TCP_PAYLOAD value, the chances of incomplete writing will be lower.. So, I feel this the reason for a different behaviour reducing this value.

Please, post yours comments.

#50 Updated by Mono Polimorph about 1 month ago

Hi,

Please can you comment if my assumption is true?

- After a lot of review of the code, I think that the deadlock is caused by the functions '[email protected]_rtp_close()' and '[email protected]_rtp_thread()'.

- The first function waits until the second ends, as each one is running in a different thread.

- However, the problem seems to appear when the "fatal" variable inside the 'satip_rtp_thread()' changes to "1". When this variable is set the while loop inside the thread stops.

- It's possible that the 'pthread_join(rtp->tid, NULL)' inside the 'satip_rtp_close()' function never contiunes because the rtp thread is blocked because the "fatal=1"?

I don't know, and for this reason I ask!

#51 Updated by Mono Polimorph about 1 month ago

Hi,

With these two lines added to the function '[email protected]_rtp_close()':

+  tvhtrace(LS_SATIPS, "Closing RTP socket: waiting for the end of the rtp thread");
  pthread_join(rtp->tid, NULL);
+  tvhtrace(LS_SATIPS, "Closing RTP socket: continue after the end of the rtp thread");

This is the log when no deadlock:

2017-08-09 10:57:02.522 [  TRACE]:satips: rtp close 0x7f205ca4cae0
2017-08-09 10:57:02.522 [  TRACE]:satips: Closing RTP socket: waiting for the end of the rtp thread
2017-08-09 10:57:02.522 [  DEBUG]:satips: RTP streaming to 192.168.1.2:1000000 closed (remote request)
2017-08-09 10:57:02.525 [  TRACE]:satips: Closing RTP socket: continue after the end of the rtp thread
2017-08-09 10:57:02.525 [   INFO]:subscription: 0004: "SAT>IP" unsubscribing, hostname="192.168.1.2" 

And this is the log when the deadlock appears:

2017-08-09 10:58:17.344 [  TRACE]:satips: rtp close 0x7f205cae0790
2017-08-09 10:58:17.344 [  TRACE]:satips: Closing RTP socket: waiting for the end of the rtp thread

In this last I wait for 2minutes before kill the TVH manually (kill -9). And these two lines are the last in the log file.

So, the problem is the "pthread_join(rtp->tid, NULL)" call.
Please, can you fix this issue?

#52 Updated by Mono Polimorph about 1 month ago

Hi,

More info:

When the deadlock is produced and the code is in the "pthread_join(rtp->tid, NULL)" call I see this:

(gdb) info threads

  4    Thread 0x7ff548214700 (LWP 1164) "tvh:tcp-start" 0x00007ff54c4a9ee5 in pthread_join ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
  3    Thread 0x7ff5437fb700 (LWP 1166) "tvh:satip-rtp" 0x00007ff54c4afd5c in __lll_lock_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
* 2    Thread 0x7ff543bfd700 (LWP 1167) "tvh:tcp-start" 0x00007ff54c4afd5c in __lll_lock_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0

(gdb) thread 4
[Switching to thread 4 (Thread 0x7ff548214700 (LWP 1164))]
#1  0x00007ff54d5ae37f in satip_rtp_close (_rtp=0x7ff5505ebba0) at src/satip/rtp.c:602
602       pthread_join(rtp->tid, NULL);

(gdb) bt
#0  0x00007ff54c4a9ee5 in pthread_join () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ff54d5ae37f in satip_rtp_close (_rtp=0x7ff5505ebba0) at src/satip/rtp.c:602
#2  0x00007ff54d5a7e55 in rtsp_close_session (rs=0x7ff5505e7b00) at src/satip/rtsp.c:1621
#3  0x00007ff54d5a81f8 in rtsp_flush_requests (hc=0x7ff548213990) at src/satip/rtsp.c:1545
#4  rtsp_serve (fd=25, opaque=0x7ff550630218, peer=<optimized out>, self=<optimized out>)
    at src/satip/rtsp.c:1604
#5  0x00007ff54d5162a3 in tcp_server_start (aux=0x7ff5506301f0) at src/tcp.c:651
#6  0x00007ff54d511ee3 in thread_wrapper (p=0x7ff5505d8cc0) at src/wrappers.c:161
#7  0x00007ff54c4a8b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#8  0x00007ff54bd6895d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#9  0x0000000000000000 in ?? ()

(gdb) frame 1
#1  0x00007ff54d5ae37f in satip_rtp_close (_rtp=0x7ff5505ebba0) at src/satip/rtp.c:602
602       pthread_join(rtp->tid, NULL);

(gdb) info locals
rtp = 0x7ff5505ebba0
tbl = <optimized out>
sq = 0x7ff5505e7ce8

(gdb) print rtp->tid
$12 = 140691376158464

And the value "140691376158464" is an inexistent 'pthread_t'.

You know where is the problem?

#53 Updated by Mono Polimorph about 1 month ago

Hi,

Perhaps I found the cross deadlock...

This is from a gdb session when the deadlock is produced:

  11   Thread 0x7f9cd3fff700 (LWP 1351) "tvh:tcp-start" 0x00007f9cdc7a3ee5 in pthread_join ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
  10   Thread 0x7f9cd8910700 (LWP 1352) "tvh:satip-rtp" 0x00007f9cdc7a9d5c in __lll_lock_wait ()
   from /lib/x86_64-linux-gnu/libpthread.so.0

- Here the thread 10 is waiting for mutext owned by Thread 11 (checked with "p *(pthread_mutex_t)0x7f9cd3ffe990" in frame 3 of thread 10)

- And the thread 11 is waiting for the end of some "rtp thread", in this case the thread 10.

Where is then the problem?
When the TCP socket is stalled, and the session needs to free, the mutex of the TCP blocks the RTP thread at this point: http://www.tvheadend.org/projects/tvheadend/repository/revisions/97458ba6de0806119ca970c524b6fa6327e1ea6d/entry/src/satip/rtp.c#L274

So the solutions seems to unlock the TCP socket before calling to rtsp_close_session().

#54 Updated by Mono Polimorph about 1 month ago

Hi,

I'm glad to present a fix for this issue! :)

The problem is the cross deadlock between the TCP writes and the closing of the RTP thread. So here is the patch:

---
 src/satip/rtp.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/src/satip/rtp.c b/src/satip/rtp.c
index 5679f7a..d800c5a 100644
--- a/src/satip/rtp.c
+++ b/src/satip/rtp.c
@@ -33,7 +33,7 @@

 #define RTP_PACKETS 128
 #define RTP_PAYLOAD (7*188+12)
-#define RTP_TCP_PAYLOAD (87*188+12+4) /* cca 16kB */
+#define RTP_TCP_PAYLOAD ((7*6*188)+12+4) /* less than 8kB (FFMpeg and others have troubles with >8kB) */
 #define RTCP_PAYLOAD (1420)

 typedef struct satip_rtp_table {
@@ -592,12 +592,11 @@ void satip_rtp_close(void *_rtp)
   tvh_cond_signal(&sq->sq_cond, 0);
   pthread_mutex_unlock(&sq->sq_mutex);
   pthread_mutex_unlock(&satip_rtp_lock);
-  if (rtp->port == RTSP_TCP_DATA)
-    pthread_mutex_lock(rtp->tcp_lock);
   pthread_join(rtp->tid, NULL);
   if (rtp->port == RTSP_TCP_DATA) {
-    pthread_mutex_unlock(rtp->tcp_lock);
+    pthread_mutex_lock(rtp->tcp_lock);
     free(rtp->tcp_data.iov_base);
+    pthread_mutex_unlock(rtp->tcp_lock);
   } else {
     udp_multisend_free(&rtp->um);
   }
--
1.7.10.4

This patch includes another fix regarding the maximum length of TS packets in a interlaved stream. Several clients (like FFmpeg) use a buffer of 8KB. So the size is reduced to a less than 8KB (42 total TS packets).

I hope this patch will be commited soon. Also with my previous patch "SATIP-Add-socket_id-in-logs.diff" for printing in logs the number of the socket (file-id). This data is very usefull when debugging.

Finally, one comment: This patch doesn't fix the another two bugs that are still present. These bugs are:

  • The "RTCP send to error : Broken pipe" when using Interlaved RTP.
  • The use of the "next free tuner" when using more than one tuner in the client and the session is closed by error (when the tuner#1 fails, the tuner#2 starts... however, when fails the tuner#2 the tuner#1 isn't used!).

Please, can you commit this and try to fix these bugs?

#55 Updated by Mono Polimorph about 1 month ago

Hi,

Regarding the "RTCP send to error : Broken pipe" issue, I see this (after apply my previous patch):

- When I start the execution of the THV, the RTCP messages over the Interlaved stream are sended.

- However, after some TCP errors, the RTCP messages aren't sended. If I change to UDP transport without closing the TVH the messages flows. But if I rechange to TCP without stopping TVH then the RTCP messages are missing.

- Futhermore, with the small RTP_TCP_PAYLOAD value (8KB) it's very hard to generate TCP errors saturating the connection.

You know, or have any idea, about this issue?

#56 Updated by Jaroslav Kysela about 1 month ago

Could you test code in https://github.com/perexg/tvheadend/commits/satip-test ? Your 'fix' is only partial. The goal for commit https://github.com/perexg/tvheadend/commit/c670fc67ae9da5509ce9a16d93f0a2b7cfe2e0eb is to write data without blocking in the RTP thread but ensure the sync with the standard RTSP replies. So pthread_join() should never block now. Also, there's data overflow protection (actually at 1MB - perhaps, the value should be higher).

#57 Updated by Mono Polimorph 16 days ago

Jaroslav Kysela wrote:

Could you test code in https://github.com/perexg/tvheadend/commits/satip-test ? Your 'fix' is only partial. The goal for commit https://github.com/perexg/tvheadend/commit/c670fc67ae9da5509ce9a16d93f0a2b7cfe2e0eb is to write data without blocking in the RTP thread but ensure the sync with the standard RTSP replies. So pthread_join() should never block now. Also, there's data overflow protection (actually at 1MB - perhaps, the value should be higher).

Hi,

I do the check now (I'm in return)... and it crashes.
See the new issue #4573.
Please, rollback the commit or review the code.

Thank you for your effort!

Also available in: Atom PDF