Project

General

Profile

Bug #5833

memory leak recording two channels concurrently

Added by Dave Denholm 7 months ago. Updated 7 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
non-x86
Target version:
-
Start date:
2019-12-30
Due date:
% Done:

0%

Estimated time:
Found in version:
4.2.8-34 (raspian - possibly base 4.2.8-31 + raspian changes ?)
Affected Versions:

Description

I've recently upgraded my faithful first-generation rpi to raspian buster, with a corresponding update to tvheadend. Trying to record two programs concurrently (from the same multiplex) caused a memory leak, and resulted in the oom killer taking action.

I can reproduce it quite easily. Eg recording BBC1 and BBC2 together resulted in the vm size growing. Down below is the result of running

  while sleep 5 ; do cat /proc/27428/statm ; done

in a shell, which shows the virtual and resident page counts (first and second columns) growing. I've added some blank lines to show transitions between stable states. First few lines show it idle. Then the first jump is when I start recording BBC1. Then a few lines later I start recording BBC2. It takes a few seconds for the leak to manifest once the second recording is started. The blank line is the point at which I abort both recordings. It doesn't fall all the way to the starting values.

Probably not relevant, but this is using a Realtek RTL2832 USB DVB-T tuner using Freeview in the UK (Crystal Palace transmitter). The hardware has been absolutely fine for several years, with an older version of tvheadend (3.4 I think, on 'wheezy').

The 'about' box reports the version as 4.2.8-34~g24a2f59e9
The changelog.Debian file in the distribution says:

$ zcat /usr/share/doc/tvheadend/changelog.Debian.gz 
tvheadend (4.2.8-34~g24a2f59e9+rpt1~buster) buster; urgency=low

  * 24a2f59e9 rpt changes                                       
  * e673865e6 packaging: add missing DEBHELPER placeholder to ..
  * 2d020e8f2 autobuild: add build targets for buster i386, am..
  * 0a60f73ee revert bogus ONID and TSID remapping              
  * 60c2f807a mpegts: use 32-bit tsid/onid to define the NONE ..

  * The full changelog(s) can be found at 
    https://tvheadend.org/projects/tvheadend/wiki/Releases
    https://github.com/tvheadend/tvheadend

 -- Serge Schneider <[email protected]>  Tue, 13 Aug 2019 12:40:39 +0100

I can see change 0a60f73ee in the git log of the tvheadend project (as v4.2.8+31). Not sure what
the other 3 changes are. Perhaps in a raspian clone of the git tree somewhere.

I have an x86 box as a backup tvheadend. I tried building and running 0a60f73ee for that,
but it doesn't seem to be showing the same problem. But the configurations and
libraries are probably not identical. It's running Mint 18.1 Serena which appears
to be an Ubuntu xenial derivative ?

Anyway, I'll try building for the raspberry pi from source, but that's obviously going
to take me a while - I don't have cross-compiling set up for it yet.

Here's the statm log for the tvheadend process for the above test.

$ while sleep 5 ; do cat /proc/27428/statm ; done
62999 8865 1384 696 0 30465 0
62999 8865 1384 696 0 30465 0
62999 8865 1384 696 0 30465 0
62999 8865 1384 696 0 30465 0
62999 8903 1384 696 0 30465 0
62999 8940 1390 696 0 30465 0
62999 8949 1390 696 0 30465 0

63255 8954 1390 696 0 30606 0
63255 8954 1390 696 0 30606 0
63255 8954 1390 696 0 30606 0
63255 8954 1390 696 0 30606 0
63255 8954 1390 696 0 30606 0
63255 8954 1390 696 0 30606 0
63255 8954 1390 696 0 30606 0
63255 8954 1390 696 0 30606 0

63255 8993 1390 696 0 30606 0
63255 8993 1390 696 0 30606 0
63255 8993 1390 696 0 30606 0
63255 8993 1390 696 0 30606 0
63255 8993 1390 696 0 30606 0
63255 8993 1390 696 0 30606 0
63255 8993 1390 696 0 30606 0
63511 9392 1390 696 0 30787 0
64023 9942 1390 696 0 31337 0
64791 10532 1390 696 0 31927 0
65303 11009 1390 696 0 32427 0
65559 11386 1390 696 0 32781 0
66071 11803 1390 696 0 33198 0
66327 12077 1390 696 0 33472 0
66583 12499 1390 696 0 33894 0
66839 12764 1390 696 0 34159 0
67351 13155 1390 696 0 34550 0
67863 13741 1390 696 0 35136 0
68631 14398 1390 696 0 35793 0
69143 14830 1390 696 0 36243 0
69399 15270 1390 696 0 36665 0
69399 15316 1390 696 0 36711 0
69399 15316 1390 696 0 36711 0
69399 15316 1390 696 0 36711 0
69911 15581 1390 696 0 37003 0
70167 15967 1390 696 0 37362 0
70679 16460 1390 696 0 37855 0
70935 16934 1390 696 0 38228 0

70935 16934 1390 696 0 38228 0
69143 15204 1390 696 0 36570 0
69143 15204 1390 696 0 36570 0
69143 15204 1390 696 0 36570 0
^C

Files

log (11.3 KB) log journalctl output from when it popped Dave Denholm, 2019-12-30 19:39

History

#1

Updated by Dave Denholm 7 months ago

Sorry, I forgot to include the system logs from when it first manifested. 270M resident memory is more than this little pi can sustain.

#2

Updated by Dave Pickles 7 months ago

I've seen this issue, though in my case only when recording two DVB-T2 channels.

What are you using to store the recordings?

In my case I was using an ex-laptop HDD via a low-cost USB->SATA interface. I have seen it mentioned somewhere that these SATA adaptors are mainly intended for backups and archiving, where only one file is open at once. With two files open (plus the added complexity of EXT filesystems) the adaptor can't keep up and output data is queued in system memory until the oom killer intervenes. Looking in the GUI at Config->Debugging->Memory the count of Packet Buffers increased continually until the crash.

After a bit of trial-and-error I found a combination of HDD and adaptor which worked.

#3

Updated by Dave Denholm 7 months ago

I use an external USB disk (Samsung M3).

It's true that the rpi has limited USB bandwidth available, and is only USB 2. And we were watching something from the disk at the same time. But this was all working fine in tvh 3.4

If this was the problem, wouldn't the backlog show up as system memory (buffers that couldn't get written to disk) rather than virtual memory within tvheadend itself ? ie tvheadend can just do a write() and it's the system's problem on getting it from buffers to disk ? Unless you are saying that tvheadend has a mechanism to buffer stuff up in userspace if a write() call reports

Is there any info in the oom log I attached ? I'm not really sure what the various names are. active_file:696kB could mean there's less than 1M of file data waiting to be written out ?

#4

Updated by saen acro 7 months ago

Set different cache scheme to recording profile.

#5

Updated by Flole Systems 7 months ago

Use valgrind if you suspect a memory leak, it will show it.

#6

Updated by Dave Denholm 7 months ago

saen acro wrote:

Set different cache scheme to recording profile.

I think I've done that. I did come across that when setting up new profiles. I set it to 'Don't keep' which seemed the appropriate one. (The help mentioned ram-limited targets such as rpi.)

However, if it was due to this feature, that would still show up when doing a single recording, wouldn't it ? Two recordings would merely consume cache at twice the rate. But I see spectacular growth with 2 recordings, and none with 1.

I'll try out a memory-leak tool once I'm able to build for the pi from source. I had hoped a simple chop in git would be able to find a version that didn't show this behavior.

#7

Updated by Flole Systems 7 months ago

You might want to try the latest version and not some old 4.2 version.

#8

Updated by Dave Denholm 7 months ago

Flole Systems wrote:

You might want to try the latest version and not some old 4.2 version.

Yes. Again, once I can actually (preferably cross-) build for the target, I'll be in a position to actually debug this. But isn't 4.2 the current stable release ?

#9

Updated by Flole Systems 7 months ago

It's still super old and about to be replaced by 4.3 and many bugfixes are not backported, I would not call it stable at all but officially it's considered stable.

#10

Updated by Dave Denholm 7 months ago

saen acro wrote:

Set different cache scheme to recording profile.

I tried 'sync + don't keep' but that made no apparent difference.

#11

Updated by saen acro 7 months ago

What exactly is file system of drive?

#12

Updated by Dave Denholm 7 months ago

saen acro wrote:

What exactly is file system of drive?

ext4 on a 1.5Tb USB Samsung M3 portable external drive. It probably is pretty fragmented since it runs almost-full most of the time.

I mentioned that when it first happened, we were also playing a video off the drive. But in subsequent repros, the machine has been idle.

using atop, and recording one program, the disk reports writing 1.5MBw/s. If I run a cp on the disk at the same time, it gets up to 7.5MBw/s so the disk itself should be capable of receiving data fast enough for two concurrent recordings. I've had no problems doing so over the last few years using tvheadend 3.4. (But I don't want to disturb it right now with further tests since it's recording something I want. And tvheadend's memory is already high since I forgot to restart it after the last test I did :-( )

#13

Updated by saen acro 7 months ago

Is it journaled or no?
Performance is different.
https://raid6.com.au/posts/fs_ext4_external_journal/

#14

Updated by Dave Denholm 7 months ago

saen acro wrote:

Is it journaled or no?
Performance is different.
https://raid6.com.au/posts/fs_ext4_external_journal/

It is journaled, since it would take a very very long time for the pi to fsck after a crash. It's a bit sensitive to power glitches eg when plugging things into usb. But I should change the mode to use writeback rather than ordered. (Apparently I can't do that using remount, so I'll have find and stop everything currently using the disk.)

Not sure how wise putting journal onto the pi's sdcard would be - wouldn't it wear it out ?

All this would make sense if this was a new system, but it's been running fine for years prior to the upgrade. The only hardware change is using a new bigger sd card for the OS. Plus I'm experimenting with using f2fs for /var but that shouldn't really be visible to tvheadend. (I don't expect you to remember whether some change between tvheadend 3.4 and 4.2 would have affected all this !)

#15

Updated by Dave Pickles 7 months ago

If this was the problem, wouldn't the backlog show up as system memory (buffers that couldn't get written to disk) rather than virtual memory within tvheadend itself ? ie tvheadend can just do a write() and it's the system's problem on getting it from buffers to disk ? Unless you are saying that tvheadend has a mechanism to buffer stuff up in userspace if a write() call reports

From a very quick look at the source, recording uses the same code path as streaming. It would make sense to buffer in userspace when streaming so I suspect recording does too.

Do you see the count of Packet Buffers (Config->Debugging->Memory) exploding just before the crash?

#16

Updated by Dave Denholm 7 months ago

Dave Pickles wrote:

Do you see the count of Packet Buffers (Config->Debugging->Memory) exploding just before the crash?

Once the memory has ramped up a bit, the web gui becomes pretty unresponsive, so not really possible to tell.

I have had a quick look at the code. If I understand correctly, there are two threads doing work during this time : one is pulling data from the tuner and adding it to a fifo ; then the dvr thread is reading packets from this fifo and writing them out to the filesystem. So if it can't keep up, the queue is going to get longer and longer, giving the symptom I'm seeing.

So I now understand why the i/o speed is being questioned. cpu is possibly also significant, if tvheadend is doing more interesting stuff than it used to. Oh well - hardware upgrade was probably overdue anyway.

#17

Updated by Dave Denholm 7 months ago

Dave Denholm wrote:

Dave Pickles wrote:

Do you see the count of Packet Buffers (Config->Debugging->Memory) exploding just before the crash?

Once the memory has ramped up a bit, the web gui becomes pretty unresponsive, so not really possible to tell.

... but letting it record for a short while then aborting, I can then see that the peak Packet Buffers has exploded (15,367,564).

I'm guessing there's been lots more code added since 3.4 to examine the mpegts data as it is received, checking for errors etc. So it's now doing a lot more work than it used to. I suspect the increased cpu load may be what's causing the symptoms I'm seeing.

#18

Updated by Dave Denholm 7 months ago

Update: I changed journaling to writeback, but didn't obviously help.

I've rebuilt 3.4patch1 from source for the pi, and am running that. The difference is obvious - there are hardly any writes to the disk (according to atop), and when it does write, it does big writes. Whereas 4.2.8 was doing about 1000 small (4k ?) writes per second.

3.4 is still doing about 10% cpu in system mode, so it's presumably still writing the data out to the kernel. But I'm guessing 4.2.8 was doing something to force it to actually write it out to the disk, whereas 3.4 was just letting the kernel decide when to flush out to disk. (And as was mentioned, with the journaling, lots of small writes just escalates things even more.) That's something I can investigate on the backup machine now that I know what I'm looking for.

Recording two programs, it's only running at about 50% cpu, and it hasn't even pushed the clock rate up to max. Adding a third and it barely notices.

Yeah, 3.4 is very old, But so's my pi ;-) And this saves me having to rewrite all the scripts I have which trawl the upcoming events looking for event clashes.

Thanks all.

#19

Updated by Dave Denholm 7 months ago

Doh! It was due to the cache scheme config ! It turns out that the "Don't Keep" setting, which sounds benign, and is recommended for rpi, actually causes the problem. tvheadend tells the kernel it doesn't need the data, which the kernel seems to take that to mean that it must get rid of it immediately. So "Don't Keep" and "Sync" actually seem to end up doing much the same thing.

(I had misunderstood the description of "Don't Keep". I thought tvheadend itself was offering to keep the data in its buffers, in case it was needed for streaming, or something.)

My backup system doesn't seem to show the same behaviour. It's running an older kernel on x86.

Also available in: Atom PDF