Project

General

Profile

Bug #4802

Experiencing regular crashes

Added by Peter Brockamp about 1 year ago. Updated about 19 hours ago.

Status:
Invalid
Priority:
Normal
Assignee:
-
Category:
Crashes
Target version:
-
Start date:
2017-12-17
Due date:
% Done:

0%

Estimated time:
Found in version:
4.2.4-42~g1c9944bb2
Affected Versions:

Description

Hello!

I'm experiencing regular crashes. Dunno if this is a problem of tvh itself or the Linux kernel, but somewhere I need to start.

I'm using tvheadend on a headless Debian stretch system, running on a hexacore amd phenom with 4GB RAM:
Linux version 4.9.0-4-amd64 () (gcc version 6.3.0 20170516 (Debian 6.3.0-18) ) #1 SMP Debian 4.9.65-3 (2017-12-03)
Current tvh version: 4.2.4-42~g1c9944bb2
I'm using a SAT-IP server (Megasat SAT>IP Server 3) connected to my LAN for recording and live TV. Tvh has exclusive access to the server.
Up to now I could not recognize any systematics on when the crashes occur (like e. g. start of a recording).

However I already had problems with earlier tvh version and jessie as follows:
  • At first (Jessie) I experienced regular complete freezing(!) of the machine with no output in the syslog. "Regular" means it could be that the machine lived only for an hour, but maybe also for several days.
  • I started wondering if I'm having a hardware problem on my machine, Linux should at least Kernel panic and I was expecting something in the logs, but nothing like that. I didn't find the time to do some in depth checking, plus I'm using a board, CPU and RAM that had been running without problems under Win 10 for many months.
  • I decided to migrate from jessie to stretch (to see if this changes something) but this did not solve the freezing problem.
  • I did some of the usual apt-get updates and suddenly with one update behavior changed so that my machine no longer froze but tvh started crashing instead (I had no debug version installed at that time, so no useful stack dump available):

? tvheadend.service - (null)
Loaded: loaded (/etc/init.d/tvheadend)
Active: active (exited) since So 2017-09-24 13:27:06 CEST; 10h ago
Process: 601 ExecStart=/etc/init.d/tvheadend start (code=exited, status=0/SUCCESS)

Sep 24 19:30:57 TVHeadend tvheadend989: CRASH: ??:0 0x563308e2fe2a 0x563308c30000
Sep 24 19:30:57 TVHeadend tvheadend989: CRASH: ??:0 0x7ff76e50d890 0x7ff76e4fe000
Sep 24 19:30:57 TVHeadend tvheadend989: CRASH: gsignal+0x37 (/lib/x86_64-linux-gnu/libc.so.6)
Sep 24 19:30:57 TVHeadend tvheadend989: CRASH: abort+0x148 (/lib/x86_64-linux-gnu/libc.so.6)
Sep 24 19:30:57 TVHeadend tvheadend989: CRASH: ??:0 0x563308dbb423 0x563308c30000
Sep 24 19:30:57 TVHeadend tvheadend989: CRASH: ??:0 0x563308e18980 0x563308c30000
Sep 24 19:30:57 TVHeadend tvheadend989: CRASH: ??:0 0x563308e1e7c9 0x563308c30000
Sep 24 19:30:57 TVHeadend tvheadend989: CRASH: ??:0 0x563308e224e5 0x563308c30000
Sep 24 19:30:57 TVHeadend tvheadend989: CRASH: ??:0 0x563308dfc670 0x563308c30000
Sep 24 19:30:57 TVHeadend tvheadend989: CRASH: ??:0 0x563308df81f4 0x563308c30000

  • After some more updates suddenly all problems seem to have been gone, the machine was running without problems and no crashes for a whole bunch of weeks and for several updates.
  • Unfortunately with the latest apt-get update the problems came back, I could even experience crashes after only 10 min uptime! The difference: Now is there is a kernel Oops:

Dec 17 21:00:39 TVHeadend kernel: [ 662.841326] BUG: unable to handle kernel paging request at fffff30207b85ee0
Dec 17 21:00:39 TVHeadend kernel: [ 662.844241] IP: [<ffffffffb8b47970>] __list_del_entry+0x0/0xb0
Dec 17 21:00:39 TVHeadend kernel: [ 662.847220] PGD 11f7f6067
Dec 17 21:00:39 TVHeadend kernel: [ 662.847275] PUD 0
Dec 17 21:00:39 TVHeadend kernel: [ 662.850253]
Dec 17 21:00:39 TVHeadend kernel: [ 662.853271] Oops: 0000 [#1] SMP

I attached the complete Oops plus the log with hardware information for my machine in case this is useful.
Apparently something's going badly awry here. I have no idea what's happening and why. Could anyone please support how to find out the reason for these crashes, be it tvh or the kernel, and how to proceed? A machine intended for doing unattended video recordings that crashes regularly has a very limited value and is of course rather frustrating... :-/

THX

Don

Kernel crash.txt (31.7 KB) Kernel crash.txt Kernel Oops Peter Brockamp, 2017-12-17 22:50
Boot.txt (66 KB) Boot.txt Boot sequence Peter Brockamp, 2017-12-17 22:50

History

#1 Updated by saen acro about 1 year ago

Do you use transcoding?

#2 Updated by Mark Clarkstone about 1 year ago

If you can install the debug packages (tvheadend-dbg) and get a full crash log that may show where the problem is (see Debugging & Traces).

#3 Updated by Mark Clarkstone about 1 year ago

  • Subject changed from Expieriencing regular crashes to Experiencing regular crashes

#4 Updated by Peter Brockamp 12 months ago

saen acro wrote:

Do you use transcoding?

You mean the output format for storing records? I tried both, pass thru and MKV, both behave identical, it crashes.

Some additional remarks:
  • The Kernel Oops I posted happened only once. In all other cases so far the machine completely froze.
  • Could it be that what I receive as "freezing" is in fact a kernel panic? As said I'm running the machine headless. For testing purposes I attached an LCD but got not picture after the crash. But maybe this is due to the screen being blanked due to powersafe timeout before the crash (if the kernel does not reenable the screen in case of a panic)? I added "consoleblank=0" to my grub configuration, let's see if this changes something.
  • After the latest updates freezing became even more nagging than previously, the machine hangs several times per day now, I could not experience a situation where it ran for a couple of days! So the problem is very present.
  • Crashing could happen during silent phases (AKA machine is currently idling, nobody's watching something, and only some recording timer is set), during an active recording and also while somebody's watching a recording. So this is quite weird and unspecific.

THX

Don

#5 Updated by Peter Brockamp 12 months ago

Mark Clarkstone wrote:

If you can install the debug packages (tvheadend-dbg) and get a full crash log that may show where the problem is (see Debugging & Traces).

Will try that and report. Please hold on.

THX

Don

#6 Updated by Peter Brockamp 12 months ago

Peter Brockamp wrote:

Some additional remarks:
  • Could it be that what I receive as "freezing" is in fact a kernel panic? As said I'm running the machine headless. For testing purposes I attached an LCD but got not picture after the crash. But maybe this is due to the screen being blanked due to powersafe timeout before the crash (if the kernel does not reenable the screen in case of a panic)? I added "consoleblank=0" to my grub configuration, let's see if this changes something.
Now here's the result:
  • "consoleblank=0" inhibits the blanking of the console and after a crash now there is a picture.
  • However there is no Kernel Panic shown, I simply see the login prompt of the first tty.
  • There is an underscore of the cursor, but it stopped blinking. So this in fact is a complete machine freeze, WTF! :(

So this again looks more like a hardware problem?? Can hardly imagine Linux crashes that lousy...

THX

Don

#7 Updated by Peter Brockamp 12 months ago

Mark Clarkstone wrote:

If you can install the debug packages (tvheadend-dbg) and get a full crash log that may show where the problem is (see Debugging & Traces).

Is there anything additional I have to do to enable logging? It seems to me as if I don't get any additional log entries. Things all look pretty normal, except that the machine suddenly freezes without any prior log entry hinting at some problem. It simply dies :(

Once during my tests I had a strange behavior, the machine went into full RAM load, started swapping like hell and didn't stop until I manually restarted tvh. In the log I saw this:


Dec 21 22:11:35 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/arte HD: H264
#5111 Continuity counter error (total 1)
Dec 21 22:11:35 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/Das Erste HD: H264 #5101 Continuity counter error (total 1)
Dec 21 22:11:35 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/Das Erste HD: MPEG2AUDIO
#5102 Continuity counter error (total 1)
Dec 21 22:11:35 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/Das Erste HD: MPEG2AUDIO #5103 Continuity counter error (total 1)
Dec 21 22:11:35 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/arte HD: MPEG2AUDIO
#5112 Continuity counter error (total 1)
Dec 21 22:11:35 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/arte HD: MPEG2AUDIO #5113 Continuity counter error (total 1)
Dec 21 22:11:35 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/arte HD: TELETEXT
#5114 Continuity counter error (total 1)
Dec 21 22:11:35 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/arte HD: MPEG2AUDIO #5117 Continuity counter error (total 1)
Dec 21 22:11:35 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/Das Erste HD: TELETEXT
#5104 Continuity counter error (total 1)
Dec 21 22:11:35 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/Das Erste HD: DVBSUB #5105 Continuity counter error (total 1)
Dec 21 22:11:35 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/arte HD: DVBSUB
#5118 Continuity counter error (total 1)
Dec 21 22:11:35 TVHeadend tvheadend1041: tsfix: transport stream AC3, DTS discontinuity. DTS = 108380160, last = 108285120
Dec 21 22:11:35 TVHeadend tvheadend1041: tsfix: transport stream AC3, DTS discontinuity. DTS = 462211, last = 367171
Dec 21 22:11:41 TVHeadend tvheadend1041: tsfix: transport stream AC3, DTS discontinuity. DTS = 1015171, last = 920131
Dec 21 22:11:41 TVHeadend tvheadend1041: tsfix: transport stream AC3, DTS discontinuity. DTS = 108944640, last = 108849600
Dec 21 22:11:46 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/arte HD: H264 #5111 Continuity counter error (total 3)
Dec 21 22:11:46 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/Das Erste HD: H264
#5101 Continuity counter error (total 3)
Dec 21 22:11:46 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/arte HD: MPEG2AUDIO #5112 Continuity counter error (total 3)
Dec 21 22:11:46 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/arte HD: MPEG2AUDIO
#5113 Continuity counter error (total 3)
Dec 21 22:11:46 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/arte HD: AC3 #5116 Continuity counter error (total 1)
Dec 21 22:11:46 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/Das Erste HD: AC3
#5106 Continuity counter error (total 1)
Dec 21 22:11:46 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/Das Erste HD: TELETEXT #5104 Continuity counter error (total 3)
Dec 21 22:11:46 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/arte HD: MPEG2AUDIO
#5117 Continuity counter error (total 3)
Dec 21 22:11:46 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/Das Erste HD: MPEG2AUDIO #5102 Continuity counter error (total 2)
Dec 21 22:11:46 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/arte HD: TELETEXT
#5114 Continuity counter error (total 3)
Dec 21 22:11:46 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/Das Erste HD: MPEG2AUDIO #5103 Continuity counter error (total 2)
Dec 21 22:11:47 TVHeadend tvheadend[1041]: tsfix: transport stream AC3, DTS discontinuity. DTS = 1510531, last = 1415491
Dec 21 22:11:47 TVHeadend tvheadend[1041]: tsfix: transport stream AC3, DTS discontinuity. DTS = 109440000, last = 109344960
Dec 21 22:11:47 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/arte HD: DVBSUB
#5118 Continuity counter error (total 3)
Dec 21 22:11:51 TVHeadend tvheadend1041: tsfix: transport stream AC3, DTS discontinuity. DTS = 109866240, last = 109771200
Dec 21 22:11:52 TVHeadend tvheadend1041: tsfix: transport stream AC3, DTS discontinuity. DTS = 1994371, last = 1841731
Dec 21 22:11:55 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/Das Erste HD: DVBSUB #5105 Continuity counter error (total 2)
Dec 21 22:11:56 TVHeadend tvheadend[1041]: tsfix: transport stream AC3, DTS discontinuity. DTS = 2339971, last = 2244931
Dec 21 22:11:56 TVHeadend tvheadend[1041]: tsfix: transport stream AC3, DTS discontinuity. DTS = 110269440, last = 110174400
Dec 21 22:12:00 TVHeadend tvheadend[1041]: dvr: /media/tvheadend/video/Liebe ist das perfekte Verbrechen/Liebe ist das perfekte Verbrechen.mkv from adapter: "SAT>IP DVB-S Tuner #3 (192.168.178.45)", network: "Astra 19.2e", mux: "11493.75H", provider: "ARD", service: "arte HD"
Dec 21 22:12:00 TVHeadend tvheadend[1041]: dvr: # type lang resolution aspect ratio sample rate channels
Dec 21 22:12:00 TVHeadend tvheadend[1041]: dvr: 1 H264 1280x720 16:9
Dec 21 22:12:00 TVHeadend tvheadend[1041]: dvr: 2 MPEG2AUDIO ger 48000 2
Dec 21 22:12:00 TVHeadend tvheadend[1041]: dvr: 3 MPEG2AUDIO fre 48000 2
Dec 21 22:12:00 TVHeadend tvheadend[1041]: dvr: 5 DVBSUB ger
Dec 21 22:12:00 TVHeadend tvheadend[1041]: dvr: 6 AC3 mul 48000 5.1
Dec 21 22:12:00 TVHeadend tvheadend[1041]: dvr: 7 MPEG2AUDIO mis 48000 2
Dec 21 22:12:00 TVHeadend tvheadend[1041]: dvr: 8 DVBSUB fre
Dec 21 22:12:00 TVHeadend tvheadend[1041]: dvr: 9 DVBSUB ger
Dec 21 22:12:00 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/Das Erste HD: H264
#5101 Continuity counter error (total 6)
Dec 21 22:12:00 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/arte HD: H264 #5111 Continuity counter error (total 6)
Dec 21 22:12:00 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/Das Erste HD: MPEG2AUDIO
#5102 Continuity counter error (total 5)
Dec 21 22:12:00 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/Das Erste HD: MPEG2AUDIO #5103 Continuity counter error (total 5)
Dec 21 22:12:00 TVHeadend tvheadend[1041]: TS: Astra 19.2e/11493.75H/arte HD: MPEG2AUDIO
#5117 Continuity counter error (total 6)
Dec 21 22:12:00 TVHeadend tvheadend1041: TS: Astra 19.2e/11493.75H/arte HD: TELETEXT #5114 Continuity counter error (total 6)

This went on for a while, then switching to lots of entries like this:


Dec 21 22:49:18 TVHeadend tvheadend[1041]: subscription: 000D: service instance is bad, reason: Tuning failed
Dec 21 22:49:18 TVHeadend tvheadend[1041]: subscription: 000D: "DVR: Liebe ist das perfekte Verbrechen" subscribing on channel "arte HD", weight: 300, adapter: "SAT>IP DVB-S Tuner #3 (192.168.178.45)", network: "Astra 19.2e", mux: "11493.75H", provider: "ARD", service: "arte HD", profile="matroska"
Dec 21 22:49:18 TVHeadend tvheadend[1041]: subscription: 000C: service instance is bad, reason: Tuning failed
Dec 21 22:49:18 TVHeadend tvheadend[1041]: mpegts: 12460.5H in Astra 19.2e - tuning on SAT>IP DVB-S Tuner #4 (192.168.178.45)
Dec 21 22:49:18 TVHeadend tvheadend[1041]: subscription: 000C: "DVR: The 100" subscribing on channel "SIXX", weight: 300, adapter: "SAT>IP DVB-S Tuner #4 (192.168.178.45)", network: "Astra 19.2e", mux: "12460.5H", provider: "ProSiebenSat.1", service: "SIXX", profile="matroska"
Dec 21 22:49:18 TVHeadend tvheadend[1041]: subscription: 000B: service instance is bad, reason: Tuning failed
Dec 21 22:49:18 TVHeadend tvheadend[1041]: subscription: 000B: "DVR: Nuhr 2017 - Der Jahresrückblick" subscribing on channel "Das Erste HD", weight: 300, adapter: "SAT>IP DVB-S Tuner #3 (192.168.178.45)", network: "Astra 19.2e", mux: "11493.75H", provider: "ARD", service: "Das Erste HD", profile="matroska"
Dec 21 22:49:18 TVHeadend tvheadend[1041]: subscription: 0008: service instance is bad, reason: Tuning failed
Dec 21 22:49:18 TVHeadend tvheadend[1041]: mpegts: 12187.5H in Astra 19.2e - tuning on SAT>IP DVB-S Tuner #2 (192.168.178.45)
Dec 21 22:49:18 TVHeadend tvheadend[1041]: subscription: 0008: "192.168.178.43 [ glotzer | Kodi Media Center ]" subscribing on channel "VOX", weight: 150, adapter: "SAT>IP DVB-S Tuner #2 (192.168.178.45)", network: "Astra 19.2e", mux: "12187.5H", provider: "CBC", service: "VOX", profile="htsp", hostname="192.168.178.43", username="glotzer", client="Kodi Media Center"
Dec 21 22:49:18 TVHeadend tvheadend[1041]: satip: SAT>IP DVB-S Tuner #2 (192.168.178.45) - RTSP error -111 (Verbindungsaufbau abgelehnt) [6-0]
Dec 21 22:49:18 TVHeadend tvheadend[1041]: satip: SAT>IP DVB-S Tuner #4 (192.168.178.45) - RTSP error -111 (Verbindungsaufbau abgelehnt) [6-0]
Dec 21 22:49:20 TVHeadend tvheadend[1041]: subscription: 000C: service instance is bad, reason: Tuning failed
Dec 21 22:49:20 TVHeadend tvheadend[1041]: mpegts: 12460.5H in Astra 19.2e - tuning on SAT>IP DVB-S Tuner #5 (192.168.178.45)
Dec 21 22:49:20 TVHeadend tvheadend[1041]: subscription: 000C: "DVR: The 100" subscribing on channel "SIXX", weight: 300, adapter: "SAT>IP DVB-S Tuner #5 (192.168.178.45)", network: "Astra 19.2e", mux: "12460.5H", provider: "ProSiebenSat.1", service: "SIXX", profile="matroska"
Dec 21 22:49:20 TVHeadend tvheadend[1041]: subscription: 0008: service instance is bad, reason: Tuning failed
Dec 21 22:49:20 TVHeadend tvheadend[1041]: mpegts: 12187.5H in Astra 19.2e - tuning on SAT>IP DVB-S Tuner #6 (192.168.178.45)
Dec 21 22:49:20 TVHeadend tvheadend[1041]: subscription: 0008: "192.168.178.43 [ glotzer | Kodi Media Center ]" subscribing on channel "VOX", weight: 150, adapter: "SAT>IP DVB-S Tuner #6 (192.168.178.45)", network: "Astra 19.2e", mux: "12187.5H", provider: "CBC", service: "VOX", profile="htsp", hostname="192.168.178.43", username="glotzer", client="Kodi Media Center"
Dec 21 22:49:28 TVHeadend tvheadend[1041]: subscription: 000D: service instance is bad, reason: No input detected
Dec 21 22:49:28 TVHeadend tvheadend[1041]: subscription: 000D: "DVR: Liebe ist das perfekte Verbrechen" subscribing on channel "arte HD", weight: 300, adapter: "SAT>IP DVB-S Tuner #3 (192.168.178.45)", network: "Astra 19.2e", mux: "11493.75H", provider: "ARD", service: "arte HD", profile="matroska"
Dec 21 22:49:28 TVHeadend tvheadend[1041]: subscription: 000B: service instance is bad, reason: No input detected
Dec 21 22:49:28 TVHeadend tvheadend[1041]: subscription: 000B: "DVR: Nuhr 2017 - Der Jahresrückblick" subscribing on channel "Das Erste HD", weight: 300, adapter: "SAT>IP DVB-S Tuner #3 (192.168.178.45)", network: "Astra 19.2e", mux: "11493.75H", provider: "ARD", service: "Das Erste HD", profile="matroska"
Dec 21 22:49:38 TVHeadend tvheadend[1041]: subscription: 000D: service instance is bad, reason: No input detected

And finally filling the log with


Dec 21 22:50:22 TVHeadend tvheadend[1041]: subscription: 000D: No input source available for subscription "DVR: Liebe ist das perfekte Verbrechen" to channel "arte HD"
Dec 21 22:50:24 TVHeadend tvheadend[1041]: subscription: 000D: No input source available for subscription "DVR: Liebe ist das perfekte Verbrechen" to channel "arte HD"
Dec 21 22:50:26 TVHeadend tvheadend[1041]: subscription: 000D: No input source available for subscription "DVR: Liebe ist das perfekte Verbrechen" to channel "arte HD"

and


Dec 22 07:23:25 TVHeadend tvheadend[901]: mpegts: too much queued table input data (over 2MB), discarding new
Dec 22 07:23:35 TVHeadend tvheadend[901]: mpegts: too much queued table input data (over 2MB), discarding new
Dec 22 07:23:45 TVHeadend tvheadend[901]: mpegts: too much queued table input data (over 2MB), discarding new
Dec 22 07:23:55 TVHeadend tvheadend[901]: mpegts: too much queued table input data (over 2MB), discarding new
Dec 22 07:24:06 TVHeadend tvheadend[901]: mpegts: too much queued table input data (over 2MB), discarding new
Dec 22 07:24:16 TVHeadend tvheadend[901]: mpegts: too much queued table input data (over 2MB), discarding new

This effect I only had once, seems like tvh was going berserk then.

What is relly annoying is that kern.log is spammed every ten seconds with


Jan 1 06:25:39 TVHeadend kernel: [60253.009168] i2c i2c-0: sendbytes: NAK bailout.
Jan 1 06:25:39 TVHeadend kernel: [60253.013559] i2c i2c-0: sendbytes: NAK bailout.
Jan 1 06:25:39 TVHeadend kernel: [60253.017926] i2c i2c-0: sendbytes: NAK bailout.
Jan 1 06:25:39 TVHeadend kernel: [60253.022374] i2c i2c-0: sendbytes: NAK bailout.
Jan 1 06:25:39 TVHeadend kernel: [60253.026616] i2c i2c-0: sendbytes: NAK bailout.
Jan 1 06:25:39 TVHeadend kernel: [60253.030364] [drm:radeon_vga_detect [radeon]] *ERROR* VGA-1: probed a monitor but no|invalid EDID

But no debug related log entries at all. Of course I enabled debug in Configuration\Debugging\}Configuration (and I'm aware this only lasts until the restarting tvh). May I have to uninstall the non-debug version of tvh in order to see debug output?

Again I bacame more suspicious that the machine dies due to a hardware issue. I will start a long run memtest run tomorrow in the hope to locate some problem. We'll see...

THX

Don

#8 Updated by Mark Clarkstone 12 months ago

Peter Brockamp wrote:

..snip..
Again I bacame more suspicious that the machine dies due to a hardware issue. I will start a long run memtest run tomorrow in the hope to locate some problem. We'll see...

The i2c errors are very suspect! If tvh is crashing and dropping a core (crash lines) you can install corekeeper to grab it & analyse it later.

#9 Updated by Peter Brockamp 12 months ago

Mark Clarkstone wrote:

The i2c errors are very suspect! If tvh is crashing and dropping a core (crash lines) you can install corekeeper to grab it & analyse it later.

I doubt this is really critical. If you google for "i2c i2c-0: sendbytes: NAK bailout" you'll find a couple of hits referring to e.g. noveau or other graphics driver stuff. Seems to occur under different conditions but obviously not critial or fatal ones. Unfortunately I could not yet find out definitely what the device 0 (i2c-0) in my case is, so this kernel message is not very enligthening.

However I'm doing an educated guess:
Reading the capabilities of a display (EDID) is also being done via an i2c bus. This works already for legacy VGA connectors (the first EDID specification is from 1994!), see here:
[[http://www.chrisbot.com/i2c-and-monitor-ddc.html]]
So I'm assuming the i2c message simply tell's me that the kernel attempts to get the monitor capabilities but as no monitor is connected of course it get's no reply. Thus it runs into a couple of timeouts and finally gives up with "probed a monitor but no|invalid EDID". This would be no problem if not the kernel repeated this attempts (and logs it) every 10 seconds. Retrying would be OK in order to detect an monitor as soons as it's being connected, but continuously spamming the log is a real nag.

The only strange point about it is that according to the specification reading EDID parameters should be possible also with a monitor connected but switched off (the device willing to send a video signal supplies 5V for reading EDID). I asume it should also be able to distinguish "switched off" from "not connected" and deal with that appropriately. Due to the crashes I currently don't run completely headless but I have a monitor attached which is most of the times either switched off or set to another input. Nevertheless the reading of the DDC fails which should not happen as I understand it. Could be that my monitor is not completely standard conformant and this is causing this strange behavior. Whatever is the case, I don't think this messages are connected to the crashes. It would be good to have at least a kernel parameter to shut down the spaming as those log messages have no value and I cannot fix the triggering problem.

Cheerz

#10 Updated by Peter Brockamp 12 months ago

Just started a long run memtest on the machine now with all CPU cores active for maximum stress and load on RAM. The first run passed succesfully without errors. CPU temperature partially rises above 70°C, fan regulation audibly kicks in, increases cooling efforts and brings the temparature rise to a halt. So this is all to be expected and seem to work OK and as designed. Will see how the long term progress is, I plan to have memtest running for at least two days or even more.

If this does not uncover any problems I will try to disable those hardware drivers that do not hamper current operation. E. g. the machine has an onboard firewire controller which is not used in the tvh scenario. I could disable it to check if this has an impact on crashes. Same the video driver, here I could fallback to the standard FB driver. And so on. However I even have no PCI cards plugged in, just the naked mainboard running, the onboard components are by far sufficient here and it's not a noname board, so things should be pretty stable.

I will also try to strip down BIOS settings to the most conservative ones. Annoyingly since a couple of years board vendors started to press users into overclocking components, even major brands. This really pisses me off, there's no obvious setting "F... off, I want no bloody OC, just give me stable, power saving options", you have to figure out for yourself and set it manually :(

#11 Updated by Peter Brockamp 11 months ago

Well, finished the Memtest run but it did not reveal any problems. No RAM errors detected and the machine did not crash or freeze...

#12 Updated by Mark Clarkstone 11 months ago

Peter Brockamp wrote:

Well, finished the Memtest run but it did not reveal any problems. No RAM errors detected and the machine did not crash or freeze...

This is a very strange one. If you can, install corekeeper & the tvheadend-dbg package. Paste the full trace lines here from the crash. Take a look at Debugging.

#13 Updated by Peter Brockamp 11 months ago

Mark Clarkstone wrote:

This is a very strange one. If you can, install corekeeper & the tvheadend-dbg package. Paste the full trace lines here from the crash. Take a look at Debugging.

OK, thanks for instructing:
  • I installed corekeeper - is there anything else I have to configure? man corekeeper seems not to exist...
  • I installed tvheadend-dbg - is it correct to keep tvheadend installed or must I uninstall it?
    apt list --installed
    tvheadend/stretch,now 4.2.5-14~g1665c355c~stretch amd64 [installiert]
    tvheadend-dbg/stretch,now 4.2.5-14~g1665c355c~stretch amd64 [installiert]
  • I globally switched on debugging for tvheadend in etc/default/tvheadend:
    TVH_DEBUG=1
    TVH_ARGS="-D -l /var/log/tvheadend"

    Any further command line arguments required?
  • /etc/security/limits.d/corekeeper.conf has both, hard and soft limit set to unlimited and I gave the daemon the "-D" parameter - is that sufficient to permanently enable debugging mode and core dumps?

Will try "-A" and see what is being dumped on my HDD...

Cheerz

#14 Updated by Peter Brockamp 11 months ago

Successfully configured the system to write coredumps:

ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 14460
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 14460
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Testing with -A what happens in case of a crash I get:

@service tvheadend status
● tvheadend.service
Loaded: loaded (/etc/init.d/tvheadend; generated; vendor preset: enabled)
Active: active (exited) since Sun 2018-01-14 20:56:28 CET; 43s ago
Docs: man:systemd-sysv-generator(8)
Process: 8409 ExecStop=/etc/init.d/tvheadend stop (code=exited, status=0/SUCCESS)
Process: 8416 ExecStart=/etc/init.d/tvheadend start (code=exited, status=0/SUCCESS)

Jan 14 20:56:30 TVHeadend tvheadend8423: scanfile: DVB-C - loaded 18 regions with 58 networks
Jan 14 20:56:30 TVHeadend tvheadend8423: scanfile: ATSC-T - loaded 2 regions with 11 networks
Jan 14 20:56:30 TVHeadend tvheadend8423: scanfile: ATSC-C - loaded 1 regions with 5 networks
Jan 14 20:56:30 TVHeadend tvheadend8423: scanfile: ISDB-T - loaded 2 regions with 1297 networks
Jan 14 20:56:30 TVHeadend tvheadend8423: CRASH: /project/repo/checkout/src/trap.c:148 0x55e7698096da 0x55e769608000
Jan 14 20:56:30 TVHeadend tvheadend8423: CRASH: ??:0 0x7fadd8c8d0c0 0x7fadd8c7c000
Jan 14 20:56:30 TVHeadend tvheadend8423: CRASH: gsignal+0xcf (/lib/x86_64-linux-gnu/libc.so.6)
Jan 14 20:56:30 TVHeadend tvheadend8423: CRASH: abort+0x16a (/lib/x86_64-linux-gnu/libc.so.6)
Jan 14 20:56:31 TVHeadend tvheadend8423: CRASH: /project/repo/checkout/src/main.c:1267 0x55e7697c5ed6 0x55e769608000
Jan 14 20:56:31 TVHeadend tvheadend8423: CRASH: __libc_start_main+0xf1 (/lib/x86_64-linux-gnu/libc.so.6)@

So this should be OK and I also do get coredumps in /tmp including debug informations:

2018-01-14 20:56:30.336 [ DEBUG]:epg: arm channel timer 2018-01-14;21:45:00(+0100) for ONE HD
2018-01-14 20:56:30.336 [ DEBUG]:epg: now/next 3149681/3149683 set on ONE HD
2018-01-14 20:56:30.336 [ DEBUG]:epg: inform HTSP of now event change on ONE HD
2018-01-14 20:56:30.336 [ DEBUG]:epg: arm channel timer 2018-01-14;21:45:00(+0100) for ONE HD
2018-01-14 20:56:30.339 [ INFO]:epgdb: loaded v2
2018-01-14 20:56:30.339 [ INFO]:epgdb: config 1
2018-01-14 20:56:30.339 [ INFO]:epgdb: brands 0
2018-01-14 20:56:30.339 [ INFO]:epgdb: seasons 0
2018-01-14 20:56:30.339 [ INFO]:epgdb: episodes 42893
2018-01-14 20:56:30.339 [ INFO]:epgdb: broadcasts 39850
2018-01-14 20:56:30.674 [ NOTICE]:START: HTS Tvheadend version 4.2.5-14~g1665c355c started, running as PID:8423 UID:110 GID:44, C$
2018-01-14 20:56:30.674 [ ALERT]:CRASH: Signal: 6 in PRG: /usr/bin/tvheadend (4.2.5-14~g1665c355c) [c95a251627e0455abd7ae5b43747$
2018-01-14 20:56:30.674 [ ALERT]:CRASH: Fault address 0x6e000020e7 (N/A)
2018-01-14 20:56:30.674 [ ALERT]:CRASH: Loaded libraries: linux-vdso.so.1 /usr/lib/libdvben50221.so /usr/lib/libdvbapi.so /usr/l$
2018-01-14 20:56:30.677 [ INFO]:htsp: Got connection from 192.168.178.43
2018-01-14 20:56:30.677 [ ALERT]:CRASH: Register dump [23]: 000000000000000000007ffe4aba7460000000000000000800000000000002460000$
2018-01-14 20:56:30.677 [ ALERT]:CRASH: STACKTRACE
2018-01-14 20:56:30.677 [ INFO]:htsp: 192.168.178.43: Welcomed client software: Kodi Media Center (HTSPv25)
2018-01-14 20:56:30.682 [ INFO]:htsp: 192.168.178.43 [ Kodi Media Center ]: Identified as user 'glotzer'
2018-01-14 20:56:30.682 [ INFO]:htsp: 192.168.178.43 [ glotzer | Kodi Media Center ]: Privileges updated
2018-01-14 20:56:30.795 [ INFO]:scanfile: DVB-S - loaded 1 regions with 113 networks
2018-01-14 20:56:30.795 [ INFO]:scanfile: DVB-T - loaded 44 regions with 1113 networks
2018-01-14 20:56:30.795 [ INFO]:scanfile: DVB-C - loaded 18 regions with 58 networks
2018-01-14 20:56:30.795 [ INFO]:scanfile: ATSC-T - loaded 2 regions with 11 networks
2018-01-14 20:56:30.795 [ INFO]:scanfile: ATSC-C - loaded 1 regions with 5 networks
2018-01-14 20:56:30.795 [ INFO]:scanfile: ISDB-T - loaded 2 regions with 1297 networks
2018-01-14 20:56:30.826 [ ALERT]:CRASH: /project/repo/checkout/src/trap.c:148 0x55e7698096da 0x55e769608000
2018-01-14 20:56:30.979 [ ALERT]:CRASH: ??:0 0x7fadd8c8d0c0 0x7fadd8c7c000
2018-01-14 20:56:30.979 [ ALERT]:CRASH: gsignal+0xcf (/lib/x86_64-linux-gnu/libc.so.6)
2018-01-14 20:56:30.980 [ ALERT]:CRASH: abort+0x16a (/lib/x86_64-linux-gnu/libc.so.6)
2018-01-14 20:56:31.126 [ ALERT]:CRASH: /project/repo/checkout/src/main.c:1267 0x55e7697c5ed6 0x55e769608000
2018-01-14 20:56:31.126 [ ALERT]:CRASH: __libc_start_main+0xf1 (/lib/x86_64-linux-gnu/libc.so.6)

However what is being logged in /var/log/debug and daemon is really huge, it dumped 170MB each in both files just today and the day's not over.... :-O
So I'll have to switch that off quickly before it fully flooded my HDD and I also have no idea how I should supply the log in case someone's asking for it. There also seems to be a lot of stuff in there that's not relevant for my problem, summing upto around 10 screen pages being logged every second:
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6EF0 (28400) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6EF1 (28401) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6EF2 (28402) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6EF3 (28403) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6EF4 (28404) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6EF5 (28405) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6EF6 (28406) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6EF7 (28407) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6EF8 (28408) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F03 (28419) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F04 (28420) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F05 (28421) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F06 (28422) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F07 (28423) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F08 (28424) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F0C (28428) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F0D (28429) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F0E (28430) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F0F (28431) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F10 (28432) running 4 free_ca 0
Jan 14 14:04:07 TVHeadend tvheadend[933]: tbl-base: sdt: sid 6F11 (28433) running 4 free_ca 0

Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: DVB-S 19.2E freq 11538000 V sym 22000000 fec 5/6 mod QPSK roff 35 is_$
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: onid 0001 (1) tsid 03FE (1022) mux 11538V in Astra 19.2e
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: DVB-S2 19.2E freq 11552750 H sym 22000000 fec 2/3 mod PSK/8 roff 35 i$
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: onid 0001 (1) tsid 03FF (1023) mux 11552.75H in Astra 19.2e
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: DVB-S2 19.2E freq 11582250 H sym 22000000 fec 2/3 mod PSK/8 roff 35 i$
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: onid 0001 (1) tsid 0401 (1025) mux 11582.25H in Astra 19.2e
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: DVB-S 19.2E freq 11597000 V sym 22000000 fec 5/6 mod QPSK roff 35 is_$
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: onid 0001 (1) tsid 0402 (1026) mux 11597V in Astra 19.2e
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: DVB-S 19.2E freq 11611750 H sym 22000000 fec 5/6 mod QPSK roff 35 is_$
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: onid 0001 (1) tsid 0403 (1027) mux 11611.75H in Astra 19.2e
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: DVB-S 19.2E freq 11626500 V sym 22000000 fec 5/6 mod QPSK roff 35 is_$
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: onid 0001 (1) tsid 0404 (1028) mux 11626.5V in Astra 19.2e
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: DVB-S2 19.2E freq 11670750 H sym 22000000 fec 2/3 mod PSK/8 roff 35 i$
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: onid 0001 (1) tsid 0407 (1031) mux 11670.75H in Astra 19.2e
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: DVB-S 19.2E freq 11685500 V sym 22000000 fec 5/6 mod QPSK roff 35 is_$
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: onid 0001 (1) tsid 0408 (1032) mux 11685.5V in Astra 19.2e
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: DVB-S2 19.2E freq 10964250 H sym 22000000 fec 2/3 mod PSK/8 roff 35 i$
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: onid 0001 (1) tsid 0409 (1033) mux 10964.25H in Astra 19.2e
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: DVB-S 19.2E freq 10979000 V sym 22000000 fec 5/6 mod QPSK roff 35 is_$
Jan 14 14:04:16 TVHeadend tvheadend[933]: tbl-base: nit: onid 0001 (1) tsid 040A (1034) mux 10979V in Astra 19.2e

I also find many entries like these that seem to point to some problem:

Jan 14 20:42:00 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:00 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:04 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:04 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:07 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:08 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:09 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:10 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:11 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:14 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:14 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:17 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:18 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:19 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:20 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:21 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:24 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:25 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 20:42:27 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new

It just started somewhen without any obvious reason:

Jan 14 14:25:26 TVHeadend kernel: [ 4909.496005] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:26 TVHeadend kernel: [ 4909.499026] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:26 TVHeadend kernel: [ 4909.502035] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:26 TVHeadend kernel: [ 4909.505105] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:26 TVHeadend kernel: [ 4909.508097] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:26 TVHeadend kernel: [ 4909.510610] [drm:radeon_vga_detect [radeon]] *ERROR* VGA-1: probed a monitor but no|invalid $
Jan 14 14:25:37 TVHeadend kernel: [ 4919.736379] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:37 TVHeadend kernel: [ 4919.739410] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:37 TVHeadend kernel: [ 4919.742429] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:37 TVHeadend kernel: [ 4919.745518] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:37 TVHeadend kernel: [ 4919.748508] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:37 TVHeadend kernel: [ 4919.751024] [drm:radeon_vga_detect [radeon]] *ERROR* VGA-1: probed a monitor but no|invalid $
Jan 14 14:25:39 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 14:25:42 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 14:25:47 TVHeadend kernel: [ 4929.976761] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:47 TVHeadend kernel: [ 4929.979786] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:47 TVHeadend kernel: [ 4929.982799] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:47 TVHeadend kernel: [ 4929.985875] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:47 TVHeadend kernel: [ 4929.988868] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:47 TVHeadend kernel: [ 4929.991391] [drm:radeon_vga_detect [radeon]] *ERROR* VGA-1: probed a monitor but no|invalid $
Jan 14 14:25:49 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 14:25:52 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 14:25:57 TVHeadend kernel: [ 4940.217171] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:57 TVHeadend kernel: [ 4940.220199] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:57 TVHeadend kernel: [ 4940.223218] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:57 TVHeadend kernel: [ 4940.226301] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:57 TVHeadend kernel: [ 4940.229295] i2c i2c-0: sendbytes: NAK bailout.
Jan 14 14:25:57 TVHeadend kernel: [ 4940.231818] [drm:radeon_vga_detect [radeon]] *ERROR* VGA-1: probed a monitor but no|invalid $
Jan 14 14:26:00 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 14:26:01 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 14:26:02 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new
Jan 14 14:26:07 TVHeadend tvheadend[933]: mpegts: too much queued table input data (over 2MB), discarding new

So let's see what it will dump when the machine freezes next time. Admittedly I'm pessimistic about seing anything useful. If the machine simply freezes it will have little opportunity to log anything...

Cheerz

#15 Updated by Peter Brockamp 11 months ago

Well, it already froze just a couple of minutes ago. Though debug is active daemon.log is no more enligthening. Crash here at around 22::10::30+, reset pressed at 22:15::30:

Jan 14 22:10:00 TVHeadend tvheadend[9442]: epg: expire event 3371651 (Dirty Dancing) from RTL Television
Jan 14 22:10:00 TVHeadend tvheadend[9442]: epg: now/next 3371913/3371915 set on RTL Television
Jan 14 22:10:00 TVHeadend tvheadend[9442]: epg: inform HTSP of now event change on RTL Television
Jan 14 22:10:00 TVHeadend tvheadend[9442]: epg: arm channel timer
2018-01-14;22:55:00(+0100) for RTL Television
Jan 14 22:10:27 TVHeadend tvheadend9442: epg: expire event 3372075 (ran Football: Jacksonville Jaguars at Pittsburgh Steelers ($
Jan 14 22:10:27 TVHeadend tvheadend9442: epg: now/next 3372141/3372143 set on Pro7 MAXX
Jan 14 22:10:27 TVHeadend tvheadend9442: epg: inform HTSP of now event change on Pro7 MAXX
Jan 14 22:10:27 TVHeadend tvheadend9442: epg: arm channel timer 2018-01-14;22:19:54(+0100) for Pro7 MAXX
Jan 14 22:10:32 TVHeadend tvheadend[9442]: epg: expire event 3370423 (Mord im Paradies) from SAT.1 Gold
Jan 14 22:10:32 TVHeadend tvheadend[9442]: epg: now/next 3370487/3370489 set on SAT.1 Gold
Jan 14 22:10:32 TVHeadend tvheadend[9442]: epg: inform HTSP of now event change on SAT.1 Gold
Jan 14 22:10:32 TVHeadend tvheadend[9442]: epg: arm channel timer
2018-01-14;23:09:51(+0100) for SAT.1 Gold
Jan 14 22:15:30 TVHeadend systemd1: Starting Flush Journal to Persistent Storage...
Jan 14 22:15:30 TVHeadend systemd1: Started Flush Journal to Persistent Storage.
Jan 14 22:15:30 TVHeadend systemd1: Started Set the console keyboard layout.
Jan 14 22:15:30 TVHeadend systemd1: Reached target Local File Systems (Pre).
Jan 14 22:15:30 TVHeadend systemd1: Found device RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (P8P67 and other mot$
Jan 14 22:15:30 TVHeadend systemd1: Reached target Sound Card.
Jan 14 22:15:30 TVHeadend systemd1: Found device SanDisk_SDSSDA120G 5.
Jan 14 22:15:30 TVHeadend systemd1: Activating swap /dev/disk/by-uuid/cfc97f77-4f0f-44d1-8bfd-1ed8d0e928a2...
Jan 14 22:15:30 TVHeadend systemd1: Activated swap /dev/disk/by-uuid/cfc97f77-4f0f-44d1-8bfd-1ed8d0e928a2.
Jan 14 22:15:30 TVHeadend systemd1: Reached target Swap.
Jan 14 22:15:30 TVHeadend systemd1: Found device ST2000DM001-1ER164 Video.
Jan 14 22:15:30 TVHeadend systemd1: Starting File System Check on /dev/disk/by-uuid/d169a2e4-9f66-4d40-8c13-dadd5ca642d7...
Jan 14 22:15:30 TVHeadend systemd1: Started File System Check Daemon to report status.
Jan 14 22:15:30 TVHeadend systemd-fsck348: Video: Journal wird wiederhergestellt
Jan 14 22:15:30 TVHeadend systemd-fsck348: Video: sauber, 6927/122101760 Dateien, 279789397/488378390 Blöcke
Jan 14 22:15:30 TVHeadend systemd1: Started File System Check on /dev/disk/by-uuid/d169a2e4-9f66-4d40-8c13-dadd5ca642d7.
Jan 14 22:15:30 TVHeadend systemd1: Mounting /media/tvheadend...
Jan 14 22:15:30 TVHeadend systemd1: Mounted /media/tvheadend.
Jan 14 22:15:30 TVHeadend systemd1: Reached target Local File Systems.
Jan 14 22:15:30 TVHeadend systemd1: Starting Create Volatile Files and Directories...
@

WTF???

In /var/crash I see coredumps, so this is operational on priciple. However these coredumps are only the ones I provocated with -A:
from /'usr/bin/tvheadend -f -u hts -g video -A -D -l /tmp/tvheadend -s'
in all files...

No cordumps for the machines freezes - Just as I already was suspecting...

#16 Updated by Mark Clarkstone 11 months ago

Peter Brockamp wrote:

... snip ...

Hmm, not sure what else to suggest here. You could try disabling all OTA EPG grabbers or perhaps give unstable a run (back up your config before upgrading).

#17 Updated by Peter Brockamp 11 months ago

Mark Clarkstone wrote:

Hmm, not sure what else to suggest here. You could try disabling all OTA EPG grabbers or perhaps give unstable a run (back up your config before upgrading).

Well, the next thing I will try to do is to attempt annalyzing the debug enriched logs for several freezes. Maybe I could espy some similarities in the prelude of the freezes. I simply inserted a mv command on the logfile in /etc/init.d/tvheadend/do_start(). E. g. if I can picture that each freeze is preluded bei some specific EPG grabber action this may be a hint in what direction to search.

Nevertheless I'm again suspicous I have a hardware problem here. I'm on SATIP, so there's no tuner plugged in my mobo and of course no hardware driver that could trigger a fatal failure. Just a TCP or UDP channel on the NIC and I expect this to be rock solid, Linux doesn't freeze out of the blue...

#18 Updated by Peter Brockamp 11 months ago

So, what I can see from the logs recorded is this:
  • It often aborts when arming the epg for a channel timer - but not always:
    2018-01-22 02:38:04.013 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:38:04.014 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:38:04.021 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:38:04.285 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:40:03.533 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:40:03.801 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:40:04.206 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:40:04.573 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:40:18.489 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:42:03.053 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:42:04.102 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:42:04.114 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:42:04.405 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:42:14.769 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:43:01.385 [ DEBUG]:epg: now/next 3480199/3441887 set on France 24 (in English)
    2018-01-22 02:43:01.385 [ DEBUG]:epg: inform HTSP of now event change on France 24 (in English)
    2018-01-22 02:43:01.385 [ DEBUG]:epg: arm channel timer
    2018-01-22;02:57:00(+0100) for France 24 (in English)
    2018-01-22 02:43:02.866 [ DEBUG]:epg: now/next 3480199/3480201 set on France 24 (in English)
    2018-01-22 02:43:02.866 [ DEBUG]:epg: inform HTSP of now event change on France 24 (in English)
    2018-01-22 02:43:02.866 [ DEBUG]:epg: arm channel timer 2018-01-22;02:57:00(+0100) for France 24 (in English)
    2018-01-22 02:43:02.866 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:43:03.553 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:43:03.854 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:43:03.902 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:44:02.881 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:44:02.913 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:44:04.286 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:44:04.313 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:44:13.049 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    ^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^$
  • Also consider the crap chars at the end of the log - which is not always present:
    2018-01-23 20:45:00.002 [ DEBUG]:epg: expire event 3313177 (Verkehr und Wetter) from B5 aktuell
    2018-01-23 20:45:00.002 [ DEBUG]:epg: now/next 3313179/3313181 set on B5 aktuell
    2018-01-23 20:45:00.002 [ DEBUG]:epg: inform HTSP of now event change on B5 aktuell
    2018-01-23 20:45:00.002 [ DEBUG]:epg: arm channel timer
    2018-01-23;20:50:00(+0100) for B5 aktuell
    2018-01-23 20:45:00.002 [ DEBUG]:epg: expire event 3274221 (Umschau) from MDR Thüringen HD
    2018-01-23 20:45:00.002 [ DEBUG]:epg: now/next 3274223/3274225 set on MDR Thüringen HD
    2018-01-23 20:45:00.002 [ DEBUG]:epg: inform HTSP of now event change on MDR Thüringen HD
    2018-01-23 20:45:00.002 [ DEBUG]:epg: arm channel timer 2018-01-23;21:15:00(+0100) for MDR Thüringen HD
    2018-01-23 20:45:01.006 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-23 20:47:17.666 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-23 20:47:31.543 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-23 20:50:00.002 [ DEBUG]:epg: expire event 3313179 (Nachrichten) from B5 aktuell
    2018-01-23 20:50:00.002 [ DEBUG]:epg: now/next 3313181/3313183 set on B5 aktuell
    2018-01-23 20:50:00.002 [ DEBUG]:epg: inform HTSP of now event change on B5 aktuell
    2018-01-23 20:50:00.002 [ DEBUG]:epg: arm channel timer
    2018-01-23;20:55:00(+0100) for B5 aktuell
    2018-01-23 20:50:00.002 [ DEBUG]:epg: expire event 3312663 (Das Forum) from NDR Info Spez.
    2018-01-23 20:50:00.002 [ DEBUG]:epg: now/next 3312665/3312667 set on NDR Info Spez.
    2018-01-23 20:50:00.002 [ DEBUG]:epg: inform HTSP of now event change on NDR Info Spez.
    2018-01-23 20:50:00.002 [ DEBUG]:epg: arm channel timer 2018-01-23;21:00:00(+0100) for NDR Info Spez.
    2018-01-23 20:50:00.002 [ DEBUG]:epg: expire event 3312747 (Das Forum) from NDR Info
    2018-01-23 20:50:00.002 [ DEBUG]:epg: now/next 3312749/3312751 set on NDR Info
    2018-01-23 20:50:00.002 [ DEBUG]:epg: inform HTSP of now event change on NDR Info
    2018-01-23 20:50:00.002 [ DEBUG]:epg: arm channel timer
    2018-01-23;21:00:00(+0100) for NDR Info
  • There are a lot of "tbl-eit: eit: completed pid 18 table 00000000 / 00000000" entries, locks somewhat suspicious to me.
  • Often this is "enriched" with a lot of queue table data overflowing:
    2018-01-23 18:08:39.137 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:39.630 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:39.969 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:40.069 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:47.454 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:49.049 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:49.197 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:49.849 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:50.089 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:50.449 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:57.453 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:59.150 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:59.177 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:08:59.849 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 18:09:00.190 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    ^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^$
  • Queue overflow drops in suddenly and unexpectedly for no apparent reason:
    2018-01-23 02:04:44.211 [ DEBUG]:psip: psip: grab started
    2018-01-23 02:04:44.211 [ DEBUG]:mpegts: 12544.75H in Astra 19.2e - open PID 0012 (18) [20/0x7f6238021f90]
    2018-01-23 02:04:44.211 [ DEBUG]:tbl-eit: eit: installed table handlers
    2018-01-23 02:04:44.211 [ DEBUG]:tbl-eit: eit: grab started
    2018-01-23 02:04:44.211 [ DEBUG]:mpegts: 12544.75H in Astra 19.2e - open PID tables subscription [0042/0x7f6238050990]
    2018-01-23 02:04:44.211 [ INFO]:subscription: 0025: "epggrab" subscribing to mux "12544.75H", weight: 4, adapter: "SAT>IP DVB-S Tuner #7 (192.168.178.45)", network: "Astra 19.2e", service: "Raw PID Subscription"
    2018-01-23 02:04:44.211 [ DEBUG]:epggrab: no OTA modules active for 12574.25H in Astra 19.2e, check again next time
    2018-01-23 02:04:44.211 [ DEBUG]:mpegts: 12603.75H in Astra 19.2e - add raw service
    2018-01-23 02:04:44.211 [ DEBUG]:service: 9: 12603.75H in Astra 19.2e si 0x7f6238024030 SAT>IP DVB-S Tuner #8 (192.168.178.45) weight 4 prio 11 error 0
    2018-01-23 02:04:44.211 [ DEBUG]:service: 8: 12603.75H in Astra 19.2e si 0x7f6238023f70 SAT>IP DVB-S Tuner #7 (192.168.178.45) weight 4 prio 11 error 0
    2018-01-23 02:04:44.211 [ DEBUG]:service: 7: 12603.75H in Astra 19.2e si 0x7f6238023eb0 SAT>IP DVB-S Tuner #6 (192.168.178.45) weight 4 prio 11 error 0
    2018-01-23 02:04:44.211 [ DEBUG]:service: 6: 12603.75H in Astra 19.2e si 0x7f6238023df0 SAT>IP DVB-S Tuner #5 (192.168.178.45) weight 4 prio 11 error 0
    2018-01-23 02:04:44.211 [ DEBUG]:service: 5: 12603.75H in Astra 19.2e si 0x7f6238023d30 SAT>IP DVB-S Tuner #4 (192.168.178.45) weight 4 prio 11 error 0
    2018-01-23 02:04:44.211 [ DEBUG]:service: 4: 12603.75H in Astra 19.2e si 0x7f6238023c70 SAT>IP DVB-S Tuner #3 (192.168.178.45) weight 4 prio 11 error 0
    2018-01-23 02:04:44.211 [ DEBUG]:service: 3: 12603.75H in Astra 19.2e si 0x7f6238023bb0 SAT>IP DVB-S Tuner #2 (192.168.178.45) weight 4 prio 11 error 0
    2018-01-23 02:04:44.211 [ DEBUG]:service: 2: 12603.75H in Astra 19.2e si 0x7f6238023af0 SAT>IP DVB-S Tuner #1 (192.168.178.45) weight 4 prio 11 error 0
    2018-01-23 02:04:52.571 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 02:04:54.274 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 02:04:56.106 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 02:05:02.611 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 02:05:04.254 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 02:05:06.047 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 02:05:11.258 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new
    2018-01-23 02:05:12.590 [WARNING]:mpegts: too much queued table input data (over 2MB), discarding new

    ... (tons of that lines to follow, flooding the log)
  • And as said, no real crash protocol, it simply aborts and /var/crash is virginal empty:
    2018-01-22 02:43:01.385 [ DEBUG]:epg: now/next 3480199/3441887 set on France 24 (in English)
    2018-01-22 02:43:01.385 [ DEBUG]:epg: inform HTSP of now event change on France 24 (in English)
    2018-01-22 02:43:01.385 [ DEBUG]:epg: arm channel timer
    2018-01-22;02:57:00(+0100) for France 24 (in English)
    2018-01-22 02:43:02.866 [ DEBUG]:epg: now/next 3480199/3480201 set on France 24 (in English)
    2018-01-22 02:43:02.866 [ DEBUG]:epg: inform HTSP of now event change on France 24 (in English)
    2018-01-22 02:43:02.866 [ DEBUG]:epg: arm channel timer 2018-01-22;02:57:00(+0100) for France 24 (in English)
    2018-01-22 02:43:02.866 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:43:03.553 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:43:03.854 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:43:03.902 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:44:02.881 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:44:02.913 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:44:04.286 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:44:04.313 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    2018-01-22 02:44:13.049 [ DEBUG]:tbl-eit: eit: completed pid 18 table 00000000 / 00000000
    ^
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^$
  • If I disable the epg grabbers as suggested this will result in the epg no longer being available, won't it? Is there a way to do a manual run to populate it? Going completely without the epg is not a comfortable work around...
  • Would disabling all epg threads except one be a reasonable test to do?
  • When trying unstable (can hardly imagine it being more unstable than stable on my machine :-( ), may I simply restore the backed up configuration or do I need to change anything?

THX

DOn

#19 Updated by Peter Brockamp 10 months ago

OK, my current status, what did I try so far:
  • I disabled all hardware drivers (integrated in the kernel as well as loadable drivers) that I don't necessarily need, freezes persist.
  • I checked if any firmware updates are available for any of my hardware (BIOS, HDD/SSD firmware, etc.). But I already had installed the latest versions of everything, so no changes resulted here.
  • I disable all power managment related functions, freezes persist.
  • I disabled all EPG grabbers as was suggested here. This test is currently still running.
    For all changes I always changed only one item and did a test again, so this is why it's taking so much time :-//
What else is left that I will test?
  • Set most conservative BIOS settings.
  • Disable all CPU cores except one.
  • Switch to the developer branch of TVH.

If these also don't lead to a change in behavior I will slowly be at a loss and am running out of ideas. But first let's see how things are evolving.

THX

Don

#20 Updated by Peter Brockamp 22 days ago

OK, my current status, what did I try so far:
...
If these also don't lead to a change in behavior I will slowly be at a loss and am running out of ideas. But first let's see how things are evolving.

So here is an update of what has happened meanwhile:
  • I was expiriencing a phase where one reboot lead to a very long stable phase, I got an uptime of more than 100 days. This is also the reason why I did not post any updates, I wanted to see how long this takes.
  • However after the next regular (non freeze induced) reboot the system started hanging regularly again as before.
  • The difference between the non-hanging phase and the hanging phases: Nothing at all! The system hung, I pressed Reset, It ran for >100 days, I had to reboot, it began hanging again.
  • Finally, as I got rather stuck, I opened a request at a debian mailing list. There I got one hint that might be the solution. A guy wrote that he had been expiriencing quite the same problems on his machine with some RAM that was non standard, it required 1.65V instead of the regular 1.5V. He had a board that could neither automatically nor manually supply this voltage, so it was constantly undervolted. After he exchanged the RAM against standard RAM all problems were gone. He also ran tests with memtest against the non standard RAM but memtest did not find any errors. Despite he proved by trial that the RAM was the culprit, so memtest is not ultimately reliable in uncovering RAM problems..
  • With this hint in mind I slowly started increasing my RAM Voltage. I did this despite the fact that my RAM was absolutely standard @1.5V, no OC gimmicks, from a brand supplier (Adata), bought at a quality store. It was a bit of an act of despair. However after approaching 1.65V it seems like the machine is suddenly getting stable (it wasn't stable @1.6V). "It seems" means that meanwhile the machine ran one week without freezing, afterwards I intentionally rebooted to avoid having caught just another incidentially stable phase, and afterwards it also has not been frezing since almost a week. I plan to do several reboots in the upcomming weeks to make sure this is not a chimera. If I can let's say make 4 or 5 reboots with an intermediate problem free uptime of several days I will start believing I could locate the problem and I will close this ticket as then it is in fact a hardware problem.

So please be patient until this results are available. I will come back to report.

ZHX

Don

#21 Updated by Joe User 21 days ago

It could be a power supply problem. I have had a few systems where intermittent lockups were caused by cheap/faulty or overloaded power supplies. I had one system which worked fine for everything including transcoding at 100% CPU for an hour, but a simple "find" command on a RAID 10 caused crashes. Replacing with a better quality (same rating) power supply solved the problem.

#22 Updated by Peter Brockamp 21 days ago

Joe User wrote:

It could be a power supply problem. I have had a few systems where intermittent lockups were caused by cheap/faulty or overloaded power supplies. I had one system which worked fine for everything including transcoding at 100% CPU for an hour, but a simple "find" command on a RAID 10 caused crashes. Replacing with a better quality (same rating) power supply solved the problem.

Sure, considered that as well. From my expierience harddisks are sometimes even more critical concerning power supply, maybe that's why you were expierienced freezes on accessing the RAID. Also consider that a seeking RAID with several large harddisks with many platters per disk could draw a considerable amount of current from the power supply, don't unerestimate that.

However here I bought a quality power supply from be quiet, 400W, 80+ gold, it's new (not worn out over years of operation), it's sufficiently large (my system does not put a large load on the power supply, probably even the 400W are grossly oversized) and the BIOS readings show no anomalies. Of course I know that BIOS readings show only tendency values and I can't see e. g. spikes, line noise, short brownouts, etc. But I'm not suspicious there's was something wrong with the power, otherwise I would've looked at the lines with an oscilloscope. Also the power line should not cause problems, the machine is supplied via an UPS.

Currently the system is still running flawlessly with the increased RAM voltage, no freezes since increasing it. I will reboot it around the upcoming weekend and see how it is further performing. Currently the RAM being the culprit seems to be the most promissing option.

THX

Don

#23 Updated by Peter Brockamp about 21 hours ago

Peter Brockamp wrote:

Currently the system is still running flawlessly with the increased RAM voltage, no freezes since increasing it. I will reboot it around the upcoming weekend and see how it is further performing. Currently the RAM being the culprit seems to be the most promissing option.

And this hasn't changed for the elapsed weeks since then. With the increased RAM voltage the system became apparently stable, no more hanging anymore. I also tried several intended graceful restarts to assure I didn't incidentially catch an exceptional stable phase of the machine. But no change after a couple of reboots, system causes no problems anymore. I think we can now close this bug, apparently it was hardware induced and not a software issue.

Will someone close this bug or can I do it myself? How?

THX

Don

#24 Updated by Mark Clarkstone about 19 hours ago

  • Status changed from New to Invalid

Also available in: Atom PDF