Project

General

Profile

Bug #5909

Invalid xmltv

Added by Ryan Li about 1 month ago. Updated 6 days ago.

Status:
Fixed
Priority:
Normal
Assignee:
-
Category:
IPTV
Target version:
-
Start date:
2020-05-30
Due date:
% Done:

0%

Estimated time:
Found in version:
4.3-1880~g2af3b9e2e
Affected Versions:

Description

TVheadend generates invalid xmltv as epg has invalid characters.
http://152.67.119.71:9981/xmltv/channels


Files

channels.xml (753 KB) channels.xml Ryan Li, 2020-06-03 08:02
channels.xml (135 KB) channels.xml Ryan Li, 2020-06-03 09:28
channels.xml (697 KB) channels.xml Ryan Li, 2020-06-04 05:27

History

#1

Updated by Ryan Li about 1 month ago

Xteve doesn't accept the xml and in Tivimate half of the epg is messed up. The epg works fine in Kodi with the Tvheadend addon.

#2

Updated by saen acro about 1 month ago

Search problem in xmltv paser on client software.

#3

Updated by Ryan Li about 1 month ago

I'm not really sure what you mean. All my sources are just IPTV automatic and epg is grabbed automatically.
Some of my sources are http://138.19.127.53:9981/playlist and http://183.179.246.16:9981/playlist

#4

Updated by Ryan Li about 1 month ago

It's not the client's fault. I raised the issue with xteve and they told me that the xml is invalid.
https://github.com/xteve-project/xTeVe/issues/129

#5

Updated by saen acro about 1 month ago

Why then hls-proxy dont have sutch problems?
maby because developer find a way to ignore such "invalid" situation.

#6

Updated by Ryan Li about 1 month ago

I mean Tivimate accepts the xml but more than half the entry are displayed at the wrong times and channels. Does hls-proxy end up doing that?

#7

Updated by Ryan Li about 1 month ago

Also when you open the link in chrome it even tells you there's a problem. If you open valid xmltv like https://techzyon.com/epg/plutotv_guide.xml it would display fine.

#8

Updated by Flole Systems about 1 month ago

Update to latest master.

#9

Updated by Ryan Li about 1 month ago

Same thing on 4.2.9~pre+201911151752-0~built202003190218~git5bdcfd8~ubuntu20.04.1 (2020-03-19T02:18:20+0000)
http://152.67.97.3:9981/xmltv/channels

#10

Updated by Flole Systems about 1 month ago

Once again: Use latest master and not some old 4.2 version.

#11

Updated by Ryan Li about 1 month ago

I just realized that isn't the latest stable, I'll build it manually from the git. Just ignore my last comment.

#12

Updated by Ryan Li about 1 month ago

I've got 4.3-1880~g2af3b9e2e (2020-05-31T02:55:05+0000) built, I'm just setting it up now.

#13

Updated by Ryan Li about 1 month ago

Same thing is happening with 4.3-1880~g2af3b9e2e (2020-05-31T02:55:05+0000).
http://152.67.97.3:9981/xmltv/channels

#14

Updated by Flole Systems about 1 month ago

  • Found in version changed from 4.2.8 to 4.3-1880~g2af3b9e2e

As you can see in line 2332 (at the moment), there is part of the XML File missing.

<sub-title>AWKWAFINA (“The Farewell”)2121500 +0000" channel="5ea1c021ca9007f932fdb3679853461d">

You need to figure out why that is missing.

There is a character there that wasn't decoded properly (�, seems to be "Vertical Tab"), and that is causing the parser to fail for some reason. It should be filtered in htsbuf_append_and_escape_xml though.

#15

Updated by saen acro about 1 month ago

This is a problem of software parser not of TVH.
Problem can be in EPG generator of operator.

#16

Updated by Ryan Li about 1 month ago

I'm not an expert on this but I tried adding "“" and "”" like you suggested but make gives me errors.
Can the switch statement match non-ascii characters like these two symbols?

src/htsbuf.c: In function ‘htsbuf_append_and_escape_xml’:
src/htsbuf.c:371:10: error: multi-character character constant [-Werror=multichar]
  371 |     case '”':  esc = "&quot;"; break;
      |          ^~~~~
src/htsbuf.c:372:10: error: multi-character character constant [-Werror=multichar]
  372 |     case '”':  esc = "&quot;"; break;
      |          ^~~~~
#17

Updated by Ryan Li about 1 month ago

The source is probably just a DTT tuner that grabs the free-to-air TV channels. I'm not 100% sure since I just found the links online.

#18

Updated by Ryan Li about 1 month ago

I meant DTMB not DTT

#19

Updated by Flole Systems about 1 month ago

The vertical tab character should be ignored anyways, so I am not sure why that is not happening. Adding it to the list won't help and my initial assumption that it's the ” characters was wrong.

#20

Updated by Em Smith about 1 month ago

Try removing those case statements that you added. Then add:

case (char)0x8a: esc = " ";break;

This is not really the correct place to fix it since it should be fixed in whatever is giving you an EPG, but it should make your Xteve work.

Fixing that one character makes the file parseable. But, I expect in the future you may encounter some other bad character. So, you need to determine the bad character via:

curl http://152.67.97.3:9981/xmltv/channels | xmllint -

And this will give you a result such as below indicating the 0x8A is bad:

Bytes: 0x8A 0x4D 0x41 0x52
  <sub-title>AWKWAFINA (“The Farewell”)�MARIO LOPEZ (“Access Hollywood”)

#21

Updated by Flole Systems about 1 month ago

That won't help as in line 378 only

h < 0x20

is accepted, and 0x8a is clearly not lower than 0x20. So there must be something else wrong.

#22

Updated by Em Smith about 1 month ago

With my patch, we would not reach line 378 because we take the other branch due to setting the "esc" variable to do the mapping of the invalid character to a valid character.

#23

Updated by Flole Systems about 1 month ago

I think I see the issue now.

Please try changing 378 to

} else if ((h < 0x20 && h != 0x09 && h != 0x0a && h != 0x0d) || (h >= 0x7F && h <= 0x9F && h != 0x85) || h > 0x10FFFF) {

#24

Updated by Em Smith about 1 month ago

I don't think that change is correct.

A character is often eight bits, so comparisons to be greater than 0x10ffff will never be met. Also, a character may be either signed or unsigned depending on the platform so the other comparisons may not work either since CHAR_MAX may be 0x7f.

#25

Updated by Flole Systems about 1 month ago

In my tests I had 0xFFFFFF8A for h, that would obviously not match for your case. The way I implemented it now is close to the xml standard, it's definitely worth a try to see what happens with this change. It should have a higher chance to succeeding than your patch either way.

#26

Updated by Ryan Li about 1 month ago

It's somehow worse, almost all the chinese characters and even the channel titles have turned into invalid characters.
http://152.67.97.3:9981/xmltv/channels

#27

Updated by Ryan Li about 1 month ago

Since it migrated the backup I'm going to reset everything and see how it goes again with the patch.

config: backup: migrating config from 4.3-1880~g2af3b9e2e (running 4.3-1880~g2af3b9e2e-dirty)

Here's the very bad xml file attached.

#28

Updated by saen acro about 1 month ago

Ryan Li wrote:

Since it migrated the backup I'm going to reset everything and see how it goes again with the patch. [...]
Here's the very bad xml file attached.

https://github.com/hiroshiyui/epgrab
try extract epg with this
100% operator epg is wrongly generated

#29

Updated by Ryan Li about 1 month ago

I don't have a dvb tuner myself so I can't use epgrab. All the source are online of other people using Tvheadend and I don't know their setup.

Though I tried out

case (char)0x8a: esc = " ";break;

and the result is all the chinese characters is now gibberish.

#30

Updated by Em Smith about 1 month ago

Unfortunately your input data is bad, so the output data is bad.

As Saen correctly says, many EPG operators generate bad data, usually by mixing encodings. So many programs just allow bad data through and try several different fallback methods to display the data on the assumption that the first method to parse it will display it ok.

But your particular data has a handful of bad characters mixed with good data, and it doesn't seem to be due to encoding.

You can try and filter the data. So replace the function with the changes below (whole function included to make it easier to copy+paste since you've been making modifications). This attempts to skip the bad characters but still allow multi-byte characters so you shouldn't end up with gibberish . It's not neat code, but given your original xmltv output, it appears to correctly parses it and only replaces the few wrong entries (based on running it in a test harness).

void
htsbuf_append_and_escape_xml(htsbuf_queue_t *hq, const char *s)
{
  const char *c = s;
  const char *e = s + strlen(s);
  const char *esc = 0;
  int h;

  if(e == s)
    return;

  while(c<e) {
    h = *c++;

    if (h & 0x80) {
      // Start of utf-8.  But we sometimes have bad UTF-8 (#5909).
      // So validate and ignore bad characters.

      // Number of top bits set indicates the total number of  bytes.
      const int num_bytes =
        ((h & 240) == 240) ? 4 :
        ((h & 224) == 224) ? 3 :
        ((h & 192) == 192) ? 2 : 0;

      if (!num_bytes) {
        // Completely invalid sequence, so we replace it with a space.
        htsbuf_append(hq, s, c - s - 1);
        htsbuf_append(hq, " ", 1);
        s=c;
        continue;
      } else {
        // Start of valid UTF-8.
        if (e - c < num_bytes - 1) {
          // Invalid sequence - too few characters left in buffer for the sequence.
          // Append what we already have accumulated and ignore remaining characters.
          htsbuf_append(hq, s, c - s - 1);
          break;
        } else {
          // We should probably check each character in the range is also valid.
          htsbuf_append(hq, s, c - s - 1);
          htsbuf_append(hq, c-1, num_bytes);
          s=c-1;
          s+=num_bytes;
          c=s;
          continue;
        }
      }
    }

    switch(h) {
    case '<':  esc = "&lt;";   break;
    case '>':  esc = "&gt;";   break;
    case '&':  esc = "&amp;";  break;
    case '\'': esc = "&apos;"; break;
    case '"':  esc = "&quot;"; break;
    default:   esc = NULL;     break;
    }

    if(esc != NULL) {
      htsbuf_append(hq, s, c - s - 1);
      htsbuf_append_str(hq, esc);
      s = c;
    } else if (h < 0x20 && h != 0x09 && h != 0x0a && h != 0x0d) {
      /* allow XML 1.0 valid characters only */
      htsbuf_append(hq, s, c - s - 1);
      s = c;
    }

    if(c == e) {
      htsbuf_append(hq, s, c - s);
      break;
    }
  }
}
#31

Updated by Flole Systems about 1 month ago

Flole Systems wrote:

I think I see the issue now.

Please try changing 378 to
[...]

Please test the patch I mentioned above. If it works I will merge it, if not we need to come up with a different solution.

#32

Updated by Ryan Li about 1 month ago

Em Smith wrote:

Unfortunately your input data is bad, so the output data is bad.

As Saen correctly says, many EPG operators generate bad data, usually by mixing encodings. So many programs just allow bad data through and try several different fallback methods to display the data on the assumption that the first method to parse it will display it ok.

But your particular data has a handful of bad characters mixed with good data, and it doesn't seem to be due to encoding.

You can try and filter the data. So replace the function with the changes below (whole function included to make it easier to copy+paste since you've been making modifications). This attempts to skip the bad characters but still allow multi-byte characters so you shouldn't end up with gibberish . It's not neat code, but given your original xmltv output, it appears to correctly parses it and only replaces the few wrong entries (based on running it in a test harness).

[...]

This solution proposed by Em Smith is working great, Thank You.

#33

Updated by Flole Systems about 1 month ago

Unfortunately it's not respecting the XML Standard (it's not ignoring characters that should not be used in a xml file) so I won't merge it in this state.

There are 3 options now: Either you test my patch aswell, or the other patch is updated to address this issue, or you just always manually patch this when updating (obviously with the risk that a future update might break it).

#34

Updated by Ryan Li about 1 month ago

I tested your patch of

} else if ((h < 0x20 && h != 0x09 && h != 0x0a && h != 0x0d) || (h >= 0x7F && h <= 0x9F && h != 0x85) || h > 0x10FFFF) {

but it ended up with most of the chinese characters turned invalid as I reported in https://tvheadend.org/issues/5909#note-26

I will test your patch with Em Smith's patch together and see how it goes.

#35

Updated by Flole Systems about 1 month ago

Ah alright, I thought you were referring to the other patch.

There's no point in trying to combine those, the patch just needs some slight changes and then it should be mergeable.

I am not sure about adding a space for an invalid character, as those characters don't have meaning it shouldn't be a space but they should be silently discarded.

Disallowed characters could be handled differently though (for example the vertical tab).

#36

Updated by Flole Systems about 1 month ago

@Em Smith:
Would you like to create a PR for that? Or should I take care of getting it in?

#37

Updated by Flole Systems 6 days ago

  • Status changed from New to Fixed

Also available in: Atom PDF