Project

General

Profile

Feature #4287

Feature: allow parsing season/episode from Title/Subtitle/Description

Added by Dietmar Konermann over 5 years ago. Updated almost 5 years ago.

Status:
Fixed
Priority:
Normal
Assignee:
Category:
EPG
Target version:
Start date:
2015-01-01
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)

Description

Hi,

if an event does not have season/episode information then we should optionally try to parse from title/subtitle/description.

Mayby configurable list of regexp could be added for each, season and episode? TVH coould walk trough the list and the 1st match ist filled in.

E.g. something like:

.*Season \([0-9]+\.*

.*Episode \([0-9]+\).*
.*Folge \([0-9]+\)\.*

Thanks!
Diddle


Files

also_scrape_title_and_desc.patch (4.77 KB) also_scrape_title_and_desc.patch Em Smith, 2017-09-09 10:23
Discovery.jpg (614 KB) Discovery.jpg Title exaple channel Petar Ivanov, 2017-09-18 01:29
Season1.jpg (424 KB) Season1.jpg Petar Ivanov, 2017-09-18 01:36
Season.jpg (396 KB) Season.jpg Petar Ivanov, 2017-09-18 01:36

Subtasks

Feature #2584: scraping season/episode info from the event description in EPG data, for EIT streamsFixedAdam Sutton

Actions

History

#1

Updated by Jaroslav Kysela over 5 years ago

  • Target version set to 4.4
#2

Updated by Jaroslav Kysela about 5 years ago

  • Status changed from New to Fixed
#3

Updated by g siviero about 5 years ago

Hi, I'm testing version 4.3-448~g2f07ea0. In the web interface I can find the new section "Scrape behaviour", but I don't understand how should I use it to scrape series/episode EIT information.
For example I would like to scrape the information from a subtitle field like "St.5 Ep.1", that would mean season 5 episode 1.
But I can't find any eit/scrape directory in my path /home/hts/.hts/tvheadend/epggrab/.

#4

Updated by Mark Clarkstone about 5 years ago

Just tried this & it appears to be working great here, thanks guys!

#5

Updated by saen acro about 5 years ago

Mark Clarkstone wrote:

Just tried this & it appears to be working great here, thanks guys!

Not documented to implement in my country, need examples.
Pleas write wiki examples.

#6

Updated by Dietmar Konermann about 5 years ago

Thanks for looking into this... works at least partially for me.
The current code only examines the summary, would be great to have the description being scanned also. :)

Cheers,
Diddle.

#7

Updated by Em Smith about 5 years ago

@siviero, Sorry for lack of full documentation.

The eit/scrape directory (by default) should be part of your package. For example on debian/ubuntu you could

dpkg -L tvheadend | grep scrape

and this gives me:
/usr/share/tvheadend/data/conf/epggrab/eit/scrape

There's currently only one scraper in there which scrapes from summary data and can be used as a basis for your country/grabber.

Perhaps best way forward is to raise a separate bug for your specific grabber, mention all the grabbers enabled, what the country is (or a better description for a name for your config) and a few exact examples of how season/episode look in your data (such as screenshot and paste text in to message too).

If you look at #4509 you'll see a sample patch for getting it to work for another config, but it will take time to get more written.

#8

Updated by Em Smith about 5 years ago

@Konnermann
Do you have an example for summary and description? Would they use same regular expression or need different ones?

#9

Updated by Dietmar Konermann about 5 years ago

@Em,

actually I'm using my own set of regexp, e.g.

{
"season_num": [
"([0-9]+)\\. *Staffel",
"Staffel *([0-9]+)"
],
"episode_num": [
"([0-9]+)\\. *Folge",
"Folge *([0-9]+)"
],
"airdate": [
"\\(([12][0-9][0-9][0-9])\\)"
]
}

Works fine, if present in summary... but often it's in the description:

Serie USA, 1. Staffel, Folge 5: Obwohl sich Frances (Sarah Jessica Parker) und Robert (Thomas Haden Church) auf eine Mediation geeinigt hatten, sucht Robert heimlich einen Anwalt auf.

And we have also cases where the title contains e.g. the episode.

So matching all, title+summary+description, would be matching my needs best.

Cheers
Diddle.

#10

Updated by Em Smith about 5 years ago

@Konermann
I've looked at the code and see no technical problems applying it across all three.

The only issue I see (other than slight performance hit) is maybe we will get false matches. But, let's try and if it fails we will have to consider a separate tick-box/regex.

I don't think my broadcaster gives me anything that will match title/description so I can't test. If I put a patch here can you compile and test before it's formally submitted? It will be a couple of days since I have one other change in the area I want to submit first.

#11

Updated by Dietmar Konermann about 5 years ago

Thanks, will be happy to test anything you throw at meine. :)

#12

Updated by Em Smith about 5 years ago

Try the attached patch. I can't test it properly since I don't have description broadcast, but it still works OK with my summary scrape.

It applies against current master.

#13

Updated by Mark Clarkstone about 5 years ago

Em Smith wrote:

Try the attached patch. I can't test it properly since I don't have description broadcast, but it still works OK with my summary scrape.

It applies against current master.

Not sure this'll help, but worth mentioning it anyway. Em, you may be able to make use of the "--tsfile_tuners" and "--tsfile" options with mux samples (containing epg data)?

#14

Updated by Em Smith about 5 years ago

The --tsfile sounds interesting but not managed to get it to work yet.
So running "mediainfo" on an old ts recording shows EIT information, and tvh says it is scanning (after enabling the mux for uk_freesat), but it's not hitting my log statements. I'll look in to it later.

#15

Updated by g siviero about 5 years ago

Em Smith wrote:

Try the attached patch. I can't test it properly since I don't have description broadcast, but it still works OK with my summary scrape.

It applies against current master.

I took a look at the proposed patch but not yet tried it.
One question: why for both season and episode is present the same line "changed |= EPG_CHANGED_EPISODE;" ?
And not for example "changed |= EPG_CHANGED_SEASON;" and "changed |= EPG_CHANGED_EPISODE;" ?
Actually another question: what would be the value of EPG_CHANGED_EPISODE?

+  /* search for season number */
+  char buffer[2048];
+  if (eit_pattern_apply_list(buffer, sizeof(buffer), str, &eit_mod->p_snum))
+    if ((en->s_num = atoi(buffer))) {
+      tvhtrace(LS_TBL_EIT,"  extract season number %d using %s", en->s_num, eit_mod->id);
+      changed |= EPG_CHANGED_EPISODE;
+    }
+
+  /* ...for episode number */
+  if (eit_pattern_apply_list(buffer, sizeof(buffer), str, &eit_mod->p_enum))
+    if ((en->e_num = atoi(buffer))) {
+      tvhtrace(LS_TBL_EIT,"  extract episode number %d using %s", en->e_num, eit_mod->id);
+      changed |= EPG_CHANGED_EPISODE;
+    }

If season/episode is present in both Title and Description (and Summary) would it be overwritten?
Or is there a check to skip further readings if the information was already retrieved from the previous fields?

#16

Updated by Dietmar Konermann about 5 years ago

@Em: works fine here, thanks a lot.

#17

Updated by Em Smith about 5 years ago

@Konermann
Glad it works. I'll give it a couple of days in case there are any issues (since this patch is also useful in Italy) and then submit it formally as a pull request.

@siviero
Good questions.

The series and episode are actually both contained inside an existing type called "epg_episode_num". So I used the same EPG_CHANGED_EPISODE flag for both since it is a change to the underlying epg_episode_num and requires a later call to "epg_episode_set_epnum" (which sets both series and episode number).

However there exists EPG_CHANGED_EPSER_NUM and EPG_CHANGED_EPNUM_NUM that are more closely related to the EPG_CHANGED_FIRST_AIRED so I'll change it to use that to make it clearer for people.

If details are present in all fields then they will be overwritten. So potentially you could scrape season from title and episode from description.

#18

Updated by Mark Clarkstone about 5 years ago

@Em Smith

Parsing for UK subtitles works almost perfectly here, apart from where a subtitle uses lots of dots at the end. "This is a subtitle..." etc.

Also, would it be possible to enable this for OpenTV too please? :).

#19

Updated by Em Smith about 5 years ago

Have you an example of both the title and description (preferably one a few days away)? My regex search in the GUI matches lots of split titles.

I took a look at OpenTV and it already has the scrape subtitle. I don't know how to configure it but an example is in the data/conf/epggrab/opentv/prov/skyit file under "subtitle".

#20

Updated by saen acro about 5 years ago

@Em Smith
Is there a way to use Subtitle to fill Content Type
It will be a challenge because need multi language translation

#21

Updated by Em Smith about 5 years ago

Unfortunately scraping a string to populate the content type (which internally is a number) is not too easy. I think there are over a hundred genres in DVB spec (such as football, tennis, martial sports, soap, romance, religious, horror, etc).

#22

Updated by Petar Ivanov about 5 years ago

@Em Smith
1. Can you make to use also Title for Episode, because of example program Episode number is Title, not in Subtitle like in most channels.
2. Why here don't work on season(сезон) and short of season s. (с.)
When have season after episode work, but when is only one season don't work.

{
    "season_num": [
        "сезон ([0-9]+)",
        "[, ] сезон ([0-9]+)",
        "сез.? ([0-9]+)",
        "[, ] с[.] ([0-9]+)",
        "с[.] ([0-9]+), еп.",
        "с[.] ([0-9]+)",
        "еп[.] [0-9]+,.*, ([0-9]+), ?сез" 
#23

Updated by Em Smith about 5 years ago

This patch (in note 12 above) should already search Title for Episode. For Episode you need this in the episode_num section:
"Епизод ([0-9]+)"

If it doesn't work could you copy+paste an example using the format in #4577?

The season should also work and match "сезон ([0-9]+)" (season, space, digits). Is patch in note 12 applied?

#24

Updated by Petar Ivanov about 5 years ago

I wasn't apply patch, now apply and all work fine for Title for Episode. Why not make PR request to add in main code ?

For the season still not work, i try different way but not show for season.

#25

Updated by saen acro about 5 years ago

Em Smith wrote:

Unfortunately scraping a string to populate the content type (which internally is a number) is not too easy. I think there are over a hundred genres in DVB spec (such as football, tennis, martial sports, soap, romance, religious, horror, etc).

http://www.etsi.org/deliver/etsi_en/300400_300499/300468/01.11.01_60/en_300468v011101p.pdf
Page 40
(because is DVB standard do not include ATSC hundred of types)

@Petar Ivanov

{
    "season_num": [
        "сезон ([0-9]+)",
        "[, ] сезон ([0-9]+)",
        "сез.? ([0-9]+)",
        "[, ] с[.] ([0-9]+)",
        "с[.] ([0-9]+), еп.",
        "с[.] ([0-9]+)",
        "еп[.] [0-9]+,.*, ([0-9]+), ?сез" 
    ],
    "episode_num": [
        "([0-9]+) серия ",
        "еп[.] ([0-9]+)",
        "[, ] ([0-9]+) еп[.]",
        "([0-9]+) еп. [,]",
        "епизод ([0-9]+)",
        "Епизод ([0-9]+)",
        "[, ] ([0-9]+) епизод ",
        "([0-9]+) епизод " 
    ],
    "airdate": [
        ", ([12][90][0-9][0-9])" 
    ]
}


Only problem is when stupid EPG writer put "3" events (series) in one event,
and parser think "3 епизода/серии"(3 series/episodes) is equal to S03

#26

Updated by Petar Ivanov about 5 years ago

@saen acro

If you think i not try this script, i say not work for season only, when must show only season.

I test 1st and post in other: https://tvheadend.org/issues/4509#note-21

Also available in: Atom PDF