Feature #4287

Feature: allow parsing season/episode from Title/Subtitle/Description

Added by Dietmar Konermann 6 months ago. Updated 3 days ago.

Status:FixedStart date:2017-03-17
Priority:NormalDue date:
Assignee:Adam Sutton% Done:

0%

Category:EPG
Target version:4.4

Description

Hi,

if an event does not have season/episode information then we should optionally try to parse from title/subtitle/description.

Mayby configurable list of regexp could be added for each, season and episode? TVH coould walk trough the list and the 1st match ist filled in.

E.g. something like:

.*Season \([0-9]+\.*

.*Episode \([0-9]+\).*
.*Folge \([0-9]+\)\.*

Thanks!
Diddle

also_scrape_title_and_desc.patch Magnifier (4.77 KB) Em Smith, 2017-09-09 10:23

Discovery.jpg - Title exaple channel (614 KB) Petar Ivanov, 2017-09-18 01:29

Season1.jpg (424 KB) Petar Ivanov, 2017-09-18 01:36

Season.jpg (396 KB) Petar Ivanov, 2017-09-18 01:36

Associated revisions

Revision c5d2f7d1
Added by Em Smith 13 days ago

eit: Move opentv pattern list functions to separate file. (#4287).

The pattern list functions are used for regular expression matching.
We move them to a separate file and rename them to have an
eit prefix instead of opentv prefix so they can be shared with
other eit modules.

Issue: #4287

Revision 3673e974
Added by Em Smith 13 days ago

eit: Extract season/episode numbers from OTA EIT. (#4287).

Broadcasters often include season and episode number in
the summary text in the OTA EIT.

For example, UK broadcasters often, but not always,
have a description of "Lorem ipsum (S5 Ep 8)" or
"Lorem ipsum (S3 Ep 4/9)" or "Lorem ipsum (Ep 4)".

From this we can use a regular expression match to
extract the season and episode data on a best effort
basis. This logic is based on the opentv extractor.

This is done via config files that are named after the
grabber module and exist in this directory:
data/conf/epggrab/eit/fixup/
Example names would be uk_freeview.

If the module-specific config file does not exist then we
fallback to trying the first component of the filename.

In the above example that would be "uk". This avoids having
duplicate files in the case where we have DVB-S and DVB-T
in the same country that share the same extraction regex.

The configuration file should contain season_num and
episode_num sections that can contain multiple regular
expressions to apply in sequence until one produces
a match.

For DVB-S, the configuration file normally needs to be copied to
a file named "eit" since data is broadcast via that mechanism.
This isn't done by default since the eit grabber is used by
multiple countries that may use different regular expressions.

Issue: #4287

Revision d0cb77b1
Added by Em Smith 13 days ago

eit: Allow EIT scraping of season/episode to be disabled at GUI. (#4287).

We now have a tick box in the OTA configuration to enable/disable
the scraping of season/episode numbers from the eit grabbers.
This will allow us to add other scrapers and tidy-ups in the
future (such as removing "Also in HD" from the summary data
or "New:" from the title), and allow the user to disable ones
they do not want for very low-spec machines or due to their
duplicate rules relying on pre-tidy data.

To achieve this configuration, we now derive our eit grabbers
to be a "...scraper" type and hook in to the activate callback
to load/unload the regular expressions.

The loading of the config also had to be moved to the activate
rather than in the module create to allow us to access the
"scrape enabled" boolean.

Issue: #4287

Revision 72d087fe
Added by Em Smith 13 days ago

eit: Allow scraper configuration file to be configured at the GUI (#4287).

Previously the scraper was hard-coded based on the module name.
So "uk_freeview" module would check "uk_freeview" configuration file
and then the "uk" file.

However, this meant that the generic "eit" module (used by several
countries) had to be symlinked by the user to a specific configuration
for their country.

With this change, the user can simply enter "uk" in the GUI to read
that configuration.j

Also renamed "fixup" to be called "scrape" since we are scraping
data from the EIT rather than fixing it.

Issue: #4287

Revision b52e4485
Added by Em Smith 13 days ago

eit: Clear scraper patterns on shutdown. (#4287).

Issue: #4287

Revision c17b34d3
Added by Em Smith 13 days ago

eit: Add scraper for first aired date. (#4287).

Our broadcaster summary often has "(1995) Lorem ipsum", so we
can extract the first aired date of 1995 from this.

Issue: #4287.

History

#1 Updated by Jaroslav Kysela 5 months ago

  • Target version set to 4.4

#2 Updated by Jaroslav Kysela 13 days ago

  • Status changed from New to Fixed

#3 Updated by g siviero 13 days ago

Hi, I'm testing version 4.3-448~g2f07ea0. In the web interface I can find the new section "Scrape behaviour", but I don't understand how should I use it to scrape series/episode EIT information.
For example I would like to scrape the information from a subtitle field like "St.5 Ep.1", that would mean season 5 episode 1.
But I can't find any eit/scrape directory in my path /home/hts/.hts/tvheadend/epggrab/.

#4 Updated by Mark Clarkstone 13 days ago

Just tried this & it appears to be working great here, thanks guys!

#5 Updated by saen acro 13 days ago

Mark Clarkstone wrote:

Just tried this & it appears to be working great here, thanks guys!

Not documented to implement in my country, need examples.
Pleas write wiki examples.

#6 Updated by Dietmar Konermann 13 days ago

Thanks for looking into this... works at least partially for me.
The current code only examines the summary, would be great to have the description being scanned also. :)

Cheers,
Diddle.

#7 Updated by Em Smith 13 days ago

@siviero, Sorry for lack of full documentation.

The eit/scrape directory (by default) should be part of your package. For example on debian/ubuntu you could

dpkg -L tvheadend | grep scrape

and this gives me:
/usr/share/tvheadend/data/conf/epggrab/eit/scrape

There's currently only one scraper in there which scrapes from summary data and can be used as a basis for your country/grabber.

Perhaps best way forward is to raise a separate bug for your specific grabber, mention all the grabbers enabled, what the country is (or a better description for a name for your config) and a few exact examples of how season/episode look in your data (such as screenshot and paste text in to message too).

If you look at #4509 you'll see a sample patch for getting it to work for another config, but it will take time to get more written.

#8 Updated by Em Smith 13 days ago

@Konnermann
Do you have an example for summary and description? Would they use same regular expression or need different ones?

#9 Updated by Dietmar Konermann 13 days ago

@Em,

actually I'm using my own set of regexp, e.g.

{
"season_num": [
"([0-9]+)\\. *Staffel",
"Staffel *([0-9]+)"
],
"episode_num": [
"([0-9]+)\\. *Folge",
"Folge *([0-9]+)"
],
"airdate": [
"\\(([12][0-9][0-9][0-9])\\)"
]
}

Works fine, if present in summary... but often it's in the description:

Serie USA, 1. Staffel, Folge 5: Obwohl sich Frances (Sarah Jessica Parker) und Robert (Thomas Haden Church) auf eine Mediation geeinigt hatten, sucht Robert heimlich einen Anwalt auf.

And we have also cases where the title contains e.g. the episode.

So matching all, title+summary+description, would be matching my needs best.

Cheers
Diddle.

#10 Updated by Em Smith 13 days ago

@Konermann
I've looked at the code and see no technical problems applying it across all three.

The only issue I see (other than slight performance hit) is maybe we will get false matches. But, let's try and if it fails we will have to consider a separate tick-box/regex.

I don't think my broadcaster gives me anything that will match title/description so I can't test. If I put a patch here can you compile and test before it's formally submitted? It will be a couple of days since I have one other change in the area I want to submit first.

#11 Updated by Dietmar Konermann 12 days ago

Thanks, will be happy to test anything you throw at meine. :)

#12 Updated by Em Smith 12 days ago

Try the attached patch. I can't test it properly since I don't have description broadcast, but it still works OK with my summary scrape.

It applies against current master.

#13 Updated by Mark Clarkstone 12 days ago

Em Smith wrote:

Try the attached patch. I can't test it properly since I don't have description broadcast, but it still works OK with my summary scrape.

It applies against current master.

Not sure this'll help, but worth mentioning it anyway. Em, you may be able to make use of the "--tsfile_tuners" and "--tsfile" options with mux samples (containing epg data)?

#14 Updated by Em Smith 12 days ago

The --tsfile sounds interesting but not managed to get it to work yet.
So running "mediainfo" on an old ts recording shows EIT information, and tvh says it is scanning (after enabling the mux for uk_freesat), but it's not hitting my log statements. I'll look in to it later.

#15 Updated by g siviero 6 days ago

Em Smith wrote:

Try the attached patch. I can't test it properly since I don't have description broadcast, but it still works OK with my summary scrape.

It applies against current master.

I took a look at the proposed patch but not yet tried it.
One question: why for both season and episode is present the same line "changed |= EPG_CHANGED_EPISODE;" ?
And not for example "changed |= EPG_CHANGED_SEASON;" and "changed |= EPG_CHANGED_EPISODE;" ?
Actually another question: what would be the value of EPG_CHANGED_EPISODE?

+  /* search for season number */
+  char buffer[2048];
+  if (eit_pattern_apply_list(buffer, sizeof(buffer), str, &eit_mod->p_snum))
+    if ((en->s_num = atoi(buffer))) {
+      tvhtrace(LS_TBL_EIT,"  extract season number %d using %s", en->s_num, eit_mod->id);
+      changed |= EPG_CHANGED_EPISODE;
+    }
+
+  /* ...for episode number */
+  if (eit_pattern_apply_list(buffer, sizeof(buffer), str, &eit_mod->p_enum))
+    if ((en->e_num = atoi(buffer))) {
+      tvhtrace(LS_TBL_EIT,"  extract episode number %d using %s", en->e_num, eit_mod->id);
+      changed |= EPG_CHANGED_EPISODE;
+    }

If season/episode is present in both Title and Description (and Summary) would it be overwritten?
Or is there a check to skip further readings if the information was already retrieved from the previous fields?

#16 Updated by Dietmar Konermann 5 days ago

@Em: works fine here, thanks a lot.

#17 Updated by Em Smith 4 days ago

@Konermann
Glad it works. I'll give it a couple of days in case there are any issues (since this patch is also useful in Italy) and then submit it formally as a pull request.

@siviero
Good questions.

The series and episode are actually both contained inside an existing type called "epg_episode_num". So I used the same EPG_CHANGED_EPISODE flag for both since it is a change to the underlying epg_episode_num and requires a later call to "epg_episode_set_epnum" (which sets both series and episode number).

However there exists EPG_CHANGED_EPSER_NUM and EPG_CHANGED_EPNUM_NUM that are more closely related to the EPG_CHANGED_FIRST_AIRED so I'll change it to use that to make it clearer for people.

If details are present in all fields then they will be overwritten. So potentially you could scrape season from title and episode from description.

#18 Updated by Mark Clarkstone 4 days ago

@Em Smith

Parsing for UK subtitles works almost perfectly here, apart from where a subtitle uses lots of dots at the end. "This is a subtitle..." etc.

Also, would it be possible to enable this for OpenTV too please? :).

#19 Updated by Em Smith 4 days ago

Have you an example of both the title and description (preferably one a few days away)? My regex search in the GUI matches lots of split titles.

I took a look at OpenTV and it already has the scrape subtitle. I don't know how to configure it but an example is in the data/conf/epggrab/opentv/prov/skyit file under "subtitle".

#20 Updated by saen acro 4 days ago

@Em Smith
Is there a way to use Subtitle to fill Content Type
It will be a challenge because need multi language translation

#21 Updated by Em Smith 4 days ago

Unfortunately scraping a string to populate the content type (which internally is a number) is not too easy. I think there are over a hundred genres in DVB spec (such as football, tennis, martial sports, soap, romance, religious, horror, etc).

#22 Updated by Petar Ivanov 4 days ago

@Em Smith
1. Can you make to use also Title for Episode, because of example program Episode number is Title, not in Subtitle like in most channels.
2. Why here don't work on season(сезон) and short of season s. (с.)
When have season after episode work, but when is only one season don't work.

{
    "season_num": [
        "сезон ([0-9]+)",
        "[, ] сезон ([0-9]+)",
        "сез.? ([0-9]+)",
        "[, ] с[.] ([0-9]+)",
        "с[.] ([0-9]+), еп.",
        "с[.] ([0-9]+)",
        "еп[.] [0-9]+,.*, ([0-9]+), ?сез" 

#23 Updated by Em Smith 4 days ago

This patch (in note 12 above) should already search Title for Episode. For Episode you need this in the episode_num section:
"Епизод ([0-9]+)"

If it doesn't work could you copy+paste an example using the format in #4577?

The season should also work and match "сезон ([0-9]+)" (season, space, digits). Is patch in note 12 applied?

#24 Updated by Petar Ivanov 4 days ago

I wasn't apply patch, now apply and all work fine for Title for Episode. Why not make PR request to add in main code ?

For the season still not work, i try different way but not show for season.

#25 Updated by saen acro 3 days ago

Em Smith wrote:

Unfortunately scraping a string to populate the content type (which internally is a number) is not too easy. I think there are over a hundred genres in DVB spec (such as football, tennis, martial sports, soap, romance, religious, horror, etc).

http://www.etsi.org/deliver/etsi_en/300400_300499/300468/01.11.01_60/en_300468v011101p.pdf
Page 40
(because is DVB standard do not include ATSC hundred of types)

@Petar Ivanov

{
    "season_num": [
        "сезон ([0-9]+)",
        "[, ] сезон ([0-9]+)",
        "сез.? ([0-9]+)",
        "[, ] с[.] ([0-9]+)",
        "с[.] ([0-9]+), еп.",
        "с[.] ([0-9]+)",
        "еп[.] [0-9]+,.*, ([0-9]+), ?сез" 
    ],
    "episode_num": [
        "([0-9]+) серия ",
        "еп[.] ([0-9]+)",
        "[, ] ([0-9]+) еп[.]",
        "([0-9]+) еп. [,]",
        "епизод ([0-9]+)",
        "Епизод ([0-9]+)",
        "[, ] ([0-9]+) епизод ",
        "([0-9]+) епизод " 
    ],
    "airdate": [
        ", ([12][90][0-9][0-9])" 
    ]
}


Only problem is when stupid EPG writer put "3" events (series) in one event,
and parser think "3 епизода/серии"(3 series/episodes) is equal to S03

#26 Updated by Petar Ivanov 3 days ago

@saen acro

If you think i not try this script, i say not work for season only, when must show only season.

I test 1st and post in other: https://tvheadend.org/issues/4509#note-21

Also available in: Atom PDF