Feature #4509

EPG content from "Subtitle" to "Content type"

Added by saen acro over 1 year ago. Updated 11 months ago.

Status:NewStart date:2017-07-31
Priority:NormalDue date:
Assignee:-% Done:

0%

Category:-
Target version:-

Заснемане.PNG - Drama (33.5 KB) saen acro, 2017-07-31 21:15

Заснемане2.PNG - Action, Adventure, Fantasy (34.7 KB) saen acro, 2017-07-31 21:16

Заснемане3.PNG - Comedy serial, Episode 5 (25.1 KB) saen acro, 2017-07-31 21:16

Заснемане4.PNG - E 01, S 01, serial (28.7 KB) saen acro, 2017-07-31 21:21

Заснемане5.PNG - E 02, serial, S02, repeat (25.6 KB) saen acro, 2017-07-31 21:21

Заснемане6.PNG - сезон 1, епизод 13, драма, романтичен, САЩ, 2014 (122 KB) saen acro, 2017-09-08 08:26

Заснемане7.PNG - сериал, еп. 45, 46 (126 KB) saen acro, 2017-09-08 08:31

Заснемане8.PNG - четвърти сезон (107 KB) saen acro, 2017-09-08 08:39

0001-eit-Initial-Bulsatcom_39E-configuration.-4509.patch Magnifier (963 Bytes) Em Smith, 2017-09-08 14:23

Заснемане9.PNG (141 KB) saen acro, 2017-09-08 16:12

Заснемане10.PNG (76.4 KB) saen acro, 2017-09-08 16:19

Bulsatcom_39E (374 Bytes) Em Smith, 2017-09-08 19:37

BNT 1.jpg - BNT 1 (432 KB) Petar Ivanov, 2017-09-08 22:50

Nat Geo.jpg (435 KB) Petar Ivanov, 2017-09-08 22:50

btv comedy.jpg (422 KB) Petar Ivanov, 2017-09-08 22:50

AXN.jpg (464 KB) Petar Ivanov, 2017-09-08 22:50

BTV.jpg (442 KB) Petar Ivanov, 2017-09-08 22:50

Evrokom.png (129 KB) Petar Ivanov, 2017-09-08 23:11

kinonova.jpg (453 KB) Petar Ivanov, 2017-09-08 23:11

hobby tv hd.jpg (512 KB) Petar Ivanov, 2017-09-08 23:11

Заснемане11.PNG (110 KB) saen acro, 2017-09-09 18:52

channels.7z (228 KB) saen acro, 2017-09-24 22:55

0001-eit-Scrape-genre-from-text-in-OTA-EIT.-4509.patch Magnifier (6.79 KB) Em Smith, 2017-10-19 19:28

0001-eit-Scrape-genre-from-text-in-OTA-EIT.-4509.patch Magnifier (13.6 KB) Em Smith, 2017-12-16 16:11

csv2json (424 Bytes) Em Smith, 2017-12-16 20:18

History

#1 Updated by saen acro over 1 year ago

Some of local operators put Content type Episode in beginning of Subtitle
example in pictures

Is it possible content betwin comas to be used as filling content in Episode and Content type

then it is very easy for DVR to record without duplication

#2 Updated by Gabor Kovacs over 1 year ago

saen acro wrote:

Some of local operators put Content type Episode in beginning of Subtitle
example in pictures

Is it possible content betwin comas to be used as filling content in Episode and Content type

then it is very easy for DVR to record without duplication

I have the same issue, the episode number is in the subtitle, would be useful to extract it somehow into the correct field.

#3 Updated by Em Smith over 1 year ago

Possible dup #4389 ("Regex in EIT grabber")?

#4 Updated by Em Smith about 1 year ago

I am assuming you are using "Bulsatcom Bula 39E" grabber?

I'm trying to work out two separate regular expressions that might help for season/episode, one for grabbing season and a separate one for matching episode that might work with a parser I'm working on.

Sorry if these questions sound obvious, but...

To confirm: "Сезон" is Season, "Епизод" is Episode.

Example 4 seems easy to parse, "episode 1, season 1" But in example 5 it seems to be saying "ep 2" and then "2, season" (rather than season 2). Am I right that sometimes it's "2, season" and sometimes "season 2"? I assume the words after "episode 2" is the genre in that example.

In example 3, it appears episode is abbreviated to purely letter "С." Does that abbreviation always mean episode?

Would you be able to paste the words used from a couple of examples and maybe just a couple of complete descriptions for season/episode?

Also, I noticed that you have dates such as ", 2015" and ", 1994". Is that the original air date (date movie was created)? If you type ", [0-9][0-9][0-9][0-9]" (comma, space) in to the EPG and hit "fulltext" then does it match correctly movies, or does it generate false matches with dramas and other programmes? Does the word after the year in example 2/3 mean something useful?

Thanks.

#5 Updated by saen acro about 1 year ago

Em Smith wrote:

I am assuming you are using "Bulsatcom Bula 39E" grabber?

No meter this is for some transponders with simulcrypt

I'm trying to work out two separate regular expressions that might help for season/episode, one for grabbing season and a separate one for matching episode that might work with a parser I'm working on.

Sorry if these questions sound obvious, but...

To confirm: "Сезон" is Season, "Епизод" is Episode.

Yes correct

Example 4 seems easy to parse, "episode 1, season 1" But in example 5 it seems to be saying "ep 2" and then "2, season" (rather than season 2). Am I right that sometimes it's "2, season" and sometimes "season 2"? I assume the words after "episode 2" is the genre in that example.

Operator owners and his personal have low IQ and result is following

In example 3, it appears episode is abbreviated to purely letter "С." Does that abbreviation always mean episode?

No it mean "Сезон" Season but who know can be "Серия" Episode, again IQ-ed personal do not have own standards for redundancies.

Would you be able to paste the words used from a couple of examples and maybe just a couple of complete descriptions for season/episode?

What word do you need, Ill will type it for you.

Also, I noticed that you have dates such as ", 2015" and ", 1994". Is that the original air date (date movie was created)? If you type ", [0-9][0-9][0-9][0-9]" (comma, space) in to the EPG and hit "fulltext" then does it match correctly movies, or does it generate false matches with dramas and other programmes? Does the word after the year in example 2/3 mean something useful?

Thanks.

You mean word "Режисьор" aka Director
similar is "В ролите" aka Starring

#6 Updated by Em Smith about 1 year ago

Which grabbers do you have enabled in configuration->channel->epg grabber modules?

Do you have a couple of example where "c." means episode and an example where it means season?

Your screenshots are good, but if you copy+paste from tvheadend a couple of examples with season/episode and one with episode only in case Google translate is using a different character mapping. I noticed your screenshots sometimes have "сез" and sometimes "сез." (with a period) so a few more examples will help find other differences.

Are Сезон and Епизод always in lowercase (сезон and епизод)?

#7 Updated by saen acro about 1 year ago

EIT: DVB Grabber + Bulsatcom: Bula 39E (same as first but on another PID)
are enabled modules.
Use Bing translate to compare https://www.bing.com/translator
In Bulgarian language "Серия" and "Епизод" have same meaning.
Foreign words in language, correct should be "Епизод, Еп. Е."
But correct typing by rules of language is other thing.

Only thing with is on place as separator is commas ;)

сезон 1, епизод 13, драма, романтичен, САЩ, 2014

Season 1, Episode 13, drama/romantic, USA, 2014

And something harder

сериал, еп. 45, 46 

"еп. 45, 46" must be read as "еп. 45 and 46" two series.
(it's more easy to type it one and not copy to second event)

And even more harder.

четвърти сезон 

fourth season, no series №, because is easy not to type.

#8 Updated by Em Smith about 1 year ago

I've attached a patch. It needs the very latest tvheadend built.

Once built, go to Configuration->Channel->EPG Grabber Modules.

On "Bulsatcom" click "Scrape Episode".

On "EIT DVB Grabber" click "Scrape Episode" and in "Configuration to use" enter "Bulsatcom_39E".

You should see lines in the log file similar to:
[ INFO] tbl-eit: scraper Bulsatcom_39E attempt to load config "Bulsatcom_39E"
[ INFO] tbl-eit: scraper Bulsatcom_39E loaded config "Bulsatcom_39E"

Then EPGGrabber->Trigger OTA EPG Grabber.

I can't test the patch properly since I don't get your channels, but I've tested the configuration regex against the strings you provided and they seem to be ok. However it's possible you get nothing or bad results. Let me know and paste the bad match text in the note.

However, I still do not understand when "с." means season and when episode. I've assumed if it is at end of title then it is episode.

Patch does not handle "еп. 45, 46" and will use "45". I don't think tvheadend allows multiple episodes in same broadcast.

#9 Updated by saen acro about 1 year ago

I will test almost immediately if/when approved in github ;)
currently tvheadend_4.3-448~g2f07ea0

Just idea, is there a way to be created "situation dictionary" inside UI.
Later will be easy to be shared with friend.
I understand that there is a very lot of language situations to be described.

#10 Updated by saen acro about 1 year ago

Test positive
but just for easy I "touch"

touch /usr/share/tvheadend/data/conf/epggrab/eit/scrape/Bulsatcom_39E

and attach content from patch
{
    "season_num": [
        "сезон ([0-9]+)",
        "сез.? ([0-9]+)",
        "еп. [0-9]+,.*, ([0-9]+), ?сез" 
    ],
    "episode_num": [
        "епизод ([0-9]+)",
        "еп. ([0-9]+)",
        "с. ([0-9]+)$" 
    ],
    "airdate": [
        ", ([0-9][0-9][0-9][0-9])" 
    ]
}

result in attachment

do we can do same modification with "Content type"

#11 Updated by saen acro about 1 year ago

and some not accepted corrections by patch

#12 Updated by Em Smith about 1 year ago

Try this replacement. This will detect the new format "ep 43 & 44" but will only see it as "ep 44".

{                                                                                                                                                                                                                                          
    "season_num": [                                                                                                                                                                                                                        
        "сезон ([0-9]+)",                                                                                                                                                                                                                  
        "сез.? ([0-9]+)",                                                                                                                                                                                                                  
        "с. ([0-9]+), еп.",                                                                                                                                                                                                                
        "еп. [0-9]+,.*, ([0-9]+), ?сез"                                                                                                                                                                                                    
    ],                                                                                                                                                                                                                                     
    "episode_num": [                                                                                                                                                                                                                       
        "епизод ([0-9]+)",                                                                                                                                                                                                                 
        "еп. ([0-9]+)",                                                                                                                                                                                                                    
        "[, ] ([0-9]+) епизод",                                                                                                                                                                                                            
        "с. ([0-9]+)$"                                                                                                                                                                                                                     
    ],                                                                                                                                                                                                                                     
    "airdate": [                                                                                                                                                                                                                           
        ", ([0-9][0-9][0-9][0-9])"                                                                                                                                                                                                         
    ]                                                                                                                                                                                                                                      
} 

#13 Updated by Em Smith about 1 year ago

Parsing content type is quite different code since it reads a string but has to convert it to a specific category number internally. I have to read the code more to learn how the config can be read and the exact numbers used.

#14 Updated by saen acro about 1 year ago

документален филм, 5 епизод 

e5 not set
сезон 4, епизод 12, криминален, драма, САЩ, 2016 

s04.e12 set correctly.

Isn't more correct to set as
S04.E12 upper case letters, dot between... am not sure is it needed.

#15 Updated by Em Smith about 1 year ago

I don't know why "документален филм, 5 епизод" isn't matched. It matches the configuration line "[, ] ([0-9]+) епизод" when I try it here. Perhaps it has not rescanned that channel yet?

The epg database file is in ".hts/tvheadend/epgdb.v2" (I don't know where that is on your system). If you stop tvheadend, remove file, restart tvheadend you will have no EPG and can rescan and see if it works.

If that does not work, the other reason it can fail is I had two updates giving different description for the same programme, one did not have episode information so it kept removing episode data. If rescan does not work I can give you a patch to log if that happens for you.

I agree with S04.E12 or S04E12, but existing GUI logic uses lowercase. I think it might be considered easier to read.

#16 Updated by saen acro about 1 year ago

If i get correct logic, it expect to be a first value in string.
But in case is second.

#17 Updated by Em Smith about 1 year ago

Tvheadend tries each line in "episode_num" and uses the first that works.

So "сезон 4, епизод 12, криминален, драма, САЩ, 2016" matches line 1 "епизод ([0-9]+)" (епизод followed by a space followed by digits).

But "документален филм, 5 епизод" matches line 3 "[, ] ([0-9]+) епизод" (a comma or a space followed by digits followed by space followed by епизод).

If the rescan does not work, is there anything different for this fail? Is it in a different part of the programme info? If you press "EPG->fulltext" and paste "[, ] ([0-9]+) епизод" (without the ") does it match anything?

#18 Updated by Em Smith about 1 year ago

Try this file attached file directly. Perhaps the pasted text above changed the spacing.

#19 Updated by saen acro about 1 year ago

NO change, it's same.

документален филм, 5 епизод, повторение 
анимационен филм, 7 епизод 

This type of string are not parsed :(

Some how
"Value A, VAleue №, Value C"
"Value A, VAleue №" string are not parsed
but if
"Value №, Value X, Value Y" string are working
difference is only in position

#20 Updated by Mark Clarkstone about 1 year ago

@Em Smith, would you like me to up the priority of this issue so that more people see it? Allowing them to request additional scrapers?

Also, instead of a text input box for the scraper selection, how about a drop down with the available scrapers instead?

#21 Updated by Petar Ivanov about 1 year ago

My type work, but not show: с. 1 and сезон 4 when are alone, when have season and series same time show.

I remove с. ([0-9]+)$ from episode_num, because this is mean season.

{
    "season_num": [
        "сезон ([0-9]+)",
        "[, ] сезон ([0-9]+)",
        "сез.? ([0-9]+)",
        "[, ] с. ([0-9]+)",
        "с. ([0-9]+), еп.",
        "с. ([0-9]+)",
        "еп. [0-9]+,.*, ([0-9]+), ?сез" 
    ],
    "episode_num": [
        "([0-9]+) серия",
        "еп. ([0-9]+)",
        "[, ] ([0-9]+) еп.",
        "([0-9]+) еп.[,]",
        "епизод ([0-9]+)",
        "Епизод ([0-9]+)",
        "[, ] ([0-9]+) епизод",
        "([0-9]+) епизод" 
    ],
    "airdate": [
        ", ([0-9][0-9][0-9][0-9])" 
    ]

}

#22 Updated by saen acro about 1 year ago

This also do not help
Peter you can check BNT 1
bTV have series without season
AXN works perfect

#23 Updated by Petar Ivanov about 1 year ago

I say this here work on BNT 1, but not work when is only с. 1
See screenshots

#24 Updated by Petar Ivanov about 1 year ago

Not work when have Сесон and short с. srennshot in KinoNova and btv comedy, other wok

#25 Updated by Em Smith about 1 year ago

@Mark Clarkstone
I like the drop-down idea. It would make it easier to configure for people. I'll have to find an example in the code where it's dynamically generated.

I think if we leave the priority for the moment, get the search to also go across subtitle&description change in proposed by Konermann in #4287 and the test harness in #4577, and let early adopters see if there is any fallout that requires changes before advertising it more and getting more config.

#26 Updated by saen acro about 1 year ago

Just found something wrong

кулинарно предаване, 3 епизода

literally is written that there will be played tree series,
also time respond to this, but scrapper thing S03

#27 Updated by saen acro about 1 year ago

Also some strings for age rating

еп. 7, комедия, САЩ, 2017. В ролите: Иса Рей, Ивон Орджи, Джей Елис, Лиса Джойс, Наташа Ротуел, [15+]

[15+] is rating

How do I have to fill in convertor "First|second|third|forth|fifth|..."
in my case "първи|втори|трети|четвърти|пети|..."
for converting season number?

Airdate, is this work?

#28 Updated by Em Smith about 1 year ago

Unfortunately I can't scrape the first, second since it requires a string to number lookup (same as genre). Does first, second occur a lot?

Airdate may work. I tried to write it so it matches for examples you gave. I can't see any easy way for a user to verify it works since Kodi doesn't seem to display it.

Rating is interesting. I will look if the code can handle that. Is it always in the same format "[18+]", "[15+]", etc? What about children programmes are they "[7+]"?

#29 Updated by saen acro about 1 year ago

IF I change Line
"[, ] ([0-9]+) епизод",
to
"[, ] ([0-9]+) епизод [, ]",
is this will exclude
word "епизода" from string

кулинарно предаване, 3 епизода

#30 Updated by Em Smith about 1 year ago

Are you using Petar Ivanov's changes from above? If yes, then it will match line 3 in season of '"[, ] ([0-9]+) еп."'.

The "еп." should be "еп[.]" in his file since "." means "any character".

But it will then match the last two lines.

So "кулинарно предаване, 3 епизода" is "cookery show, with episode number of 3 (E3)", or does it mean "total number of episodes is 3"?

Earlier (comment 14) you said "документален филм, 5 епизод" should mean "documentary, Episode 5"?

So it's not matching any episode?

I'm adding all the examples you and Petar have to my tests and hopefully we can get it solved.

Easiest way to test your changes is to use 'echo' for the string you have and 'egrep' with the string bit. If it returns nothing then it does not match. If it returns something (maybe highlighted in red) then it has matched.

echo "кулинарно предаване, 3 епизода" | egrep "[, ] ([0-9]+) епизод " 

@Petar:
For BNT1, do you mean it should be "S1.E10", "S1.E05" for the first two examples in BNT1.jpg?

At the moment, we do not force a season if broadcast does not have one. Does it make sense to always force season 1 if you have an episode and no season? (In my region you often have Ep1 even though it might be S3 and they don't say it is S3).

For kinonova, first line should be S4?

#31 Updated by saen acro about 1 year ago

Em Smith wrote:

Are you using Petar Ivanov's changes from above? If yes, then it will match line 3 in season of '"[, ] ([0-9]+) еп."'.

The "еп." should be "еп[.]" in his file since "." means "any character".

Dot meaning is same as "ex." from "example"

But it will then match the last two lines.

So "кулинарно предаване, 3 епизода" is "cookery show, with episode number of 3 (E3)", or does it mean "total number of episodes is 3"?

Earlier (comment 14) you said "документален филм, 5 епизод" should mean "documentary, Episode 5"?

So it's not matching any episode?

I'm adding all the examples you and Petar have to my tests and hopefully we can get it solved.

Easiest way to test your changes is to use 'echo' for the string you have and 'egrep' with the string bit. If it returns nothing then it does not match. If it returns something (maybe highlighted in red) then it has matched.

[echo "кулинарно предаване, 3 епизода" | egrep "[, ] ([0-9]+) епизод "]

:~$ echo "кулинарно предаване, 3 епизода" | egrep "[, ] ([0-9]+) епизод " :~$ echo "кулинарно предаване, 3 епизод" | egrep "[, ] ([0-9]+) епизод "
:~$ echo "кулинарно предаване, 3 епизод" | egrep "[, ] ([0-9]+) епизод"
кулинарно предаване, 3 епизод
:~$ echo "кулинарно предаване, 3 епизода" | egrep "[, ] ([0-9]+) епизод"
кулинарно предаване, 3 епизода

#32 Updated by saen acro about 1 year ago

my file

{
    "season_num": [
        "сезон ([0-9]+)",
        "[, ] сезон ([0-9]+)",
        "сез.? ([0-9]+)",
        "[, ] с[.] ([0-9]+)",
        "с[.] ([0-9]+), еп.",
        "с[.] ([0-9]+)",
        "еп[.] [0-9]+,.*, ([0-9]+), ?сез" 
    ],
    "episode_num": [
        "([0-9]+) серия ",
        "еп[.] ([0-9]+) ",
        "[, ] ([0-9]+) еп.",
        "([0-9]+) еп[.][,]",
        "епизод ([0-9]+)",
        "Епизод ([0-9]+)",
        "[, ] ([0-9]+) епизод ",
        "([0-9]+) епизод " 
    ],
    "airdate": [
        ", ([0-9][0-9][0-9][0-9])" 
    ]
}

result attached

#33 Updated by Em Smith about 1 year ago

I've submitted a pull request for a test harness (#4577). This has a simple file where you can put the EIT description and what you expect season and episode to be to check your changes.

In the config file "." means "any character" (rather than ex. as in example). I think it was my mistake that has been copied.

For example this egrep will match the character "е":

$ echo "кулинарно предаване, 3 епизода" | egrep "пр.да" 
кулинарно предаване, 3 епизода

My test harness suggests your config file does not parse this from your earlier example in comment 27: "еп. 7, комедия, САЩ".

Perhaps instead of "еп[.] ([0-9]+) " try "еп[.] ([0-9]+)[ ,]" (which means space or comma).

#34 Updated by Em Smith about 1 year ago

The test harness (#4577) is now available, so let's try and get this config working as best we can. We might not get 100% season/episode working, but let's try our best.

The file in support/testdata/eitscrape/bg contains examples from your bug reports, so "summary" is the text from EIT, "season" and "episode" are what you expect to receive, with "null" meaning don't expect anything.

cd support
PYTHONIOENCODING=utf-8 ./eitscrape_test.py ../data/conf/epggrab/eit/scrape/Bulsatcom_39E  ./testdata/eitscrape/bg

So over the next few days, please try and add other examples that work or don't work. Just add them to the support/testdata/eitscrape/bg test file and re-run the test harness to see if it works.

Then next weekend let's try and get the best config.

#35 Updated by saen acro about 1 year ago

This is export of current moment EPG,
How do I can use script on it?

#36 Updated by Em Smith about 1 year ago

That file is actually really useful. For example I can see lots of odd things such as:

<sub-title lang="bul">еп. 78  </sub-title>

(78 followed by two spaces, which I would never have known from the screenshot);
and
<desc lang="bul"> Сенките...

(space immediately before the description, but only for some programmes).

We can't use the file directly, but can copy+paste to make examples. For example if your GUI isn't showing a correct season/episode then find the information in your channels file and use it.

There exists the file "support/testdata/eitscrape/bg"

For example:

In your file channels.7z, search in the file for "сезон 2, епизод 4, драма, криминален, САЩ, 2015" and you will find:

  <sub-title lang="bul">сезон 2, епизод 4, драма, криминален, САЩ, 2015 </sub-title>

If we assumed that wasn't working then we would add an entry via copy+paste (ensure we get everything, including extra spaces):

{
        "summary": "сезон 2, епизод 4, драма, криминален, САЩ, 2015 ",
        "season" : 2, "episode": "4", "airdate" : "2015" 
},

Then we run the test harness:

cd support;
PYTHONIOENCODING=utf-8 ./eitscrape_test.py ../data/conf/epggrab/eit/scrape/Bulsatcom_39E  ./testdata/eitscrape/bg

(Assuming your regex file is called Bulsatcom_39E in that directory).

And you should see at the end NumOK: XX NumFailed: YY

So then we know if it worked or not and can alter the regex to try and match it.

Now, we don't actually need that as a new entry since I already have one very similar in the test file. But hopefully that gives an idea.

#37 Updated by Em Smith about 1 year ago

I've attached a patch for converting genre strings. My broadcaster sends content type, but I've tested the patch by forcing specific programmes to have a different genre.

You need to alter your scraper configuration file and add section such as:

    "genre_16": ["(драма,  романтичен)"],                                                                                                                      
    "genre_23": ["(документален)"] 

The codes at the end are the content specifier (hex) value from EN 300 468 table 28.
[[http://www.etsi.org/deliver/etsi_en/300400_300499/300468/01.11.01_60/en_300468v011101p.pdf]]

Each regex needs brackets to indicate it's a "match". You can have multiple regex per genre, or combine them such as "(a|b)".

You can apply the patch via "git am file.patch".

The configuration file I have for Bulsatcom episodes still has a problem with not recognizing "кулинарно предаване, 3 епизода". Did you create a better file?

Let me know if it you have questions.

#38 Updated by saen acro about 1 year ago

Em Smith wrote:

The configuration file I have for Bulsatcom episodes still has a problem with not recognizing "кулинарно предаване, 3 епизода". Did you create a better file?

this must be excluded can not be recognized tree episodes they can be 1, 2, 3 or 256, 257, 258
----
Translations table
ETSI EN 300 468 V1.11.1 (2010-04)
Table 28: Content_nibble level 1 and 2 assignments

Content_nibble_level_1 Content_nibble_level_2 ID Description
Movie/Drama:
0x1 0x0 10 movie/drama (general)
0x1 0x1 11 detective/thriller
0x1 0x2 12 adventure/western/war
0x1 0x3 13 science fiction/fantasy/horror
0x1 0x4 14 comedy
0x1 0x5 15 soap/melodrama/folkloric
0x1 0x6 16 romance
0x1 0x7 17 serious/classical/religious/historical movie/drama
0x1 0x8 18 adult movie/drama
News/Current affairs:
0x2 0x0 20 news/current affairs (general)
0x2 0x1 21 news/weather report
0x2 0x2 22 news magazine
0x2 0x3 23 documentary
0x2 0x4 24 discussion/interview/debate
Show/Game show:
0x3 0x0 30 show/game show (general)
0x3 0x1 31 game show/quiz/contest
0x3 0x2 32 variety show
0x3 0x3 33 talk show
Sports:
0x4 0x0 40 sports (general)
0x4 0x1 41 special events (Olympic Games, World Cup, etc.)
0x4 0x2 42 sports magazines
0x4 0x3 43 football/soccer
0x4 0x4 44 tennis/squash
0x4 0x5 45 team sports (excluding football)
0x4 0x6 46 athletics
0x4 0x7 47 motor sport
0x4 0x8 48 water sport
0x4 0x9 49 winter sports
0x4 0xA 4a equestrian
0x4 0xB 4b martial sports
Children's/Youth programmes:
0x5 0x0 50 children's/youth programmes (general)
0x5 0x1 51 pre-school children's programmes
0x5 0x2 52 entertainment programmes for 6 to14
0x5 0x3 53 entertainment programmes for 10 to 16
0x5 0x4 54 informational/educational/school programmes
0x5 0x5 55 cartoons/puppets
Music/Ballet/Dance:
0x6 0x0 60 music/ballet/dance (general)
0x6 0x1 61 rock/pop
0x6 0x2 62 serious music/classical music
0x6 0x3 63 folk/traditional music
0x6 0x4 64 jazz
0x6 0x5 65 musical/opera
0x6 0x6 66 ballet
Arts/Culture (without music):
0x7 0x0 70 arts/culture (without music, general)
0x7 0x1 71 performing arts
0x7 0x2 72 fine arts
0x7 0x3 73 religion
0x7 0x4 74 popular culture/traditional arts
0x7 0x5 75 literature
0x7 0x6 76 film/cinema
0x7 0x7 77 experimental film/video
0x7 0x8 78 broadcasting/press
0x7 0x9 79 new media
0x7 0xA 7a arts/culture magazines
0x7 0xB 7b fashion
Social/Political issues/Economics:
0x8 0x0 80 social/political issues/economics (general)
0x8 0x1 81 magazines/reports/documentary
0x8 0x2 82 economics/social advisory
0x8 0x3 83 remarkable people
Education/Science/Factual topics:
0x9 0x0 90 education/science/factual topics (general)
0x9 0x1 91 nature/animals/environment
0x9 0x2 92 technology/natural sciences
0x9 0x3 93 medicine/physiology/psychology
0x9 0x4 94 foreign countries/expeditions
0x9 0x5 95 social/spiritual sciences
0x9 0x6 96 further education
0x9 0x7 97 languages
Leisure hobbies:
0xA 0x0 a0 leisure hobbies (general)
0xA 0x1 a1 tourism/travel
0xA 0x2 a2 handicraft
0xA 0x3 a3 motoring
0xA 0x4 a4 fitness and health
0xA 0x5 a5 cooking
0xA 0x6 a6 advertisement/shopping
0xA 0x7 a7 gardening
Special characteristics:
0xB 0x0 b0 original language
0xB 0x1 b1 black and white
0xB 0x2 b2 unpublished
0xB 0x3 b3 live broadcast

also to be easy to be translated to other languages

genre_10": ["(movie/drama (general)"],
genre_11": ["(detective/thriller"],
genre_12": ["(adventure/western/war"],
genre_13": ["(science fiction/fantasy/horror"],
genre_14": ["(comedy"],
genre_15": ["(soap/melodrama/folkloric"],
genre_16": ["(romance"],
genre_17": ["(serious/classical/religious/historical movie/drama"],
genre_18": ["(adult movie/drama"],
genre_20": ["(news/current affairs (general)"],
genre_21": ["(news/weather report"],
genre_22": ["(news magazine"],
genre_23": ["(documentary"],
genre_24": ["(discussion/interview/debate"],
genre_30": ["(show/game show (general)"],
genre_31": ["(game show/quiz/contest"],
genre_32": ["(variety show"],
genre_33": ["(talk show"],
genre_40": ["(sports (general)"],
genre_41": ["(special events (Olympic Games, World Cup, etc.)"],
genre_42": ["(sports magazines"],
genre_43": ["(football/soccer"],
genre_44": ["(tennis/squash"],
genre_45": ["(team sports (excluding football)"],
genre_46": ["(athletics"],
genre_47": ["(motor sport"],
genre_48": ["(water sport"],
genre_49": ["(winter sports"],
genre_4a": ["(equestrian"],
genre_4b": ["(martial sports"],
genre_50": ["(children's/youth programmes (general)"],
genre_51": ["(pre-school children's programmes"],
genre_52": ["(entertainment programmes for 6 to14"],
genre_53": ["(entertainment programmes for 10 to 16"],
genre_54": ["(informational/educational/school programmes"],
genre_55": ["(cartoons/puppets"],
genre_60": ["(music/ballet/dance (general)"],
genre_61": ["(rock/pop"],
genre_62": ["(serious music/classical music"],
genre_63": ["(folk/traditional music"],
genre_64": ["(jazz"],
genre_65": ["(musical/opera"],
genre_66": ["(ballet"],
genre_70": ["(arts/culture (without music, general)"],
genre_71": ["(performing arts"],
genre_72": ["(fine arts"],
genre_73": ["(religion"],
genre_74": ["(popular culture/traditional arts"],
genre_75": ["(literature"],
genre_76": ["(film/cinema"],
genre_77": ["(experimental film/video"],
genre_78": ["(broadcasting/press"],
genre_79": ["(new media"],
genre_7a": ["(arts/culture magazines"],
genre_7b": ["(fashion"],
genre_80": ["(social/political issues/economics (general)"],
genre_81": ["(magazines/reports/documentary"],
genre_82": ["(economics/social advisory"],
genre_83": ["(remarkable people"],
genre_90": ["(education/science/factual topics (general)"],
genre_91": ["(nature/animals/environment"],
genre_92": ["(technology/natural sciences"],
genre_93": ["(medicine/physiology/psychology"],
genre_94": ["(foreign countries/expeditions"],
genre_95": ["(social/spiritual sciences"],
genre_96": ["(further education"],
genre_97": ["(languages"],
genre_a0": ["(leisure hobbies (general)"],
genre_a1": ["(tourism/travel"],
genre_a2": ["(handicraft"],
genre_a3": ["(motoring"],
genre_a4": ["(fitness and health"],
genre_a5": ["(cooking"],
genre_a6": ["(advertisement/shopping"],
genre_a7": ["(gardening"],
genre_b0": ["(original language"],
genre_b1": ["(black and white"],
genre_b2": ["(unpublished"],
genre_b3": ["(live broadcast"],

#39 Updated by Em Smith about 1 year ago

I think there's a misunderstanding. The genre names are already translated in tvheadend.bg.po, so you will probably already have Мониторинг, Изящни изкуства, etc. in the GUI.

The regex that you need to put in the configuration file need to match your broadcast data such as those in comment 1 such as Заснемане.PNG and Заснемане2.PNG that are in your sub-title.

So from those images you could have regex such as:

"genre_10" : ["(^Драма, )"],
"genre_12" : ["(, Фантастични, )"],
"genre_14" : ["(^Комедия)"],

#40 Updated by saen acro about 1 year ago

Em Smith wrote:

I think there's a misunderstanding. The genre names are already translated in tvheadend.bg.po, so you will probably already have Мониторинг, Изящни изкуства, etc. in the GUI.

The regex that you need to put in the configuration file need to match your broadcast data such as those in comment 1 such as Заснемане.PNG and Заснемане2.PNG that are in your sub-title.

So from those images you could have regex such as:
[...]

Em Smith can you describe steps with epg string take during parsing.
how words are sorted and pass to coresponding values in epg table in UI.
(diagram or somting)

p.s.
Мониторинг = Monitoring ;)
also is reading and sounds same ;)

#41 Updated by Em Smith about 1 year ago

The idea is that if "genre_xy" matches a regular expression then the programme will be assigned genre category code "xy".

So, if genre_10 has regex "(^Драма, )" then if your programme has "Драма" it will be given internally genre code 10.

The regex is only used for matching, the actual words it matches are not used. So, if you match against "(^Драма, )" then that word is not used, it is the category code ("genre_xy") that is used.

With other countries the DVB contains multiple genre codes so we simply copy them to internal structures. The regex in the patch connects to that logic.

If a programme matches multiple regex then it will be given multiple genres.

If you add the examples I gave an run an OTA grab then you should be able to use "filter by content type" on EPG which may make it more clear what is happening.

#42 Updated by saen acro about 1 year ago

I still believe that simple CSV table will be better solution for parsing genre type.

ex.
CODE ETSI Description language EN language BG language IT language DE language PL
14 comedy comedy комедия commedia Komödie komedia

Everybody can translate corresponding colon to his language.

And when parser see word "Комедия" between or fallowed by coma , Комедия, to search VAriables loaded per corresponding language colon to find word "Комедия" and set value in EPG genre for Event.
Sorry for stupid explaining but I programmed only on Basic, but don't belie in modern programming languages, is harder to GET/SET values after IF/THEN/ELSE logic.

p.s. is small and capital letters are differently taken
is Комедия = комедия?

#43 Updated by Em Smith about 1 year ago

The patterns are case-sensitive (small and capital letters are different). I couldn't find any obvious documentation for how it would handle insensitive regex locales (e.g., German ß) since it uses an older regex library to keep compatibility with systems that do not ship/build with the more powerful (but optional) pcre regex libraries. You can probably use "[Кк]омедия" to match both patterns.

The CSV is less flexible because the regex is not per-language but per-region. For example, satellite in UK has completely different text in title/description to satellite in USA despite both being EN.

But, more importantly, many broadcasters already transmit the genre code as part of the data so drama is already tagged, and adding regex for those systems could be wrong since regex matching is not as accurate as broadcaster's knowledge of what they are broadcasting.

You see a similar per-region approach in language translation files where each region has its own file instead of all language being in one file.

Another reason is that CSV files are quite difficult for developer merges since merges occur on a line-by-line basis and you get frequent conflicts.

The advantage of the config-based approach is that you don't have to define all genres if you don't want to, just define the major ones (10, 20, 30, etc) and ignore the minor ones (21, 22, 23).

#44 Updated by saen acro about 1 year ago

CSV can be used only to load variables once, not to make each ask from file (memory is little faster ;) )

All symbols ñåüúíóöøðßáæ etc. is part of utf8 (ISO-8859-(1-15) need just to be converted)

#45 Updated by Jaroslav Kysela about 1 year ago

I believe that we should extend the genres to new three or four level ETSI TS 102 822-3-1 format, see #3753 before this code.

Also, it would be probably much readable to use the english names as keys for the json config files like:

{
  "Documentary": ["(документален)"] 
}

#46 Updated by saen acro about 1 year ago

Jaroslav Kysela wrote:

I believe that we should extend the genres to new three or four level ETSI TS 102 822-3-1 format, see #3753 before this code.

Also, it would be probably much readable to use the english names as keys for the json config files like:

{

"Documentary": ["(документален)"]
}

In some situations there is a synonyms names for "news" from two words "новинарска емисия", "вечерни новини" etc.
So we need some easy to fill text file, human readable/editable
I'll be very happy if UI option appeared

#47 Updated by Em Smith about 1 year ago

The ETSI categories are far more comprehensive and seem to be what I get in SD and parse out as categories. It's a shame their major numbers aren't compatible with the EN300468 numbers (so 3.1.1 for News vs 0x20). The one problem I see is that so many clients seem to hardcode the major/minor genre extraction, primarily to get colours in the EPG GUI. It looks like Kodi has EPG_GENRE_USE_STRING and so it can pass through arbitrary data.

I think if we proceed with this patch then the config file should probably move all the genre tags in to a sub-tag map to keep them all nicely together (so have genres : [ genre_ab: [], genre_cd: [] ]).

I thought about using string names. The code should be easy (maybe 50 lines) but I thought it might be too complicated for config files and typos. It's ok for some genres ("religion", "athletics") but since we have genres such as "special events (Olympic Games, World Cup, etc.)" it would be too difficult for people to type. So then we'd have to have to have shortened names. I thought perhaps the numbers since it's then easy to see which genres are missing, e.g., if we have 40, 44, 45, then it's clear we are missing 41-43. However, I'm easy on adding it, or having both formats.

But with ETSI, the numbering is more difficult to read. Would we have "Religion/Philosophies" (3.1.2) and then append the sub-category "Religion" (3.1.2.1) and sub-sub-category "Buddhism" (3.1.2.1.1) to get "Religion/Philosophies/Religion/Buddhism"?

I don't think this needs a GUI nor a new config file format since I think all Tvheadend config is JSON. It is a system file that is likely to be modified only a couple of times a year at most. Even then, the mapping is only useful for broadcasters that don't send genre information but have data that is scrapable. Examples of existing config such as data/conf/epggrab/opentv/dict/skyeng. It's certainly possible to post-process some other easier-to-edit format in to JSON, which suggests that perhaps this config should be in a completely separate file.

I'll park this one pending #3573 and more agreement on the best way forward.

#48 Updated by saen acro about 1 year ago

Em Smith wrote:

I've attached a patch for converting genre strings. My broadcaster sends content type, but I've tested the patch by forcing specific programmes to have a different genre.
Let me know if it you have questions.

This don't fill data
It's more easy to give account to my TVH to test.

#49 Updated by Em Smith 11 months ago

Sorry I missed your previous post.

It's taken a while since I needed to get some other bugs resolved first. Please try the new patch when you have time. The configuration has changed slightly so it is:

 "genre" : [ {
      "Romance" : ["(Romcom)"],
      "Detective / Thriller": [ "(Detective)" ]
      "Cartoons / Puppets": [ "(Muppets|Sesame)" ],
      "Sports" : ["^(Snooker)"]
  }]

You should see lines logged at startup:

"Module Bulsatcom_39E - Scrape "Detective / Thriller" to genre 0x11" 

It doesn't use a separate file because of some slight complications with loading multiple different files. But hopefully they should be written once and then only modified very infrequently.

There's a few csv-to-json tools out there if it helps you.

There is a UK genre mapping file as an example.

The genre name must be English, the regex must have a capture group.

So:

"Documentary": ["(документален)"]

Not:
"Documentary": ["документален"]   (Missing brackets)

Not:
"документален": ["(документален)"]  (Left name must be English).

Also spaces are important in the left name, so "Detective / Thriller", not "Detective/Thriller" (missing space), not "Detective / thriller" (incorrect capital letters).

If it doesn't work then I'll add a bit more debug logging.

I think you have 50mbps but my Internet is still RFC 1149 so me debugging on your system may be a bit slow, so will try a bit of extra logging first.

#50 Updated by saen acro 11 months ago

Is there a way to process generated by TVH XMLTV xml file with this processing script?
regular scenario is slower.

----
P.S. about internet speed
I don't live in banana country (or western Europe) my internet is 10 time faster then 50Mbps,
nearly 30% households have fiber connection

#51 Updated by Em Smith 11 months ago

Perhaps csv is easier for generating the json? I've attached a quick script which uses exclamation mark ("!") as a separator instead of comma (",").

It will take a file like:

Romance!драма, романтичен
Documentary!(документален)
Detective / Thriller!(документален)

and it will add missing brackets and output this:
"genre" : [{
"Romance" : ["(драма, романтичен)"],
"Documentary" : ["(документален)"],
"Detective / Thriller" : ["(документален)"],   <---you have to remove this comma yourself
}]

You can then copy+paste in to the Bulsatcom_39E file.

I already added a couple of examples in the Bulsatcom_39E file to see if it works for you.

500Mbps for 10euro? Compare UK 2016: "the slowest download speed was 0.12Mbps...the fastest average broadband speed at 77.17Mbps".

#52 Updated by saen acro 11 months ago

I try to patch but:

# patch <0001-eit-Scrape-genre-from-text-in-OTA-EIT.-4509.patch
patching file Bulsatcom_39E
can't find file to patch at input line 103
Perhaps you should have used the -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/data/conf/epggrab/eit/scrape/uk b/data/conf/epggrab/eit/scrape/uk
|index f7b383db0..bac2e8621 100644
|--- a/data/conf/epggrab/eit/scrape/uk
|+++ b/data/conf/epggrab/eit/scrape/uk
--------------------------
File to patch: /data/conf/epggrab/eit/scrape/uk
/data/conf/epggrab/eit/scrape/uk: No such file or directory
Skip this patch? [y]

#53 Updated by Em Smith 11 months ago

The patch is against master:
c1a5e434b59a0fa08e7a3c256ac6908fae2b0265
Thu Dec 14 21:28:22 2017 +0100

#54 Updated by Em Smith 11 months ago

Ah, actually you should use:

git am .....patch

(You can use git reset --hard HEAD^ to revert)

#55 Updated by Em Smith 11 months ago

The git reset command will delete all the changes from the patch including all changes you make to Bulsatcom_39E so take a copy of the config before resetting.

#56 Updated by saen acro 11 months ago

I make copy of Bulsatcom_39E to bg in
/usr/share/tvheadend/data/conf/epggrab/eit/scrape

but prefer to have copy here
https://github.com/tvheadend/tvheadend/tree/master/data/conf/epggrab/eit/scrape

#57 Updated by Em Smith 11 months ago

If the patch works for you, then I'll submit a Pull Request. If approved, the configuration will be in github. Then you can create the extra genre mappings and we can submit them as new configuration.

Then we should rename the "Bulsatcom_39E" grabber to "bg_bulsatcom" (to be consistent with the other grabbers) and the configuration file will be renamed "bg".

Also available in: Atom PDF