Project

General

Profile

Feature #4577

EIT scraper basic test harness

Added by Em Smith over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
EPG - Grabbers
Target version:
-
Start date:
2017-09-09
Due date:
% Done:

0%

Estimated time:

Description

The #4287 implemented a basic EIT scraper relying on regex.

I've WIP on a very basic test harness to check we extract the correct data since I think it will be easy to fix a regex for one programme or region but accidentally break it for another. Having samples of what we were trying to extract should help avoid breakage and help to craft configurations for other regions since at least we can receive examples even if there is no working scraper for them.

Is saving in git a complete programme title/subtitle/description from the EIT "fair use" under copyright? I assume so.

Basic format of the JSON test file is like this, so from an EIT summary we expect to extract S13E11.

 
{                                                                                                                                                                                                                                      
        "summary": "S13, E11. Lorem Ipsum",                                                                                                                                                                                                
        "season": "13", "episode": "11"                                                                                                                                                                                                    
}    

And from this we expect S5E31.
{                                                                                                                                                                                                                                      
        "summary" : "Lorem Ipsum. (S5, E31)",                                                                                                                                                                                              
        "season": "5", "episode": "31"                                                                                                                                                                                                     
}

Associated revisions

Revision 1bd02b85 (diff)
Added by Em Smith over 2 years ago

eit: Add simple test harness for scraping EIT data. (#4577)

This python script parses a scraper configuration file
from data/conf/epggrab/eit/scrape and a unit test file
from support/testdata/eitscrape.

The unit test contains numerous examples and the expected
scrape results, such as season and episode number.

The top of the test harness configuration file contains
some comments fields that are unparsed but help document
what environment the test harness is meant to be testing.

Issue: #4577.

History

#1

Updated by Em Smith over 2 years ago

If you've been referred to this page it's because EIT scraping of your season/episode information is not working for your country/satellite and you want to improve it.

You do not need to be a programmer to help. Simply documenting what you see and expect will help.

So, we need specific examples of what you see as your title/subtitle/description. They need to be copied+pasted since spaces and commas are significant. Screen-grabs are useful, but we can't reproduce the characters from them easily.

If your tvheadend dialog says "S13, E11. Lorem Ipsum (R)" then we need that exactly, with the comma, full stop, and spaces. So you would write:

"summary" : "S13, E11. Lorem Ipsum (R)" 

Then we need to know what it means! In many cases we can guess, but some descriptions are very complicated for us. So, in the above example season is 13 and episode is 11 so you write (with the quotation marks being important):

"season" : "13", "episode" : "11" 

Put it all together and add a comma after the summary text and braces around the whole lot and you have:

{
"summary" : "S13, E11. Lorem Ipsum (R)",
"season" : "13", "episode" : "11" 
}

Congratulations! You've just documented the scraping for you region.

Please choose examples where season and episode are different. So, S1E1 is bad since season and episode are the same number.

Write a lot of extra examples. Create some examples that might be mis-interpreted such as where episode and season numbers are in a different order.

If a test is showing something difficult then include a comment:

{
"comment" : "Episode is at the end instead of next to season.",
"summary" : "S13. Lorem Ipsum Ep 11. (R)",
"season" : "13", "episode" : "11" 
}

Then, include a couple of piece of extra information so we know what this is for.

"language" : "en",
"location" : "uk",
"description" : "DVB-T/DVB-S configuration for UK using Freeview and Freesat",
"eitgrabber" : ["uk_freeview", "uk_freesat", "eit"]

Finally, any information you post may be added to the codebase to ensure there are no problems in the future. If you don't want this then please do not post.

Don't worry too much if you put a space around the colons or not, or maybe miss a comma, or not sure exactly what information to write. As long as the information is mostly readable we can take a look, but we don't promise anything!

If you are a developer then you can run the tests yourself through the test harness in support/eitscrape_test.py. This can be run on any machine that has python and does not need to have access to the tvheadend server.

#2

Updated by Em Smith over 2 years ago

Also if you have examples where programmes only have an episode (no season) then these can be documented as:

"season" : null

If your guide always has specific words that are consistently used and useful then document them in the comment. For example, if your guide has "Режисьор: Bob" to mean it is a movie directed by Bob then add it as a comment. We do not currently scrape or use this information, but knowing what is available will help if it were to be implemented in the future.

Also available in: Atom PDF