I would like to share my thoughts on the state of the genre system within tvheadend, especially regarding EPG data sourced via XMLTV, in the hopes of initiating a discussion that will eventuate in a new sub-module for handling EPG genre information.
Firstly, this proposal is designed to be 100% backwards compatible with the existing system. Its implementation should, in theory, break nothing.
Internally TVH appears to use the DVB/ETSI genre classification system. In addition to this, it is only capable of ingesting a subset of genre values.
There are multiple Digital TV systems in the world with a variety of accompanying EPG genre systems. In addition to this, there may be a number of third party EPG providers with their own proprietary systems.
For simplification, I will consider the three stages of EPG processing as follows:
1) Acquisition
2) Storage
3) Output
I would like to propose a proprietary extension to the XMLTV schema, as well as some internal EPG storage and processing changes in order to accommodate a more customisable genre data handling mechanism.
ACQUISITION
An example of the existing XMLTV 'category' syntax is as follows:
<category lang="en">Documentary</category>
TVH reads the value of this tag and compares it to an internal list of pre-defined text values in order to arrive at a 1 byte code that will be used to store this genre.
Until this point, this is still 100% backwards compatible with the existing system. This system, however, is not extensible.
My first proposal is to add a 'system' attribute to the 'category' tag as follows:
<category system="NAME" lang="en">Documentary</category>
If present, the 'system' attribute, along with the 'lang' attribute will be used to identify a list of extensible genre acquisition rules to be used as a replacement for the existing hard-coded rules.
Together, the 'system' and 'lang' values will be used to identify a JSON file within the TVH XMLTV directory structure. There can be multiple JSON files to support multiple systems and multiple languages. These can be bundled with TVH or user-defined.
The JSON file will consist of an array of rules for matching a 'category' string to a value code. Each rule would consist of the following 4 properties:
1) 'regex' - This is a regular expression used for case-blind matching a category string. It should be flexible enough to cater for dialect or punctuation differences, for example, 'programme' and 'program' or 'Science/Technology' and 'Science / Technology', yet not too flexible that it produces too many false positives. Should a category string match multiple rules, the first match will take priority.
2) 'description' - This is the actual genre description to use. It could be the same as the regex, but in most cases should not be.
3) 'code' - This is the proprietary numerical code for this 'system' that matches the text/regex provided.
4) 'etsi' - This is the equivalent ETSI numerical code that matches the text/regex provided.
Example:
XML
<category system="atsc" lang="en">Hunting/Fishing/Outdoors</category>
JSON Rules
{
"regex": "Hunting[ ]{0,1}[-/\\][ ]{0,1}Fishing[ ]{0,1}[-/\\][ ]{0,1}Outdoors",
"description": "Hunting/Fishing/Outdoors",
"code": "0x94",
"etsi": "0xA0"
}
The rule regex would match any of the following descriptions: 'Hunting/Fishing/Outdoors', 'Hunting / Fishing / Outdoors', 'Hunting - Fishing – Outdoors', 'Hunting-Fishing-Outdoors', 'Hunting\Fishing\Outdoors', etc.
A regex need not be so precise as the example provided, 'hunt|fish|outdoor' would also yield an adequate match.
Once matched, the code for this 'system' '0x94' would be stored in the 'extensible genre' field of the EPG record. The ETSI code '0xA0' (Leisure Hobbies (general)) would be stored in the existing 'genre' field of the EPG record.
Note: '<category system="atsc" lang="es">Caza/Pesca/Al aire libre</category>' would look up the Spanish rules file, match a Spanish regex, but still result in '0x94' and '0xA0' being stored as the relevant codes.
The use of regular expressions would come at the cost of additional CPU overhead during the XMLTV import process. In order to minimise this overhead, the JSON rules file could be sorted so that the most common genres (most likely to match) appear first. This could even be statistically determined as the XML EPG is loaded and then retained for future uploads.
My next proposal would result in regular expressions being avoided completely with the use of an additional custom attribute, 'code'.
Together, the 'system' and 'code' attributes could be used to bypass the regex process entirely and save the provided codes directly.
Example: <category system="atsc" code="0x3B" lang="en">Documentary</category>
The JSON rules file would be used to match ATSC '0x3B' with ETSI '0x23' directly. This would also eliminate the need to provide the tag text with only the attributes being necessary.
STORAGE
The existing genre storage mechanism would remain unchanged.
An additional 'extensible genre' array will be created to store tuples for each extensible genre. 1 byte for the 'system' and 1 byte for the 'code'.
It will be necessary to store both the system and code values as a single TVH instance may be receiving XMLTV EPG data for multiple sources. For example, XMLTV for a DVB satellite service may supply ETSI genre codes whereas XMLTV for an ATSC OTA system may supply ATSC genre codes.
Each 'system' will be assigned a unique persistent 8 bit integer when encountered for the first time on each TVH instance. This number will be used in the 'extensible genre' table. This table will map the 'system' names to their internal number as well as hold pointers to the regex, descriptions and code values for each language that has been used. Some regex libraries can pre-parse the regex text prior to use. Where this is possible, the pre-parsed value can also be saved to increase performance.
The 'extensible genre' table should be able to cater for instances where a single channel broadcasts programmes in multiple languages with the EPG data for each programme matching its language. For example, one programme may have the genre text 'Informations (général)' and another 'Nachrichten/Aktuelles'. These should both be interpreted and stored as ETSI '0x20'.
Whichever regular expression library is used, it should be Unicode-aware. If, during the research phase, some 'systems' are found that use more than 256 genres, 16 bits should be allowed for the genre code from the outset.
It is also possible that the existing OTA grabbers could be modified to utilise the genre information contained within the JSON rules file and eventually do away with their hard-coded rules.
OUTPUT
EPG data is available via the HTSP and JSON APIs as well as the Web UI.
A new field will be added to both APIs to handle the new extended genre information. Something short like 'extgenre' would be preferable.
This new EPG data field would consist of an array of objects, one for each 'extensible genre' present for that programme. Each object would consist of a 'system' integer, and a 'code' integer.
Perhaps an API request parameter could be added to include the 'system' name as well as the extended genre code and perhaps even the genre description with the data returned.
A new API call will be required to extract the map between the 'system' code and the 'system' name. This call could return 1 or all of the 'systems' depending upon the parameters used.
The existing JSON API 'epg/content_type/list' call will be modified to accept an additional argument of either the system name or the system code. If these additional arguments are present, the API will return the contents of the JSON rules file excluding the 'regex' property.
In order to display the 'extensible genre' Description, a client would have to first obtain the EPG data. For each 'system' in the 'extgenre', the client would then have to execute an API call to retrieve all of the description texts for that system. A client would be expected to adopt a caching strategy that limited the number of calls made.
Translation note: I realise that TVH already has a method for providing translations. However, due to the extensibility aspect, it may not be practical to use TVH translations for every extensible system/regex/description combination. More thought may be required here.
CONCLUSION
I do not intend for this to be 'the' solution the gets implemented. I personally only use DVB/modified-ETSI in my location so it is possible, if not probable, that I am unaware of some of the nuances of other digital TV systems with respect genre handing.
According to Wikipedia, there appears to be 5 digital TV systems in operation around the world: DVB, ATSC, ISDB, DTMB and DMB. I'm not sure which of these systems TVH currently supports, however, it would be good to get feedback from users of every system that TVH is able to support and more importantly, the genre classification system used. Perhaps these core 'systems' could have hard-coded 'system' integer codes.
I know of a 3rd party EPG information provider here in Australia, so I expect that there will be many more worldwide. The extensible genre system should be flexible enough to support the published standards as well as 3rd party EPG information providers where possible.
I look forward to a healthy and respectful discussion.