Looking to reduce cpu usage, I noticed that the libdvbcsa library from debain stretch was using a maximum batch size of 64 packets. Looking at the package, it shows it was compiled with both "enable-mmx" and "enable-sse2", but it appears to use mmx instead of sse2, which uses a batch size of 128 packets. I downloaded the source code (https://code.videolan.org/videolan/libdvbcsa/tree/master) and found that if you configure with both options, it uses just mmx. So, after configuring for sse2 and building, I now see tvheadend using 128 packet batches to decrypt.
Then I also found a fork with newer updates from the videolan.org code. (https://github.com/glenvt18/libdvbcsa)
It supports the SSSE3 extention:
./configure --help --enable-uint32 Use native 32 bits integers for bitslice --enable-uint64 Use native 64 bits integers for bitslice --enable-mmx Use MMX for bitslice --enable-sse2 Use SSE2 for bitslice --enable-ssse3 Use SSSE3 for bitslice --enable-avx2 Use AVX2 for bitslice --enable-altivec Use AltiVec for bitslice --enable-neon Use NEON for bitslice --enable-alt-sbox Use alternative sbox lookup, may be faster on some targets
Although my cpu supports sse4, it does not support sse3, however when using this version with "--enable-sse2" the performance was a little better than the original videolan.org version with the same option.
Using this version, my old 3 core cpu usage has dropped from 16-18% to 4-6% while descrambling similar content. So, anyone doing descrambling may want to look into which libdvbcsa library they are using and whether a better version can be used.
I have created a related pull request that, when used with "--trace tvhcsa" it will report the libdvbcsa batch size being used by tvheadend.
[ TRACE] tvhcsa: 0x7f3a70004a80: service "SERVICE" using CSA batch size = 128 for decryption