Post subject: Parallelizing Encodes
Player (202)
Joined: 11/21/2019
Posts: 247
Location: Washington
I've got an encoding computer with 20 CPU cores, and the speed gain per core at SD resolution drops off precipitously at around 8 threads or so. So while HD encodes could saturate as many cores as I threw at them, my SD encodes would use 40 threads and get nearly the same speed as just using 8. The reason this happens is because there is only so much work (lib)x264 is able to assign to cores for a single encoding instance, and that amount of work is tied to the resolution of the video. The trick to overcoming this limitation is to encode multiple sections of the video at once. I have come up with a way to fully automate this process of slicing up SD encodes, encoding the slices in parallel, then stitching them back together on the other side. It uses ffmpeg to accomplish the concat muxing. vidsplit.vbs
Option Explicit

Dim Frames, Segments, SegLength, i
Dim Output

If WScript.Arguments.Count = 2 Then
    Frames = WScript.Arguments.Item(0)
    Segments = WScript.Arguments.Item(1)
    Wscript.Echo "Usage: vidsplit.vbs <frames> <segments>"
End If

SegLength = (Frames \ Segments) + 1

For i = 0 To Segments - 1
    Output = Output & vbCrLf & CStr(i * SegLength) & " " & CStr(SegLength)

Wscript.Echo Output
set segments=6
set threads=16

echo Encoding video...
".\programs\ffprobe" -hide_banner -v error -select_streams v -of default -show_entries stream=nb_frames encode.avs >> ".\temp\info.txt"
for /f "tokens=2 delims==" %%G in ('FINDSTR "nb_frames" "%~dp0temp\info.txt"') do (set frames=%%G)

set concat=".\temp\concat_modern.txt"
break > %concat%

for /f "tokens=1,2" %%G in ('cscript -nologo "programs\vidsplit.vbs" %frames% %segments%') do (
    set start=%%G
    set length=%%H
    set name=video_modern_!start!.mkv
    set filepath=.\temp\!name!
    start "!start!" ".\programs\x264_x64" --threads %threads% --stitchable --seek !start! --frames !length! --sar "%VAR%" --crf 20 --keyint 600 --ref 16 --no-fast-pskip --bframes 16 --b-adapt 2 --direct auto --me tesa --merange 64 --subme 11 --trellis 2 --partitions all --no-dct-decimate --input-range pc --range pc -o "!filepath!" --colormatrix smpte170m --output-csp i444 --profile high444 --output-depth 10 encode.avs
    echo file '!name!' >> %concat%
    Timeout /T 3 /NoBreak > NUL
    set previous=!next!

tasklist | find /i "x264_x64.exe" >nul 2>&1
) ELSE (
  Timeout /T 300 /Nobreak

ffmpeg -y -hide_banner -safe 0 -f concat -i %concat% -c copy .\temp\video.mkv

:: Muxing ::
Explanation This technique is general in that it could be implemented for whatever encoding needs you have, but I'm presenting it as part of the encoding package because that's where I've got it. vidsplit.vbs is a script that lives in the programs folder. It's job is to calculate frame numbers that correspond to some splitting interval. It just spits out some text that gets later ready in by a loop. The section of code is global.bat is where the meat of this trick is. The segments and threads variables correspond to the number of segments the video will get split into, and the number of threads assigned to each encoding instance. What works here is really a dark art, and I had to spend a lot of time experimentally discovering what works best for my machine. I have 20 physical cores and it seems that 6 sections assigned 16 threads each was the sweet spot for performance. The loop makes use of the start command, which allows you to spin off a new process for a command. For each desired segment, an encoding instance is spun off in its own window, and x264 is told to start at a specific frame, and end at a specific frame. Notice the --stitchable argument in there - this argument is very important. It prevents x264 from doing some optimizations at the beginning and the end of the video segment that makes ffmpeg's life a lot easier later when it's putting them back together. Without it, I noticed a lot of error messages when ffmpeg was remuxing the videos, and sometimes it would outright fail. The "MODERNLOOP" thing is a process checking loop. While the encoding instances are doing their thing, the main global.bat file spins in this loop. Every 5 minutes, it checks to see if there are any x264 instances left running. If there are, it waits another 5 minutes. If not, then the videos are done and ready to be put together. You may have noticed a "modern_concat.txt" file that got created earlier in the process. This is a text file that gets built up by getting filenames during the encoder instance loop, and is later fed into ffmpeg. ffmpeg has a special muxing tool built in called a "concat demuxer". What this does is, rather than concatenating videos in a more traditional way and then re-encoding them as one video, you can take a bunch of videos that all have the exact same properties (dimensions, framerate, colorspace, encode parameters, etc), and just "play" them back-to-back into a container. This allows you to combine disparate segments into a single video without any extra processing. There is a little tiny bit of inefficiency here since the encoder will naturally have to stop the GOPs in each segment at the seams, but losing just 5 full-sized GOPs only ever accounts for a few hundred kilobytes in the end product (I've tested this). Should you bother with this? For fun I tested splitting videos up with an old 4-core intel processor from 2011, and found that doing it didn't provide any speed benefits at all. For native-sized encodes, x264 is able to fully saturate about 8 threads already, so splitting it up didn't help a ton in my case. If you have a normal consumer 4-core processor, this probably won't help you at all. I am interested in hearing from anyone with consumer-grade CPUs trying this out and seeing performance increases, though. If you've got a hexacore or octacore processor however, you might be able to use this to speed stuff up. I would recommend trying two segment encoding for either 6- or 8-core processors. You can set the threads for both higher than the actual amount of logical cores than you have available, like each instance could have 8 or even 16. Ultimately, you would need to test various parameter combinations to try and hone in one an effective combination. For HD videos, this probably isn't worth it unless you've got over 20 cores. On my machine a 4k encode with the right settings can saturate all 20 cores, and x265 can do that even more effectively. If you've got really serious numbers though, splitting up HD encodes will net exactly the same benefits as the SD encodes enjoy.