So here's a horrible idea. Bear with me for a bit.
Right now, emulator botting is fundamentally a single-threaded endeavor, and CPUs haven't undergone a drastic progress in that department lately (the best consumer CPU you'll find today probably won't even be 1.5 times faster on a typical ST task than my 5.5 years old i3-8350K from 2017 overclocked to 5.1 GHz). So if we want it to handle a complex enough task, we either have to run multiple instances of the emulator and make each instance search through a dedicated chunk of input permutations, or, in some very specific cases, we isolate the game logic entirely and then turn it into a parallelized code to be executed completely separately from the emulator. Both are very labor-intensive (especially the latter, which also requires reverse-engineering and knowledge of a high-performance programming language), don't have a generalized solution, and in the former case still don't run very fast, either.
The fact that emulator code is mostly not parallelizable has been a serious problem for botting and the main argument against porting it onto CUDA or another platform using compute shaders to take advantage of GPUs' massive parallelism. However, that argument has historically been built around the speed of a single emulator instance, which would most likely be slower in this case, and it's a valid argument if we want to prioritize that.
But do we? What if we use it to run multiple cores, each within its own waterboxed instance, all in parallel?
Sure, each one would be slower, but on the other hand, we can run as many as we can fit into VRAM without much (if any) overhead, unlike running them on a >8-core CPU. Assuming a single core instance for a 8/16-bit platform takes roughly 200 MB on average (I just pulled that number out of my ass, don't judge me), and we're using a graphics card with 12 GB of VRAM (a GTX 1080 Ti, RTX 2060-12, RTX 3060, or RX 6700 XT, most of which can be found for a couple hundred USD on the aftermarket), we can fit up to ~60 instances of our core at the same time, with a single interface for managing their inputs. So even if it makes a single core run about twice as slow as it would have on a modern 5 GHz CPU core (out of my ass again), that's still a whopping 30x net speedup for the purposes of botting in particular. And a graphics card with 16, 20, or 24 GB would result in a proportionately larger speedup still, which would make it a really damn good generalized solution in the longer term as VRAM sizes keep increasing at a faster rate than the number of high-performance CPU cores. And then there's the possibility of running 3-4 GPUs on the same machine, and you can see how well it scales in principle. We're looking at overall speedups of at least two orders of magnitude in the near-term: up to 500x could already be achievable with today's technology if my napkin math here is anywhere close to realistic. And from there, we only need a relatively minor step towards a folding@home-style parallel computing network run by other TASVideos members on their own GPUs, each able to pick the games to which they'd like to dedicate their compute resources.
I've avoided the obvious elephant in the room, which is whether that's feasible to implement at all, to which my answer is: honestly, I don't know. But if it is, it's something worth considering, as the optimization problems we're encountering become progressively more complex with both old and new games, so we'll be relying on bots more, not less, over time.