Happy ACEVideos day!
Last year, I submitted an unoptimized (almost working) TAS of Super Mario Bros. that demonstrates executing arbitrary code. Due to emulator inaccuracies with “Open Bus”, the run did not sync on console. One pull request and a year later, I’ll be executing a larger payload inside Super Mario Bros. which syncs on console! (Thank you Alyosha for verifying this!)
This tool assisted demonstration uses pretty much everything I know about the NES. Enjoy!
(This link is for the abridged version that skips the setup, and starts at the important part)
In these submission comments, I will explain how I created this run, how it works, the limitations I had to work with, and the ideas I used to make it *just* a bit nicer.
This TAS was made in Bizhawk 2.9.1.
Table of contents

Objectives:

  1. Do a little tomfoolery
  2. Execute arbitrary code
  3. Play the Bad Apple!! music video
  4. Earn the maximum score
  5. Beat the game in 4:53

What is Arbitrary Code Execution?

Before I dive into the details, I thought I would explain what an “Arbitrary Code Execution” exploit is. Let’s ignore the word “Arbitrary” here, and focus on how the NES executes code in general. The programmers for a NES game write code in “6502 Assembly”, where an instruction may look like “LDA #$80”. Converting assembly code to machine code is remarkably simple, as every possible instruction corresponds to a different value for a byte. For instance, this “LDA” instruction would become the byte “A9”. The full instruction, “LDA #$80” would include that operand, “80” as well, so written as bytes, that entire instruction is simply “A9 80”.
A list of 6502 instructions and the values of their opcodes can be found here: https://www.nesdev.org/obelisk-6502-guide/reference.html
A slightly different topic: This is a sentence. As you read this sentence, you’re probably not keeping an exact count of how many words have appeared on this page up to this point. If every word on this page was assigned an ID in order of appearance, you probably wouldn’t correctly guess the ID of this sentence’s last word. The NES on the other hand, needs to know exactly which byte should be processed next, and this information is kept in a 16 bit register called the “Program Counter”, or PC.
The opcode A9 is processed. “LDA”. The PC is incremented. Next byte. The operand “80” is processed. The LDA instruction runs with the value “80”. The PC is incremented. Next byte.
This process repeats hundreds of thousands of times a second. A byte is read and interpreted as an instruction, then the PC is moved and ready to interpret the next byte.
Some instructions, such as “JSR” or “JMP”, move the PC to any location of the programmer's choice. For example, suppose part of the code for collecting a coin is at address $BBFE. The instruction “JSR $BBFE” can move the PC there. Data is also pushed to the stack, so the one byte “RTS” instruction can move the PC back to where the “JSR” instruction was.
That’s the oversimplified basics of code execution. Bytes are interpreted as instructions, and all the code is just a bunch of bytes.
Fun fact: RAM is just a bunch of bytes. Here’s where we talk about “Arbitrary Code Execution”. If the PC somehow ends up pointing to RAM, those bytes can be interpreted as instructions. More amusing though, if a player is able to manipulate the values in RAM where the PC will move to, they could write code that gets executed. Here’s what a player would need for an ACE exploit:
  1. An exploit to move the PC to RAM. Writing a payload that cannot be executed is pointless.
  2. The ability to manipulate RAM at a location the PC will execute. Moving the PC to RAM is not enough. If the PC jumps somewhere that cannot be manipulated, you cannot execute a payload. The more consecutive bytes that can be manipulated, the more power the player has.
  3. A general understanding of 6502 ASM. You should be able to look at a series of bytes and know what instructions it will be executed as.
  4. A goal. You should know what you are trying to achieve with your code.
With this general understanding of how ACE exploits work, let’s talk about Super Mario Bros. (henceforth referred to as “SMB”)

How does Arbitrary Code Execution work in SMB?

This TAS begins in world ‘N’, which is not one of the 8 worlds normally found when playing SMB.
For the curious, beyond world 8 is world 9, A, B… and so on. World N is the 23rd world. This world can be reached by changing address $7FD to a value of 0x16 (hexadecimal), and starting the game while holding the A Button.
By the way, this TAS would actually begin inside another cartridge! In order to start the game in world ‘N’, the state of RAM must be manipulated in a specific way so SMB skips the routine that clears a few extra bytes in RAM when loading the title screen.
When you get a game over, SMB jumps to the same code that runs immediately after booting the game. Several NES games, SMB included, start by clearing almost every byte of RAM. Since your high score is displayed on the title screen, the developers need to check if the RAM values for your score are valid (if the game was already running before this code executed), or if it needs to be cleared too (if the game was not already running before this executed). If you were to use another game to set up RAM to pass the checks SMB1 makes, the stale memory can persist after swapping cartridges to SMB.
Another byte that doesn’t get cleared if the checks pass is the world you were in when you ran out of lives and returned to the title screen (address $7FD). If you hold down the A button while pressing Start, the game is programmed to start the game back at the start of the last world you were in. This TAS begins with modified RAM that passes the checks so these bytes aren’t cleared, and starts the game in world ‘N’. Ideally this modified RAM is set up through an incredibly short TAS of Super Mario Bros. 3, where total control can be achieved in less than a second, and RAM can be manipulated to any desired state. After the RAM is set up, the cartridge is swapped out for SMB, and this TAS begins.
Since modern emulators don’t have cartridge swapping as a feature, this TAS was created in a modified version of Bizhawk 2.9.1, where I added such a feature. The uploaded TAS simply omits the Super Mario Bros. 3 step, and begins at the start of SMB with the RAM already set up.
I consider this cartridge swapping, “Tomfoolery”, thus satisfying “Objective 1” of this TAS.

So what does starting the game in world ‘N’ achieve?

Here’s a bit of a tangent, but I promise it’s related. An interesting detail in SMB is what happens when you defeat Bowser by hitting him with fireballs. When most enemies are defeated by fireballs, they flip upside down and fall off screen. When Bowser is defeated this way, his sprite is replaced with a different sprite that will then flip upside down and fall off screen. The “lore” behind this detail is that Bowser's minions are disguising themselves as Bowser, and Mario burns away the disguise.
The game uses a LookUp Table to determine what Bowser is replaced with for each world. The order for worlds 1 through 8 is: Goomba, Koopa, Buzzy Beetle, Spiny, Lakitu, Blooper, Hammer Bro, Bowser. But what happens when you are playing beyond the 8 intended worlds of SMB? The table is parsed out of bounds, and unrelated bytes are read as if they were data from the table. In world ‘N’, Bowser gets replaced by an object with the ID 201, or 0xC9 in Hexadecimal.

What does Object C9 do?

The logic for how each sprite behaves is handled by a Jump Table, which is a list of 16 bit addresses to jump the PC to corresponding to the ID of the enemy. In a similar fashion to the way being in world ‘N’ parses the Bowser table out of bounds, object 0xC9 will parse this jump table out of bounds. Remarkably, it jumps to stable code at address $D007. The code at $D007 is near the end of the screen transition routine. What runs next is code that clears out the enemy sprite slots, increments the game’s “State machine value” (Address $0772), and then jumps to the end of the frame where the console waits for the next vertical blank.
The “State machine value” is used for one of the most important jump tables in the game’s logic. Here’s what happens based on the value of the state machine:
0. This is used when the game boots up to initialize everything and set up the title screen.
1. This leads to a jump table of screen updating routines.
2. This is used during a screen transition to set up the next area.
3. This is for the regular gameplay loop as you move Mario around.
This value was 3 when Bowser was defeated, and the jump to $D007 increments it to a value of 4. Once again, we’re out of the bounds of a table. This time the game does not jump to stable code, and in fact it doesn’t jump to code at all! The PC is being moved to address $53AE, which isn’t mapped to anything! Not RAM, not ROM, no registers… nothing.
Now when the PC fetches the next instruction, nothing comes back. Not necessarily a value of zero. Since nothing gets fetched, the PC ends up reading whatever value is currently on the console's Data Bus. This behavior is known as “Open Bus.” Let’s see what is currently on the Data Bus. An indirect jump instruction took us to address $53AE, and here’s how!
JMP ($0006) (Addresses $0006 and $0007 are 0xAE and 0x53)
1. The CPU Fetches the opcode 0x6C. Indirect Jump.
2. The CPU Fetches the first byte of the operand, 0x06: The low byte address of the pointer.
3. The CPU Fetches the second operand, 0x00: The high byte address of the pointer.
4. Using this pointer, the CPU fetches the target address low byte, 0xAE
5. Finally, the CPU fetches fetches the target address high byte, 0x53
So 0x53 is the value currently on the Data Bus. Since the PC is at $53AE, nothing will get fetched, so the value on the databus will still be 0x53. Exhibiting Open Bus behavior, here’s how the following instruction is read.
1. The CPU Fetches the opcode. 0x53
2. The CPU fetches the 1-byte operand, 0x53
A brief note, opcode 0x53 is not an official opcode for the 6502 processor, and has no official name, though it is commonly called “(SRE),Y” which stands for “Indirect Logical Shift Right then Bitwise Exclusive OR Memory (with offset Y) with Accumulator”.
3. Using the pointer from the operand, Fetch target address low byte, 0x00 (Address $0053 has a value of 0x00)
4. Using the pointer from the operand, Fetch target address high byte, 0x00. (This byte also has a value of 0x00) Add the value of the Y register (0x0A) to this address.
The SRE instruction is about to modify the value at address $000A, which happens to be a byte the player can easily control. If they are holding down the A button on this frame (and not holding the B button), address $000A will have a value of 0x80, which will allow the PC to move somewhere useful.
5. Read from target address, 0x80
6. Read from target address again, 0x80
7. Store the old value (0x80) at target address, 0x80
8. Perform the SRE operation, and store this new value at the target address. 0x40
Now the SRE instruction has ended, and the PC is still not mapped to anything. This next instruction will once again exhibit Open Bus behavior. This time, 0x40 is on the bus.
1. The CPU fetches the opcode, 0x40. RTI …
The RTI instruction will pull 3 bytes from the stack. The processor status flags, the target address low byte, and the target address high byte. This ends up moving the PC to address $1181.

Executing Arbitrary Code

Finally, we’re about to execute the arbitrary code. Address $1181 is a location in RAM that SMB never clears when the game boots up. Since this RAM in uninitialized, in theory it could be anything! Uninitialized RAM is the perfect place to write a payload, since it will never be overwritten. Recall how this TAS uses a seperate cartridge to start SMB in world 'N'. While the TAS is manipulating RAM pre-cartridge-swap, it can also write custom code at address $1181.
The code written at address $1181 will do the following:
  1. Read the controller, write that value to address $C3. (Target address High byte)
  2. Read the controller, write that value to address $C2. (Target address Low byte)
  3. Read the controller, write that value to address $C1. (Payload length)
  4. Set the X register to zero.
  5. Read the controller, write that value to (Target Address) with offset X.
  6. Increment X.
  7. Repeat steps 5 and 6 until X == (Payload length).
  8. Read the controller. If the value is zero, return to step 1.
  9. Read the controller twice, storing at C3 and C2.
  10. JMP ($00C2). This jumps to the address determined by step 9.
With this payload, RAM could be manipulated to nearly any state. (I say “nearly” any state, since overwriting the location where this code is written would take some finesse.)
This is used to write code for updating the graphics and playing audio through the sample channel, which is how this TAS plays the video and music for Bad Apple. Once the code is written, a simple jump to the new custom code is all it takes to get the show started. (followed by reading the controller 5.8 million times over the next 3 and a half minutes to get all the visual and audio data)
Running arbitrary code is “Objective 2” of this TAS, which has now been satisfied. Objective 3, “Play the Bad Apple!! Music video” might be a bit tricky. Let’s figure this out one step at a time, beginning with the graphics.

Tangent: Let's actually talk about the gameplay before the ACE

World N-1:
N-1 is an exact copy of 8-3. In addition to beating the level as fast as possible, I also need to collect a fire flower. After submitting my TAS for last year's April Fools Day, I had help from Mizumaririn and Seraphmlll optimizing the gameplay. Out of curiosity, I attempted to save any frames if possible, despite knowing very little about TASing SMB1.
This TAS includes your typical SMB1 TAS strategies, such as pressing Left+Right to accellerate faster than normal. After parts where I needed to turn around, I also press Left before walking off a ledge, which can allow Mario to accellerate faster.
To collect a fireflower, one must first collect a mushroom. I was actually able to collect the mushroom a frame before Mizu, though I have no idea how to manipulate hammer bro RNG, losing a frame or two by slowing down to get the right pattern. In any case, I beat N-1 on the same framerule as Mizu was able to.
Despite reading a ton of documentation, and even getting the flagpole glitch, I still have no clue how it works or how I would ever do that again. I just compared mario's position, and subpixels with a different TAS, then lined everything up by holding and releasing the R button while mid-air to manipulate the subpixels.
World N-2:
N-2 is the same as 8-4, except taking any pipes will warp you back to the start of the level. Instead, you simply run to the right and skip all of the pipes. When Seraphmlll TASed this section last year to optimize my ACE proof of concept TAS, the objective also included landing on the axe after Bowser is defeated. Since this TAS doesn't require that, (and due to a frame I was able to save at the very start of the level) I was able to defeat bowser 4 frames faster than the previous ACE TAS.
That makes this TAS my first submission to include optimized inputs with actual gameplay, as my other submissions have all utilized subframe inputs to complete a game from the title screen. (or my unoptimized ACEVideos submission last year)

The graphics

How was the music video transformed into SMB tiles?

SMB does not have CHR-RAM, which means the only graphics that can be drawn on screen are a pre-made set of 512 “characters”, 8 pixels by 8 pixels in size. By writing to the PPU register at address $2000, A specific group of 256 characters could be chosen for the Nametable (The background), while another group of 256 characters could be chosen for Objects. (The sprites)
Here are the 256 characters used for the backgrounds in SMB: (using a grayscale color palette)
The first step in turning the music video into characters on the NES’s background, is to decide the resolution. My initial plan was to use the full 256x240 pixels, but since the video is in 4 by 3 aspect ratio, I took it down to 256x192. I wanted the video to play in full glorious 60 FPS, only to realize later down the road the video is actually, in fact, only 30 FPS. Even still, I need to change an incredibly large amount of tiles in a single frame, and to make it worse these can only be updated during a blanking period. (Well, they can only be updated properly during a blanking period. Some visual corruptions occur otherwise).
Each frame has 7,161 PPU cycles of Vertical Blank, or 2,387 CPU cycles. Suppose you wanted to draw the same character in every tile on a single screen. Here’s some ASM code to do that.
	LDA #20	 	; A = 0x20. Let's change the PPU Address to $2000
	STA $2006	; Store 0x20 in the PPU Address. (this changes the high byte)
	LDA #0	 	; A = 0
	STA $2006	; Store 0x00 in the PPU Address. (this changes the low byte)
	LDA #$24	; We’re going to place Character 0x24 in every spot.
	LDX #C0 	; X = C0
	LDY #4	 	; Y = 4
Loop:			; This will run 0x3C0 times. Once per tile on a nametable.
	STA $2007 	; Store the target character at the next location on nametable.
	DEX		; Decrement X
	BNE Loop	; Loop unless X = 0
	DEY		; Decrement Y
	BNE Loop	; Loop unless Y = 0
But unless rendering was disabled, something has gone horribly wrong when executing this code. Only about a quarter of the Nametable was changed, and the visuals rendered to the screen were completely wrong.
Nametable:
Screen:
What happened? When the Vertical Blanking Interval ended, the PPU started drawing the screen using whatever PPU Address was currently loaded. Then, while drawing to the screen, the PPU address was still being changed by the code trying to update the Nametable. Two different processes were using the same address for different means, and they prevented either process from working as intended.
While the code for updating the table could be optimized (quite a bit, actually) the real crux of the issue is how limiting the vertical blanking interval is. If 2,387 CPU cycles isn’t enough to draw the whole screen with the exact same tile, then what code could possibly be written to draw a detailed frame with multiple characters? The solution: require fewer tiles to be drawn.
Though there were changes later in the development of this TAS, the first draft uses a resolution of 128x96, or 16x12 characters. A maximum of 192 tiles need to be updated in a single vertical blank, compared to the 960 tiles in the previous attempt.
Here’s the process of turning a frame from Bad Apple into a 16x12 character grid.
I began by making a Visual Studio .net forms application. I already made an application that I’ve been using to convert my ASM code into button presses, so I just added another tab to that program.
To begin with, I needed a bitmap of the frame from the music video. I found an archive online with every frame from the video at a 1440x1080 resolution in the form of .png files. With a frame loaded as a bitmap, I was able to squish the image into a 128x96 resolution bitmap. Then, for each 8x8 pixel cell of this squished bitmap, I would compare every pixel in this cell with every pixel in a cell from the SMB character set. Since this entire video is black and white, I just recorded the difference for each pixel’s r channel and tallied them up. This number represents how different this SMB character is from the cell of the squished bitmap. After checking each of the 256 cells, the cell with the lowest difference is chosen.
Here’s the C# code I wrote in the .net application to find the closest matching tile:
int score;
int best = int.MaxValue;
int bestID = -1;
for (int i = 0; i < 256; i++)
{
 	score = 0; //set up next tile
 	for (int x = 0; x < 8; x++)
 	{
 	 	for (int y = 0; y < 8; y++)
 	 	{
 	 	 	int chunkVal = badAppleTile.GetPixel(x, y).R; //value of bad apple pixel
 	 	 	int tileVal = tileMap.GetPixel((i % 16) * 8 + x, (i / 16) * 8 + y).R; //value of mario tile pixel
 	 	 	score += Math.Abs(tileVal - chunkVal); //get the difference, add it to the score
 	 	}
 	}
 	// this is like golf, so we're aiming for the lowest score
 	if (score < best)
 	{
 	 	best = score;
 	 	bestID = i;
 	}
}
// now we have the mario tile that closest matches this 8 pixel block of the music video
This process is repeated for every cell, for every frame. The output of this was a giant list of bytes: 192 bytes per frame, 6562 frames.
Of course, not every character will need to be redrawn every frame, and sometimes the same character is being drawn many tiles in a row.

What steps were taken to optimize rendering a frame inside the NES?

Consider the following:
I can only draw during a blanking interval. Vblank is only 2387 CPU cycles. The most efficient way of reading the controller will take 65 CPU cycles each read. It would be foolish to spend the time inside VBlank reading the controller over and over. Instead, once I finish updating the nametables for a frame, I’ll spend the remaining time reading the controller and writing out all the data for the next frame, storing it on the zero page. This will make the process of reading the information during VBlank as fast as possible.
The order of operations is now:
  1. Vblank begins.
  2. Read bytes from the zero page until the nametable is done
  3. Write the bytes for next frame’s data.
  4. Wait for Vblank
In theory, I could write code that is simply:
; -snip- there would be code here to set up the PPU address. This would happen every 16 tiles.

LDA <$00	; A = the next byte
STA $2007	; Store the next character on the next tile of the Nametable
LDA <$01	; A = the next byte
STA $2007	; Store the next character on the next tile of the Nametable
LDA <$02	; A = the next byte
STA $2007	; Store the next character on the next tile of the Nametable
…
And have that repeat 196 times. That’s only 1104 bytes of the available 2048. It would take exactly 1488 cycles every frame, which is less than the 2387 cycles of Vblank! Even though this would work, I wanted to minimize time spent on this even further. After all, the video is really only 30 FPS, so every other frame would spend 1488 cycles for no reason (also bloating the amount of inputs unnecessarily)
Instead, I started thinking about how often an entire row of 16 tiles would be all white tiles or all black tiles. It would be way faster to simply write `STA $2007` 16 times in a row. That’s only 64 cycles per row, instead of the 160 cycles of the previous method. An even more important modification to this code would be to simply not overwrite tiles that didn’t change between frames.
To fix all these issues, I imagined a “packet” of bytes to draw specific tiles at a specific location. 2 bytes to set the PPU address, 1 byte for the length, then 1 byte to decide what byte is being drawn. In this case, an entire row of tiles can be stored on the zero page as 4 bytes. Of course, most rows aren’t being entirely changed all at once, and this “compression” can start working against the goal of minimizing bytes / CPU cycles. Suppose an entire row has 16 different bytes. I updated the “packet” system to have two methods.
Method 1: 2 bytes for the PPU address, 1 byte for the length, and 1 byte to represent the tile being drawn ‘n’ times.
Method 2: 2 bytes for the PPU address, 1 byte for the length (with bit 7 set), then the following ‘n’ bytes are all the different tiles needed.
One more optimization, suppose we draw 3 unique tiles, then a strip of 13 identical tiles. What a waste of CPU cycles it would be to update the PPU Address when it doesn’t need to be changed. Since the PPU address will always be between 0x20 and 0x23, I updated the packets so if the first byte is 0x00, the PPU address can stay where it is, and only use 1 byte for the address, saving an extra controller read.
These optimizations are primarily to make sure every frame has as much time as possible to play the audio. This method almost worked perfectly! There is a single frame that requires changing every single tile, and every row uses the more cycle-expensive method.
Unfortunately, there is just barely not enough time in VBlank to change every single tile using this current method. I’ll explain how this problem was solved later, but right now I want to explain the packet system a little more.

Packet examples.

So, as an example, suppose the following bytes were being read by my code:
21 82 06 24
21 and 82 are the PPU address, $2182
06 is the length. I add 1 to this number, so I can use length 0 to represent a single tile. In this case, we have 7 tiles to draw.
24 is the tile being drawn, which corresponds to a white square.
So this will place 7 white tiles from $2182 through $2188.
Suppose the bytes that followed were:
00 83 33 22 11 00
00 indicates that we’re not moving the PPU address, so it’s currently at $2189
83 means we have 4 unique tiles. (because I add 1 here, so 80 is a single unique tile. 81 is two tiles…)
Then the following bytes, 33 22 11 and 00 are the tiles being drawn.
The final addition to the packet system is the terminator. If when reading the first byte of a packet, 0xFF is read, then we’ve finished the changes for this frame!
For instance, Frame 42 is the second frame to have any graphical changes. Here’s the entire series of packets for that frame:
21 37 00 40
21 57 00 40
21 77 00 40
21 97 00 A8
21 B7 00 A8
21 D7 00 3D
21 F7 00 7E
22 17 00 A3
22 36 81 9C 32
FF
This is the result:
Right now, All I have is a giant list of bytes representing every tile of every frame. Let's convert this data into packets.

How did the output list of bytes get converted into packets?

This was also done in the visual studio .net forms application.
To begin with, it was the process of splitting the frame in 12 rows of 16 bytes, and finding the most optimal way to represent those bytes in the packet format. Specifically optimizing for the fewest CPU cycles to execute, even if it would take more controller reads to store this data on the zero page. For instance, if a row of bytes is:
24 24 24 24 3F 3E 27 27 27 27 27 27 27 27 27 A8
It could be represented as:
 21 28 03 24	; 4 tiles of 0x24
 00 81 3F 3E	; 2 unique tiles, 3F and 3E
 00 08 27	; 9 tiles of 27
 00 00 A8	; 1 tile of A8
(14 bytes)
The time spent reading the start of each packet begins to add up, and it saves a significant amount of CPU cycles to simply read the whole row as a series of “unique tiles”, like:
21 28 8F 24 24 24 24 3F 3E 27 27 27 27 A8 24 24 24 24 24 ; 16 “unique” tiles in a row.
(19 bytes)
19 is larger than 14 [citation needed]. Despite this, the 19 byte packet would take fewer CPU cycles to process compared to the set of 4 packets taking up fewer bytes.
Of course, not every row is most optimal if read entirely as “unique tiles”. In several cases, most of the row doesn’t need to be changed at all. Suppose the difference between this frame and the one before is a single tile being moved one tile to the left, and replacing the old position with a white square.
22 6C 81 33 24 ; 2 unique tiles in a row
It changes only the tiles that need changed, and nothing more. But what if on one frame, there are the characters, 55 24 24 24 24 66, and on the next frame, it’s 33 24 24 24 24 44? It’s actually faster to overwrite all the existing “24 tiles” with the same values than to change the PPU address, so the packet would look like
22 28 85 33 24 24 24 24 44 ; 6 “unique” tiles in a row.
With this method, It’s faster to overwrite existing tiles for 5 unchanged tiles in a row, but any more than that and it’s faster to start a new packet at an updated PPU address.
With this knowledge of optimizing for the fewest CPU cycles, I compressed the data into packets and stored it on the zero page each frame. If a frame has no changes, all that needs to be stored is 0xFF at address $0000.
The end result of this program, after creating all the bytes for the packets, converts these bytes into button presses, which I could paste into the TAStudio timeline.

Implementing 6502 ASM code to read the packets.

When the ACE exploit begins, all I have is a controller reading loop at address $1181. For the record, there are only 0x800 bytes in RAM. Address $1181 is an exact copy of address $0181, or as we call it in fancy 6502 terms, this address is a “mirror” of $0181.
The code changed quite a bit, though I generally kept everything in the same place through each iteration.
Address $000 through $0FF: The zero page. The "packet" data is stored here each frame.
Address $100 through $1FF: The original "Total control code" is still here. I don't change the stack at all during the video.
Address $200 through $2FF: Unused space. I'm not using OAM in the TAS, so I could have used this space, though I didn't need it.
Address $300 through $6FF: The code to parse the packets, then read the controller for writing the next frame data / writes to APU.
Address $700 through $7FF: Part of the packet parsing involves jump tables, which lead to a series of repeated instructions I put here.
The final version of the ASM code is shared at the end of this write-up.

I meant to talk about address $2002, and couldn’t find a good section to mention it.

One more detail about this TAS that I’ll gloss over here but explain at the very end: I read from $2002 to check for Vblank, which can result in a false negative due to a race condition. Anyway, I read the controller, store that value, and then re-use that value to prevent the race condition if needed. Again, I’ll explain how it prevents that in detail later, but you should know I re-use that controller read to exit the “music video loop”. If I press the A button with that controller read, the PC jumps to address $0181, the “total control payload” which lets me write more code. That’s only used at the end of the TAS so I can return to stable SMB1 gameplay. More on that later.
I had to mention it though, since I’ll talk about the $2002 race condition later, and it would make no sense if I didn’t even mention it being problematic yet. These following sections flow into each other, and this was the last good break to mention the race condition’s existence. Anyway, back to the TAS.

The audio

Ever since the TASBot showcase at GeekPWN 2016, I have been fascinated with the theoretical limits of the NES audio chip. Streaming music data through the controllers can sound way better than I thought, so why not use that method for the Bad Apple music?
The NES audio chip’s DMC channel can play 7-Bit PCM audio. Playing this audio is done by repeated writes to address $4011. My controller reading + writing to $4011 loop takes exactly 71 CPU cycles, (there are 1,786,830 CPU cycles in a second, so we can calculate the hertz by dividing that number by 71) so I’m playing audio at 25166.62 Hz.
Here’s my code for optimally reading the controller and playing this:
PlayPCMSample:
	LDA #01
	STA $4016		; strobe the controllers
	LSR A
	STA $4016 		; strobe controllers part 2
    
	EOR $4016		; Very fast controller reading using Exclusive OR
	ASL A			; Credit goes to total for informing me of this.
	EOR $4016
	ASL A
	EOR $4016		; Also this loop is unrolled because it’s faster.
	ASL A
	EOR $4016		; Branches waste CPU cycles.
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0		; Flip the bits that were flipped from Open Bus.

	STA $4011		; Write this to the audio chip.
	CLC 			; Clear the carry flag. (the open bus + ASL would set this)
	ASL A			; Shove bit 7 into the carry flag
	BCC PlayPCMSample	; If bit 8 was set, we exit. Otherwise, loop.    	
And this will run as many times as possible before Vblank begins.

Tangent: How does this controller reading routine work?

While we’re here, those reading this with a basic understanding of 6502 ASM might be asking how exactly this controller reading code works, and what in the world that “EOR #$C0” is doing at the end? Let’s quickly review some standard controller reading code:
start:
	LDA  #1
	STA $4016	; Write 1 to the controller port
	LDA #0
	STA $4016	; Write 0 to the controller port
			; This “strobes” the controller. 
			; The controller has a shift register inside it that records the buttons pressed.
			; Reading from the controller grabs 1 button at a time.

	LDX #8		; X = 8. We will read the controller 8 times.
loop:
	LDA $4016 	; A = a single bit from the controller in bit 0
	LSR A		; A is shifted to the right, sending the contents of bit 0 into the carry flag.
	ROL <$00	; Address $00 is shifted left, and the carry flag is sent to bit 0.
	DEX		; Decrement X
	BNE loop	; Loop until X = 0
			; Once X = 0, the full 8 bits of the controller are stored in address $00
	LDA <$00	; A = the contents from address $00, which is the full controller data.
End:
This sets up the A register with the buttons stored in the following order (from most significant bit to least): A, B, Select, Start, Up, Down, Left, Right.
From “start” to “end”, this controller reading routine will take 144 CPU cycles. It also uses an address on the zero page to temporarily store the values being read. There is plenty of room for optimizations, but let’s learn one more fun fact about reading from the controller ports.
Only bits 0 through bit 4 are read back from register $4016.
In binary, we’re looking at 00011111.
For a little bit of terminology, the bits are numbered from right to left, 0 through 7. So when I say bit 0, it’s the one on the far right. Bits are numbered as 76543210.
Bit zero will be the value of the next button read from the controller. In our “00011111” example, the other bits set as one are used for various other controller inputs, such as the NES zapper and the Famicom controller’s microphone, which will always be a zero when read from a standard controller. Those bits in “00011111” that are zeroes represent the bits which are not set by a read from $4016. Remember Open Bus? These bits here are going to exhibit open bus behavior.
Let’s walk through the cycles that occur when reading from the controller.
LDA $4016 (8D 16 40):
  1. Fetch the opcode: 8D (10001101)
  2. Fetch first operand: 16 (00010110)
  3. Fetch the second operand: 40 (01000000)
  4. Read from target address: (010xxxxx)
Fascinating! So whenever the console reads from $4016, it will always return a 010 pattern for bits 7,6, and 5.
With this knowledge, let’s start optimizing the controller reading routine. Step 1, unroll the loop. Branches waste CPU cycles.
Step 2, instead of shifting bit 0 to the carry flag and shifting the carry flag into another byte, can’t we just keep everything inside the A register?
What if we try:
	ORA $4016	; Bitwise OR A with the controller data (Setting bit 0)
	ASL A		; Shift to the left. (Scooting bit 0 into bit 1)
	ORA $4016	; Bitwise OR A with the controller data (Setting bit 0)
	ASL A		; Shift to the left. (Scooting bit 1 into bit 2, and bit 0 into bit 1)
	; … and so on
Here’s where that pesky Open Bus behavior gets in the way. Suppose in this example all buttons on the controller are held. Bit 6 will be set as well as bit 0.
	LDA #0		; A = 00000000
	ORA $4016	; A = 01000001
	ASL A		; A = 10000010
	ORA $4016	; A = 11000011
	ASL A		; A = 10000110
	ORA $4016	; A = 11000111
	ASL A		; A = 10001110
	ORA $4016	; A = 11001111
	ASL A		; A = 10011110
	ORA $4016	; A = 11011111
	ASL A		; A = 10111110
	ORA $4016	; A = 11111111
	ASL A		; A = 11111110
	ORA $4016	; A = 11111111
	ASL A		; A = 11111110
	ORA $4016	; A = 11111111
Bit 6 will pollute our otherwise clean controller reading! If we press all the buttons it works out just fine, but if all buttons were unpressed, it will still read as if two of them were set! Now something interesting happens if instead of a bitwise OR we were to use a bitwise Exclusive OR.
	LDA #0		; A = 00000000
	EOR $4016	; A = 01000001
	ASL A		; A = 10000010
	EOR $4016	; A = 11000011
	ASL A		; A = 10000110
	EOR $4016	; A = 11000111
	ASL A		; A = 10001110
	EOR $4016	; A = 11001111
	ASL A		; A = 10011110
	EOR $4016	; A = 11011111
	ASL A		; A = 10111110
	EOR $4016	; A = 11111111
	ASL A		; A = 11111110
	EOR $4016	; A = 10111111
	ASL A		; A = 01111110
	EOR $4016	; A = 00111111
Instead of bits 6 and 7 always being set, they will be inverted. For instance, if no buttons were pressed, the result would be 11000000. If the buttons pressed should be read as 01010101, it would be read as 10010101. This can easily be solved by simply flipping bits 6 and 7 with EOR #$C0. (Exclusive or with the number 11000000. Flip those 2 bits.)
Some other optimizations:
	LDA #1		; A = 1
	STA $4016	; strobe the controllers
	LDA #0		; A = 0
	STA $4016 	; strobe controllers part 2
Can become:
	LDA #01		; A = 1
	STA $4016	; strobe the controllers
	LSR A		; A = 0
	STA $4016 	; strobe controllers part 2
That takes the same amount of CPU cycles, but it saves us one byte of our payload! This makes the new finalized input code:
Start:
	LDA #01		; A = 1
	STA $4016	; strobe the controllers
	LSR A		; A = 0
	STA $4016 	; strobe controllers part 2

	EOR $4016	; Bitwise XOR with the controller data (Setting bit 0 and 6)
	ASL A		; Shift to the left
	EOR $4016	; That bit 6 getting set is from Open Bus, as explained above.
	ASL A		; 
	EOR $4016	; 
	ASL A		; Also we unrolled the loop because it’s faster
	EOR $4016	; Branches waste CPU cycles.
	ASL A		; Also if we were using the X or Y registers…
	EOR $4016	; this avoids overwriting them.
	ASL A		; It takes more bytes to write it this way.
	EOR $4016	; But it runs much faster!
	ASL A		; 
	EOR $4016	;
	ASL A		;
	EOR $4016	;
	EOR #$C0	; Flip the bits that were flipped from Open Bus.
End:
This takes 63 cycles from “Start” to “End”.
One optimization I devised about half a month after this submission was made,
	LDA #01		; A = 1
	STA $4016	; strobe the controllers
	LSR A		; A = 0
	STA $4016 	; strobe controllers part 2
can become
	STY $4016	; strobe the controllers
	STX $4016 	; strobe controllers part 2
	TXA 	 	; A still needs to be zero here.
if X=1, and Y=0. This saves 2 cycles. Another 2 cycles can be saved by removing the CLC isntruction in teh audio loop, which could improve the audio up to 26,669 Hz.
Anyway, that’s how the controller reading code was optimized so I can fit as many audio samples into a frame as possible.

How did you turn the music into button presses?

Now that we know how the button presses turn into Nes Chip audio, let’s turn Bad Apple into button presses.
Remember that archive that had every frame as a 1080p .png file? The archive also provided a .wav file of the music!
The .wav file format is a single header, followed by PCM data for the entire remainder of the file. This .wav file had a 0x50 byte header, followed by 16 bit PCM audio (2 bytes per sample) at a rate of 48kHz (48,000 samples per second). Since the NES APU can play PCM audio, I just need to grab a byte from the wav file and convert that to 7 Bit PCM. Some simple math can be used to calculate which byte from the wav file I should convert for a given sample.
I’ll be playing my audio at 1 sample per 71 CPU cycles. In order to get the correct sample to send to the audio chip at the right time, I would need to know the exact number of CPU cycles that have passed since the TAS began playing the music video. Luckily, when the screen is rendering, every frame is the same amount of cycles! Well, it’s a consistent pattern of 29780, 29781, 29780, 29781… so we can average it out to 29780.5 cycles per frame.
To make a long story short, I used a custom 6502 emulator I wrote to emulate the frame until the first write to the APU and count all the cycles for me.
With the knowledge of exactly how many CPU cycles have occurred since the start of the video and playing the next audio sample, I can simply convert from CPU cycles to bytes inside the wav file. The file is 48,000 samples a second, and there are 1,786,830 CPU cycles per second. Accounting for the wav file’s 0x50 byte header, and how each sample is 2 bytes long, grabbing the desired byte (The high byte of the sample) is simply:
(Floor(TimeInSeconds * 48000) * 2) + 0x50 + 1
Where TimeInSeconds is the current CPU cycle divided by 1,786,830.
Knowing how many cycles I have left before the frame ends, I was able to pack in as many audio samples as possible before exiting the loop and waiting for VBlank. (Optimizing for the smallest possible wait for VBlank)
Of course, the huge drawback here is that for all the time spent updating the nametables and filling in the bytes on the zero page, audio data is halted. A flat line in the audio happens every frame.
Let’s brainstorm some ideas on playing audio during those busy parts of the frame.
In theory, I could try to use the DMC channel’s DPCM audio functionality for the split second between leaving the audio playing loop and re-entering it, but that sounded too complicated. Also using DPCM audio runs the risk of corrupting controller reads, which sounded scary.

Tangent: Using Differential Pulse-Code Modulation the way Nintendo definitely had planned.

So I tried it anyway.
DPCM audio cannot play samples from RAM (even if the audio chip overflows beyond address $FFFF, it will remain inside the ROM data by looping to address $8000 instead of $0000), so if I want to fill in these gaps with DPCM audio, I would have to use existing bytes inside the SMB ROM.
But this got me thinking. Recall how I used the existing character data to approximate the visuals. Okay, now think about that, but taking the game’s ROM and approximating DPCM audio to fill in the gaps where the PCM stops playing.
The plan: After the final audio input is written for a frame, I need to know when (How many CPU cycles since the movie began) the next frame’s first write to the audio chip will be.
With that info, I could create a spline along the .wav file’s PCM data from where the gap starts to where it ends, then test how SMB’s ROM data (when read as DPCM audio) compares. With my graphical data I compared every pixel in the 8 by 8 cell against each of the characters. With this I’ll take every “sample” (value of the DMC output) in every possible “sample” (short audio clip) I can create, using the 256 possible starting positions, and 16 possible sample rates, using whatever sample length would get as close to, but without going over the PC reaching the first controller read of the next frame, and compare that against the .wav file’s sample data as a spline. Then, just like with the images, I’ll tally up all the samples and see which one is closest.
Anyway, to make a long story short, the fastest sample rate of DPCM audio is not enough to get in the minimal sample length before I need the controller again (even for the frame with the longest break between audio loops), so I wrote the code to do that for nothing, and this section of my authors notes could have been omitted. In theory, it would’ve been really cool though.

Plan B to make the audio cleaner.

Okay, so DPCM audio is no good.
The big issue that needs fixed might not be the large gaps, but rather the steep change when the audio loop resumes.
What if I spend the first 16th of the audio loop using linearly interpolated values between the last sample of the previous loop, and the current sample.
This isn’t a perfect solution, but rest assured this makes the audio sound better. Every frame had a noticeable click before this change, now it’s only a gritty sound.

Plan C to make the audio cleaner.

While the previous method made the audio less ear-hurting, I still think there’s room for improvement.
The most time consuming aspect of this TAS is filling in the bytes for the zero page on frames that have a lot of data. The most dense frame being 0xE4 bytes! (That problematic frame I talked about earlier) That frame takes 15732 CPU cycles just to fill in, which is almost 53% of the frame spent not writing to the audio chip.
Some incredibly rough math done in my head would tell me that I just barely don’t have enough time in a frame to alternate a single write to the APU for each write to the zero page. I *could* write two bytes to the zero page and one byte to the audio chip in a loop though. To optimize this write to the APU, I’ll only read the first 7 bits of the controller, since only 7 bits are sent to the APU. With my method of controller reading, the inputs need to be shifted to the left. So if I wish to send 0x40 to the audio chip, what would normally be just the B button, is now just the A button for this input.
The (simplified) order of operations is now:
  1. Vblank begins
  2. Update Nametable
  3. Read controller to prevent the $2002 race condition / exit the music video loop.
  4. Read controller to determine number of bytes needed on zero page
  5. Read controller X + (X/2) times, where every third input is a bit shifted 7-bit input sent to the APU.
  6. Read controller as many times as possible for highest quality audio
  7. Wait for Vblank
While before, the audio would swap from 25166.62Hz to 0Hz, now the TAS will swap from 25166.62Hz, to (a microscopic 0Hz moment, to) 9163.23Hz, to 25166.62Hz, to very briefly, 0Hz when it updates the graphics.
The result is cleaner audio, but something still seems to make the audio sound grainy and bad.
Before I get desperate and try adding audio data to my packet system, I want to try writing to the APU after every write to the Zero Page. Perhaps the change from 25kHz to 9kHz is way too noticeable. This change will bump that 9.16kHz to 13.85kHz.
Big issue: There’s a single problematic frame where there’s not enough time to fill in the data on the zero page if I’m writing to the audio chip this often… right? The problematic frame has 0xE4 bytes to write, and the new loop takes 129 CPU cycles per byte. Hm- let’s see… 0xE4 * 129 = 29412. Which is… less than 29780!
Let’s consider the remaining cycles in this frame:
132 		; These happen before the controller reading loop.
+ 29412		; The cycles are from our graphics data/APU controller reading loop.
+ 69 		; A minimum of 1 read sent to the APU after the graphical data writing loop.
+ 15 		; The remaining cycles before spinning (waiting for V-Blank). 
That’s a total of 29629 cycles, which means there’s actually still room for some more writes to the APU! Just *barely*!
Well, how does it sound? Does increasing the Hz by almost 5,000 fix the issue?
Oh absolutely. There’s still a gritty noise, though it’s much much quieter, and becomes unnoticeable when all the instruments kick in.
See if you can can spot where the sample rate dips in the middle of this image:
It would seem like the logical next step would be to fix the audio abruptly stopping when the screen is being updated. Those writes to the nametable leave a pretty clear cut in the audio.
Unfortunately, while I could add audio data to the system of graphical packets (at an execution cost of only 10 CPU cycles per packet) adding that data would make that one frame take more than 0xE4 bytes of data, bringing it up to 0xF0 bytes, and then with my new interlaced “write to Zero Page, write to audio chip, write to Zero Page, write to audio chip” method of keeping the audio clean, that one frame would absolutely run out of time.
But what if there was a way to fix every one of these issues?

Better video, better audio. The big improvement. The second nametable.

After going through several hoops to improve the audio, I began wondering if I could up the resolution. After all, I initially created this under the assumption it would be 60FPS. If I’m rewriting the ASM code for the graphics with the knowledge that this is 30 FPS, I can utilize the second nametable. By utilizing the second nametable, I could spend 2 frames to draw it, then swap which nametable the screen is looking at. I would have 2 VBlanks at my disposal, doubling the amount I could draw for a visual frame!
Let’s do some math. With the current method, I can draw about 16*12 characters in a single VBlank, which is 192 characters. What’s the largest 4x3 integer ratio where dividing it by two would be less than 192? The answer is… 20 by 15. It’s an improvement, even if it’s a small one. Going a bit further, to 24 by 18 yields (24*18)/2 = 216, which is pretty close, but I don’t believe there’s enough time in a Vblank to make it happen. Let’s start by getting the 20 by 15 version working.
There are some huge advantages to this Two-Nametable method. Previously I had to draw up to 192 tiles in a single VBlank. Now the limit is 150 tiles in one VBlank. That leaves me with even more time in a frame for the audio, and enough extra bytes on the zero page that I could add audio data into the packets! The more I think about it, the more obvious it is that I should have done this from the beginning. This fixes everything. Let’s make some modifications to the ASM code.
  • The rendered nametable needs to swap every 2nd frame.
  • The series of optimal write instructions to the nametables, as well as the jump tables to go there need to be extended to account for a row of up to 20 tiles.
  • The packets need to include 1 byte of audio data now.
After making these adjustments, I realized I still had 925 bytes in RAM being unused. That’s 45% of RAM! Keep in mind, this entire time I’ve been optimizing for the fewest CPU cycles, so my ASM code is already taking up way more bytes than necessary. Well, can I use these 925 bytes to improve anything? I’ve got another idea to improve the audio a bit.
What do you think sounds better?
  • 50% audio quality for 25% of the frame, followed by 100% audio quality for the rest.
  • 87% audio quality for 75% of the frame, followed by 100% audio quality for the rest.
My theory is the latter. I might be playing at 100% quality for a shorter amount of time, but I think the dips to 50% are going to be making the audio sound worse, even if it only happens for a very small amount of time. Let’s write some code for dynamic audio quality, based on the amount of bytes I need to write to the zero page!

Plan D to make the audio cleaner.

The time spent writing data to the zero page is currently interlacing one write to the audio chip, one write to zero page. Some frames require only a single write to the zero page, others will require upwards of 193. That means for some frames near the average, I might be able to interlace 2 writes to the audio chip, or even 3 writes for each byte depending on the amount of data I need to write to the zero page. The idea here will be to have multiple functions for interlacing audio and graphical data. Different functions will be used depending on how many writes to the zero page are required, and how many writes to the audio chip I can make between the graphical bytes.
In any other situation, any reasonable programmer would look at this code and be upset. Stackoverflow.com would have a field day. I would argue that in this particular situation, it would be foolish not to do this!
This was the final change made to the TAS.
The current order of operations is now:
  1. Wait for Vblank
    1. Disable rendering. (In theory this was unneeded, but good practice.)
  2. Make all the graphical changes
    1. Read the next byte from the zero page. Check for Terminator
    2. If it wasn't the terminator, check if the PPU Address needs to be moved.
      • If so, Write to $2006. Read the next byte, and also write to $2006.
    3. Read the next byte. Write PCM data to APU ($4011)
    4. Read the next byte. This is the length of data to write to PPU
    5. Determine if it’s ‘n’ identical tiles, or different tiles. (Is Length >= 0x80?)
      • If identical tiles, read the next byte. Write ‘n’ bytes to the PPU.
      • If different tiles, read the next ‘n’ bytes, writing each one to the PPU.
    6. Loop
  3. Post-Graphics stuff
    1. Set PPU Address to $2000 or $2400. This alternates which nametable is visible.
    2. Enable rendering.
  4. Read the controller to set up the RaceConditionFix / leave the loop when the video ends.
    • If ‘A’ is pressed, JMP to $0181. This exits the “Bad Apple Loop”.
    • If ‘B’ is pressed, waste a single CPU cycle at the end of the loop
  5. Read the controller to determine the length of the payload
    • Determine which interlacing function to use
  6. Interlaced writes to the audio chip and zero page.
  7. High quality audio until the frame ends
  8. Prepare to wait for Vblank
    • If needed, BNE 0 bytes. Taking the branch uses an extra cycle.
After setting this all up, I still have 414 unused bytes! I don’t know what else I could improve, so those bytes will just be left unused.
After these changes, the video is now 25% larger, and the audio sounds even better!
That’s “Objective 3” of this TAS complete!

Tangent: Okay, but what if we could improve the graphics again?

The next step up in graphical quality would be to increase the resolution to 24 by 18 tiles.
I did some math a few paragraphs ago and determined there’s not enough time with the current system of packets to make all of those changes before VBlank ends. That being said, I am still just estimating that it won’t be enough time with the packet system. The estimate that there won’t be enough time is based on the one problematic frame when I was using 12 packets of 20 bytes. This would require 9 packets of 29 bytes. (If I’m including the write to the APU in these packets)
As we’ve seen with previous examples, starting new packets takes up a lot of CPU cycles. If I’m using fewer packets, despite their larger size, is there enough time in VBlank to render it all? I opted to simply try and implement it rather than do some rough calculations. I want to be certain.
Now that a single row can have up to 24 different tiles, I’m thinking about ways I could optimize the big list of “LDA <$00, X”, “INX”, “STA $2007”. After all, running that 24 times is 240 cycles. Here’s what I’m thinking:
Remove the INX instructions. That’s 2 cycles each, adding up to 48 cycles if all 24 tiles need to be changed.
The pattern can now be:
	…
	LDA <$03, X ; Offset from three
	STA $2007
	LDA <$02, X ; Offset from two
	STA $2007
	LDA <$01, X ; Offset from one
	STA $2007
	LDA <$00, X ; Offset from zero
	STA $2007
Now an entire row of 24 tiles takes 192 cycles. This requires two changes to how the data is stored in the packets.
Change 1: The data in the packets here is now written in reverse order. So 21 80 00 83 33 22 11 00 will write the bytes, 00, 11, 22, 33 in that order.
Change 2: The X register will now need to be dramatically changed at the end of this sequence, somehow accounting for however many bytes were read. Since these bytes are already being read in reverse order, I’ll add that to be “index zero” in this list, so if the X register needs to be changed to, for example, 0x14, the previous packet would now look like this: 21 80 00 83 14 33 22 11 00
This makes the means to update the X register now look like this:
	LDA <$03, X 	; Offset from three
	STA $2007
	LDA <$02, X 	; Offset from two
	STA $2007
	LDA <$01, X 	; Offset from one
	STA $2007
	LDA <$00, X 	; Offset from zero
	TAX 		; Fix the X register
Only 198 cycles for a stream of 24 bytes, when the previous code would’ve taken 240 cycles.
The largest possible packet is now 30 bytes, and there’s up to 9 packets in a frame. I probably could have done some really quick math before taking the time to optimize that function, because even 29*9 is greater than 256. I don’t have enough space on the zero page to store this much information!
Here’s the thing: If there are 9 packets (+ the terminator), I have enough time in Vblank to run 260 cycles per packet. So far, I’ve only accounted for the time spent after the jump to the huge list that writes to $2007, so let me add up the rest of the cycles in a packet. 57 cycles parsing through everything and jumping to the list, and 3 cycles to jump back to the beginning of the loop. 198 + 60 = 258 cycles, which is just barely under that 260 threshold I calculated.
I was going to suggest manually writing “less optimal” packets that require fewer bytes for these frames, but there simply is not enough cycles to spare to let that happen.

Alright then, the packet system doesn’t work here. Idea 2:

One idea I suggested was to replace the system of packets with an absolutely massive list of reading from the zero page and writing to the PPU. What if we’re only drawing 24 by 9 tiles in a single VBlank? That’s 219 bytes on the zero page (225 if you include the PCM data for the APU). Before we start seriously considering this option, let’s assume the PPU address pointers are magically updated each frame.
Could this even work? Let’s do some math.
The frame would begin with upwards of 5 cycles to exit the “spinning” loop. It would immediately begin to execute:
	LDA #PPU_Address_Pointer_Lo	; this magically changes between frames
	STA $2006
	LDA #PPU_Address_Pointer_Hi	; this magically changes between frames
	STA $2006

	LDA <$00	; Read byte 0
	STA $2007	; Write to the PPU
	LDA <$01	; Read byte 1
	STA $2007	; Write to the PPU
	LDA <$02	; Read byte 2
	STA $2007	; Write to the PPU
	LDA <$03	; …
	STA $2007	; …
	; - snip -
	LDA <$14	; Read byte 20 (which is 0x14 in hexadecimal)
	STA $2007	; That’s the 20th byte. The row is complete

	LDA <$15	; Read the next byte
	STA $4011	; We still want audio data in this hypothetical code.

	LDA #PPU_Address_Pointer_Lo_2	; this magically changes between frames
	STA $2006
	LDA #PPU_Address_Pointer_Hi_2	; this magically changes between frames
	STA $2006	; this updates the PPU address pointer so we can skip all the blank tiles between the end of the previous row and the start of the next one.

	LDA <$16	; Read the next byte
	STA $2007	; Write to the PPU
	LDA <$17	; …
	STA $2007	; …
	; - snip -	; and so on until we get through 9 rows
	LDA <$E0	; Read the next byte
	STA $2007	; The final PPU write
	
	LDA <$E1	; Read the next byte
	STA $2007	; The final APU write
(width+1)*(height/2)*(bytesPerLDA + bytesPerSTA) + (height/2)*(bytesPerLDA*2 + bytesPerSTA*2)
25*9*5 + 9*10 = 1215 bytes}{
(width+1)*(height/2)*cyclesPerSTA2007 + (height/2)*cyclesPerSTA2006
25*9*7 + 10*12 = 1683 cycles
Let’s see… This ASM code would take 1215 bytes of RAM, and it would take 1683 cycles to execute. Wait… Vblank is 2387 CPU cycles long. There’s an insane amount of cycles left Over. Before I dare consider a 28 by 21 resolution, let’s see if there’s enough time in a frame to write 225 bytes to the zero page with decent audio quality.
This method would no longer need the dynamic audio interlacing functions that were written for the previous method, since every frame would require the exact same amount of bytes for the zero page.
The controller read to prevent the $2002 race condition must still occur, but we don’t need to check for the length, since it will always be 225. This brings us to:
5	; Exiting the loop that waits for vblank
+1683	; Updating the nametable
+30	; Update the PPU address before the screen starts being drawn (this also checks for which nametable to draw for this frame)
+33	; Read only 2 bits from the controller. This is to prevent the race condition.
= 1751
This loop writing to the zero page / APU will begin 1751 cycles after the start of the frame. If it takes 129 cycles each loop, and a frame only has 29780.5 cycles, is there enough time to write 225 bytes?
(29780-1751) - (129*225) = -996
Nope! In order to render the video at 24 by 18 tiles, I would need to make massive audio sacrifices. It’s remarkably close though. Granted, this code still doesn't change the PPU Address pointers that are currently changing by "magic".

What if I unroll the loop that interlaces APU writes and ZP writes as much as possible?

Instead of:
  1. Write to zero page
  2. Write to APU
  3. Branch back
It would be like:
  1. Write to zero page
  2. Write to APU
  3. Write to zero page
  4. Write to APU
  5. Write to zero page
  6. Write to APU
  7. Write to zero page
  8. Write to APU
  9. Branch back
And repeat that as much as possible without branching. This saves at most 50 cycles, which is nowhere near enough.

What if I remove the writes to the APU during VBlank?

This reduces both cycles spent inside VBlank and the required bytes written to the zero page. After these changes there’s a few hundred CPU cycles left over, but this doesn’t even account for the magic code that would need to update the PPU address pointers that are used during VBlank.
To put it bluntly, there’s not going to be a way to make the resolution any larger without completely ruining the audio.
Therefore, the TAS shall remain at 20 by 15 tiles!

Bonus Tangent: If I were to remove all audio, what’s the largest resolution possible at 30FPS?

Let’s take the ridiculous idea from the previous section to its logical extreme. First off, could we get the full 32 by 24 tiles? (Remember, the video is at a 4:3 aspect ratio)
The PPU Address no longer needs to be updated between rows, since there are now zero bytes between rows. We're looking at a series of "LDA Zero Page" (2 bytes) and "STA $2007" (3 bytes). We're updating half of the 32*24 in one VBlank, so that would be (2+3)*32*12 = 1920.
It would take 1920 bytes, and 2688 cycles to execute! Well, that’s how much it would be in some fantasy land wherein the zero page could be extended to 384 bytes. Since that’s not the case, many of those two-byte LDA instructions (taking 3 cycles) become three-byte instructions (taking 4 cycles). By making these adjustments, we’re now looking at exactly 2048 bytes. That's exactly the entirety of RAM. No room to read the controller, and no room for the 384 bytes needed to update the nametables. Worse yet, it would take longer than all the available time in Vblank to execute.

What if I use "STA Indirect, Y" instructions to save space?

By sacrificing 2 bytes on the zero page, You could have a pointer to address $2007. Now the 3 byte STA instructions can be written as 2 bytes, at the cost of more CPU cycles. (I know this is already over the CPU cycle budget for this Vblank, but let's humor the idea for a bit)
Before, the entirety of RAM was filled with:
LDA <$00	; 2 bytes
STA $2007	; 3 bytes
LDA <$01	; 2 bytes
STA $2007	; 3 bytes
; ...
It could be preplaced with
LDA <$00	; 2 bytes
STA ($FE),Y	; 2 bytes
LDA <$01	; 2 bytes
STA ($FE),Y	; 2 bytes
; ...
This brings the total byte count for our code down to 1666 bytes, taking 3202 cycles. We need 384 bytes of data, and 2 bytes for that pointer to address $2007, which brings the total to 2052 bytes. Ooh! So close. There's still no room for code that can read the controller to write all this data though.
At the cost of even more CPU cycles, we could write a loop for all the non-zero page bytes. Of course, we were unrolling this for a reason, but there's no reasonable way to save around 850 cycles here.

What about updating graphics in HBlank?

So, Horizontal blank contains about 20 CPU cycles where the PPU Address isn't used/updated. Ideally, you would change the PPU address, write something, then change it back to what it needs to be for the next scanline.
LDA #$21
LDX #$80
LDY #$11
; HBlank begins
STA $2006 ; 4 cycles in
STX $2006 ; 8 cycles in
STY $2007 ; 12 cycles in
; Now we need to set the PPU address to whatever it needs to be for the next scanline
LDA #$23 ; 14 cycles in
STA $2006 ; 18 cycles in
LDA #$E0 ; 20 cycles in... uh oh
; The PPU has begun reading from the VRAM to prepare the first tiles of the next scanline
STA $2006 ; 24 cycles, too late!  
So that's not happening.
Anyway, let's try and answer the question posed by this tangent, and drop the resolution.

How about 28 by 21?

336 bytes of data, so agian, we're going to need to use more space than the zero page has to offer. Let's use STA Absolute instead of those indirect instructions. We need those CPU cycles.
(width*height*ZPInstructions) + (width*remainingHeight*AbsoluteInstructions) = 28*9*5 + 28*3*6 = 1764 bytes (+336 bytes of data = 2100) , taking 2436 cycles. Also, we'll need to move the PPU address between rows here, and seeing how this is already unrealistic without that addition, let's move on.

So it's 24 by 18?

We already confirmed it was possible to do this, at the cost of audio quality, but let's just re check the math. This example will omit the audio data from the packets.
24*9 < 256, so this can all fit inside the zero page.
LDA Zero Page and STA Absolute take a total of 5 bytes and 7 CPU cycles.
For each row, the PPU address needs to be moved, which takes 10 bytes and 12 CPU cycles.
24*9*5 + 9*10 = 1170 bytes.
24*9*7 + 9*12 = 1620 cycles.
this is comfortably within Vblank. It looks like this might be the best resolution possible for a 4:3 video played in SMB at 30 FPS.
Finding a way to render 30FPS 32x24 tile videos in Super Mario Bros. has been left as an exercise for the reader.

A reader did the exercise.

A user in the youtube comment section named Lorenz17 suggested replacing the LDA Zero Page and LDA Absolute instructions with LDA Immediate, which would save both bytes and cycles. This would take 1920 cytes (which is less than 2048) and 2304 CPU cycles (which is less than 2387), making it all fit! One small issue is this would take 1920 bytes, leaving us with only 128 for all remaining code. 62 of these bytes are the initial code-writing payload, so we're down to 66 remaining bytes in RAM
I decided to take this challenge anyway, and see what I could do with this space.
AddressZero:
	<Prevent2002>		
	<WriteLo>
	<WriteHi>	
	<FrameCount>	;-4 bytes of available space
Nametabler:
	LDA #<PPUHi>
	STA $2006
	LDA #<PPULo>
	STA $2006	
	 
	LDA #<Byte1>
	STA $2007
	LDA #<Byte2>
	STA $2007
	; ...
	; snip ; This would be the 1920 bytes for updating the nametables.

	; We have about 70 CPU cycles before VBlank ends.
	; Update the PPU address to look at the correct nametable
	LDX #20				; 2 bytes
	INC <FrameCount		 	; +2 = 4
	LDA <FrameCount			; +2 = 6 
	AND #1				; +2 = 8 
	BEQ Use20			; +2 = 10
	LDX #24				; +2 = 12
Use20:
	STX $2006			; +3 = 15
	STY $2006			; +3 = 18 		; (Y=0)

	; Set up data writing pointer
	STY <WriteLo			; +2 = 20		;  It begins by overwriting "Prevent2002". 5 bytes later is PPUHi
	STY <WriteHi			; +2 = 22  		;  5 more bytes is PPULo, then 5 more is Byte1 and so on.

Controller:
	LDA #1				; +2 = 24
	STA $4016			; +3 = 27
	LSR A				; +1 = 28
	STA $4016			; +3 = 31
	LDX #8				; +2 = 33
Loop:
	EOR $4016			; +3 = 36
	DEX				; +1 = 37
	BNE Loop			; +2 = 39
	EOR #C0				; +2 = 41

	BEQ PostLoop			; +2 = 43		; never draw tile 0 to the nametable
								; instead, use zero to exit the controller reading loop.
								; store at the "write pointer" then update it
	STA (WriteLo),Y			; +2 = 45
	LDA <WriteLo			; +2 = 47
	CLC				; +1 = 49	
	ADC #5				; +2 = 50		; Move the pointer 5 bytes
	STA <WriteLo			; +2 = 52
	TYA				; +1 = 53		; (Y=0)
	ADC #0				; +2 = 55		; Update the high byte if needed.
	STA <WriteHi			; +2 = 57
	BPL Controller			; +2 = 59		; always branch

PostLoop:
	LDA <Prevent2002		; +2 = 61
	BNE #0				; +2 = 63
	CMP #$FF			; +2 = 65
	BNE Wait4VBlank:		; +2 = 67
 	JMP TotalControl: 	 	; +3 = 70

Wait4VBlank:
	LDA $2002			; +3 = 73
	BPl Wait4Vblank			; +2 = 75
	JMP Nametabler			; +3 = 78 		; 12 bytes over budget! (and 4 more bytes at address $0000)
Oh no! I used 16 too many bytes! Luckily, I have an ace up my sleeve. 18 bytes from the initial code-writing payload could be removed, then once I finally jump back to the code writing routine, I could simply use it to re-complete the payload!
As for actually testing this code, I ran into new fun issues. Recall when I said that overwriting all of RAM would require some finesse to overwrite the code-writing loop? I'm able to write a copy of it elsewhere and execute that, but it uses the stack. If I'm overwriting the entire stack, there's going to be bytes inside the stack that are being updated by the JSR instruction.
Never mind that- let's imagine we were actually able to set this code up and run it. How many cycles would it take to overwrite the operands for all 384 LDA instructions? This controller reading + pointer updating loop will take 115 cycles each iteration. To make a long story short, there's not enough time in the rest of the frame to update all those with this code. 115*384 > (29780 - 2304)
It might still be possible, but once again, It's an excercise for the reader.

How the TAS ends

I did it! The entire Bad Apple music video is played at the best possible audio and video quality I could manage. What happens next? To make the TAS have a nice ending, It should return the game to a stable state. Recall that the controller read that is used to prevent the $2002 race condition is also used to check if the video has ended, in which case it will run JMP $0181, taking the PC back to the original total control code writing loop. Some new code is written to spoof the level transition screen, set the nametable to mimic the end of 8-4, then almost every byte in RAM is modified to be a copy of RAM from a state where mario is about to land on the axe in 8-4. Some bytes were specifically modified to set mario’s position, the score, etc. (I modify Mario’s score to be the maxout of “9999990”, satisfying “Objective 4”.) Once RAM is all set up, the NMI is enabled and the PC jumps to where SMB waits for the next frame.
When the next frame runs, it’s almost as if nothing had happened at all! Mario gracefully falls to the axe, defeating Bowser and saving the princess. Mario touches the axe with a time of 4:53.78 (17656 frames) since the console boot. (Including the time spent inside Super Mario Bros. 3 setting up the RAM pre-cart-swap) This satisfies the final objective of this TAS!

Fun details about the creation of this TAS

Since my ASM code cannot be compiled, I had to manually convert it into bytes and place those bytes on the TAStudio timeline. To optimize this process, I wrote a program to turn a string of bytes into TAStudio’s input string format. Some smaller modifications to the code were simply done by modifying the inputs on the timeline. Changes to the code were made frequently while experimenting with the audio, which could occasionally make a perfectly fine branch instruction cross a page boundary, (every time the branch is taken, it costs 1 extra CPU cycle) throwing off my calculations for how many inputs I could fit before the end of a frame. Remember, all 5.8 million inputs of this TAS were generated by a program I wrote, which means I had to know exactly how many cycles I had to spare before exiting the audio loop, jumping to the “wait for VBlank” loop.
Since I cannot use the NMI to wait for a new frame, I need to rely on reading address $2002 to check for VBlank, which can occasionally yield a false negative. Bit 7 should be set to 1 if VBlank has begun, and reading from this address will automatically clear bit 7. There exists a PPU cycle specific window where when this address is read, bit 7 is not set (for V-Blank has not begun), VBlank then begins, and after reading from this address bit 7 is cleared. To work around this, every frame will spend a single controller read (either 0 or 1) and if “1” is read, I waste a single CPU cycle by branching 0 bytes. Since the branch is taken, a single CPU cycle is spent moving the PC zero bytes, which will make the read instruction happen on a different PPU cycle, thus preventing the race condition.
I got to learn about how the .wav file format works during the creation of this TAS, so I could convert the 16 Bit PCM data of a .wav file to 7 Bit data for the audio chip. I’ve also recently learned about the inner workings of the .gif format, and I’m enjoying learning about how file formats store their data.
While submitting this TAS, I ran into an issue, as the file size was larger than the maximum allowed size. By a pretty hefty amount too. The .bk2 file was 15 Megabytes, and there's a 2 Mb limit. With the help of 7z, I could get it down to 8.3 Mb, which is still not nearly enough. Spikestuff was able to help me compress it into a 5.6 Mb file, though it uses a compression method that bizhawk is unable to parse.

Conclusion

This is without hyperbole my entire knowledge about the NES. 6502 Assembly, unofficial opcodes, ASM optimizations, controller reading, the audio chip, PCM audio, writing to the nametables, color palettes, attribute tables, open bus, uninitialized ram, cartridge swapping, The Address $2002 Vblank-Race-Condition, how many bytes and CPU cycles each instruction takes, Hblank, and more! It all comes together to form a music video being played inside an unmodified cartridge of Super Mario Bros.
I’m curious to know if a video using all 4 color palettes could be achieved. My process for optimally converting an image into “mario tiles” was easy, since I was working with greyscale images. It’s an entirely different beast if I need to consider all the possible colors, and how the attribute tables would assign palettes. That’s definitely worth exploring.
This website makes me do silly things.
Suggested screenshot (Frame 1,502,323 / Vblank 7768)

Special Thanks:

The NESdev wiki for its outstanding documentation of the NES.
Total for informing me of the extra fast controller reading with EOR.
Alyosha and Bigbass for their help last April replaying my SMB1 ACE proof of concept TAS on console and helping me discover the open bus inaccuracies inside Bizhawk pre-version-2.9
DwangoAC for help console verifying this TAS, and his incredible patience as we ran into wall after wall debugging. We weren't able to play much of the video before desyncing, but the time spent working on this is greatly appreciated!
Bigbass again for additional help with console verification, and his incredible patience as we also ran into many walls debugging. We were able to get around 10 seconds in before something desynced, which, all things considered (using $2002 to check for VBlank, which is PPU cycle dependant) was amazing!
Spikestuff and CasualPokePlayer for help compressing this massive TAS to allow it to be uploaded.
negative_seven for her research into SMB1 ACE, which helped during my own investigations.
Alyosha again for console verifying this TAS!.

Here’s the ASM code:

This was written before cartidge swapping to Super Mario Bros.
It can be written with a subframe TAS of SMB3 in 47 frames.
Initial payload: (This let’s me write more ASM code in RAM)
			; $16B, The controller reading loop. 
ReadController:
	LDA #$01
	STA $4016	; Strobe the controller
	STA <$C0
	LSR
	STA $4016	; Strobe the controller part 2
	LDA $4016
	LSR
	ROL <$C0	; This isn’t the most optimal controller reading routine
	BCC		; But it takes up fewer bytes, which is important here.
	LDA <$C0
	RTS		; This specific part is a subroutine.

Setup:			; $181
	LDX #$02	; this is the first line executed when the ACE begins. X=2
Loop1:
	JSR $016B	; Read the controller
	STA <$C1, X	; Store at C3, C2, and C1. That’s the 16 bit pointer and payload length.
	DEX		; X--
	BPL Loop1	; Loop unless X is 0xFF
	LDY #$00	; Y=0
Loop2:
	JSR $016B	; Read controller
	STA ($C2), Y	; Store at the target address with offset Y.
	INY		; Y++
	CPY <$C1	; If Y < Payload Length
	BCC Loop2	; Loop. This will loop until the payload is finished.
			; Post loop:
	JSR $016B	; Read controller. Do we write more code, or jump somewhere?
	BEQ Setup	; If zero, we got more code to write. Branch back to $181, the setup.

	JSR $016B	; Read controller. Let’s figure out where we’re jumping
	STA <$C3	; Store at $C3 (The high byte)
	JSR $016B	; Read controller
	STA <$C2	; Store at $C2 (The low byte)
	JMP ($00C2)	; This indirect jump goes wherever we just pointed.
The setup code: (Clear the nametable, set up the color palette, halt audio, etc.)
			; I wrote this at address $300
BadAppleSetup:
	LDA #$00	; A=0, This will be used to silence the APU
	TAX		; X=0
APULoop:
	STA $4000, X	; Store 0 at the APU register
	INX		; X++
	CPX #$10	; If X!=0x10
	BNE APULoop	; Loop.

	STX $400C	; I also want to write 0x10 to address $400C
	STA $2005	; Set the fine scroll X. A still equals 0.
	STA $2005	; Sets fine scroll Y. This step gets overwritten by writing to $2006, oops!
VBLoop:
	LDA $2002	; Wait for Vblank
	BPL VBLoop	; Loop

	LDA #$00	; A=0
	STA $2001	; Disable rendering
	LDA #$10	; A=10
	STA $2000	; Set PPU Address increment mode.
	LDA #$20	; Set up PPU Read/Write Address
	STA $2006	; High byte is 0x20
	LDA #$00	; Low byte is 0x00
	STA $2006	; Address set to $2000
NTLoop:
	LDA #$27	; A = 0x27. This represents a blank tile
	LDX #$00	; X=0
	LDY #$04	; Y=4
	STA $2007	; Write 0x400 bytes to $2000
	DEX		; X--
	BNE NTLoop	; if X != 0, loop
	DEY		; Y--
	BNE NTLoop	; if Y != 0, loop

	LDA #$23	; Set up PPU Address to $23C0
	STA $2006	; This is the attribute table.
	LDA #$C0	; We want the attribute table to be all zeroes.
	STA $2006	; PPU Address = $23C0
	LDA #$00	; A=0
	LDX #$40	; X=0x40
ATLoop:	
	STA $2007	; Write to attribute table
	DEX		; X--
	BNE ATLoop	; If X != 0, loop
	LDA #$24	; We need to clear a second name table.
	STA $2006	; This is pretty much the same code as before
	LDA #$00	; but using different PPU addresses.
	STA $2006	; PPU Address = $2400
	LDA #$27	; A=0x27, the blank tile.
	LDX #$00	; X=0;
	LDY #$04	; Y=4;
NTLoop2:
	STA $2007	; Write to nametable
	DEX		; X--
	BNE NTLoop2	; if X != 0, loop.
	DEY		; Y--
	BNE NTLoop2	; if Y != 0, loop

	STA $2006	; And the attribute table. A is already 27, which was the tile ID.
	LDA #$C0	; Remarkably, I’m setting up the address to $27C0
	STA $2006	; So I didn’t need to change A for the first write to $2006.
	LDA #$00	; A = 0
	LDX #$40	; X= 0x40
ATLoop2:
	STA $2007	; Write to attribute table
	DEX		; X--
	BNE ATLoop2	; if X != 0, loop

VBLoop2:
	LDA $2002	; Wait for Vblank. I’ll change the background color while rendering is disabled.
	BPL VBLoop2	; If not VBlank, loop. This would have a visible artifact if outside Vblank.

	LDA #$3F	; Palette stuff.
	STA $2006	; Set up PPU Address to $3F00
	LDA #$00	; We’re only going to change a single palette
	STA $2006	; The PPU Address is all set up.
	LDA #$30	; Color $30, white
	STA $2007	; Write to palettes
	LDA #$10	; Color $10, light grey
	STA $2007	; Write to palettes
	LDA #$00	; Color $00, grey
	STA $2007	; Write to palettes
	LDA #$0F	; Color $0F, black
	STA $2007	; Write to palettes
	LDA #$0E	; Enable rendering
	STA $2001	; This is still in VBlank, so it’s okay to do things in this order.
	LDA #$20	; Set up PPU Address to $2000
	STA $2006	; This is used for rendering the screen with $2000 in the upper left.
	LDA #$00	; Right now, the entire screen will be blank.
	STA $2006	; Final write to $2006 in the setup.
	JMP $0181	; Jump back to the initial payload. We have more to write.
The Bad Apple Loop:
 	 		; I wrote this at address $300
Top:
	LDA $100	; Load from Address $100. If non-zero, waste a single CPU cycle.
	BNE #00		; Branch 0 bytes. If taken, it costs 1 extra cycle.
VBLoop:
	LDA $2002	; Wasting 1 cycle can be used to prevent the false-negative read.
	BPL VBLoop	; Wait for Vblank.
	LDA #$00	; A=0
	STA $2001	; Disable rendering
Main:			; This loop is at address $30F.
	LDA <$00, X	; Read the first byte of a packet. The first time this runs, X = 0.
	BMI Terminator	; If A == 0xFF, we exit this graphical loop, “Main”.
	BEQ NoPPU	; if A == 0, We don’t change the PPU read/write address.
	STA $2006	; We are changing the read/write address. Store the packet’s first byte.
	INX		; X++
	LDA <$00, X	; Read the next byte of the packet.
	STA $2006	; Also store this one to address $2006. The PPU address is set up.
NoPPU:
	INX		; X++
	LDA <$00, X	; Read the next byte. This will be PCM data for the audio chip.
	STA $4011	; Store it at $4011, writing to the audio chip.
	INX		; X++
	LDY <$00, X	; Read the next byte (length) into the Y register.
	BMI Indies	; If the length is >= 0x80, it’s a group of individual tiles.
	INX		; otherwise, X++
	LDA $0740, Y	; There’s a LookUp Table at $740, grab the right byte
	STA $0101	; Store at $101 (Address 102 = 0x07, so this will always point somewhere in page 7)
	LDA <$00, X	; Read the next byte (what tile to draw multiple times)
	JMP ($0101)	; Jump to a huge chain of “STA $2007” over and over.

Indies:
	INX		; X++
	LDA $074F, Y	; There’s a LookUp Table at $7CF. LDA from $74F, since bit 7 of Y is set, adding 0x80.
	STA $0101	; Store at $101 (Address 102 = 0x07, so this will always point somewhere in page 7)
	JMP ($0101)	; Jump to a huge chain of “STA $2007, INX, LDA <00,X” over and over.

	NOP		; I modified the code here, but I kept the length the same. This NOP never runs.

Terminator:		; If the first byte of a packet is 0xFF, it branches here.
	INC $0103	; Increment Address $103
	LDA $0103	; Address $103 is used to track which nametable the screen is looking at.
	AND #$03	; Only look at bits 0 and 1.
	TAX		; X = bits 0 and 1 of $103.
	LDA #$20	; A = 20
	CPX #$02	; if X < 2
	BMI Use20	; Branch, and write 0x20 to Address $2006
	ORA #$04	; Otherwise, we’re writing 0x24 there.
Use20:
	STA $2006	; Store 0x20 or 0x24 to $2006.
	LDA #$00	; A=0
	STA $2006	; Store 0 to $2006. The PPU Address is now either $2000 or $2400
	NOP		; I used to have a different system for swapping the nametable.
	NOP		; I replaced it with this code, which is more optimal, and took fewer bytes.
	NOP		; I didn’t want to figure out where all those LookUp Tables would end up.
	NOP		; I’m not actually compiling this code, so moving pointers is a pain.
	NOP		; More importantly though, the audio playing loop would cross a page boundary.
	NOP		; Then it would take 1 extra cycle each loop, breaking a lot of my math.
	LDA #$0E	; Anyway, A = 0x0E
	STA $2001	; Enable rendering.
	LDA #$01	; A=1
	STA $4016	; Strobe controllers
	LSR A		; A=0
	STA $4016	; Strobe controllers part 2
	LDA $4016	; Read from controller port.
	ASL A		; Shift the bits
	EOR $4016	; EOR next bit for controller
	AND #$0F	; We only want bits 0 and 1 here.
	STA $0100	; Store this at address $100. This is used to waste a single cycle before Vblank.
	CMP #$02	; This is also used to exit the loop, if A = 2.
	BNE SkipJMP	; If A != 2, branch ahead 3 bytes
	JMP $0181	; Jump out of the Bad Apple Loop, back to the total control loop.

SkipJMP:
	LDA #$01	; A=1
	STA $4016	; Strobe controllers
	LSR A		; A=0
	STA $4016	; Strobe controllers part 2
	EOR $4016	; Very fast controller reading with EOR
	ASL A		; Shift bits
	EOR $4016	; Read from controller
	ASL A		; Shift bits
	EOR $4016	; Read from controller
	ASL A		; Shift bits
	EOR $4016	; Read from controller
	ASL A		; Shift bits
	EOR $4016	; Read from controller
	ASL A		; Shift bits
	EOR $4016	; Read from controller
	ASL A		; Shift bits
	EOR $4016	; Read from controller
	ASL A		; Shift bits
	EOR $4016	; Read from controller
	EOR #$C0	; Exclusive or with 0xC0 to flip those bits that were affected by open bus.
	TAX		; X = the data read from the controller. That’s the length of data for the zero page.
	LDY #$00	; Y = 0
	CMP #$91	; If the Data Length >= 0x91
	BCS IMode1	; Branch so Y isn’t incremented at all. This will jump to “Interlace Mode 1”.
	CMP #$6E	; if the Data Length >= 0x6E
	BCS IMode2	; Branch to set Y up for a jump to “Interlace Mode 2”		
	CMP #$59	; If the Data Length >= 0x59
	BCS IMode3	; Branch to set Y up for a jump to “Interlace Mode 3”	
	INY		; Otherwise, Y is set up for “Interlace Mode 4”.
	INY		; The “Interlace Modes” will write ‘n’ bytes to the APU, 1 byte to the zero page.
IMode3:		 	; This is how I have dynamic audio quality based on the number of ZP writes.
	INY		; The more bytes required for the zero page, the fewer writes to the audio chip.
	INY		; Those magic numbers used in the CMP instructions are how many writes…
IMode2:		 	; … to the zero page I can make with each interlace mode before running out…
 	INY		; … of CPU cycles in a single frame.
 	INY		; This guarantees the highest possible audio quality during the ZP writes.
IMode1:
	NOP		; This was changed a few times, so here’s a few lingering NOPs.
	NOP		; I still need these to prevent the audio main loop from crossing a page boundary.
	LDA $0403, Y	; Finally, using the value in Y, grab bytes from a LookUp Table
	STA <$01	; Store the High byte at address $01
	LDA $0402, Y	; Grab the low byte from the LUT.
	STA <$00	; It’s perfectly fine to write these in the zero page. They will be overwritten.
	JMP ($0000)	; Jump to the interlacing mode that best fits this frame.

MainAudio:		; The interlace modes all end by jumping here. Let’s count CPU cycles!
	LDA #$01	; 2 cycles. A = 1. 
	STA $4016	; +4 = 6 cycles. Strobe controllers.
	LSR A		; +2 = 8 cycles. A = 0
	STA $4016	; +4 = 12 cycles. Strobe controllers part 2.
	EOR $4016	; +4 = 16 cycles. Very fast controller reading with EOR
	ASL A		; +2 = 18 cycles.
	EOR $4016	; +4 = 22 cycles.
	ASL A		; +2 = 24 cycles.
	EOR $4016	; +4 = 28 cycles.
	ASL A		; +2 = 30 cycles.
	EOR $4016	; +4 = 34 cycles.
	ASL A		; +2 = 36 cycles.
	EOR $4016	; +4 = 40 cycles.
	ASL A		; +2 = 42 cycles.
	EOR $4016	; +4 = 46 cycles.
	ASL A		; +2 = 48 cycles.
	EOR $4016	; +4 = 52 cycles.
	ASL A		; +2 = 54 cycles.
	EOR $4016	; +4 = 58 cycles.
	EOR #$C0	; +2 = 60 cycles.
	STA $4011	; +4 = 64 cycles. This write to the audio chip happens every 71 cycles.
	CLC		; +2 = 66 cycles. Clear carry flag
	ASL A		; +2 = 68 cycles. Bit 7 goes into the carry flag. 
	BCC MainAudio	; +3 = 71 cycles. If bit 7 was not set, loop!
	JMP $0300	; Now that all the time in a frame has run out, jump back to the top of the loop.

InterlaceModeJumpTable:	; Address $402
	.byte $0A, $04	; Address $040A, Interlace Mode 1
	.byte $68, $04	; Address $0468, Interlace Mode 2
	.byte $EF, $04	; Address $04EF, Interlace Mode 3
	.byte $9F, $05	; Address $059F, Interlace Mode 4

InterlaceMode1: 	; The comments for this section will be sparse
	LDA #$01	; The code here is repetitive.
	STA $4016	; Read the controller
	LSR A
	STA $4016	
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA <$FF, X	; Store the contents of the controller to the zero page
	LDA #$01	; Read the controller again
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA $4011	; Write to Audio chip.
	DEX		; if X == data length, branch 3 bytes
	BEQ PostInterlaceMode1 
	JMP $040A	; Otherwise, back to the start of InterlaceMode1
PostInterlaceMode1:
	JMP $03CE	; This jumps to MainAudio.

InterlaceMode2:
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
 	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA <$FF, X	; Write to ZP
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL A
	EOR $4016
 	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA $4011	; Write to APU (1)
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA $4011	; Write to APU (2)
	DEX
	BEQ PostInterlaceMode2
	JMP $0468	; Back to the top of InterlaceMode2
PostInterlaceMode2:
	JMP $03CE	; Otherwise, jump to MainAudio

InterlaceMode3:
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA <$FF, X	; Write to Zero Page
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA $4011	; Write to APU (1)
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA $4011	; Write to APU (2)
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA $4011	; Write to APU (3)
	DEX
	BEQ PostInterlaceMode3
	JMP $04EF	; Back to the top of InterlaceMode3
PostInterlaceMode3:
	JMP $03CE	; Jump to MainAudio

InterlaceMode4:
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA <$FF, X	; Write to Zero Page
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA $4011	; Write to APU (1)
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA $4011	; Write to APU (2)
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA $4011	; Write to APU (3)
	LDA #$01	; Read controller
	STA $4016
	LSR A
	STA $4016
	EOR $4016
	ASL
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	ASL A
	EOR $4016
	EOR #$C0
	STA $4011	; Write to APU (4)
	DEX
	BEQ PostInterlaceMode4
	JMP $059F	; Back to the top of InterlaceMode4
PostInterlaceMode4:
	JMP $03CE	; Jump to MainAudio

InitialSetup:
	LDX #$00	; This is where I jump to when I begin the Bad Apple Loop.
	JMP $0300	; Basically it just sets up the X register, then jumps to the main loop.

 	 	 	; There’s a handful of unused bytes here between $67D, and $6FF

 	 	 	; $700
IdenticalByteRow:
	STA $2007	; Here’s where the jump tables during VBlank end up.
	STA $2007	; Basically, if I need to draw the same tile 20 times, then “STA $2007” runs 20 times.
	STA $2007	; Taking a jump table is faster than looping.
	STA $2007	
	STA $2007	; Writing to $2007 is how you change a character on the nametable.
	STA $2007	
	STA $2007
	STA $2007
	STA $2007
	STA $2007
	STA $2007
	STA $2007
	STA $2007
	STA $2007
	STA $2007
	STA $2007
	STA $2007
	STA $2007
	STA $2007
	STA $2007
	INX		; X++, getting ready for the first byte of the next packet.
	JMP $030F	; Jump to “Main”

IdenticalByteLUT:
	.byte $39, $36, $33, $30, $2D, $2A, $27, $24, $21, $1E, $1B, $18, $15, $12, $0F, $0C, $09, $06, $03, $00
	 	 	; If Y = 0, for example, it will grab that “$39”, making a jump to $739, executing a single STA instruction
	 	 	; This is why it draws "Length+1" tiles. If Y = 19, it would jump to $700, executing all 20 STA's.
IndividualByteRow:
	LDA <$00, X	; Read the next byte of the packet
	INX		; X++
	STA $2007	; Store the value to the nametable.
	LDA <$00, X	; Read the next byte of the packet
	INX		; X++
	STA $2007	; Store the value to the nametable
	LDA <$00, X
	INX		; This is very similar to the previous function, but for a run of individual tiles.
	STA $2007	; Again, this is much faster than making a loop.
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX
	STA $2007
	LDA <$00, X
	INX		; X is now ready to read the first byte of the next packet.
	STA $2007
	JMP $030F	; Jump to “Main”

IndividualByteLUT:
	.byte $C6, $C0, $BA, $B4, $AE, $A8, $A2, $9C, $96, $90, $8A, $84, $7E, $78, $72, $6C, $66, $60, $5A, $54
	 	 	; Another LUT that is used for a jump table. 
	 	 	; And that’s it!
This runs after the video. It spoofs the level transition screen:
	 	 	; This was written at address $300
PostVideo:
	LDA $2002	; Wait for Vblank
	BPL PostVideo	; Branch until VBlank

	LDA #$00	; A = 0
	STA $2001	; Disable Rendering

	LDA #$3F	; Set PPU Address to $3F00
	STA $2006	; Palette info
	LDA #$00	; Restore the palette to the castle level palette
	STA $2006	; PPU Address = $$3F00
	LDA #$0F	; A = 0x0F, Black
	STA $2007	; Write to palette
	LDA #$30	; A = 0x30, Whtie
	STA $2007	; Write to palette
	LDA #$10	; A = 0x10, Light Grey
	STA $2007	; Write to palette
	LDA #$00	; A = 0x00, Grey
	STA $2007	; Write to palette
	LDA #$3F	; Set up PPU Address to $3F10
	STA $2006	; $3F10 is the palette used for Mario
	LDA #$10	; It's still using the fire flower palette
	STA $2006	; So I set it to the regular mario palette.
	LDA #$0F	; A = 0x0F, Black. Ths background color.
	STA $2007	; Write to palette
	LDA #$16	; A= 0x16, Red
	STA $2007	; Write to palette
	LDA #$27	; A = 0x27, Light Orange
	STA $2007	; Write to palette
	LDA #$18	; A = 0x18, a Brownish-Green
	STA $2007	; Write to palette
	LDA #$3F	; Set PPU Address to $3F0D
	STA $2006	; This is the color used for the blinking coin in the HUD
	LDA #$0D	; It's always 0x27 during the screen transitions.
	STA $2006	; 
	LDA #$27	; A = 0x27, a Light Orange
	STA $2007	; Write to palette

	LDA #$20	; Change PPU Address to $2000
	STA $2006	; ^
	LDA #$00	; ^
	STA $2006	; ^
	LDA #$24	; A = 0x24, an emtpy square
	LDX #$C0	; We're going to change $3C0 tiles.
	LDY #$04	; Recall that the blank square was tile 0x27 when setting up the bad apple loop.
NTLoop:
	STA $2007	; Now I need to cover the whole screen with tile 0x24.
	DEX		; X--
	BNE NTLoop	; If X != 0, loop
	DEY		; Y--
	BNE NTLoop	; If Y != 0, loop

	LDA #$20	; Now I'm going to draw individual tiles.
	STA $2006	; Set up the PPU Address to $2043
	LDA #$43	; ^
	STA $2006	; ^
	LDA #$16	; "M"
	STA $2007	;
	LDA #$0A	; "A"
	STA $2007	;
	LDA #$1B	; "R"
	STA $2007	;
	LDA #$12	; "I"
	STA $2007	;
	LDA #$18	; "O"
	STA $2007	;
	LDA #$20	; Change PPU Address
	STA $2006	; ^
	LDA #$52	; ^
	STA $2006	; ^
	LDA #$20	; "W"
	STA $2007	;
	LDA #$18	; "O"
	STA $2007	;
	LDA #$1B	; "R"
	STA $2007	;
	LDA #$15	; "L"
	STA $2007	;
	LDA #$0D	; "D"
	STA $2007	; 
	LDA #$24	; " "
	STA $2007	; 
	STA $2007	;
	LDA #$1D	; "T"
	STA $2007	;
	LDA #$12	; "I"
	STA $2007	;
	LDA #$16	; "M"
	STA $2007	;
	LDA #$0E	; "E"
	STA $2007	;
	LDA #$20	; Change PPU Address
	STA $2006	; ^
	LDA #$62	; ^
	STA $2006	; ^
	LDA #$09	; "9"
	STA $2007	; The number 9 is drawn 6 times.
	STA $2007	; This is Mario's score.
	STA $2007	;
	STA $2007	;
	STA $2007	;
	STA $2007	;
	LDA #$00	; "0" It ends in a zero.
	STA $2007	;
	LDA #$24	; " "
	STA $2007	; Two spaces
	STA $2007	;
	LDA #$2E	; "Coin Icon"
	STA $2007	;
	LDA #$29	; "x"
	STA $2007	;
	LDA #$00	; "0"
	STA $2007	; Zero is drawn twice
	STA $2007	;
	LDA #$24	; " "
	STA $2007	; 4 spaces 
	STA $2007	;
	STA $2007	;
	STA $2007	;
	LDA #$08	; "8"
	STA $2007	;
	LDA #$28	; "-"
	STA $2007	;
	LDA #$04	; "4"
	STA $2007	;
	LDA #$21	; Change PPU Address
	STA $2006	; ^
	LDA #$4B	; ^
	STA $2006	; ^
	LDA #$20	; "W"
	STA $2007	;
	LDA #$18	; "O"
	STA $2007	;
	LDA #$1B	; "R"
	STA $2007	;
	LDA #$15	; "L"
	STA $2007	;
	LDA #$0D	; "D"
	STA $2007	;
	LDA #$24	; " "
	STA $2007	;
	LDA #$08	; "8"
	STA $2007	;
	LDA #$28	; "-"
	STA $2007	;
	LDA #$04	; "4"
	STA $2007	;
	LDA #$21	; Change PPU Address
	STA $2006	; ^
	LDA #$CF	; ^
	STA $2006	; ^
	LDA #$29	; "x"
	STA $2007	;
	LDA #$24	; " "
	STA $2007	; two spaces
	STA $2007	;
	LDA #$03	; "3"
	STA $2007	; This number is how many lives Mario has.
	LDA #$23	; Change PPU Address
	STA $2006	; ^
	LDA #$C2	; ^
	STA $2006	; ^
	LDA #$C0	; Changing the Nametable's attribute table to make the coin the correct color.
	STA $2007	;
	LDA #$F8	; Clear OAM
	LDX #$00	; This entire time it was still the data of mario shooting fireballs at bowser.
OAMLoop:
	STA $0200, X	; All of bowser's hammers are still there too!
	DEX		; X--
	BNE OAMLoop	; If X != 0, loop

	LDX #$00	; X= 0
OAMLoop2:
	LDA $0513, X	; A = next byte from a LookUp Table. This fills in OAM with the mario graphic for the screen.
	STA $0200, X	; Store the next byte in OAM.
	INX		; X++
	CPX #$24	; if X != 0x24
	BNE OAMLoop2	; loop

	LDA #$02	; A = 2
	STA $4014	; OAM DMA. This sends $200 through $2FF to the PPU for sprite data.
	LDA #$24	; While we're here, let's restore Nametable 2 to Bowser's castle.
	STA $2006	; Set PPU Address to $2400
	LDA #$00	; ^
	STA $2006	; ^
	LDX #$00	; We have 0x400 bytes to change.
	LDY #$04	;
NT2Loop:
	JSR $016B	; Read the controller
	STA $2007	; Store it on the nametable.
	DEX		; X--
	BNE NT2Loop	; If X != 0, loop
	DEY		; Y--
	BNE NT2Loop	; if Y != 0, loop

	LDA $2002	; I made changes here, and could've made this 3 NOPs, but this takes fewer cycles.
WaitForVBlank:
	LDA $2002	; Wait for Vblank
	BPL WaitForVBlank

	LDA #$20	; Set the PPU Address to $2000
	STA $2006	; ^
	LDA #$00	; ^
	STA $2006	; ^
	LDA #$00	; Set the fine X scroll to zero
	STA $2005	; This step needs to happen after the writes to $2006
	LDA #$00	; Otherwise writing to $2006 messes with the fine Y scroll
	STA $2005	; Which will also be 0 in this case.
	LDA #$1E	; A = 0x1E
	STA $2001	; Enable rendering, both the background and sprites.

WaitAFewFrames:
	JSR $016B	; Read the controller
	BMI PostScreen	; If I press "A" this loop ends.
	BNE #0		; If I press anything else, waste 1 cycle.
VBlankLoop2: 
	LDA $2002	; Vblank loop again, this time with false-negative prevention.
	BPL VBlankLoop2 ; Wait for VBlank, then always branch back to "WaitAFewFrames"
	BMI WaitAFewFrames

PostScreen:
	LDA #$00	; It's time to leave the fake screen, and restore the game to a stable state.
	STA $2001	; Disable rendering
	LDA #$20	; Set the ppu address to $2006
	STA $2006	; ^
	LDA #$00	; ^
	STA $2006	; ^
	LDX #$00	; Time to overwrite the nametable that was showing the fake screen transition.
	LDY #$04	; $400 bytes to change
NTLoop3:
	JSR $016B	; Read the controller
	STA $2007	; Write to nametable
	DEX		; X--
	BNE NTLoop3	; If X != 0, loop
	DEY		; Y--
	BNE NTLoop3	; if Y != 0, loop

	JMP $0181	; Return to total control loop
 			; Then use the total control loop to write to every byte in RAM (except in the stack)
			; We're essentially "loading a savestate" by doing this.
			; Once that's all set up, enable the NMI and Jump to where SMB waits for the next frame. 
OAMLookUpTable: 	; Address $513
	.byte $F8, $00, $00, $00 	; This is a copy of OAM data during the screen transition.
	.byte $58, $FC, $00, $60 	; An invisible object at screen coordinates (60, 58) Mario is short here.
	.byte $58, $FC, $00, $68 	; An invisible object at screen coordinates (68, 58) If mario was tall, this would be his upper half.
	.byte $60, $FC, $00, $60 	; An invisible object at screen coordinates (60, 60) Instead, it's invisible objects.
	.byte $60, $FC, $00, $68	; An invisible object at screen coordinates (68, 60)
	.byte $68, $3A, $00, $60	; The upper left of Mario's head at coords  (60, 68)
	.byte $68, $37, $00, $68	; The upper right of Mario's head at coords (68, 68)
	.byte $70, $4F, $00, $60	; The lower left of Mario's body at coords  (60, 70)
	.byte $70, $4F, $40, $68	; The lower right of Mario's body at coords (68, 70)

Fun statistics

For a little extra bit of fun, here's an ordered list of the characters drawn on screen during the music video, ranked by number of appearances.
0.   : Character 0x27 : 932396 instances.
1. : Character 0x24 : 764660 instances.
2. : Character 0x2A : 21826 instances.
3. : Character 0x40 : 21037 instances.
4. : Character 0xA2 : 16911 instances.
5. : Character 0x2F : 13054 instances.
6. : Character 0xA3 : 11675 instances.
7. : Character 0x7E : 11604 instances.
8. : Character 0xDD : 10895 instances.
9. : Character 0x9D : 9784 instances.
10. : Character 0xFF : 9646 instances.
11. : Character 0x3D : 8853 instances.
12. : Character 0x9B : 8005 instances.
13. : Character 0x9E : 7806 instances.
14. : Character 0x32 : 7457 instances.
15. : Character 0x77 : 7449 instances.
16. : Character 0x9C : 7222 instances.
17. : Character 0x5C : 6125 instances.
18. : Character 0x31 : 5872 instances.
19. : Character 0xAD : 5767 instances.
20. : Character 0xA8 : 5507 instances.
21. : Character 0x3F : 4972 instances.
22. : Character 0x26 : 4468 instances.
23. : Character 0xAF : 4406 instances.
24. : Character 0xCB : 4282 instances.
25. : Character 0xA5 : 4011 instances.
26. : Character 0xD3 : 3303 instances.
27. : Character 0xE5 : 2762 instances.
28. : Character 0x7D : 2709 instances.
29. : Character 0x39 : 2577 instances.
30. : Character 0xAC : 2540 instances.
31. : Character 0x33 : 2534 instances.
32. : Character 0xA6 : 2096 instances.
33. : Character 0xCA : 2073 instances.
34. : Character 0xD7 : 1833 instances.
35. : Character 0xA7 : 1781 instances.
36. : Character 0x0F : 1599 instances.
37. : Character 0x48 : 1274 instances.
38. : Character 0xAE : 1170 instances.
39. : Character 0xBF : 1157 instances.
40. : Character 0x66 : 1056 instances.
41. : Character 0x07 : 1050 instances.
42. : Character 0x25 : 1036 instances.
43. : Character 0x3E : 981 instances.
44. : Character 0x5B : 928 instances.
45. : Character 0xC2 : 875 instances.
46. : Character 0x4A : 757 instances.
47. : Character 0x46 : 734 instances.
48. : Character 0x79 : 630 instances.
49. : Character 0x30 : 583 instances.
50. : Character 0xD2 : 524 instances.
51. : Character 0x8B : 470 instances.
52. : Character 0xCC : 459 instances.
53. : Character 0xD0 : 459 instances.
54. : Character 0xE1 : 426 instances.
55. : Character 0x98 : 419 instances.
56. : Character 0x41 : 400 instances.
57. : Character 0xD1 : 398 instances.
58. : Character 0x99 : 345 instances.
59. : Character 0xD6 : 341 instances.
60. : Character 0x78 : 334 instances.
61. : Character 0x65 : 319 instances.
62. : Character 0xC6 : 309 instances.
63. : Character 0x6C : 298 instances.
64. : Character 0x9A : 291 instances.
65. : Character 0x8E : 289 instances.
66. : Character 0x15 : 285 instances.
67. : Character 0xD9 : 284 instances.
68. : Character 0x28 : 277 instances.
69. : Character 0xDC : 275 instances.
70. : Character 0x6D : 264 instances.
71. : Character 0x7F : 263 instances.
72. : Character 0x8F : 240 instances.
73. : Character 0x1D : 238 instances.
74. : Character 0xDE : 225 instances.
75. : Character 0x2B : 218 instances.
76. : Character 0x6E : 217 instances.
77. : Character 0x42 : 202 instances.
78. : Character 0xBD : 188 instances.
79. : Character 0xE8 : 183 instances.
80. : Character 0xBE : 171 instances.
81. : Character 0xD5 : 170 instances.
82. : Character 0x56 : 165 instances.
83. : Character 0xC7 : 164 instances.
84. : Character 0xC4 : 162 instances.
85. : Character 0xC9 : 150 instances.
86. : Character 0x95 : 149 instances.
87. : Character 0xD4 : 142 instances.
88. : Character 0x81 : 141 instances.
89. : Character 0x97 : 132 instances.
90. : Character 0x87 : 128 instances.
91. : Character 0xAB : 120 instances.
92. : Character 0xBC : 117 instances.
93. : Character 0x2C : 114 instances.
94. : Character 0xE2 : 105 instances.
95. : Character 0x2E : 97 instances.
96. : Character 0xCD : 94 instances.
97. : Character 0xB4 : 88 instances.
98. : Character 0x45 : 85 instances.
99. : Character 0x91 : 84 instances.
100. : Character 0x14 : 83 instances.
101. : Character 0x22 : 79 instances.
102. : Character 0x57 : 71 instances.
103. : Character 0x59 : 71 instances.
104. : Character 0x72 : 71 instances.
105. : Character 0x86 : 71 instances.
106. : Character 0x01 : 67 instances.
107. : Character 0x4D : 63 instances.
108. : Character 0x13 : 61 instances.
109. : Character 0xE3 : 60 instances.
110. : Character 0x16 : 53 instances.
111. : Character 0x7C : 53 instances.
112. : Character 0x4B : 52 instances.
113. : Character 0x19 : 51 instances.
114. : Character 0x34 : 51 instances.
115. : Character 0x23 : 50 instances.
116. : Character 0x1F : 48 instances.
117. : Character 0x68 : 47 instances.
118. : Character 0xCE : 46 instances.
119. : Character 0xF2 : 44 instances.
120. : Character 0x64 : 41 instances.
121. : Character 0x96 : 38 instances.
122. : Character 0xB8 : 37 instances.
123. : Character 0x0C : 36 instances.
124. : Character 0x67 : 36 instances.
125. : Character 0xA0 : 36 instances.
126. : Character 0x0E : 34 instances.
127. : Character 0x12 : 34 instances.
128. : Character 0x73 : 33 instances.
129. : Character 0xDF : 33 instances.
130. : Character 0xF1 : 33 instances.
131. : Character 0x8C : 32 instances.
132. : Character 0xBA : 31 instances.
133. : Character 0x60 : 28 instances.
134. : Character 0xB9 : 28 instances.
135. : Character 0x7B : 27 instances.
136. : Character 0x80 : 26 instances.
137. : Character 0x71 : 25 instances.
138. : Character 0xDB : 25 instances.
139. : Character 0x2D : 22 instances.
140. : Character 0x04 : 21 instances.
141. : Character 0x29 : 21 instances.
142. : Character 0x52 : 21 instances.
143. : Character 0x58 : 21 instances.
144. : Character 0x83 : 21 instances.
145. : Character 0xE9 : 21 instances.
146. : Character 0x6B : 20 instances.
147. : Character 0x85 : 20 instances.
148. : Character 0xA9 : 20 instances.
149. : Character 0x75 : 19 instances.
150. : Character 0xA1 : 18 instances.
151. : Character 0xAA : 18 instances.
152. : Character 0x6A : 17 instances.
153. : Character 0x6F : 17 instances.
154. : Character 0x74 : 17 instances.
155. : Character 0x35 : 16 instances.
156. : Character 0x70 : 16 instances.
157. : Character 0x8A : 16 instances.
158. : Character 0x17 : 15 instances.
159. : Character 0x92 : 14 instances.
160. : Character 0xF6 : 14 instances.
161. : Character 0x1E : 12 instances.
162. : Character 0x20 : 11 instances.
163. : Character 0x62 : 11 instances.
164. : Character 0x00 : 10 instances.
165. : Character 0x43 : 10 instances.
166. : Character 0x4F : 10 instances.
167. : Character 0x76 : 10 instances.
168. : Character 0x7A : 10 instances.
169. : Character 0x37 : 9 instances.
170. : Character 0x63 : 9 instances.
171. : Character 0xC0 : 9 instances.
172. : Character 0xC5 : 9 instances.
173. : Character 0xDA : 9 instances.
174. : Character 0x03 : 8 instances.
175. : Character 0x0D : 8 instances.
176. : Character 0x51 : 8 instances.
177. : Character 0x69 : 8 instances.
178. : Character 0x82 : 8 instances.
179. : Character 0xB5 : 8 instances.
180. : Character 0x05 : 7 instances.
181. : Character 0x09 : 7 instances.
182. : Character 0x11 : 7 instances.
183. : Character 0x54 : 7 instances.
184. : Character 0x61 : 7 instances.
185. : Character 0xB7 : 7 instances.
186. : Character 0x21 : 6 instances.
187. : Character 0x55 : 6 instances.
188. : Character 0xC3 : 6 instances.
189. : Character 0xC8 : 6 instances.
190. : Character 0xFA : 6 instances.
191. : Character 0x47 : 5 instances.
192. : Character 0x49 : 5 instances.
193. : Character 0x5F : 5 instances.
194. : Character 0x90 : 5 instances.
195. : Character 0x0A : 4 instances.
196. : Character 0x18 : 4 instances.
197. : Character 0x50 : 4 instances.
198. : Character 0xE0 : 4 instances.
199. : Character 0xFB : 4 instances.
200. : Character 0x44 : 3 instances.
201. : Character 0x53 : 3 instances.
202. : Character 0x5E : 3 instances.
203. : Character 0x8D : 3 instances.
204. : Character 0xB3 : 3 instances.
205. : Character 0xD8 : 3 instances.
206. : Character 0xFC : 3 instances.
207. : Character 0x02 : 2 instances.
208. : Character 0x06 : 2 instances.
209. : Character 0x1B : 2 instances.
210. : Character 0x38 : 2 instances.
211. : Character 0x3C : 2 instances.
212. : Character 0x4C : 2 instances.
213. : Character 0x5D : 2 instances.
214. : Character 0x84 : 2 instances.
215. : Character 0x9F : 2 instances.
216. : Character 0xA4 : 2 instances.
217. : Character 0xF0 : 2 instances.
218. : Character 0x1A : 1 instance.
219. : Character 0x36 : 1 instance.
220. : Character 0x3B : 1 instance.
221. : Character 0x5A : 1 instance.
222. : Character 0x88 : 1 instance.
223. : Character 0x93 : 1 instance.
224. : Character 0xB0 : 1 instance.
225. : Character 0xB2 : 1 instance.
226. : Character 0xB6 : 1 instance.
227. : Character 0xCF : 1 instance.
228. : Character 0xEB : 1 instance.
229. : Character 0xF8 : 1 instance.
230. : Character 0x08 : 0 instances.
231. : Character 0x0B : 0 instances.
232. : Character 0x10 : 0 instances.
233. : Character 0x1C : 0 instances.
234. : Character 0x3A : 0 instances.
235. : Character 0x4E : 0 instances.
236. : Character 0x89 : 0 instances.
237. : Character 0x94 : 0 instances.
238. : Character 0xB1 : 0 instances.
239. : Character 0xBB : 0 instances.
240. : Character 0xC1 : 0 instances.
241. : Character 0xE4 : 0 instances.
242. : Character 0xE6 : 0 instances.
243. : Character 0xE7 : 0 instances.
244. : Character 0xEA : 0 instances.
245. : Character 0xEC : 0 instances.
246. : Character 0xED : 0 instances.
247. : Character 0xEE : 0 instances.
248. : Character 0xEF : 0 instances.
249. : Character 0xF3 : 0 instances.
250. : Character 0xF4 : 0 instances.
251. : Character 0xF5 : 0 instances.
252. : Character 0xF7 : 0 instances.
253. : Character 0xF9 : 0 instances.
254. : Character 0xFD : 0 instances.
255. : Character 0xFE : 0 instances.

Very important notes for judges

The full TAS was too large to upload as a submission, so the full 15MB TAS is currently uploaded as a userfile UserFiles/Info/638479355124534524. The submitted TAS is pretty much the first fifth of the video (and apparently desyncs, wow!). If possible (and if you're willing) please replace my submitted TAS with this userfile. I also super-definitely mangled the input log in an attempt to shrink the file size even more (which likely caused the desync), and it doesn't like being loaded in TAStudio, but it runs fine!

InfoTeddy: nice
InfoTeddy: I know everyone has said this a billion times already but I have to say it too - this is genuinely really freaking impressive. Not only figuring out how to play the video and audio and streaming both of them through controller inputs, but also documenting the process thoroughly in this submission so others could do it too (if they wanted to). What's more is that the audience really enjoyed it too with staggering positive feedback, making it worth it in the end. Dare I say, this is one of the best April Fools' submissions ever made on this site.
I wasn't able to get the "official" submission working; my Bizhawk install crashes when I try it. But the uploaded userfile does sync for me, albeit really slowly on emulator. Unfortunately, I can't replace the submission with that file because I have a submission file limit too, so I think leaving the actual file used for sync as a userfile is acceptable.
Due to the impressive technical feats and overwhelming positive audience reception, I am accepting to Alternative under the "arbitrary code execution" goal.

Spikestuff: Processing...
feos: Replacing with the correct movie.