Tool-assisted game movies
When human skills are just not enough

Submission #6466: Masterjun & ais523's NES Super Mario Bros. 3 "game end glitch" in 00:00.78

Console: Nintendo Entertainment System
Game name: Super Mario Bros. 3
Game version: USA PRG1
ROM filename: Super Mario Bros. 3 (U) (PRG1) [!].nes
Branch: game end glitch
Emulator: BizHawk 2.3.2
Movie length: 00:00.78
FrameCount: 47
Re-record count: 10904
Author's real name:
Author's nickname: Masterjun & ais523
Submitter: Masterjun
Submitted at: 2019-07-25 17:32:37
Text last edited at: 2019-08-05 15:43:10
Text last edited by: feos
Download: Download (2326 bytes)
Status: judging underway
Submission instructions
Discuss this submission (also rating / voting)
List all submissions by this submitter
List pages on this site that refer to this submission
View submission text history
Back to the submission list
Author's comments and explanations:
This site was founded when a certain Super Mario Bros. 3 TAS was discovered on the internet. Ever since then, the job of this site was to push video games to their very limits.

Today, almost 16 years later, it is possible to submit a Super Mario Bros. 3 TAS which might be the very embodiment of a video game being pushed to the limit.

(Link to video)

Game objectives

  • Aims to complete the game as fast as possible
  • Exploits a workaround to a hardware bug
  • Presses buttons real fast (requires SubNESHawk core to be enabled for even faster button pressing!)


You might have seen a similar short SMB3 run in the TAS block at SGDQ 2016, or maybe you've read some of the many articles on the internet which followed the showcase.

It is important to note that, despite dwangoAC saying "it is a valid completion" in the video, it was in fact not a valid completion of the game. The showcased run enters the peach cutscene and then softlocks when the world cutscenes were supposed to appear. This was due to the game being in the wrong mode.

The method to make the real ending appear and complete the game is as simple as changing the (NMI-)game mode at $0100 to the correct value 0x20 before jumping to the credits. What is not as simple to see is how much trouble this one address is, taking months of work just to accomplish this one additional write.

How it came to be

  <ais523>    Masterjun: that said, my default is to assume that any game
              has an ACE glitch unless it's very simple, and possibly even then
  <Masterjun> I'm guessing the same for at least the SNES games
  <Masterjun> and I'm betting a lot of NES games also have some kind of major
              glitch that simply wasn't discovered yet
  <Masterjun> I'm thinking about those bank switches and exact instruction timings
  <ais523>    this reminds me, I found a technique to create precise amounts of lag
              on the majority of NES games

DPCM bug

"If the DMC DMA is running, and happens to start a read in the same cycle that the CPU is trying to read from $4016 or $4017, the values read will become invalid." (full explanation)

This is a bug in the hardware of a NES console itself. In simple terms, it refers to audio processing occasionally interfering with input polling, leading to wrong button presses being read by the game.

DPCM bug workaround

It seems like developers of games for the NES were aware of this hardware bug. To avoid wrong button presses, they had to implement a software workaround. This can be approached in different ways or simply ignored.

In SMB3 specifically, developers programmed the game to repeatedly repoll the controller until two consecutive inputs matched. This means in normal play you usually repeat 2 loops (no bug occuring), or maybe 3 or 4 loops (bug occuring once) until you have two inputs matching.

An extra loop only takes 222 cycles (124 microseconds) of the ~30000 cycles in a frame, and it's unlikely that a human changes input 8000 times a second.

DPCM bug workaround exploit

Now this is TASVideos: When human skills are just not enough. So of course we can mash buttons really fast. This is what ais523 meant when talking about creating precise amounts of lag. By continuously changing inputs we can delay the game because it keeps waiting until two consecutive inputs match.

For convenience, the game has the controller reading routine inside the NMI, which is the interrupt that runs at the start of each frame. As soon as it begins we delay the execution by changing the input each loop. Important to notice here is how NMI switches to different banks at the start, and would switch them back at the end. Eventually the NMI is interrupted by IRQ, a different interrupt which is set to run at scanline 192 or 193 (= late in the frame). IRQ expects the NMI to have finished long ago and jumps to $A826. Unfortunately for the game, NMI did not yet finish and the banks are still switched. So it jumps into the middle of a wrong routine. A lot of crazy things happen (such as interrupts interrupting each other), and they keep getting more out of control because of IRQ trying to execute on wrong banks.

Until at some point the very unlikely scenario happens where a leftover byte from an indirect $9Axx jump is executed. Instruction 0x9A is TXS, Transfer X to Stack Pointer. Here, X happens to be 0x00, so the Stack Pointer (innitially 0xFF and in normal execution 0xC0-0xFF) is suddenly 0x00 and after a return it's 0x02. The Stack Pointer points to memory values $01xx, so after another BRK we will overwrite $0100 (= NMI mode) and $0101 (= IRQ mode). They change into "default mode" where IRQ finally doesn't jump into different banks. So we're now at the start of RAM filled mostly with 00's we can safely execute.

This is where the adventure begins.

The goal

We want to reach the peach cutscene and then the credits. This has 6 requirements.
  1. The $C000 bank needs to be 0x19
  2. The $A000 bank needs to be 0x18
  3. The PPU control register copy ($00FF) needs to be 0xA8
  4. NMI mode ($0100) needs to be set to 0x20
  5. Jump to $B85A
  6. The Stack Pointer needs to be sane (not lower than around 0x30), so it doesn't overwrite game modes.

At first, this seems too ambitious to be feasible. However, the first 3 requirements are already met and req. 6 works out automatically in most cases.

We can choose from two different approaches: First approach, set NMI mode (req. 4) and jump to credits (req. 5) manually. Or second approach, jump to a location which does it for us. But does such a convenient location exist? Yes, $8FE3 is what the game executes to prepare for the end sequence. There, the first 5 requirements are executed.

So we can either:
1. Set $0100 to 0x20 and jump to $B85A.
2. Jump to $8FE3.

The tools

What makes this whole movie possible in the first place are the bytes of controller input stored in RAM. We have two controllers with 8 buttons each. In order from most to least significant bit the buttons go: A B Select Start Up Down Left Right. In addition to currently pressed input bytes, we also have newly pressed input bytes (those are always a subset of the currently pressed bits). In particular $17 is P1 input, $18 is P1 new input. Then, $F5 is P1 new input, $F6 is P2 new input, $F7 is P1 input, $F8 is P2 input.

That's barely enough to do anything, and it gets even worse: Up+Down and Left+Right presses are cancelled out. This makes building bytes that end on x3,x7,xB,xC,xD,xE, or xF impossible, limiting our choice of opcodes.

As a first example, let's construct a jump to $B85A. There are three jump instructions: JSR(0x20), JMP(0x4C), indirect JMP(0x6C). As you can see, the JMP's are already impossible as they end on xC (= requiring an Up+Down press). So let's construct 20 5A B8. None of the bytes require opposite directional inputs, so that's good. Since two bytes are not enough, we have to use the second block of inputs. It doesn't really matter if we start at $F5 or $F6, but the important part to notice is how the 0x20 and 0xB8 are made with the same input. For this to be possible, all bits in the first byte need to occur in the second byte, which is the case here! This is exactly what the showcased run did, but unfortunately without setting $0100 (req. 4).

Now the second example, let's do the same thing except we jump to $8FE3. This gives us 20 E3 8F, which is impossible because 0xE3 ends on x3 and 0x8F ends on xF. It's also impossible because 0x8F doesn't have the required 0x20 bit.

The loop

Executing anything just once is not enough. We need a loop to be able to either set $0100, or somehow get that $8FE3 jump. Thankfully we have two areas of inputs to work with. We can use the 4-byte block as a loop back by executing 20 00 00, jumping back to the beginning of RAM. Then we can lag the game enough to get new inputs, execute something at the 2-byte block, execute until the 4-byte block, and loop back again.

It's at this point where it's possible to take a million approaches and have a million problems.

The problems

Just as an example, here is the list of some things that can go wrong:
  • Every BRK(0x00) instruction we execute is a 2 byte opcode, so we need to be careful how we're aligned, either executing even or odd addresses as opcodes.
  • The 3 byte opcode 0x1E sitting at $16, skipping the execution of your $17/$18 completely.
  • The counter at $15 counting upwards through all the opcodes.
  • The counter at $10 counting downwards through all the opcodes.
  • The execution in between, changing the value of A in unpredictable ways. A is unusable.
  • The two 0xA0's sitting at $8D and $8E, executing one of A0 A0 or A0 00 in each loop, setting Y to 0xA0 or 0x00. Y is unusable.
  • The code accidentally stumbling across one of the 12 KIL instructions, stopping execution completely.
  • The fact that setting X to 0xFF, then executing 'STA $01,x' does not write to $0100 but instead wraps around to $0000.

The execution

What is done in this movie is writing 0x8F to $F9 (which is just after the input bytes). Then we're able to form 20 E1 with input, making a jump to $8FE1 (not quite $8FE3, but it will reach the same place if the zero flag isn't set).

To make that write, we need at least two registers. The A and Y registers are unusable. But we can make Y usable by somehow avoiding the A0 A0 block at $8D. After both X and Y are usable, we can change their values and either use STX $ZZ,y or STY $ZZ,x to make the write.

Bytes Instruction Description
48 48 PHA PHA The Stack Pointer is at 0x04, and just about to overwrite the NMI mode with a bad value. We manipulate it to avoid that.
D6 14 DEC $14,x X is 0x00, so this decreases $14 from 0x00 to 0xFF. This is done to (eventually) skip over the counter at $15 and the byte at $16, as 0xFF is a 3 byte instruction. Note how using C6 14 (DEC $14) as an instruction wouldn't have worked due to the 0x10 bit in 0x14 not being in 0xC6.
CA (C2) DEX (NOP) This decreases X from 0x00 to 0xFF. We want X to be odd so the second byte is different (it happens to do nothing here).
D6 10 DEC $10,x Since X is 0xFF, this decreases $0F from 0x00 to 0xFF. Another preparation to skip over problematic addresses ($10 in this case).
C6 06 DEC $06 We decrease $06 from 0x00 to 0xFF. This finally completes the setup. No matter whether we execute odd or even addresses, we will hit the 0xFF at $0F (skipping over $10), then we will execute 0x00 at $12, and then 0xFF at $14 (skipping over $15 and $16), to assure we execute $17 every loop.
D6 40 DEC $40,x We decrease $3F from 0x00 to 0xFF to execute even addresses after this point. This is necessary because we can only write specific values to even addresses (using only X).
A2 20 LDX #$20 Load X with 0x20 for the next write.
86 86 STX $86 We write 0x20 to $86, which executes JSR $0000 for us without using the 4-byte block at the end. Additionally, we now skip the A0 A0 at $8D, so Y is now usable.
88 (08/00) DEY (PHP/BRK) We decrease Y from 0x00 all the way to 0xF9 to prepare for the write to $F9
A2 A0 LDX #$A0 We now set X to 0xA0 to be decreased to 0x8F, but we can use a trick here.
9A (18) TXS (CLC) We transfer X to the Stack Pointer. We can decrease the Stack Pointer faster than X.
(28) 20 (PLP) JSR $0000 $19 and $1A are both 00, so we can shorten the loop and decrease the Stack Pointer faster.
20 00 JSR $0000 Since we can lag the game whenever, we can precisely time the point where the Stack Pointer reaches 0x8F.
BA 9A TSX (TXS) We transfer the Stack Pointer back to X.
96 00 STX $00,y The setup is complete and we can finally write X (= 0x8F) into $00+Y (= $F9). Now we just need to break out of the loop we created.
84 84 STY $84 We write 0xF9 into $84. This is a 3 byte opcode so we jump over the 0x20 we wrote to $86 earlier.
20 E1 8F JSR $8FE1 The zero flag is not set so we execute $8FE3 and win the game (for real this time).

Special thanks to

  • Alyosha, for creating SubNESHawk, the core that allows button presses once per poll instead of once per video frame.
  • total, for initially playing around with this and creating a Lua script for allowing subframe input on FCEUX.
  • Site admins, for implementing the correct movie file parsing.

Suggested Screenshot

Maru: Judging.

Similar submissions (by title and categories where applicable):