I experimented with many different options for party member names in an attempt to find a faster credits warp than
[4468] NES Final Fantasy "game end glitch" by AmaizumiUni, Spikestuff & DJ Incendration in 01:36.47. Long story short, I didn't find any. The names よるえき
[0xaf 0xb2 0x8d 0x90] and ちごにむ
[0x9a 0x4c 0x9f 0xaa] for party members 2 and 4 are the fastest of many alternatives I tried.
Let's first walk through the exploit. Though
4468M is for the Japanese release, I'll use labels from the
USA disassembly.
The NES CPU's
stack occupies 256 bytes from 0x0100 to 0x01ff. The ultimate cause of the game end glitch is the fact that the variable
tmp_hi is located inside the stack area, at address 0x0110.
tmp_hi is used
during string formatting to store a pointer into a text buffer. Repeatedly climbing the stairs causes the stack to grow until it overlaps
tmp_hi. When you press Start after filling the stack, the text formatting for the pause menu ends up storing the value 0x0302 in
tmp_hi, overwriting the return address of an active stack frame. In this case, 0x0302 is a pointer into
str_buf (which happens to overlap with the data structure
ptygen that stores information about the party during party selection). When the stack frame returns, because of the overwritten return address, the CPU starts executing code at 0x0303. This address would be part of party member 1's name, except that the data for party member 1 data in
ptygen has been overwritten, having been used as scratch space for string formatting. That scratch string data executes as harmless code and eventually reaches the part of
ptygen reserved for party member 2 0x0310, which is where the first part of the exploit payload is stored.
ptygen is not where party information is permanently stored. Once the game starts, that memory may be reused for other purposes, including string formatting, as we have seen. At the time of the exploit, the data for party member 1 has been overwritten, but the data for party members 2–4 is still intact. Some of the memory is controllable (party member classes and names), and some are not (e.g. screen coordinates). Here's what
ptygen looks like at the time:
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
0300 10 06 .. .. .. .. .. .. 7a 80 7a 80 00
0310 L2 00 N2 N2 N2 N2 15 .. .. .. .. .. .. .. .. ..
0320 .. .. .. .. .. .. .. .. .. .. .. .. 04 10 80 a0
0330 L4 00 N4 N4 N4 N4 15 18
- .. bytes are omitted because they always get jumped over.
- Ln and Nn bytes are the bytes we have control over:
- Ln is a party member's class, 0x00 to 0x05,
- Nn are the 4 bytes of a party member's name.
- All other bytes are fixed.
The first few instructions starting at 0x0303 are some harmless branches and nops:
0303 BPL 0x030b
030B NOP
030C NOP #0x7a
030E NOP #0x00
The first byte we have control over is
L2 at 0x0310. The second byte is always 0x00 and cannot be changed, but then we get to execute the 4 bytes of party member 2's name:
0310 L2
0311 0x00
0312 N2
0313 N2
0314 N2
0315 N2
Starting at 0x0316, there are are some non-modifiable instructions that halt the CPU. No matter what alignment we start with, we will hit a 0x12 at address 0x031c which encodes a
JAM instruction. Therefore, we are obligated to use the final
N2 byte for a branch instruction (either
BCC or
BVC) to take the next byte 0x15 as an operand and jump ahead 21 bytes, to arrive at 0x032c. At 0x32c, there is a nop and a branch that doesn't get taken, and then we get to execute the class and name of player character 4:
032C NOP 0x10
032E BMI 0x02d0
0330 L4
0331 0x00
0332 N4
0333 N4
0334 N4
0335 N4
That's the extent of our control, the 10 bytes representing the class and name of party members 2 and 4. The
L2 and
L4 bytes are limited because they can only represent one of the six classes using byte values 0x00 to 0x05. The
N2 and
N4 bytes are more flexible, but even then we are limited to a palette of 90 bytes that represent name characters. Because the final byte of party member 2's name must be a branch instruction, we really have just about 7 bytes to work with.
The goal of the game end glitch is to jump to
EnterEndingScene in memory bank 0x0d. For that, we want to jump to
VictoryLoop in bank 0x0f.
VictoryLoop loads bank 0x0d and jumps to
EnterEndingScene.
C935 JSR EnterBattle_L ; start the battle!
C938 BCC :+ ; see if this battle was the end game battle
@VictoryLoop:
C93A JSR LoadEpilogueSceneGFX
C93D LDA #BANK_ENDINGSCENE
C93F JSR SwapPRG_L
C942 JSR EnterEndingScene
C945 JMP @VictoryLoop
:
C948 JSR ReenterStandardMap ; if this was just a normal battle, reenter the map
Because
VictoryLoop is a loop, our jump target doesn't have to be exact. We can land at almost any address inside the loop and it will work. Even if we're misaligned with the intended instructions, it happens to still work out: the
JMP at 0xc945 eventually runs and restarts the loop with proper alignment. And because the CPU's carry flag happens to be set at this point, we can even land on the
BCC instruction before the loop. Therefore, we are looking to jump to any address between 0xc938 and 0xc945 inclusive.
We'd be happy with a jump target anywhere between 0xc938 and 0xc945. Unfortunately, we cannot just write a
JMP instruction, because none of the bytes 0xc9 and 0x38–0x45 are available to us. The table below shows the naming bytes we have to work with, along with what instructions they encode:
| 0x8a あ | 0x8b い | 0x8c う | 0x8d え | 0x8e お | 0x48 が | 0x49 ぎ | 0x4a ぐ | 0x4b げ | 0x4c ご |
|---|
| TXA impl | ANE # | STY abs | STA abs | STX abs | PHA impl | EOR # | LSR A | ALR # | JMP abs |
| 0x8f か | 0x90 き | 0x91 く | 0x92 け | 0x93 こ | 0x4d ざ | 0x4e じ | 0x4f ず | 0x50 ぜ | 0x51 ぞ |
|---|
| SAX abs | BCC rel | STA ind,Y | JAM | SHA ind,Y | EOR abs | LSR abs | SRE abs | BVC rel | EOR ind,Y |
| 0x94 さ | 0x95 し | 0x96 す | 0x97 せ | 0x98 そ | 0x52 だ | 0x53 ぢ | 0x54 づ | 0x55 で | 0x56 ど |
|---|
| STY zpg,X | STA zpg,X | STX zpg,Y | SAX zpg,Y | TYA impl | JAM | SRE ind,Y | NOP zpg,X | EOR zpg,X | LSR zpg,X |
| 0x99 た | 0x9a ち | 0x9b つ | 0x9c て | 0x9d と | 0x57 ば | 0x58 び | 0x59 ぶ | 0x5a べ | 0x5b ぼ |
|---|
| STA abs,Y | TXS impl | TAS abs,Y | SHY abs,X | STA abs,X | SRE zpg,X | CLI impl | EOR abs,Y | NOP impl | SRE abs,Y |
| 0x9e な | 0x9f に | 0xa0 ぬ | 0xa1 ね | 0xa2 の | 0x70 ぱ | 0x71 ぴ | 0x72 ぷ | 0x73 ぺ | 0x74 ぽ |
|---|
| SHX abs,Y | SHA abs,Y | LDY # | LDA X,ind | LDX # | BVS rel | ADC ind,Y | JAM | RRA ind,Y | NOP zpg,X |
| 0xa3 は | 0xa4 ひ | 0xa5 ふ | 0xa6 へ | 0xa7 ほ | 0x7d ゃ | 0x7e ゅ | 0x7f ょ | 0x7c っ | 0xb9 。 |
|---|
| LAX X,ind | LDY zpg | LDA zpg | LDX zpg | LAX zpg | ADC abs,X | ROR abs,X | RRA abs,X | NOP abs,X | LDA abs,Y |
| 0xa8 ま | 0xa9 み | 0xaa む | 0xab め | 0xac も | 0x80 0 | 0x81 1 | 0x82 2 | 0x83 3 | 0x84 4 |
|---|
| TAY impl | LDA # | TAX impl | LXA # | LDY abs | NOP # | STA X,ind | NOP # | SAX X,ind | STY zpg |
| 0xb0 ら | 0xb1 り | 0xb2 る | 0xb3 れ | 0xb4 ろ | 0x85 5 | 0x86 6 | 0x87 7 | 0x88 8 | 0x89 9 |
|---|
| BCS rel | LDA ind,Y | JAM | LAX ind,Y | LDY zpg,X | STA zpg | STX zpg | SAX zpg | DEY impl | NOP # |
| 0xad や | 0xae ゆ | 0xaf よ | 0xb5 わ | 0xb6 ん | 0xc2 - | 0xc4 ! | 0xc5 ? | 0xc3 ‥ | 0xff |
|---|
| LDA abs | LDX abs | LAX abs | LDA zpg,X | LDX zpg,Y | NOP # | CPY zpg | CMP zpg | DCP X,ind | ISC abs,X |
We do also have limited control over the
L2 and
L4 bytes, which represent party member classes and may take on any value from 0x00 to 0x05. The fastest option is to take the defaults, which are
L2 = Thief = 0x01 =
ORA X,ind and
L4 = Red Mage = 0x03 =
SLO X,ind. The Red Mage
SLO can cause us problems in some circumstances, because it may change the value of the
A register. We can work around the problem by setting and
L4 = White Mage = 0x04 =
NOP zpg, but that costs a few frames. The 0x00
BRK impl and 0x02
JAM instructions are not useful to us.
The exploit payload of
4468M is clever and subtle. Because we cannot directly enter any of the jump destination address bytes we need, we must cobble together an address using bytes already in memory, on the stack and elsewhere. The general strategy is this:
- Store a value in the A register that works as the low-order byte of a destination address in the range 0xc938 to 0xc945. Because we will eventually use an RTS instruction to do the jump, and RTS adds 1 to the address, A must actually be in the range 0x37 to 0x44.
- Store a value in the X register that points to one byte before any of the multiple appearances of 0xc9 on the stack.
- Transfer the value of the X register into the SP (stack pointer) register.
- Push the value of the A register onto the stack using PHA. The stack pointer now points to an address 0xc9?? that works as a destination jump address.
- Jump to the address on the stack with an RTS instruction.
It's a lot to accomplish in 7 bytes. Let's walk through it.
0310 01 00 ORA (0x00,X) ; character class 0x01, (0x00,X) points to zero so there is no effect
0312 AF B2 8D LAX 0x8db2 ; set A = X = 0x38, the value stored at 0x8db2
0313 90 15 BCC 0x032c ; use the preexisting 0x15 operand, jump ahead to character 4 data
The 0x01 byte (representing the Thief class) encodes a two-byte
ORA instruction, which has no effect given the current state of the registers. The exploit then spends 3 bytes on an instance of the undocumented/illegal instruction
LAX, which assigns to the
A register and the
X register simultaneously. The final byte of party member 2's name must be used for a branch instruction (here
BCC) to jump to party member 4's data—we have no choice.
Assigning to
A and
X simultaneously is a good use of space, but it means we need a single value that works for both registers.
A must be in the range 0x37 to 0x44, and
X must point to one byte before a 0xc9 on the stack. Here's the relevant part of the stack:
00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
0130 92 c9 01 18 00 0b 05 92
0140 c9 01 08 00 0b 05
There are only two values that satisfy both constraints: 0x38 and 0x3f. We can take either of these values from anywhere in memory. The payload takes it from address 0x8db2, which contains 0x38.
(0x38, as a value for
A and
X, has a non-obvious advantage over 0x3f. The
SLO (0x00,X) instruction that comes from from executing character class 0x03 (Red Mage) happens to leave the
A register and processor flags unchanged when
X = 0x38, but not when
X = 0x3f. 0x3f can still work, but it requires changing party member 4's class to White Mage, which encodes a
NOP instruction. This is the part that pirohiko marked "be careful" in
Post #505719.)
0330 03 00 SLO (0x00,X) ; character class 0x03, doesn't change A when X = 0x38
0332 9A TXS ; set SP = X, SP points to one byte before 0xc9
0333 4C 9F AA JMP 0xaa9f ; jump to code that will run PHA and RTS for us
The first byte of party member 4's name is spent on a
TXS instruction to set the stack pointer. We have just 3 bytes of payload left, and we still need to do a
PHA to push
A on the stack and an
RTS to do the jump. We cannot directly encode an
RTS instruction (0x60 is not one of the bytes we have access to), but we can encode a
JMP (0x4c). So what we can do is execute an absolute
JMP to an
RTS somewhere else in the code. But a
JMP takes 3 bytes, which leaves no room for the
PHA we also need. Therefore we need to jump to a place in memory that executes
both a
PHA and (eventually) an
RTS. There's only one range of addresses that works, 0xaa93 to 0xaa9f. The payload uses 0xaa9f.
It's hard to imagine triggering the exploit any faster: start a new game, walk directly north to the stairs, walk up and down and open the menu a few times. Any improvement can only come from entering the party information faster. Is there a payload (a configuration of party member classes and names) that accomplishes the same exploit, but is faster to enter? That is what I tried and failed to find.
My strategy was to keep the basic exploit technique, but generate a number of variations on the payload (that use different opcodes and addresses but have the same effect), then time each of those payloads using an
automatic name entry script.
These are some of the variations I tried:
- LAX opcodes other than 0xaf, including 2-byte encodings.
- Source addresses for LAX other than 0x8db2.
- Destination addresses for JMP other than 0xaa9f.
- BVC as an alternative to BCC.
- Moving the PHA instruction into the payload, which gives more address freedom in the JMP to RTS.
- Opcodes and placement of NOP instructions, in payloads that are short enough to leave room for a NOP.
4468M used a 3-byte form of the
LAX instruction, opcode 0xaf, with an
absolute addressing mode. The 2-byte operand is the address of a memory location that contains 0x38. Our repertoire of name characters gives us access to addressing modes of LAX whose encoding is only 2 bytes:
| Opcode | Instruction | Bytes |
|---|
| 0xa3 | LAX X,ind | 2 bytes |
| 0xa7 | LAX zpg | 2 bytes |
| 0xaf | LAX abs | 3 bytes |
| 0xb3 | LAX ind,Y | 2 bytes |
A search of the possible second operand bytes, using the values of the registers at the time of the exploit (
X = 0x05,
Y = 0x0e), finds just two possibilities for a 2-byte
LAX instruction that places a 0x38 or 0x3f into both
A and
X. The first is
LAX (0x59),Y; that is, the indirect, Y-indexed address mode with an operand of 0x59, whose encoding is
0xb3 0x59. The second possibility is only possible when you hold certain controller inputs at the time of the exploit, in order to stage a 0x38 value in the
joy variable at address 0x0020. With this preparation you can used
LAX (0x50,X), whose encoding is
0xa3 0x50.
A 2-byte
LAX gives us a lot of flexibility. Saving 1 byte means we can put the necessary
PHA instruction directly in the payload, which means our
JMP only has to hit an
RTS, not
PHA then RTS. Only needing to hit an
RTS gives us a lot more possibilities for jump targets. Alternatively, we could continue jumping to
PHA+
RTS, and fill in the extra byte in the payload with a
NOP instruction (or any 1-byte instruction that is effectively a nop). Or, as a special case in the final position, we can just repeat the previous name byte, because the
JMP will happen before it gets executed.
If we must jump to an address that executes
PHA and then
RTS, the only possibilities are 0xaa93–0xaa9f. I found this using a script—see gameend-search-jmp.fnl in the source code below. Of these possibilities, four are equally fast to enter, following a 0x4c byte: 0xaa95, 0xaa96, 0xaa9a, and 0xaa9f. 0xaa9f is the cleanest, as it points directly to the
PHA instruction.
Besides the variations listed above, I also tried things such as substituting 2-byte
LDA followed by
TAX for 3-byte
LAX. See the file gameend-payloads.py in the source code for the full expression of variations:
Language: python
# 4468M payload, with all possibilities for branch instruction and PHA+RTS code address.
# Also consider 2-byte LDA, ADC, EOR, SRE, or RRA followed by TAX as an alternative for 3-byte LAX.
# Also consider 2-byte LDX followed by TXA as an alternative for 3-byte LAX.
(LAX3 | (LDA2 | ADC2 | EOR2) + TAX | LDX2 + TXA) + BRANCH + TXS + JMP_PHA_RTS,
# SRE and RRA can also serve to set A, though they may also change the value
# of processor flags, so try more options for the branch (BVS and BCS in
# addition to BVC and BCC).
(SRE2 | RRA2) + TAX + ALL_BRANCH + TXS + JMP_PHA_RTS,
# When we do 2-byte LDX first, we also have the option of doing TXS before TXA.
(LDX2 + TXS) + BRANCH + TXA + JMP_PHA_RTS,
# 2-byte LAX with jump to PHA+RTS, padding with 1-byte NOPs (or what are
# effectively NOPs) in every possible position. What works as a NOP differs
# in each position.
# Anything that assigns to A or X is a NOP if it happens before LAX.
# (As long as it doesn't interfere with the address mode of LAX.)
# PHA and TXS work too, as we're not using the stack yet. Also include
# TAY, even though it may interfere with the address mode of the LAX.
(NOP | TXA | TYA | TAX | TAY | LSRA | PHA | TXS) + LAX2 + BRANCH + TXS + JMP_PHA_RTS,
# After LAX, A and X are equal, so TXA and TAX are NOPs. We may now
# clobber Y with TAY or DEY. PHA and TXS are still available to us too.
LAX2 + (NOP | TXA | TAX | TAY | DEY | PHA | TXS) + BRANCH + TXS + JMP_PHA_RTS,
# After TXS, we may still interchange A and X, or clobber Y. We can also
# use a second TXS as a NOP. But we cannot use PHA anymore, because the
# stack is now set up.
LAX2 + TXS + BRANCH + (NOP | TXA | TAX | TAY | DEY | TXS) + JMP_PHA_RTS,
# In the final position, don't use a NOP, just repeat the previous byte.
LAX2 + TXS + BRANCH + JMP_PHA_RTS + REP,
# 2-byte LAX with PHA and jump to RTS.
LAX2 + TXS + BRANCH + PHA + JMP_RTS,
Sadly I could not find a use for the
TAS instruction :(
In total I generated and timed 8168 candidate payloads. The list, sorted from fastest to slowest, is
ff1-gameend-payloads-time.log.
There are a few payloads that are equally fast (52 frames) as the one used in
4468M. These are basically the same,
[LAX 0x8db2; BCC] [TXS; JMP ADDR], just using equivalent
JMP addresses:
[0xaf 0xb2 0x8d 0x90] 22 [0x9a 0x4c 0x95 0xaa] 30 52
[0xaf 0xb2 0x8d 0x90] 22 [0x9a 0x4c 0x96 0xaa] 30 52
[0xaf 0xb2 0x8d 0x90] 22 [0x9a 0x4c 0x9a 0xaa] 30 52
[0xaf 0xb2 0x8d 0x90] 22 [0x9a 0x4c 0x9f 0xaa] 30 52
The
LAX 0x8db2 instruction, despite needing 3 bytes to encode, uses characters that are all close to each other on the name entry screen.
After that, 1 frame slower, there is a similar batch that uses address 0xc3b5 in place of 0x8db2 in the
LAX instruction, and uses
BVC in place of
BCC:
[LAX 0xc3b5; BCC] [TXS; JMP ADDR].
[0xaf 0xb5 0xc3 0x50] 23 [0x9a 0x4c 0x95 0xaa] 30 53
[0xaf 0xb5 0xc3 0x50] 23 [0x9a 0x4c 0x96 0xaa] 30 53
[0xaf 0xb5 0xc3 0x50] 23 [0x9a 0x4c 0x9a 0xaa] 30 53
[0xaf 0xb5 0xc3 0x50] 23 [0x9a 0x4c 0x9f 0xaa] 30 53
The first payloads that are notably different are 3 frames slower overall. I'll highlight just a few of them. One uses a 2-byte
LAX and an explicit
PHA followed by a
JMP to
RTS:
[LAX (0x59),Y; TXS; BCC] [PHA; JMP 0xc34b].
[0xb3 0x59 0x9a 0x90] 32 [0x48 0x4c 0x4b 0xc3] 23 55
Another uses the space gained by a 2-byte
LAX to store a
TXA (which is effectively a nop) at the beginning of party member 4's name:
[LAX (0x59),Y; TXS; BCC] [TXA; JMP 0xaa9f]. The
TXA opcode 0x8a is in the upper left of the name entry palette and requires no cursor movement to enter.
[0xb3 0x59 0x9a 0x90] 32 [0x8a 0x4c 0x9f 0xaa] 23 55
This one uses an alternative 2-byte
LAX, no explicit
PHA, and a
JMP to
PHA+
RTS. The final name character, which is not executed, can be a repeat of the previous character, which requires no cursor movement:
[LAX (0x50,X); TXS; BCC] [JMP 0xaa9f].
[0xa3 0x50 0x9a 0x90] 32 [0x4c 0x9f 0xaa 0xaa] 23 55
In summary, I am impressed by the high degree of optimization achieved by
[4468] NES Final Fantasy "game end glitch" by AmaizumiUni, Spikestuff & DJ Incendration in 01:36.47. Despite my best efforts, I was not able to improve it.
My source code, data files, and development notes are available from
git clone https://www.bamsoftware.com/git/ff1.git. The commit as of this writing is d5cd9d40f891729a441eb950da98e371b43167d2.
A brief howto is: run gameend-payloads.py to generate a list of candidate payloads in gameend-payloads.fnl. Run gameend-payloads-time.fnl (in BizHawk) to time how the names take to enter. Run gameend-payloads-test.fnl to check whether they actually result in a game end glitch.