I'm currently in a data science / machine learning class, and I just got approval for a final project to analyze RAM in NES/SMS games for patterns. I'm hoping it can make finding interesting memory addresses easier or potentially allow us to find more than we could before, for instance abstract things like RNG.
We currently have RAM searching, but it's a little primitive. It works well for some things where you expect a specific value or a clear change, such as position or HP. However it's a bit more difficult to find more elusive things like RNG, leaving us to throw spaghetti at the wall with rerecords until we get what we want.
I would be doing this programmatically, so that makes it a bit challenging to find more interactive memory addresses. Point is, I need a way to objectively define in a general sense what an "interesting" memory address is.
I have some thoughts on what an interesting address might be:
* It affects the value of many other memory addresses from frame to frame. This should identify things such as velocity, input, and (maybe) RNG. Inferring these kinds of relations would require a quadratic amount of memory and time on the total amount of input RAM and therefore probably can't practically scale beyond 2Kb (NES) unless you can do some serious filtering on which addresses you include in the search.
* An address that holds a predictable value that isn't constant. It might be periodic or it might be affected by other memory addresses. Periodicity would be mostly reasonable if you bound it and wouldn't explode as much as a dependency network/matrix.
* ...? Not sure what else
(In case you couldn't tell, I'm really interested in finding RNG seeds)
I could use some ideas on how to approach this. They probably should have linear complexity on space (for those of you who are not well versed in CS, that essentially means that I can't reasonably bookkeep on every possible pair of addresses) or be able to filter out temp values such as stack data. It would be ideal to be able to scale this to memory sizes larger than 2 Kb.
Joined: 4/17/2010
Posts: 11495
Location: Lake Chargoggagoggmanchauggagoggchaubunagungamaugg
Well, at first you record 2 versions of the movie with some even occurring and not occurring, and find its ram value, then you just see when it's written to (a breakpoint) and tracelog that frame for both scenarios. You examine them side-by-side going back in time. That's quite an effective way to fund the real RNG affected. Repeat every time you see something that you can't explain or predict and in the end it's all written down and clear.
Here's my take on reverse engineering involved in tasing:
http://tasvideos.org/ReverseEngineering.html
Warning: When making decisions, I try to collect as much data as possible before actually deciding. I try to abstract away and see the principles behind real world events and people's opinions. I try to generalize them and turn into something clear and reusable. I hate depending on unpredictable and having to make lottery guesses. Any problem can be solved by systems thinking and acting.
Oh... That's good to know.
Sorta leaves me with less of a project though. Or I guess it just calls for a change of direction.
I still wonder what kinds of other patterns can be gleaned from throwing data science at the system RAM. It might turn out to be fruitful for something unexpected.
First, what kind of output do you want to get? Should this problem be solved with classification or regression? Classification would imply to define some categories such as: this byte(the value of an address) is being labeled as "some-behavior/relationship".
A regression would be more specific about his confidence level. For instance, on a scale from -1 to 1 how likely does a byte is correlated to X? Where X is a RNG or the actually seed or any specific thing(a "counter", a "constant", a variable from the "physics-engine", etc).
Since there's a decent database of labelled memory for the NES collection, in order to do actual training, you will probably have to record some gameplay session, then dump the ram data while filtering useless data(such as constant).
Here's some suggestions how you could approach the problem.
Getting the data is a problem. How much time can you afford just to collect data? If you can get data on different games, then you might get the best results while avoiding overfitting. Building and formating your validation/test sets correctly can be quite slow. I'll suggest recording session of 1 to 5 minutes for at least 20-30 games or more with some RNG action during the gameplay. There should be enough candidate for this(Tetris, Final Fantasy 1, etc...). This way you can at least define 10 games+ as your train/validation set and 10 other for your test sets.
Once you get the data, as you said filtering is important to avoid large complexity. You can probably filter out things such as constants. Filtering periodic data is possible as well, if you can detect a cycle repeating over and over, but be extra careful how you handle those.
On another note, as feos point out, the ram data alone might not be enough to find the RNG. With disassembly/debugger info, you can get quite a lot of valuable data and basically solve RNG problem in a manual fashion(using breakpoints). Or by extracting particular metrics from the tracelog such as the number of "writes for a single frame" a byte can get(on an average number of frames) and filter out the lowest number of other bytes, but this might be too specific for a game... Thing is, if you use such data, then maybe you are out of scope of your initial proposal which is something along the line of labelling memory addresses by finding abstract patterns over fuzzy data(the ram) that human have no clue about.
Another almost completely different idea(maybe easier to do):
Right now, with our current tools we don't even know right away if a set of consecutive byte is actually a long, int, short, word, byte, etc(well, despite you check directly how the disassembler write his info), does a byte is signed or unsigned? Maybe it would be easier to simply tag what kind of data bytes any given byte is. Also, to do this you may not need a large dataset as input(number of columns), you simply need some of the adjacent bytes. Then, for F numbers of frames(of actual gameplay) you can try to predict the datatype for a byte.
That being said, what kind of ML frameworks did you intend to use?
Joined: 5/1/2004
Posts: 4096
Location: Rio, Brazil
One thing that'd be interesting to know is how many times each address changed since you started logging and maybe give the last 10 values for each address could be nice too.
There are a few things I can think of where having an automated appraoch with data science concepts could help out in analzying RAM. Here are couple I thought of ordered in what difficulty I expect them to be.
Easy: tracing RNG usage: as feos mentioned, finding RNG is not so difficult. Finding all the things it impacts though can be more time consuming. If you could automatically make some lists of what values RNG impacts (and even indirectly impacts) that might be really useful.
Medium: Finding object tables: Usually enemy/sprite/player data used by a game is clumped into tables in RAM, if we could automatically get lists of object slots, hit boxes, attributes, and such that could save a lot of time.
Hard: Improving Emulation: Currently one of the challenges in NES emulation is how initial console state impacts the game as it plays out. The dfferences are subtle and might only emerge after several minutes of playing, and might go undtected until something like a desync happens. This could really use some data sceincy type of stuff to find exactly where things go awry.
Extra Hard: Stack Corruption: If you can heuristically analyze what actions effect the stack, you might be able to force situations where the stack becomes corrupted (overflow, underflow, gets overwritten etc) This would be like an automated form of glitch hunting which might be feasible using RAM. Might get some really cool results if it works.
Joined: 4/17/2010
Posts: 11495
Location: Lake Chargoggagoggmanchauggagoggchaubunagungamaugg
What Alyosha said please.
Warning: When making decisions, I try to collect as much data as possible before actually deciding. I try to abstract away and see the principles behind real world events and people's opinions. I try to generalize them and turn into something clear and reusable. I hate depending on unpredictable and having to make lottery guesses. Any problem can be solved by systems thinking and acting.
Joined: 3/2/2010
Posts: 2178
Location: A little to the left of nowhere (Sweden)
Isn't the hitbox data usually packed in the ROM. Then values in RAM together with some look-up functions in the game code are used to determine the hitbox data for each entity in the game? I am not sure the extraction of hitbox data is a medium-level task in terms of complexity if you want it to work for a big set of games.
I've seen games that use RAM to store the (x_ini, y_ini, x_end, y_end), others that have an index that point to the ROM, others that have poses, vertical/horizontal flips, etc.
It's probably impossible to find it semi-automatically for a generic game. Knowing a bit of ASM is required.
It depends on the game. I'll use two examples I know fairly well: Ninja Gaiden uses an "ID" associated with each object in RAM as an offset to look up hit-boxes, points, health, damage in PRG-ROM every time they are needed. Ninja Gaiden II just loads all that into RAM ($0460-$069F) when it spawns things.
I expect (but haven't looked at enough games to back this up) that the former approach was more common in earlier games.
First, what kind of output do you want to get? Should this problem be solved with classification or regression? Classification would imply to define some categories such as: this byte(the value of an address) is being labeled as "some-behavior/relationship".
I was planning on doing a classification-ish of "most interesting addresses" Hopefully, this finds things such as position, HP, RNG seeds, velocity, etc... Though I'm not exactly sure how to go about this. I could have it track patterns and trends and that would probably get a good starting point. Then I could feed addresses found with that into another script that tracks which memory addresses were read immediately before writing to those.
It would be doubly cool if I could not only label addresses as interesting, but give a good guess of what the address actually represents. Humans would probably have to correct it most of the time, but it would probably help analysis.
I originally planned on finding correlation, but I quickly realized that would scale poorly- requiring at least 32 MB to analyze an NES game (4 million address pairs, 8 bytes for a double-precision float each)
While that would probably work for non-real-time analysis of NES and maybe SMS games (requiring 512 MB for correlation analysis), it pretty much explodes if you want to do realtime analysis or later generations. Using single-precision floats wouldn't help deal with the issue of scale.
BadPotato wrote:
Since there's a decent database of labelled memory for the NES collection, in order to do actual training, you will probably have to record some gameplay session, then dump the ram data while filtering useless data(such as constant).
I was planning on using that as comparisons so I can figure out what works well.
BadPotato wrote:
Getting the data is a problem.
I plan on using Lua scripts to dump RAM at every frame, making this pretty breezy. It would probably make sense to throw in joystick input as part of the data as well. This might slow down emulation a bit, though probably not enough to be an issue.
BadPotato wrote:
Another almost completely different idea(maybe easier to do):
Right now, with our current tools we don't even know right away if a set of consecutive byte is actually a long, int, short, word, byte, etc(well, despite you check directly how the disassembler write his info), does a byte is signed or unsigned? Maybe it would be easier to simply tag what kind of data bytes any given byte is. Also, to do this you may not need a large dataset as input(number of columns), you simply need some of the adjacent bytes. Then, for F numbers of frames(of actual gameplay) you can try to predict the datatype for a byte.
That is a good thought. This could cause hiccups in analysis as it is. This might need to be solved before much other progress can be made. I'd imagine that larger datatypes are typically aligned to their size, though that assumption may not be true on 8-bit hardware. Then there's the matter of endianness to worry about.
That being said, what kind of ML frameworks did you intend to use?
Probably Numpy since I'm mostly just doing analysis right now (the class is more focused on big data analysis than actual ML). I'm not exactly sure if Neural Networks would help much. That's already sort of been done before with some interesting results.
There's going to have to be two-way communication somehow, as some work will have to be done by Lua (disassembly and event hooks)
Does Lua's popen allow r/w pipes, because that would be, by far, the simplest way to handle 2-way talking. I know you can do read-only and write-only pipes, but it isn't very clear in the documentation. (Also, does BizHawk even support popen?)
Alyosha wrote:
tracing RNG usage: as feos mentioned, finding RNG is not so difficult. Finding all the things it impacts though can be more time consuming. If you could automatically make some lists of what values RNG impacts (and even indirectly impacts) that might be really useful.
This requires disassembly for sure, but I have an idea of how to automate this process somewhat.
Alyosha wrote:
Finding object tables: Usually enemy/sprite/player data used by a game is clumped into tables in RAM, if we could automatically get lists of object slots, hit boxes, attributes, and such that could save a lot of time.
That is a great thought. I was hoping to be able to do this sort of thing.
It also points out that related data tends to be clumped together, which probably would help with automated analysis.
-----
Thanks everyone for the responses so far and keep it coming!
There's going to have to be two-way communication somehow, as some work will have to be done by Lua (disassembly and event hooks)
Does Lua's popen allow r/w pipes, because that would be, by far, the simplest way to handle 2-way talking. I know you can do read-only and write-only pipes, but it isn't very clear in the documentation. (Also, does BizHawk even support popen?)
I believe the pipe api doesn't really work well on the Windows platform. But there are workarounds that may suit your need.
First, I guess pipes could work with a custom version of lua on a console application, but that won't do it with Bizhawk. At this point, you may as well ask for the BizHawk dev to directly add some way to IPC with Bizhawk... I guess such feature would be awesome.
Meanwhile, you can lookup the named pipe from the lua winapi library. I'm not sure how well it runs if you intent to make a lot call between the emulator and your python program. But this might just be the easiest solution.
Another idea is to checkout for a message queue library. The zeromq lua client library with a zeromq python server might work out. If you want to practice a bit, before attempting this, I'll suggest trying out some tutorials with rabbitmq for a single client/server with python, since the tutorial is a bit more user-friendly than zeromq.
Yet, another alternative idea is to use the luasocket library, but it can a challenge in itself.
One last idea if you are not a fan of async with lua, involves using the file system directly to send your data and simply poll for a last change on your files to get notified, yet this is definitely not an efficient solution.
Beefster wrote:
Probably Numpy since I'm mostly just doing analysis right now (the class is more focused on big data analysis than actual ML).
Ok, then maybe you should simply look for a basic algorithm such as: KNN, Bayes, Decision Tree, Random Forest or SVM. You should be able to find a couple basic examples on the web about how to write those with only a couple of lines of code from scratch.
I ended up using Bayes for type inference (which is all I really had time for - no "interesting" detection mostly partly I really had no idea how to objectively define an "interesting" address- maybe I should have used a neural network?), but it ended up not working at all. I probably needed a better model to determine the likelihood of data given a type as it really ended up only finding volatile addresses and calling them u32s and u16s. It was also really slow, so that was kind of a bummer.
I'll put up the paper I wrote at some point so that you know what doesn't work.
On the other hand, I was able to make a really straightforward and fast LFSR RNG detector which easily detected the known RNG in SMB3 and found some currently undocumented likely RNG addresses in Dr. Mario (0x00EA-0x00EB) and Zelda 1 (0x0017-0x0023).
You can get my LFSR detector here. You will need Python 3 and numpy to run the analysis tool and (naturally) Bizhawk to dump RAM from a game. I ended up not using pipes. Maybe I will someday.
I'll put it up on GitHub eventually. I'm just feeling lazy right now.