Post subject: Machine Learning project with FCEUX: technical questions
RL_scientist
He/Him
Joined: 12/4/2015
Posts: 9
Location: MVA
Hi everybody, I study Machine Learning and for one of my course I intend to work on designing an algorithm aimed at learning to play a game by statistical means. The method used is called Reinforcement Learning (RL). In this field of machine learning, we try to develop algorithms that compute strategies revolving around improvement over a high number of simulation (die and retry style, or pavlovian learning style, but computed). I've been casually watching the TAS community for a while and this project is the best occasion to do some hacking myself, so I've downloaded FCEUX (both linux version and wine) and began to test the lua scripting tool on Gradius, helped by the datacrystal RAM mapping. At first glance I was amazed by the flexibility and the robustness of the engine. However after some digging I've found a couple of inconsistencies, hence my questions: Version: FCEUX 2.2.2 Turbo mode in Linux: it's not implemented ? maximum mode works but turbo doesn't. emu.frameadvance and turbo mode: when turbo/maximum mode is on, what is the behaviour of emu.frameadvance() ? I designed a random pathing AI that speedfire and take random direction inputs at each frame, and it doesn't behave well in turbo mode.. does it means that computations are skipped in turbo mode ? Side note: if you're interested by the project - which is super exciting since machine learning could lead to amazing TAS techniques - feel free to ask more questions on the subjects). It will happen in a short window of time for me but I'd be interested to continue it afterward. Sincerely.
Site Admin, Skilled player (1234)
Joined: 4/17/2010
Posts: 11251
Location: RU
What exactly is wrong when you turbo?
Warning: When making decisions, I try to collect as much data as possible before actually deciding. I try to abstract away and see the principles behind real world events and people's opinions. I try to generalize them and turn into something clear and reusable. I hate depending on unpredictable and having to make lottery guesses. Any problem can be solved by systems thinking and acting.
RL_scientist
He/Him
Joined: 12/4/2015
Posts: 9
Location: MVA
What my script does: select random directions at each frames. What I see in normal mode: Vic Viper taking small step in random directions, back and forth. What I see in turbo mode: Vic Viper crashing into the sides of the screen.
RL_scientist
He/Him
Joined: 12/4/2015
Posts: 9
Location: MVA
Answer to myself: sorry, the problem is not here. My script is just bad, I just realised put math.randomseed(os.script()) in a loop. Tired days. I'm not familiar with Lua though. Should I delete the thread ? -.-" I will post again in the TAS technique section.
Site Admin, Skilled player (1234)
Joined: 4/17/2010
Posts: 11251
Location: RU
No need to delete of course, you will likely still have questions. Not sure if it should be moved either.
Warning: When making decisions, I try to collect as much data as possible before actually deciding. I try to abstract away and see the principles behind real world events and people's opinions. I try to generalize them and turn into something clear and reusable. I hate depending on unpredictable and having to make lottery guesses. Any problem can be solved by systems thinking and acting.
Pokota
He/Him
Joined: 2/5/2014
Posts: 778
RL_scientist wrote:
Answer to myself: sorry, the problem is not here. My script is just bad, I just realised put math.randomseed(os.script()) in a loop. Tired days. I'm not familiar with Lua though. Should I delete the thread ? -.-" I will post again in the TAS technique section.
I would suggest sharing the lua script with us if you can; it's easier for us to know what you're doing if we can see your code.
[code=lua ] lua code goes here [/code ]
...without the spaces in the tags, though.
Adventures in Lua When did I get a vest?
Joined: 1/26/2009
Posts: 558
Location: Canada - Québec
I did a bit of ML lately as well and I seriously really don't think using lua could lead to any great implementation which can be easily maintained or been improved. It's true that MarI/O did quite a bit of buzz, but the result is actually quite simple, since the fitness function only check for horizontal scroll. So, if you don't go the lua-way, maybe could be to directly hack fceux in C... or: An another possibility could be to add a socket capability which can remotely call the lua API from an external language. If you are interested, I started a simple project that call the bizhawk lua API from an external app using java with UDP. Here you go:
    -Server(luascript+luasocket+timer) -Client(java... but you could re-implement it to any language you want).
Of course, it would be much better to embed the udp-server directly in the emulator rather than lua.... also huh, the project I made is kind of unfinished, but you could definitely build on something like that. And I guess it go without saying, but the benefit of this approach is that you can make a real distributed system accross different computer(and on different platform), so your processing can easily scale. So far, I thought this could help for a genetic algorithm performing several prediction from a large dataset. In any case, let me known if you like the idea or if this sound too confusing, maybe I'll just try to eleborate. edit: fixed github branch url
Site Admin, Skilled player (1234)
Joined: 4/17/2010
Posts: 11251
Location: RU
Warning: When making decisions, I try to collect as much data as possible before actually deciding. I try to abstract away and see the principles behind real world events and people's opinions. I try to generalize them and turn into something clear and reusable. I hate depending on unpredictable and having to make lottery guesses. Any problem can be solved by systems thinking and acting.
RL_scientist
He/Him
Joined: 12/4/2015
Posts: 9
Location: MVA
edit: the script given below is much more impressive if you allow Vic Viper to shoot :p (it goes to the final boss of the level 1 in a few interations) edit2: I've fixed it so it now learn correctly the pacifist run. You can play with the parameter so it can be faster. I guess if you run it for long enough it would end the game... (but it woud be smarter to just restart it at each level with a savestate of course). Hi, thank for the answers
BadPotato wrote:
I did a bit of ML lately as well and I seriously really don't think using lua could lead to any great implementation which can be easily maintained or been improved.
I've never used Lua but I've heard it's good for ML because of the library Torch for manipulating arrays, with implementation for math operations, optimization and neural networks, and it has a good CUDA support for heavy GPU computation (too bad my computer is a potato).
BadPotato wrote:
It's true that MarI/O did quite a bit of buzz, but the result is actually quite simple, since the fitness function only check for horizontal scroll.
There have been a lot more on Mario already, check those 3 papers: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6156425&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6156425 (PM me for full version it just describes this http://www.marioai.org/) http://cs229.stanford.edu/proj2012/LiaoYiYang-RLtoPlayMario.pdf This is one learning approach that does a lot more than checking for horizontal scrolling (ennemies, position of ennemies, etc) and learn to be responsive in all kind of situation, not only pre-loaded ones. https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf I haven't gone through deeply into this paper but apparently, deep networks with reinforcement learning have shown amazing results on Atari games.
BadPotato wrote:
So, if you don't go the lua-way, maybe could be to directly hack fceux in C... or: An another possibility could be to add a socket capability which can remotely call the lua API from an external language. If you are interested, I started a simple project that call the bizhawk lua API from an external app using java with UDP. Here you go:
    -Server(luascript+luasocket+timer) -Client(java... but you could re-implement it to any language you want).
Of course, it would be much better to embed the udp-server directly in the emulator rather than lua.... also huh, the project I made is kind of unfinished, but you could definitely build on something like that. And I guess it go without saying, but the benefit of this approach is that you can make a real distributed system accross different computer(and on different platform), so your processing can easily scale. So far, I thought this could help for a genetic algorithm performing several prediction from a large dataset.
This is interesting, however that's not the way I want to explore (and I'm not that good of a programmer myself). I think one good computer with good GPU and several days of computations at most could be enough for most tasks. I have more specific ML-related ideas for now.
Pokota wrote:
I would suggest sharing the lua script with us if you can; it's easier for us to know what you're doing if we can see your code.
I post now a first script for Gradius I've worked on those past days. Let me give a few comment before you read/try it. First of all I'm new to Lua and my code might be bad-looking. I wanted to use the library Torch, but the main implementation of arrays (the "Tensors") doesn't support sparse matrix so it was a waste of time because my potato has 2GB of RAM. I used the native implementation of arrays in Lua instead, which is good for sparse but not very convenient and you have to do low(ish)-level stuff. It's a naïve implementation of q-learning. The only input it takes is the position of Vic Viper and if it's alive or not. The default strategy is to give no input, and then Vic tries blindly different paths if the previous one led to death. It's a very raw and blind die-and-retry but as you can see, it would probably be enough to do pacifist runs of many shoot'em up with reasonable computation time. The code looks long, but the only lines that really matter (where you can tune parameters for better learning) are:
Language: lua

-- update visit_counter if visit_counter[state_action] then visit_counter[state_action]= visit_counter[state_action]+1; else visit_counter[state_action]=1; end a=visit_counter[state_action];
and
Language: lua

-- Define a reward R = 0; if not isAlive then R = -math.huge; end
and
Language: lua

-- Update the Q-Value q_value[state_action]=(1-1/a)*q_value[state_action] +(1/a)*(R+1*max_q_value);
also you can tune the input frame rate:
Language: lua

-- press those inputs for 12 frames -- (approximately 300 apm) for frame=1,12,1 do joypad.set(1,inputs); emu.frameadvance(); end
As @BadPotato remarks, this script only checks for horizontal scrolling and the position of Vic Viper so it's a bit disappointing. but it can be improved by adding more states (cf the Mario paper linked above): number/type/position/speed/hitbox of ennemies, of bonuses, information on the sprites where you can go or where you would crash, etc. Then Vic Viper could be trained to react to any random environment and not only the frame/position data. But that would demand a more exhaustive mapping of the RAM. To a better extent, I wonder if it's possible to take into account all of the RAM as a state. It's probably too heavy for a fast computation (2000*8bits) and too noisy for a good learning (the bits encoding the music or the sprites are useless) but maybe there's a way to pre-process it to remove the useless stuff and shrink the size. (mentionning this thread would be appreciated if you reuse the code)
Language: lua

-- Launch ROM and this script and watch. -- One file is created in the folder of the script -- that records the best inputs over all runs. -- Vitesse de l'émulation ("turbo", "normal" or "maximum") -- What works depend on the OS emu.speedmode("turbo"); -- Useful Inputs: pressStart={ up=false, down=false, left=false, right=false, A=false, B=false, start=true, select=false }; releasePad={ up=false, down=false, left=false, right=false, A=false, B=false, start=false, select=false }; -- Parameters for the environment frame_max = 200000; x_max = 256; y_max = 256; action_max = 5; best_frame_alive=0; -- Best policy at current iteration policy = {}; -- Counting how many time you visit each state visit_counter={}; -- Initialize q-value q_value={}; while true do -- Reset the engine emu.softreset() -- Press start two times for starting a 1P game emu.frameadvance(); emu.frameadvance(); emu.frameadvance(); emu.frameadvance(); joypad.set(1, pressStart); emu.frameadvance(); emu.frameadvance(); joypad.set(1, pressStart); -- Wait for the game to begin -- i.e that Vic Viper appears while not isAlive do emu.frameadvance(); isAlive=memory.readbyte(0x0100)==1; end -- Initialize input recording input_history={}; -- Number of framed passed so far frame_ni = emu.framecount(); frame_alive = 0; -- Initial state: vic_x_i=memory.readbyte(0x07A0); vic_y_i=memory.readbyte(0x07C0); state_i=frame_alive+frame_max*(vic_x_i+x_max*vic_y_i); -- While Vic Viper is still alive while isAlive do -- choose an action to perform -- according the best policy computed so far action=0; if policy[state_i] then action = policy[state_i]; end -- Record input input_history[frame_alive]=action; -- compute index of (frame_alive,vic_x_i,vic_y_i,action) state_action=frame_alive+frame_max*(vic_x_i+x_max*(vic_y_i+y_max*action)); -- update visit_counter if visit_counter[state_action] then visit_counter[state_action]= visit_counter[state_action]+1; else visit_counter[state_action]=1; end a=visit_counter[state_action]; -- define inputs to be pressed this frame if action==0 then inputs=releasePad; else go_up= (action==1)or(action==2); go_left=(action==1)or(action==3); press_A=false; press_B=false; -- format inputs={ up= go_up, down= not go_up, left=go_left, right=not go_left, A=press_A, B=press_B, start=false, select=false }; end -- press those inputs for 12 frames -- (approximately 300 apm) for frame=1,12,1 do joypad.set(1,inputs); emu.frameadvance(); end -- Get new state new_frame=emu.framecount()-frame_ni; vic_x=memory.readbyte(0x07A0); vic_y=memory.readbyte(0x07C0); state=new_frame+frame_max*(vic_x+x_max*vic_y); -- Check if alive isAlive=memory.readbyte(0x0100)==1; -- Define a reward R = 0; if not isAlive then R = -math.huge; end -- max_a Q(X_t+1,a) max_q_value = -math.huge; for new_action=0,(action_max-1),1 do new_state_action=new_frame+ frame_max*(vic_x+ x_max*(vic_y+ y_max*new_action)); if q_value[new_state_action] then new_q_value=q_value[new_state_action] if new_q_value > max_q_value then max_q_value=new_q_value; end else max_q_value = 0; end end -- Initialise Q-value if needed -- for all actions possible for the state for possible_action=0,(action_max-1),1 do possible_state_action=frame_alive+ frame_max*(vic_x_i+ x_max*(vic_y_i+ y_max*possible_action)); if not q_value[possible_state_action] then q_value[possible_state_action]=0; end end -- Update the Q-Value q_value[state_action]= (1-1/a)*q_value[state_action] +(1/a)*(R+1*max_q_value); -- Update the state for the next loop vic_x_i=vic_x; vic_y_i=vic_y; frame_alive = new_frame; state_i=frame_alive+frame_max*(vic_x_i+x_max*vic_y_i); end -- compute argmax(qvalue) over the actions qvalue_max={}; qvalue_argmax={}; for key,value in pairs(q_value) do action=math.floor(key/(frame_max*x_max*y_max)); vic_y=math.floor(key/(frame_max*x_max))-y_max*action; vic_x=math.floor(key/(frame_max))-x_max*(vic_y+y_max*action); frame_alive=key-frame_max*(vic_x+x_max*(vic_y+y_max*action)); state=frame_alive+frame_max*(vic_x+x_max*vic_y); if not qvalue_max[state] then qvalue_max[state]=value; qvalue_argmax[state]=action; elseif value > qvalue_max[state] then qvalue_max[state]=value; qvalue_argmax[state]=action; end end -- update the best policy only if it improves the q-value for key,value in pairs(qvalue_max) do previous_action=0; if policy[key] then previous_action=policy[key]; end previous_state_action=key +previous_action*frame_max*x_max*y_max; if q_value[previous_state_action]~=value then policy[key]=qvalue_argmax[key]; end end -- save the inputs if it's the best run if frame_alive > best_frame_alive then best_time=io.open('best_time.value',"w+"); best_time:write(frame_alive); best_time:write('\n'); for key,value in pairs(input_history) do best_time:write(value); best_time:write('\n'); end best_time:close(); best_frame_alive=frame_alive; end end
RL_scientist
He/Him
Joined: 12/4/2015
Posts: 9
Location: MVA
This is so much fun. In less that one hour Vic Viper learn to pass the first level including the two bosses, without shooting a missile. The default action is no input, and when it's not possible you can see it dance with the missiles/ennemies, with the accuracy you want (choose 1input/frame for frame-perfect dodges, but it will be long to learn). Is there a way to disable the video to have faster emulation ?
Joined: 1/26/2009
Posts: 558
Location: Canada - Québec
Thanks for all the resources. Keep it up.
RL_scientist wrote:
Is there a way to disable the video to have faster emulation ?
Out of of the box, I believe FCEUX is a bit limited with these kind of rendering option.
RL_scientist wrote:
To a better extent, I wonder if it's possible to take into account all of the RAM as a state. It's probably too heavy for a fast computation (2000*8bits) and too noisy for a good learning (the bits encoding the music or the sprites are useless) but maybe there's a way to pre-process it to remove the useless stuff and shrink the size.
Yeah, any static data such as music or sprite, wouldn't be relevant 99.9% of the time. Maybe you can tune up some function to filter those... so again you don't get out of hand with too much processing. That being said, if you can take a screenshot, I guess this could be reusable data for ML. It's true that with a resolution of 256x240 pixel for the NES, it can be hard to settle down with a particuliar kind of algorithm you can use without getting out of hand. But I still think it could be relevant data. As for the script using the q-value, well I'll need to check up more stuff about how Q-learning works, but I just wanted to point out there's a 0 on the numerator division when you update the Q-value : (1-1/a)*q_value[state_action]+... Also do you think math.random() could help for the choice of input? In any case, your script seem to work just fine for gradius... yet for this particular game, I'm a bit unsure how well it perform in comparison to simply bruteforcing or backtracking on the long run.
RL_scientist
He/Him
Joined: 12/4/2015
Posts: 9
Location: MVA
BadPotato wrote:
Yes, any static data such as music or sprite, wouldn't be relevant 99.9% of the time. Maybe you can tune up some function to filter those... so again you don't get out of hand with too much processing. That being said, if you can take a screenshot, I guess this could be reusable data for ML. It's true that with a resolution of 256x240 pixel for the NES, it can be hard to settle down with a particuliar kind of algorithm you can use without getting out of hand. But I still think it could be relevant data.
Yes, the screenshots can be interesting inputs too. But what exactly is encoded in the ram ? is there the information of every object loaded in the screen ? i thought that if the sprites are in the RAM, the information of the screenshot is already there..?
BadPotato wrote:
As for the script using the q-value, well I'll need to check up more stuff about how Q-learning works, but I just wanted to point out there's a 0 on the numerator division when you update the Q-value : (1-1/a)*q_value[state_action]+...
The "a" is always at least equal to 1 because of how it is defined here:
Language: lua

-- update visit_counter if visit_counter[state_action] then visit_counter[state_action]= visit_counter[state_action]+1; else visit_counter[state_action]=1; end a=visit_counter[state_action];
On q-value in general, there should be litterature available on the internet, or i'll look for some good papers when some more time.
BadPotato wrote:
Also do you think math.random() could help for the choice of input? In any case, your script seem to work just fine for gradius... yet for this particular game, I'm a bit unsure how well it perform in comparison to simply bruteforcing or backtracking on the long run.
Well, if by backtracking you mean "play some sequence of inputs, and when it's bad, go to the frame it was still ok and from there try different inputs"... this first script is just an elegant implementation of that, it does no more. The q-value can be a lot more than that (cf. second paper on mario in last post), this was just the more obvious thing to try to begin with. It's because the parameters taken for describing a state of the game is the horizontal scrolling and the position of the ship. Moreover, there isn't any random exploration of states at all (as we could add with math.random), it's really just deterministic backtracking (more or less if you play with the parameters). I don't think it's a good idea to use random exploration of states on this simple example, but it is for more sophisticated models in the future. Btw I've opened a git repository for the last script: https://gitlab.com/FranckC/nes_learning/ learn_gradius.lua is the one from last post (polished and with some remaining bugs corrected) play_performance.lua can be used to play the result of learning at normal speed. also: finish_gradius.lua is the same thing, but implemented with savestates (unfortunately, it doesn't finish gradius, as it was unable to pass alone the wall from the end of lvl 2, even with shots activated :(.. this is a challenge now :D ) and play_run.lua to play the result of previous script.
RL_scientist
He/Him
Joined: 12/4/2015
Posts: 9
Location: MVA
Thanks to the kind mod that edited my message to format it correctly. It was late in the night and I surrendered to the tags, with the intent to fix it in the morning... Just a little up to let you know that I've adapted the same script to dodonpachi, with fba-rr just for fun. It's available in the repository. Launch the rom, pause it, load a savestate to a running instance of a level with the ship alive, start the script "finish_dodon.lua", and unpause to watch it learn. It works with the red ship, I haven't tried the others. There's another script to play the video when it's done, with the same method than above. (play_run_dodon.lua). I'm watching it now and it passed successfully lvl 2, I think it can do it! Not 100% sure though because I haven't used the variables for the momentum and the speed of the ship, just the position. So, yeah, conceptually it brings nothing new... I hope next updates will introduce more intelligent algorithm. edit: blue fighter does finish loop 1 of the game with fire on. it would finish the loop2 too but it crashes, idk why.
RL_scientist
He/Him
Joined: 12/4/2015
Posts: 9
Location: MVA
Up: I'm back to Gradius, trying to learn to play the game with a deep neural network applied to Q-learning, and taking the RAM of the game as the state space. It looks very promising!
RL_scientist
He/Him
Joined: 12/4/2015
Posts: 9
Location: MVA
This idea is taken from this paper. You can find a lua script here. It runs with fceux/gradius, and you could easily tune it to work with other games or emulators that support lua. What you need to run it: - Lua5.1, Torch7 , and net-toolkit. Torch7 is not available on windows. - Fceux and the rom for Gradius - To expect results in a reasonable time, you need a powerful cpu (at least 4 cores) dedicated to the task. I'd recommend to monitor the temperature during the learning. It has been tested with archlinux and linux mint. On linux mint you might run into errors because of relative paths, in this case just replace it with absolute paths. What it does: it trains a deep network to predict a score for each possible action for the player, such that the best possible action has the highest score. (the score is the so-called q-value). The training includes taking random inputs. It reads the RAM to know the state of the game (but not the screenshots). The script saves periodically the network to the disk (with net-toolkit) so you can stop the learning process at any time and resume it later, and it saves the scores of the consecutive runs so you can plot the progress. I haven't had time to observe a major improvement in the learning because my computer is a potato, and up to several thousands runs could be expected before the learning looks tangible. So far I wouldn't bet on a spectacular result before more tuning but it's worth a try. You can try to tune the parameters, or the architecture of the neural network, or the state/action spaces, the reward, etc, for better results. The script is commented in this regard. I probably won't work much more on this project, however i'll let you know how it looks after several days of processing.
Joined: 1/26/2009
Posts: 558
Location: Canada - Québec
So, any result? I guess it all come on how you do training. So I'm not sure if random input alone can build a good enough model to apply new data. Even if you probably have a set of rule to reward/punish to handle randomness, maybe adding a human playthrough into your training set can help.