Posts for RL_scientist

Posted: 1/15/2016 12:07 PM

Quote

RL_scientist

He/Him

Experienced Forum User

Joined: 12/4/2015
Posts: 9
Location: MVA

This idea is taken from this paper. You can find a lua script here. It runs with fceux/gradius, and you could easily tune it to work with other games or emulators that support lua. What you need to run it: - Lua5.1, Torch7 , and net-toolkit. Torch7 is not available on windows. - Fceux and the rom for Gradius - To expect results in a reasonable time, you need a powerful cpu (at least 4 cores) dedicated to the task. I'd recommend to monitor the temperature during the learning. It has been tested with archlinux and linux mint. On linux mint you might run into errors because of relative paths, in this case just replace it with absolute paths. What it does: it trains a deep network to predict a score for each possible action for the player, such that the best possible action has the highest score. (the score is the so-called q-value). The training includes taking random inputs. It reads the RAM to know the state of the game (but not the screenshots). The script saves periodically the network to the disk (with net-toolkit) so you can stop the learning process at any time and resume it later, and it saves the scores of the consecutive runs so you can plot the progress. I haven't had time to observe a major improvement in the learning because my computer is a potato, and up to several thousands runs could be expected before the learning looks tangible. So far I wouldn't bet on a spectacular result before more tuning but it's worth a try. You can try to tune the parameters, or the architecture of the neural network, or the state/action spaces, the reward, etc, for better results. The script is commented in this regard. I probably won't work much more on this project, however i'll let you know how it looks after several days of processing.

Posted: 1/13/2016 12:26 AM

Quote

RL_scientist

He/Him

Experienced Forum User

Joined: 12/4/2015
Posts: 9
Location: MVA

Up: I'm back to Gradius, trying to learn to play the game with a deep neural network applied to Q-learning, and taking the RAM of the game as the state space. It looks very promising!

Posted: 12/13/2015 7:58 PM

Quote

RL_scientist

He/Him

Experienced Forum User

Joined: 12/4/2015
Posts: 9
Location: MVA

Thanks to the kind mod that edited my message to format it correctly. It was late in the night and I surrendered to the tags, with the intent to fix it in the morning... Just a little up to let you know that I've adapted the same script to dodonpachi, with fba-rr just for fun. It's available in the repository. Launch the rom, pause it, load a savestate to a running instance of a level with the ship alive, start the script "finish_dodon.lua", and unpause to watch it learn. It works with the red ship, I haven't tried the others. There's another script to play the video when it's done, with the same method than above. (play_run_dodon.lua). I'm watching it now and it passed successfully lvl 2, I think it can do it! Not 100% sure though because I haven't used the variables for the momentum and the speed of the ship, just the position. So, yeah, conceptually it brings nothing new... I hope next updates will introduce more intelligent algorithm. edit: blue fighter does finish loop 1 of the game with fire on. it would finish the loop2 too but it crashes, idk why.

Posted: 12/13/2015 12:32 AM

Quote

RL_scientist

He/Him

Experienced Forum User

Joined: 12/4/2015
Posts: 9
Location: MVA

BadPotato wrote:

Yes, any static data such as music or sprite, wouldn't be relevant 99.9% of the time. Maybe you can tune up some function to filter those... so again you don't get out of hand with too much processing. That being said, if you can take a screenshot, I guess this could be reusable data for ML. It's true that with a resolution of 256x240 pixel for the NES, it can be hard to settle down with a particuliar kind of algorithm you can use without getting out of hand. But I still think it could be relevant data.

Yes, the screenshots can be interesting inputs too. But what exactly is encoded in the ram ? is there the information of every object loaded in the screen ? i thought that if the sprites are in the RAM, the information of the screenshot is already there..?

BadPotato wrote:

As for the script using the q-value, well I'll need to check up more stuff about how Q-learning works, but I just wanted to point out there's a 0 on the numerator division when you update the Q-value : (1-1/a)*q_value[state_action]+...

The "a" is always at least equal to 1 because of how it is defined here:

Language: lua
-- update visit_counter
if visit_counter[state_action] then
  visit_counter[state_action]=
    visit_counter[state_action]+1;     
else
  visit_counter[state_action]=1;
end

a=visit_counter[state_action];

On q-value in general, there should be litterature available on the internet, or i'll look for some good papers when some more time.

BadPotato wrote:

Also do you think math.random() could help for the choice of input? In any case, your script seem to work just fine for gradius... yet for this particular game, I'm a bit unsure how well it perform in comparison to simply bruteforcing or backtracking on the long run.

Well, if by backtracking you mean "play some sequence of inputs, and when it's bad, go to the frame it was still ok and from there try different inputs"... this first script is just an elegant implementation of that, it does no more. The q-value can be a lot more than that (cf. second paper on mario in last post), this was just the more obvious thing to try to begin with. It's because the parameters taken for describing a state of the game is the horizontal scrolling and the position of the ship. Moreover, there isn't any random exploration of states at all (as we could add with math.random), it's really just deterministic backtracking (more or less if you play with the parameters). I don't think it's a good idea to use random exploration of states on this simple example, but it is for more sophisticated models in the future. Btw I've opened a git repository for the last script: https://gitlab.com/FranckC/nes_learning/ learn_gradius.lua is the one from last post (polished and with some remaining bugs corrected) play_performance.lua can be used to play the result of learning at normal speed. also: finish_gradius.lua is the same thing, but implemented with savestates (unfortunately, it doesn't finish gradius, as it was unable to pass alone the wall from the end of lvl 2, even with shots activated :(.. this is a challenge now :D ) and play_run.lua to play the result of previous script.

Posted: 12/6/2015 1:21 PM

Quote

RL_scientist

He/Him

Experienced Forum User

Joined: 12/4/2015
Posts: 9
Location: MVA

This is so much fun. In less that one hour Vic Viper learn to pass the first level including the two bosses, without shooting a missile. The default action is no input, and when it's not possible you can see it dance with the missiles/ennemies, with the accuracy you want (choose 1input/frame for frame-perfect dodges, but it will be long to learn). Is there a way to disable the video to have faster emulation ?

Posted: 12/6/2015 10:37 AM
(Edited: 12/6/2015 2:00 PM)

Quote

RL_scientist

He/Him

Experienced Forum User

Joined: 12/4/2015
Posts: 9
Location: MVA

edit: the script given below is much more impressive if you allow Vic Viper to shoot :p (it goes to the final boss of the level 1 in a few interations) edit2: I've fixed it so it now learn correctly the pacifist run. You can play with the parameter so it can be faster. I guess if you run it for long enough it would end the game... (but it woud be smarter to just restart it at each level with a savestate of course). Hi, thank for the answers

BadPotato wrote:

I did a bit of ML lately as well and I seriously really don't think using lua could lead to any great implementation which can be easily maintained or been improved.

I've never used Lua but I've heard it's good for ML because of the library Torch for manipulating arrays, with implementation for math operations, optimization and neural networks, and it has a good CUDA support for heavy GPU computation (too bad my computer is a potato).

BadPotato wrote:

It's true that MarI/O did quite a bit of buzz, but the result is actually quite simple, since the fitness function only check for horizontal scroll.

There have been a lot more on Mario already, check those 3 papers: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6156425&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6156425 (PM me for full version it just describes this http://www.marioai.org/) http://cs229.stanford.edu/proj2012/LiaoYiYang-RLtoPlayMario.pdf This is one learning approach that does a lot more than checking for horizontal scrolling (ennemies, position of ennemies, etc) and learn to be responsive in all kind of situation, not only pre-loaded ones. https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf I haven't gone through deeply into this paper but apparently, deep networks with reinforcement learning have shown amazing results on Atari games.

BadPotato wrote:

So, if you don't go the lua-way, maybe could be to directly hack fceux in C... or: An another possibility could be to add a socket capability which can remotely call the lua API from an external language. If you are interested, I started a simple project that call the bizhawk lua API from an external app using java with UDP. Here you go:
-Server(luascript+luasocket+timer) -Client(java... but you could re-implement it to any language you want).
Of course, it would be much better to embed the udp-server directly in the emulator rather than lua.... also huh, the project I made is kind of unfinished, but you could definitely build on something like that. And I guess it go without saying, but the benefit of this approach is that you can make a real distributed system accross different computer(and on different platform), so your processing can easily scale. So far, I thought this could help for a genetic algorithm performing several prediction from a large dataset.

This is interesting, however that's not the way I want to explore (and I'm not that good of a programmer myself). I think one good computer with good GPU and several days of computations at most could be enough for most tasks. I have more specific ML-related ideas for now.

Pokota wrote:

I would suggest sharing the lua script with us if you can; it's easier for us to know what you're doing if we can see your code.

I post now a first script for Gradius I've worked on those past days. Let me give a few comment before you read/try it. First of all I'm new to Lua and my code might be bad-looking. I wanted to use the library Torch, but the main implementation of arrays (the "Tensors") doesn't support sparse matrix so it was a waste of time because my potato has 2GB of RAM. I used the native implementation of arrays in Lua instead, which is good for sparse but not very convenient and you have to do low(ish)-level stuff. It's a naïve implementation of q-learning. The only input it takes is the position of Vic Viper and if it's alive or not. The default strategy is to give no input, and then Vic tries blindly different paths if the previous one led to death. It's a very raw and blind die-and-retry but as you can see, it would probably be enough to do pacifist runs of many shoot'em up with reasonable computation time. The code looks long, but the only lines that really matter (where you can tune parameters for better learning) are:

Language: lua
-- update visit_counter
if visit_counter[state_action] then
	visit_counter[state_action]=
		visit_counter[state_action]+1;			
else
	visit_counter[state_action]=1;
end

a=visit_counter[state_action];

and

Language: lua
-- Define a reward
R = 0;
if not isAlive then
	R = -math.huge;
end

and

Language: lua
-- Update the Q-Value
q_value[state_action]=(1-1/a)*q_value[state_action]
			+(1/a)*(R+1*max_q_value);

also you can tune the input frame rate:

Language: lua
-- press those inputs for 12 frames
-- (approximately 300 apm)
for frame=1,12,1 do
	joypad.set(1,inputs);
	emu.frameadvance();
end

As @BadPotato remarks, this script only checks for horizontal scrolling and the position of Vic Viper so it's a bit disappointing. but it can be improved by adding more states (cf the Mario paper linked above): number/type/position/speed/hitbox of ennemies, of bonuses, information on the sprites where you can go or where you would crash, etc. Then Vic Viper could be trained to react to any random environment and not only the frame/position data. But that would demand a more exhaustive mapping of the RAM. To a better extent, I wonder if it's possible to take into account all of the RAM as a state. It's probably too heavy for a fast computation (2000*8bits) and too noisy for a good learning (the bits encoding the music or the sprites are useless) but maybe there's a way to pre-process it to remove the useless stuff and shrink the size. (mentionning this thread would be appreciated if you reuse the code)

Language: lua
-- Launch ROM and this script and watch. 
-- One file is created in the folder of the script
-- that records the best inputs over all runs.

-- Vitesse de l'émulation ("turbo", "normal" or "maximum")
-- What works depend on the OS
emu.speedmode("turbo");

-- Useful Inputs:
pressStart={
	up=false, down=false, 
	left=false, right=false,
	A=false, B=false,
	start=true, select=false
	};

releasePad={
	up=false, down=false, 
	left=false, right=false,
	A=false, B=false,
	start=false, select=false
	};

-- Parameters for the environment
frame_max = 200000;
x_max = 256;
y_max = 256;
action_max = 5;

best_frame_alive=0;

-- Best policy at current iteration
policy = {};

-- Counting how many time you visit each state
visit_counter={};

-- Initialize q-value
q_value={};

while true do
-- Reset the engine
  	emu.softreset()
	
-- Press start two times for starting a 1P game
	emu.frameadvance();
	emu.frameadvance();
	emu.frameadvance();
	emu.frameadvance();
	joypad.set(1, pressStart);
	emu.frameadvance();
	emu.frameadvance();
	joypad.set(1, pressStart);
		
	-- Wait for the game to begin
	-- i.e that Vic Viper appears
	while not isAlive  do
		emu.frameadvance();
		isAlive=memory.readbyte(0x0100)==1;
	end
	
	-- Initialize input recording
	input_history={};
	
	-- Number of framed passed so far
	
	frame_ni = emu.framecount();
	frame_alive = 0;

	-- Initial state:
	vic_x_i=memory.readbyte(0x07A0);
	vic_y_i=memory.readbyte(0x07C0);
	state_i=frame_alive+frame_max*(vic_x_i+x_max*vic_y_i);	
			
	-- While Vic Viper is still alive
	while isAlive do
			
		-- choose an action to perform
                -- according the best policy computed so far
		action=0;
		if policy[state_i] then
			action = policy[state_i];
		end
		
		-- Record input
		input_history[frame_alive]=action;
		
		-- compute index of (frame_alive,vic_x_i,vic_y_i,action)
		state_action=frame_alive+frame_max*(vic_x_i+x_max*(vic_y_i+y_max*action));

		-- update visit_counter
		if visit_counter[state_action] then
			visit_counter[state_action]=
				visit_counter[state_action]+1;			
		else
			visit_counter[state_action]=1;
		end
		
		a=visit_counter[state_action];
		
		-- define inputs to be pressed this frame
		if action==0 then
			inputs=releasePad;
		else
			go_up= (action==1)or(action==2);	
			go_left=(action==1)or(action==3);
			press_A=false;
			press_B=false;

			-- format
			inputs={
				up= go_up, down= not go_up,
				left=go_left, right=not go_left,
				A=press_A, B=press_B, 
				start=false, select=false
				};
		end
		
		-- press those inputs for 12 frames
		-- (approximately 300 apm)
		for frame=1,12,1 do
			joypad.set(1,inputs);
			emu.frameadvance();
		end
		
		-- Get new state
		new_frame=emu.framecount()-frame_ni;
		vic_x=memory.readbyte(0x07A0);
		vic_y=memory.readbyte(0x07C0);
		state=new_frame+frame_max*(vic_x+x_max*vic_y);

		-- Check if alive
		isAlive=memory.readbyte(0x0100)==1;
		
		-- Define a reward
		R = 0;
		if not isAlive then
			R = -math.huge;
		end

		-- max_a Q(X_t+1,a)
		max_q_value = -math.huge;
		for new_action=0,(action_max-1),1 do
			new_state_action=new_frame+
				frame_max*(vic_x+
				x_max*(vic_y+
				y_max*new_action));
			if q_value[new_state_action] then
				new_q_value=q_value[new_state_action] 
				if new_q_value > max_q_value then
					max_q_value=new_q_value;
				end
			else 
				max_q_value = 0;
			end
			
		end

		-- Initialise Q-value if needed
		-- for all actions possible for the state
		for possible_action=0,(action_max-1),1 do
			possible_state_action=frame_alive+
				frame_max*(vic_x_i+
				x_max*(vic_y_i+
				y_max*possible_action));
			if not q_value[possible_state_action] then
				q_value[possible_state_action]=0;
			end
		end

		-- Update the Q-Value
		q_value[state_action]=
			(1-1/a)*q_value[state_action]
			+(1/a)*(R+1*max_q_value);

		-- Update the state for the next loop
		vic_x_i=vic_x;
		vic_y_i=vic_y;
		frame_alive = new_frame;
		state_i=frame_alive+frame_max*(vic_x_i+x_max*vic_y_i);	
	end

	-- compute argmax(qvalue) over the actions
	qvalue_max={};
	qvalue_argmax={};
	for key,value in pairs(q_value) do
		action=math.floor(key/(frame_max*x_max*y_max));
		vic_y=math.floor(key/(frame_max*x_max))-y_max*action;
		vic_x=math.floor(key/(frame_max))-x_max*(vic_y+y_max*action);
		frame_alive=key-frame_max*(vic_x+x_max*(vic_y+y_max*action));
		state=frame_alive+frame_max*(vic_x+x_max*vic_y);
		if not qvalue_max[state] then
			qvalue_max[state]=value;
			qvalue_argmax[state]=action;
		elseif value > qvalue_max[state] then
			qvalue_max[state]=value;
			qvalue_argmax[state]=action;
		end
	end
	
        -- update the best policy only if it improves the q-value
	for key,value in pairs(qvalue_max) do
		previous_action=0;
		if policy[key] then
			previous_action=policy[key];
		end
		previous_state_action=key
			+previous_action*frame_max*x_max*y_max;
		
		if q_value[previous_state_action]~=value then
			policy[key]=qvalue_argmax[key];
		end
	end
			
	-- save the inputs if it's the best run

	if frame_alive > best_frame_alive  then
		best_time=io.open('best_time.value',"w+");
		best_time:write(frame_alive);
		best_time:write('\n');
		for key,value in pairs(input_history) do
			best_time:write(value);
			best_time:write('\n');
		end
		best_time:close();
		best_frame_alive=frame_alive;
	end
end

Posted: 12/4/2015 3:22 PM

Quote

RL_scientist

He/Him

Experienced Forum User

Joined: 12/4/2015
Posts: 9
Location: MVA

Answer to myself: sorry, the problem is not here. My script is just bad, I just realised put math.randomseed(os.script()) in a loop. Tired days. I'm not familiar with Lua though. Should I delete the thread ? -.-" I will post again in the TAS technique section.

Posted: 12/4/2015 3:16 PM

Quote

RL_scientist

He/Him

Experienced Forum User

Joined: 12/4/2015
Posts: 9
Location: MVA

What my script does: select random directions at each frames. What I see in normal mode: Vic Viper taking small step in random directions, back and forth. What I see in turbo mode: Vic Viper crashing into the sides of the screen.

Posted: 12/4/2015 3:02 PM

Post subject: Machine Learning project with FCEUX: technical questions

Quote

RL_scientist

He/Him

Experienced Forum User

Joined: 12/4/2015
Posts: 9
Location: MVA

Hi everybody, I study Machine Learning and for one of my course I intend to work on designing an algorithm aimed at learning to play a game by statistical means. The method used is called Reinforcement Learning (RL). In this field of machine learning, we try to develop algorithms that compute strategies revolving around improvement over a high number of simulation (die and retry style, or pavlovian learning style, but computed). I've been casually watching the TAS community for a while and this project is the best occasion to do some hacking myself, so I've downloaded FCEUX (both linux version and wine) and began to test the lua scripting tool on Gradius, helped by the datacrystal RAM mapping. At first glance I was amazed by the flexibility and the robustness of the engine. However after some digging I've found a couple of inconsistencies, hence my questions: Version: FCEUX 2.2.2 Turbo mode in Linux: it's not implemented ? maximum mode works but turbo doesn't. emu.frameadvance and turbo mode: when turbo/maximum mode is on, what is the behaviour of emu.frameadvance() ? I designed a random pathing AI that speedfire and take random direction inputs at each frame, and it doesn't behave well in turbo mode.. does it means that computations are skipped in turbo mode ? Side note: if you're interested by the project - which is super exciting since machine learning could lead to amazing TAS techniques - feel free to ask more questions on the subjects). It will happen in a short window of time for me but I'd be interested to continue it afterward. Sincerely.