Post subject: Improving Lua performance using baked functions
Player (96)
Joined: 12/12/2013
Posts: 376
Location: Russia
Lua internals Alright, lets start from some info about Lua internals. As for many other script languages, each mention of some variable in some Lua expression is actually taking time to get its value. For example, if I write following code:
Language: lua

for i = 0, 10 do print(i) end
It takes some time to get value of i and pass it into print. More important that here print is also variable. It's so called global variable. Here it's storing API function provided by environment. So, getting print function is also taking time. Again, more important, that taking here value of print is much slower than taking value of i. Because i here is local variable. Local variables is working faster than global variables, mostly because local variables doesn't have name in runtime, instead of that, they have indexes. And all actions "take local variable value" is just getting value by index, which is fast. On the other hand, global variables are stored in _G table. It's global table for all global variables. And getting some global variable is essentially taking value from table by its name. So following code is same:
Language: lua

_G["print"]("qwe") print("qwe")
And because indexing by string is more harder task, including hash calculations if it's using hashtables as implementation, or other stuff - it works much slower than just getting value by index. This means that this stupid-looking code is actually works faster:
Language: lua

local print = print -- store global value into local for i = 0, 10 do print(i) end -- use local variable
Now, a bit about print itself. As I said, it's API function provided by environment. Each time you call environment function, Lua doing following: Lua->API->Environment 1) Lua calls API function provided by environment 2) API function pulling some info about: arguments count, their types 3) API function does some sanity checks about call correctness 4) API function does actual work with environment internals Lua<-API<-Environment 5) API function is taking results from environment 6) API function is pushing results to Lua 7) Lua continues execution of Lua code So, basically each environment API call consists of two translations: from Lua to Environment and backwards. Even though you can't avoid calling API, you may try to reduce count of calls, to reduce unneccessary calls to improve speed. Case We want to draw collision data in 2D platformer. In this post I'll talk mostly about this case. What do I mean by "collision data"? Simple: if some object is changing behavior when it's touched <- it's "collision data". As for now, I'll talk only about tile based 16x16 pixel collision data. Alright. We have 16x16 collision blocks, they have pixels, each value of pixel may have different meaning, like: ground, ceil, wall, ladder, and so on. It's very handy to draw overlay over game screen where each collision pixel is colored by its color. But how code would look like? It would be something like this:
Language: lua

for x = ?, ? do -- here ? is some expression calculating bounds for x for y = ?, ? do -- here ? is some expression calculating bounds for y local tile = get_collision_tile(x, y) -- to avoid finding corresponding tile 16*16=256 times for i = 0, 15 do -- 16 by y for j = 0, 15 do -- 16 by x local v = get_collision_data(tile, i, j) -- get collision pixel if colors[v] ~= nil then -- if it has color, then you want to draw here gui.drawPixel(x+i-scrollx, y+j-scrolly, colors[v]) -- scrollx, scrolly is camera position end end end end end
If colors[v] is nil this means that v value doesn't affect behavior at all, in other words: nothing to draw. For example you should have v for air, and you'll set colors[air] = nil, to avoid draw every space where you can walk freely. (Demolition Man for Sega Genesis) This code above is actually already optimized. It'll only translate values to colors and position relative to tile into global position. Nothing to optimize, right? No! First optimization that is often done is: process only tiles that is on screen. It's done by more carefully cooking ranges of first two for loops. Example is something like that:
Language: lua

local left,right,top,bottom = scrollx,scrollx+320,scrolly,scrolly+224 for y = max(0,floor(top/256)), min(hblocks-1,floor((bottom-1)/256)) do for x = max(0,floor(left/512)), min(wblocks-1,floor((right-1)/512)) do
I don't want to explain it in details, but idea is simple. x and y should take values of block if and only if some part of it is visible on the screen. For 320x224 screen as in my case it will process at most 21x15=315 blocks. Also, for each tile we can calculate sx, sy = x-scrollx, y-scrolly and use them in gui.drawPixel call reducing few operations: get value of scrollx, scrolly, and two subtractions. Now, what is next? Next is what is all lua internals part was about. What API environment function do we see here in code above? It's gui.drawPixel. It should be something about it. Nope. (actually yes but... later) We are calling get_collision_tile and get_collision_data. But wait, it's not API environment functions. It's our custom made functions for reading game data. Yep. But what it has inside? It has multiple calls memory.read_something calls, each of them is environment API call. First, you should make local variable for them, to avoid looking for them in global table _G every time you want to read something from memory. Time to remind our goal: we want to reduce count of API calls. Here API calls are memory.read_something. How can we get rid of them? Answer is simple: read level only once, and store its whole structure on Lua side! But how can we do that? Just each time when you want to draw collision overlay, check for current level. If it changed, forget all data about previous level, and load whole level into Lua again. It'll require time only when you change level. After that, code of get_collision_tile and get_collision_data should not have any API calls. After you did that, you may think: wait, why do we always convert collision pixel value(v) into its color? (colors[v]) Yes, now we can "bake in" collision color instead of collision value straight into collision tile data! What's next? You may notice that many of collision tiles are just empty. Full of air. (wait, you said empty!) So, you don't want to spend time on those tiles. You may make some table of empty tiles and before processing 16x16 pixels check is it in list of empty. If it is, then skip processing. Note: when I say "check in list" - I don't actually mean that you should have it as some kind of list. Best way to implement that is just one big table, and if value at index of tile is true then it is empty. So, code is just
Language: lua

if empty_tiles[tile_id] then continue end
You may came up with some similar ideas, like store table of full_somevalue tiles, and so on... Well, it's way long introduction to actual new approach. Baking time! What if we would be able to make list of calls for each collision tile? For example, if it has 16 non-empty pixels, why not to store function with its arguments into list, and then just execute it? Fortunately, we able to do that! Like this:
Language: lua

calls_list = { {gui.drawPixel, 5, 7, 0xFFFF00FF}, {gui.drawPixel, 10, 10, 0xFF00FFFF} } for i = 1, #calls_list do local b = calls_list[i] -- single list entry b[1](x + b[2], y + b[3], b[4]) -- where x, y is shift to align with camera position. end
Using this technique, you may bake all collision tiles into lists of calls, during level loading on Lua side. Now we have solution for making and running call lists. How about optimizing call lists? What else do we have besides gui.drawPixel? We have gui.drawLine! Can we utilize it? Of course we can! You may think of gui.drawLine as batch of gui.drawPixel with same color. So, if we able to split collision pixels into bunch of lines, then we can draw them instead. It will reduce count of API calls and thus, it'll improve performance of script. How can we do that? For example like this. If color of previous line is same as next pixel, and there is no gap in between - just increase length of line. Here is a picture: https://i.imgur.com/k7Xcw9e.gifv Also, good implementations of drawLine method is doing sanity checks only in the beginning, and drawing all pixels straight ahead. This is done using so called "clipping". Original line reduced to segment of it that is on screen, and only pixels of that segment is efficiently enumerated, without checks for leaving bounds. On the other hand, each call of drawPixel does check weather position is on screen or out of bounds. But drawLine does it once for the whole line. Next, you may think of gui.drawBox as batch run of gui.drawLine. Can we utilize it? Of course we can! Similarly to lines, we will look for previous line, and if it has appropriate starting and ending position, then we will convert it into box. Here is a picture: https://i.imgur.com/PxeTKei.gifv Similarly, good implementation of drawBox does coordinates check only once. It's done by "clipping" - transforming coords passed to the method into new coords - part of box that is on screen. And no additional checks required during fill. Now we have a small trouble. gui.drawPixel, gui.drawLine, gui.drawBox have different arguments. To handle variable count of arguments for functions, you would need some workarounds. You may want to use unpack function, but you need to add camera shift to coordinates, so it's not an elegant way. You may check vs function that you want to call, and then call it as intended. You may make supplementary functions that use always four coordinates and then calling API function, but this will lead to unnecessary addition for coords that is not used. gui.drawPixel for example. Also, gui.drawBox has two colors: fill and outline. Checks like that, intermediate functions, and other stuff may reduce gain. Perhaps we can "bake" even more? We need to bake more! Even though Lua is interpreter, each time you include some file using require, it's actually compiling code that is loaded in some kind of intermediate language code. This step is required to have decent performance. Otherwise you'll end up spending most of the time parsing each line over and over again. This means, in theory we can generate .lua files and then load them using require, with all tiles recompiled into functions, and then simply call that functions. But in practice, there is better function. There is loadstring. It is loading code pretty much like require does, but is loading code straight from the string! Also, it doesn't check "was it loaded already?" which is annoying part if you would use require because you would have to always generate new name for generated file. That's it! Just generate Lua code for each collision tile, and then simply call it like this:
Language: lua

cdata[idx1](sx,sy) -- shift x, shift y to align with camera
And, to make functions work fast, you should make local variables for functions, like this:
Language: lua

local gui_box, gui_line, gui_pixel = gui.drawBox, gui.drawLine, gui.drawPixel
So, for the following picture: https://i.imgur.com/4kmJbBK.gifv Code is following:
Language: lua

local gui_box, gui_line, gui_pixel = gui.drawBox, gui.drawLine, gui.drawPixel return function (x,y) gui_pixel(x+0,y+0,16777113) gui_box(x+1,y+0,x+2,y+1,16777113,16777113) gui_pixel(x+2,y+2,16777113) gui_box(x+3,y+0,x+4,y+3,16777113,16777113) gui_pixel(x+4,y+4,16777113) gui_box(x+5,y+0,x+6,y+5,16777113,16777113) gui_pixel(x+6,y+6,16777113) gui_box(x+7,y+0,x+8,y+7,16777113,16777113) gui_pixel(x+8,y+8,16777113) gui_box(x+9,y+0,x+10,y+9,16777113,16777113) gui_pixel(x+10,y+10,16777113) gui_box(x+11,y+0,x+12,y+11,16777113,16777113) gui_pixel(x+12,y+12,16777113) gui_box(x+13,y+0,x+14,y+13,16777113,16777113) gui_pixel(x+14,y+14,16777113) gui_line(x+15,y+0,x+15,y+15,16777113) end
So, whole problem is reduced to following task: how to split picture into minimum count of axis aligned rectangular areas of same color? Depending on your solution of this task, you may gain more speed. In the end, if you have way to draw pictures, you may "bake" each collision tile into a picture, and then draw it instead. But solution discussed above works even if you can't draw picture, or if it is slow. Ahhh. Here is part of code that is actually implementation of algorithm above: https://gist.github.com/realmonster/93090d4ef0b40c1b88d62f8840fad675/5310f3bf8463ae6e0a766e4608f268fd28762184#file-demolition-man-lua-L107 And, according to pictures above, atm it has some bug with building blocks not as intended. Anyway, it works.
Joined: 12/16/2018
Posts: 1
That's amazing, I didn't know about any of these. It massively reduces readability, though. Gotta pay attention to that when you optimize code that you are going to keep working on. Might be a good idea to keep the un-optimized code somewhere in case you need to re-visit it later.
Post subject: Re: Improving Lua performance using baked functions
creaothceann
He/Him
Editor
Joined: 4/7/2005
Posts: 1874
Location: Germany
r57shell wrote:
[...] More important that here print is also variable. It's so called global variable. Here it's storing API function provided by environment. So, getting print function is also taking time. Again, more important, that taking here value of print is much slower than taking value of i. Because i here is local variable. Local variables is working faster than global variables, mostly because local variables doesn't have name in runtime, instead of that, they have indexes. And all actions "take local variable value" is just getting value by index, which is fast. On the other hand, global variables are stored in _G table. It's global table for all global variables. And getting some global variable is essentially taking value from table by its name.
Sounds like it's time to switch to an actually usable language...
Post subject: Re: Improving Lua performance using baked functions
Player (96)
Joined: 12/12/2013
Posts: 376
Location: Russia
creaothceann wrote:
Sounds like it's time to switch to an actually usable language...
Any suggestions?
creaothceann
He/Him
Editor
Joined: 4/7/2005
Posts: 1874
Location: Germany
Unfortunately I only do (Free) Pascal, which has Pascal Script.
Emulator Coder, Site Developer, Former player
Joined: 11/6/2004
Posts: 833
I feel like we're overdoing it here. I mean, if it comes down to it you could just write a C function and require() it into Lua and have it do all the heavy lifting. As an aside, _G["string goes here"] is faster than you think. Lua has an internal table of strings and so the hashtable lookup is really just looking up an integer field. There's no actual hashing of the function name on each iteration. And besides, complaining about the slow execution of a script feels hypocritical when you're emulating the CPU, etc of another platform the rest of the time.
Editor, Skilled player (1172)
Joined: 9/27/2008
Posts: 1085
For particularly large loops (including drawing every tile on screen, which admittedly is a few thousand iterations at most, and probably more like several hundred; it does go up some more if you want to extend the viewing area), using techniques to reduce some overhead per loop would still help performance a bit. Although, feel free to let me know how expensive a drawing function is compared to a pair of hash calls. I have a habit of using an alias for relatively common functions, such as... local R1u = memory.readbyte or local R1u = memory.read_u8 Porting a script across different emulators usually just means changing the top alias lines in this case. It's good to know it has another advantage of using a local reference directly instead of having to hash things out. In this case, at least, I'm not looking for _G["memory"]["read_u8"] now. For some reason, I only rarely rename drawing functions. I'll be sure to give an alias to them more often. Except the colors... and fonts... and order of parameters... and, eh...
Site Admin, Skilled player (1236)
Joined: 4/17/2010
Posts: 11267
Location: RU
DeHackEd wrote:
And besides, complaining about the slow execution of a script feels hypocritical when you're emulating the CPU, etc of another platform the rest of the time.
My script for Gargoyles slows down bizhawk from 800fps to 80fps, and that's already after all the improvements FatRatKnight and I could come up with. EDIT: In Gens the same script runs at crazy speed, because there's no need for Lua-C#-C crosstalk.
DeHackEd wrote:
I mean, if it comes down to it you could just write a C function and require() it into Lua and have it do all the heavy lifting.
Can this C thing keep using emulator lua API?
Warning: When making decisions, I try to collect as much data as possible before actually deciding. I try to abstract away and see the principles behind real world events and people's opinions. I try to generalize them and turn into something clear and reusable. I hate depending on unpredictable and having to make lottery guesses. Any problem can be solved by systems thinking and acting.
Player (96)
Joined: 12/12/2013
Posts: 376
Location: Russia
creaothceann wrote:
Unfortunately I only do (Free) Pascal, which has Pascal Script.
Probably. I just want to say that some popular scripting languages like JavaScript, Python, Perl for example are kinda same in this aspect. Don't know about Pascal Script though.
DeHackEd wrote:
I feel like we're overdoing it here. I mean, if it comes down to it you could just write a C function and require() it into Lua and have it do all the heavy lifting.
It would not help with environment API calls: like gui.drawPixel in bizhawk. It would not reduce their count. Also, when you make Lua dll you should compile it with same Lua version. This is additional headache.
DeHackEd wrote:
As an aside, _G["string goes here"] is faster than you think. Lua has an internal table of strings and so the hashtable lookup is really just looking up an integer field. There's no actual hashing of the function name on each iteration.
You're right. I didn't know that. Hashing is done as soon as string is made. It doesn't change fact that it does hashtable lookup in _G. On the other hand, local variables are on the stack. Also, I've separated version of this algorithm. Here: https://gist.github.com/realmonster/93090d4ef0b40c1b88d62f8840fad675/afc1653f74b91d7b95b9a90d80b9ddc5796237eb#file-demolition-man-lua-L82 Now it also scans tile in all eight orders, and picks best result it was able to find. You can't apply it mindlessly. You need to know full set of tiles that you want to bake. To bake tile, you just need to make new instance of tile_splitter, then fill it with add, and then call split, and it will return list of rectangles of same color. Last step is completely your responsibility: you need to convert list of rectangles into string that contains Lua code suitable for your case. Code example is there, just look for tile_splitter call in level loading, and tile_generator that does conversion of list into Lua code.
Amaraticando
It/Its
Editor, Player (158)
Joined: 1/10/2012
Posts: 673
Location: Brazil
Lua is used because it's one of the easiest languages to embed in a C (or C++) program and you have an API in which you can write the hardcore parts of the program. Combine that with the fact that it's one of the fastest scripting languages and is very small. I haven't seen it yet, but I wonder whether it's feasible to use luajit instead of the main Lua implementation (here we sacrifice portability and some Lua 5.3 features). I tried that in lsnes without success.
Site Admin, Skilled player (1236)
Joined: 4/17/2010
Posts: 11267
Location: RU
To be frank, bizhawk does provide an alternative to lua, it's called APIHawk. It's totally unused, but allows some neater stuff potentially, though it is about having compiled DLLs. https://github.com/Hathor86/HelloWorld_BizHawkTool.git Needs to be built for x64 to work with 2.x hawk. Currently the build and the src are in ExternalTools folder. I tested this in a few releases and it works only in 1.11.4. I think APIHawk has access to everything that bizhawk tools have access to. And we have this page now http://tasvideos.org/Bizhawk/LuaAlternatives.html
Warning: When making decisions, I try to collect as much data as possible before actually deciding. I try to abstract away and see the principles behind real world events and people's opinions. I try to generalize them and turn into something clear and reusable. I hate depending on unpredictable and having to make lottery guesses. Any problem can be solved by systems thinking and acting.
Player (96)
Joined: 12/12/2013
Posts: 376
Location: Russia
Alright. For sake of completeness, I've made some comparison. https://gist.github.com/realmonster/e7ceadff746750e795822e0a95f42eb1 It has 5 files: demolition_man_1_raw.lua - naive implementation of drawing collision overlay. demolition_man_2_ranges.lua - naive implementation but having tricky ranges calculation. demolition_man_3_loadlevel.lua - get rid of all memory_readsomething by loading whole level on Lua side. demolition_man_4_empty.lua - also mark empty tiles and tiles full of same color demolition_man_5_baked.lua - proposed solution Test: PC: my shitty laptop. Bizhawk 1.13.2 First level, first floors. FPS: 1-2, 1-2, 6-7, 6-8, 60 - Bang.
Skilled player (1706)
Joined: 9/17/2009
Posts: 4952
Location: ̶C̶a̶n̶a̶d̶a̶ "Kanatah"
This looks fantastic, but looking at the example code, it becomes progressively harder to understand. Is there a way to semiautomaticly convert from the 1st approach to the last file, so I can add more features to the script in a easier to understand format, then optimize it later? Alternatively deoptimize it to change a function, then reoptimize it again ?
Player (96)
Joined: 12/12/2013
Posts: 376
Location: Russia
Sadly there is no automatic way to optimize it in this way. You have to know game mechanics and tune it. The whole idea is to bake building blocks of level into premade functions. Someone should find out what are building blocks, and how they should be baked.