Wednesday, 21 September 2016

Yah! - or if that's cheating - Soon we'll be livin' high and wide

I'll skip the post on profiling and come back to that next time.

Tonight, at the suggestion of the author of the Atari 800 Pentagram port, I thought I'd look at the masking logic. It certainly seemed like a good candidate, given the amount of time spent in the rendering routines and the fact that the masking requires two table look-ups per byte rendered.

Rather than try to optimise it, I thought I'd disable it altogether and see what effect it had on the performance. I was surprised, and a little disappointed, that it seemed to have very little in fact. Well, perhaps not completely insignificant at 4% in the 'busy' room, taking the frame rate from 8-9fps (see screenshot below) to a solid 9fps.

The infamous 'busy' screen, with fps counter

And because I now had a metric, I disabled the Z-ordering again. I clearly underestimated the effect last time, because the performance jumped to 13fps, and a reduction in the corresponding routine (calc_display_order_and_render) from 25% all the way down to 2%.

I was now a little bummed that disabling masking and Z-order completely still resulted in 13fps, a few frames short of my target 15fps. Curious as to how this compared to the original, I fired up the ZX Spectrum and Coco3 emulators side-by-side and entered the 'busy' room on each.

To my surprise, the Coco3 (with no masking or Z-order) was significantly faster. So I re-enabled Z-order. It was still faster. So I re-enabled masking - back to the complete code - and it was still faster at 8-9fps!!! For this screen at least, I had been trying to optimise something that was already too fast!

So where does that leave the project? What I need to do now is visit a significant number of screens and compare them, side-by-side, with the ZX Spectrum original. If the Coco3 is running faster in every case, then my work here is done! In fact, I'll need to (re-enable and) tweak the throttling function so that the screens are all roughly the same speed.

EDIT: I think I'm going to back-port my fps ISR from the Coco3 to the ZX Spectrum!

Then there's the matter of beefing up the Z80 R register emulation, fixing a graphical glitch that is simply a result of not being able to emulate the Spectrum's attribute bytes, and we're done!

Livin' high and wide, one might even say!

Saturday, 17 September 2016

My heart's calculatin'

Just a quick update since it's late...

My profiler appears to be working for the most part, although any delusions I had about writing a generic 6809 profiler are pretty much dashed. I'll go into more details next post, but the nature of assembler makes it difficult - nay impossible - to identify the context (subroutine) of the executing code without some comprehensive code analysis (smells like a halting problem to me).

Regardless, with some inside knowledge of Knight Lore, I've got a pretty good handle on what's taking most of the CPU time now.

Addr   Routine           Count   Cycles
----   ----------------  -----   ------
0xE97A calc_pixel_XY_     1626 15175159 ( 58%)
0xE19E calc_display_o      249  2924962 ( 11%)
0xE8AD blit_to_screen      501  1081830 ( 4%)
0xD858 fill_window         521   845652 ( 3%)
0xE98F print_sprite        129   820025 ( 3%)
0xDB20 upd_16_to_21_2      235   561370 ( 2%)
0xE02E set_draw_objs_      235   501364 ( 2%)
0xE12B save_2d_info      10210   459450 ( 2%)
0xC852 toggle_audio_h      461   411865 ( 2%)
0xE610 get_ptr_object    12960   401760 ( 2%)
0xE967 flip_sprite        2239   380273 ( 1%)
0xE144 list_objects_t      249   337709 ( 1%)
0xE799 update_screen         2   291910 ( 1%)
0xE7BC render_dynamic      249   249443 ( 1%)
0xE790 clear_scrn_buf        2   172068 ( 1%)
0xD6CF print_sun_moon      248   145080 ( 1%)
0xDA55 upd_2_4             498   131802 ( 1%)
(snip)
0xFEF7 _IRQ_               904    38264 ( 0%)
(snip)
0xFEF4 _FIRQ_               58     1334 ( 0%)
(snip)
----------------------        ---------
Total Cycles                   26085292


That top routine is actually calc_pixel_XY_and_render(), which does some trivial calcs and then calls into print_sprite, which is obviously where all the time is spent!

More on this topic next post...

Thursday, 15 September 2016

Cut 'em out

Whilst I haven't been working on Knight Lore code per se, I have done some work that will ultimately assist in the optimisation.

I need a profiler, and since none of the Coco3 emulators have that functionality, I have to roll my own. My first preference was to hack MAME/MESS, but the barrier-to-entry is rather high. So the next candidate is Vcc.

Unfortunately Vcc currently compiles under Microsoft Visual C, which I actively try to avoid in my own projects, preferring GNU or otherwise open source toolchains. So I decided to try my hand at porting Vcc to GCC/MINGW. After hitting a few roadblocks that had me stumped for a day or two, I managed to finally get it all building and running! Mostly.

There wasn't a lot of code to modify, in fact about 10 lines in a handful of files. One problem I couldn't figure out how to overcome was a call to AfxInitRichEdit() which is part of MFC and therefore not available under GCC. Interestingly there is a RICHED20 library in the GCC/MINGW distribution which supposedly implements AfxInitRichEdit2(), but I had no luck. Somewhat encouragingly though, the documentation suggests that it's not always necessary to call this function.

So the solution for now was - Cut 'em out!

There are a couple of issues with the GCC build - for example the tape configuration dialog is mysteriously blank - but the real stick-in-the-mud for me is the fact that Knight Lore doesn't actually run. Galactic Attack ran just fine, so perhaps it's an issue with 32KB cartridges?

It's never easy, is it?

UPDATE: There was/is an issue with 32KB cartridge images; Vcc expects the two 16KB banks to be swapped - for some unknown reason - unlike MAME/MESS and also unlike you'd burn to FLASH or EEPROM for that matter. I've done a quick hack for myself so that Knight Lore will run as-is.

The tape configuration dialog was blank because it contained rich edit controls; after adding a call to explicitly load riched20.dll that's now sorted too.

That's the last of the obvious GCC/MINGW issues.

Now I should be able to start on the profiling functionality!

Friday, 26 August 2016

Keep them doggies unrollin'

More incremental optimisations...

The most significant, from an effort point-of-view at least, was unrolling the shifted (non-byte-aligned) sprite rendering routine. That took a bit to get right. Previously it was also reading & writing to each video byte twice; that's now remedied too. FWIW it makes <1fps difference on the 'moving block' screen.

It's worth noting that it requires 75 cycles to render a single (shifted) byte. That entails reading 4 bytes from data memory, performing 4 table lookups (across 2 different tables) before a read-modify-write of a single byte in video memory.

I also opted to duplicate the (small) routine that calculated the video buffer address from X,Y position. It was originally returning the result in U, however the code always then transferred it to either X or Y, depending on whether it was used as a source or destination pointer. 16-bit register transfers are actually surprisingly expensive (6 cycles) and one case required two transfers to preserve U as well. I also optimised the calculation itself to save a few cycles.

I really need to profile the code properly to identify the bottlenecks. Chipping away at the more obvious optimisations isn't having much of an effect on fps.

And just because I haven't posted any pictures for a while...

Showing the fps counter lower right (49fps)

There's also definitely some subtle graphics corruption, or rather, garbage. It appears to be limited to human->wulf transformations, and only when in certain orientations. I'll do more experimenting to nail down the exact conditions, then see if it can be reproduced on the ZX Spectrum...

EDIT: Another upside of unrolling the sprite rendering loops is that it should be easier to add support for CPC (4-colour) graphics!

Thursday, 25 August 2016

Unrollin' unrollin' unrollin'

Still haven't determined the main bottleneck but have made more progress.

After tweaking a few routines to eliminate unnecessary branches and some direct-page & register juggling I returned my focus to the sprite rendering routine. I was recently reminded that the original Z80 code used a separate execution path for byte-aligned sprites, so I decided to optimise that section.

After adding code to test for and then render specifically byte-aligned sprites, I then unrolled the loop for this case. To make matters worse (or better, depending on how you look at it), I had unnecessarily transferred a loop counter to & from DP memory and register B when I need only decrement the in-memory value. So worst case - the widest sprite is 5 bytes - it was taking 65 cycles per sprite line (the tallest sprite is 64 lines) just to test and loop. Contrast that with just 26 cycles set-up per sprite, and no per-line test and loop!

Empty screens are now too fast to control (turning accurately is problematic) and my troublesome 'moving block' screen, which started out at 5fps, is now 9-10fps - and I'm yet to optimise the non-byte-aligned (shifted) sprite rendering logic! Having said that, patching the code to render all sprites as byte-aligned did little to improve the frame rate on this particular screen, though it does improve on other screens.

The above-mentioned 'moving block' screen appears to be puzzlingly slow, considering there's only a pair of blocks to animate. Patching the code again to remove the blocks did increase the frame rate by around 2fps, but I can see little opportunity to speed up this particular sprite handler.

However during that process, I noticed that a dozen or so routines branched (near/relative) to another routine that simply jumped (far/absolute) to a third routine. Whilst this could conceivably have been done to reduce code size (slightly) it is rather detrimental to performance, so I of course remedied the situation by 'removing the middleman' so-to-speak.

I've not compared the performance with the original Z80 code since starting on the optimisations, but I would guess that it's starting to approach it now. It would be ideal if I could exceed it slightly, then tweak the frame rate smoothing logic to bring it back on par with the ZX Spectrum.

Watch this space!

Monday, 22 August 2016

Good News is No News

I'm still perplexed, but I do have some 'good' news. I had a typo that resulted in the main loop attempting to smooth the frame rate when profiling was enabled, when it should have been disabled.

So my frame rate for 'empty' screens is actually up around 43fps (not 29fps). Of course on 'busy' screens the frame rate smoothing logic did nothing so it's still limping along around 7-8fps.

What has me completely bamboozled is the fact that disabling the Z-order calculations still results in no more than 1fps improvement (and I had my hopes up when I first saw the above-mentioned typo), whilst another porter (to 6502) has seen significant improvement by optimising the Z-order algorithm. This means either the 6809 code is vastly more efficient than the translated 6502 code or, more likely, I've done something brain-dead in my attempt to circumvent it.

Oh and I think I saw some transient pixel corruption on one of the animated screens... perhaps that's a clue to something I've done wrong, or perhaps only a side-effect of disabling the Z-order calculations - I'm not sure.

So my quest continues to find the bottleneck on busier screens...

Monday, 15 August 2016

1fps

Had a bit of spare time so I started on the optimisation tonight. The focus was the sprite print routine, though from all reports that's actually not one of the hotter spots in the code. Regardless, I managed to find a few cycles here and there which equated to roughly one frame-per-second improvement on a busier screen - instead of 5/6 it's now 6/7 fps.

Nothing too tricky; found a PSHS/PULS inside a loop that was completely unnecessary, pre-computed an LEAX outside a loop, used post-increment to remove an LEAU, and finally moved a few temporary variables from the stack to the direct page.

Also found a bug in the process, but one that doesn't appear to affect the code - a word-sized variable on the direct page was only allocated a single byte, and AFAICT wasn't used at the same time as the next variable.

Next step is to locate and review my correspondence with fellow porters to identify the critical areas I should be looking at!