Author Topic: SD to psram performance  (Read 10553 times)

0 Members and 1 Guest are viewing this topic.

Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #30 on: September 11, 2010, 02:18:50 AM »
 :P

You DO realize that

Code: [Select]
        lw $9,($15)
is merely

Code: [Select]
        lw $9,0x0000($15)
don't you?

From the MIPS user manual...

Quote
Load and Store instructions move data between memory and general registers. They are all I-type instructions, since the only addressing mode supported is base register + 16-bit immediate offset.

So any difference in speed you think you got out of changing all those 0x6060($15) into ($15) is all in your head... and imprecision in how you time things.
 ::sm-02::

There's a reason I simply loaded the top of register $15 and used 0x6060($15).  ^-^

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #31 on: September 11, 2010, 02:41:48 AM »
Well , i don't even dare to challenge you in asm  :-[ , but i've done so many changes : http://code.google.com/p/neo-myth-menu/source/detail?r=215#
That might was something else.

Anyway , the timings are still the ones i just posted. 32MB rom = 1m24s  8)
And i improved a bit more your routine :
-Replaced 4constant loads with registers(2 are saved/restored) (this can be also optimized with few constants ... but oh well )
-Removed 4immediate ors (new)
-Replaced 4 and's with immediate and's equivalent. (new)
(Again , those improvements can't really cure the slow read/writes from/to psram...but hey it was a challenge to improve an already fast loop, wasn't it ?   ~sm-41.gif~)

Try it , compare it with the previous revision and see  ~sm-65.gif~.gif

..End of hunt-for-the-next-second challenge i guess  ~sm-69.gif~.gif
« Last Edit: September 11, 2010, 04:35:26 AM by Conle »

Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #32 on: September 11, 2010, 05:07:50 AM »
While they may not make any noticeable difference to the speed, every little bit helps you learn more about assembly programming, which could yield BIG results in something else later. If you remember, your initial 68000 assembly was god-awful, but now you're doing much better.  ~sm-82.gif~.gif

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #33 on: September 11, 2010, 03:09:40 PM »
Quote
While they may not make any noticeable difference to the speed

Of course , there is no noticeable difference , unless you do timing(By the way,sometime i'll make a profiler using your percice timing routines,
since in the final timing those little delay(30)'s before loading the game are counted in) , and there is no way to get another 20seconds
like in the last build, since one read from gba mapped address is the same as doing 350-400nops(Try to replace hw_delay() with a
load from PSRAM , and you'll notice it works the same way without any issues!You probably know this already , but just had to mention it) , so yeah , just wanted to get that next second before i stop
bothering.
 ::sm-16.gif::
The last thing i thought to optimize was the progress bar : It adds about 1.5s~2.0s loading time for 64Mb games the last time i tried , so it probably adds a bit overhead for large
roms.Maybe we could do the same thing as with MD menu , but its so fast proccessor that it would look lame.  ~sm-75.gif~.gif

Quote
every little bit helps you learn more about assembly programming, which could yield BIG results in something else later.

Yep , there's a personal benefit behind this  :D

Quote
If you remember, your initial 68000 assembly was god-awful, but now you're doing much better

Yes , translating C to ASM is even worst than generated code without optimizations.  ~sm-34.gif~
Now i don't see instructions.I just see them how they would look in higher level language.That's a skill
i got while making that compiler.
 8)

Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #34 on: September 12, 2010, 02:39:24 AM »
Well, the only thing that changes in the loader screen is the progress bar. We could draw the background and text to both frame buffers, then on the update just draw the new progress bar by itself, leaving the rest of the screen alone.

So maybe call progress_screen() to start with the text and fill values to make it draw the screen, then with -1 for those during the loop to just update the progress bar.

Code: [Select]
void progress_screen(char *str1, char *str2, int frac, int total, int bfill)
{
    display_context_t dcon;
    char temp[40];

    // get next buffer to draw in
    dcon = lockVideo(1);
    graphics_fill_screen(dcon, 0);

    if ((str1 != -1) || (str2 != -1) || (bfill != -1))
    {
        if (loading && (bfill == 4))
        {
            drawImage(dcon, loading, loading_w, loading_h);
        }
        else if ((bfill < 3) && (pattern[bfill] != NULL))
        {
            rdp_sync(SYNC_PIPE);
            rdp_set_default_clipping();
            rdp_enable_texture_copy();
            rdp_attach_display(dcon);
            // Draw pattern
            rdp_sync(SYNC_PIPE);
            rdp_load_texture(0, 0, MIRROR_DISABLED, pattern[bfill]);
            for (int j=0; j<240; j+=pattern[bfill]->height)
                for (int i=0; i<320; i+=pattern[bfill]->width)
                    rdp_draw_sprite(0, i, j);
            rdp_detach_display();
        }

        graphics_set_color(gTextColors.sel_game, 0);
        printText(dcon, str1, 20 - strlen(str1)/2, 3);

        graphics_set_color(gTextColors.usel_game, 0);
        strncpy(temp, str2, 34);
        temp[34] = 0;
        printText(dcon, temp, 20-strlen(temp)/2, 5);
        for (int ix=34, iy=6; ix<strlen(str2); ix+=34, iy++)
        {
            strncpy(temp, &str2[ix], 34);
            temp[34] = 0;
            printText(dcon, temp, 20-strlen(temp)/2, iy);
        }

        // show display
        unlockVideo(dcon);

        // get next buffer to draw in
        dcon = lockVideo(1);
        graphics_fill_screen(dcon, 0);

        if (loading && (bfill == 4))
        {
            drawImage(dcon, loading, loading_w, loading_h);
        }
        else if ((bfill < 3) && (pattern[bfill] != NULL))
        {
            rdp_sync(SYNC_PIPE);
            rdp_set_default_clipping();
            rdp_enable_texture_copy();
            rdp_attach_display(dcon);
            // Draw pattern
            rdp_sync(SYNC_PIPE);
            rdp_load_texture(0, 0, MIRROR_DISABLED, pattern[bfill]);
            for (int j=0; j<240; j+=pattern[bfill]->height)
                for (int i=0; i<320; i+=pattern[bfill]->width)
                    rdp_draw_sprite(0, i, j);
            rdp_detach_display();
        }

        graphics_set_color(gTextColors.sel_game, 0);
        printText(dcon, str1, 20 - strlen(str1)/2, 3);

        graphics_set_color(gTextColors.usel_game, 0);
        strncpy(temp, str2, 34);
        temp[34] = 0;
        printText(dcon, temp, 20-strlen(temp)/2, 5);
        for (int ix=34, iy=6; ix<strlen(str2); ix+=34, iy++)
        {
            strncpy(temp, &str2[ix], 34);
            temp[34] = 0;
            printText(dcon, temp, 20-strlen(temp)/2, iy);
        }
    }

    if (frac)
        graphics_draw_box(dcon, 32, 160, 256*frac/total, 6, graphics_make_color(0x3F, 0xFF, 0x3F, 0xFF));
    if (frac<total)
        graphics_draw_box(dcon, 32+256*frac/total, 160, 256-256*frac/total, 6, graphics_make_color(0xFF, 0x3F, 0x3F, 0xFF));

    // show display
    unlockVideo(dcon);
}

If that wasn't fast enough, you could draw the bar directly to the currently displaying buffer.

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #35 on: September 12, 2010, 03:28:50 AM »
I'll try it and post back  ~sm-91.gif~.gif

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #36 on: October 13, 2010, 01:47:41 AM »
For the shake of it , im posting here another implementation , but in the end i explain why its..waste of time and we have to know where to stop :)


Facts:
-Mips has 32gprs. -- forget cop0 , fpr --
-We can use up to 15 non-calee registers and we can use 10 more calee registers.
-The SD multi-block read routine reads blocks of 512bytes.
-From gba mapped adress we can read a byte , a word , or a double word --no quads--.
-To map a different gba region we have to tell it to the core and it can take a few cycles to complete.
-Writing data to PSRAM requires a load (or 400nops ) right after a store.
-PSRAM needs region switching every 128Mb
-After each sector crc follows

So we have:
-25 registers to use
-The sd multi-block read routine
-512 bytes for each sector
-Up to dword stores/loads
-GBA region switching -- which is slow --
-PSRAM timing
-PSRAM region switching
-CRC tossing

Additional information:

To read a sector we have to do 512byte reads and form 128 32bit values.

Execution:


save non-calee
lui $t0,0xB000 /*reserved psram offset*/
main loop:
   map to gba offset
..
..
..
..
   sector copy loop:
      read byte1 to next free non-calee reg
      read byte2 to next free non-calee reg
      read byte3 to next free non-calee reg
      read byte4 to next free non-calee reg /*4 regs are already reserved */
      store to next free non-calee reg ( reg with byte 1 << 24 )
      store to next free non-calee reg ( reg with byte 2 << 16 )
      store to next free non-calee reg ( reg with byte 3  << 8 )
      "or them" : or $resReg,res0,res1
      "or them" : or $resReg,$resReg,res2
      "or them" : or $resReg,$resReg,res3
      Good , now we have the 32bit value!
      NOW SAVE ON STACK the 32bit result
      toss checksum bits!
      adjust offsets
      branch to sector copy

   stack to psram:
   At this point stack is perfectly aligned to 128 * 4

   write enable psram
   select offset
   branch : Check written size , if more than 128Mb switch region once and set flag to a register
      ::  => lui $t0,0xB000 <= ::
   skip:
   disable interrupts

   /*Hardcoded copy of 128 32bit values -- reverse order -- */
   
   lw $t1,512-4($sp)
   sw $t1,($t0)
   addiu $t0,$t0,4

   lbu $t1,(0xB00000000) /*wait bus*/

   lw $t1,512-8($sp)
   sw $t1,($t0)
   addiu $t0,$t0,4

   lbu $t1,(0xB00000000) /*wait bus*/

   lw $t1,512-12($sp)
   sw $t1,($t0)
   addiu $t0,$t0,4

   lbu $t1,(0xB00000000) /*wait bus*/

   ...
   ..
   .
   .
   ..
   ...

   lw $t1,0($sp)
   sw $t1,($t0)
   addiu $t0,$t0,4

   lbu $t1,(0xB00000000) /*wait bus*/

   addiu $sp,$sp,-(128 * 4) /* pop stack */
   branch to mainloop
..
..
..
..
restore non-calee!

Handle the loop with the 10 calee registers -- stack handling included --


Short explanation of the code;
Its simple -- we abuse the stack to form up the sector's data then we write them to psram
with a hardcoded loop without passing them to buffers.

Final comments:
With the above im pretty sure we could get under 1 minute for a 32MB game easily.
But why i said that this isn't a solution ?

Here is why :

-Only Z64 binaries are supported or we need 4 versions of the copy loop to fix endianess
-Its a hack! The point is , anyone can use the sdk with their own code and not to build up a special
module just to copy a game!
-The code will become extremely messy!The assembly loop should be calling the progress function and the BC sim function


That's it.
I hope it was interesting to read :D

ps - the above solution is "universal" .We can do the same with md/snes  ,but don't expect from md/snes that high performance.We have to do timing on stack accessing ^-^
« Last Edit: October 13, 2010, 02:16:10 AM by Conle »