Author Topic: SD to psram performance  (Read 10609 times)

0 Members and 1 Guest are viewing this topic.

Offline jsmith

  • Newbie
  • *
  • Posts: 1
  • NEO newbie
SD to psram performance
« on: September 05, 2010, 09:39:24 AM »
Hi, I have briefly looked at the N64 menu code. As a starting point I looked at the SD card to psram transfer. I think the transfer is pretty quick but improving the speed is always a good thing. Am I correct in thinking the copying of data from ram to psram is the major performance bottleneck? Particularly the bus delay in this section of code:

Code: [Select]
void neo_xferto_psram(void *src, int pstart, int len)
{
    // copy data
    for (int ix=0; ix<len; ix+=4)
    {
        *(vu32 *)(0xB0000000+pstart+ix) = *(u32 *)(src+ix);
        bus_delay(96);
    }
}

I don't have the datasheet for the psram, but is 96 cycles the required wait time? If so, does the compiler unroll this loop?

Code: [Select]
inline void bus_delay(int cnt)
{
    for(int ix=0; ix<cnt; ix++)
        asm("\tnop\n"::);
}

If it doesn't, there is an overhead of at least two cycles for each noop call (increment loop counter and jump if not equal), which means roughly 192 cycles are wasted for each byte of data transferred to psram. To fix this, the loop could be unrolled or written in asm and the actual number of cycles done per iteration could be calculated, then the integer input could be divided to the correct value which would give a performance boost. Again, I've only briefly looked at the code and am trying to get an understanding of it. Any feedback on why this observation is right or wrong would be appreciated.


Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #1 on: September 05, 2010, 10:30:22 AM »
Hi, I have briefly looked at the N64 menu code. As a starting point I looked at the SD card to psram transfer. I think the transfer is pretty quick but improving the speed is always a good thing. Am I correct in thinking the copying of data from ram to psram is the major performance bottleneck?

No, actually, it isn't. The major bottleneck is the time it takes to read the SD interface on the N64 Myth. That's a "rom" cycle to the N64, and rom READ cycles are slow since the N64 was made at a time when fast roms were expensive. In recent menus, I decrease the time in reading the rom space that the SD interface uses, but it's still pretty slow, and there's not much we can do about it with the current core.


Quote
Particularly the bus delay in this section of code:

Code: [Select]
void neo_xferto_psram(void *src, int pstart, int len)
{
    // copy data
    for (int ix=0; ix<len; ix+=4)
    {
        *(vu32 *)(0xB0000000+pstart+ix) = *(u32 *)(src+ix);
        bus_delay(96);
    }
}

I don't have the datasheet for the psram, but is 96 cycles the required wait time? If so, does the compiler unroll this loop?

No, and if it did, we'd have to increase it to compensate. That still doesn't make it the slow part of the transfer, believe it or not. It's maybe 10 to 15 seconds of the total transfer time of 1:45 (for a 32MB rom).

Quote
Code: [Select]
inline void bus_delay(int cnt)
{
    for(int ix=0; ix<cnt; ix++)
        asm("\tnop\n"::);
}

If it doesn't, there is an overhead of at least two cycles for each noop call (increment loop counter and jump if not equal), which means roughly 192 cycles are wasted for each byte of data transferred to psram. To fix this, the loop could be unrolled or written in asm and the actual number of cycles done per iteration could be calculated, then the integer input could be divided to the correct value which would give a performance boost. Again, I've only briefly looked at the code and am trying to get an understanding of it. Any feedback on why this observation is right or wrong would be appreciated.

That delay is needed when you init the SD card because they init at a slow speed for compatibility. Once the card is initialized, that loop is set to 0. The majority of the transfer is spent right here:

Code: [Select]
int neo2_recv_sd_multi(unsigned char *buf, int count)
{
    int res;

    asm(".set push\n"
        ".set noreorder\n\t"
        "lui $15,0xB30E\n\t"            // $15 = 0xB30E0000
        "ori $14,%1,0\n\t"              // $14 = buf
        "ori $12,%2,0\n"                // $12 = count

        "oloop:\n\t"
        "lui $11,0x0001\n"              // $11 = timeout = 64 * 1024

        "tloop:\n\t"
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "andi $2,$2,0x0100\n\t"         // eqv of (data>>8)&0x01
        "beq $2,$0,getsect\n\t"         // start bit detected
        "nop\n\t"
        "addiu $11,$11,-1\n\t"
        "bne $11,$0,tloop\n\t"          // not timed out
        "nop\n\t"
        "beq $11,$0,exit\n\t"           // timeout
        "ori %0,$0,0\n"                 // res = FALSE

        "getsect:\n\t"
        "ori $13,$0,128\n"              // $13 = long count

        "gsloop:\n\t"
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4 => -a-- -a--
        "lui $10,0xF000\n\t"            // $10 = mask = 0xF0000000
        "sll $2,$2,4\n\t"               // a--- a---

        "lw $3,0x6060($15)\n\t"         // rdMmcDatBit4 => -b-- -b--
        "and $2,$2,$10\n\t"             // a000 0000
        "lui $10,0x0F00\n\t"            // $10 = mask = 0x0F000000
        "and $3,$3,$10\n\t"             // 0b00 0000

        "lw $4,0x6060($15)\n\t"         // rdMmcDatBit4 => -c-- -c--
        "lui $10,0x00F0\n\t"            // $10 = mask = 0x00F00000
        "or $11,$3,$2\n\t"              // $11 = ab00 0000
        "srl $4,$4,4\n\t"               // --c- --c-

        "lw $5,0x6060($15)\n\t"         // rdMmcDatBit4 => -d-- -d--
        "and $4,$4,$10\n\t"             // 00c0 0000
        "lui $10,0x000F\n\t"            // $10 = mask = 0x000F0000
        "srl $5,$5,8\n\t"               // ---d ---d
        "or $11,$11,$4\n\t"             // $11 = abc0 0000

        "lw $6,0x6060($15)\n\t"         // rdMmcDatBit4 => -e-- -e--
        "and $5,$5,$10\n\t"             // 000d 0000
        "ori $10,$0,0xF000\n\t"         // $10 = mask = 0x0000F000
        "sll $6,$6,4\n\t"               // e--- e---
        "or $11,$11,$5\n\t"             // $11 = abcd 0000

        "lw $7,0x6060($15)\n\t"         // rdMmcDatBit4 => -f-- -f--
        "and $6,$6,$10\n\t"             // 0000 e000
        "ori $10,$0,0x0F00\n\t"         // $10 = mask = 0x00000F00
        "or $11,$11,$6\n\t"             // $11 = abcd e000
        "and $7,$7,$10\n\t"             // 0000 0f00

        "lw $8,0x6060($15)\n\t"         // rdMmcDatBit4 => -g-- -g--
        "ori $10,$0,0x00F0\n\t"         // $10 = mask = 0x000000F0
        "or $11,$11,$7\n\t"             // $11 = abcd ef00
        "srl $8,$8,4\n\t"               // --g- --g-

        "lw $9,0x6060($15)\n\t"         // rdMmcDatBit4 => -h-- -h--
        "and $8,$8,$10\n\t"             // 0000 00g0
        "ori $10,$0,0x000F\n\t"         // $10 = mask = 0x000000F
        "or $11,$11,$8\n\t"             // $11 = abcd efg0

        "srl $9,$9,8\n\t"               // ---h ---h
        "and $9,$9,$10\n\t"             // 0000 000h
        "or $11,$11,$9\n\t"             // $11 = abcd efgh

        "sw $11,0($14)\n\t"             // save sector data
        "addiu $13,$13,-1\n\t"
        "bne $13,$0,gsloop\n\t"
        "addiu $14,$14,4\n\t"           // inc buffer pointer

        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4 - just toss checksum bytes
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4
        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4

        "lw $2,0x6060($15)\n\t"         // rdMmcDatBit4 - clock out end bit

        "addiu $12,$12,-1\n\t"          // count--
        "bne $12,$0,oloop\n\t"          // next sector
        "nop\n\t"

        "ori %0,$0,1\n"                 // res = TRUE

        "exit:\n"
        ".set pop\n"
        : "=r" (res)                    // output
        : "r" (buf), "r" (count)        // inputs
        : "$0" );                       // clobbered

    return res;
}

That reads N sectors from the SD card in a row, tossing the checksum bytes for better speed. Writes DO checksum for safety. In any case, for each byte read, you have to access the rom space at the SD card interface twice. So you're doing two slow rom cycles per byte. If the Myth had a register that accumulated the data for us, the speed would be much faster. Another thing that could be done in a future core is to make the SD read nibble location map to the 0xA804xxxx range - that's SRAM space, and is far faster than rom space. Ideally, the core should map any access to a range in that space to the one address on the SD interface so that we could DMA the data from the SD card into ram, then put together all the nibbles.

So as an example, say a future core maps any access from 0xA8050000 to 0xA805FFFF to a single ASIC access of 0x000E6060, we could quickly DMA up to 32768 nibbles into ram, then decode them into the proper data. Actually, because we have to wait on the start bit between sectors, the most nibbles we would DMA at one time would be 1041 (1024 data nibbles + 16 checksum nibble + 1 end bit nibble). While we were waiting for one sector's DMA to finish, we could be decoding the data from the previous sector.

« Last Edit: September 05, 2010, 10:37:57 AM by ChillyWilly »

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #2 on: September 07, 2010, 01:01:01 AM »
Ok , after seeing this , i really wanted to do some tests , and here are the results :



The C-generated code of the psram copy routine adds a delay enough to burn a 32MB rom in 1min45secs.

This code writes 256KB chunks of a 32MB rom in 1minute(Yes , ONE minute , that's 45seconds less!) ...but the PSRAM can't match
this speed , so it writes garbage or at random offsets.

Code: [Select]
.section .text
.align 2

.set push
.set noreorder
/*.set noat*/

.global neo_xferto_psram
.ent    neo_xferto_psram
neo_xferto_psram:

/*regs.a[0] = src , regs.a[1] = pstart , regs.a[2] = len*/
la $10,0xB0000000
ori $8,$4,0
ori $9,$4,0
addu $9,$9,$6
addu $10,$10,$5
addu $10,$10,-4/*init for delay slot optimization*/

psram_copy_half_quad:
lw $11,($8)
addiu $10,$10,4 /*delay slot optimized*/
sw $11,($10)

/*cache 0x10,($10)*//*   (4 << 2) | 0*/

/*100nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop

bltu $8,$9,psram_copy_half_quad
addiu $8,$8,4 /*delay slot optimized*/

nop

jr $ra
nop
.end neo_xferto_psram

.set pop
.set reorder
/*.set at*/

This one with exact same settings but with bus delay 400nops , in 1min42secs (That's already -3secs from the C code)
Code: [Select]
.section .text
.align 2

.set push
.set noreorder
/*.set noat*/

.global neo_xferto_psram
.ent    neo_xferto_psram
neo_xferto_psram:

/*regs.a[0] = src , regs.a[1] = pstart , regs.a[2] = len*/
la $10,0xB0000000
ori $8,$4,0
ori $9,$4,0
addu $9,$9,$6
addu $10,$10,$5
addu $10,$10,-4/*init for delay slot optimization*/

psram_copy_half_quad:
lw $11,($8)
addiu $10,$10,4 /*delay slot optimized*/
sw $11,($10)

/*cache 0x10,($10)*//*   (4 << 2) | 0*/

/*100nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
/*100nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
/*100nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
/*100nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop


bltu $8,$9,psram_copy_half_quad
addiu $8,$8,4 /*delay slot optimized*/

nop

jr $ra
nop
.end neo_xferto_psram

.set pop
.set reorder
/*.set at*/

This one with exact same settings but with bus delay 333nops , in 1min36secs (That's -9secs from the C code)
Code: [Select]
.section .text
.align 2

.set push
.set noreorder
/*.set noat*/

.global neo_xferto_psram
.ent    neo_xferto_psram
neo_xferto_psram:

/*regs.a[0] = src , regs.a[1] = pstart , regs.a[2] = len*/
la $10,0xB0000000
ori $8,$4,0
ori $9,$4,0
addu $9,$9,$6
addu $10,$10,$5
addu $10,$10,-4/*init for delay slot optimization*/

psram_copy_half_quad:
lw $11,($8)
addiu $10,$10,4 /*delay slot optimized*/
sw $11,($10)

/*cache 0x10,($10)*//*   (4 << 2) | 0*/

/*100nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
/*100nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
/*100nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
/*33nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop

bltu $8,$9,psram_copy_half_quad
addiu $8,$8,4 /*delay slot optimized*/

nop

jr $ra
nop
.end neo_xferto_psram

.set pop
.set reorder
/*.set at*/

We can probably get this to -1 second , but its not worth it.

Again , ChillyWilly was correct , although i will commit the changes to the tracker anyway...
I have also attached the SD version with 333nops so you can try it , but after this , i think
we can't do anything else from software side.

Maybe we could disable SD CRC checks or make it optional...But i really think
that its time for guys working on the core to take action.

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #3 on: September 07, 2010, 01:04:56 AM »
Also , i haven't tried to write full quads yet.I'll do it tomorrow probably and i will let you know if i had any success.

Edit : Nope , they don't work  ~sm-69.gif~.gif
« Last Edit: September 07, 2010, 12:51:00 PM by Conle »

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #4 on: September 07, 2010, 09:38:05 PM »
Ok , i beat the previous record again.
I did some more alchemy after reading the R4400 manual, but this time the whole code got a nice speed boost , and unfortunately
most of those precious cycles go to bus delay.

In the previous version , i had 333nops for bus delay.After this optimization , guess how much is now...
440! I thought to also make a fast asm version of crc7 code , but that would lead to +200~300nops probably for just 2-3seconds.


Here are the results :

(Games - all .z64 , so with natively-swapped binaries you should be getting even better speeds)

( Unsafe bus delay (games boot,but not always) )
64Mb : 20.54
128Mb : 43.25
256Mb : 1m.29s

( Safe bus delay(440nops) (games boot always) (Current configuration) )
64Mb : 23.73
128Mb : 48.25
256Mb : 1m.36s ~ 1m.37s

NEON64SD.V64 (1.2MB) : 3.54seconds (With the C code its 5seconds+)



Findings :

-256Mb games have high chances to boot fine even with -(60~80)nops less
-256Mb games need a little bit extra delay when switching psram offsets otherwise they won't boot.Probably while the loop fills
the psram the core still haven't got the command and we're still writing to the 1st bank.



Summary:

Wasted 2 hours optimizing code , then filling up the remaining cycles with bus delays just to get 10seconds max.

Once again , ChillyWilly was right when he pretty much said : Get extra cycles , then figure out the magic constant
bus delay to fix the timing...Anyway ,  the purpose was to prove this , but maybe it was waste of time.
:)



Current code:
Code: [Select]

/*Interface : neo_2_asm.h*/

.section .text
.align 2 /*.align 3*/

.set push
.set noreorder
.set noat

.global neo2_recv_sd_multi /*ChillyWilly's MASTERPIECE*/
.ent    neo2_recv_sd_multi
neo2_recv_sd_multi:
        lui $15,0xB30E            /* $15 = 0xB30E0000*/
        ori $14,$4,0              /* $14 = buf*/
        ori $12,$5,0                /* $12 = count*/

        oloop:
        lui $11,0x0001              /* $11 = timeout = 64 * 1024*/

        tloop:
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        andi $2,$2,0x0100         /* eqv of (data>>8)&0x01*/
        beq $2,$0,getsect         /* start bit detected*/
        nop
        addiu $11,$11,-1
        bne $11,$0,tloop          /* not timed out*/
        nop
        beq $11,$0,___exit           /* timeout*/
        ori $2,$0,0                 /* res = FALSE*/

        getsect:
        ori $13,$0,128              /* $13 = long count*/

        gsloop:
        lw $2,0x6060($15)         /* rdMmcDatBit4 => -a-- -a--*/
        lui $10,0xF000            /* $10 = mask = 0xF0000000*/
        sll $2,$2,4               /* a--- a---*/

        lw $3,0x6060($15)         /* rdMmcDatBit4 => -b-- -b--*/
        and $2,$2,$10             /* a000 0000*/
        lui $10,0x0F00            /* $10 = mask = 0x0F000000*/
        and $3,$3,$10             /* 0b00 0000*/

        lw $4,0x6060($15)         /* rdMmcDatBit4 => -c-- -c--*/
        lui $10,0x00F0            /* $10 = mask = 0x00F00000*/
        or $11,$3,$2              /* $11 = ab00 0000*/
        srl $4,$4,4               /* --c- --c-*/

        lw $5,0x6060($15)         /* rdMmcDatBit4 => -d-- -d--*/
        and $4,$4,$10             /* 00c0 0000*/
        lui $10,0x000F            /* $10 = mask = 0x000F0000*/
        srl $5,$5,8               /* ---d ---d*/
        or $11,$11,$4             /* $11 = abc0 0000*/

        lw $6,0x6060($15)         /* rdMmcDatBit4 => -e-- -e--*/
        and $5,$5,$10             /* 000d 0000*/
        ori $10,$0,0xF000         /* $10 = mask = 0x0000F000*/
        sll $6,$6,4               /* e--- e---*/
        or $11,$11,$5             /* $11 = abcd 0000*/

        lw $7,0x6060($15)         /* rdMmcDatBit4 => -f-- -f--*/
        and $6,$6,$10             /* 0000 e000*/
        ori $10,$0,0x0F00         /* $10 = mask = 0x00000F00*/
        or $11,$11,$6             /* $11 = abcd e000*/
        and $7,$7,$10             /* 0000 0f00*/

        lw $8,0x6060($15)         /* rdMmcDatBit4 => -g-- -g--*/
        ori $10,$0,0x00F0         /* $10 = mask = 0x000000F0*/
        or $11,$11,$7             /* $11 = abcd ef00*/
        srl $8,$8,4               /* --g- --g-*/

        lw $9,0x6060($15)         /* rdMmcDatBit4 => -h-- -h--*/
        and $8,$8,$10             /* 0000 00g0*/
        ori $10,$0,0x000F         /* $10 = mask = 0x000000F*/
        or $11,$11,$8             /* $11 = abcd efg0*/

        srl $9,$9,8               /* ---h ---h*/
        and $9,$9,$10             /* 0000 000h*/
        or $11,$11,$9             /* $11 = abcd efgh*/

        sw $11,0($14)             /* save sector data*/
        addiu $13,$13,-1
        bne $13,$0,gsloop
        addiu $14,$14,4           /* inc buffer pointer */

        lw $2,0x6060($15)         /* rdMmcDatBit4 - just toss checksum bytes */
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/
        lw $2,0x6060($15)         /* rdMmcDatBit4*/

        lw $2,0x6060($15)         /* rdMmcDatBit4 - clock out end bit*/

        addiu $12,$12,-1          /* count--*/
        bne $12,$0,oloop          /* next sector*/
        nop

        ori $2,$0,1                 /* res = TRUE*/

___exit:

jr $ra
nop

.end neo2_recv_sd_multi

.global neo_xferto_psram
.ent    neo_xferto_psram
neo_xferto_psram:

la $10,0xB0000000
ori $8,$4,0
addu $8,$8,$6
addu $10,$10,$5

psram_copy_half_quad:
lw $11,($4)
nop
sw $11,($10)
addiu $10,$10,4

/*100nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
/*100nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
/*100nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
/*40nops*/
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop
nop


bne $4,$8,psram_copy_half_quad
addiu $4,$4,4

jr $ra
nop
.end neo_xferto_psram

.set pop
.set reorder
.set at
« Last Edit: September 07, 2010, 09:39:51 PM by Conle »

Offline mic_

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 632
Re: SD to psram performance
« Reply #5 on: September 07, 2010, 10:03:08 PM »
Holy NOPs, Batman!

How about

Code: [Select]
.rept 440
nop
.endr

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #6 on: September 07, 2010, 10:14:15 PM »
Holy NOPs, Batman!

How about

Code: [Select]
.rept 440
nop
.endr

 ::sm-29.gif::   ~sm-34.gif~
GCC had such a nice macro? --I was about to write a little loop before committing the changes   ~sm-75.gif~.gif  ~sm-73.gif~.gif 
That's nice and clean  ~sm-57.gif~.gif , thanks  ::sm-24.gif::

Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #7 on: September 08, 2010, 01:38:53 AM »
Why are you trying to optimize THIS?

Code: [Select]
void neo_xferto_psram(void *src, int pstart, int len)
{
    // copy data
    for (int ix=0; ix<len; ix+=4)
    {
        *(vu32 *)(0xB0000000+pstart+ix) = *(u32 *)(src+ix);
        bus_delay(96);
    }
}

As you demonstrated, it provides almost no change in speed. The value 96 was chosen because it works for everyone. I can lower it to 80 on my system and it still work... at a savings of 2 seconds! (on a 32MB rom)

There's no reason to change that into assembly. There's no reason to unroll that loop. You MIGHT want to play with that bus_delay() value a little. Like I said, I can run it at 80 on my system. I can go lower, but at 64 it sometimes fails on loads. so somewhere between 64 and 80 is where my system runs.

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #8 on: September 08, 2010, 02:03:14 AM »
Why are you trying to optimize THIS?

Code: [Select]
void neo_xferto_psram(void *src, int pstart, int len)
{
    // copy data
    for (int ix=0; ix<len; ix+=4)
    {
        *(vu32 *)(0xB0000000+pstart+ix) = *(u32 *)(src+ix);
        bus_delay(96);
    }
}


Well , the OP didn't suggested to unroll the loop and write an assembly version?
I just proved that even with assembly code we can't do miracles.  8)

Then i noticed that gcc was saving/restoring a few registers on the stack and i noticed we could
make use of a few extra cycles , and that's it , we got -10seconds less for 32MB roms , -(5~7)seconds for 16MB and 8MB,
-2 and half seconds for the ext menu bios.

And on top of that , we figured that when switching psram offset at those "high" speeds , an extra delay is needed.  ~sm-56.gif~.gif

Not bad for just a buffer copy :)


ps-- One thing i want to try is : Load a quad from the memory , write it in 2passes in a single loop protected by nops.
« Last Edit: September 08, 2010, 02:23:49 AM by Conle »

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #9 on: September 08, 2010, 03:04:59 AM »

Ok , what do you guys say about burning a 32MB z64 in 1min26 seconds ?  :)

Here's an algorithm i figured.I call it brute-force :
Code: [Select]
la $10,0xB0000000
ori $8,$4,0
addu $8,$8,$6
addu $10,$10,$5

psram_copy_game:

lw $11,($4)
ori $13,$0,64 /*timeout*/

psram_copy_game_2:
bltz $13,psram_copy_game_next
addiu $13,$13,-1

sw $11,($10)
lw $12,($10)
bne $11,$12,psram_copy_game_2
nop

psram_copy_game_next:
addiu $10,$10,4

bne $4,$8,psram_copy_game
addiu $4,$4,4

jr $ra
nop

Check the binary  and tell me if it works for you  :D

Now with this we might do streaming off sd each long and god knows if it will work and how it will perform

Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #10 on: September 08, 2010, 03:38:49 AM »
Now THAT is an interesting optimization... comparing the data to see if it's valid yet... although you only have to write it once. The write should be outside the timeout loop. That might be microscopically faster.  ~sm-82.gif~.gif

more like this

Code: [Select]
   psram_copy_game:
     
      lw $11,($4)
      ori $13,$0,64 /*timeout*/
      sw $11,($10)

         psram_copy_game_2:
           bltz $13,psram_copy_game_next

           lw $12,($10)
           bne $11,$12,psram_copy_game_2
           addiu $13,$13,-1

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #11 on: September 08, 2010, 03:58:03 AM »
Good catch , i made it perfect now.When the timeout is <1 it uses the old method with 440nops to burn:
Edit : No , that's not correct , because then it doesn't write it if its invalid  ~sm-79.gif~.gif , so i fixed it again/
Code: [Select]
la $10,0xB0000000
ori $8,$4,0
addu $8,$8,$6
addu $10,$10,$5

psram_copy_game:

lw $11,($4)
ori $13,$0,96 /*timeout*/

psram_copy_game_2:
bltz $13,psram_copy_game_full_cycle
addiu $13,$13,-1

                sw $11,($10)
lw $12,($10)
bne $11,$12,psram_copy_game_2
nop

bgtz $13,psram_copy_game_next
nop

psram_copy_game_full_cycle:
.rept 440
nop
.endr
sw $11,($10)

psram_copy_game_next:
addiu $10,$10,4

bne $4,$8,psram_copy_game
addiu $4,$4,4

jr $ra
nop

I didn't noticed any speedup or slowdown.

Regarding streaming modes : With 64/128KB its awful , with 512KB per read we get -1second with large games , but anything else is slower by 1second.
So , with the default streaming size you got it performs very well.
 8)
« Last Edit: September 08, 2010, 04:06:36 AM by Conle »

Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #12 on: September 08, 2010, 04:30:48 AM »
Good catch , i made it perfect now.When the timeout is <1 it uses the old method with 440nops to burn:
Edit : No , that's not correct , because then it doesn't write it if its invalid  ~sm-79.gif~.gif , so i fixed it again/

Hmm - probably due to the reads. If you look at the original code, there's only one write, then a wait. Reading the value back must invalidate the write if the data wasn't yet stable.

In any case, get rid of the nop and move the subtract and move the store before the full wait:

Code: [Select]
la $10,0xB0000000
ori $8,$4,0
addu $8,$8,$6
addu $10,$10,$5

psram_copy_game:

lw $11,($4)
ori $13,$0,96 /*timeout*/

psram_copy_game_2:
bltz $13,psram_copy_game_full_cycle

                sw $11,($10)
lw $12,($10)
bne $11,$12,psram_copy_game_2
addiu $13,$13,-1

bgtz $13,psram_copy_game_next
nop

sw $11,($10)
psram_copy_game_full_cycle:
.rept 440
nop
.endr

psram_copy_game_next:
addiu $10,$10,4

bne $4,$8,psram_copy_game
addiu $4,$4,4

jr $ra
nop

Quote
I didn't noticed any speedup or slowdown.

Because your code wasn't any different other than a mostly useless long wait on timeout.

Quote
Regarding streaming modes : With 64/128KB its awful , with 512KB per read we get -1second with large games , but anything else is slower by 1second.
So , with the default streaming size you got it performs very well.
 8)

That was one of the things I tried different values for back when the menu was first made. 256KB tested out as better under more circumstances. Caching is probably part of that difference compared to different size buffers.

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #13 on: September 08, 2010, 12:41:10 PM »
Quote
Because your code wasn't any different other than a mostly useless long wait on timeout.

Exactly.  :)  I meant that it never jumped to that section even with timeout retries set to 32.

Quote
Hmm - probably due to the reads. If you look at the original code, there's only one write, then a wait. Reading the value back must invalidate the write if the data wasn't yet stable.

In any case, get rid of the nop and move the subtract and move the store before the full wait:

Here is a proper implementation:  ~sm-67.gif~.gif

Code: [Select]
.global neo_xferto_psram
.ent    neo_xferto_psram
neo_xferto_psram:

la $10,0xB0000000
ori $8,$4,0
addu $8,$8,$6
addu $10,$10,$5

0:
ori $13,$0,NEO_PSRAM_BRUTEFORCE_RETRIES

1:
lw $11,($4)
sw $11,($10)
lw $12,($10)
j 8f
nop

2:
sw $11,($10)
lw $12,($10)

3:
bltz $13,4f
addi $13,$13,-1

bne $11,$12,2b
nop

j 5f
nop

8:/*hack*/
bne $11,$12,2b
nop

j 5f
nop

4:
sw $11,($10)

.rept 440
nop
.endr

5:
addiu $10,$10,4

6:
bne $4,$8,0b
addiu $4,$4,4
7:

jr $ra
nop

.end neo_xferto_psram

I know , you'll complain about not using the 2 delay slots , but of course i've tried it :

Code: [Select]
.global neo_xferto_psram
.ent    neo_xferto_psram
neo_xferto_psram:

la $10,0xB0000000
ori $8,$4,0
addu $8,$8,$6
addu $10,$10,$5

0:
ori $13,$0,NEO_PSRAM_BRUTEFORCE_RETRIES

1:
lw $11,($4)
sw $11,($10)
j 8f
nop

9:/*hack*/
sw $11,($10)

2:
lw $12,($10)

3:
bltz $13,4f
addi $13,$13,-1

bne $11,$12,2b
sw $11,($10)

j 5f
nop

8:/*hack*/
bne $11,$12,9b
lw $12,($10)

j 5f
nop

4:
sw $11,($10)

.rept 440
nop
.endr

5:
addiu $10,$10,4

6:
bne $4,$8,0b
addiu $4,$4,4
7:

jr $ra
nop

.end neo_xferto_psram

And it makes the transfer slower by 3seconds.

Anyway , the current timings are :  ::sm-17.gif::
64Mb z64 - 0m21.28s
128Mb z64 - 0m42.64s
256Mb z64 - 1m26.54s

Now just need to complete the integration of the lib config and should be on the tracker.
((If you need the code right now let me know))

Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #14 on: September 08, 2010, 12:52:58 PM »
You really like to overly complicate the assembly, don't you?  ~sm-73.gif~.gif

That's way more complex than it needs to be. The original code with my changes was better.

Also, moving the store to the delay slow was slower because the delay slot is ALWAYS done, so you were doing the store one time more than when you had the nop there. The store and the read are the slowest parts of the xfer loop since they are accessing uncached hardware. Also, the subtract from the timeout you put in the first delay slot doesn't affect the branch - delay slot instructions cannot affect the branch, even when on the same register. It probably doesn't affect that loop, but I wanted to make sure you realized that side effect.

What you should probably do is put the subtract after the load since almost any instruction there not relying on the load will be absorbed into the load time.