Author Topic: SD to psram performance  (Read 12773 times)

0 Members and 1 Guest are viewing this topic.

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #15 on: September 08, 2010, 01:19:44 PM »

Quote
You really like to overly complicate the assembly, don't you?  ~sm-73.gif~.gif

That's way more complex than it needs to be. The original code with my changes was better.

That version is slower by 1.30s ~ 1.40s for 32MB games , and nearly 2s for 16 & 8MB compared to the one i just posted.  ~sm-65.gif~.gif  :)
Those jumps really speedup the transfer
since most of the time the loop jumps instantly to the next long.  ~sm-92.gif~.gif

Quote
Also, moving the store to the delay slow was slower because the delay slot is ALWAYS done, so you were doing the store one time more than when you had the nop there. The store and the read are the slowest parts of the xfer loop since they are accessing uncached hardware. Also, the subtract from the timeout you put in the first delay slot doesn't affect the branch - delay slot instructions cannot affect the branch, even when on the same register. It probably doesn't affect that loop, but I wanted to make sure you realized that side effect.

Thanks! That was a bit confusing.I've downloaded those freely-available mips pdf ebooks , and the one was saying about delay slots
on load/branch instructions , the other one was saying that in some r43xxx models delay slots should be used(mostly) as breakpoints ONLY in
branches.

 ~sm-34.gif~
Its conspiracy for sure.    ~sm-75.gif~.gif

Anyway, now(=yesterday.Time is relative(probably)) i got a nice manual from Infrid's --One of the testers here-- website : http://infrid.com/rcp64/documents.php
I think its the best i've found so far.

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #16 on: September 08, 2010, 01:59:41 PM »
And i found the secret i think. First read psram , then you can write fine with just 1 call :
Code: [Select]
la $10,0xB0000000
ori $8,$4,0
addu $8,$8,$6
addu $10,$10,$5

0:
lw $11,($10)
lw $12,($4)
sw $12,($10)

addiu $10,$10,4
addiu $4,$4,4

bne $4,$8,0b
nop

jr $ra
nop

booted mario 64 in 21seconds...

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #17 on: September 08, 2010, 02:24:35 PM »

Ok , i maxed it out.It can't get any faster than this.

Here's the current code , it writes 8bytes at a time by adding a small delay to the slow psram read access.

Code: [Select]
.global neo_xferto_psram
.ent    neo_xferto_psram
neo_xferto_psram:

la $10,0xB0000000
ori $8,$4,0
addu $8,$8,$6
addu $10,$10,$5

0:
lw $12,0($4)
lw $13,4($4)

sw $12,0($10)
lw $11,($10)
sw $13,4($10)
.rept 188
nop
.endr
addiu $10,$10,8
addiu $4,$4,8

bne $4,$8,0b
nop

jr $ra
nop

.end neo_xferto_psram

64Mb = 20.60s with 180nops (2nd pass) , or 21.10s with 188nops(2nd pass)
128Mb = 41.84s with 180nops (2nd pass) , or 42s with 188nops(2nd pass)
256Mb = 1m24.34s with 180nops (2nd pass) , or 1m25s with 188nops(2nd pass)

Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #18 on: September 08, 2010, 02:45:24 PM »
So if you read first, you don't need a wait after the write... but adding the nops after the second write makes it faster? That doesn't make any sense. The code that did a long at a time with nops should be faster.

The code with the nops isn't reading before the first write... which might be why it needed the nops. Shouldn't it be

Code: [Select]
      lw $11,0($10)
      sw $12,0($10)
      lw $11,4($10)
      sw $13,4($10)

and then no nops? That would be more consistent with the previous long move routine.

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #19 on: September 08, 2010, 02:50:59 PM »
 
Ok you're right , it works , but it burns the games about half second slower :
Code: [Select]
0:

lw $12,0($4)
lw $13,4($4)

/*sw $12,0($10)
lw $11,($10)
sw $13,4($10)
.rept 188
nop
.endr*/

lw $11,0($10)
sw $12,0($10)
lw $11,4($10)
sw $13,4($10)

addiu $10,$10,8
addiu $4,$4,8

bne $4,$8,0b
nop

jr $ra
nop

So should be this be the final ?
« Last Edit: September 08, 2010, 03:14:22 PM by Conle »

Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #20 on: September 08, 2010, 03:41:48 PM »

Ok you're right , it works , but it burns the games about half second slower :

Why does it still have this code?

Code: [Select]
/*sw $12,0($10)
lw $11,($10)
sw $13,4($10)
.rept 188
nop
.endr*/

That's probably where the extra time is wasted.

Just do

Code: [Select]
   0:
      lw $12,0($4)
      lw $13,4($4)

      lw $11,0($10)
      sw $12,0($10)
      lw $11,4($10)
      sw $13,4($10)

      addiu $10,$10,8
      addiu $4,$4,8

      bne $4,$8,0b
      nop   


Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #21 on: September 08, 2010, 03:59:36 PM »
Why does it still have this code?

Code: [Select]
/*sw $12,0($10)
lw $11,($10)
sw $13,4($10)
.rept 188
nop
.endr*/

That's probably where the extra time is wasted.

Just do

Code: [Select]
   0:
      lw $12,0($4)
      lw $13,4($4)

      lw $11,0($10)
      sw $12,0($10)
      lw $11,4($10)
      sw $13,4($10)

      addiu $10,$10,8
      addiu $4,$4,8

      bne $4,$8,0b
      nop   

ChillyWilly , that code is in a multiline comment-block  ^-^

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #22 on: September 08, 2010, 04:04:23 PM »
The code is now on the tracker.I'll complete libconfig integration in few hours.No more time right now  ~sm-74.gif~.gif  ~sm-69.gif~.gif

Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #23 on: September 09, 2010, 12:22:55 AM »
ChillyWilly , that code is in a multiline comment-block  ^-^

Didn't notice... that was right before I went to bed.  ~sm-79.gif~.gif

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #24 on: September 09, 2010, 02:09:17 AM »
Didn't notice... that was right before I went to bed.  ~sm-79.gif~.gif

Yeah , i figured it was the case  :D
I have done worst stuff when im typing/coding late at night so i always try to avoid doing stuff really late  ~sm-52.gif~.gif

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #25 on: September 09, 2010, 01:12:58 PM »
I replaced both crc routines with lookup tables found here : http://www.humblesoft.com/n-card/kernel-patch-2.4.21-pre4/mmc-driver.txt

And the only boost we got is like half second at the best(for a 64Mb rom i tested)  ~sm-70.gif~.gif
It might be more useful for the md though.  ~sm-51.gif~.gif

Offline ChillyWilly

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1751
  • Just a coding machine.
Re: SD to psram performance
« Reply #26 on: September 09, 2010, 01:51:06 PM »
I replaced both crc routines with lookup tables found here : http://www.humblesoft.com/n-card/kernel-patch-2.4.21-pre4/mmc-driver.txt

And the only boost we got is like half second at the best(for a 64Mb rom i tested)  ~sm-70.gif~.gif
It might be more useful for the md though.  ~sm-51.gif~.gif

I bet it makes almost no difference, but it's worth trying.

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #27 on: September 09, 2010, 02:04:26 PM »
I bet it makes almost no difference, but it's worth trying.

It doesn't even give a rounded second  ~sm-44.gif~
For the md/snes(especially snes!) it can help though.

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #28 on: September 11, 2010, 01:23:35 AM »
Quote
It doesn't even give a rounded second

Almost got that second  ~sm-68.gif~.gif
Code: [Select]
.global neo2_recv_sd_multi /*ChillyWilly's MASTERPIECE*/
.ent    neo2_recv_sd_multi
neo2_recv_sd_multi:
        la $15,0xB30E6060         /* $15 = 0xB30E6060*/
        ori $14,$4,0              /* $14 = buf*/
        ori $12,$5,0              /* $12 = count*/

        oloop:
        lui $11,0x0001            /* $11 = timeout = 64 * 1024*/

        tloop:
        lw $2,($15)               /* rdMmcDatBit4*/
        andi $2,$2,0x0100         /* eqv of (data>>8)&0x01*/
        beq $2,$0,getsect         /* start bit detected*/
        nop
        addiu $11,$11,-1
        bne $11,$0,tloop          /* not timed out*/
        nop
        beq $11,$0,___exit        /* timeout*/
        ori $2,$0,0               /* res = FALSE*/

        getsect:
        ori $13,$0,128            /* $13 = long count*/

        gsloop:
        lw $2,($15)               /* rdMmcDatBit4 => -a-- -a--*/
        lui $10,0xF000            /* $10 = mask = 0xF0000000*/
        sll $2,$2,4               /* a--- a---*/

        lw $3,($15)               /* rdMmcDatBit4 => -b-- -b--*/
        and $2,$2,$10             /* a000 0000*/
        lui $10,0x0F00            /* $10 = mask = 0x0F000000*/
        and $3,$3,$10             /* 0b00 0000*/

        lw $4,($15)               /* rdMmcDatBit4 => -c-- -c--*/
        lui $10,0x00F0            /* $10 = mask = 0x00F00000*/
        or $11,$3,$2              /* $11 = ab00 0000*/
        srl $4,$4,4               /* --c- --c-*/

        lw $5,($15)               /* rdMmcDatBit4 => -d-- -d--*/
        and $4,$4,$10             /* 00c0 0000*/
        lui $10,0x000F            /* $10 = mask = 0x000F0000*/
        srl $5,$5,8               /* ---d ---d*/
        or $11,$11,$4             /* $11 = abc0 0000*/

        lw $6,($15)               /* rdMmcDatBit4 => -e-- -e--*/
        and $5,$5,$10             /* 000d 0000*/
        ori $10,$0,0xF000         /* $10 = mask = 0x0000F000*/
        sll $6,$6,4               /* e--- e---*/
        or $11,$11,$5             /* $11 = abcd 0000*/

        lw $7,($15)               /* rdMmcDatBit4 => -f-- -f--*/
        and $6,$6,$10             /* 0000 e000*/
        ori $10,$0,0x0F00         /* $10 = mask = 0x00000F00*/
        or $11,$11,$6             /* $11 = abcd e000*/
        and $7,$7,$10             /* 0000 0f00*/

        lw $8,($15)               /* rdMmcDatBit4 => -g-- -g--*/
        ori $10,$0,0x00F0         /* $10 = mask = 0x000000F0*/
        or $11,$11,$7             /* $11 = abcd ef00*/
        srl $8,$8,4               /* --g- --g-*/

        lw $9,($15)               /* rdMmcDatBit4 => -h-- -h--*/
        and $8,$8,$10             /* 0000 00g0*/
        ori $10,$0,0x000F         /* $10 = mask = 0x000000F*/
        or $11,$11,$8             /* $11 = abcd efg0*/

        srl $9,$9,8               /* ---h ---h*/
        and $9,$9,$10             /* 0000 000h*/
        or $11,$11,$9             /* $11 = abcd efgh*/

        sw $11,0($14)             /* save sector data*/
        addiu $13,$13,-1
        bne $13,$0,gsloop
        addiu $14,$14,4           /* inc buffer pointer */

        lw $2,($15)               /* rdMmcDatBit4 - just toss checksum bytes */
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/

        lw $2,($15)               /* rdMmcDatBit4 - clock out end bit*/

        addiu $12,$12,-1          /* count--*/
        bne $12,$0,oloop          /* next sector*/
        nop

        ori $2,$0,1                 /* res = TRUE*/

___exit:

jr $ra
nop

CW excuse me for butchering your masterpiece  :'(  :D

This gives about 0.12 ~ 0.18s . Still looking for 0.10s  >:D
I'm lame  ~sm-46.gif~.gif

-->Now i want to try to remove those masks in register $10 to some reserved(saved/restored of course) registers.It might help a bit.
« Last Edit: September 11, 2010, 01:35:27 AM by Conle »

Offline Conle

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 2203
Re: SD to psram performance
« Reply #29 on: September 11, 2010, 02:04:12 AM »
32MB Z64 takes 1m24.48s ~ 1m24.54s   8)

In this code i moved a few constants down to registers saving a few loads,
used t8-t9 , and saving/restoring s0/s1 to/from the stack.

Now its over i think(I've done some more changes but for more look at the diff when its on the tracker).  ~sm-65.gif~.gif
Code: [Select]
.global neo2_recv_sd_multi /*ChillyWilly's MASTERPIECE*/
.ent    neo2_recv_sd_multi
neo2_recv_sd_multi:

        la $15,0xB30E6060         /* $15 = 0xB30E6060*/
        ori $14,$4,0              /* $14 = buf*/
        ori $12,$5,0              /* $12 = count*/

/*
t8 = 0xF000 , t9 = 0x0F00 (not saved)
s0 = 0x00F0 , s1 = 0x000F (saved)
*/
addiu $sp,$sp,-8 /*this block adds lots of latency to the prologue but in the end we get a few ms!*/
sw $16,0($sp)
sw $17,4($sp)
lui $24,0xF000
lui $25,0x0F00
lui $16,0x00F0
lui $17,0x000F
 
        oloop:
        lui $11,0x0001            /* $11 = timeout = 64 * 1024*/

        tloop:
        lw $2,($15)               /* rdMmcDatBit4*/
        andi $2,$2,0x0100         /* eqv of (data>>8)&0x01*/
        beq $2,$0,getsect         /* start bit detected*/
        nop
        addiu $11,$11,-1
        bne $11,$0,tloop          /* not timed out*/
        nop
        beq $11,$0,___exit        /* timeout*/
        ori $2,$0,0               /* res = FALSE*/

        getsect:
        ori $13,$0,128            /* $13 = long count*/

        gsloop:
        lw $2,($15)               /* rdMmcDatBit4 => -a-- -a--*/
        /*lui $10,0xF000*/            /* $10 = mask = 0xF0000000*/
        sll $2,$2,4               /* a--- a---*/

        lw $3,($15)               /* rdMmcDatBit4 => -b-- -b--*/
        and $2,$2,$24             /* a000 0000*/
        /*lui $10,0x0F00*/            /* $10 = mask = 0x0F000000*/
        and $3,$3,$25             /* 0b00 0000*/

        lw $4,($15)               /* rdMmcDatBit4 => -c-- -c--*/
        /*lui $10,0x00F0*/            /* $10 = mask = 0x00F00000*/
        or $11,$3,$2              /* $11 = ab00 0000*/
        srl $4,$4,4               /* --c- --c-*/

        lw $5,($15)               /* rdMmcDatBit4 => -d-- -d--*/
        and $4,$4,$16             /* 00c0 0000*/
        /*lui $10,0x000F*/            /* $10 = mask = 0x000F0000*/
        srl $5,$5,8               /* ---d ---d*/
        or $11,$11,$4             /* $11 = abc0 0000*/

        lw $6,($15)               /* rdMmcDatBit4 => -e-- -e--*/
        and $5,$5,$17             /* 000d 0000*/
        ori $10,$0,0xF000         /* $10 = mask = 0x0000F000*/
        sll $6,$6,4               /* e--- e---*/
        or $11,$11,$5             /* $11 = abcd 0000*/

        lw $7,($15)               /* rdMmcDatBit4 => -f-- -f--*/
        and $6,$6,$10             /* 0000 e000*/
        ori $10,$0,0x0F00         /* $10 = mask = 0x00000F00*/
        or $11,$11,$6             /* $11 = abcd e000*/
        and $7,$7,$10             /* 0000 0f00*/

        lw $8,($15)               /* rdMmcDatBit4 => -g-- -g--*/
        ori $10,$0,0x00F0         /* $10 = mask = 0x000000F0*/
        or $11,$11,$7             /* $11 = abcd ef00*/
        srl $8,$8,4               /* --g- --g-*/

        lw $9,($15)               /* rdMmcDatBit4 => -h-- -h--*/
        and $8,$8,$10             /* 0000 00g0*/
        ori $10,$0,0x000F         /* $10 = mask = 0x000000F*/
        or $11,$11,$8             /* $11 = abcd efg0*/

        srl $9,$9,8               /* ---h ---h*/
        and $9,$9,$10             /* 0000 000h*/
        or $11,$11,$9             /* $11 = abcd efgh*/

        sw $11,0($14)             /* save sector data*/
        addiu $13,$13,-1
        bne $13,$0,gsloop
        addiu $14,$14,4           /* inc buffer pointer */

        lw $2,($15)               /* rdMmcDatBit4 - just toss checksum bytes */
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/
        lw $2,($15)               /* rdMmcDatBit4*/

        lw $2,($15)               /* rdMmcDatBit4 - clock out end bit*/

        addiu $12,$12,-1          /* count--*/
        bne $12,$0,oloop          /* next sector*/
        nop

        ori $2,$0,1                 /* res = TRUE*/

___exit:
lw $16,0($sp)
lw $17,4($sp)
addiu $sp,$sp,8

jr $ra
nop

.end neo2_recv_sd_multi

Oh one more thing , bus_delays got replaced with a byte load off psram  ::sm-22.gif::
« Last Edit: September 11, 2010, 02:05:54 AM by Conle »