Checkpoint
This commit is contained in:
parent
e65fedc8e1
commit
e2e4b778b0
1 changed files with 71 additions and 11 deletions
82
STATUS.md
82
STATUS.md
|
|
@ -217,7 +217,7 @@ which runs correctly under MAME (apple2gs).
|
||||||
image addresses.
|
image addresses.
|
||||||
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
|
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
|
||||||
libgcc into linkable objects.
|
libgcc into linkable objects.
|
||||||
- `scripts/smokeTest.sh` runs 145 end-to-end checks at -O2:
|
- `scripts/smokeTest.sh` runs 132 end-to-end checks at -O2:
|
||||||
scalar ops, control flow, calling conventions, MAME execution
|
scalar ops, control flow, calling conventions, MAME execution
|
||||||
regressions, link816 bss-base safety + weak-symbol resolution +
|
regressions, link816 bss-base safety + weak-symbol resolution +
|
||||||
heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link,
|
heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link,
|
||||||
|
|
@ -246,10 +246,20 @@ which runs correctly under MAME (apple2gs).
|
||||||
|
|
||||||
- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts
|
- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts
|
||||||
via MAME's emulated time counter. Eight benchmarks under
|
via MAME's emulated time counter. Eight benchmarks under
|
||||||
`benchmarks/`. Current numbers: popcount 3683 cyc, bsearch
|
`benchmarks/`. Current numbers (2026-05-13 after the umulhisi3 /
|
||||||
852, memcmp 1091, strcpy 2558, dotProduct 2387, fib(10) 12617,
|
TAX-TXA / store-bypass / PHI-hoist landings): popcount 3478,
|
||||||
sumOfSquares 23529. Speed is the optimization priority, not
|
bsearch 852, memcmp 1091, strcpy 2558, dotProduct 2302,
|
||||||
size.
|
fib(10) 12617, sumOfSquares 18755. Speed is the optimization
|
||||||
|
priority, not size.
|
||||||
|
|
||||||
|
- `compare/` holds three side-by-side C tests with our asm and
|
||||||
|
Calypsi's listing for static-size comparison:
|
||||||
|
`sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh`
|
||||||
|
recompiles each under both `clang --target=w65816 -O2 -S` and
|
||||||
|
`cc65816 --speed -O 2 --64bit-doubles` and prints an
|
||||||
|
ours/Calypsi instruction-count ratio. Current ratios:
|
||||||
|
sumSquares 2.32x, evalAt 2.10x, mul16to32 2.50x. See
|
||||||
|
`compare/README.md`.
|
||||||
|
|
||||||
**Backend register allocation:**
|
**Backend register allocation:**
|
||||||
|
|
||||||
|
|
@ -323,16 +333,66 @@ for the common-case C / minimal-C++ workload. Priority is speed
|
||||||
rewrites to LDA/INA/STA/INC_HI_IF_CARRY (with private-label BNE
|
rewrites to LDA/INA/STA/INC_HI_IF_CARRY (with private-label BNE
|
||||||
expansion in AsmPrinter). Saves ~13 cyc per increment on the
|
expansion in AsmPrinter). Saves ~13 cyc per increment on the
|
||||||
no-carry common path. memcmp 1330 → 1194 (−10.2%), strcpy 3325 →
|
no-carry common path. memcmp 1330 → 1194 (−10.2%), strcpy 3325 →
|
||||||
3154 (−5.1%). LSR's `*p++ → base+offset` rewrite remains
|
3154 (−5.1%). Now also tolerates intervening TAX/TXA pseudo-saves
|
||||||
unaddressed; tried `-disable-lsr` and `isLSRCostLess` override,
|
in the matcher (regalloc inserts them around STAfi's conservative
|
||||||
both regressed dotProduct.
|
`Defs=[A]`); LSR-introduced i32 PHIs like `lsr.iv9 += 1` now match.
|
||||||
|
LSR's `*p++ → base+offset` rewrite remains unaddressed; tried
|
||||||
|
`-disable-lsr` and `isLSRCostLess` override, both regressed
|
||||||
|
dotProduct.
|
||||||
|
|
||||||
|
- **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR
|
||||||
|
pass** (2026-05-13). Added `__umulhisi3` (unsigned 16x16→32) to
|
||||||
|
`runtime/src/libgcc.s`. New IR pass in `addISelPrepare` walks
|
||||||
|
`mul i32 X, Y` and uses IR-level `computeKnownBits` plus a SCEV
|
||||||
|
unsigned-range fallback (`getUnsignedRange().getActiveBits() <=
|
||||||
|
16`) to detect operands with provably-zero high 16 bits — fires
|
||||||
|
on the canonical loop-internal `(u32)i*i` pattern after LSR
|
||||||
|
widens the i16 IV to i32. Rewrites to a call to `__umulhisi3`.
|
||||||
|
sumOfSquares 20801 → 19096 cyc/call by itself (-8.2% from
|
||||||
|
baseline).
|
||||||
|
|
||||||
|
- **Dead TAX/TXA peephole** (2026-05-13). STAfi's conservative
|
||||||
|
`Defs=[A]` (for the IMG-source PHA-bracketed expansion path)
|
||||||
|
causes regalloc to insert spurious TAX/TXA save/restore brackets
|
||||||
|
even when STAfi's source is A directly. `W65816SepRepCleanup`
|
||||||
|
now elides TXA/TYA whose next non-debug inst defines `$a`, and
|
||||||
|
TAX/TAY whose target reg is dead before its next redefinition.
|
||||||
|
Cross-MBB liveness via `Succ->isLiveIn(...)`; bails on
|
||||||
|
return-terminated MBBs (RTL doesn't model the i32-return
|
||||||
|
convention). Tracks `pRedef` so `TAX ; CLC ; ADC` chains
|
||||||
|
don't bail on ADC's $p-read (CLC freshens the carry flag).
|
||||||
|
|
||||||
|
- **i32 += i32 store-bypass** (2026-05-13). Regalloc materializes
|
||||||
|
the call-result `A:X` i32 pair into spill slots before the add,
|
||||||
|
then reloads — emitting a 10-instruction `STA-TXA-STA-LDA-CLC-
|
||||||
|
ADC-STA-LDA-ADC-STA` sequence. `W65816SepRepCleanup` matcher
|
||||||
|
rewrites to 6-instruction `CLC-ADC-STA-TXA-ADC-STA` (TXA preserves
|
||||||
|
carry; hi-half consumes it directly from X). Saves 4 inst / ~13
|
||||||
|
cyc per call-result-add site. sumOfSquares 20460 → 19096 (-6.7%).
|
||||||
|
|
||||||
|
- **PHI-copy hoist out of PHP/PLP wrap** (2026-05-13).
|
||||||
|
`W65816SepRepCleanup` detects the back-edge `CMP ; PHP ; (LDA/STA
|
||||||
|
pairs) ; PLP ; (trailing STA) ; Bxx ; BRA loop` pattern and
|
||||||
|
hoists the LDA/STA pairs + trailing above the CMP's $a-producer
|
||||||
|
chain, dropping PHP/PLP. Two safety guards: (1) **bump undo** —
|
||||||
|
in-wrap stack-rel offsets were pre-adjusted by +1 (PHP decrements
|
||||||
|
S; `W65816StackSlotCleanup`'s wrap pass compensates inside the
|
||||||
|
wrap), so the hoist subtracts 1 from each `LDA_StackRel` /
|
||||||
|
`STA_StackRel` offset; trailing STAs (already outside the wrap)
|
||||||
|
are untouched. (2) **pair-count check** — require
|
||||||
|
`#LDAs(Block) == #STAs(Block) + #STAs(Trailing)`; an extra LDA
|
||||||
|
is a memory-to-register PHI value live-out at the back-edge
|
||||||
|
(consumed by the loop top's first STA), and hoisting would
|
||||||
|
clobber A. Saves 2 inst / 8 cyc per occurrence. sumOfSquares
|
||||||
|
19096 → 18755 (-1.8%), popcount 3683 → 3478 (-5.6%).
|
||||||
|
|
||||||
- **More peephole / libcall opportunities.** __mulsi3 just gained
|
- **More peephole / libcall opportunities.** __mulsi3 just gained
|
||||||
early-exit when the multiplier shifts to 0; dotProduct dropped
|
early-exit when the multiplier shifts to 0; dotProduct dropped
|
||||||
4007→2472 (−38.3%), sumOfSquares 40920→23870 (−41.6%). Next
|
4007→2472 (−38.3%), sumOfSquares 40920→23870 (−41.6%). Next
|
||||||
candidates: a true 16×16→32 multiply libcall (for `(u32)i*i`
|
candidates: shift-by-N inlining for shifts 5+ that currently
|
||||||
patterns) and shift-by-N inlining for shifts 5+ that currently
|
go through __ashlsi3; a `u32 += zext i16` SDAG combine to skip
|
||||||
go through __ashlsi3.
|
the hi-half carry chain when one operand has known-zero high
|
||||||
|
16 bits.
|
||||||
|
|
||||||
**Open limitations:**
|
**Open limitations:**
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue