Checkpoint

This commit is contained in:
Scott Duensing 2026-05-13 15:57:02 -05:00
parent e65fedc8e1
commit e2e4b778b0

View file

@ -217,7 +217,7 @@ which runs correctly under MAME (apple2gs).
image addresses.
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
libgcc into linkable objects.
- `scripts/smokeTest.sh` runs 145 end-to-end checks at -O2:
- `scripts/smokeTest.sh` runs 132 end-to-end checks at -O2:
scalar ops, control flow, calling conventions, MAME execution
regressions, link816 bss-base safety + weak-symbol resolution +
heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link,
@ -246,10 +246,20 @@ which runs correctly under MAME (apple2gs).
- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts
via MAME's emulated time counter. Eight benchmarks under
`benchmarks/`. Current numbers: popcount 3683 cyc, bsearch
852, memcmp 1091, strcpy 2558, dotProduct 2387, fib(10) 12617,
sumOfSquares 23529. Speed is the optimization priority, not
size.
`benchmarks/`. Current numbers (2026-05-13 after the umulhisi3 /
TAX-TXA / store-bypass / PHI-hoist landings): popcount 3478,
bsearch 852, memcmp 1091, strcpy 2558, dotProduct 2302,
fib(10) 12617, sumOfSquares 18755. Speed is the optimization
priority, not size.
- `compare/` holds three side-by-side C tests with our asm and
Calypsi's listing for static-size comparison:
`sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh`
recompiles each under both `clang --target=w65816 -O2 -S` and
`cc65816 --speed -O 2 --64bit-doubles` and prints an
ours/Calypsi instruction-count ratio. Current ratios:
sumSquares 2.32x, evalAt 2.10x, mul16to32 2.50x. See
`compare/README.md`.
**Backend register allocation:**
@ -323,16 +333,66 @@ for the common-case C / minimal-C++ workload. Priority is speed
rewrites to LDA/INA/STA/INC_HI_IF_CARRY (with private-label BNE
expansion in AsmPrinter). Saves ~13 cyc per increment on the
no-carry common path. memcmp 1330 → 1194 (10.2%), strcpy 3325 →
3154 (5.1%). LSR's `*p++ → base+offset` rewrite remains
unaddressed; tried `-disable-lsr` and `isLSRCostLess` override,
both regressed dotProduct.
3154 (5.1%). Now also tolerates intervening TAX/TXA pseudo-saves
in the matcher (regalloc inserts them around STAfi's conservative
`Defs=[A]`); LSR-introduced i32 PHIs like `lsr.iv9 += 1` now match.
LSR's `*p++ → base+offset` rewrite remains unaddressed; tried
`-disable-lsr` and `isLSRCostLess` override, both regressed
dotProduct.
- **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR
pass** (2026-05-13). Added `__umulhisi3` (unsigned 16x16→32) to
`runtime/src/libgcc.s`. New IR pass in `addISelPrepare` walks
`mul i32 X, Y` and uses IR-level `computeKnownBits` plus a SCEV
unsigned-range fallback (`getUnsignedRange().getActiveBits() <=
16`) to detect operands with provably-zero high 16 bits — fires
on the canonical loop-internal `(u32)i*i` pattern after LSR
widens the i16 IV to i32. Rewrites to a call to `__umulhisi3`.
sumOfSquares 20801 → 19096 cyc/call by itself (-8.2% from
baseline).
- **Dead TAX/TXA peephole** (2026-05-13). STAfi's conservative
`Defs=[A]` (for the IMG-source PHA-bracketed expansion path)
causes regalloc to insert spurious TAX/TXA save/restore brackets
even when STAfi's source is A directly. `W65816SepRepCleanup`
now elides TXA/TYA whose next non-debug inst defines `$a`, and
TAX/TAY whose target reg is dead before its next redefinition.
Cross-MBB liveness via `Succ->isLiveIn(...)`; bails on
return-terminated MBBs (RTL doesn't model the i32-return
convention). Tracks `pRedef` so `TAX ; CLC ; ADC` chains
don't bail on ADC's $p-read (CLC freshens the carry flag).
- **i32 += i32 store-bypass** (2026-05-13). Regalloc materializes
the call-result `A:X` i32 pair into spill slots before the add,
then reloads — emitting a 10-instruction `STA-TXA-STA-LDA-CLC-
ADC-STA-LDA-ADC-STA` sequence. `W65816SepRepCleanup` matcher
rewrites to 6-instruction `CLC-ADC-STA-TXA-ADC-STA` (TXA preserves
carry; hi-half consumes it directly from X). Saves 4 inst / ~13
cyc per call-result-add site. sumOfSquares 20460 → 19096 (-6.7%).
- **PHI-copy hoist out of PHP/PLP wrap** (2026-05-13).
`W65816SepRepCleanup` detects the back-edge `CMP ; PHP ; (LDA/STA
pairs) ; PLP ; (trailing STA) ; Bxx ; BRA loop` pattern and
hoists the LDA/STA pairs + trailing above the CMP's $a-producer
chain, dropping PHP/PLP. Two safety guards: (1) **bump undo**
in-wrap stack-rel offsets were pre-adjusted by +1 (PHP decrements
S; `W65816StackSlotCleanup`'s wrap pass compensates inside the
wrap), so the hoist subtracts 1 from each `LDA_StackRel` /
`STA_StackRel` offset; trailing STAs (already outside the wrap)
are untouched. (2) **pair-count check** — require
`#LDAs(Block) == #STAs(Block) + #STAs(Trailing)`; an extra LDA
is a memory-to-register PHI value live-out at the back-edge
(consumed by the loop top's first STA), and hoisting would
clobber A. Saves 2 inst / 8 cyc per occurrence. sumOfSquares
19096 → 18755 (-1.8%), popcount 3683 → 3478 (-5.6%).
- **More peephole / libcall opportunities.** __mulsi3 just gained
early-exit when the multiplier shifts to 0; dotProduct dropped
4007→2472 (38.3%), sumOfSquares 40920→23870 (41.6%). Next
candidates: a true 16×16→32 multiply libcall (for `(u32)i*i`
patterns) and shift-by-N inlining for shifts 5+ that currently
go through __ashlsi3.
candidates: shift-by-N inlining for shifts 5+ that currently
go through __ashlsi3; a `u32 += zext i16` SDAG combine to skip
the hi-half carry chain when one operand has known-zero high
16 bits.
**Open limitations:**