diff --git a/STATUS.md b/STATUS.md index a285834..3268683 100644 --- a/STATUS.md +++ b/STATUS.md @@ -217,7 +217,7 @@ which runs correctly under MAME (apple2gs). image addresses. - `runtime/build.sh` builds crt0, libc, soft-float, soft-double, libgcc into linkable objects. -- `scripts/smokeTest.sh` runs 145 end-to-end checks at -O2: +- `scripts/smokeTest.sh` runs 132 end-to-end checks at -O2: scalar ops, control flow, calling conventions, MAME execution regressions, link816 bss-base safety + weak-symbol resolution + heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link, @@ -246,10 +246,20 @@ which runs correctly under MAME (apple2gs). - `scripts/benchCyclesPrecise.sh` measures per-call cycle counts via MAME's emulated time counter. Eight benchmarks under - `benchmarks/`. Current numbers: popcount 3683 cyc, bsearch - 852, memcmp 1091, strcpy 2558, dotProduct 2387, fib(10) 12617, - sumOfSquares 23529. Speed is the optimization priority, not - size. + `benchmarks/`. Current numbers (2026-05-13 after the umulhisi3 / + TAX-TXA / store-bypass / PHI-hoist landings): popcount 3478, + bsearch 852, memcmp 1091, strcpy 2558, dotProduct 2302, + fib(10) 12617, sumOfSquares 18755. Speed is the optimization + priority, not size. + +- `compare/` holds three side-by-side C tests with our asm and + Calypsi's listing for static-size comparison: + `sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh` + recompiles each under both `clang --target=w65816 -O2 -S` and + `cc65816 --speed -O 2 --64bit-doubles` and prints an + ours/Calypsi instruction-count ratio. Current ratios: + sumSquares 2.32x, evalAt 2.10x, mul16to32 2.50x. See + `compare/README.md`. **Backend register allocation:** @@ -323,16 +333,66 @@ for the common-case C / minimal-C++ workload. Priority is speed rewrites to LDA/INA/STA/INC_HI_IF_CARRY (with private-label BNE expansion in AsmPrinter). Saves ~13 cyc per increment on the no-carry common path. memcmp 1330 → 1194 (−10.2%), strcpy 3325 → - 3154 (−5.1%). LSR's `*p++ → base+offset` rewrite remains - unaddressed; tried `-disable-lsr` and `isLSRCostLess` override, - both regressed dotProduct. + 3154 (−5.1%). Now also tolerates intervening TAX/TXA pseudo-saves + in the matcher (regalloc inserts them around STAfi's conservative + `Defs=[A]`); LSR-introduced i32 PHIs like `lsr.iv9 += 1` now match. + LSR's `*p++ → base+offset` rewrite remains unaddressed; tried + `-disable-lsr` and `isLSRCostLess` override, both regressed + dotProduct. + +- **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR + pass** (2026-05-13). Added `__umulhisi3` (unsigned 16x16→32) to + `runtime/src/libgcc.s`. New IR pass in `addISelPrepare` walks + `mul i32 X, Y` and uses IR-level `computeKnownBits` plus a SCEV + unsigned-range fallback (`getUnsignedRange().getActiveBits() <= + 16`) to detect operands with provably-zero high 16 bits — fires + on the canonical loop-internal `(u32)i*i` pattern after LSR + widens the i16 IV to i32. Rewrites to a call to `__umulhisi3`. + sumOfSquares 20801 → 19096 cyc/call by itself (-8.2% from + baseline). + +- **Dead TAX/TXA peephole** (2026-05-13). STAfi's conservative + `Defs=[A]` (for the IMG-source PHA-bracketed expansion path) + causes regalloc to insert spurious TAX/TXA save/restore brackets + even when STAfi's source is A directly. `W65816SepRepCleanup` + now elides TXA/TYA whose next non-debug inst defines `$a`, and + TAX/TAY whose target reg is dead before its next redefinition. + Cross-MBB liveness via `Succ->isLiveIn(...)`; bails on + return-terminated MBBs (RTL doesn't model the i32-return + convention). Tracks `pRedef` so `TAX ; CLC ; ADC` chains + don't bail on ADC's $p-read (CLC freshens the carry flag). + +- **i32 += i32 store-bypass** (2026-05-13). Regalloc materializes + the call-result `A:X` i32 pair into spill slots before the add, + then reloads — emitting a 10-instruction `STA-TXA-STA-LDA-CLC- + ADC-STA-LDA-ADC-STA` sequence. `W65816SepRepCleanup` matcher + rewrites to 6-instruction `CLC-ADC-STA-TXA-ADC-STA` (TXA preserves + carry; hi-half consumes it directly from X). Saves 4 inst / ~13 + cyc per call-result-add site. sumOfSquares 20460 → 19096 (-6.7%). + +- **PHI-copy hoist out of PHP/PLP wrap** (2026-05-13). + `W65816SepRepCleanup` detects the back-edge `CMP ; PHP ; (LDA/STA + pairs) ; PLP ; (trailing STA) ; Bxx ; BRA loop` pattern and + hoists the LDA/STA pairs + trailing above the CMP's $a-producer + chain, dropping PHP/PLP. Two safety guards: (1) **bump undo** — + in-wrap stack-rel offsets were pre-adjusted by +1 (PHP decrements + S; `W65816StackSlotCleanup`'s wrap pass compensates inside the + wrap), so the hoist subtracts 1 from each `LDA_StackRel` / + `STA_StackRel` offset; trailing STAs (already outside the wrap) + are untouched. (2) **pair-count check** — require + `#LDAs(Block) == #STAs(Block) + #STAs(Trailing)`; an extra LDA + is a memory-to-register PHI value live-out at the back-edge + (consumed by the loop top's first STA), and hoisting would + clobber A. Saves 2 inst / 8 cyc per occurrence. sumOfSquares + 19096 → 18755 (-1.8%), popcount 3683 → 3478 (-5.6%). - **More peephole / libcall opportunities.** __mulsi3 just gained early-exit when the multiplier shifts to 0; dotProduct dropped 4007→2472 (−38.3%), sumOfSquares 40920→23870 (−41.6%). Next - candidates: a true 16×16→32 multiply libcall (for `(u32)i*i` - patterns) and shift-by-N inlining for shifts 5+ that currently - go through __ashlsi3. + candidates: shift-by-N inlining for shifts 5+ that currently + go through __ashlsi3; a `u32 += zext i16` SDAG combine to skip + the hi-half carry chain when one operand has known-zero high + 16 bits. **Open limitations:**