diff --git a/STATUS.md b/STATUS.md
index a285834..3268683 100644
--- a/STATUS.md
+++ b/STATUS.md
@@ -217,7 +217,7 @@ which runs correctly under MAME (apple2gs).
   image addresses.
 - `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
   libgcc into linkable objects.
-- `scripts/smokeTest.sh` runs 145 end-to-end checks at -O2:
+- `scripts/smokeTest.sh` runs 132 end-to-end checks at -O2:
   scalar ops, control flow, calling conventions, MAME execution
   regressions, link816 bss-base safety + weak-symbol resolution +
   heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link,
@@ -246,10 +246,20 @@ which runs correctly under MAME (apple2gs).
 
 - `scripts/benchCyclesPrecise.sh` measures per-call cycle counts
   via MAME's emulated time counter.  Eight benchmarks under
-  `benchmarks/`.  Current numbers: popcount 3683 cyc, bsearch
-  852, memcmp 1091, strcpy 2558, dotProduct 2387, fib(10) 12617,
-  sumOfSquares 23529.  Speed is the optimization priority, not
-  size.
+  `benchmarks/`.  Current numbers (2026-05-13 after the umulhisi3 /
+  TAX-TXA / store-bypass / PHI-hoist landings): popcount 3478,
+  bsearch 852, memcmp 1091, strcpy 2558, dotProduct 2302,
+  fib(10) 12617, sumOfSquares 18755.  Speed is the optimization
+  priority, not size.
+
+- `compare/` holds three side-by-side C tests with our asm and
+  Calypsi's listing for static-size comparison:
+  `sumSquares`/`evalAt`/`mul16to32`.  `bash compare/regen.sh`
+  recompiles each under both `clang --target=w65816 -O2 -S` and
+  `cc65816 --speed -O 2 --64bit-doubles` and prints an
+  ours/Calypsi instruction-count ratio.  Current ratios:
+  sumSquares 2.32x, evalAt 2.10x, mul16to32 2.50x.  See
+  `compare/README.md`.
 
 **Backend register allocation:**
 
@@ -323,16 +333,66 @@ for the common-case C / minimal-C++ workload.  Priority is speed
   rewrites to LDA/INA/STA/INC_HI_IF_CARRY (with private-label BNE
   expansion in AsmPrinter).  Saves ~13 cyc per increment on the
   no-carry common path.  memcmp 1330 → 1194 (−10.2%), strcpy 3325 →
-  3154 (−5.1%).  LSR's `*p++ → base+offset` rewrite remains
-  unaddressed; tried `-disable-lsr` and `isLSRCostLess` override,
-  both regressed dotProduct.
+  3154 (−5.1%).  Now also tolerates intervening TAX/TXA pseudo-saves
+  in the matcher (regalloc inserts them around STAfi's conservative
+  `Defs=[A]`); LSR-introduced i32 PHIs like `lsr.iv9 += 1` now match.
+  LSR's `*p++ → base+offset` rewrite remains unaddressed; tried
+  `-disable-lsr` and `isLSRCostLess` override, both regressed
+  dotProduct.
+
+- **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR
+  pass** (2026-05-13).  Added `__umulhisi3` (unsigned 16x16→32) to
+  `runtime/src/libgcc.s`.  New IR pass in `addISelPrepare` walks
+  `mul i32 X, Y` and uses IR-level `computeKnownBits` plus a SCEV
+  unsigned-range fallback (`getUnsignedRange().getActiveBits() <=
+  16`) to detect operands with provably-zero high 16 bits — fires
+  on the canonical loop-internal `(u32)i*i` pattern after LSR
+  widens the i16 IV to i32.  Rewrites to a call to `__umulhisi3`.
+  sumOfSquares 20801 → 19096 cyc/call by itself (-8.2% from
+  baseline).
+
+- **Dead TAX/TXA peephole** (2026-05-13).  STAfi's conservative
+  `Defs=[A]` (for the IMG-source PHA-bracketed expansion path)
+  causes regalloc to insert spurious TAX/TXA save/restore brackets
+  even when STAfi's source is A directly.  `W65816SepRepCleanup`
+  now elides TXA/TYA whose next non-debug inst defines `$a`, and
+  TAX/TAY whose target reg is dead before its next redefinition.
+  Cross-MBB liveness via `Succ->isLiveIn(...)`; bails on
+  return-terminated MBBs (RTL doesn't model the i32-return
+  convention).  Tracks `pRedef` so `TAX ; CLC ; ADC` chains
+  don't bail on ADC's $p-read (CLC freshens the carry flag).
+
+- **i32 += i32 store-bypass** (2026-05-13).  Regalloc materializes
+  the call-result `A:X` i32 pair into spill slots before the add,
+  then reloads — emitting a 10-instruction `STA-TXA-STA-LDA-CLC-
+  ADC-STA-LDA-ADC-STA` sequence.  `W65816SepRepCleanup` matcher
+  rewrites to 6-instruction `CLC-ADC-STA-TXA-ADC-STA` (TXA preserves
+  carry; hi-half consumes it directly from X).  Saves 4 inst / ~13
+  cyc per call-result-add site.  sumOfSquares 20460 → 19096 (-6.7%).
+
+- **PHI-copy hoist out of PHP/PLP wrap** (2026-05-13).
+  `W65816SepRepCleanup` detects the back-edge `CMP ; PHP ; (LDA/STA
+  pairs) ; PLP ; (trailing STA) ; Bxx ; BRA loop` pattern and
+  hoists the LDA/STA pairs + trailing above the CMP's $a-producer
+  chain, dropping PHP/PLP.  Two safety guards: (1) **bump undo** —
+  in-wrap stack-rel offsets were pre-adjusted by +1 (PHP decrements
+  S; `W65816StackSlotCleanup`'s wrap pass compensates inside the
+  wrap), so the hoist subtracts 1 from each `LDA_StackRel` /
+  `STA_StackRel` offset; trailing STAs (already outside the wrap)
+  are untouched.  (2) **pair-count check** — require
+  `#LDAs(Block) == #STAs(Block) + #STAs(Trailing)`; an extra LDA
+  is a memory-to-register PHI value live-out at the back-edge
+  (consumed by the loop top's first STA), and hoisting would
+  clobber A.  Saves 2 inst / 8 cyc per occurrence.  sumOfSquares
+  19096 → 18755 (-1.8%), popcount 3683 → 3478 (-5.6%).
 
 - **More peephole / libcall opportunities.**  __mulsi3 just gained
   early-exit when the multiplier shifts to 0; dotProduct dropped
   4007→2472 (−38.3%), sumOfSquares 40920→23870 (−41.6%).  Next
-  candidates: a true 16×16→32 multiply libcall (for `(u32)i*i`
-  patterns) and shift-by-N inlining for shifts 5+ that currently
-  go through __ashlsi3.
+  candidates: shift-by-N inlining for shifts 5+ that currently
+  go through __ashlsi3; a `u32 += zext i16` SDAG combine to skip
+  the hi-half carry chain when one operand has known-zero high
+  16 bits.
 
 **Open limitations:**