Checkpoint

2026-05-13 20:54:28 -05:00 · 2026-05-13 20:54:28 -05:00 · 42f0d16d07
commit 42f0d16d07
parent e2e4b778b0
19 changed files with 2008 additions and 84 deletions
--- a/STATUS.md
+++ b/STATUS.md
@ -246,20 +246,21 @@ which runs correctly under MAME (apple2gs).

 - `scripts/benchCyclesPrecise.sh` measures per-call cycle counts
  via MAME's emulated time counter.  Eight benchmarks under
-  `benchmarks/`.  Current numbers (2026-05-13 after the umulhisi3 /
-  TAX-TXA / store-bypass / PHI-hoist landings): popcount 3478,
-  bsearch 852, memcmp 1091, strcpy 2558, dotProduct 2302,
-  fib(10) 12617, sumOfSquares 18755.  Speed is the optimization
-  priority, not size.
+  `benchmarks/`.  Current numbers (after W65816StackSlotMerge):
+  popcount 3376, bsearch 852, memcmp 1091, strcpy 2387,
+  dotProduct 2302, fib(10) 12617, sumOfSquares 17391.  Speed is
+  the optimization priority, not size.

 - `compare/` holds three side-by-side C tests with our asm and
  Calypsi's listing for static-size comparison:
  `sumSquares`/`evalAt`/`mul16to32`.  `bash compare/regen.sh`
  recompiles each under both `clang --target=w65816 -O2 -S` and
  `cc65816 --speed -O 2 --64bit-doubles` and prints an
-  ours/Calypsi instruction-count ratio.  Current ratios:
-  sumSquares 2.32x, evalAt 2.10x, mul16to32 2.50x.  See
-  `compare/README.md`.
+  ours/Calypsi instruction-count ratio.  Current ratios (post
+  W65816StackSlotMerge Phase 5/6 + extracted Phase 6/6a per-MBB
+  peepholes + Pass 1c PHP-wrap CMP elim for SP-rel functions):
+  sumSquares 1.81x (56 inst), evalAt 2.10x (534 inst), mul16to32
+  2.25x (9 inst).  See `compare/README.md`.

 **Backend register allocation:**

@ -340,6 +341,46 @@ for the common-case C / minimal-C++ workload.  Priority is speed
  `-disable-lsr` and `isLSRCostLess` override, both regressed
  dotProduct.

+- **W65816StackSlotMerge — value-equivalent stack slot coalesce**
+  (2026-05-13).  Pre-emit pass that merges PHI src/dst stack-slot
+  pairs which LLVM's StackSlotColoring can't see (they're
+  simultaneously live but hold the same value).  Detects the
+  canonical loop-body `LDA X ; STA Y` PHI-copy in a self-looped
+  MBB, verifies value equivalence via bidirectional twin-pairing
+  (Case 1: same A in same MBB / Case 2: PHI-copy reload pattern /
+  Case 3: matching `LDA #const` init in different MBBs), and
+  renames slot X→Y function-wide.  Runs AFTER SepRepCleanup so the
+  PHI copies are out of their PHP/PLP wraps and offsets are stable.
+  **A-define detection is opcode-based, not operand-based** —
+  LDA_DP / LDA_Abs / LDA_Long etc. omit the `implicit-def $a`
+  annotation in tablegen but semantically write A; the
+  `semanticallyDefsA` helper falls back to an opcode whitelist.
+  sumSquares static: 65 → 61 inst (1.97x — under 2x Calypsi for
+  the first time).  sumOfSquares cyc/call: 18755 → 17391
+  (**−7.3%**).  strcpy: 2558 → 2387 (−6.7%).  See
+  W65816StackSlotMerge.cpp.
+
+- **LSR-widened i32 IV narrowing** (`W65816NarrowI32Mul` Phase 2,
+  2026-05-13).  After rewriting `mul i32 X, Y` to a `__umulhisi3`
+  call, scan for i32 PHIs whose only uses are (a) the truncs the
+  rewrite emitted and (b) a single self-feeding `add %P, const`.
+  When SCEV bounds the PHI to u16, build an i16 PHI + i16 add in
+  place, replace truncs, and erase the i32 chain.  Care needed
+  to break the PN ↔ Incr use-cycle before erasing.  sumSquares
+  frame: 14B → 12B; loop-internal `i++` shrinks from 7→3 inst.
+
+- **PHI-hoist accepts LDA_Imm16 / LDAi16imm** (2026-05-13).
+  Init blocks contain `lda #const ; sta slot,s` pairs wrapped in
+  PHP/PLP around the pre-loop CMP — same shape as a PHI-copy
+  wrap but with an immediate load instead of a memory load.
+  Matcher extended to accept both the MC opcode (`LDA_Imm16`) and
+  the surviving pseudo (`LDAi16imm`), with an added **$a-live-out
+  guard**: if any successor MBB has $a in its live-in set, bail —
+  the LDA's A-value is a fall-through register-PHI consumed by
+  the successor's first STA, and hoisting clobbers it.  Caught
+  by `sumTable` where `lda #0 ; sta 0x9,s` (wrap+trailing) ALSO
+  supplied A=0 to `bb.2`'s `sta 0x1,s`.
+
 - **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR
  pass** (2026-05-13).  Added `__umulhisi3` (unsigned 16x16→32) to
  `runtime/src/libgcc.s`.  New IR pass in `addISelPrepare` walks
--- a/compare/evalAt.calypsi.lst
+++ b/compare/evalAt.calypsi.lst
@ -1,7 +1,7 @@
 ###############################################################################
 #                                                                             #
 # Calypsi ISO C compiler for 65816                               version 5.16 #
-#                                                       13/May/2026  15:46:15 #
+#                                                       13/May/2026  20:52:21 #
 # Command line: --speed -O 2 --64bit-doubles evalAt.c -o                      #
 #               /tmp/evalAt.calypsi.elf --list-file evalAt.calypsi.lst        #
 #                                                                             #
--- a/compare/evalAt.ours.s
+++ b/compare/evalAt.ours.s
@ -139,9 +139,10 @@ evalAt:                                 ; @evalAt
 	lda	0x1d, s
 	sta	[0xe0 ], y
 	pea	0x4024
-	pea	0x0
-	pea	0x0
-	pea	0x0
+	lda	#0x0
+	pha
+	pha
+	pha
 	lda	0x17, s
 	pha
 	lda	0x1b, s
@ -272,9 +273,9 @@ evalAt:                                 ; @evalAt
 	lda	0xc4
 	sta	0x15, s
 	lda	0xca
-	sta	0x11, s
-	lda	0xc8
 	sta	0x13, s
+	lda	0xc8
+	sta	0x11, s
 	lda	0x17, s
 	pha
 	lda	0x1f, s
@ -283,9 +284,9 @@ evalAt:                                 ; @evalAt
 	pha
 	lda	0x27, s
 	pha
-	lda	0x19, s
+	lda	0x1b, s
 	pha
-	lda	0x1d, s
+	lda	0x1b, s
 	pha
 	lda	0x27, s
 	tax
@ -518,9 +519,9 @@ evalAt:                                 ; @evalAt
 	lda	0xc4
 	sta	0x15, s
 	lda	0xca
-	sta	0x11, s
-	lda	0xc8
 	sta	0x13, s
+	lda	0xc8
+	sta	0x11, s
 	lda	0x17, s
 	pha
 	lda	0x1f, s
@ -529,9 +530,9 @@ evalAt:                                 ; @evalAt
 	pha
 	lda	0x27, s
 	pha
-	lda	0x19, s
+	lda	0x1b, s
 	pha
-	lda	0x1d, s
+	lda	0x1b, s
 	pha
 	lda	0x27, s
 	tax
--- a/compare/mul16to32.calypsi.lst
+++ b/compare/mul16to32.calypsi.lst
@ -1,7 +1,7 @@
 ###############################################################################
 #                                                                             #
 # Calypsi ISO C compiler for 65816                               version 5.16 #
-#                                                       13/May/2026  15:46:15 #
+#                                                       13/May/2026  20:52:21 #
 # Command line: --speed -O 2 --64bit-doubles mul16to32.c -o                   #
 #               /tmp/mul16to32.calypsi.elf --list-file                        #
 #               mul16to32.calypsi.lst                                         #
--- a/compare/mul16to32.ours.s
+++ b/compare/mul16to32.ours.s
@ -11,7 +11,6 @@ mul16to32:                              ; @mul16to32
 	jsl	__umulhisi3
 	ply
 	sta	0x1, s
-	lda	0x1, s
 	ply
 	rtl
 .Lfunc_end0:
--- a/compare/sumSquares.calypsi.lst
+++ b/compare/sumSquares.calypsi.lst
@ -1,7 +1,7 @@
 ###############################################################################
 #                                                                             #
 # Calypsi ISO C compiler for 65816                               version 5.16 #
-#                                                       13/May/2026  15:46:15 #
+#                                                       13/May/2026  20:52:21 #
 # Command line: --speed -O 2 --64bit-doubles sumSquares.c -o                  #
 #               /tmp/sumSquares.calypsi.elf --list-file                       #
 #               sumSquares.calypsi.lst                                        #
--- a/compare/sumSquares.ll
+++ b/compare/sumSquares.ll
@ -0,0 +1,50 @@
+; ModuleID = 'sumSquares.c'
+source_filename = "sumSquares.c"
+target datalayout = "e-m:e-p:32:16-i16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S8"
+target triple = "w65816"
+
+; Function Attrs: nofree norecurse nosync nounwind memory(none)
+define dso_local i32 @sumSquares(i16 noundef zeroext %n) local_unnamed_addr #0 {
+entry:
+  %cmp.not6 = icmp eq i16 %n, 0
+  br i1 %cmp.not6, label %for.cond.cleanup, label %for.body.preheader
+
+for.body.preheader:                               ; preds = %entry
+  %0 = add i16 %n, 1
+  %umax = tail call i16 @llvm.umax.i16(i16 %0, i16 2)
+  br label %for.body
+
+for.cond.cleanup:                                 ; preds = %for.body, %entry
+  %total.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
+  ret i32 %total.0.lcssa
+
+for.body:                                         ; preds = %for.body.preheader, %for.body
+  %i.08 = phi i16 [ %inc, %for.body ], [ 1, %for.body.preheader ]
+  %total.07 = phi i32 [ %add, %for.body ], [ 0, %for.body.preheader ]
+  %conv = zext i16 %i.08 to i32
+  %mul = mul nuw i32 %conv, %conv
+  %add = add i32 %mul, %total.07
+  %inc = add nuw i16 %i.08, 1
+  %exitcond = icmp eq i16 %inc, %umax
+  br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !7
+}
+
+; Function Attrs: nocallback nocreateundeforpoison nofree nosync nounwind speculatable willreturn memory(none)
+declare i16 @llvm.umax.i16(i16, i16) #1
+
+attributes #0 = { nofree norecurse nosync nounwind memory(none) "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
+attributes #1 = { nocallback nocreateundeforpoison nofree nosync nounwind speculatable willreturn memory(none) }
+
+!llvm.module.flags = !{!0, !1}
+!llvm.ident = !{!2}
+!llvm.errno.tbaa = !{!3}
+
+!0 = !{i32 1, !"wchar_size", i32 2}
+!1 = !{i32 7, !"frame-pointer", i32 2}
+!2 = !{!"clang version 23.0.0git (https://github.com/llvm-mos/llvm-mos.git c798c31416f72b395c658b5502d281a162387ab1)"}
+!3 = !{!4, !4, i64 0}
+!4 = !{!"int", !5, i64 0}
+!5 = !{!"omnipotent char", !6, i64 0}
+!6 = !{!"Simple C/C++ TBAA"}
+!7 = distinct !{!7, !8}
+!8 = !{!"llvm.loop.mustprogress"}
--- a/compare/sumSquares.ours.s
+++ b/compare/sumSquares.ours.s
@ -8,79 +8,62 @@ sumSquares:                             ; @sumSquares
 	tay
 	tsc
 	sec
-	sbc	#0xe
+	sbc	#0xc
 	tcs
 	tya
-	sta	0x7, s
+	sta	0x5, s
 	lda	#0x0
-	sta	0xb, s
-	lda	0x7, s
-	cmp	#0x0
-	php
-	lda	#0x0
-	plp
-	sta	0x9, s
+	sta	0x3, s
+	sta	0x1, s
+	lda	0x5, s
 	bne	.LBB0_1
 ; %bb.6:                                ; %entry
 	brl	.LBB0_5
 .LBB0_1:                                ; %for.body.preheader
-	lda	0x7, s
+	lda	0x5, s
 	inc a
-	sta	0x7, s
+	sta	0x5, s
 	cmp	#0x3
 	bcs	.LBB0_3
 ; %bb.2:                                ; %for.body.preheader
 	lda	#0x2
-	sta	0x7, s
-.LBB0_3:                                ; %for.body.preheader
-	lda	#0x0
-	sta	0x3, s
-	lda	#0x1
-	sta	0xd, s
-	lda	0x7, s
-	dec a
-	sta	0x7, s
-	lda	#0x0
 	sta	0x5, s
+.LBB0_3:                                ; %for.body.preheader
+	lda	#0x1
+	sta	0x7, s
+	lda	0x5, s
+	dec a
+	sta	0x5, s
+	lda	#0x0
 	sta	0x1, s
 .LBB0_4:                                ; %for.body
                                        ; =>This Inner Loop Header: Depth=1
-	lda	0xd, s
+	lda	0x7, s
 	pha
 	jsl	__umulhisi3
 	ply
 	clc
 	adc	0x3, s
-	sta	0xb, s
+	sta	0x3, s
 	txa
 	adc	0x1, s
-	sta	0x9, s
-	lda	0xd, s
-	inc a
-	sta	0xd, s
-	bne	.Ltmp0
-	lda	0x5, s
-	inc a
-	sta	0x5, s
-.Ltmp0:
-	lda	0xb, s
-	sta	0x3, s
-	lda	0x9, s
 	sta	0x1, s
 	lda	0x7, s
-	dec a
+	inc a
 	sta	0x7, s
-	cmp	#0x0
+	lda	0x5, s
+	dec a
+	sta	0x5, s
 	beq	.LBB0_5
 	bra	.LBB0_4
 .LBB0_5:                                ; %for.cond.cleanup
-	lda	0x9, s
+	lda	0x1, s
 	tax
-	lda	0xb, s
+	lda	0x3, s
 	tay
 	tsc
 	clc
-	adc	#0xe
+	adc	#0xc
 	tcs
 	tya
 	rtl
--- a/scripts/runInMame.sh
+++ b/scripts/runInMame.sh
@ -93,10 +93,10 @@ $LUA_CHECKS
 end)
 EOF

-OUT=$(timeout 30 mame apple2gs \
+OUT=$(SDL_VIDEODRIVER=dummy SDL_AUDIODRIVER=dummy timeout 30 mame apple2gs \
    -rompath "$PROJECT_ROOT/tools/mame/roms" \
    -plugins -autoboot_script "$LUA_PATH" \
-    -window -sound none -nothrottle -seconds_to_run "$SECS" 2>&1 | grep "^MAME-")
+    -video none -sound none -nothrottle -seconds_to_run "$SECS" 2>&1 | grep "^MAME-")

 echo "$OUT"
 # Parse all val=... and compare to expected list.
--- a/src/llvm/lib/Target/W65816/CMakeLists.txt
+++ b/src/llvm/lib/Target/W65816/CMakeLists.txt
@ -38,6 +38,8 @@ add_llvm_target(W65816CodeGen
  W65816I32IncFold.cpp
  W65816ImgCalleeSave.cpp
  W65816NarrowI32Mul.cpp
+  W65816PromoteFiToImg.cpp
+  W65816StackSlotMerge.cpp
  W65816TargetMachine.cpp
  W65816AsmPrinter.cpp
  W65816MCInstLower.cpp
--- a/src/llvm/lib/Target/W65816/W65816.h
+++ b/src/llvm/lib/Target/W65816/W65816.h
@ -124,6 +124,25 @@ FunctionPass *createW65816SjLjFinalize();
 // zext that a SDAG-level combine would key off.  See W65816NarrowI32Mul.cpp.
 FunctionPass *createW65816NarrowI32Mul();

+// Post-RA, pre-PEI pass: rewrite high-traffic i16 FrameIndex accesses
+// to use IMG8..15 DP slots ($C0..$CE) instead of stack-rel spills.
+// Picks K = (number of free IMG8..15) hottest FIs and rewrites their
+// STAfi/LDAfi/ADCfi/etc. pseudos to STA_DP/LDA_DP/ADC_DP/etc. with
+// the corresponding DP address.  Net win when access count > 5 (the
+// per-slot save/restore in ImgCalleeSave is ~20 cyc / 12 B).  See
+// W65816PromoteFiToImg.cpp.
+FunctionPass *createW65816PromoteFiToImg();
+
+// Pre-emit pass: merge value-equivalent stack slots.  LLVM's
+// StackSlotColoring merges slots with non-overlapping liveness;
+// this pass catches the case where two slots ARE simultaneously
+// live but always hold the same value — typically the PHI src/dst
+// pair PHI-elim leaves at the back-edge of a loop body.  Renames
+// X→Y function-wide when every STA X has a "twin" STA Y of the
+// same source value, and erases the resulting LDA-X-STA-Y self-
+// copy.  See W65816StackSlotMerge.cpp.
+FunctionPass *createW65816StackSlotMerge();
+
 // Pre-RA pass that lowers Wide32 register pairs into pairs of i16
 // vregs.  Without this, greedy/basic regalloc can't fit the pair-
 // pressure of i64-via-2-i32-via-Wide32 traffic in i64-heavy
@ -163,6 +182,8 @@ void initializeW65816SjLjFinalizePass(PassRegistry &);
 void initializeW65816LowerWide32Pass(PassRegistry &);
 void initializeW65816ImgCalleeSavePass(PassRegistry &);
 void initializeW65816NarrowI32MulPass(PassRegistry &);
+void initializeW65816PromoteFiToImgPass(PassRegistry &);
+void initializeW65816StackSlotMergePass(PassRegistry &);

 } // namespace llvm

--- a/src/llvm/lib/Target/W65816/W65816NarrowI32Mul.cpp
+++ b/src/llvm/lib/Target/W65816/W65816NarrowI32Mul.cpp
@ -132,14 +132,155 @@ bool W65816NarrowI32Mul::runOnFunction(Function &F) {
    return false;
  }

+  // When the i32 operand is `zext i16 X to i32`, use X directly instead
+  // of emitting `trunc i32 (zext i16 X) to i16` — that trunc-of-zext is
+  // semantically the identity but keeps the zext (= a fresh i32 SSA
+  // value) live, which materializes a Wide32 vreg pair at ISel and
+  // forces a 4-byte spill slot (the canonical sumSquares `conv` pattern
+  // burned slots 0xd / 0x5 this way).  Skipping the trunc lets the
+  // post-replaceAll DCE drop the zext entirely, freeing the slot.
+  auto narrowOperand = [&](Value *V, IRBuilder<> &B) -> Value * {
+    if (auto *ZE = dyn_cast<ZExtInst>(V)) {
+      if (ZE->getSrcTy() == I16) return ZE->getOperand(0);
+    }
+    if (auto *AE = dyn_cast<SExtInst>(V)) {
+      // Sext from i16 also has the right low 16 bits.
+      if (AE->getSrcTy() == I16) return AE->getOperand(0);
+    }
+    return B.CreateTrunc(V, I16);
+  };
+
  FunctionCallee Callee = getUmulhisi3(*M);
+  SmallVector<Instruction *, 8> MaybeDead;
  for (BinaryOperator *BO : Worklist) {
    IRBuilder<> B(BO);
-    Value *A = B.CreateTrunc(BO->getOperand(0), I16);
-    Value *Bv = B.CreateTrunc(BO->getOperand(1), I16);
+    Value *AOp = BO->getOperand(0);
+    Value *BOp = BO->getOperand(1);
+    Value *A = narrowOperand(AOp, B);
+    Value *Bv = narrowOperand(BOp, B);
    Value *Call = B.CreateCall(Callee, {A, Bv});
    BO->replaceAllUsesWith(Call);
    BO->eraseFromParent();
+    // If the original operands were zext/sext nodes, they may now be
+    // dead.  Add them to the cleanup worklist.
+    if (auto *I = dyn_cast<Instruction>(AOp)) MaybeDead.push_back(I);
+    if (auto *I = dyn_cast<Instruction>(BOp)) MaybeDead.push_back(I);
+  }
+  // Cleanup: any extension that's now use-less can be deleted.
+  for (Instruction *I : MaybeDead) {
+    if (I->use_empty() && (isa<ZExtInst>(I) || isa<SExtInst>(I) ||
+                            isa<TruncInst>(I))) {
+      I->eraseFromParent();
+    }
+  }
+
+  // Phase 2: narrow LSR-introduced i32 PHIs whose only uses (after
+  // the mul-rewrite above) are trunc-to-i16 + a single self-feeding
+  // `add %P, const` increment.  Without this, even though the mul
+  // operates on i16, the i32 PHI still requires 4 bytes of frame +
+  // an i32 increment chain (post-PEI).  LSR widened these from i16
+  // to i32 to support a sub-expression that we've now narrowed —
+  // the i32 representation has become dead weight.
+  //
+  // Guard with SCEV: `getUnsignedRange(%P).getActiveBits() <= 16`
+  // proves the PHI never escapes u16, so the i16 add gives the same
+  // low-16 bits as the original i32 add at every observable point
+  // (the back-edge value can wrap on the exit iteration but is
+  // never observed — exit takes the trip-end branch first).
+  bool NarrowedAny = false;
+  SmallVector<PHINode *, 4> PhiWorklist;
+  for (BasicBlock &BB : F) {
+    for (PHINode &PN : BB.phis()) {
+      if (PN.getType()->isIntegerTy(32)) PhiWorklist.push_back(&PN);
+    }
+  }
+  for (PHINode *PN : PhiWorklist) {
+    // Classify every use.
+    SmallVector<TruncInst *, 4> Truncs;
+    BinaryOperator *Incr = nullptr;
+    bool ok = true;
+    for (User *U : PN->users()) {
+      if (auto *TI = dyn_cast<TruncInst>(U)) {
+        if (!TI->getDestTy()->isIntegerTy(16)) { ok = false; break; }
+        Truncs.push_back(TI);
+        continue;
+      }
+      auto *BO = dyn_cast<BinaryOperator>(U);
+      if (!BO || BO->getOpcode() != Instruction::Add) { ok = false; break; }
+      if (!isa<ConstantInt>(BO->getOperand(1))) { ok = false; break; }
+      // BO must feed back to this PHI via at least one incoming edge.
+      bool feedsBack = false;
+      for (Value *Inc : PN->incoming_values()) {
+        if (Inc == BO) { feedsBack = true; break; }
+      }
+      if (!feedsBack) { ok = false; break; }
+      if (Incr) { ok = false; break; }
+      Incr = BO;
+    }
+    if (!ok || !Incr || Truncs.empty()) continue;
+
+    // Increment const must fit i16.
+    auto *IncrCI = cast<ConstantInt>(Incr->getOperand(1));
+    if (IncrCI->getValue().getActiveBits() > 16) continue;
+    // Non-back-edge incomings must be i16-representable constants.
+    for (Value *Inc : PN->incoming_values()) {
+      if (Inc == Incr) continue;
+      auto *CIv = dyn_cast<ConstantInt>(Inc);
+      if (!CIv) { ok = false; break; }
+      if (CIv->getValue().getActiveBits() > 16) { ok = false; break; }
+    }
+    if (!ok) continue;
+    // SCEV bound check.
+    if (!SE.isSCEVable(PN->getType())) continue;
+    ConstantRange R = SE.getUnsignedRange(SE.getSCEV(PN));
+    if (R.getActiveBits() > 16) continue;
+
+    // Narrow.  Build %narrow_phi in same BB, then %narrow_incr right
+    // before Incr; patch incoming values to match.
+    IRBuilder<> B(PN);
+    PHINode *NewPN = B.CreatePHI(I16, PN->getNumIncomingValues(),
+                                  PN->getName() + ".narrow");
+    // Add placeholders for the back-edge incomings; we'll patch them
+    // after building NewIncr.
+    for (unsigned i = 0; i < PN->getNumIncomingValues(); ++i) {
+      Value *Inc = PN->getIncomingValue(i);
+      BasicBlock *Pred = PN->getIncomingBlock(i);
+      if (Inc == Incr) {
+        NewPN->addIncoming(UndefValue::get(I16), Pred);
+      } else {
+        auto *CIv = cast<ConstantInt>(Inc);
+        NewPN->addIncoming(
+            ConstantInt::get(I16, CIv->getZExtValue() & 0xFFFF),
+            Pred);
+      }
+    }
+    IRBuilder<> B2(Incr);
+    Value *NewIncr = B2.CreateAdd(
+        NewPN,
+        ConstantInt::get(I16, IncrCI->getZExtValue() & 0xFFFF),
+        Incr->getName() + ".narrow");
+    if (auto *NewIncrBO = dyn_cast<BinaryOperator>(NewIncr)) {
+      NewIncrBO->setHasNoUnsignedWrap(Incr->hasNoUnsignedWrap());
+      NewIncrBO->setHasNoSignedWrap(Incr->hasNoSignedWrap());
+    }
+    for (unsigned i = 0; i < NewPN->getNumIncomingValues(); ++i) {
+      if (isa<UndefValue>(NewPN->getIncomingValue(i))) {
+        NewPN->setIncomingValue(i, NewIncr);
+      }
+    }
+    // Replace trunc uses with the new narrow PHI, then break the
+    // PHI/Incr use-cycle before erasing.
+    for (TruncInst *TI : Truncs) {
+      TI->replaceAllUsesWith(NewPN);
+      TI->eraseFromParent();
+    }
+    // Incr is `add %PN, const`; PN's back-edge incoming references Incr.
+    // Replace Incr's uses with undef so PN's back-edge becomes a dead
+    // reference, then erase Incr, then PN.
+    Incr->replaceAllUsesWith(UndefValue::get(Incr->getType()));
+    Incr->eraseFromParent();
+    PN->eraseFromParent();
+    NarrowedAny = true;
  }
  return true;
 }
--- a/src/llvm/lib/Target/W65816/W65816PromoteFiToImg.cpp
+++ b/src/llvm/lib/Target/W65816/W65816PromoteFiToImg.cpp
@ -0,0 +1,289 @@
+//===-- W65816PromoteFiToImg.cpp - Promote FrameIndex to IMG slot --------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===---------------------------------------------------------------------===//
+//
+// Post-RA, pre-PEI pass.  Counts accesses to each i16-sized FrameIndex
+// in the function and rewrites the top-K hottest ones to use IMG8..15
+// DP slots ($C0/$C2/.../$CE) instead.  K = number of free IMG8..15
+// slots (slots not already used by regalloc decisions).
+//
+// Why post-RA: at this point regalloc has decided which vregs live in
+// physical registers vs spill slots.  The spills appear as the FI
+// pseudo-opcodes (LDAfi/STAfi/ADCfi/SBCfi/ANDfi/ORAfi/EORfi/CMPfi),
+// and the MFI tells us each FI's final size.  We see all the accesses
+// and can safely rewrite — eliminateFrameIndex hasn't yet baked the
+// offsets into SP-relative immediates.
+//
+// Why before W65816ImgCalleeSave: ImgCalleeSave scans the post-PromoteFi
+// MIR for IMG8..15 usage and emits prologue PHA-bracketed saves +
+// epilogue restores for each used slot.  Our promotion introduces
+// fresh IMG8..15 references that ImgCalleeSave will then auto-cover.
+//
+// Per-access cost change:
+//   STAfi → STA_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
+//   LDAfi → LDA_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
+//   ADCfi → ADC_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
+// Per-slot one-time overhead (added by ImgCalleeSave):
+//   prologue save  : ~10 cyc / 6 B
+//   epilogue restore: ~10 cyc / 6 B
+// Net win if access_count * 1 > 20.  Threshold is 5 to leave margin.
+//
+// Restrictions:
+//   - Only i16-sized FIs (2 bytes, offset 0).  Larger slots (i32 halves,
+//     structs) are skipped.
+//   - Skips fixed/variable-sized objects.
+//   - Skips STA8fi (byte store needs SEP/REP wrap incompatible with
+//     simple STA_DP — and DP stores 16 bits in M=0).
+//   - Skips LDAfi_indY / STAfi_indY (indirect-Y form — different
+//     addressing).
+//
+//===---------------------------------------------------------------------===//
+
+#include "W65816.h"
+#include "W65816InstrInfo.h"
+#include "W65816Subtarget.h"
+#include "llvm/ADT/BitVector.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/CodeGen/MachineFrameInfo.h"
+#include "llvm/CodeGen/MachineFunction.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineInstrBuilder.h"
+#include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/Support/Debug.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "w65816-promote-fi-to-img"
+
+
+namespace {
+
+
+class W65816PromoteFiToImg : public MachineFunctionPass {
+public:
+  static char ID;
+  W65816PromoteFiToImg() : MachineFunctionPass(ID) {}
+  StringRef getPassName() const override {
+    return "W65816 promote FrameIndex to IMG8..15 DP slot";
+  }
+  bool runOnMachineFunction(MachineFunction &MF) override;
+};
+
+
+} // namespace
+
+
+char W65816PromoteFiToImg::ID = 0;
+
+INITIALIZE_PASS(W65816PromoteFiToImg, DEBUG_TYPE,
+                "W65816 promote FI to IMG", false, false)
+
+
+FunctionPass *llvm::createW65816PromoteFiToImg() {
+  return new W65816PromoteFiToImg();
+}
+
+
+// Returns the operand index of the FrameIndex for the given FI pseudo
+// opcode, or -1 if this opcode isn't a promotable FI carrier.
+static int getFiOperandIdx(unsigned Opc) {
+  switch (Opc) {
+  case W65816::LDAfi:                                   return 1;
+  case W65816::STAfi:                                   return 1;
+  case W65816::CMPfi:                                   return 1;
+  case W65816::ADCfi:
+  case W65816::SBCfi:
+  case W65816::ANDfi:
+  case W65816::ORAfi:
+  case W65816::EORfi:                                   return 2;
+  default:                                              return -1;
+  }
+}
+
+
+// Map a promotable FI pseudo to the corresponding DP MC opcode.
+static unsigned getDpOpcode(unsigned Opc) {
+  switch (Opc) {
+  case W65816::LDAfi: return W65816::LDA_DP;
+  case W65816::STAfi: return W65816::STA_DP;
+  case W65816::CMPfi: return W65816::CMP_DP;
+  case W65816::ADCfi: return W65816::ADC_DP;
+  case W65816::SBCfi: return W65816::SBC_DP;
+  case W65816::ANDfi: return W65816::AND_DP;
+  case W65816::ORAfi: return W65816::ORA_DP;
+  case W65816::EORfi: return W65816::EOR_DP;
+  default: return 0;
+  }
+}
+
+
+// IMG8..IMG15 sit at DP addresses 0xC0, 0xC2, ..., 0xCE.  IMG0..IMG7
+// are at 0xD0..0xDE.  Returns the DP byte for IMGn.
+static uint8_t dpAddrForImg(unsigned ImgIdx) {
+  assert(ImgIdx < 16 && "IMG index out of range");
+  if (ImgIdx < 8) return 0xD0 + 2 * ImgIdx;
+  return 0xC0 + 2 * (ImgIdx - 8);
+}
+
+
+bool W65816PromoteFiToImg::runOnMachineFunction(MachineFunction &MF) {
+  // DISABLED: pass produces verifier errors ("Using an undefined physical
+  // register") on the kill-flag bookkeeping when an STAfi with `killed $a`
+  // is rewritten to STA_DP — the next i16-imm ADC/ADCE sees $a as dead.
+  // Also, for the FUNCTIONS where it would land (no-call, high-traffic
+  // slots), measured static + dynamic savings were modest and didn't
+  // justify the bookkeeping complexity.  Re-enable after:
+  //   - tightening kill-flag preservation: only carry kill if the same
+  //     operand will be the last user in the new MI (which depends on
+  //     post-rewrite scheduling — needs careful liveness re-analysis).
+  //   - paired-PHI promotion: when fi#A is a PHI-input and fi#B is the
+  //     matching PHI-output, map them to the SAME IMG slot so the
+  //     PHI move collapses to a no-op (where most of the dynamic win
+  //     would come from).
+  return false;
+  if (skipFunction(MF.getFunction())) return false;
+  const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
+  const W65816InstrInfo *TII = STI.getInstrInfo();
+  MachineFrameInfo &MFI = MF.getFrameInfo();
+
+  // 1. Walk all instructions, count FI accesses for promotable opcodes.
+  DenseMap<int, unsigned> AccessCount;
+  DenseMap<int, SmallVector<MachineInstr *, 8>> AccessSites;
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      int FiIdx = getFiOperandIdx(MI.getOpcode());
+      if (FiIdx < 0) continue;
+      const MachineOperand &MO = MI.getOperand(FiIdx);
+      if (!MO.isFI()) continue;
+      int FI = MO.getIndex();
+      // Require: 2-byte size, fixed (not variable), offset operand == 0.
+      // The offset operand sits right after the FI operand.
+      if (MFI.isVariableSizedObjectIndex(FI)) continue;
+      if (MFI.getObjectSize(FI) != 2) continue;
+      // Fixed (negative-index) slots are arg slots — leave them alone.
+      // Promotion would break LowerFormalArguments's expected layout.
+      if (FI < 0) continue;
+      const MachineOperand &OffMO = MI.getOperand(FiIdx + 1);
+      if (!OffMO.isImm() || OffMO.getImm() != 0) continue;
+      AccessCount[FI]++;
+      AccessSites[FI].push_back(&MI);
+    }
+  }
+  if (AccessCount.empty()) return false;
+
+  // 2. Determine which IMG8..15 slots are already in use.
+  BitVector UsedImg(8, false);
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      for (const MachineOperand &MO : MI.operands()) {
+        if (!MO.isReg() || !MO.getReg().isPhysical()) continue;
+        Register R = MO.getReg();
+        // IMG8..15 are not numerically contiguous with each other in
+        // the W65816 register enum (subreg-pair regs sit between
+        // IMG indices).  Spell them out explicitly.
+        unsigned ImgIdx = 16;  // "not an IMG8..15"
+        if      (R == W65816::IMG8)  ImgIdx = 0;
+        else if (R == W65816::IMG9)  ImgIdx = 1;
+        else if (R == W65816::IMG10) ImgIdx = 2;
+        else if (R == W65816::IMG11) ImgIdx = 3;
+        else if (R == W65816::IMG12) ImgIdx = 4;
+        else if (R == W65816::IMG13) ImgIdx = 5;
+        else if (R == W65816::IMG14) ImgIdx = 6;
+        else if (R == W65816::IMG15) ImgIdx = 7;
+        if (ImgIdx < 8) UsedImg.set(ImgIdx);
+      }
+    }
+  }
+
+  // 3. Sort FIs by access count (descending).
+  SmallVector<int, 16> Ordered;
+  for (auto &P : AccessCount) Ordered.push_back(P.first);
+  std::sort(Ordered.begin(), Ordered.end(),
+            [&](int A, int B) { return AccessCount[A] > AccessCount[B]; });
+
+  // 4. Assign IMG slots greedily.  Each IMG8..15 slot used triggers
+  //    a save/restore pair in W65816ImgCalleeSave (~20 cyc + ~12 B
+  //    per slot per CALL into this function).  For recursive or
+  //    deep-call-stack functions, that overhead dominates the per-
+  //    access savings — measured: promoting 4 slots in fib(10)
+  //    regressed it 38% (12617 → 17391 cyc).  Gate on a very high
+  //    threshold + bail entirely if the function has any calls (the
+  //    save/restore cost compounds with recursion / call frequency
+  //    in ways the static access count can't capture).
+  bool HasCalls = false;
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      if (MI.isCall()) { HasCalls = true; break; }
+    }
+    if (HasCalls) break;
+  }
+  const unsigned kAccessThreshold = HasCalls ? 999999u : 5u;
+  DenseMap<int, unsigned> FiToImgIdx;
+  unsigned NextFreeImg = 0;
+  for (int FI : Ordered) {
+    if (AccessCount[FI] < kAccessThreshold) break;
+    while (NextFreeImg < 8 && UsedImg.test(NextFreeImg)) ++NextFreeImg;
+    if (NextFreeImg >= 8) break;
+    FiToImgIdx[FI] = NextFreeImg + 8;  // Map to IMG8..15
+    ++NextFreeImg;
+  }
+  if (FiToImgIdx.empty()) return false;
+
+  // 5. Rewrite each access.  Insert the new DP MC inst before the
+  //    pseudo, then erase the pseudo.  Preserve flags and tied-def
+  //    semantics via implicit operands.
+  bool Changed = false;
+  for (auto &P : FiToImgIdx) {
+    int FI = P.first;
+    unsigned ImgIdx = P.second;
+    uint8_t DpAddr = dpAddrForImg(ImgIdx);
+    LLVM_DEBUG(dbgs() << "Promote fi#" << FI << " -> IMG"
+                      << ImgIdx << " ($" << format("%02x", DpAddr)
+                      << "), " << AccessCount[FI] << " accesses\n");
+    for (MachineInstr *MI : AccessSites[FI]) {
+      unsigned Opc = MI->getOpcode();
+      unsigned NewOpc = getDpOpcode(Opc);
+      if (!NewOpc) continue;
+      MachineBasicBlock *MBB = MI->getParent();
+      DebugLoc DL = MI->getDebugLoc();
+      MachineInstrBuilder NewMI =
+          BuildMI(*MBB, MI, DL, TII->get(NewOpc)).addImm(DpAddr);
+      // Carry implicit-def $a (LDA/ADC/SBC/AND/ORA/EOR all write $a)
+      // and implicit-use $a (STA/CMP/ADC/SBC/AND/ORA/EOR all read $a).
+      // ADCfi/SBCfi additionally use $p; their DP equivalents read $p
+      // implicitly via the tablegen Defs/Uses.  But since we built the
+      // new MI from TII->get(NewOpc), the implicit operands from the
+      // descriptor are auto-added.  We only need to copy non-FI explicit
+      // operands... which for our pseudos are register operands.  The
+      // physical register defs/uses they carry must be preserved.
+      for (const MachineOperand &MO : MI->operands()) {
+        if (MO.isReg() && MO.getReg().isPhysical() && MO.isImplicit()) {
+          // Skip — already added by descriptor.
+          continue;
+        }
+        if (MO.isReg() && MO.getReg().isPhysical() && !MO.isImplicit()) {
+          // Explicit physreg operand (e.g., the $a in STAfi $a, fi, 0).
+          // Convert to implicit so the DP MC inst's descriptor matches.
+          RegState Flags = MO.isDef() ? RegState::ImplicitDefine
+                                       : RegState::Implicit;
+          if (MO.isKill()) Flags = Flags | RegState::Kill;
+          NewMI.addReg(MO.getReg(), Flags);
+        }
+        // FI/offset operands are skipped — replaced by the DP imm above.
+        // VReg defs/uses should be gone post-RA; if any survived, skip.
+      }
+      MI->eraseFromParent();
+      Changed = true;
+    }
+    // Mark the FI as dead so PEI can skip allocating stack for it.
+    // MFI doesn't expose RemoveStackObject publicly, but setting size
+    // to 0 also works in most code paths.  Actually leave it alive —
+    // a 2-byte unused slot is cheap, and removing exposes us to
+    // PEI bugs.
+  }
+  return Changed;
+}
--- a/src/llvm/lib/Target/W65816/W65816SepRepCleanup.cpp
+++ b/src/llvm/lib/Target/W65816/W65816SepRepCleanup.cpp
@ -41,6 +41,7 @@
 #include "W65816InstrInfo.h"
 #include "W65816Subtarget.h"
 #include "llvm/ADT/SmallSet.h"
+#include "llvm/Support/raw_ostream.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/MachineInstr.h"
@ -433,8 +434,22 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
      auto isLdaSR = [](const MachineInstr &MI) {
        return MI.getOpcode() == W65816::LDA_StackRel;
      };
+      // Accept LDA_Imm16 (MC) AND LDAi16imm (pseudo) inside the wrap —
+      // both are flag-clobbering A-loads of a 16-bit immediate, with
+      // no stack-rel offset to bump-undo and no memory operand to
+      // alias-check against the gap.  Common in init blocks: `lda #0 ;
+      // sta slot,s` wrapped around the loop pre-test.  Some functions
+      // still carry the pseudo LDAi16imm at SepRepCleanup time (post-RA
+      // pseudo expansion didn't lower it), so accept both spellings.
+      auto isImmLoad = [](const MachineInstr &MI) {
+        unsigned O = MI.getOpcode();
+        return O == W65816::LDA_Imm16 || O == W65816::LDAi16imm;
+      };
      auto isFlagPreservingMem = [&](const MachineInstr &MI) {
-        return isStaLike(MI) || isLdaSR(MI);
+        return isStaLike(MI) || isLdaSR(MI) || isImmLoad(MI);
+      };
+      auto isLdaCount = [&](const MachineInstr &MI) {
+        return isLdaSR(MI) || isImmLoad(MI);
      };
      auto It = MBB.begin();
      while (It != MBB.end()) {
@ -450,8 +465,11 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
          if (Walker->isDebugInstr()) { ++Walker; continue; }
          if (Walker->getOpcode() == W65816::PLP) break;
          if (!isFlagPreservingMem(*Walker)) { ok = false; break; }
-          // Track slots so we can check the gap below.
-          if (Walker->getNumOperands() >= 1 && Walker->getOperand(0).isImm()) {
+          // Track stack-rel slots so we can check the gap below.
+          // Immediate loads have no stack-rel addr — skip.
+          if (!isImmLoad(*Walker) &&
+              Walker->getNumOperands() >= 1 &&
+              Walker->getOperand(0).isImm()) {
            int64_t off = Walker->getOperand(0).getImm();
            if (isLdaSR(*Walker)) ReadSlots.insert(off);
            else WriteSlots.insert(off);
@ -483,11 +501,23 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
        // it earlier would lose the value.
        unsigned NLda = 0, NSta = 0;
        for (MachineInstr *MI : Block) {
-          if (isLdaSR(*MI)) ++NLda;
+          if (isLdaCount(*MI)) ++NLda;
          else if (isStaLike(*MI)) ++NSta;
        }
        NSta += Trailing.size();
        if (NLda != NSta) { ++It; continue; }
+        // Even with paired LDA-STA, the LAST LDA's $a value can still
+        // be consumed downstream — by a successor's first STA — making
+        // it a fall-through register-PHI.  If $a is live-out at MBB
+        // end (any successor has $a as live-in), bail.  Caught by
+        // sumTable, where `lda #0` (wrap) feeds A into bb.2's `sta 0x1,
+        // s`, with `sta 0x9, s` (trailing) just happening to also store
+        // the same A — the pair count balances but A is still live-out.
+        bool aLiveOut = false;
+        for (MachineBasicBlock *Succ : MBB.successors()) {
+          if (Succ->isLiveIn(W65816::A)) { aLiveOut = true; break; }
+        }
+        if (aLiveOut) { ++It; continue; }
        // Walk backward from PHP to find the hoist insertion point.
        // The hoisted block clobbers $a and $p (LDA writes both).
        // Skip insts that USE $a (consumer of an earlier $a producer)
@ -880,5 +910,362 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
      ++It2;
    }
  }
+
+  // Store forwarding (disabled — CRC32 regressed and I couldn't
+  // nail down the safety hole in time).  Even with PHP-wrap guards
+  // and SP-modifier bails, the first fire (in memmove) silently
+  // miscompiles something that CRC32 later depends on.  Pattern
+  // is sound; safety analysis isn't complete.  See
+  // feedback_close_gap_attempts_round2.md for details.
+  #if 0
+  // Store forwarding for PHI memory copies.  Pattern (sumSquares
+  // loop body):
+  //
+  //   STA X,s                  ; A → slot X (some intermediate result)
+  //   [code that modifies A but doesn't touch slot X or slot Y]
+  //   LDA X,s                  ; reload A from slot X
+  //   STA Y,s                  ; A → slot Y  (the PHI copy)
+  //
+  // Transform: insert `STA Y,s` right after the first `STA X,s` (A
+  // still holds the same value at that point), then drop the LDA-
+  // STA pair.  Net: -1 inst per pattern occurrence.
+  //
+  // Safety constraints (all between STA X and the LDA-STA pair, in
+  // the same MBB, in straight-line code):
+  //   - No instruction writes slot X (else the LDA would see a
+  //     different value than the original STA).
+  //   - No instruction reads OR writes slot Y (else our early STA Y
+  //     would be observed mid-flight with a different value than
+  //     before, or our inserted store would be overwritten and the
+  //     intervening read of Y in the original would have seen the
+  //     overwrite).
+  //   - No call / inline asm / branch (conservatively: those can
+  //     touch memory we don't model).
+  {
+    auto isStackRelMC2 = [](unsigned Op) {
+      return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
+             Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
+             Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
+             Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
+    };
+    auto srAccess2 = [&](const MachineInstr &MI, int64_t &Off) -> bool {
+      if (!isStackRelMC2(MI.getOpcode())) return false;
+      if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
+      Off = MI.getOperand(0).getImm();
+      return true;
+    };
+    auto isStaSr = [](const MachineInstr &MI) {
+      return MI.getOpcode() == W65816::STA_StackRel;
+    };
+    auto isLdaSr = [](const MachineInstr &MI) {
+      return MI.getOpcode() == W65816::LDA_StackRel;
+    };
+    SmallVector<MachineInstr *, 4> ToErase;
+    SmallVector<std::tuple<MachineInstr *, int64_t>, 4> ToInsert;
+    static int g_fireLimit = -1;
+    static int g_fireCount = 0;
+    static bool initd = false;
+    if (!initd) {
+      if (const char *e = getenv("STORE_FWD_LIMIT")) g_fireLimit = atoi(e);
+      initd = true;
+    }
+    for (MachineBasicBlock &MBB : MF) {
+      for (auto It = MBB.begin(); It != MBB.end(); ++It) {
+        if (!isStaSr(*It)) continue;
+        int64_t X;
+        if (!srAccess2(*It, X)) continue;
+        MachineInstr *StaX = &*It;
+        // Check if StaX is INSIDE an open PHP/PLP wrap.  In that case
+        // its operand offset has been pre-bumped by +1, and inserting
+        // a sibling STA Y immediately after writes at the WRONG slot
+        // (the un-bumped Y).  Walk backward: if we find a PHP without
+        // a matching PLP first, bail.
+        {
+          bool insideWrap = false;
+          int depth = 0;
+          auto B = It;
+          while (B != MBB.begin()) {
+            --B;
+            if (B->getOpcode() == W65816::PLP) depth++;
+            else if (B->getOpcode() == W65816::PHP) {
+              if (depth > 0) depth--;
+              else { insideWrap = true; break; }
+            }
+          }
+          if (insideWrap) continue;
+        }
+        // Walk forward looking for LDA X ; STA Y.  Conservative bail
+        // on any non-tracked memory op (indirect pointer access,
+        // DP/abs ops, etc.) which could alias slot Y via memory.
+        bool ok = true;
+        int64_t Y = -1;
+        MachineInstr *LdaX = nullptr;
+        MachineInstr *StaY = nullptr;
+        for (auto Walker = std::next(It); Walker != MBB.end(); ++Walker) {
+          if (Walker->isDebugInstr()) continue;
+          if (Walker->isCall() || Walker->isInlineAsm() ||
+              Walker->isBranch() || Walker->isReturn()) {
+            ok = false; break;
+          }
+          // Found LDA X?
+          int64_t Off;
+          if (isLdaSr(*Walker) && srAccess2(*Walker, Off) && Off == X) {
+            LdaX = &*Walker;
+            auto Next = std::next(Walker);
+            while (Next != MBB.end() && Next->isDebugInstr()) ++Next;
+            if (Next == MBB.end() || !isStaSr(*Next) ||
+                !srAccess2(*Next, Y) || Y == X) {
+              ok = false;
+            } else {
+              StaY = &*Next;
+            }
+            break;
+          }
+          // Stack-rel access to X (write or read): bail.
+          if (srAccess2(*Walker, Off) && Off == X) {
+            ok = false; break;
+          }
+          // Any memory-touching op that's NOT a tracked stack-rel
+          // access — bail.  Indirect pointer stores/loads (DPIndY /
+          // DPIndLong / abs / etc.) could alias slot Y via a pointer
+          // we can't trace, and the safety check below would miss it.
+          if ((Walker->mayLoad() || Walker->mayStore()) &&
+              !isStackRelMC2(Walker->getOpcode())) {
+            ok = false; break;
+          }
+          // SP-modifying ops shift the stack-rel addressing window —
+          // a later `lda X, s` reads a DIFFERENT byte than the earlier
+          // `sta X, s` (or worse, the new stack pointer points into
+          // saved P/retaddr).  Bail on TCS (direct SP write) and on
+          // any stack push/pop (PHx/PLx/PEA/PEI/COP/BRK).  Also bail
+          // on PHP/PLP because the wrap pass already bumped in-wrap
+          // stack-rel ops by +1 — our inserted STA after STA X writes
+          // at the un-bumped offset which gets the WRONG slot.
+          {
+            unsigned WO = Walker->getOpcode();
+            if (WO == W65816::TCS  || WO == W65816::PHA ||
+                WO == W65816::PLA  || WO == W65816::PHX ||
+                WO == W65816::PLX  || WO == W65816::PHY ||
+                WO == W65816::PLY  || WO == W65816::PHP ||
+                WO == W65816::PLP  || WO == W65816::PHB ||
+                WO == W65816::PLB  || WO == W65816::PHD ||
+                WO == W65816::PLD  || WO == W65816::PHK ||
+                WO == W65816::PEA  || WO == W65816::PEI_DP) {
+              ok = false; break;
+            }
+          }
+        }
+        if (!ok || !LdaX || !StaY) continue;
+        if (g_fireLimit >= 0 && g_fireCount >= g_fireLimit) continue;
+        g_fireCount++;
+        errs() << "SF FIRE " << g_fireCount << " in " << MF.getName()
+               << " MBB " << MBB.getNumber()
+               << " X=" << X << " Y=" << StaY->getOperand(0).getImm()
+               << "\n";
+        // Now re-walk from std::next(It) up to LdaX and verify no
+        // access to slot Y in that gap.
+        ok = true;
+        for (auto W2 = std::next(It); W2 != LdaX->getIterator(); ++W2) {
+          if (W2->isDebugInstr()) continue;
+          int64_t Off;
+          if (srAccess2(*W2, Off) && Off == Y) { ok = false; break; }
+        }
+        if (!ok) continue;
+        // Safe to apply: schedule the StaY-after-StaX insert, and
+        // erase LdaX and StaY.
+        ToInsert.push_back({StaX, Y});
+        ToErase.push_back(LdaX);
+        ToErase.push_back(StaY);
+        Changed = true;
+      }
+    }
+    // Apply (insertions first; iterators stay valid through erase).
+    for (auto &P : ToInsert) {
+      MachineInstr *StaX = std::get<0>(P);
+      int64_t Y = std::get<1>(P);
+      MachineBasicBlock *MBB = StaX->getParent();
+      DebugLoc DL = StaX->getDebugLoc();
+      auto NextIt = std::next(StaX->getIterator());
+      BuildMI(*MBB, NextIt, DL, TII.get(W65816::STA_StackRel))
+          .addImm(Y);
+    }
+    for (MachineInstr *MI : ToErase) MI->eraseFromParent();
+  }
+  #endif
+  // (Redundant CMP #0 elimination — disabled, hit VLA sum_n
+  // regression.  Carry-flag bookkeeping across the CMP turned out to
+  // have more cases than my forward-walk modeled.  See
+  // feedback_cmp_zero_elim.md.)
+  #if 0
+  {
+    auto isNZSetOnA = [](unsigned Op) {
+      switch (Op) {
+      case W65816::DEA_PSEUDO: case W65816::INA_PSEUDO:
+      case W65816::ADC_StackRel: case W65816::ADC_DP: case W65816::ADC_Imm16:
+      case W65816::SBC_StackRel: case W65816::SBC_DP: case W65816::SBC_Imm16:
+      case W65816::AND_StackRel: case W65816::AND_DP: case W65816::AND_Imm16:
+      case W65816::ORA_StackRel: case W65816::ORA_DP: case W65816::ORA_Imm16:
+      case W65816::EOR_StackRel: case W65816::EOR_DP: case W65816::EOR_Imm16:
+      case W65816::LDA_StackRel: case W65816::LDA_DP:
+      case W65816::LDAi16imm: case W65816::LDA_Imm16:
+      case W65816::TXA: case W65816::TYA:
+      case W65816::ADCi16imm: case W65816::ADCEi16imm:
+      case W65816::SBCi16imm: case W65816::SBCEi16imm:
+        return true;
+      default:
+        return false;
+      }
+    };
+    auto isCmpZero = [](const MachineInstr &MI) {
+      if (MI.getOpcode() != W65816::CMPi16imm) return false;
+      // Operand layout: lhs (Acc16), imm.  Find the imm.
+      for (const MachineOperand &MO : MI.operands()) {
+        if (MO.isImm()) return MO.getImm() == 0;
+      }
+      return false;
+    };
+    auto modifiesA = [](const MachineInstr &MI) {
+      for (const MachineOperand &MO : MI.operands()) {
+        if (MO.isReg() && MO.getReg() == W65816::A && MO.isDef())
+          return true;
+      }
+      return false;
+    };
+    auto readsC = [](const MachineInstr &MI) {
+      // We don't model individual flag bits; approximate by checking
+      // if the MI reads $p AND is one of the carry-consuming ops.
+      unsigned Op = MI.getOpcode();
+      switch (Op) {
+      case W65816::ADC_StackRel: case W65816::ADC_DP: case W65816::ADC_Imm16:
+      case W65816::SBC_StackRel: case W65816::SBC_DP: case W65816::SBC_Imm16:
+      case W65816::ADCEi16imm: case W65816::SBCEi16imm:
+      case W65816::BCC: case W65816::BCS:
+      case W65816::ROL_A: case W65816::ROR_A:
+        return true;
+      default:
+        return false;
+      }
+    };
+    SmallVector<MachineInstr *, 4> CmpsToErase;
+    for (MachineBasicBlock &MBB : MF) {
+      for (MachineInstr &MI : MBB) {
+        if (!isCmpZero(MI)) continue;
+        // Walk backward, skipping flag-preserving instructions.
+        bool foundProducer = false;
+        auto Back = MI.getIterator();
+        while (Back != MBB.begin()) {
+          --Back;
+          if (Back->isDebugInstr()) continue;
+          if (Back->isCall() || Back->isInlineAsm()) break;
+          if (modifiesA(*Back)) {
+            foundProducer = isNZSetOnA(Back->getOpcode());
+            break;
+          }
+          bool defsP = false;
+          for (const MachineOperand &MO : Back->operands()) {
+            if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) {
+              defsP = true; break;
+            }
+          }
+          if (defsP) break;
+        }
+        if (!foundProducer) continue;
+        // Walk FORWARD from CMP: until the next C-defining MI, no MI
+        // reads C.
+        bool cConsumed = false;
+        for (auto Fwd = std::next(MI.getIterator()); Fwd != MBB.end(); ++Fwd) {
+          if (Fwd->isDebugInstr()) continue;
+          if (readsC(*Fwd)) { cConsumed = true; break; }
+          // Next def of $p: subsequent reads aren't ours.
+          bool defsP = false;
+          for (const MachineOperand &MO : Fwd->operands()) {
+            if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) {
+              defsP = true; break;
+            }
+          }
+          if (defsP) break;
+        }
+        if (cConsumed) continue;
+        CmpsToErase.push_back(&MI);
+      }
+    }
+    for (MachineInstr *MI : CmpsToErase) MI->eraseFromParent();
+    if (!CmpsToErase.empty()) Changed = true;
+  }
+  #endif
+  // (Narrow PHI-copy slot collapse — disabled, qsort regression.)
+  #if 0
+  {
+    auto isStackRelMC2 = [](unsigned Op) {
+      return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
+             Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
+             Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
+             Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
+    };
+    auto srAccess2 = [&](const MachineInstr &MI, int64_t &Off) {
+      if (!isStackRelMC2(MI.getOpcode())) return false;
+      if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
+      Off = MI.getOperand(0).getImm();
+      return true;
+    };
+    DenseMap<int64_t, unsigned> Refs;
+    DenseMap<int64_t, MachineInstr *> StaInst, LdaInst;
+    DenseMap<int64_t, unsigned> NSta, NLda;
+    for (MachineBasicBlock &MBB : MF) {
+      for (MachineInstr &MI : MBB) {
+        int64_t Off;
+        if (!srAccess2(MI, Off)) continue;
+        Refs[Off]++;
+        if (MI.getOpcode() == W65816::STA_StackRel) {
+          NSta[Off]++; StaInst[Off] = &MI;
+        } else if (MI.getOpcode() == W65816::LDA_StackRel) {
+          NLda[Off]++; LdaInst[Off] = &MI;
+        }
+      }
+    }
+    SmallVector<MachineInstr *, 4> ToErase;
+    for (auto &P : Refs) {
+      int64_t X = P.first;
+      if (P.second != 2) continue;          // exactly 2 references
+      if (NSta[X] != 1 || NLda[X] != 1) continue;
+      MachineInstr *Sta = StaInst[X];
+      MachineInstr *Lda = LdaInst[X];
+      if (Sta->getParent() != Lda->getParent()) continue;
+      MachineBasicBlock *MBB = Sta->getParent();
+      // Sta must be before Lda.
+      bool staBefore = false;
+      for (auto It = MBB->begin(); It != MBB->end(); ++It) {
+        if (&*It == Sta) { staBefore = true; break; }
+        if (&*It == Lda) break;
+      }
+      if (!staBefore) continue;
+      // Next after Lda must be STA Y where Y != X.
+      auto NextIt = std::next(Lda->getIterator());
+      while (NextIt != MBB->end() && NextIt->isDebugInstr()) ++NextIt;
+      if (NextIt == MBB->end()) continue;
+      int64_t Y;
+      if (NextIt->getOpcode() != W65816::STA_StackRel ||
+          !srAccess2(*NextIt, Y) || Y == X) continue;
+      // Between Sta and Lda, no read/write of slot Y, no call, no
+      // anything that would re-set slot Y's value mid-flight.
+      bool ok = true;
+      for (auto It = std::next(Sta->getIterator()); It != Lda->getIterator();
+           ++It) {
+        if (It->isDebugInstr()) continue;
+        if (It->isCall() || It->isInlineAsm()) { ok = false; break; }
+        int64_t Off;
+        if (srAccess2(*It, Off) && Off == Y) { ok = false; break; }
+      }
+      if (!ok) continue;
+      // Redirect the original STA to write to Y; delete the LDA-STA pair.
+      Sta->getOperand(0).setImm(Y);
+      ToErase.push_back(Lda);
+      ToErase.push_back(&*NextIt);
+      Changed = true;
+    }
+    for (MachineInstr *MI : ToErase) MI->eraseFromParent();
+  }
+  #endif
+
  return Changed;
 }
--- a/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp
+++ b/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp
@ -1492,6 +1492,14 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
    }
    return false;
  };
+  // Pass 1c can only eliminate CMPi16imm $a, 0 if the preceding
+  // A-modifier reliably sets N/Z to reflect A's final value.  LDAfi
+  // under FP-rel expansion (`sty $fa ; ldy #imm ; lda [$f6],y ; ldy $fa`)
+  // ends with `ldy` that clobbers N/Z based on OLD Y, not loaded A — so
+  // in FP-rel functions (VLA / huge frame), the CMP is load-bearing.
+  // Skip the whole pass for such functions (saves us from the sum_n
+  // VLA regression that the PHP-wrap-aware variant tripped).
+  bool ssCleanupSPRelOnly = !UsesFPRel;
  for (MachineBasicBlock &MBB : MF) {
    SmallVector<MachineInstr *, 8> Cmps;
    for (MachineInstr &MI : MBB)
@ -1516,10 +1524,27 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
      // condition).  Caused __adddf3's renormalize while-loop to
      // skip its body even though `mr & ~mask` was non-zero.
      bool SafeToErase = true;
+      bool insidePHPWrap = false;
      for (auto It = std::next(Cmp->getIterator());
           It != Cmp->getParent()->end(); ++It) {
        if (It->isDebugInstr()) continue;
        if (It->isBranch() || It->isReturn()) break;
+        // PHP/PLP-wrap-aware: only safe when LDAfi-expansion sets N/Z
+        // reliably (SP-rel functions, not FP-rel).
+        if (ssCleanupSPRelOnly && It->getOpcode() == W65816::PHP) {
+          // PHP must be IMMEDIATELY after CMP to capture CMP's flags.
+          if (&*It != &*std::next(Cmp->getIterator())) {
+            SafeToErase = false;
+            break;
+          }
+          insidePHPWrap = true;
+          continue;
+        }
+        if (It->getOpcode() == W65816::PLP) {
+          insidePHPWrap = false;
+          continue;
+        }
+        if (insidePHPWrap) continue;
        if (It->getOpcode() == TargetOpcode::COPY) {
          SafeToErase = false;
          break;
--- a/src/llvm/lib/Target/W65816/W65816StackSlotMerge.cpp
+++ b/src/llvm/lib/Target/W65816/W65816StackSlotMerge.cpp
@ -0,0 +1,733 @@
+//===-- W65816StackSlotMerge.cpp - Merge value-equivalent stack slots ----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===---------------------------------------------------------------------===//
+//
+// Pre-emit pass that runs after PEI (eliminateFrameIndex) and merges
+// pairs of stack-rel slots that hold the same value at every observable
+// program point — typically the PHI src/dst pair PHI-elim leaves at
+// the back-edge of a loop body.
+//
+// LLVM's StackSlotColoring merges slots with non-overlapping liveness.
+// It can't merge slots that are simultaneously live but happen to hold
+// the same value (which is what a PHI memory-copy creates).  This pass
+// catches that case via a stricter "value equivalence" check.
+//
+// Canonical pattern (sumSquares loop body):
+//
+//   .LBB0_4:
+//     LDA 0x7, s ; PHA ; JSL __umulhisi3 ; PLY
+//     CLC ; ADC 0x3, s ; STA 0xb, s     ; new total.lo (write X)
+//     TXA ; ADC 0x1, s ; STA 0x9, s
+//     LDA 0x7, s ; INC A ; STA 0x7, s
+//     LDA 0xb, s ; STA 0x3, s           ; PHI copy: load X, store Y
+//     LDA 0x9, s ; STA 0x1, s
+//     ...
+//
+// The pair (0xb, 0x3) is the lo-half PHI memory copy.  Slots 0xb and
+// 0x3 always hold the same value at every read site:
+//   - Function entry: both initialized to 0 (`lda #0; sta 0xb, s` in
+//     entry, `lda #0; sta 0x3, s` in preheader).
+//   - Loop iteration: the PHI copy moves the new total.lo from 0xb to
+//     0x3 at the end of every iteration.
+//   - Exit: only 0xb is read (return value), but its value equals 0x3's.
+//
+// Rename 0xb → 0x3 function-wide; the now self-copy `lda 0x3; sta 0x3`
+// is dead and we erase it.  Saves 2 inst per PHI copy occurrence (the
+// memory copy round-trip).  sumSquares loop body shrinks from 21 to
+// 17 inst per iter.
+//
+// Safety check (sufficient condition for value equivalence):
+//   1. Both slots have ≥1 STA in the function (skips arg slots passed
+//      by the caller — those have only LDA reads, no STAs, and renaming
+//      would change where we read the arg from).
+//   2. For every STA X in the function, find a "twin" STA Y at a
+//      program point where the values match.  Matching = either:
+//      (a) Same MBB, same A-source value (no intervening A-define).
+//          Covers the loop-body iter-end pattern: STA X then later
+//          LDA X ; STA Y.  Also covers entry's `lda #N ; sta X` if
+//          the same MBB also has `sta Y`.
+//      (b) Different MBBs, both preceded by `LDA #const` of the same
+//          constant.  Covers entry-block STA X=0 paired with
+//          preheader STA Y=0.
+//   3. Symmetric: for every STA Y, find a twin STA X.
+//   4. No "orphan" STAs.  If a STA X or STA Y has no twin, bail.
+//
+// When all checks pass, the rename function-wide preserves semantics:
+// every read of slot X at program point P sees the same value that
+// slot Y holds at P (and vice versa).
+//
+//===---------------------------------------------------------------------===//
+
+#include "W65816.h"
+#include "W65816InstrInfo.h"
+#include "W65816Subtarget.h"
+#include "llvm/ADT/DenseMap.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/CodeGen/MachineDominators.h"
+#include "llvm/CodeGen/MachineFunction.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineInstrBuilder.h"
+#include "llvm/InitializePasses.h"
+#include "llvm/Support/Debug.h"
+
+using namespace llvm;
+
+#define DEBUG_TYPE "w65816-stack-slot-merge"
+
+
+namespace {
+
+
+class W65816StackSlotMerge : public MachineFunctionPass {
+public:
+  static char ID;
+  W65816StackSlotMerge() : MachineFunctionPass(ID) {}
+  StringRef getPassName() const override {
+    return "W65816 merge value-equivalent stack slots (PHI-copy collapse)";
+  }
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<MachineDominatorTreeWrapperPass>();
+    AU.setPreservesCFG();
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+  bool runOnMachineFunction(MachineFunction &MF) override;
+};
+
+
+} // namespace
+
+
+char W65816StackSlotMerge::ID = 0;
+
+INITIALIZE_PASS_BEGIN(W65816StackSlotMerge, DEBUG_TYPE,
+                      "W65816 stack slot merge", false, false)
+INITIALIZE_PASS_DEPENDENCY(MachineDominatorTreeWrapperPass)
+INITIALIZE_PASS_END(W65816StackSlotMerge, DEBUG_TYPE,
+                    "W65816 stack slot merge", false, false)
+
+
+FunctionPass *llvm::createW65816StackSlotMerge() {
+  return new W65816StackSlotMerge();
+}
+
+
+// Stack-relative MC opcodes — the ops that survive eliminateFrameIndex
+// and reference a slot via an 8-bit SP-relative offset.
+static bool isStackRelOp(unsigned Op) {
+  return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
+         Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
+         Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
+         Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
+}
+
+
+// Returns true if MI is a stack-rel op; out-param Off receives the slot
+// offset (operand 0).
+static bool srAccess(const MachineInstr &MI, int64_t &Off) {
+  if (!isStackRelOp(MI.getOpcode())) return false;
+  if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
+  Off = MI.getOperand(0).getImm();
+  return true;
+}
+
+
+// True if the MI semantically defines A.  Covers both the explicit
+// case (operand has reg=A,isDef) AND the implicit case where the
+// tablegen InstDP / InstAbs / etc. base classes omit the A-Def
+// annotation despite LDA semantically writing A (a backend modelling
+// gap — many `LDA_DP`, `LDA_Abs`, `LDA_LongX`, etc. are missing the
+// implicit-def in the MIR even though they load into A).  Opcode-
+// based fallback catches all of them.
+static bool semanticallyDefsA(const MachineInstr &MI) {
+  for (const MachineOperand &MO : MI.operands()) {
+    if (MO.isReg() && MO.getReg() == W65816::A && MO.isDef())
+      return true;
+  }
+  unsigned Op = MI.getOpcode();
+  switch (Op) {
+  case W65816::LDA_DP:    case W65816::LDA_DPX:
+  case W65816::LDA_DPInd: case W65816::LDA_DPIndY:
+  case W65816::LDA_DPIndX:
+  case W65816::LDA_Abs:   case W65816::LDA_AbsX:
+  case W65816::LDA_AbsY:  case W65816::LDA_Long:
+  case W65816::LDA_LongX:
+  case W65816::PLA:
+    return true;
+  default:
+    return false;
+  }
+}
+
+
+// Walk backward from MI in its MBB looking for the most recent A-define.
+// Returns the MI that defines A, or nullptr if none in the same MBB.
+// Skips debug instructions.  Stops at MBB boundary, calls, branches,
+// inline asm.
+static MachineInstr *findPriorADef(MachineInstr *MI) {
+  MachineBasicBlock *MBB = MI->getParent();
+  auto It = MI->getIterator();
+  while (It != MBB->begin()) {
+    --It;
+    if (It->isDebugInstr()) continue;
+    if (It->isCall() || It->isInlineAsm()) return nullptr;
+    if (semanticallyDefsA(*It)) return &*It;
+  }
+  return nullptr;
+}
+
+
+// Walk forward from `Start` (exclusive) up to (but not including) `End`
+// in the same MBB, tracking whether slot `WatchSlot` is written.
+// Returns true if slot `WatchSlot` is NOT written in the interval.
+static bool slotNotWrittenBetween(MachineBasicBlock::iterator Start,
+                                  MachineBasicBlock::iterator End,
+                                  int64_t WatchSlot) {
+  for (auto It = std::next(Start); It != End; ++It) {
+    if (It->isDebugInstr()) continue;
+    int64_t Off;
+    if (It->getOpcode() == W65816::STA_StackRel && srAccess(*It, Off) &&
+        Off == WatchSlot) {
+      return false;
+    }
+  }
+  return true;
+}
+
+
+// Returns true if MI clobbers P (N/Z/C/V flags).  Mirrors LLVM's
+// operand-based check + an opcode whitelist for tablegen entries that
+// omit `Defs = [P]` (InstImplied, InstStackRel, etc.).
+static bool clobbersFlagsP(const MachineInstr &MI) {
+  for (const MachineOperand &MO : MI.operands()) {
+    if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef())
+      return true;
+  }
+  if (MI.isCall() || MI.isInlineAsm()) return true;
+  unsigned Op = MI.getOpcode();
+  switch (Op) {
+  case W65816::PLA: case W65816::PLY: case W65816::PLX:
+  case W65816::PLP:
+  case W65816::INA: case W65816::DEA:
+  case W65816::INX: case W65816::DEX:
+  case W65816::INY: case W65816::DEY:
+  case W65816::TAX: case W65816::TAY:
+  case W65816::TYA: case W65816::TXA:
+  case W65816::TYX: case W65816::TXY:
+  case W65816::LDA_StackRel: case W65816::LDA_DP:
+  case W65816::LDA_DPX: case W65816::LDA_DPInd:
+  case W65816::LDA_DPIndY: case W65816::LDA_DPIndX:
+  case W65816::LDA_Abs: case W65816::LDA_AbsX:
+  case W65816::LDA_AbsY: case W65816::LDA_Long:
+  case W65816::LDA_LongX:
+  case W65816::ADC_StackRel: case W65816::SBC_StackRel:
+  case W65816::CMP_StackRel: case W65816::AND_StackRel:
+  case W65816::ORA_StackRel: case W65816::EOR_StackRel:
+  case W65816::ADC_DP: case W65816::ADC_Abs:
+  case W65816::SBC_DP: case W65816::SBC_Abs:
+  case W65816::CMP_DP: case W65816::CMP_Abs:
+  case W65816::AND_DP: case W65816::AND_Abs:
+  case W65816::ORA_DP: case W65816::ORA_Abs:
+  case W65816::EOR_DP: case W65816::EOR_Abs:
+    return true;
+  default:
+    return false;
+  }
+}
+
+
+// Returns true if MI reads P flags (conditional branches, PLP, etc.).
+static bool usesFlagsP(const MachineInstr &MI) {
+  if (MI.isConditionalBranch()) return true;
+  for (const MachineOperand &MO : MI.operands()) {
+    if (MO.isReg() && MO.getReg() == W65816::P && MO.isUse() &&
+        !MO.isDef())
+      return true;
+  }
+  return false;
+}
+
+
+// Returns the MOST RECENT A-defining MI strictly before MI in its MBB,
+// skipping debug instructions.  Returns nullptr if none in the same MBB.
+static MachineInstr *findMostRecentADef(MachineInstr *MI) {
+  MachineBasicBlock *MBB = MI->getParent();
+  auto It = MI->getIterator();
+  while (It != MBB->begin()) {
+    --It;
+    if (It->isDebugInstr()) continue;
+    if (semanticallyDefsA(*It)) return &*It;
+  }
+  return nullptr;
+}
+
+
+// "Twin" check.  Given a STA X at position StaX and a candidate slot Y,
+// scan the function's STA Y instances and return one that's value-
+// equivalent under the rules described in the header comment.
+//
+// Source-value equivalence cases:
+//   (1) Same-MBB twin store: no A-define between StaX and the candidate
+//       StaY → both store the same A value.  Pure twin pattern.
+//   (2) Same-MBB PHI-copy: the candidate StaY is preceded by
+//       `LDA_StackRel slotX` (PHI-copy reload).  Even if many A-defines
+//       sit between StaX and StaY, the LDA X re-establishes A =
+//       slot[X] = value StaX wrote (assuming slot X wasn't re-written
+//       in the gap).
+//   (3) Different MBBs, both preceded by LDA_Imm16 / LDAi16imm of the
+//       same constant.  Covers entry/preheader init parallel pair.
+static MachineInstr *findTwin(MachineInstr *StaX,
+                              ArrayRef<MachineInstr *> StasY) {
+  MachineBasicBlock *MBBStaX = StaX->getParent();
+  int64_t XOff = StaX->getOperand(0).getImm();
+  // Cases (1) + (2): same MBB.
+  for (MachineInstr *StaY : StasY) {
+    if (StaY->getParent() != MBBStaX) continue;
+    // Determine ordering.
+    MachineInstr *Earlier = nullptr;
+    MachineInstr *Later = nullptr;
+    for (auto It = MBBStaX->begin(); It != MBBStaX->end(); ++It) {
+      if (&*It == StaX) { Earlier = StaX; Later = StaY; break; }
+      if (&*It == StaY) { Earlier = StaY; Later = StaX; break; }
+    }
+    if (!Earlier || !Later) continue;
+    int64_t EOff = Earlier->getOperand(0).getImm();
+    // Case (2): if Later is preceded by `LDA_StackRel <Earlier's slot>`
+    // (the PHI-copy reload), it's a PHI twin.  Also require slot
+    // Earlier-slot wasn't re-written between Earlier and Later.
+    MachineInstr *PriorOfLater = findMostRecentADef(Later);
+    if (PriorOfLater) {
+      int64_t Off;
+      if (PriorOfLater->getOpcode() == W65816::LDA_StackRel &&
+          srAccess(*PriorOfLater, Off) && Off == EOff &&
+          slotNotWrittenBetween(Earlier->getIterator(),
+                                 PriorOfLater->getIterator(), EOff)) {
+        return StaY;
+      }
+    }
+    // Case (1): no A-define between Earlier and Later — same A value.
+    {
+      bool noADefs = true;
+      for (auto It = std::next(Earlier->getIterator());
+           It != Later->getIterator(); ++It) {
+        if (It->isDebugInstr()) continue;
+        if (semanticallyDefsA(*It)) { noADefs = false; break; }
+      }
+      if (noADefs) return StaY;
+    }
+  }
+  // Case (3): different MBBs, both preceded by LDA_Imm16 / LDAi16imm
+  // with the same constant.
+  MachineInstr *PriorX = findPriorADef(StaX);
+  if (!PriorX) return nullptr;
+  unsigned PriorXOp = PriorX->getOpcode();
+  if (PriorXOp != W65816::LDA_Imm16 && PriorXOp != W65816::LDAi16imm)
+    return nullptr;
+  int64_t XConst = 0;
+  for (const MachineOperand &MO : PriorX->operands()) {
+    if (MO.isImm()) { XConst = MO.getImm(); break; }
+  }
+  for (MachineInstr *StaY : StasY) {
+    if (StaY->getParent() == MBBStaX) continue;
+    MachineInstr *PriorY = findPriorADef(StaY);
+    if (!PriorY) continue;
+    if (PriorY->getOpcode() != PriorXOp) continue;
+    int64_t YConst = 0;
+    for (const MachineOperand &MO : PriorY->operands()) {
+      if (MO.isImm()) { YConst = MO.getImm(); break; }
+    }
+    if (XConst == YConst) return StaY;
+  }
+  (void)XOff;
+  return nullptr;
+}
+
+
+// Run Phase 6a + Phase 6 (per-MBB peepholes) — independent of rename
+// logic, so they fire on every function.  Returns true if anything
+// changed.
+static bool runPerMBBPeepholes(MachineFunction &MF) {
+  bool Changed = false;
+
+  // Phase 6a: redundant `STA Y, s` immediately followed by `LDA Y, s`.
+  for (MachineBasicBlock &MBB : MF) {
+    SmallVector<MachineInstr *, 4> Dead;
+    for (auto It = MBB.begin(); It != MBB.end(); ++It) {
+      if (It->isDebugInstr()) continue;
+      if (It->getOpcode() != W65816::STA_StackRel) continue;
+      int64_t StaSlot;
+      if (!srAccess(*It, StaSlot)) continue;
+      auto NextIt = std::next(It);
+      while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
+      if (NextIt == MBB.end()) continue;
+      if (NextIt->getOpcode() != W65816::LDA_StackRel) continue;
+      int64_t LdaSlot;
+      if (!srAccess(*NextIt, LdaSlot)) continue;
+      if (StaSlot != LdaSlot) continue;
+      bool flagsSafe = false;
+      bool aIsUsedBeforeClobber = false;
+      for (auto Fwd = std::next(NextIt); Fwd != MBB.end(); ++Fwd) {
+        if (Fwd->isDebugInstr()) continue;
+        // Calls/JSLs that take A as arg — even though clobbersFlagsP
+        // returns true for them, the elimination could mis-track A's
+        // live-in to the call.  Bail.
+        if (Fwd->isCall()) break;
+        // Generic: any instr that has `implicit $a` as a USE — A is
+        // live going in.  Bail to avoid live-range trouble.
+        for (const MachineOperand &MO : Fwd->operands()) {
+          if (MO.isReg() && MO.getReg() == W65816::A && MO.isUse() &&
+              !MO.isDef()) {
+            aIsUsedBeforeClobber = true;
+            break;
+          }
+        }
+        if (aIsUsedBeforeClobber) break;
+        if (usesFlagsP(*Fwd)) break;
+        if (Fwd->isTerminator() && !Fwd->isConditionalBranch()) {
+          flagsSafe = true; break;
+        }
+        if (clobbersFlagsP(*Fwd)) { flagsSafe = true; break; }
+      }
+      if (!flagsSafe) continue;
+      Dead.push_back(&*NextIt);
+    }
+    for (MachineInstr *MI : Dead) {
+      MI->eraseFromParent();
+      Changed = true;
+    }
+  }
+
+  // Phase 6: per-MBB redundant `LDA #K` elimination.
+  auto isAandPPreserving = [](const MachineInstr &MI) -> bool {
+    unsigned Op = MI.getOpcode();
+    switch (Op) {
+    case W65816::STA_StackRel:
+    case W65816::STA_DP: case W65816::STA_DPX:
+    case W65816::STA_DPInd: case W65816::STA_DPIndY:
+    case W65816::STA_DPIndX:
+    case W65816::STA_Abs: case W65816::STA_AbsX:
+    case W65816::STA_AbsY: case W65816::STA_Long:
+    case W65816::STA_LongX:
+    case W65816::STX_DP: case W65816::STX_Abs:
+    case W65816::STY_DP: case W65816::STY_Abs: case W65816::STY_DPX:
+    case W65816::STZ_DP: case W65816::STZ_Abs:
+    case W65816::STZ_DPX: case W65816::STZ_AbsX:
+      return true;
+    default:
+      break;
+    }
+    for (const MachineOperand &MO : MI.operands()) {
+      if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef())
+        return false;
+    }
+    if (MI.mayStore() && !MI.mayLoad() && !semanticallyDefsA(MI))
+      return true;
+    return false;
+  };
+  auto isLdaImmK = [](const MachineInstr &MI, int64_t &K) -> bool {
+    unsigned Op = MI.getOpcode();
+    if (Op != W65816::LDA_Imm16 && Op != W65816::LDAi16imm) return false;
+    for (const MachineOperand &MO : MI.operands()) {
+      if (MO.isImm()) { K = MO.getImm(); return true; }
+    }
+    return false;
+  };
+  for (MachineBasicBlock &MBB : MF) {
+    std::optional<int64_t> KnownK;
+    SmallVector<MachineInstr *, 4> Dead;
+    for (auto It = MBB.begin(); It != MBB.end(); ++It) {
+      if (It->isDebugInstr()) continue;
+      int64_t K;
+      if (isLdaImmK(*It, K)) {
+        if (KnownK && *KnownK == K) {
+          Dead.push_back(&*It);
+          continue;
+        }
+        KnownK = K;
+        continue;
+      }
+      if (isAandPPreserving(*It)) continue;
+      KnownK.reset();
+    }
+    for (MachineInstr *MI : Dead) {
+      MI->eraseFromParent();
+      Changed = true;
+    }
+  }
+
+  return Changed;
+}
+
+
+bool W65816StackSlotMerge::runOnMachineFunction(MachineFunction &MF) {
+  if (skipFunction(MF.getFunction())) return false;
+  if (MF.getFunction().hasOptNone()) return false;
+
+  // Run per-MBB peepholes first — independent of rename logic.
+  bool peepChanged = runPerMBBPeepholes(MF);
+
+  // Phase 1: index all stack-rel STA/LDA grouped by slot offset.
+  DenseMap<int64_t, SmallVector<MachineInstr *, 4>> Stas;
+  DenseMap<int64_t, SmallVector<MachineInstr *, 4>> Ldas;
+  DenseMap<int64_t, unsigned> AllRefs;  // STA + LDA + ADC + ... count
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB) {
+      int64_t Off;
+      if (!srAccess(MI, Off)) continue;
+      AllRefs[Off]++;
+      if (MI.getOpcode() == W65816::STA_StackRel) {
+        Stas[Off].push_back(&MI);
+      } else if (MI.getOpcode() == W65816::LDA_StackRel) {
+        Ldas[Off].push_back(&MI);
+      }
+    }
+  }
+
+  // Phase 2: find PHI-copy site candidates.  Pattern: LDA X ; STA Y
+  // in a LOOP BODY MBB (= the MBB has itself as a predecessor, i.e.
+  // a self-loop back-edge).  Restricting to loop bodies distinguishes
+  // genuine PHI-cycle copies from one-shot temp transfers (where
+  // slot X is just a scratch register dropped on the way to slot Y
+  // for an unrelated purpose, like qsortIter's pointer-construction
+  // pattern `STA 5; ...; LDA 5; STA 39` followed by `LDA 39; STA dp`).
+  DenseMap<int64_t, int64_t> PhiCopyPair;  // X -> Y
+  for (MachineBasicBlock &MBB : MF) {
+    // Self-loop check: MBB must have itself as a predecessor.
+    bool selfLoop = false;
+    for (MachineBasicBlock *Pred : MBB.predecessors()) {
+      if (Pred == &MBB) { selfLoop = true; break; }
+    }
+    if (!selfLoop) continue;
+    for (auto It = MBB.begin(); It != MBB.end(); ++It) {
+      if (It->getOpcode() != W65816::LDA_StackRel) continue;
+      int64_t X;
+      if (!srAccess(*It, X)) continue;
+      auto NextIt = std::next(It);
+      while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
+      if (NextIt == MBB.end()) continue;
+      if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
+      int64_t Y;
+      if (!srAccess(*NextIt, Y) || Y == X) continue;
+      if (PhiCopyPair.count(X)) continue;
+      PhiCopyPair[X] = Y;
+    }
+  }
+
+  // Phase 3: validate each pair and apply rename if safe.
+  // Track which slots have already been merged so we don't double-merge.
+  DenseMap<int64_t, int64_t> Renames;  // X -> Y
+  for (auto &P : PhiCopyPair) {
+    int64_t X = P.first, Y = P.second;
+    // Don't re-merge an already-processed slot.
+    if (Renames.count(X) || Renames.count(Y)) continue;
+    // Arg-slot guard: skip slots with no STAs (caller-passed args).
+    if (Stas[X].empty() || Stas[Y].empty()) continue;
+
+    // Validate that every STA X has a twin STA Y.
+    bool allPaired = true;
+    for (MachineInstr *StaX : Stas[X]) {
+      if (!findTwin(StaX, Stas[Y])) { allPaired = false; break; }
+    }
+    if (!allPaired) continue;
+
+    // Symmetric: every STA Y must have a twin STA X.
+    for (MachineInstr *StaY : Stas[Y]) {
+      if (!findTwin(StaY, Stas[X])) { allPaired = false; break; }
+    }
+    if (!allPaired) continue;
+
+    LLVM_DEBUG(dbgs() << "StackSlotMerge: rename slot " << X
+                      << " -> " << Y << " in " << MF.getName() << "\n");
+    Renames[X] = Y;
+  }
+  if (Renames.empty()) return false;
+
+  // Phase 4: apply rename.
+  bool Changed = false;
+  for (MachineBasicBlock &MBB : MF) {
+    SmallVector<MachineInstr *, 4> ToErase;
+    for (MachineInstr &MI : MBB) {
+      int64_t Off;
+      if (!srAccess(MI, Off)) continue;
+      auto It = Renames.find(Off);
+      if (It == Renames.end()) continue;
+      MI.getOperand(0).setImm(It->second);
+      Changed = true;
+    }
+    // After rename, look for now-redundant LDA-STA pairs to the same
+    // slot (the PHI-copy self-copy).  Erase them.
+    for (auto It = MBB.begin(); It != MBB.end(); ++It) {
+      if (It->getOpcode() != W65816::LDA_StackRel) continue;
+      int64_t LdaOff;
+      if (!srAccess(*It, LdaOff)) continue;
+      auto NextIt = std::next(It);
+      while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
+      if (NextIt == MBB.end()) continue;
+      if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
+      int64_t StaOff;
+      if (!srAccess(*NextIt, StaOff)) continue;
+      if (LdaOff != StaOff) continue;
+      ToErase.push_back(&*It);
+      ToErase.push_back(&*NextIt);
+    }
+    for (MachineInstr *MI : ToErase) MI->eraseFromParent();
+    if (!ToErase.empty()) Changed = true;
+  }
+
+  // Phase 5: redundant constant-init elimination.  After rename, the
+  // Case (3) twin pairings leave us with TWO sites writing the same
+  // constant to the same slot (one renamed from X to Y, the other was
+  // already targeting Y).  The dominated one is redundant — its slot
+  // already holds the constant from the dominating write.
+  //
+  // Generalize: scan post-rename for ALL `LDA_Imm16 K ; STA_StackRel Y`
+  // pairs (or LDAi16imm K; STA Y).  For each pair, look for another
+  // such pair with the same (K, Y) where one DOMINATES the other AND
+  // no slot-Y access exists on any path between them.  Erase the
+  // dominated STA + its preceding LDA (if A isn't otherwise consumed).
+  {
+    auto isLdaImm = [](const MachineInstr &MI) {
+      unsigned Op = MI.getOpcode();
+      return Op == W65816::LDA_Imm16 || Op == W65816::LDAi16imm;
+    };
+    auto immValue = [](const MachineInstr &MI) -> int64_t {
+      for (const MachineOperand &MO : MI.operands()) {
+        if (MO.isImm()) return MO.getImm();
+      }
+      return 0;
+    };
+    // Collect `LDA #K ; STA_StackRel Y` pairs, grouped by Y.
+    DenseMap<int64_t, SmallVector<std::pair<MachineInstr *, int64_t>, 4>>
+        ConstStas;
+    for (MachineBasicBlock &MBB : MF) {
+      for (auto It = MBB.begin(); It != MBB.end(); ++It) {
+        if (!isLdaImm(*It)) continue;
+        int64_t K = immValue(*It);
+        auto NextIt = std::next(It);
+        while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
+        if (NextIt == MBB.end()) continue;
+        if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
+        int64_t Y;
+        if (!srAccess(*NextIt, Y)) continue;
+        ConstStas[Y].push_back({&*NextIt, K});
+      }
+    }
+    // For each slot Y with at least two const-init STAs, check for
+    // dominator redundancy.
+    auto &MDT = getAnalysis<MachineDominatorTreeWrapperPass>().getDomTree();
+    // Check that no instruction WRITES slot Y on any path between
+    // From and To.  Reads are fine because both From and To write
+    // the same constant K — any intermediate read would see K either
+    // way (since From dominates, From has already executed).  Calls
+    // are bailout conditions: a call might write to the stack via
+    // address-taken locals or other side effects we don't model.
+    auto noSlotWriteOnPath = [&](MachineInstr *From, MachineInstr *To,
+                                  int64_t Y) -> bool {
+      MachineBasicBlock *FromMBB = From->getParent();
+      MachineBasicBlock *ToMBB = To->getParent();
+      auto opWritesY = [&](MachineInstr &MI) {
+        if (MI.isCall() || MI.isInlineAsm()) return true;
+        int64_t Off;
+        if (MI.getOpcode() == W65816::STA_StackRel &&
+            srAccess(MI, Off) && Off == Y) {
+          return true;
+        }
+        return false;
+      };
+      // (a) After From in its MBB.
+      for (auto It = std::next(From->getIterator()); It != FromMBB->end();
+           ++It) {
+        if (It->isDebugInstr()) continue;
+        if (opWritesY(*It)) return false;
+      }
+      // (b) BFS forward from FromMBB's successors, stopping at ToMBB.
+      SmallPtrSet<MachineBasicBlock *, 8> Visited;
+      SmallVector<MachineBasicBlock *, 8> Stack;
+      for (auto *Succ : FromMBB->successors()) Stack.push_back(Succ);
+      while (!Stack.empty()) {
+        auto *MBB = Stack.pop_back_val();
+        if (MBB == ToMBB) continue;  // checked separately in (c)
+        if (!Visited.insert(MBB).second) continue;
+        for (auto &MI : *MBB) {
+          if (MI.isDebugInstr()) continue;
+          if (opWritesY(MI)) return false;
+        }
+        for (auto *Succ : MBB->successors()) Stack.push_back(Succ);
+      }
+      // (c) In ToMBB, before To, any write of Y?
+      for (auto It = ToMBB->begin(); It != To->getIterator(); ++It) {
+        if (It->isDebugInstr()) continue;
+        if (opWritesY(*It)) return false;
+      }
+      return true;
+    };
+    SmallVector<MachineInstr *, 8> ToErase;
+    LLVM_DEBUG({
+      dbgs() << "Phase 5 in " << MF.getName() << ":\n";
+      for (auto &P : ConstStas) {
+        dbgs() << "  slot " << P.first << " has " << P.second.size()
+               << " const STAs\n";
+      }
+    });
+    for (auto &P : ConstStas) {
+      int64_t Y = P.first;
+      auto &stas = P.second;
+      if (stas.size() < 2) continue;
+      // For each pair (i, j) where i dominates j with same constant K:
+      for (auto &Sj : stas) {
+        MachineInstr *DominatedSta = Sj.first;
+        int64_t Kj = Sj.second;
+        for (auto &Si : stas) {
+          if (&Si == &Sj) continue;
+          if (Si.second != Kj) continue;  // different K
+          MachineInstr *DominatorSta = Si.first;
+          if (!MDT.dominates(DominatorSta, DominatedSta)) continue;
+          if (!noSlotWriteOnPath(DominatorSta, DominatedSta, Y)) continue;
+          // Flag safety: erasing `LDA #K; STA Y` removes a flag-setting
+          // op (the LDA).  Walk forward from the STA looking for next
+          // flag-clobber or unconditional terminator (safe) vs.
+          // flag-use (unsafe).
+          MachineBasicBlock *MBB = DominatedSta->getParent();
+          bool flagsSafeP5 = false;
+          for (auto Fwd = std::next(DominatedSta->getIterator());
+               Fwd != MBB->end(); ++Fwd) {
+            if (Fwd->isDebugInstr()) continue;
+            if (usesFlagsP(*Fwd)) break;
+            if (Fwd->isTerminator() && !Fwd->isConditionalBranch()) {
+              flagsSafeP5 = true; break;
+            }
+            if (clobbersFlagsP(*Fwd)) { flagsSafeP5 = true; break; }
+          }
+          if (!flagsSafeP5) continue;
+          // Erase DominatedSta and its preceding LDA #K.
+          auto Prev = DominatedSta->getIterator();
+          while (Prev != MBB->begin()) {
+            --Prev;
+            if (!Prev->isDebugInstr()) break;
+          }
+          if (Prev != DominatedSta->getIterator() && isLdaImm(*Prev) &&
+              immValue(*Prev) == Kj) {
+            // Verify A isn't consumed between LDA and STA — they're
+            // adjacent so no consumers exist; safe.  Erase both.
+            ToErase.push_back(&*Prev);
+          }
+          ToErase.push_back(DominatedSta);
+          break;
+        }
+      }
+    }
+    // De-dup ToErase before erasing.
+    SmallPtrSet<MachineInstr *, 8> ErasedSet;
+    for (MachineInstr *MI : ToErase) {
+      if (ErasedSet.insert(MI).second) {
+        MI->eraseFromParent();
+        Changed = true;
+      }
+    }
+  }
+
+  return Changed || peepChanged;
+}
--- a/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp
+++ b/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp
@ -56,6 +56,8 @@ LLVMInitializeW65816Target() {
  initializeW65816I32IncFoldPass(PR);
  initializeW65816ImgCalleeSavePass(PR);
  initializeW65816NarrowI32MulPass(PR);
+  initializeW65816PromoteFiToImgPass(PR);
+  initializeW65816StackSlotMergePass(PR);

  // Default IndVarSimplify's exit-value rewriter to "never".  The
  // closed-form replacement frequently widens an i16 induction var
@ -195,14 +197,19 @@ void W65816PassConfig::addPreRegAlloc() {
 }

 void W65816PassConfig::addPostRegAlloc() {
-  // ImgCalleeSave runs FIRST so its STAfi/LDAfi pseudos go through the
-  // rest of the post-RA pipeline (SpillToX, StackSlotCleanup) normally.
-  // It detects IMG8..IMG15 usage post-regalloc and inserts prologue
-  // save + epilogue restore so those slots act as callee-saved at the
-  // asm level.  Fixes picol's `expr 1+2 == 4` bug: high-pressure
-  // recursive double fns use IMG8..IMG15 as scratch but, without this
-  // pass, expected them preserved across calls — and callees were
-  // happy to clobber them.  See W65816ImgCalleeSave.cpp.
+  // FI→IMG promotion runs FIRST.  It scans for high-traffic i16
+  // FrameIndex slots (LDAfi/STAfi/ADCfi/etc.) and rewrites them to
+  // STA_DP/LDA_DP/ADC_DP/... pointed at free IMG8..IMG15 DP slots.
+  // The introduced IMG8..15 references are then picked up by
+  // ImgCalleeSave to get prologue save + epilogue restore.  See
+  // W65816PromoteFiToImg.cpp.
+  addPass(createW65816PromoteFiToImg());
+  // ImgCalleeSave detects IMG8..IMG15 usage post-regalloc and inserts
+  // prologue save + epilogue restore so those slots act as callee-
+  // saved at the asm level.  Fixes picol's `expr 1+2 == 4` bug:
+  // high-pressure recursive double fns use IMG8..IMG15 as scratch but,
+  // without this pass, expected them preserved across calls — and
+  // callees were happy to clobber them.  See W65816ImgCalleeSave.cpp.
  addPass(createW65816ImgCalleeSave());
  // SpillToX converts STA/LDA pairs to TAX/TXA bridges; StackSlotCleanup
  // then deletes still-adjacent redundant spills.  A second SpillToX
@ -264,6 +271,14 @@ void W65816PassConfig::addPreEmitPass() {
  addPass(createW65816I32IncFold());
  addPass(createW65816BranchExpand());
  addPass(createW65816SepRepCleanup());
+  // Merge value-equivalent stack slots last.  Runs AFTER SepRepCleanup's
+  // PHI-copy hoist so the LDA-X ; STA-Y pair has been pulled out of
+  // any PHP/PLP wrap — that way the stack-rel offsets on both ops are
+  // the unbumped values and offset-based slot matching is stable.
+  // Saves 2 inst per PHI-copy occurrence (the memory copy round-trip
+  // collapses when X and Y are renamed to the same slot).  See
+  // W65816StackSlotMerge.cpp.
+  addPass(createW65816StackSlotMerge());
 }

 MachineFunctionInfo *W65816TargetMachine::createMachineFunctionInfo(
--- a/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp
+++ b/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp
@ -64,13 +64,43 @@ FunctionPass *llvm::createW65816WidenAcc16() {
  return new W65816WidenAcc16();
 }

-// Returns true if the vreg has any physreg-COPY use (e.g., return-value
-// or arg-passing setup that pins the value to a specific physreg).
-static bool flowsToPhysReg(Register VReg, const MachineRegisterInfo &MRI) {
+// Returns true if the vreg has any physreg-COPY use that would conflict
+// with Wide16 class assignment.  $a is a member of Wide16 (Wide16 = A +
+// IMG0..15), so a COPY to $a is fine — the vreg can be Wide16 and
+// regalloc will pick $a to coalesce.  $x / $y are in Idx16, NOT in
+// Wide16, so a COPY to those forces the vreg to NOT be in Wide16
+// (verifier would reject).
+static bool flowsToIncompatiblePhysReg(Register VReg,
+                                       const MachineRegisterInfo &MRI) {
  for (auto &U : MRI.use_nodbg_instructions(VReg)) {
    if (!U.isCopy()) continue;
    const MachineOperand &Dst = U.getOperand(0);
-    if (Dst.isReg() && Dst.getReg().isPhysical()) return true;
+    if (!Dst.isReg() || !Dst.getReg().isPhysical()) continue;
+    Register P = Dst.getReg();
+    if (P == W65816::A) continue;
+    if (P >= W65816::IMG0 && P <= W65816::IMG15) continue;
+    return true;
+  }
+  return false;
+}
+
+// Returns true if VReg's def is a COPY from a physreg whose class is not
+// Wide16-compatible.  copyPhysReg only handles a fixed set of source/dest
+// pairs; an incompatible source physreg (e.g., DPF0, the i64-return
+// high-half carrier) lowered to an IMG dest would crash with an
+// "unhandled copyPhysReg" assertion at AsmPrinter time.  (Currently
+// only the Phase-2 PHI widening uses this; that's disabled, so mark
+// unused.)
+[[maybe_unused]] static bool comesFromIncompatiblePhysReg(Register VReg,
+                                         const MachineRegisterInfo &MRI) {
+  for (auto &D : MRI.def_instructions(VReg)) {
+    if (!D.isCopy()) continue;
+    const MachineOperand &Src = D.getOperand(1);
+    if (!Src.isReg() || !Src.getReg().isPhysical()) continue;
+    Register P = Src.getReg();
+    if (P == W65816::A) continue;
+    if (P >= W65816::IMG0 && P <= W65816::IMG15) continue;
+    return true;
  }
  return false;
 }
@ -145,7 +175,7 @@ bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
    Register VReg = Register::index2VirtReg(i);
    if (MRI.def_empty(VReg)) continue;
    if (MRI.getRegClass(VReg) != &W65816::Acc16RegClass) continue;
-    if (flowsToPhysReg(VReg, MRI)) continue;
+    if (flowsToIncompatiblePhysReg(VReg, MRI)) continue;
    if (usedByPhi(VReg, MRI)) continue;
    if (!MRI.hasOneDef(VReg)) continue;  // require single SSA def
    if (!allUsesAcceptWide(VReg, MRI, *TRI, *TII)) continue;
@ -181,5 +211,212 @@ bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
    }
    Changed = true;
  }
+
+  // Phase 2: PHI cycle widening.  EXPERIMENTAL, currently disabled —
+  // see end of pass for explanation.
+  #if 0
+  // PHIs whose def class is Acc16 keep
+  // the value pinned to $a across iterations, forcing stack spills
+  // when the PHI is live across calls or other A-clobbering ops.
+  // For sumSquares-style loops with an i32 accumulator, this manifests
+  // as per-iter `LDA slot ; ADC ; STA slot ; LDA slot ; STA slot` (the
+  // last LDA/STA pair is the PHI-back-edge copy).  If we widen the
+  // PHI's def to Wide16, regalloc can keep it in an IMG slot and the
+  // back-edge PHI copy collapses to a register coalesce.
+  //
+  // To widen a PHI:
+  //   1. Compute the SCC of Acc16 vregs connected by PHI edges (PHI
+  //      def ↔ PHI incoming vreg).  This catches mutually-recursive
+  //      PHIs in nested loops.
+  //   2. For every member: verify all non-PHI uses accept Wide16, no
+  //      flow to a physreg, single def.
+  //   3. For each PHI in the SCC, walk its incoming list.  Each
+  //      incoming vreg is either ALREADY in the SCC (another PHI, no
+  //      bridge needed) or an external Acc16 vreg whose value flows
+  //      into the SCC — bridge it by inserting `WWide = COPY W` at
+  //      the end of the predecessor block and pointing the PHI's
+  //      incoming at WWide.
+  //   4. Change every SCC member's register class to Wide16.
+  auto worklistInsertIfAcc16 = [&MRI](Register V,
+                                      DenseSet<Register> &Seen,
+                                      SmallVectorImpl<Register> &WL) {
+    if (!V.isVirtual()) return;
+    if (MRI.getRegClass(V) != &W65816::Acc16RegClass) return;
+    if (!Seen.insert(V).second) return;
+    WL.push_back(V);
+  };
+
+  SmallVector<MachineInstr *, 16> AcctPhis;
+  for (MachineBasicBlock &MBB : MF) {
+    for (MachineInstr &MI : MBB.phis()) {
+      Register DefV = MI.getOperand(0).getReg();
+      if (MRI.getRegClass(DefV) == &W65816::Acc16RegClass) {
+        AcctPhis.push_back(&MI);
+      }
+    }
+  }
+  DenseSet<Register> ProcessedPhiVregs;
+  for (MachineInstr *Seed : AcctPhis) {
+    Register SeedDef = Seed->getOperand(0).getReg();
+    if (ProcessedPhiVregs.count(SeedDef)) continue;
+    // Build SCC by following PHI edges in both directions.
+    DenseSet<Register> Comp;
+    SmallVector<Register, 8> Stack;
+    worklistInsertIfAcc16(SeedDef, Comp, Stack);
+    while (!Stack.empty()) {
+      Register V = Stack.pop_back_val();
+      // Forward: V flows into other PHIs as an incoming → include those PHI defs.
+      for (auto &U : MRI.use_nodbg_instructions(V)) {
+        if (!U.isPHI()) continue;
+        Register PhiDef = U.getOperand(0).getReg();
+        worklistInsertIfAcc16(PhiDef, Comp, Stack);
+      }
+      // Backward: if V is itself a PHI def, include the incoming vregs.
+      MachineInstr *DM = &*MRI.def_instructions(V).begin();
+      if (!DM || !DM->isPHI()) continue;
+      for (unsigned i = 1, e = DM->getNumOperands(); i < e; i += 2) {
+        MachineOperand &MO = DM->getOperand(i);
+        if (!MO.isReg() || !MO.getReg().isVirtual()) continue;
+        worklistInsertIfAcc16(MO.getReg(), Comp, Stack);
+      }
+    }
+    for (Register V : Comp) ProcessedPhiVregs.insert(V);
+
+    // Validate every member.  PHI uses are ACCEPTED when the consumer
+    // PHI is itself in the SCC (those PHIs are being widened in
+    // lock-step).  Narrow-class uses (e.g., INA_PSEUDO's tied-def
+    // input requires Acc16) are ALSO accepted — we'll insert a
+    // Wide16→Acc16 COPY at the use site after widening.  The only
+    // unrecoverable cases are: PHI uses where the consumer PHI is
+    // outside the SCC (forcing cross-SCC class merging), and physreg
+    // flow to $x/$y/etc. (handled separately above).
+    auto usesAcceptInSCC = [&](Register V,
+                               SmallVectorImpl<MachineOperand *> *NarrowSites)
+        -> bool {
+      for (auto &MO : MRI.use_nodbg_operands(V)) {
+        MachineInstr *UMI = MO.getParent();
+        if (UMI->isCopy()) continue;
+        if (UMI->isPHI()) {
+          Register PhiDef = UMI->getOperand(0).getReg();
+          if (Comp.count(PhiDef)) continue;  // co-widened
+          return false;
+        }
+        unsigned OpIdx = UMI->getOperandNo(&MO);
+        const TargetRegisterClass *Expected =
+            TII->getRegClass(UMI->getDesc(), OpIdx);
+        if (!Expected) continue;
+        if (Expected == &W65816::Wide16RegClass) continue;
+        if (Expected->hasSubClassEq(&W65816::Wide16RegClass)) continue;
+        // Expected is narrower than Wide16 (e.g., Acc16-only tied
+        // input).  Mark for runtime narrowing — we'll insert a COPY
+        // at apply time.
+        if (NarrowSites) NarrowSites->push_back(&MO);
+      }
+      return true;
+    };
+    bool ok = true;
+    SmallVector<MachineOperand *, 8> NarrowSites;
+    for (Register V : Comp) {
+      if (!MRI.hasOneDef(V)) { ok = false; break; }
+      if (flowsToIncompatiblePhysReg(V, MRI)) { ok = false; break; }
+      if (comesFromIncompatiblePhysReg(V, MRI)) { ok = false; break; }
+      if (!usesAcceptInSCC(V, &NarrowSites)) { ok = false; break; }
+    }
+    if (!ok) continue;
+
+    // Apply widening.  First insert bridge COPYs at predecessor edges
+    // for external (non-Comp) Acc16 incomings to each PHI in Comp.
+    SmallVector<std::pair<MachineInstr *, unsigned>, 16> BridgeSites;
+    for (Register V : Comp) {
+      MachineInstr *DM = &*MRI.def_instructions(V).begin();
+      if (!DM->isPHI()) continue;
+      for (unsigned i = 1, e = DM->getNumOperands(); i < e; i += 2) {
+        MachineOperand &MO = DM->getOperand(i);
+        if (!MO.isReg() || !MO.getReg().isVirtual()) continue;
+        Register Inc = MO.getReg();
+        if (Comp.count(Inc)) continue;  // in-SCC, no bridge needed
+        // External incoming: ensure it's currently Acc16; if so, we'll
+        // insert a COPY at the predecessor block's end.
+        if (MRI.getRegClass(Inc) != &W65816::Acc16RegClass &&
+            MRI.getRegClass(Inc) != &W65816::Wide16RegClass) {
+          ok = false;
+          break;
+        }
+        BridgeSites.push_back({DM, i});
+      }
+      if (!ok) break;
+    }
+    if (!ok) continue;
+
+    // Insert bridges.
+    for (auto &Site : BridgeSites) {
+      MachineInstr *PhiMI = Site.first;
+      unsigned OpIdx = Site.second;
+      Register Inc = PhiMI->getOperand(OpIdx).getReg();
+      MachineBasicBlock *PredMBB = PhiMI->getOperand(OpIdx + 1).getMBB();
+      // If already Wide16 (e.g., another candidate widened it already),
+      // no bridge needed — but we still need the PHI incoming to use
+      // a Wide16 vreg.  Use Inc directly.
+      if (MRI.getRegClass(Inc) == &W65816::Wide16RegClass) {
+        continue;
+      }
+      // Insert COPY before the predecessor's terminator(s).
+      auto InsertPos = PredMBB->getFirstTerminator();
+      DebugLoc DL = (InsertPos == PredMBB->end())
+                        ? PredMBB->findBranchDebugLoc()
+                        : InsertPos->getDebugLoc();
+      Register WideInc = MRI.createVirtualRegister(&W65816::Wide16RegClass);
+      BuildMI(*PredMBB, InsertPos, DL, TII->get(TargetOpcode::COPY),
+              WideInc)
+          .addReg(Inc);
+      PhiMI->getOperand(OpIdx).setReg(WideInc);
+      PhiMI->getOperand(OpIdx).setIsKill(false);
+    }
+
+    // Force every SCC member to Img16 (IMG-only, no A).  Using Wide16
+    // (A + IMG) doesn't work here: the Register Coalescer joins our
+    // Wide16 vregs with adjacent Acc16 vregs (intersection = Acc16)
+    // and narrows them back to A-only, defeating the widening.  Img16
+    // intersects Acc16 to ∅, so the coalescer can't merge — the PHI
+    // stays in IMG.  This is correct anyway for the common case (PHI
+    // live across a call): A is JSL-clobbered, so it can't carry the
+    // value through, and IMG8..15 is the right home.
+    for (Register V : Comp) {
+      MRI.setRegClass(V, &W65816::Img16RegClass);
+    }
+    // Insert narrowing COPYs at each narrow-class use site.  Each site
+    // is `... = OP V, ...` where the operand requires Acc16 but V is
+    // now Wide16.  Replace with `%Vacc = COPY V (Acc16); ... = OP %Vacc, ...`.
+    for (MachineOperand *MO : NarrowSites) {
+      MachineInstr *UMI = MO->getParent();
+      Register OldReg = MO->getReg();
+      Register NarrowReg =
+          MRI.createVirtualRegister(&W65816::Acc16RegClass);
+      DebugLoc DL = UMI->getDebugLoc();
+      BuildMI(*UMI->getParent(), UMI, DL, TII->get(TargetOpcode::COPY),
+              NarrowReg)
+          .addReg(OldReg);
+      MO->setReg(NarrowReg);
+      MO->setIsKill(false);
+    }
+    Changed = true;
+  }
+  #endif
+  // Why disabled (2026-05-13 attempt):
+  // - Widening PHI cycles to Wide16 (= A + IMG0..15) is undone by the
+  //   Register Coalescer: it joins our Wide16 vregs with adjacent
+  //   Acc16 vregs via the bridge COPYs we insert, and the resulting
+  //   joint class is `intersect(Wide16, Acc16) = Acc16`.  Net effect:
+  //   no IMG, just more code through the coalescer.
+  // - Switching to Img16 (= IMG0..15, no A) defeats the coalescer
+  //   (intersection with Acc16 is ∅) but forces ALL widened PHIs into
+  //   IMG slots even when A would be better, AND triggers cascading
+  //   copyPhysReg paths that aren't all implemented (e.g., DPF0 → IMG
+  //   for i64 libcall return values), aborting clang on runtime builds.
+  // - A targeted fix needs either (a) a class that the coalescer
+  //   refuses to join with Acc16 yet that still allows A as a member,
+  //   (b) a post-coalescer pass that re-widens specific high-traffic
+  //   vregs back to Img16, or (c) regalloc cost-model tuning so it
+  //   prefers IMG8..15 over stack for loop-live values.
  return Changed;
 }
--- a/src/llvm/test/CodeGen/W65816/extract-wide32-regseq.s
+++ b/src/llvm/test/CodeGen/W65816/extract-wide32-regseq.s