Checkpoint

2026-05-13 20:54:28 -05:00 · 2026-05-13 20:54:28 -05:00 · 42f0d16d07
commit 42f0d16d07
parent e2e4b778b0
19 changed files with 2008 additions and 84 deletions
--- a/STATUS.md
+++ b/STATUS.md
@ -246,20 +246,21 @@ which runs correctly under MAME (apple2gs).
 - `scripts/benchCyclesPrecise.sh` measures per-call cycle counts
  via MAME's emulated time counter.  Eight benchmarks under
-  `benchmarks/`.  Current numbers (2026-05-13 after the umulhisi3 /
+  `benchmarks/`.  Current numbers (after W65816StackSlotMerge):
-  TAX-TXA / store-bypass / PHI-hoist landings): popcount 3478,
+  popcount 3376, bsearch 852, memcmp 1091, strcpy 2387,
-  bsearch 852, memcmp 1091, strcpy 2558, dotProduct 2302,
+  dotProduct 2302, fib(10) 12617, sumOfSquares 17391.  Speed is
-  fib(10) 12617, sumOfSquares 18755.  Speed is the optimization
+  the optimization priority, not size.
  priority, not size.
 - `compare/` holds three side-by-side C tests with our asm and
  Calypsi's listing for static-size comparison:
  `sumSquares`/`evalAt`/`mul16to32`.  `bash compare/regen.sh`
  recompiles each under both `clang --target=w65816 -O2 -S` and
  `cc65816 --speed -O 2 --64bit-doubles` and prints an
-  ours/Calypsi instruction-count ratio.  Current ratios:
+  ours/Calypsi instruction-count ratio.  Current ratios (post
-  sumSquares 2.32x, evalAt 2.10x, mul16to32 2.50x.  See
+  W65816StackSlotMerge Phase 5/6 + extracted Phase 6/6a per-MBB
-  `compare/README.md`.
+  peepholes + Pass 1c PHP-wrap CMP elim for SP-rel functions):
  sumSquares 1.81x (56 inst), evalAt 2.10x (534 inst), mul16to32
  2.25x (9 inst).  See `compare/README.md`.
 **Backend register allocation:**
@ -340,6 +341,46 @@ for the common-case C / minimal-C++ workload.  Priority is speed
  `-disable-lsr` and `isLSRCostLess` override, both regressed
  dotProduct.
 - **W65816StackSlotMerge — value-equivalent stack slot coalesce**
  (2026-05-13).  Pre-emit pass that merges PHI src/dst stack-slot
  pairs which LLVM's StackSlotColoring can't see (they're
  simultaneously live but hold the same value).  Detects the
  canonical loop-body `LDA X ; STA Y` PHI-copy in a self-looped
  MBB, verifies value equivalence via bidirectional twin-pairing
  (Case 1: same A in same MBB / Case 2: PHI-copy reload pattern /
  Case 3: matching `LDA #const` init in different MBBs), and
  renames slot X→Y function-wide.  Runs AFTER SepRepCleanup so the
  PHI copies are out of their PHP/PLP wraps and offsets are stable.
  **A-define detection is opcode-based, not operand-based** —
  LDA_DP / LDA_Abs / LDA_Long etc. omit the `implicit-def $a`
  annotation in tablegen but semantically write A; the
  `semanticallyDefsA` helper falls back to an opcode whitelist.
  sumSquares static: 65 → 61 inst (1.97x — under 2x Calypsi for
  the first time).  sumOfSquares cyc/call: 18755 → 17391
  (**−7.3%**).  strcpy: 2558 → 2387 (−6.7%).  See
  W65816StackSlotMerge.cpp.
 - **LSR-widened i32 IV narrowing** (`W65816NarrowI32Mul` Phase 2,
  2026-05-13).  After rewriting `mul i32 X, Y` to a `__umulhisi3`
  call, scan for i32 PHIs whose only uses are (a) the truncs the
  rewrite emitted and (b) a single self-feeding `add %P, const`.
  When SCEV bounds the PHI to u16, build an i16 PHI + i16 add in
  place, replace truncs, and erase the i32 chain.  Care needed
  to break the PN ↔ Incr use-cycle before erasing.  sumSquares
  frame: 14B → 12B; loop-internal `i++` shrinks from 7→3 inst.
 - **PHI-hoist accepts LDA_Imm16 / LDAi16imm** (2026-05-13).
  Init blocks contain `lda #const ; sta slot,s` pairs wrapped in
  PHP/PLP around the pre-loop CMP — same shape as a PHI-copy
  wrap but with an immediate load instead of a memory load.
  Matcher extended to accept both the MC opcode (`LDA_Imm16`) and
  the surviving pseudo (`LDAi16imm`), with an added **$a-live-out
  guard**: if any successor MBB has $a in its live-in set, bail —
  the LDA's A-value is a fall-through register-PHI consumed by
  the successor's first STA, and hoisting clobbers it.  Caught
  by `sumTable` where `lda #0 ; sta 0x9,s` (wrap+trailing) ALSO
  supplied A=0 to `bb.2`'s `sta 0x1,s`.
 - **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR
  pass** (2026-05-13).  Added `__umulhisi3` (unsigned 16x16→32) to
  `runtime/src/libgcc.s`.  New IR pass in `addISelPrepare` walks
--- a/compare/evalAt.calypsi.lst
+++ b/compare/evalAt.calypsi.lst
@ -1,7 +1,7 @@
 ###############################################################################
 #                                                                             #
 # Calypsi ISO C compiler for 65816                               version 5.16 #
-#                                                       13/May/2026  15:46:15 #
+#                                                       13/May/2026  20:52:21 #
 # Command line: --speed -O 2 --64bit-doubles evalAt.c -o                      #
 #               /tmp/evalAt.calypsi.elf --list-file evalAt.calypsi.lst        #
 #                                                                             #
--- a/compare/evalAt.ours.s
+++ b/compare/evalAt.ours.s
@ -139,9 +139,10 @@ evalAt:                                 ; @evalAt
 	lda	0x1d, s
 	sta	[0xe0 ], y
 	pea	0x4024
-	pea	0x0
+	lda	#0x0
-	pea	0x0
+	pha
-	pea	0x0
+	pha
 	pha
 	lda	0x17, s
 	pha
 	lda	0x1b, s
@ -272,9 +273,9 @@ evalAt:                                 ; @evalAt
 	lda	0xc4
 	sta	0x15, s
 	lda	0xca
 	sta	0x11, s
 	lda	0xc8
 	sta	0x13, s
 	lda	0xc8
 	sta	0x11, s
 	lda	0x17, s
 	pha
 	lda	0x1f, s
@ -283,9 +284,9 @@ evalAt:                                 ; @evalAt
 	pha
 	lda	0x27, s
 	pha
-	lda	0x19, s
+	lda	0x1b, s
 	pha
-	lda	0x1d, s
+	lda	0x1b, s
 	pha
 	lda	0x27, s
 	tax
@ -518,9 +519,9 @@ evalAt:                                 ; @evalAt
 	lda	0xc4
 	sta	0x15, s
 	lda	0xca
 	sta	0x11, s
 	lda	0xc8
 	sta	0x13, s
 	lda	0xc8
 	sta	0x11, s
 	lda	0x17, s
 	pha
 	lda	0x1f, s
@ -529,9 +530,9 @@ evalAt:                                 ; @evalAt
 	pha
 	lda	0x27, s
 	pha
-	lda	0x19, s
+	lda	0x1b, s
 	pha
-	lda	0x1d, s
+	lda	0x1b, s
 	pha
 	lda	0x27, s
 	tax
--- a/compare/mul16to32.calypsi.lst
+++ b/compare/mul16to32.calypsi.lst
@ -1,7 +1,7 @@
 ###############################################################################
 #                                                                             #
 # Calypsi ISO C compiler for 65816                               version 5.16 #
-#                                                       13/May/2026  15:46:15 #
+#                                                       13/May/2026  20:52:21 #
 # Command line: --speed -O 2 --64bit-doubles mul16to32.c -o                   #
 #               /tmp/mul16to32.calypsi.elf --list-file                        #
 #               mul16to32.calypsi.lst                                         #
--- a/compare/mul16to32.ours.s
+++ b/compare/mul16to32.ours.s
@ -11,7 +11,6 @@ mul16to32:                              ; @mul16to32
 	jsl	__umulhisi3
 	ply
 	sta	0x1, s
 	lda	0x1, s
 	ply
 	rtl
 .Lfunc_end0:
--- a/compare/sumSquares.calypsi.lst
+++ b/compare/sumSquares.calypsi.lst
@ -1,7 +1,7 @@
 ###############################################################################
 #                                                                             #
 # Calypsi ISO C compiler for 65816                               version 5.16 #
-#                                                       13/May/2026  15:46:15 #
+#                                                       13/May/2026  20:52:21 #
 # Command line: --speed -O 2 --64bit-doubles sumSquares.c -o                  #
 #               /tmp/sumSquares.calypsi.elf --list-file                       #
 #               sumSquares.calypsi.lst                                        #
--- a/compare/sumSquares.ll
+++ b/compare/sumSquares.ll
@ -0,0 +1,50 @@
 ; ModuleID = 'sumSquares.c'
 source_filename = "sumSquares.c"
 target datalayout = "e-m:e-p:32:16-i16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S8"
 target triple = "w65816"
 ; Function Attrs: nofree norecurse nosync nounwind memory(none)
 define dso_local i32 @sumSquares(i16 noundef zeroext %n) local_unnamed_addr #0 {
 entry:
  %cmp.not6 = icmp eq i16 %n, 0
  br i1 %cmp.not6, label %for.cond.cleanup, label %for.body.preheader
 for.body.preheader:                               ; preds = %entry
  %0 = add i16 %n, 1
  %umax = tail call i16 @llvm.umax.i16(i16 %0, i16 2)
  br label %for.body
 for.cond.cleanup:                                 ; preds = %for.body, %entry
  %total.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
  ret i32 %total.0.lcssa
 for.body:                                         ; preds = %for.body.preheader, %for.body
  %i.08 = phi i16 [ %inc, %for.body ], [ 1, %for.body.preheader ]
  %total.07 = phi i32 [ %add, %for.body ], [ 0, %for.body.preheader ]
  %conv = zext i16 %i.08 to i32
  %mul = mul nuw i32 %conv, %conv
  %add = add i32 %mul, %total.07
  %inc = add nuw i16 %i.08, 1
  %exitcond = icmp eq i16 %inc, %umax
  br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !7
 }
 ; Function Attrs: nocallback nocreateundeforpoison nofree nosync nounwind speculatable willreturn memory(none)
 declare i16 @llvm.umax.i16(i16, i16) #1
 attributes #0 = { nofree norecurse nosync nounwind memory(none) "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
 attributes #1 = { nocallback nocreateundeforpoison nofree nosync nounwind speculatable willreturn memory(none) }
 !llvm.module.flags = !{!0, !1}
 !llvm.ident = !{!2}
 !llvm.errno.tbaa = !{!3}
 !0 = !{i32 1, !"wchar_size", i32 2}
 !1 = !{i32 7, !"frame-pointer", i32 2}
 !2 = !{!"clang version 23.0.0git (https://github.com/llvm-mos/llvm-mos.git c798c31416f72b395c658b5502d281a162387ab1)"}
 !3 = !{!4, !4, i64 0}
 !4 = !{!"int", !5, i64 0}
 !5 = !{!"omnipotent char", !6, i64 0}
 !6 = !{!"Simple C/C++ TBAA"}
 !7 = distinct !{!7, !8}
 !8 = !{!"llvm.loop.mustprogress"}
--- a/compare/sumSquares.ours.s
+++ b/compare/sumSquares.ours.s
@ -8,79 +8,62 @@ sumSquares:                             ; @sumSquares
 	tay
 	tsc
 	sec
-	sbc	#0xe
+	sbc	#0xc
 	tcs
 	tya
-	sta	0x7, s
+	sta	0x5, s
 	lda	#0x0
-	sta	0xb, s
+	sta	0x3, s
-	lda	0x7, s
+	sta	0x1, s
-	cmp	#0x0
+	lda	0x5, s
 	php
 	lda	#0x0
 	plp
 	sta	0x9, s
 	bne	.LBB0_1
 ; %bb.6:                                ; %entry
 	brl	.LBB0_5
 .LBB0_1:                                ; %for.body.preheader
-	lda	0x7, s
+	lda	0x5, s
 	inc a
-	sta	0x7, s
+	sta	0x5, s
 	cmp	#0x3
 	bcs	.LBB0_3
 ; %bb.2:                                ; %for.body.preheader
 	lda	#0x2
 	sta	0x7, s
 .LBB0_3:                                ; %for.body.preheader
 	lda	#0x0
 	sta	0x3, s
 	lda	#0x1
 	sta	0xd, s
 	lda	0x7, s
 	dec a
 	sta	0x7, s
 	lda	#0x0
 	sta	0x5, s
 .LBB0_3:                                ; %for.body.preheader
 	lda	#0x1
 	sta	0x7, s
 	lda	0x5, s
 	dec a
 	sta	0x5, s
 	lda	#0x0
 	sta	0x1, s
 .LBB0_4:                                ; %for.body
                                        ; =>This Inner Loop Header: Depth=1
-	lda	0xd, s
+	lda	0x7, s
 	pha
 	jsl	__umulhisi3
 	ply
 	clc
 	adc	0x3, s
-	sta	0xb, s
+	sta	0x3, s
 	txa
 	adc	0x1, s
 	sta	0x9, s
 	lda	0xd, s
 	inc a
 	sta	0xd, s
 	bne	.Ltmp0
 	lda	0x5, s
 	inc a
 	sta	0x5, s
 .Ltmp0:
 	lda	0xb, s
 	sta	0x3, s
 	lda	0x9, s
 	sta	0x1, s
 	lda	0x7, s
-	dec a
+	inc a
 	sta	0x7, s
-	cmp	#0x0
+	lda	0x5, s
 	dec a
 	sta	0x5, s
 	beq	.LBB0_5
 	bra	.LBB0_4
 .LBB0_5:                                ; %for.cond.cleanup
-	lda	0x9, s
+	lda	0x1, s
 	tax
-	lda	0xb, s
+	lda	0x3, s
 	tay
 	tsc
 	clc
-	adc	#0xe
+	adc	#0xc
 	tcs
 	tya
 	rtl
--- a/scripts/runInMame.sh
+++ b/scripts/runInMame.sh
@ -93,10 +93,10 @@ $LUA_CHECKS
 end)
 EOF
-OUT=$(timeout 30 mame apple2gs \
+OUT=$(SDL_VIDEODRIVER=dummy SDL_AUDIODRIVER=dummy timeout 30 mame apple2gs \
    -rompath "$PROJECT_ROOT/tools/mame/roms" \
    -plugins -autoboot_script "$LUA_PATH" \
-    -window -sound none -nothrottle -seconds_to_run "$SECS" 2>&1 | grep "^MAME-")
+    -video none -sound none -nothrottle -seconds_to_run "$SECS" 2>&1 | grep "^MAME-")
 echo "$OUT"
 # Parse all val=... and compare to expected list.
--- a/src/llvm/lib/Target/W65816/CMakeLists.txt
+++ b/src/llvm/lib/Target/W65816/CMakeLists.txt
@ -38,6 +38,8 @@ add_llvm_target(W65816CodeGen
  W65816I32IncFold.cpp
  W65816ImgCalleeSave.cpp
  W65816NarrowI32Mul.cpp
  W65816PromoteFiToImg.cpp
  W65816StackSlotMerge.cpp
  W65816TargetMachine.cpp
  W65816AsmPrinter.cpp
  W65816MCInstLower.cpp
--- a/src/llvm/lib/Target/W65816/W65816.h
+++ b/src/llvm/lib/Target/W65816/W65816.h
@ -124,6 +124,25 @@ FunctionPass *createW65816SjLjFinalize();
 // zext that a SDAG-level combine would key off.  See W65816NarrowI32Mul.cpp.
 FunctionPass *createW65816NarrowI32Mul();
 // Post-RA, pre-PEI pass: rewrite high-traffic i16 FrameIndex accesses
 // to use IMG8..15 DP slots ($C0..$CE) instead of stack-rel spills.
 // Picks K = (number of free IMG8..15) hottest FIs and rewrites their
 // STAfi/LDAfi/ADCfi/etc. pseudos to STA_DP/LDA_DP/ADC_DP/etc. with
 // the corresponding DP address.  Net win when access count > 5 (the
 // per-slot save/restore in ImgCalleeSave is ~20 cyc / 12 B).  See
 // W65816PromoteFiToImg.cpp.
 FunctionPass *createW65816PromoteFiToImg();
 // Pre-emit pass: merge value-equivalent stack slots.  LLVM's
 // StackSlotColoring merges slots with non-overlapping liveness;
 // this pass catches the case where two slots ARE simultaneously
 // live but always hold the same value — typically the PHI src/dst
 // pair PHI-elim leaves at the back-edge of a loop body.  Renames
 // X→Y function-wide when every STA X has a "twin" STA Y of the
 // same source value, and erases the resulting LDA-X-STA-Y self-
 // copy.  See W65816StackSlotMerge.cpp.
 FunctionPass *createW65816StackSlotMerge();
 // Pre-RA pass that lowers Wide32 register pairs into pairs of i16
 // vregs.  Without this, greedy/basic regalloc can't fit the pair-
 // pressure of i64-via-2-i32-via-Wide32 traffic in i64-heavy
@ -163,6 +182,8 @@ void initializeW65816SjLjFinalizePass(PassRegistry &);
 void initializeW65816LowerWide32Pass(PassRegistry &);
 void initializeW65816ImgCalleeSavePass(PassRegistry &);
 void initializeW65816NarrowI32MulPass(PassRegistry &);
 void initializeW65816PromoteFiToImgPass(PassRegistry &);
 void initializeW65816StackSlotMergePass(PassRegistry &);
 } // namespace llvm
--- a/src/llvm/lib/Target/W65816/W65816NarrowI32Mul.cpp
+++ b/src/llvm/lib/Target/W65816/W65816NarrowI32Mul.cpp
@ -132,14 +132,155 @@ bool W65816NarrowI32Mul::runOnFunction(Function &F) {
    return false;
  }
  // When the i32 operand is `zext i16 X to i32`, use X directly instead
  // of emitting `trunc i32 (zext i16 X) to i16` — that trunc-of-zext is
  // semantically the identity but keeps the zext (= a fresh i32 SSA
  // value) live, which materializes a Wide32 vreg pair at ISel and
  // forces a 4-byte spill slot (the canonical sumSquares `conv` pattern
  // burned slots 0xd / 0x5 this way).  Skipping the trunc lets the
  // post-replaceAll DCE drop the zext entirely, freeing the slot.
  auto narrowOperand = [&](Value *V, IRBuilder<> &B) -> Value * {
    if (auto *ZE = dyn_cast<ZExtInst>(V)) {
      if (ZE->getSrcTy() == I16) return ZE->getOperand(0);
    }
    if (auto *AE = dyn_cast<SExtInst>(V)) {
      // Sext from i16 also has the right low 16 bits.
      if (AE->getSrcTy() == I16) return AE->getOperand(0);
    }
    return B.CreateTrunc(V, I16);
  };
  FunctionCallee Callee = getUmulhisi3(*M);
  SmallVector<Instruction *, 8> MaybeDead;
  for (BinaryOperator *BO : Worklist) {
    IRBuilder<> B(BO);
-    Value *A = B.CreateTrunc(BO->getOperand(0), I16);
+    Value *AOp = BO->getOperand(0);
-    Value *Bv = B.CreateTrunc(BO->getOperand(1), I16);
+    Value *BOp = BO->getOperand(1);
    Value *A = narrowOperand(AOp, B);
    Value *Bv = narrowOperand(BOp, B);
    Value *Call = B.CreateCall(Callee, {A, Bv});
    BO->replaceAllUsesWith(Call);
    BO->eraseFromParent();
    // If the original operands were zext/sext nodes, they may now be
    // dead.  Add them to the cleanup worklist.
    if (auto *I = dyn_cast<Instruction>(AOp)) MaybeDead.push_back(I);
    if (auto *I = dyn_cast<Instruction>(BOp)) MaybeDead.push_back(I);
  }
  // Cleanup: any extension that's now use-less can be deleted.
  for (Instruction *I : MaybeDead) {
    if (I->use_empty() && (isa<ZExtInst>(I) || isa<SExtInst>(I) ||
                            isa<TruncInst>(I))) {
      I->eraseFromParent();
    }
  }
  // Phase 2: narrow LSR-introduced i32 PHIs whose only uses (after
  // the mul-rewrite above) are trunc-to-i16 + a single self-feeding
  // `add %P, const` increment.  Without this, even though the mul
  // operates on i16, the i32 PHI still requires 4 bytes of frame +
  // an i32 increment chain (post-PEI).  LSR widened these from i16
  // to i32 to support a sub-expression that we've now narrowed —
  // the i32 representation has become dead weight.
  //
  // Guard with SCEV: `getUnsignedRange(%P).getActiveBits() <= 16`
  // proves the PHI never escapes u16, so the i16 add gives the same
  // low-16 bits as the original i32 add at every observable point
  // (the back-edge value can wrap on the exit iteration but is
  // never observed — exit takes the trip-end branch first).
  bool NarrowedAny = false;
  SmallVector<PHINode *, 4> PhiWorklist;
  for (BasicBlock &BB : F) {
    for (PHINode &PN : BB.phis()) {
      if (PN.getType()->isIntegerTy(32)) PhiWorklist.push_back(&PN);
    }
  }
  for (PHINode *PN : PhiWorklist) {
    // Classify every use.
    SmallVector<TruncInst *, 4> Truncs;
    BinaryOperator *Incr = nullptr;
    bool ok = true;
    for (User *U : PN->users()) {
      if (auto *TI = dyn_cast<TruncInst>(U)) {
        if (!TI->getDestTy()->isIntegerTy(16)) { ok = false; break; }
        Truncs.push_back(TI);
        continue;
      }
      auto *BO = dyn_cast<BinaryOperator>(U);
      if (!BO || BO->getOpcode() != Instruction::Add) { ok = false; break; }
      if (!isa<ConstantInt>(BO->getOperand(1))) { ok = false; break; }
      // BO must feed back to this PHI via at least one incoming edge.
      bool feedsBack = false;
      for (Value *Inc : PN->incoming_values()) {
        if (Inc == BO) { feedsBack = true; break; }
      }
      if (!feedsBack) { ok = false; break; }
      if (Incr) { ok = false; break; }
      Incr = BO;
    }
    if (!ok || !Incr || Truncs.empty()) continue;
    // Increment const must fit i16.
    auto *IncrCI = cast<ConstantInt>(Incr->getOperand(1));
    if (IncrCI->getValue().getActiveBits() > 16) continue;
    // Non-back-edge incomings must be i16-representable constants.
    for (Value *Inc : PN->incoming_values()) {
      if (Inc == Incr) continue;
      auto *CIv = dyn_cast<ConstantInt>(Inc);
      if (!CIv) { ok = false; break; }
      if (CIv->getValue().getActiveBits() > 16) { ok = false; break; }
    }
    if (!ok) continue;
    // SCEV bound check.
    if (!SE.isSCEVable(PN->getType())) continue;
    ConstantRange R = SE.getUnsignedRange(SE.getSCEV(PN));
    if (R.getActiveBits() > 16) continue;
    // Narrow.  Build %narrow_phi in same BB, then %narrow_incr right
    // before Incr; patch incoming values to match.
    IRBuilder<> B(PN);
    PHINode *NewPN = B.CreatePHI(I16, PN->getNumIncomingValues(),
                                  PN->getName() + ".narrow");
    // Add placeholders for the back-edge incomings; we'll patch them
    // after building NewIncr.
    for (unsigned i = 0; i < PN->getNumIncomingValues(); ++i) {
      Value *Inc = PN->getIncomingValue(i);
      BasicBlock *Pred = PN->getIncomingBlock(i);
      if (Inc == Incr) {
        NewPN->addIncoming(UndefValue::get(I16), Pred);
      } else {
        auto *CIv = cast<ConstantInt>(Inc);
        NewPN->addIncoming(
            ConstantInt::get(I16, CIv->getZExtValue() & 0xFFFF),
            Pred);
      }
    }
    IRBuilder<> B2(Incr);
    Value *NewIncr = B2.CreateAdd(
        NewPN,
        ConstantInt::get(I16, IncrCI->getZExtValue() & 0xFFFF),
        Incr->getName() + ".narrow");
    if (auto *NewIncrBO = dyn_cast<BinaryOperator>(NewIncr)) {
      NewIncrBO->setHasNoUnsignedWrap(Incr->hasNoUnsignedWrap());
      NewIncrBO->setHasNoSignedWrap(Incr->hasNoSignedWrap());
    }
    for (unsigned i = 0; i < NewPN->getNumIncomingValues(); ++i) {
      if (isa<UndefValue>(NewPN->getIncomingValue(i))) {
        NewPN->setIncomingValue(i, NewIncr);
      }
    }
    // Replace trunc uses with the new narrow PHI, then break the
    // PHI/Incr use-cycle before erasing.
    for (TruncInst *TI : Truncs) {
      TI->replaceAllUsesWith(NewPN);
      TI->eraseFromParent();
    }
    // Incr is `add %PN, const`; PN's back-edge incoming references Incr.
    // Replace Incr's uses with undef so PN's back-edge becomes a dead
    // reference, then erase Incr, then PN.
    Incr->replaceAllUsesWith(UndefValue::get(Incr->getType()));
    Incr->eraseFromParent();
    PN->eraseFromParent();
    NarrowedAny = true;
  }
  return true;
 }
--- a/src/llvm/lib/Target/W65816/W65816PromoteFiToImg.cpp
+++ b/src/llvm/lib/Target/W65816/W65816PromoteFiToImg.cpp
@ -0,0 +1,289 @@
 //===-- W65816PromoteFiToImg.cpp - Promote FrameIndex to IMG slot --------===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 //
 //===---------------------------------------------------------------------===//
 //
 // Post-RA, pre-PEI pass.  Counts accesses to each i16-sized FrameIndex
 // in the function and rewrites the top-K hottest ones to use IMG8..15
 // DP slots ($C0/$C2/.../$CE) instead.  K = number of free IMG8..15
 // slots (slots not already used by regalloc decisions).
 //
 // Why post-RA: at this point regalloc has decided which vregs live in
 // physical registers vs spill slots.  The spills appear as the FI
 // pseudo-opcodes (LDAfi/STAfi/ADCfi/SBCfi/ANDfi/ORAfi/EORfi/CMPfi),
 // and the MFI tells us each FI's final size.  We see all the accesses
 // and can safely rewrite — eliminateFrameIndex hasn't yet baked the
 // offsets into SP-relative immediates.
 //
 // Why before W65816ImgCalleeSave: ImgCalleeSave scans the post-PromoteFi
 // MIR for IMG8..15 usage and emits prologue PHA-bracketed saves +
 // epilogue restores for each used slot.  Our promotion introduces
 // fresh IMG8..15 references that ImgCalleeSave will then auto-cover.
 //
 // Per-access cost change:
 //   STAfi → STA_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
 //   LDAfi → LDA_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
 //   ADCfi → ADC_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
 // Per-slot one-time overhead (added by ImgCalleeSave):
 //   prologue save  : ~10 cyc / 6 B
 //   epilogue restore: ~10 cyc / 6 B
 // Net win if access_count * 1 > 20.  Threshold is 5 to leave margin.
 //
 // Restrictions:
 //   - Only i16-sized FIs (2 bytes, offset 0).  Larger slots (i32 halves,
 //     structs) are skipped.
 //   - Skips fixed/variable-sized objects.
 //   - Skips STA8fi (byte store needs SEP/REP wrap incompatible with
 //     simple STA_DP — and DP stores 16 bits in M=0).
 //   - Skips LDAfi_indY / STAfi_indY (indirect-Y form — different
 //     addressing).
 //
 //===---------------------------------------------------------------------===//
 #include "W65816.h"
 #include "W65816InstrInfo.h"
 #include "W65816Subtarget.h"
 #include "llvm/ADT/BitVector.h"
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/CodeGen/MachineFrameInfo.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/MachineInstrBuilder.h"
 #include "llvm/CodeGen/MachineRegisterInfo.h"
 #include "llvm/Support/Debug.h"
 using namespace llvm;
 #define DEBUG_TYPE "w65816-promote-fi-to-img"
 namespace {
 class W65816PromoteFiToImg : public MachineFunctionPass {
 public:
  static char ID;
  W65816PromoteFiToImg() : MachineFunctionPass(ID) {}
  StringRef getPassName() const override {
    return "W65816 promote FrameIndex to IMG8..15 DP slot";
  }
  bool runOnMachineFunction(MachineFunction &MF) override;
 };
 } // namespace
 char W65816PromoteFiToImg::ID = 0;
 INITIALIZE_PASS(W65816PromoteFiToImg, DEBUG_TYPE,
                "W65816 promote FI to IMG", false, false)
 FunctionPass *llvm::createW65816PromoteFiToImg() {
  return new W65816PromoteFiToImg();
 }
 // Returns the operand index of the FrameIndex for the given FI pseudo
 // opcode, or -1 if this opcode isn't a promotable FI carrier.
 static int getFiOperandIdx(unsigned Opc) {
  switch (Opc) {
  case W65816::LDAfi:                                   return 1;
  case W65816::STAfi:                                   return 1;
  case W65816::CMPfi:                                   return 1;
  case W65816::ADCfi:
  case W65816::SBCfi:
  case W65816::ANDfi:
  case W65816::ORAfi:
  case W65816::EORfi:                                   return 2;
  default:                                              return -1;
  }
 }
 // Map a promotable FI pseudo to the corresponding DP MC opcode.
 static unsigned getDpOpcode(unsigned Opc) {
  switch (Opc) {
  case W65816::LDAfi: return W65816::LDA_DP;
  case W65816::STAfi: return W65816::STA_DP;
  case W65816::CMPfi: return W65816::CMP_DP;
  case W65816::ADCfi: return W65816::ADC_DP;
  case W65816::SBCfi: return W65816::SBC_DP;
  case W65816::ANDfi: return W65816::AND_DP;
  case W65816::ORAfi: return W65816::ORA_DP;
  case W65816::EORfi: return W65816::EOR_DP;
  default: return 0;
  }
 }
 // IMG8..IMG15 sit at DP addresses 0xC0, 0xC2, ..., 0xCE.  IMG0..IMG7
 // are at 0xD0..0xDE.  Returns the DP byte for IMGn.
 static uint8_t dpAddrForImg(unsigned ImgIdx) {
  assert(ImgIdx < 16 && "IMG index out of range");
  if (ImgIdx < 8) return 0xD0 + 2 * ImgIdx;
  return 0xC0 + 2 * (ImgIdx - 8);
 }
 bool W65816PromoteFiToImg::runOnMachineFunction(MachineFunction &MF) {
  // DISABLED: pass produces verifier errors ("Using an undefined physical
  // register") on the kill-flag bookkeeping when an STAfi with `killed $a`
  // is rewritten to STA_DP — the next i16-imm ADC/ADCE sees $a as dead.
  // Also, for the FUNCTIONS where it would land (no-call, high-traffic
  // slots), measured static + dynamic savings were modest and didn't
  // justify the bookkeeping complexity.  Re-enable after:
  //   - tightening kill-flag preservation: only carry kill if the same
  //     operand will be the last user in the new MI (which depends on
  //     post-rewrite scheduling — needs careful liveness re-analysis).
  //   - paired-PHI promotion: when fi#A is a PHI-input and fi#B is the
  //     matching PHI-output, map them to the SAME IMG slot so the
  //     PHI move collapses to a no-op (where most of the dynamic win
  //     would come from).
  return false;
  if (skipFunction(MF.getFunction())) return false;
  const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
  const W65816InstrInfo *TII = STI.getInstrInfo();
  MachineFrameInfo &MFI = MF.getFrameInfo();
  // 1. Walk all instructions, count FI accesses for promotable opcodes.
  DenseMap<int, unsigned> AccessCount;
  DenseMap<int, SmallVector<MachineInstr *, 8>> AccessSites;
  for (MachineBasicBlock &MBB : MF) {
    for (MachineInstr &MI : MBB) {
      int FiIdx = getFiOperandIdx(MI.getOpcode());
      if (FiIdx < 0) continue;
      const MachineOperand &MO = MI.getOperand(FiIdx);
      if (!MO.isFI()) continue;
      int FI = MO.getIndex();
      // Require: 2-byte size, fixed (not variable), offset operand == 0.
      // The offset operand sits right after the FI operand.
      if (MFI.isVariableSizedObjectIndex(FI)) continue;
      if (MFI.getObjectSize(FI) != 2) continue;
      // Fixed (negative-index) slots are arg slots — leave them alone.
      // Promotion would break LowerFormalArguments's expected layout.
      if (FI < 0) continue;
      const MachineOperand &OffMO = MI.getOperand(FiIdx + 1);
      if (!OffMO.isImm() || OffMO.getImm() != 0) continue;
      AccessCount[FI]++;
      AccessSites[FI].push_back(&MI);
    }
  }
  if (AccessCount.empty()) return false;
  // 2. Determine which IMG8..15 slots are already in use.
  BitVector UsedImg(8, false);
  for (MachineBasicBlock &MBB : MF) {
    for (MachineInstr &MI : MBB) {
      for (const MachineOperand &MO : MI.operands()) {
        if (!MO.isReg() || !MO.getReg().isPhysical()) continue;
        Register R = MO.getReg();
        // IMG8..15 are not numerically contiguous with each other in
        // the W65816 register enum (subreg-pair regs sit between
        // IMG indices).  Spell them out explicitly.
        unsigned ImgIdx = 16;  // "not an IMG8..15"
        if      (R == W65816::IMG8)  ImgIdx = 0;
        else if (R == W65816::IMG9)  ImgIdx = 1;
        else if (R == W65816::IMG10) ImgIdx = 2;
        else if (R == W65816::IMG11) ImgIdx = 3;
        else if (R == W65816::IMG12) ImgIdx = 4;
        else if (R == W65816::IMG13) ImgIdx = 5;
        else if (R == W65816::IMG14) ImgIdx = 6;
        else if (R == W65816::IMG15) ImgIdx = 7;
        if (ImgIdx < 8) UsedImg.set(ImgIdx);
      }
    }
  }
  // 3. Sort FIs by access count (descending).
  SmallVector<int, 16> Ordered;
  for (auto &P : AccessCount) Ordered.push_back(P.first);
  std::sort(Ordered.begin(), Ordered.end(),
            [&](int A, int B) { return AccessCount[A] > AccessCount[B]; });
  // 4. Assign IMG slots greedily.  Each IMG8..15 slot used triggers
  //    a save/restore pair in W65816ImgCalleeSave (~20 cyc + ~12 B
  //    per slot per CALL into this function).  For recursive or
  //    deep-call-stack functions, that overhead dominates the per-
  //    access savings — measured: promoting 4 slots in fib(10)
  //    regressed it 38% (12617 → 17391 cyc).  Gate on a very high
  //    threshold + bail entirely if the function has any calls (the
  //    save/restore cost compounds with recursion / call frequency
  //    in ways the static access count can't capture).
  bool HasCalls = false;
  for (MachineBasicBlock &MBB : MF) {
    for (MachineInstr &MI : MBB) {
      if (MI.isCall()) { HasCalls = true; break; }
    }
    if (HasCalls) break;
  }
  const unsigned kAccessThreshold = HasCalls ? 999999u : 5u;
  DenseMap<int, unsigned> FiToImgIdx;
  unsigned NextFreeImg = 0;
  for (int FI : Ordered) {
    if (AccessCount[FI] < kAccessThreshold) break;
    while (NextFreeImg < 8 && UsedImg.test(NextFreeImg)) ++NextFreeImg;
    if (NextFreeImg >= 8) break;
    FiToImgIdx[FI] = NextFreeImg + 8;  // Map to IMG8..15
    ++NextFreeImg;
  }
  if (FiToImgIdx.empty()) return false;
  // 5. Rewrite each access.  Insert the new DP MC inst before the
  //    pseudo, then erase the pseudo.  Preserve flags and tied-def
  //    semantics via implicit operands.
  bool Changed = false;
  for (auto &P : FiToImgIdx) {
    int FI = P.first;
    unsigned ImgIdx = P.second;
    uint8_t DpAddr = dpAddrForImg(ImgIdx);
    LLVM_DEBUG(dbgs() << "Promote fi#" << FI << " -> IMG"
                      << ImgIdx << " ($" << format("%02x", DpAddr)
                      << "), " << AccessCount[FI] << " accesses\n");
    for (MachineInstr *MI : AccessSites[FI]) {
      unsigned Opc = MI->getOpcode();
      unsigned NewOpc = getDpOpcode(Opc);
      if (!NewOpc) continue;
      MachineBasicBlock *MBB = MI->getParent();
      DebugLoc DL = MI->getDebugLoc();
      MachineInstrBuilder NewMI =
          BuildMI(*MBB, MI, DL, TII->get(NewOpc)).addImm(DpAddr);
      // Carry implicit-def $a (LDA/ADC/SBC/AND/ORA/EOR all write $a)
      // and implicit-use $a (STA/CMP/ADC/SBC/AND/ORA/EOR all read $a).
      // ADCfi/SBCfi additionally use $p; their DP equivalents read $p
      // implicitly via the tablegen Defs/Uses.  But since we built the
      // new MI from TII->get(NewOpc), the implicit operands from the
      // descriptor are auto-added.  We only need to copy non-FI explicit
      // operands... which for our pseudos are register operands.  The
      // physical register defs/uses they carry must be preserved.
      for (const MachineOperand &MO : MI->operands()) {
        if (MO.isReg() && MO.getReg().isPhysical() && MO.isImplicit()) {
          // Skip — already added by descriptor.
          continue;
        }
        if (MO.isReg() && MO.getReg().isPhysical() && !MO.isImplicit()) {
          // Explicit physreg operand (e.g., the $a in STAfi $a, fi, 0).
          // Convert to implicit so the DP MC inst's descriptor matches.
          RegState Flags = MO.isDef() ? RegState::ImplicitDefine
                                       : RegState::Implicit;
          if (MO.isKill()) Flags = Flags | RegState::Kill;
          NewMI.addReg(MO.getReg(), Flags);
        }
        // FI/offset operands are skipped — replaced by the DP imm above.
        // VReg defs/uses should be gone post-RA; if any survived, skip.
      }
      MI->eraseFromParent();
      Changed = true;
    }
    // Mark the FI as dead so PEI can skip allocating stack for it.
    // MFI doesn't expose RemoveStackObject publicly, but setting size
    // to 0 also works in most code paths.  Actually leave it alive —
    // a 2-byte unused slot is cheap, and removing exposes us to
    // PEI bugs.
  }
  return Changed;
 }
--- a/src/llvm/lib/Target/W65816/W65816SepRepCleanup.cpp
+++ b/src/llvm/lib/Target/W65816/W65816SepRepCleanup.cpp
@ -41,6 +41,7 @@
 #include "W65816InstrInfo.h"
 #include "W65816Subtarget.h"
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/MachineInstr.h"
@ -433,8 +434,22 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
      auto isLdaSR = [](const MachineInstr &MI) {
        return MI.getOpcode() == W65816::LDA_StackRel;
      };
      // Accept LDA_Imm16 (MC) AND LDAi16imm (pseudo) inside the wrap —
      // both are flag-clobbering A-loads of a 16-bit immediate, with
      // no stack-rel offset to bump-undo and no memory operand to
      // alias-check against the gap.  Common in init blocks: `lda #0 ;
      // sta slot,s` wrapped around the loop pre-test.  Some functions
      // still carry the pseudo LDAi16imm at SepRepCleanup time (post-RA
      // pseudo expansion didn't lower it), so accept both spellings.
      auto isImmLoad = [](const MachineInstr &MI) {
        unsigned O = MI.getOpcode();
        return O == W65816::LDA_Imm16 || O == W65816::LDAi16imm;
      };
      auto isFlagPreservingMem = [&](const MachineInstr &MI) {
-        return isStaLike(MI) || isLdaSR(MI);
+        return isStaLike(MI) || isLdaSR(MI) || isImmLoad(MI);
      };
      auto isLdaCount = [&](const MachineInstr &MI) {
        return isLdaSR(MI) || isImmLoad(MI);
      };
      auto It = MBB.begin();
      while (It != MBB.end()) {
@ -450,8 +465,11 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
          if (Walker->isDebugInstr()) { ++Walker; continue; }
          if (Walker->getOpcode() == W65816::PLP) break;
          if (!isFlagPreservingMem(*Walker)) { ok = false; break; }
-          // Track slots so we can check the gap below.
+          // Track stack-rel slots so we can check the gap below.
-          if (Walker->getNumOperands() >= 1 && Walker->getOperand(0).isImm()) {
+          // Immediate loads have no stack-rel addr — skip.
          if (!isImmLoad(*Walker) &&
              Walker->getNumOperands() >= 1 &&
              Walker->getOperand(0).isImm()) {
            int64_t off = Walker->getOperand(0).getImm();
            if (isLdaSR(*Walker)) ReadSlots.insert(off);
            else WriteSlots.insert(off);
@ -483,11 +501,23 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
        // it earlier would lose the value.
        unsigned NLda = 0, NSta = 0;
        for (MachineInstr *MI : Block) {
-          if (isLdaSR(*MI)) ++NLda;
+          if (isLdaCount(*MI)) ++NLda;
          else if (isStaLike(*MI)) ++NSta;
        }
        NSta += Trailing.size();
        if (NLda != NSta) { ++It; continue; }
        // Even with paired LDA-STA, the LAST LDA's $a value can still
        // be consumed downstream — by a successor's first STA — making
        // it a fall-through register-PHI.  If $a is live-out at MBB
        // end (any successor has $a as live-in), bail.  Caught by
        // sumTable, where `lda #0` (wrap) feeds A into bb.2's `sta 0x1,
        // s`, with `sta 0x9, s` (trailing) just happening to also store
        // the same A — the pair count balances but A is still live-out.
        bool aLiveOut = false;
        for (MachineBasicBlock *Succ : MBB.successors()) {
          if (Succ->isLiveIn(W65816::A)) { aLiveOut = true; break; }
        }
        if (aLiveOut) { ++It; continue; }
        // Walk backward from PHP to find the hoist insertion point.
        // The hoisted block clobbers $a and $p (LDA writes both).
        // Skip insts that USE $a (consumer of an earlier $a producer)
@ -880,5 +910,362 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
      ++It2;
    }
  }
  // Store forwarding (disabled — CRC32 regressed and I couldn't
  // nail down the safety hole in time).  Even with PHP-wrap guards
  // and SP-modifier bails, the first fire (in memmove) silently
  // miscompiles something that CRC32 later depends on.  Pattern
  // is sound; safety analysis isn't complete.  See
  // feedback_close_gap_attempts_round2.md for details.
  #if 0
  // Store forwarding for PHI memory copies.  Pattern (sumSquares
  // loop body):
  //
  //   STA X,s                  ; A → slot X (some intermediate result)
  //   [code that modifies A but doesn't touch slot X or slot Y]
  //   LDA X,s                  ; reload A from slot X
  //   STA Y,s                  ; A → slot Y  (the PHI copy)
  //
  // Transform: insert `STA Y,s` right after the first `STA X,s` (A
  // still holds the same value at that point), then drop the LDA-
  // STA pair.  Net: -1 inst per pattern occurrence.
  //
  // Safety constraints (all between STA X and the LDA-STA pair, in
  // the same MBB, in straight-line code):
  //   - No instruction writes slot X (else the LDA would see a
  //     different value than the original STA).
  //   - No instruction reads OR writes slot Y (else our early STA Y
  //     would be observed mid-flight with a different value than
  //     before, or our inserted store would be overwritten and the
  //     intervening read of Y in the original would have seen the
  //     overwrite).
  //   - No call / inline asm / branch (conservatively: those can
  //     touch memory we don't model).
  {
    auto isStackRelMC2 = [](unsigned Op) {
      return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
             Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
             Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
             Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
    };
    auto srAccess2 = [&](const MachineInstr &MI, int64_t &Off) -> bool {
      if (!isStackRelMC2(MI.getOpcode())) return false;
      if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
      Off = MI.getOperand(0).getImm();
      return true;
    };
    auto isStaSr = [](const MachineInstr &MI) {
      return MI.getOpcode() == W65816::STA_StackRel;
    };
    auto isLdaSr = [](const MachineInstr &MI) {
      return MI.getOpcode() == W65816::LDA_StackRel;
    };
    SmallVector<MachineInstr *, 4> ToErase;
    SmallVector<std::tuple<MachineInstr *, int64_t>, 4> ToInsert;
    static int g_fireLimit = -1;
    static int g_fireCount = 0;
    static bool initd = false;
    if (!initd) {
      if (const char *e = getenv("STORE_FWD_LIMIT")) g_fireLimit = atoi(e);
      initd = true;
    }
    for (MachineBasicBlock &MBB : MF) {
      for (auto It = MBB.begin(); It != MBB.end(); ++It) {
        if (!isStaSr(*It)) continue;
        int64_t X;
        if (!srAccess2(*It, X)) continue;
        MachineInstr *StaX = &*It;
        // Check if StaX is INSIDE an open PHP/PLP wrap.  In that case
        // its operand offset has been pre-bumped by +1, and inserting
        // a sibling STA Y immediately after writes at the WRONG slot
        // (the un-bumped Y).  Walk backward: if we find a PHP without
        // a matching PLP first, bail.
        {
          bool insideWrap = false;
          int depth = 0;
          auto B = It;
          while (B != MBB.begin()) {
            --B;
            if (B->getOpcode() == W65816::PLP) depth++;
            else if (B->getOpcode() == W65816::PHP) {
              if (depth > 0) depth--;
              else { insideWrap = true; break; }
            }
          }
          if (insideWrap) continue;
        }
        // Walk forward looking for LDA X ; STA Y.  Conservative bail
        // on any non-tracked memory op (indirect pointer access,
        // DP/abs ops, etc.) which could alias slot Y via memory.
        bool ok = true;
        int64_t Y = -1;
        MachineInstr *LdaX = nullptr;
        MachineInstr *StaY = nullptr;
        for (auto Walker = std::next(It); Walker != MBB.end(); ++Walker) {
          if (Walker->isDebugInstr()) continue;
          if (Walker->isCall() || Walker->isInlineAsm() ||
              Walker->isBranch() || Walker->isReturn()) {
            ok = false; break;
          }
          // Found LDA X?
          int64_t Off;
          if (isLdaSr(*Walker) && srAccess2(*Walker, Off) && Off == X) {
            LdaX = &*Walker;
            auto Next = std::next(Walker);
            while (Next != MBB.end() && Next->isDebugInstr()) ++Next;
            if (Next == MBB.end() || !isStaSr(*Next) ||
                !srAccess2(*Next, Y) || Y == X) {
              ok = false;
            } else {
              StaY = &*Next;
            }
            break;
          }
          // Stack-rel access to X (write or read): bail.
          if (srAccess2(*Walker, Off) && Off == X) {
            ok = false; break;
          }
          // Any memory-touching op that's NOT a tracked stack-rel
          // access — bail.  Indirect pointer stores/loads (DPIndY /
          // DPIndLong / abs / etc.) could alias slot Y via a pointer
          // we can't trace, and the safety check below would miss it.
          if ((Walker->mayLoad() || Walker->mayStore()) &&
              !isStackRelMC2(Walker->getOpcode())) {
            ok = false; break;
          }
          // SP-modifying ops shift the stack-rel addressing window —
          // a later `lda X, s` reads a DIFFERENT byte than the earlier
          // `sta X, s` (or worse, the new stack pointer points into
          // saved P/retaddr).  Bail on TCS (direct SP write) and on
          // any stack push/pop (PHx/PLx/PEA/PEI/COP/BRK).  Also bail
          // on PHP/PLP because the wrap pass already bumped in-wrap
          // stack-rel ops by +1 — our inserted STA after STA X writes
          // at the un-bumped offset which gets the WRONG slot.
          {
            unsigned WO = Walker->getOpcode();
            if (WO == W65816::TCS  || WO == W65816::PHA ||
                WO == W65816::PLA  || WO == W65816::PHX ||
                WO == W65816::PLX  || WO == W65816::PHY ||
                WO == W65816::PLY  || WO == W65816::PHP ||
                WO == W65816::PLP  || WO == W65816::PHB ||
                WO == W65816::PLB  || WO == W65816::PHD ||
                WO == W65816::PLD  || WO == W65816::PHK ||
                WO == W65816::PEA  || WO == W65816::PEI_DP) {
              ok = false; break;
            }
          }
        }
        if (!ok || !LdaX || !StaY) continue;
        if (g_fireLimit >= 0 && g_fireCount >= g_fireLimit) continue;
        g_fireCount++;
        errs() << "SF FIRE " << g_fireCount << " in " << MF.getName()
               << " MBB " << MBB.getNumber()
               << " X=" << X << " Y=" << StaY->getOperand(0).getImm()
               << "\n";
        // Now re-walk from std::next(It) up to LdaX and verify no
        // access to slot Y in that gap.
        ok = true;
        for (auto W2 = std::next(It); W2 != LdaX->getIterator(); ++W2) {
          if (W2->isDebugInstr()) continue;
          int64_t Off;
          if (srAccess2(*W2, Off) && Off == Y) { ok = false; break; }
        }
        if (!ok) continue;
        // Safe to apply: schedule the StaY-after-StaX insert, and
        // erase LdaX and StaY.
        ToInsert.push_back({StaX, Y});
        ToErase.push_back(LdaX);
        ToErase.push_back(StaY);
        Changed = true;
      }
    }
    // Apply (insertions first; iterators stay valid through erase).
    for (auto &P : ToInsert) {
      MachineInstr *StaX = std::get<0>(P);
      int64_t Y = std::get<1>(P);
      MachineBasicBlock *MBB = StaX->getParent();
      DebugLoc DL = StaX->getDebugLoc();
      auto NextIt = std::next(StaX->getIterator());
      BuildMI(*MBB, NextIt, DL, TII.get(W65816::STA_StackRel))
          .addImm(Y);
    }
    for (MachineInstr *MI : ToErase) MI->eraseFromParent();
  }
  #endif
  // (Redundant CMP #0 elimination — disabled, hit VLA sum_n
  // regression.  Carry-flag bookkeeping across the CMP turned out to
  // have more cases than my forward-walk modeled.  See
  // feedback_cmp_zero_elim.md.)
  #if 0
  {
    auto isNZSetOnA = [](unsigned Op) {
      switch (Op) {
      case W65816::DEA_PSEUDO: case W65816::INA_PSEUDO:
      case W65816::ADC_StackRel: case W65816::ADC_DP: case W65816::ADC_Imm16:
      case W65816::SBC_StackRel: case W65816::SBC_DP: case W65816::SBC_Imm16:
      case W65816::AND_StackRel: case W65816::AND_DP: case W65816::AND_Imm16:
      case W65816::ORA_StackRel: case W65816::ORA_DP: case W65816::ORA_Imm16:
      case W65816::EOR_StackRel: case W65816::EOR_DP: case W65816::EOR_Imm16:
      case W65816::LDA_StackRel: case W65816::LDA_DP:
      case W65816::LDAi16imm: case W65816::LDA_Imm16:
      case W65816::TXA: case W65816::TYA:
      case W65816::ADCi16imm: case W65816::ADCEi16imm:
      case W65816::SBCi16imm: case W65816::SBCEi16imm:
        return true;
      default:
        return false;
      }
    };
    auto isCmpZero = [](const MachineInstr &MI) {
      if (MI.getOpcode() != W65816::CMPi16imm) return false;
      // Operand layout: lhs (Acc16), imm.  Find the imm.
      for (const MachineOperand &MO : MI.operands()) {
        if (MO.isImm()) return MO.getImm() == 0;
      }
      return false;
    };
    auto modifiesA = [](const MachineInstr &MI) {
      for (const MachineOperand &MO : MI.operands()) {
        if (MO.isReg() && MO.getReg() == W65816::A && MO.isDef())
          return true;
      }
      return false;
    };
    auto readsC = [](const MachineInstr &MI) {
      // We don't model individual flag bits; approximate by checking
      // if the MI reads $p AND is one of the carry-consuming ops.
      unsigned Op = MI.getOpcode();
      switch (Op) {
      case W65816::ADC_StackRel: case W65816::ADC_DP: case W65816::ADC_Imm16:
      case W65816::SBC_StackRel: case W65816::SBC_DP: case W65816::SBC_Imm16:
      case W65816::ADCEi16imm: case W65816::SBCEi16imm:
      case W65816::BCC: case W65816::BCS:
      case W65816::ROL_A: case W65816::ROR_A:
        return true;
      default:
        return false;
      }
    };
    SmallVector<MachineInstr *, 4> CmpsToErase;
    for (MachineBasicBlock &MBB : MF) {
      for (MachineInstr &MI : MBB) {
        if (!isCmpZero(MI)) continue;
        // Walk backward, skipping flag-preserving instructions.
        bool foundProducer = false;
        auto Back = MI.getIterator();
        while (Back != MBB.begin()) {
          --Back;
          if (Back->isDebugInstr()) continue;
          if (Back->isCall() || Back->isInlineAsm()) break;
          if (modifiesA(*Back)) {
            foundProducer = isNZSetOnA(Back->getOpcode());
            break;
          }
          bool defsP = false;
          for (const MachineOperand &MO : Back->operands()) {
            if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) {
              defsP = true; break;
            }
          }
          if (defsP) break;
        }
        if (!foundProducer) continue;
        // Walk FORWARD from CMP: until the next C-defining MI, no MI
        // reads C.
        bool cConsumed = false;
        for (auto Fwd = std::next(MI.getIterator()); Fwd != MBB.end(); ++Fwd) {
          if (Fwd->isDebugInstr()) continue;
          if (readsC(*Fwd)) { cConsumed = true; break; }
          // Next def of $p: subsequent reads aren't ours.
          bool defsP = false;
          for (const MachineOperand &MO : Fwd->operands()) {
            if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) {
              defsP = true; break;
            }
          }
          if (defsP) break;
        }
        if (cConsumed) continue;
        CmpsToErase.push_back(&MI);
      }
    }
    for (MachineInstr *MI : CmpsToErase) MI->eraseFromParent();
    if (!CmpsToErase.empty()) Changed = true;
  }
  #endif
  // (Narrow PHI-copy slot collapse — disabled, qsort regression.)
  #if 0
  {
    auto isStackRelMC2 = [](unsigned Op) {
      return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
             Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
             Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
             Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
    };
    auto srAccess2 = [&](const MachineInstr &MI, int64_t &Off) {
      if (!isStackRelMC2(MI.getOpcode())) return false;
      if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
      Off = MI.getOperand(0).getImm();
      return true;
    };
    DenseMap<int64_t, unsigned> Refs;
    DenseMap<int64_t, MachineInstr *> StaInst, LdaInst;
    DenseMap<int64_t, unsigned> NSta, NLda;
    for (MachineBasicBlock &MBB : MF) {
      for (MachineInstr &MI : MBB) {
        int64_t Off;
        if (!srAccess2(MI, Off)) continue;
        Refs[Off]++;
        if (MI.getOpcode() == W65816::STA_StackRel) {
          NSta[Off]++; StaInst[Off] = &MI;
        } else if (MI.getOpcode() == W65816::LDA_StackRel) {
          NLda[Off]++; LdaInst[Off] = &MI;
        }
      }
    }
    SmallVector<MachineInstr *, 4> ToErase;
    for (auto &P : Refs) {
      int64_t X = P.first;
      if (P.second != 2) continue;          // exactly 2 references
      if (NSta[X] != 1 || NLda[X] != 1) continue;
      MachineInstr *Sta = StaInst[X];
      MachineInstr *Lda = LdaInst[X];
      if (Sta->getParent() != Lda->getParent()) continue;
      MachineBasicBlock *MBB = Sta->getParent();
      // Sta must be before Lda.
      bool staBefore = false;
      for (auto It = MBB->begin(); It != MBB->end(); ++It) {
        if (&*It == Sta) { staBefore = true; break; }
        if (&*It == Lda) break;
      }
      if (!staBefore) continue;
      // Next after Lda must be STA Y where Y != X.
      auto NextIt = std::next(Lda->getIterator());
      while (NextIt != MBB->end() && NextIt->isDebugInstr()) ++NextIt;
      if (NextIt == MBB->end()) continue;
      int64_t Y;
      if (NextIt->getOpcode() != W65816::STA_StackRel ||
          !srAccess2(*NextIt, Y) || Y == X) continue;
      // Between Sta and Lda, no read/write of slot Y, no call, no
      // anything that would re-set slot Y's value mid-flight.
      bool ok = true;
      for (auto It = std::next(Sta->getIterator()); It != Lda->getIterator();
           ++It) {
        if (It->isDebugInstr()) continue;
        if (It->isCall() || It->isInlineAsm()) { ok = false; break; }
        int64_t Off;
        if (srAccess2(*It, Off) && Off == Y) { ok = false; break; }
      }
      if (!ok) continue;
      // Redirect the original STA to write to Y; delete the LDA-STA pair.
      Sta->getOperand(0).setImm(Y);
      ToErase.push_back(Lda);
      ToErase.push_back(&*NextIt);
      Changed = true;
    }
    for (MachineInstr *MI : ToErase) MI->eraseFromParent();
  }
  #endif
  return Changed;
 }
--- a/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp
+++ b/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp
@ -1492,6 +1492,14 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
    }
    return false;
  };
  // Pass 1c can only eliminate CMPi16imm $a, 0 if the preceding
  // A-modifier reliably sets N/Z to reflect A's final value.  LDAfi
  // under FP-rel expansion (`sty $fa ; ldy #imm ; lda [$f6],y ; ldy $fa`)
  // ends with `ldy` that clobbers N/Z based on OLD Y, not loaded A — so
  // in FP-rel functions (VLA / huge frame), the CMP is load-bearing.
  // Skip the whole pass for such functions (saves us from the sum_n
  // VLA regression that the PHP-wrap-aware variant tripped).
  bool ssCleanupSPRelOnly = !UsesFPRel;
  for (MachineBasicBlock &MBB : MF) {
    SmallVector<MachineInstr *, 8> Cmps;
    for (MachineInstr &MI : MBB)
@ -1516,10 +1524,27 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
      // condition).  Caused __adddf3's renormalize while-loop to
      // skip its body even though `mr & ~mask` was non-zero.
      bool SafeToErase = true;
      bool insidePHPWrap = false;
      for (auto It = std::next(Cmp->getIterator());
           It != Cmp->getParent()->end(); ++It) {
        if (It->isDebugInstr()) continue;
        if (It->isBranch() || It->isReturn()) break;
        // PHP/PLP-wrap-aware: only safe when LDAfi-expansion sets N/Z
        // reliably (SP-rel functions, not FP-rel).
        if (ssCleanupSPRelOnly && It->getOpcode() == W65816::PHP) {
          // PHP must be IMMEDIATELY after CMP to capture CMP's flags.
          if (&*It != &*std::next(Cmp->getIterator())) {
            SafeToErase = false;
            break;
          }
          insidePHPWrap = true;
          continue;
        }
        if (It->getOpcode() == W65816::PLP) {
          insidePHPWrap = false;
          continue;
        }
        if (insidePHPWrap) continue;
        if (It->getOpcode() == TargetOpcode::COPY) {
          SafeToErase = false;
          break;
--- a/src/llvm/lib/Target/W65816/W65816StackSlotMerge.cpp
+++ b/src/llvm/lib/Target/W65816/W65816StackSlotMerge.cpp
@ -0,0 +1,733 @@
 //===-- W65816StackSlotMerge.cpp - Merge value-equivalent stack slots ----===//
 //
 // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
 // See https://llvm.org/LICENSE.txt for license information.
 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 //
 //===---------------------------------------------------------------------===//
 //
 // Pre-emit pass that runs after PEI (eliminateFrameIndex) and merges
 // pairs of stack-rel slots that hold the same value at every observable
 // program point — typically the PHI src/dst pair PHI-elim leaves at
 // the back-edge of a loop body.
 //
 // LLVM's StackSlotColoring merges slots with non-overlapping liveness.
 // It can't merge slots that are simultaneously live but happen to hold
 // the same value (which is what a PHI memory-copy creates).  This pass
 // catches that case via a stricter "value equivalence" check.
 //
 // Canonical pattern (sumSquares loop body):
 //
 //   .LBB0_4:
 //     LDA 0x7, s ; PHA ; JSL __umulhisi3 ; PLY
 //     CLC ; ADC 0x3, s ; STA 0xb, s     ; new total.lo (write X)
 //     TXA ; ADC 0x1, s ; STA 0x9, s
 //     LDA 0x7, s ; INC A ; STA 0x7, s
 //     LDA 0xb, s ; STA 0x3, s           ; PHI copy: load X, store Y
 //     LDA 0x9, s ; STA 0x1, s
 //     ...
 //
 // The pair (0xb, 0x3) is the lo-half PHI memory copy.  Slots 0xb and
 // 0x3 always hold the same value at every read site:
 //   - Function entry: both initialized to 0 (`lda #0; sta 0xb, s` in
 //     entry, `lda #0; sta 0x3, s` in preheader).
 //   - Loop iteration: the PHI copy moves the new total.lo from 0xb to
 //     0x3 at the end of every iteration.
 //   - Exit: only 0xb is read (return value), but its value equals 0x3's.
 //
 // Rename 0xb → 0x3 function-wide; the now self-copy `lda 0x3; sta 0x3`
 // is dead and we erase it.  Saves 2 inst per PHI copy occurrence (the
 // memory copy round-trip).  sumSquares loop body shrinks from 21 to
 // 17 inst per iter.
 //
 // Safety check (sufficient condition for value equivalence):
 //   1. Both slots have ≥1 STA in the function (skips arg slots passed
 //      by the caller — those have only LDA reads, no STAs, and renaming
 //      would change where we read the arg from).
 //   2. For every STA X in the function, find a "twin" STA Y at a
 //      program point where the values match.  Matching = either:
 //      (a) Same MBB, same A-source value (no intervening A-define).
 //          Covers the loop-body iter-end pattern: STA X then later
 //          LDA X ; STA Y.  Also covers entry's `lda #N ; sta X` if
 //          the same MBB also has `sta Y`.
 //      (b) Different MBBs, both preceded by `LDA #const` of the same
 //          constant.  Covers entry-block STA X=0 paired with
 //          preheader STA Y=0.
 //   3. Symmetric: for every STA Y, find a twin STA X.
 //   4. No "orphan" STAs.  If a STA X or STA Y has no twin, bail.
 //
 // When all checks pass, the rename function-wide preserves semantics:
 // every read of slot X at program point P sees the same value that
 // slot Y holds at P (and vice versa).
 //
 //===---------------------------------------------------------------------===//
 #include "W65816.h"
 #include "W65816InstrInfo.h"
 #include "W65816Subtarget.h"
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/CodeGen/MachineDominators.h"
 #include "llvm/CodeGen/MachineFunction.h"
 #include "llvm/CodeGen/MachineFunctionPass.h"
 #include "llvm/CodeGen/MachineInstrBuilder.h"
 #include "llvm/InitializePasses.h"
 #include "llvm/Support/Debug.h"
 using namespace llvm;
 #define DEBUG_TYPE "w65816-stack-slot-merge"
 namespace {
 class W65816StackSlotMerge : public MachineFunctionPass {
 public:
  static char ID;
  W65816StackSlotMerge() : MachineFunctionPass(ID) {}
  StringRef getPassName() const override {
    return "W65816 merge value-equivalent stack slots (PHI-copy collapse)";
  }
  void getAnalysisUsage(AnalysisUsage &AU) const override {
    AU.addRequired<MachineDominatorTreeWrapperPass>();
    AU.setPreservesCFG();
    MachineFunctionPass::getAnalysisUsage(AU);
  }
  bool runOnMachineFunction(MachineFunction &MF) override;
 };
 } // namespace
 char W65816StackSlotMerge::ID = 0;
 INITIALIZE_PASS_BEGIN(W65816StackSlotMerge, DEBUG_TYPE,
                      "W65816 stack slot merge", false, false)
 INITIALIZE_PASS_DEPENDENCY(MachineDominatorTreeWrapperPass)
 INITIALIZE_PASS_END(W65816StackSlotMerge, DEBUG_TYPE,
                    "W65816 stack slot merge", false, false)
 FunctionPass *llvm::createW65816StackSlotMerge() {
  return new W65816StackSlotMerge();
 }
 // Stack-relative MC opcodes — the ops that survive eliminateFrameIndex
 // and reference a slot via an 8-bit SP-relative offset.
 static bool isStackRelOp(unsigned Op) {
  return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
         Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
         Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
         Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
 }
 // Returns true if MI is a stack-rel op; out-param Off receives the slot
 // offset (operand 0).
 static bool srAccess(const MachineInstr &MI, int64_t &Off) {
  if (!isStackRelOp(MI.getOpcode())) return false;
  if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
  Off = MI.getOperand(0).getImm();
  return true;
 }
 // True if the MI semantically defines A.  Covers both the explicit
 // case (operand has reg=A,isDef) AND the implicit case where the
 // tablegen InstDP / InstAbs / etc. base classes omit the A-Def
 // annotation despite LDA semantically writing A (a backend modelling
 // gap — many `LDA_DP`, `LDA_Abs`, `LDA_LongX`, etc. are missing the
 // implicit-def in the MIR even though they load into A).  Opcode-
 // based fallback catches all of them.
 static bool semanticallyDefsA(const MachineInstr &MI) {
  for (const MachineOperand &MO : MI.operands()) {
    if (MO.isReg() && MO.getReg() == W65816::A && MO.isDef())
      return true;
  }
  unsigned Op = MI.getOpcode();
  switch (Op) {
  case W65816::LDA_DP:    case W65816::LDA_DPX:
  case W65816::LDA_DPInd: case W65816::LDA_DPIndY:
  case W65816::LDA_DPIndX:
  case W65816::LDA_Abs:   case W65816::LDA_AbsX:
  case W65816::LDA_AbsY:  case W65816::LDA_Long:
  case W65816::LDA_LongX:
  case W65816::PLA:
    return true;
  default:
    return false;
  }
 }
 // Walk backward from MI in its MBB looking for the most recent A-define.
 // Returns the MI that defines A, or nullptr if none in the same MBB.
 // Skips debug instructions.  Stops at MBB boundary, calls, branches,
 // inline asm.
 static MachineInstr *findPriorADef(MachineInstr *MI) {
  MachineBasicBlock *MBB = MI->getParent();
  auto It = MI->getIterator();
  while (It != MBB->begin()) {
    --It;
    if (It->isDebugInstr()) continue;
    if (It->isCall() || It->isInlineAsm()) return nullptr;
    if (semanticallyDefsA(*It)) return &*It;
  }
  return nullptr;
 }
 // Walk forward from `Start` (exclusive) up to (but not including) `End`
 // in the same MBB, tracking whether slot `WatchSlot` is written.
 // Returns true if slot `WatchSlot` is NOT written in the interval.
 static bool slotNotWrittenBetween(MachineBasicBlock::iterator Start,
                                  MachineBasicBlock::iterator End,
                                  int64_t WatchSlot) {
  for (auto It = std::next(Start); It != End; ++It) {
    if (It->isDebugInstr()) continue;
    int64_t Off;
    if (It->getOpcode() == W65816::STA_StackRel && srAccess(*It, Off) &&
        Off == WatchSlot) {
      return false;
    }
  }
  return true;
 }
 // Returns true if MI clobbers P (N/Z/C/V flags).  Mirrors LLVM's
 // operand-based check + an opcode whitelist for tablegen entries that
 // omit `Defs = [P]` (InstImplied, InstStackRel, etc.).
 static bool clobbersFlagsP(const MachineInstr &MI) {
  for (const MachineOperand &MO : MI.operands()) {
    if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef())
      return true;
  }
  if (MI.isCall() || MI.isInlineAsm()) return true;
  unsigned Op = MI.getOpcode();
  switch (Op) {
  case W65816::PLA: case W65816::PLY: case W65816::PLX:
  case W65816::PLP:
  case W65816::INA: case W65816::DEA:
  case W65816::INX: case W65816::DEX:
  case W65816::INY: case W65816::DEY:
  case W65816::TAX: case W65816::TAY:
  case W65816::TYA: case W65816::TXA:
  case W65816::TYX: case W65816::TXY:
  case W65816::LDA_StackRel: case W65816::LDA_DP:
  case W65816::LDA_DPX: case W65816::LDA_DPInd:
  case W65816::LDA_DPIndY: case W65816::LDA_DPIndX:
  case W65816::LDA_Abs: case W65816::LDA_AbsX:
  case W65816::LDA_AbsY: case W65816::LDA_Long:
  case W65816::LDA_LongX:
  case W65816::ADC_StackRel: case W65816::SBC_StackRel:
  case W65816::CMP_StackRel: case W65816::AND_StackRel:
  case W65816::ORA_StackRel: case W65816::EOR_StackRel:
  case W65816::ADC_DP: case W65816::ADC_Abs:
  case W65816::SBC_DP: case W65816::SBC_Abs:
  case W65816::CMP_DP: case W65816::CMP_Abs:
  case W65816::AND_DP: case W65816::AND_Abs:
  case W65816::ORA_DP: case W65816::ORA_Abs:
  case W65816::EOR_DP: case W65816::EOR_Abs:
    return true;
  default:
    return false;
  }
 }
 // Returns true if MI reads P flags (conditional branches, PLP, etc.).
 static bool usesFlagsP(const MachineInstr &MI) {
  if (MI.isConditionalBranch()) return true;
  for (const MachineOperand &MO : MI.operands()) {
    if (MO.isReg() && MO.getReg() == W65816::P && MO.isUse() &&
        !MO.isDef())
      return true;
  }
  return false;
 }
 // Returns the MOST RECENT A-defining MI strictly before MI in its MBB,
 // skipping debug instructions.  Returns nullptr if none in the same MBB.
 static MachineInstr *findMostRecentADef(MachineInstr *MI) {
  MachineBasicBlock *MBB = MI->getParent();
  auto It = MI->getIterator();
  while (It != MBB->begin()) {
    --It;
    if (It->isDebugInstr()) continue;
    if (semanticallyDefsA(*It)) return &*It;
  }
  return nullptr;
 }
 // "Twin" check.  Given a STA X at position StaX and a candidate slot Y,
 // scan the function's STA Y instances and return one that's value-
 // equivalent under the rules described in the header comment.
 //
 // Source-value equivalence cases:
 //   (1) Same-MBB twin store: no A-define between StaX and the candidate
 //       StaY → both store the same A value.  Pure twin pattern.
 //   (2) Same-MBB PHI-copy: the candidate StaY is preceded by
 //       `LDA_StackRel slotX` (PHI-copy reload).  Even if many A-defines
 //       sit between StaX and StaY, the LDA X re-establishes A =
 //       slot[X] = value StaX wrote (assuming slot X wasn't re-written
 //       in the gap).
 //   (3) Different MBBs, both preceded by LDA_Imm16 / LDAi16imm of the
 //       same constant.  Covers entry/preheader init parallel pair.
 static MachineInstr *findTwin(MachineInstr *StaX,
                              ArrayRef<MachineInstr *> StasY) {
  MachineBasicBlock *MBBStaX = StaX->getParent();
  int64_t XOff = StaX->getOperand(0).getImm();
  // Cases (1) + (2): same MBB.
  for (MachineInstr *StaY : StasY) {
    if (StaY->getParent() != MBBStaX) continue;
    // Determine ordering.
    MachineInstr *Earlier = nullptr;
    MachineInstr *Later = nullptr;
    for (auto It = MBBStaX->begin(); It != MBBStaX->end(); ++It) {
      if (&*It == StaX) { Earlier = StaX; Later = StaY; break; }
      if (&*It == StaY) { Earlier = StaY; Later = StaX; break; }
    }
    if (!Earlier || !Later) continue;
    int64_t EOff = Earlier->getOperand(0).getImm();
    // Case (2): if Later is preceded by `LDA_StackRel <Earlier's slot>`
    // (the PHI-copy reload), it's a PHI twin.  Also require slot
    // Earlier-slot wasn't re-written between Earlier and Later.
    MachineInstr *PriorOfLater = findMostRecentADef(Later);
    if (PriorOfLater) {
      int64_t Off;
      if (PriorOfLater->getOpcode() == W65816::LDA_StackRel &&
          srAccess(*PriorOfLater, Off) && Off == EOff &&
          slotNotWrittenBetween(Earlier->getIterator(),
                                 PriorOfLater->getIterator(), EOff)) {
        return StaY;
      }
    }
    // Case (1): no A-define between Earlier and Later — same A value.
    {
      bool noADefs = true;
      for (auto It = std::next(Earlier->getIterator());
           It != Later->getIterator(); ++It) {
        if (It->isDebugInstr()) continue;
        if (semanticallyDefsA(*It)) { noADefs = false; break; }
      }
      if (noADefs) return StaY;
    }
  }
  // Case (3): different MBBs, both preceded by LDA_Imm16 / LDAi16imm
  // with the same constant.
  MachineInstr *PriorX = findPriorADef(StaX);
  if (!PriorX) return nullptr;
  unsigned PriorXOp = PriorX->getOpcode();
  if (PriorXOp != W65816::LDA_Imm16 && PriorXOp != W65816::LDAi16imm)
    return nullptr;
  int64_t XConst = 0;
  for (const MachineOperand &MO : PriorX->operands()) {
    if (MO.isImm()) { XConst = MO.getImm(); break; }
  }
  for (MachineInstr *StaY : StasY) {
    if (StaY->getParent() == MBBStaX) continue;
    MachineInstr *PriorY = findPriorADef(StaY);
    if (!PriorY) continue;
    if (PriorY->getOpcode() != PriorXOp) continue;
    int64_t YConst = 0;
    for (const MachineOperand &MO : PriorY->operands()) {
      if (MO.isImm()) { YConst = MO.getImm(); break; }
    }
    if (XConst == YConst) return StaY;
  }
  (void)XOff;
  return nullptr;
 }
 // Run Phase 6a + Phase 6 (per-MBB peepholes) — independent of rename
 // logic, so they fire on every function.  Returns true if anything
 // changed.
 static bool runPerMBBPeepholes(MachineFunction &MF) {
  bool Changed = false;
  // Phase 6a: redundant `STA Y, s` immediately followed by `LDA Y, s`.
  for (MachineBasicBlock &MBB : MF) {
    SmallVector<MachineInstr *, 4> Dead;
    for (auto It = MBB.begin(); It != MBB.end(); ++It) {
      if (It->isDebugInstr()) continue;
      if (It->getOpcode() != W65816::STA_StackRel) continue;
      int64_t StaSlot;
      if (!srAccess(*It, StaSlot)) continue;
      auto NextIt = std::next(It);
      while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
      if (NextIt == MBB.end()) continue;
      if (NextIt->getOpcode() != W65816::LDA_StackRel) continue;
      int64_t LdaSlot;
      if (!srAccess(*NextIt, LdaSlot)) continue;
      if (StaSlot != LdaSlot) continue;
      bool flagsSafe = false;
      bool aIsUsedBeforeClobber = false;
      for (auto Fwd = std::next(NextIt); Fwd != MBB.end(); ++Fwd) {
        if (Fwd->isDebugInstr()) continue;
        // Calls/JSLs that take A as arg — even though clobbersFlagsP
        // returns true for them, the elimination could mis-track A's
        // live-in to the call.  Bail.
        if (Fwd->isCall()) break;
        // Generic: any instr that has `implicit $a` as a USE — A is
        // live going in.  Bail to avoid live-range trouble.
        for (const MachineOperand &MO : Fwd->operands()) {
          if (MO.isReg() && MO.getReg() == W65816::A && MO.isUse() &&
              !MO.isDef()) {
            aIsUsedBeforeClobber = true;
            break;
          }
        }
        if (aIsUsedBeforeClobber) break;
        if (usesFlagsP(*Fwd)) break;
        if (Fwd->isTerminator() && !Fwd->isConditionalBranch()) {
          flagsSafe = true; break;
        }
        if (clobbersFlagsP(*Fwd)) { flagsSafe = true; break; }
      }
      if (!flagsSafe) continue;
      Dead.push_back(&*NextIt);
    }
    for (MachineInstr *MI : Dead) {
      MI->eraseFromParent();
      Changed = true;
    }
  }
  // Phase 6: per-MBB redundant `LDA #K` elimination.
  auto isAandPPreserving = [](const MachineInstr &MI) -> bool {
    unsigned Op = MI.getOpcode();
    switch (Op) {
    case W65816::STA_StackRel:
    case W65816::STA_DP: case W65816::STA_DPX:
    case W65816::STA_DPInd: case W65816::STA_DPIndY:
    case W65816::STA_DPIndX:
    case W65816::STA_Abs: case W65816::STA_AbsX:
    case W65816::STA_AbsY: case W65816::STA_Long:
    case W65816::STA_LongX:
    case W65816::STX_DP: case W65816::STX_Abs:
    case W65816::STY_DP: case W65816::STY_Abs: case W65816::STY_DPX:
    case W65816::STZ_DP: case W65816::STZ_Abs:
    case W65816::STZ_DPX: case W65816::STZ_AbsX:
      return true;
    default:
      break;
    }
    for (const MachineOperand &MO : MI.operands()) {
      if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef())
        return false;
    }
    if (MI.mayStore() && !MI.mayLoad() && !semanticallyDefsA(MI))
      return true;
    return false;
  };
  auto isLdaImmK = [](const MachineInstr &MI, int64_t &K) -> bool {
    unsigned Op = MI.getOpcode();
    if (Op != W65816::LDA_Imm16 && Op != W65816::LDAi16imm) return false;
    for (const MachineOperand &MO : MI.operands()) {
      if (MO.isImm()) { K = MO.getImm(); return true; }
    }
    return false;
  };
  for (MachineBasicBlock &MBB : MF) {
    std::optional<int64_t> KnownK;
    SmallVector<MachineInstr *, 4> Dead;
    for (auto It = MBB.begin(); It != MBB.end(); ++It) {
      if (It->isDebugInstr()) continue;
      int64_t K;
      if (isLdaImmK(*It, K)) {
        if (KnownK && *KnownK == K) {
          Dead.push_back(&*It);
          continue;
        }
        KnownK = K;
        continue;
      }
      if (isAandPPreserving(*It)) continue;
      KnownK.reset();
    }
    for (MachineInstr *MI : Dead) {
      MI->eraseFromParent();
      Changed = true;
    }
  }
  return Changed;
 }
 bool W65816StackSlotMerge::runOnMachineFunction(MachineFunction &MF) {
  if (skipFunction(MF.getFunction())) return false;
  if (MF.getFunction().hasOptNone()) return false;
  // Run per-MBB peepholes first — independent of rename logic.
  bool peepChanged = runPerMBBPeepholes(MF);
  // Phase 1: index all stack-rel STA/LDA grouped by slot offset.
  DenseMap<int64_t, SmallVector<MachineInstr *, 4>> Stas;
  DenseMap<int64_t, SmallVector<MachineInstr *, 4>> Ldas;
  DenseMap<int64_t, unsigned> AllRefs;  // STA + LDA + ADC + ... count
  for (MachineBasicBlock &MBB : MF) {
    for (MachineInstr &MI : MBB) {
      int64_t Off;
      if (!srAccess(MI, Off)) continue;
      AllRefs[Off]++;
      if (MI.getOpcode() == W65816::STA_StackRel) {
        Stas[Off].push_back(&MI);
      } else if (MI.getOpcode() == W65816::LDA_StackRel) {
        Ldas[Off].push_back(&MI);
      }
    }
  }
  // Phase 2: find PHI-copy site candidates.  Pattern: LDA X ; STA Y
  // in a LOOP BODY MBB (= the MBB has itself as a predecessor, i.e.
  // a self-loop back-edge).  Restricting to loop bodies distinguishes
  // genuine PHI-cycle copies from one-shot temp transfers (where
  // slot X is just a scratch register dropped on the way to slot Y
  // for an unrelated purpose, like qsortIter's pointer-construction
  // pattern `STA 5; ...; LDA 5; STA 39` followed by `LDA 39; STA dp`).
  DenseMap<int64_t, int64_t> PhiCopyPair;  // X -> Y
  for (MachineBasicBlock &MBB : MF) {
    // Self-loop check: MBB must have itself as a predecessor.
    bool selfLoop = false;
    for (MachineBasicBlock *Pred : MBB.predecessors()) {
      if (Pred == &MBB) { selfLoop = true; break; }
    }
    if (!selfLoop) continue;
    for (auto It = MBB.begin(); It != MBB.end(); ++It) {
      if (It->getOpcode() != W65816::LDA_StackRel) continue;
      int64_t X;
      if (!srAccess(*It, X)) continue;
      auto NextIt = std::next(It);
      while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
      if (NextIt == MBB.end()) continue;
      if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
      int64_t Y;
      if (!srAccess(*NextIt, Y) || Y == X) continue;
      if (PhiCopyPair.count(X)) continue;
      PhiCopyPair[X] = Y;
    }
  }
  // Phase 3: validate each pair and apply rename if safe.
  // Track which slots have already been merged so we don't double-merge.
  DenseMap<int64_t, int64_t> Renames;  // X -> Y
  for (auto &P : PhiCopyPair) {
    int64_t X = P.first, Y = P.second;
    // Don't re-merge an already-processed slot.
    if (Renames.count(X) || Renames.count(Y)) continue;
    // Arg-slot guard: skip slots with no STAs (caller-passed args).
    if (Stas[X].empty() || Stas[Y].empty()) continue;
    // Validate that every STA X has a twin STA Y.
    bool allPaired = true;
    for (MachineInstr *StaX : Stas[X]) {
      if (!findTwin(StaX, Stas[Y])) { allPaired = false; break; }
    }
    if (!allPaired) continue;
    // Symmetric: every STA Y must have a twin STA X.
    for (MachineInstr *StaY : Stas[Y]) {
      if (!findTwin(StaY, Stas[X])) { allPaired = false; break; }
    }
    if (!allPaired) continue;
    LLVM_DEBUG(dbgs() << "StackSlotMerge: rename slot " << X
                      << " -> " << Y << " in " << MF.getName() << "\n");
    Renames[X] = Y;
  }
  if (Renames.empty()) return false;
  // Phase 4: apply rename.
  bool Changed = false;
  for (MachineBasicBlock &MBB : MF) {
    SmallVector<MachineInstr *, 4> ToErase;
    for (MachineInstr &MI : MBB) {
      int64_t Off;
      if (!srAccess(MI, Off)) continue;
      auto It = Renames.find(Off);
      if (It == Renames.end()) continue;
      MI.getOperand(0).setImm(It->second);
      Changed = true;
    }
    // After rename, look for now-redundant LDA-STA pairs to the same
    // slot (the PHI-copy self-copy).  Erase them.
    for (auto It = MBB.begin(); It != MBB.end(); ++It) {
      if (It->getOpcode() != W65816::LDA_StackRel) continue;
      int64_t LdaOff;
      if (!srAccess(*It, LdaOff)) continue;
      auto NextIt = std::next(It);
      while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
      if (NextIt == MBB.end()) continue;
      if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
      int64_t StaOff;
      if (!srAccess(*NextIt, StaOff)) continue;
      if (LdaOff != StaOff) continue;
      ToErase.push_back(&*It);
      ToErase.push_back(&*NextIt);
    }
    for (MachineInstr *MI : ToErase) MI->eraseFromParent();
    if (!ToErase.empty()) Changed = true;
  }
  // Phase 5: redundant constant-init elimination.  After rename, the
  // Case (3) twin pairings leave us with TWO sites writing the same
  // constant to the same slot (one renamed from X to Y, the other was
  // already targeting Y).  The dominated one is redundant — its slot
  // already holds the constant from the dominating write.
  //
  // Generalize: scan post-rename for ALL `LDA_Imm16 K ; STA_StackRel Y`
  // pairs (or LDAi16imm K; STA Y).  For each pair, look for another
  // such pair with the same (K, Y) where one DOMINATES the other AND
  // no slot-Y access exists on any path between them.  Erase the
  // dominated STA + its preceding LDA (if A isn't otherwise consumed).
  {
    auto isLdaImm = [](const MachineInstr &MI) {
      unsigned Op = MI.getOpcode();
      return Op == W65816::LDA_Imm16 || Op == W65816::LDAi16imm;
    };
    auto immValue = [](const MachineInstr &MI) -> int64_t {
      for (const MachineOperand &MO : MI.operands()) {
        if (MO.isImm()) return MO.getImm();
      }
      return 0;
    };
    // Collect `LDA #K ; STA_StackRel Y` pairs, grouped by Y.
    DenseMap<int64_t, SmallVector<std::pair<MachineInstr *, int64_t>, 4>>
        ConstStas;
    for (MachineBasicBlock &MBB : MF) {
      for (auto It = MBB.begin(); It != MBB.end(); ++It) {
        if (!isLdaImm(*It)) continue;
        int64_t K = immValue(*It);
        auto NextIt = std::next(It);
        while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
        if (NextIt == MBB.end()) continue;
        if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
        int64_t Y;
        if (!srAccess(*NextIt, Y)) continue;
        ConstStas[Y].push_back({&*NextIt, K});
      }
    }
    // For each slot Y with at least two const-init STAs, check for
    // dominator redundancy.
    auto &MDT = getAnalysis<MachineDominatorTreeWrapperPass>().getDomTree();
    // Check that no instruction WRITES slot Y on any path between
    // From and To.  Reads are fine because both From and To write
    // the same constant K — any intermediate read would see K either
    // way (since From dominates, From has already executed).  Calls
    // are bailout conditions: a call might write to the stack via
    // address-taken locals or other side effects we don't model.
    auto noSlotWriteOnPath = [&](MachineInstr *From, MachineInstr *To,
                                  int64_t Y) -> bool {
      MachineBasicBlock *FromMBB = From->getParent();
      MachineBasicBlock *ToMBB = To->getParent();
      auto opWritesY = [&](MachineInstr &MI) {
        if (MI.isCall() || MI.isInlineAsm()) return true;
        int64_t Off;
        if (MI.getOpcode() == W65816::STA_StackRel &&
            srAccess(MI, Off) && Off == Y) {
          return true;
        }
        return false;
      };
      // (a) After From in its MBB.
      for (auto It = std::next(From->getIterator()); It != FromMBB->end();
           ++It) {
        if (It->isDebugInstr()) continue;
        if (opWritesY(*It)) return false;
      }
      // (b) BFS forward from FromMBB's successors, stopping at ToMBB.
      SmallPtrSet<MachineBasicBlock *, 8> Visited;
      SmallVector<MachineBasicBlock *, 8> Stack;
      for (auto *Succ : FromMBB->successors()) Stack.push_back(Succ);
      while (!Stack.empty()) {
        auto *MBB = Stack.pop_back_val();
        if (MBB == ToMBB) continue;  // checked separately in (c)
        if (!Visited.insert(MBB).second) continue;
        for (auto &MI : *MBB) {
          if (MI.isDebugInstr()) continue;
          if (opWritesY(MI)) return false;
        }
        for (auto *Succ : MBB->successors()) Stack.push_back(Succ);
      }
      // (c) In ToMBB, before To, any write of Y?
      for (auto It = ToMBB->begin(); It != To->getIterator(); ++It) {
        if (It->isDebugInstr()) continue;
        if (opWritesY(*It)) return false;
      }
      return true;
    };
    SmallVector<MachineInstr *, 8> ToErase;
    LLVM_DEBUG({
      dbgs() << "Phase 5 in " << MF.getName() << ":\n";
      for (auto &P : ConstStas) {
        dbgs() << "  slot " << P.first << " has " << P.second.size()
               << " const STAs\n";
      }
    });
    for (auto &P : ConstStas) {
      int64_t Y = P.first;
      auto &stas = P.second;
      if (stas.size() < 2) continue;
      // For each pair (i, j) where i dominates j with same constant K:
      for (auto &Sj : stas) {
        MachineInstr *DominatedSta = Sj.first;
        int64_t Kj = Sj.second;
        for (auto &Si : stas) {
          if (&Si == &Sj) continue;
          if (Si.second != Kj) continue;  // different K
          MachineInstr *DominatorSta = Si.first;
          if (!MDT.dominates(DominatorSta, DominatedSta)) continue;
          if (!noSlotWriteOnPath(DominatorSta, DominatedSta, Y)) continue;
          // Flag safety: erasing `LDA #K; STA Y` removes a flag-setting
          // op (the LDA).  Walk forward from the STA looking for next
          // flag-clobber or unconditional terminator (safe) vs.
          // flag-use (unsafe).
          MachineBasicBlock *MBB = DominatedSta->getParent();
          bool flagsSafeP5 = false;
          for (auto Fwd = std::next(DominatedSta->getIterator());
               Fwd != MBB->end(); ++Fwd) {
            if (Fwd->isDebugInstr()) continue;
            if (usesFlagsP(*Fwd)) break;
            if (Fwd->isTerminator() && !Fwd->isConditionalBranch()) {
              flagsSafeP5 = true; break;
            }
            if (clobbersFlagsP(*Fwd)) { flagsSafeP5 = true; break; }
          }
          if (!flagsSafeP5) continue;
          // Erase DominatedSta and its preceding LDA #K.
          auto Prev = DominatedSta->getIterator();
          while (Prev != MBB->begin()) {
            --Prev;
            if (!Prev->isDebugInstr()) break;
          }
          if (Prev != DominatedSta->getIterator() && isLdaImm(*Prev) &&
              immValue(*Prev) == Kj) {
            // Verify A isn't consumed between LDA and STA — they're
            // adjacent so no consumers exist; safe.  Erase both.
            ToErase.push_back(&*Prev);
          }
          ToErase.push_back(DominatedSta);
          break;
        }
      }
    }
    // De-dup ToErase before erasing.
    SmallPtrSet<MachineInstr *, 8> ErasedSet;
    for (MachineInstr *MI : ToErase) {
      if (ErasedSet.insert(MI).second) {
        MI->eraseFromParent();
        Changed = true;
      }
    }
  }
  return Changed || peepChanged;
 }
--- a/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp
+++ b/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp
@ -56,6 +56,8 @@ LLVMInitializeW65816Target() {
  initializeW65816I32IncFoldPass(PR);
  initializeW65816ImgCalleeSavePass(PR);
  initializeW65816NarrowI32MulPass(PR);
  initializeW65816PromoteFiToImgPass(PR);
  initializeW65816StackSlotMergePass(PR);
  // Default IndVarSimplify's exit-value rewriter to "never".  The
  // closed-form replacement frequently widens an i16 induction var
@ -195,14 +197,19 @@ void W65816PassConfig::addPreRegAlloc() {
 }
 void W65816PassConfig::addPostRegAlloc() {
-  // ImgCalleeSave runs FIRST so its STAfi/LDAfi pseudos go through the
+  // FI→IMG promotion runs FIRST.  It scans for high-traffic i16
-  // rest of the post-RA pipeline (SpillToX, StackSlotCleanup) normally.
+  // FrameIndex slots (LDAfi/STAfi/ADCfi/etc.) and rewrites them to
-  // It detects IMG8..IMG15 usage post-regalloc and inserts prologue
+  // STA_DP/LDA_DP/ADC_DP/... pointed at free IMG8..IMG15 DP slots.
-  // save + epilogue restore so those slots act as callee-saved at the
+  // The introduced IMG8..15 references are then picked up by
-  // asm level.  Fixes picol's `expr 1+2 == 4` bug: high-pressure
+  // ImgCalleeSave to get prologue save + epilogue restore.  See
-  // recursive double fns use IMG8..IMG15 as scratch but, without this
+  // W65816PromoteFiToImg.cpp.
-  // pass, expected them preserved across calls — and callees were
+  addPass(createW65816PromoteFiToImg());
-  // happy to clobber them.  See W65816ImgCalleeSave.cpp.
+  // ImgCalleeSave detects IMG8..IMG15 usage post-regalloc and inserts
  // prologue save + epilogue restore so those slots act as callee-
  // saved at the asm level.  Fixes picol's `expr 1+2 == 4` bug:
  // high-pressure recursive double fns use IMG8..IMG15 as scratch but,
  // without this pass, expected them preserved across calls — and
  // callees were happy to clobber them.  See W65816ImgCalleeSave.cpp.
  addPass(createW65816ImgCalleeSave());
  // SpillToX converts STA/LDA pairs to TAX/TXA bridges; StackSlotCleanup
  // then deletes still-adjacent redundant spills.  A second SpillToX
@ -264,6 +271,14 @@ void W65816PassConfig::addPreEmitPass() {
  addPass(createW65816I32IncFold());
  addPass(createW65816BranchExpand());
  addPass(createW65816SepRepCleanup());
  // Merge value-equivalent stack slots last.  Runs AFTER SepRepCleanup's
  // PHI-copy hoist so the LDA-X ; STA-Y pair has been pulled out of
  // any PHP/PLP wrap — that way the stack-rel offsets on both ops are
  // the unbumped values and offset-based slot matching is stable.
  // Saves 2 inst per PHI-copy occurrence (the memory copy round-trip
  // collapses when X and Y are renamed to the same slot).  See
  // W65816StackSlotMerge.cpp.
  addPass(createW65816StackSlotMerge());
 }
 MachineFunctionInfo *W65816TargetMachine::createMachineFunctionInfo(
--- a/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp
+++ b/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp
@ -64,13 +64,43 @@ FunctionPass *llvm::createW65816WidenAcc16() {
  return new W65816WidenAcc16();
 }
-// Returns true if the vreg has any physreg-COPY use (e.g., return-value
+// Returns true if the vreg has any physreg-COPY use that would conflict
-// or arg-passing setup that pins the value to a specific physreg).
+// with Wide16 class assignment.  $a is a member of Wide16 (Wide16 = A +
-static bool flowsToPhysReg(Register VReg, const MachineRegisterInfo &MRI) {
+// IMG0..15), so a COPY to $a is fine — the vreg can be Wide16 and
 // regalloc will pick $a to coalesce.  $x / $y are in Idx16, NOT in
 // Wide16, so a COPY to those forces the vreg to NOT be in Wide16
 // (verifier would reject).
 static bool flowsToIncompatiblePhysReg(Register VReg,
                                       const MachineRegisterInfo &MRI) {
  for (auto &U : MRI.use_nodbg_instructions(VReg)) {
    if (!U.isCopy()) continue;
    const MachineOperand &Dst = U.getOperand(0);
-    if (Dst.isReg() && Dst.getReg().isPhysical()) return true;
+    if (!Dst.isReg() || !Dst.getReg().isPhysical()) continue;
    Register P = Dst.getReg();
    if (P == W65816::A) continue;
    if (P >= W65816::IMG0 && P <= W65816::IMG15) continue;
    return true;
  }
  return false;
 }
 // Returns true if VReg's def is a COPY from a physreg whose class is not
 // Wide16-compatible.  copyPhysReg only handles a fixed set of source/dest
 // pairs; an incompatible source physreg (e.g., DPF0, the i64-return
 // high-half carrier) lowered to an IMG dest would crash with an
 // "unhandled copyPhysReg" assertion at AsmPrinter time.  (Currently
 // only the Phase-2 PHI widening uses this; that's disabled, so mark
 // unused.)
 [[maybe_unused]] static bool comesFromIncompatiblePhysReg(Register VReg,
                                         const MachineRegisterInfo &MRI) {
  for (auto &D : MRI.def_instructions(VReg)) {
    if (!D.isCopy()) continue;
    const MachineOperand &Src = D.getOperand(1);
    if (!Src.isReg() || !Src.getReg().isPhysical()) continue;
    Register P = Src.getReg();
    if (P == W65816::A) continue;
    if (P >= W65816::IMG0 && P <= W65816::IMG15) continue;
    return true;
  }
  return false;
 }
@ -145,7 +175,7 @@ bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
    Register VReg = Register::index2VirtReg(i);
    if (MRI.def_empty(VReg)) continue;
    if (MRI.getRegClass(VReg) != &W65816::Acc16RegClass) continue;
-    if (flowsToPhysReg(VReg, MRI)) continue;
+    if (flowsToIncompatiblePhysReg(VReg, MRI)) continue;
    if (usedByPhi(VReg, MRI)) continue;
    if (!MRI.hasOneDef(VReg)) continue;  // require single SSA def
    if (!allUsesAcceptWide(VReg, MRI, *TRI, *TII)) continue;
@ -181,5 +211,212 @@ bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
    }
    Changed = true;
  }
  // Phase 2: PHI cycle widening.  EXPERIMENTAL, currently disabled —
  // see end of pass for explanation.
  #if 0
  // PHIs whose def class is Acc16 keep
  // the value pinned to $a across iterations, forcing stack spills
  // when the PHI is live across calls or other A-clobbering ops.
  // For sumSquares-style loops with an i32 accumulator, this manifests
  // as per-iter `LDA slot ; ADC ; STA slot ; LDA slot ; STA slot` (the
  // last LDA/STA pair is the PHI-back-edge copy).  If we widen the
  // PHI's def to Wide16, regalloc can keep it in an IMG slot and the
  // back-edge PHI copy collapses to a register coalesce.
  //
  // To widen a PHI:
  //   1. Compute the SCC of Acc16 vregs connected by PHI edges (PHI
  //      def ↔ PHI incoming vreg).  This catches mutually-recursive
  //      PHIs in nested loops.
  //   2. For every member: verify all non-PHI uses accept Wide16, no
  //      flow to a physreg, single def.
  //   3. For each PHI in the SCC, walk its incoming list.  Each
  //      incoming vreg is either ALREADY in the SCC (another PHI, no
  //      bridge needed) or an external Acc16 vreg whose value flows
  //      into the SCC — bridge it by inserting `WWide = COPY W` at
  //      the end of the predecessor block and pointing the PHI's
  //      incoming at WWide.
  //   4. Change every SCC member's register class to Wide16.
  auto worklistInsertIfAcc16 = [&MRI](Register V,
                                      DenseSet<Register> &Seen,
                                      SmallVectorImpl<Register> &WL) {
    if (!V.isVirtual()) return;
    if (MRI.getRegClass(V) != &W65816::Acc16RegClass) return;
    if (!Seen.insert(V).second) return;
    WL.push_back(V);
  };
  SmallVector<MachineInstr *, 16> AcctPhis;
  for (MachineBasicBlock &MBB : MF) {
    for (MachineInstr &MI : MBB.phis()) {
      Register DefV = MI.getOperand(0).getReg();
      if (MRI.getRegClass(DefV) == &W65816::Acc16RegClass) {
        AcctPhis.push_back(&MI);
      }
    }
  }
  DenseSet<Register> ProcessedPhiVregs;
  for (MachineInstr *Seed : AcctPhis) {
    Register SeedDef = Seed->getOperand(0).getReg();
    if (ProcessedPhiVregs.count(SeedDef)) continue;
    // Build SCC by following PHI edges in both directions.
    DenseSet<Register> Comp;
    SmallVector<Register, 8> Stack;
    worklistInsertIfAcc16(SeedDef, Comp, Stack);
    while (!Stack.empty()) {
      Register V = Stack.pop_back_val();
      // Forward: V flows into other PHIs as an incoming → include those PHI defs.
      for (auto &U : MRI.use_nodbg_instructions(V)) {
        if (!U.isPHI()) continue;
        Register PhiDef = U.getOperand(0).getReg();
        worklistInsertIfAcc16(PhiDef, Comp, Stack);
      }
      // Backward: if V is itself a PHI def, include the incoming vregs.
      MachineInstr *DM = &*MRI.def_instructions(V).begin();
      if (!DM || !DM->isPHI()) continue;
      for (unsigned i = 1, e = DM->getNumOperands(); i < e; i += 2) {
        MachineOperand &MO = DM->getOperand(i);
        if (!MO.isReg() || !MO.getReg().isVirtual()) continue;
        worklistInsertIfAcc16(MO.getReg(), Comp, Stack);
      }
    }
    for (Register V : Comp) ProcessedPhiVregs.insert(V);
    // Validate every member.  PHI uses are ACCEPTED when the consumer
    // PHI is itself in the SCC (those PHIs are being widened in
    // lock-step).  Narrow-class uses (e.g., INA_PSEUDO's tied-def
    // input requires Acc16) are ALSO accepted — we'll insert a
    // Wide16→Acc16 COPY at the use site after widening.  The only
    // unrecoverable cases are: PHI uses where the consumer PHI is
    // outside the SCC (forcing cross-SCC class merging), and physreg
    // flow to $x/$y/etc. (handled separately above).
    auto usesAcceptInSCC = [&](Register V,
                               SmallVectorImpl<MachineOperand *> *NarrowSites)
        -> bool {
      for (auto &MO : MRI.use_nodbg_operands(V)) {
        MachineInstr *UMI = MO.getParent();
        if (UMI->isCopy()) continue;
        if (UMI->isPHI()) {
          Register PhiDef = UMI->getOperand(0).getReg();
          if (Comp.count(PhiDef)) continue;  // co-widened
          return false;
        }
        unsigned OpIdx = UMI->getOperandNo(&MO);
        const TargetRegisterClass *Expected =
            TII->getRegClass(UMI->getDesc(), OpIdx);
        if (!Expected) continue;
        if (Expected == &W65816::Wide16RegClass) continue;
        if (Expected->hasSubClassEq(&W65816::Wide16RegClass)) continue;
        // Expected is narrower than Wide16 (e.g., Acc16-only tied
        // input).  Mark for runtime narrowing — we'll insert a COPY
        // at apply time.
        if (NarrowSites) NarrowSites->push_back(&MO);
      }
      return true;
    };
    bool ok = true;
    SmallVector<MachineOperand *, 8> NarrowSites;
    for (Register V : Comp) {
      if (!MRI.hasOneDef(V)) { ok = false; break; }
      if (flowsToIncompatiblePhysReg(V, MRI)) { ok = false; break; }
      if (comesFromIncompatiblePhysReg(V, MRI)) { ok = false; break; }
      if (!usesAcceptInSCC(V, &NarrowSites)) { ok = false; break; }
    }
    if (!ok) continue;
    // Apply widening.  First insert bridge COPYs at predecessor edges
    // for external (non-Comp) Acc16 incomings to each PHI in Comp.
    SmallVector<std::pair<MachineInstr *, unsigned>, 16> BridgeSites;
    for (Register V : Comp) {
      MachineInstr *DM = &*MRI.def_instructions(V).begin();
      if (!DM->isPHI()) continue;
      for (unsigned i = 1, e = DM->getNumOperands(); i < e; i += 2) {
        MachineOperand &MO = DM->getOperand(i);
        if (!MO.isReg() || !MO.getReg().isVirtual()) continue;
        Register Inc = MO.getReg();
        if (Comp.count(Inc)) continue;  // in-SCC, no bridge needed
        // External incoming: ensure it's currently Acc16; if so, we'll
        // insert a COPY at the predecessor block's end.
        if (MRI.getRegClass(Inc) != &W65816::Acc16RegClass &&
            MRI.getRegClass(Inc) != &W65816::Wide16RegClass) {
          ok = false;
          break;
        }
        BridgeSites.push_back({DM, i});
      }
      if (!ok) break;
    }
    if (!ok) continue;
    // Insert bridges.
    for (auto &Site : BridgeSites) {
      MachineInstr *PhiMI = Site.first;
      unsigned OpIdx = Site.second;
      Register Inc = PhiMI->getOperand(OpIdx).getReg();
      MachineBasicBlock *PredMBB = PhiMI->getOperand(OpIdx + 1).getMBB();
      // If already Wide16 (e.g., another candidate widened it already),
      // no bridge needed — but we still need the PHI incoming to use
      // a Wide16 vreg.  Use Inc directly.
      if (MRI.getRegClass(Inc) == &W65816::Wide16RegClass) {
        continue;
      }
      // Insert COPY before the predecessor's terminator(s).
      auto InsertPos = PredMBB->getFirstTerminator();
      DebugLoc DL = (InsertPos == PredMBB->end())
                        ? PredMBB->findBranchDebugLoc()
                        : InsertPos->getDebugLoc();
      Register WideInc = MRI.createVirtualRegister(&W65816::Wide16RegClass);
      BuildMI(*PredMBB, InsertPos, DL, TII->get(TargetOpcode::COPY),
              WideInc)
          .addReg(Inc);
      PhiMI->getOperand(OpIdx).setReg(WideInc);
      PhiMI->getOperand(OpIdx).setIsKill(false);
    }
    // Force every SCC member to Img16 (IMG-only, no A).  Using Wide16
    // (A + IMG) doesn't work here: the Register Coalescer joins our
    // Wide16 vregs with adjacent Acc16 vregs (intersection = Acc16)
    // and narrows them back to A-only, defeating the widening.  Img16
    // intersects Acc16 to ∅, so the coalescer can't merge — the PHI
    // stays in IMG.  This is correct anyway for the common case (PHI
    // live across a call): A is JSL-clobbered, so it can't carry the
    // value through, and IMG8..15 is the right home.
    for (Register V : Comp) {
      MRI.setRegClass(V, &W65816::Img16RegClass);
    }
    // Insert narrowing COPYs at each narrow-class use site.  Each site
    // is `... = OP V, ...` where the operand requires Acc16 but V is
    // now Wide16.  Replace with `%Vacc = COPY V (Acc16); ... = OP %Vacc, ...`.
    for (MachineOperand *MO : NarrowSites) {
      MachineInstr *UMI = MO->getParent();
      Register OldReg = MO->getReg();
      Register NarrowReg =
          MRI.createVirtualRegister(&W65816::Acc16RegClass);
      DebugLoc DL = UMI->getDebugLoc();
      BuildMI(*UMI->getParent(), UMI, DL, TII->get(TargetOpcode::COPY),
              NarrowReg)
          .addReg(OldReg);
      MO->setReg(NarrowReg);
      MO->setIsKill(false);
    }
    Changed = true;
  }
  #endif
  // Why disabled (2026-05-13 attempt):
  // - Widening PHI cycles to Wide16 (= A + IMG0..15) is undone by the
  //   Register Coalescer: it joins our Wide16 vregs with adjacent
  //   Acc16 vregs via the bridge COPYs we insert, and the resulting
  //   joint class is `intersect(Wide16, Acc16) = Acc16`.  Net effect:
  //   no IMG, just more code through the coalescer.
  // - Switching to Img16 (= IMG0..15, no A) defeats the coalescer
  //   (intersection with Acc16 is ∅) but forces ALL widened PHIs into
  //   IMG slots even when A would be better, AND triggers cascading
  //   copyPhysReg paths that aren't all implemented (e.g., DPF0 → IMG
  //   for i64 libcall return values), aborting clang on runtime builds.
  // - A targeted fix needs either (a) a class that the coalescer
  //   refuses to join with Acc16 yet that still allows A as a member,
  //   (b) a post-coalescer pass that re-widens specific high-traffic
  //   vregs back to Img16, or (c) regalloc cost-model tuning so it
  //   prefers IMG8..15 over stack for loop-live values.
  return Changed;
 }
--- a/src/llvm/test/CodeGen/W65816/extract-wide32-regseq.s
+++ b/src/llvm/test/CodeGen/W65816/extract-wide32-regseq.s