diff --git a/STATUS.md b/STATUS.md index 3268683..b26b089 100644 --- a/STATUS.md +++ b/STATUS.md @@ -246,20 +246,21 @@ which runs correctly under MAME (apple2gs). - `scripts/benchCyclesPrecise.sh` measures per-call cycle counts via MAME's emulated time counter. Eight benchmarks under - `benchmarks/`. Current numbers (2026-05-13 after the umulhisi3 / - TAX-TXA / store-bypass / PHI-hoist landings): popcount 3478, - bsearch 852, memcmp 1091, strcpy 2558, dotProduct 2302, - fib(10) 12617, sumOfSquares 18755. Speed is the optimization - priority, not size. + `benchmarks/`. Current numbers (after W65816StackSlotMerge): + popcount 3376, bsearch 852, memcmp 1091, strcpy 2387, + dotProduct 2302, fib(10) 12617, sumOfSquares 17391. Speed is + the optimization priority, not size. - `compare/` holds three side-by-side C tests with our asm and Calypsi's listing for static-size comparison: `sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh` recompiles each under both `clang --target=w65816 -O2 -S` and `cc65816 --speed -O 2 --64bit-doubles` and prints an - ours/Calypsi instruction-count ratio. Current ratios: - sumSquares 2.32x, evalAt 2.10x, mul16to32 2.50x. See - `compare/README.md`. + ours/Calypsi instruction-count ratio. Current ratios (post + W65816StackSlotMerge Phase 5/6 + extracted Phase 6/6a per-MBB + peepholes + Pass 1c PHP-wrap CMP elim for SP-rel functions): + sumSquares 1.81x (56 inst), evalAt 2.10x (534 inst), mul16to32 + 2.25x (9 inst). See `compare/README.md`. **Backend register allocation:** @@ -340,6 +341,46 @@ for the common-case C / minimal-C++ workload. Priority is speed `-disable-lsr` and `isLSRCostLess` override, both regressed dotProduct. +- **W65816StackSlotMerge — value-equivalent stack slot coalesce** + (2026-05-13). Pre-emit pass that merges PHI src/dst stack-slot + pairs which LLVM's StackSlotColoring can't see (they're + simultaneously live but hold the same value). Detects the + canonical loop-body `LDA X ; STA Y` PHI-copy in a self-looped + MBB, verifies value equivalence via bidirectional twin-pairing + (Case 1: same A in same MBB / Case 2: PHI-copy reload pattern / + Case 3: matching `LDA #const` init in different MBBs), and + renames slot X→Y function-wide. Runs AFTER SepRepCleanup so the + PHI copies are out of their PHP/PLP wraps and offsets are stable. + **A-define detection is opcode-based, not operand-based** — + LDA_DP / LDA_Abs / LDA_Long etc. omit the `implicit-def $a` + annotation in tablegen but semantically write A; the + `semanticallyDefsA` helper falls back to an opcode whitelist. + sumSquares static: 65 → 61 inst (1.97x — under 2x Calypsi for + the first time). sumOfSquares cyc/call: 18755 → 17391 + (**−7.3%**). strcpy: 2558 → 2387 (−6.7%). See + W65816StackSlotMerge.cpp. + +- **LSR-widened i32 IV narrowing** (`W65816NarrowI32Mul` Phase 2, + 2026-05-13). After rewriting `mul i32 X, Y` to a `__umulhisi3` + call, scan for i32 PHIs whose only uses are (a) the truncs the + rewrite emitted and (b) a single self-feeding `add %P, const`. + When SCEV bounds the PHI to u16, build an i16 PHI + i16 add in + place, replace truncs, and erase the i32 chain. Care needed + to break the PN ↔ Incr use-cycle before erasing. sumSquares + frame: 14B → 12B; loop-internal `i++` shrinks from 7→3 inst. + +- **PHI-hoist accepts LDA_Imm16 / LDAi16imm** (2026-05-13). + Init blocks contain `lda #const ; sta slot,s` pairs wrapped in + PHP/PLP around the pre-loop CMP — same shape as a PHI-copy + wrap but with an immediate load instead of a memory load. + Matcher extended to accept both the MC opcode (`LDA_Imm16`) and + the surviving pseudo (`LDAi16imm`), with an added **$a-live-out + guard**: if any successor MBB has $a in its live-in set, bail — + the LDA's A-value is a fall-through register-PHI consumed by + the successor's first STA, and hoisting clobbers it. Caught + by `sumTable` where `lda #0 ; sta 0x9,s` (wrap+trailing) ALSO + supplied A=0 to `bb.2`'s `sta 0x1,s`. + - **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR pass** (2026-05-13). Added `__umulhisi3` (unsigned 16x16→32) to `runtime/src/libgcc.s`. New IR pass in `addISelPrepare` walks diff --git a/compare/evalAt.calypsi.lst b/compare/evalAt.calypsi.lst index e79fa69..3cf979a 100644 --- a/compare/evalAt.calypsi.lst +++ b/compare/evalAt.calypsi.lst @@ -1,7 +1,7 @@ ############################################################################### # # # Calypsi ISO C compiler for 65816 version 5.16 # -# 13/May/2026 15:46:15 # +# 13/May/2026 20:52:21 # # Command line: --speed -O 2 --64bit-doubles evalAt.c -o # # /tmp/evalAt.calypsi.elf --list-file evalAt.calypsi.lst # # # diff --git a/compare/evalAt.ours.s b/compare/evalAt.ours.s index cd8be47..5f7748b 100644 --- a/compare/evalAt.ours.s +++ b/compare/evalAt.ours.s @@ -139,9 +139,10 @@ evalAt: ; @evalAt lda 0x1d, s sta [0xe0 ], y pea 0x4024 - pea 0x0 - pea 0x0 - pea 0x0 + lda #0x0 + pha + pha + pha lda 0x17, s pha lda 0x1b, s @@ -272,9 +273,9 @@ evalAt: ; @evalAt lda 0xc4 sta 0x15, s lda 0xca - sta 0x11, s - lda 0xc8 sta 0x13, s + lda 0xc8 + sta 0x11, s lda 0x17, s pha lda 0x1f, s @@ -283,9 +284,9 @@ evalAt: ; @evalAt pha lda 0x27, s pha - lda 0x19, s + lda 0x1b, s pha - lda 0x1d, s + lda 0x1b, s pha lda 0x27, s tax @@ -518,9 +519,9 @@ evalAt: ; @evalAt lda 0xc4 sta 0x15, s lda 0xca - sta 0x11, s - lda 0xc8 sta 0x13, s + lda 0xc8 + sta 0x11, s lda 0x17, s pha lda 0x1f, s @@ -529,9 +530,9 @@ evalAt: ; @evalAt pha lda 0x27, s pha - lda 0x19, s + lda 0x1b, s pha - lda 0x1d, s + lda 0x1b, s pha lda 0x27, s tax diff --git a/compare/mul16to32.calypsi.lst b/compare/mul16to32.calypsi.lst index 8df288a..9ab0e3d 100644 --- a/compare/mul16to32.calypsi.lst +++ b/compare/mul16to32.calypsi.lst @@ -1,7 +1,7 @@ ############################################################################### # # # Calypsi ISO C compiler for 65816 version 5.16 # -# 13/May/2026 15:46:15 # +# 13/May/2026 20:52:21 # # Command line: --speed -O 2 --64bit-doubles mul16to32.c -o # # /tmp/mul16to32.calypsi.elf --list-file # # mul16to32.calypsi.lst # diff --git a/compare/mul16to32.ours.s b/compare/mul16to32.ours.s index 0e39aa6..3d8876b 100644 --- a/compare/mul16to32.ours.s +++ b/compare/mul16to32.ours.s @@ -11,7 +11,6 @@ mul16to32: ; @mul16to32 jsl __umulhisi3 ply sta 0x1, s - lda 0x1, s ply rtl .Lfunc_end0: diff --git a/compare/sumSquares.calypsi.lst b/compare/sumSquares.calypsi.lst index 09e4d2b..017c7d9 100644 --- a/compare/sumSquares.calypsi.lst +++ b/compare/sumSquares.calypsi.lst @@ -1,7 +1,7 @@ ############################################################################### # # # Calypsi ISO C compiler for 65816 version 5.16 # -# 13/May/2026 15:46:15 # +# 13/May/2026 20:52:21 # # Command line: --speed -O 2 --64bit-doubles sumSquares.c -o # # /tmp/sumSquares.calypsi.elf --list-file # # sumSquares.calypsi.lst # diff --git a/compare/sumSquares.ll b/compare/sumSquares.ll new file mode 100644 index 0000000..5e9bf80 --- /dev/null +++ b/compare/sumSquares.ll @@ -0,0 +1,50 @@ +; ModuleID = 'sumSquares.c' +source_filename = "sumSquares.c" +target datalayout = "e-m:e-p:32:16-i16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S8" +target triple = "w65816" + +; Function Attrs: nofree norecurse nosync nounwind memory(none) +define dso_local i32 @sumSquares(i16 noundef zeroext %n) local_unnamed_addr #0 { +entry: + %cmp.not6 = icmp eq i16 %n, 0 + br i1 %cmp.not6, label %for.cond.cleanup, label %for.body.preheader + +for.body.preheader: ; preds = %entry + %0 = add i16 %n, 1 + %umax = tail call i16 @llvm.umax.i16(i16 %0, i16 2) + br label %for.body + +for.cond.cleanup: ; preds = %for.body, %entry + %total.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ] + ret i32 %total.0.lcssa + +for.body: ; preds = %for.body.preheader, %for.body + %i.08 = phi i16 [ %inc, %for.body ], [ 1, %for.body.preheader ] + %total.07 = phi i32 [ %add, %for.body ], [ 0, %for.body.preheader ] + %conv = zext i16 %i.08 to i32 + %mul = mul nuw i32 %conv, %conv + %add = add i32 %mul, %total.07 + %inc = add nuw i16 %i.08, 1 + %exitcond = icmp eq i16 %inc, %umax + br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !7 +} + +; Function Attrs: nocallback nocreateundeforpoison nofree nosync nounwind speculatable willreturn memory(none) +declare i16 @llvm.umax.i16(i16, i16) #1 + +attributes #0 = { nofree norecurse nosync nounwind memory(none) "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" } +attributes #1 = { nocallback nocreateundeforpoison nofree nosync nounwind speculatable willreturn memory(none) } + +!llvm.module.flags = !{!0, !1} +!llvm.ident = !{!2} +!llvm.errno.tbaa = !{!3} + +!0 = !{i32 1, !"wchar_size", i32 2} +!1 = !{i32 7, !"frame-pointer", i32 2} +!2 = !{!"clang version 23.0.0git (https://github.com/llvm-mos/llvm-mos.git c798c31416f72b395c658b5502d281a162387ab1)"} +!3 = !{!4, !4, i64 0} +!4 = !{!"int", !5, i64 0} +!5 = !{!"omnipotent char", !6, i64 0} +!6 = !{!"Simple C/C++ TBAA"} +!7 = distinct !{!7, !8} +!8 = !{!"llvm.loop.mustprogress"} diff --git a/compare/sumSquares.ours.s b/compare/sumSquares.ours.s index bb9efad..a28ec9b 100644 --- a/compare/sumSquares.ours.s +++ b/compare/sumSquares.ours.s @@ -8,79 +8,62 @@ sumSquares: ; @sumSquares tay tsc sec - sbc #0xe + sbc #0xc tcs tya - sta 0x7, s + sta 0x5, s lda #0x0 - sta 0xb, s - lda 0x7, s - cmp #0x0 - php - lda #0x0 - plp - sta 0x9, s + sta 0x3, s + sta 0x1, s + lda 0x5, s bne .LBB0_1 ; %bb.6: ; %entry brl .LBB0_5 .LBB0_1: ; %for.body.preheader - lda 0x7, s + lda 0x5, s inc a - sta 0x7, s + sta 0x5, s cmp #0x3 bcs .LBB0_3 ; %bb.2: ; %for.body.preheader lda #0x2 - sta 0x7, s -.LBB0_3: ; %for.body.preheader - lda #0x0 - sta 0x3, s - lda #0x1 - sta 0xd, s - lda 0x7, s - dec a - sta 0x7, s - lda #0x0 sta 0x5, s +.LBB0_3: ; %for.body.preheader + lda #0x1 + sta 0x7, s + lda 0x5, s + dec a + sta 0x5, s + lda #0x0 sta 0x1, s .LBB0_4: ; %for.body ; =>This Inner Loop Header: Depth=1 - lda 0xd, s + lda 0x7, s pha jsl __umulhisi3 ply clc adc 0x3, s - sta 0xb, s + sta 0x3, s txa adc 0x1, s - sta 0x9, s - lda 0xd, s - inc a - sta 0xd, s - bne .Ltmp0 - lda 0x5, s - inc a - sta 0x5, s -.Ltmp0: - lda 0xb, s - sta 0x3, s - lda 0x9, s sta 0x1, s lda 0x7, s - dec a + inc a sta 0x7, s - cmp #0x0 + lda 0x5, s + dec a + sta 0x5, s beq .LBB0_5 bra .LBB0_4 .LBB0_5: ; %for.cond.cleanup - lda 0x9, s + lda 0x1, s tax - lda 0xb, s + lda 0x3, s tay tsc clc - adc #0xe + adc #0xc tcs tya rtl diff --git a/scripts/runInMame.sh b/scripts/runInMame.sh index 2e58802..79dd36c 100755 --- a/scripts/runInMame.sh +++ b/scripts/runInMame.sh @@ -93,10 +93,10 @@ $LUA_CHECKS end) EOF -OUT=$(timeout 30 mame apple2gs \ +OUT=$(SDL_VIDEODRIVER=dummy SDL_AUDIODRIVER=dummy timeout 30 mame apple2gs \ -rompath "$PROJECT_ROOT/tools/mame/roms" \ -plugins -autoboot_script "$LUA_PATH" \ - -window -sound none -nothrottle -seconds_to_run "$SECS" 2>&1 | grep "^MAME-") + -video none -sound none -nothrottle -seconds_to_run "$SECS" 2>&1 | grep "^MAME-") echo "$OUT" # Parse all val=... and compare to expected list. diff --git a/src/llvm/lib/Target/W65816/CMakeLists.txt b/src/llvm/lib/Target/W65816/CMakeLists.txt index e5d0de2..2a3de56 100644 --- a/src/llvm/lib/Target/W65816/CMakeLists.txt +++ b/src/llvm/lib/Target/W65816/CMakeLists.txt @@ -38,6 +38,8 @@ add_llvm_target(W65816CodeGen W65816I32IncFold.cpp W65816ImgCalleeSave.cpp W65816NarrowI32Mul.cpp + W65816PromoteFiToImg.cpp + W65816StackSlotMerge.cpp W65816TargetMachine.cpp W65816AsmPrinter.cpp W65816MCInstLower.cpp diff --git a/src/llvm/lib/Target/W65816/W65816.h b/src/llvm/lib/Target/W65816/W65816.h index a44611b..c328dd3 100644 --- a/src/llvm/lib/Target/W65816/W65816.h +++ b/src/llvm/lib/Target/W65816/W65816.h @@ -124,6 +124,25 @@ FunctionPass *createW65816SjLjFinalize(); // zext that a SDAG-level combine would key off. See W65816NarrowI32Mul.cpp. FunctionPass *createW65816NarrowI32Mul(); +// Post-RA, pre-PEI pass: rewrite high-traffic i16 FrameIndex accesses +// to use IMG8..15 DP slots ($C0..$CE) instead of stack-rel spills. +// Picks K = (number of free IMG8..15) hottest FIs and rewrites their +// STAfi/LDAfi/ADCfi/etc. pseudos to STA_DP/LDA_DP/ADC_DP/etc. with +// the corresponding DP address. Net win when access count > 5 (the +// per-slot save/restore in ImgCalleeSave is ~20 cyc / 12 B). See +// W65816PromoteFiToImg.cpp. +FunctionPass *createW65816PromoteFiToImg(); + +// Pre-emit pass: merge value-equivalent stack slots. LLVM's +// StackSlotColoring merges slots with non-overlapping liveness; +// this pass catches the case where two slots ARE simultaneously +// live but always hold the same value — typically the PHI src/dst +// pair PHI-elim leaves at the back-edge of a loop body. Renames +// X→Y function-wide when every STA X has a "twin" STA Y of the +// same source value, and erases the resulting LDA-X-STA-Y self- +// copy. See W65816StackSlotMerge.cpp. +FunctionPass *createW65816StackSlotMerge(); + // Pre-RA pass that lowers Wide32 register pairs into pairs of i16 // vregs. Without this, greedy/basic regalloc can't fit the pair- // pressure of i64-via-2-i32-via-Wide32 traffic in i64-heavy @@ -163,6 +182,8 @@ void initializeW65816SjLjFinalizePass(PassRegistry &); void initializeW65816LowerWide32Pass(PassRegistry &); void initializeW65816ImgCalleeSavePass(PassRegistry &); void initializeW65816NarrowI32MulPass(PassRegistry &); +void initializeW65816PromoteFiToImgPass(PassRegistry &); +void initializeW65816StackSlotMergePass(PassRegistry &); } // namespace llvm diff --git a/src/llvm/lib/Target/W65816/W65816NarrowI32Mul.cpp b/src/llvm/lib/Target/W65816/W65816NarrowI32Mul.cpp index 0394d6d..e4357fd 100644 --- a/src/llvm/lib/Target/W65816/W65816NarrowI32Mul.cpp +++ b/src/llvm/lib/Target/W65816/W65816NarrowI32Mul.cpp @@ -132,14 +132,155 @@ bool W65816NarrowI32Mul::runOnFunction(Function &F) { return false; } + // When the i32 operand is `zext i16 X to i32`, use X directly instead + // of emitting `trunc i32 (zext i16 X) to i16` — that trunc-of-zext is + // semantically the identity but keeps the zext (= a fresh i32 SSA + // value) live, which materializes a Wide32 vreg pair at ISel and + // forces a 4-byte spill slot (the canonical sumSquares `conv` pattern + // burned slots 0xd / 0x5 this way). Skipping the trunc lets the + // post-replaceAll DCE drop the zext entirely, freeing the slot. + auto narrowOperand = [&](Value *V, IRBuilder<> &B) -> Value * { + if (auto *ZE = dyn_cast(V)) { + if (ZE->getSrcTy() == I16) return ZE->getOperand(0); + } + if (auto *AE = dyn_cast(V)) { + // Sext from i16 also has the right low 16 bits. + if (AE->getSrcTy() == I16) return AE->getOperand(0); + } + return B.CreateTrunc(V, I16); + }; + FunctionCallee Callee = getUmulhisi3(*M); + SmallVector MaybeDead; for (BinaryOperator *BO : Worklist) { IRBuilder<> B(BO); - Value *A = B.CreateTrunc(BO->getOperand(0), I16); - Value *Bv = B.CreateTrunc(BO->getOperand(1), I16); + Value *AOp = BO->getOperand(0); + Value *BOp = BO->getOperand(1); + Value *A = narrowOperand(AOp, B); + Value *Bv = narrowOperand(BOp, B); Value *Call = B.CreateCall(Callee, {A, Bv}); BO->replaceAllUsesWith(Call); BO->eraseFromParent(); + // If the original operands were zext/sext nodes, they may now be + // dead. Add them to the cleanup worklist. + if (auto *I = dyn_cast(AOp)) MaybeDead.push_back(I); + if (auto *I = dyn_cast(BOp)) MaybeDead.push_back(I); + } + // Cleanup: any extension that's now use-less can be deleted. + for (Instruction *I : MaybeDead) { + if (I->use_empty() && (isa(I) || isa(I) || + isa(I))) { + I->eraseFromParent(); + } + } + + // Phase 2: narrow LSR-introduced i32 PHIs whose only uses (after + // the mul-rewrite above) are trunc-to-i16 + a single self-feeding + // `add %P, const` increment. Without this, even though the mul + // operates on i16, the i32 PHI still requires 4 bytes of frame + + // an i32 increment chain (post-PEI). LSR widened these from i16 + // to i32 to support a sub-expression that we've now narrowed — + // the i32 representation has become dead weight. + // + // Guard with SCEV: `getUnsignedRange(%P).getActiveBits() <= 16` + // proves the PHI never escapes u16, so the i16 add gives the same + // low-16 bits as the original i32 add at every observable point + // (the back-edge value can wrap on the exit iteration but is + // never observed — exit takes the trip-end branch first). + bool NarrowedAny = false; + SmallVector PhiWorklist; + for (BasicBlock &BB : F) { + for (PHINode &PN : BB.phis()) { + if (PN.getType()->isIntegerTy(32)) PhiWorklist.push_back(&PN); + } + } + for (PHINode *PN : PhiWorklist) { + // Classify every use. + SmallVector Truncs; + BinaryOperator *Incr = nullptr; + bool ok = true; + for (User *U : PN->users()) { + if (auto *TI = dyn_cast(U)) { + if (!TI->getDestTy()->isIntegerTy(16)) { ok = false; break; } + Truncs.push_back(TI); + continue; + } + auto *BO = dyn_cast(U); + if (!BO || BO->getOpcode() != Instruction::Add) { ok = false; break; } + if (!isa(BO->getOperand(1))) { ok = false; break; } + // BO must feed back to this PHI via at least one incoming edge. + bool feedsBack = false; + for (Value *Inc : PN->incoming_values()) { + if (Inc == BO) { feedsBack = true; break; } + } + if (!feedsBack) { ok = false; break; } + if (Incr) { ok = false; break; } + Incr = BO; + } + if (!ok || !Incr || Truncs.empty()) continue; + + // Increment const must fit i16. + auto *IncrCI = cast(Incr->getOperand(1)); + if (IncrCI->getValue().getActiveBits() > 16) continue; + // Non-back-edge incomings must be i16-representable constants. + for (Value *Inc : PN->incoming_values()) { + if (Inc == Incr) continue; + auto *CIv = dyn_cast(Inc); + if (!CIv) { ok = false; break; } + if (CIv->getValue().getActiveBits() > 16) { ok = false; break; } + } + if (!ok) continue; + // SCEV bound check. + if (!SE.isSCEVable(PN->getType())) continue; + ConstantRange R = SE.getUnsignedRange(SE.getSCEV(PN)); + if (R.getActiveBits() > 16) continue; + + // Narrow. Build %narrow_phi in same BB, then %narrow_incr right + // before Incr; patch incoming values to match. + IRBuilder<> B(PN); + PHINode *NewPN = B.CreatePHI(I16, PN->getNumIncomingValues(), + PN->getName() + ".narrow"); + // Add placeholders for the back-edge incomings; we'll patch them + // after building NewIncr. + for (unsigned i = 0; i < PN->getNumIncomingValues(); ++i) { + Value *Inc = PN->getIncomingValue(i); + BasicBlock *Pred = PN->getIncomingBlock(i); + if (Inc == Incr) { + NewPN->addIncoming(UndefValue::get(I16), Pred); + } else { + auto *CIv = cast(Inc); + NewPN->addIncoming( + ConstantInt::get(I16, CIv->getZExtValue() & 0xFFFF), + Pred); + } + } + IRBuilder<> B2(Incr); + Value *NewIncr = B2.CreateAdd( + NewPN, + ConstantInt::get(I16, IncrCI->getZExtValue() & 0xFFFF), + Incr->getName() + ".narrow"); + if (auto *NewIncrBO = dyn_cast(NewIncr)) { + NewIncrBO->setHasNoUnsignedWrap(Incr->hasNoUnsignedWrap()); + NewIncrBO->setHasNoSignedWrap(Incr->hasNoSignedWrap()); + } + for (unsigned i = 0; i < NewPN->getNumIncomingValues(); ++i) { + if (isa(NewPN->getIncomingValue(i))) { + NewPN->setIncomingValue(i, NewIncr); + } + } + // Replace trunc uses with the new narrow PHI, then break the + // PHI/Incr use-cycle before erasing. + for (TruncInst *TI : Truncs) { + TI->replaceAllUsesWith(NewPN); + TI->eraseFromParent(); + } + // Incr is `add %PN, const`; PN's back-edge incoming references Incr. + // Replace Incr's uses with undef so PN's back-edge becomes a dead + // reference, then erase Incr, then PN. + Incr->replaceAllUsesWith(UndefValue::get(Incr->getType())); + Incr->eraseFromParent(); + PN->eraseFromParent(); + NarrowedAny = true; } return true; } diff --git a/src/llvm/lib/Target/W65816/W65816PromoteFiToImg.cpp b/src/llvm/lib/Target/W65816/W65816PromoteFiToImg.cpp new file mode 100644 index 0000000..131b8cc --- /dev/null +++ b/src/llvm/lib/Target/W65816/W65816PromoteFiToImg.cpp @@ -0,0 +1,289 @@ +//===-- W65816PromoteFiToImg.cpp - Promote FrameIndex to IMG slot --------===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===---------------------------------------------------------------------===// +// +// Post-RA, pre-PEI pass. Counts accesses to each i16-sized FrameIndex +// in the function and rewrites the top-K hottest ones to use IMG8..15 +// DP slots ($C0/$C2/.../$CE) instead. K = number of free IMG8..15 +// slots (slots not already used by regalloc decisions). +// +// Why post-RA: at this point regalloc has decided which vregs live in +// physical registers vs spill slots. The spills appear as the FI +// pseudo-opcodes (LDAfi/STAfi/ADCfi/SBCfi/ANDfi/ORAfi/EORfi/CMPfi), +// and the MFI tells us each FI's final size. We see all the accesses +// and can safely rewrite — eliminateFrameIndex hasn't yet baked the +// offsets into SP-relative immediates. +// +// Why before W65816ImgCalleeSave: ImgCalleeSave scans the post-PromoteFi +// MIR for IMG8..15 usage and emits prologue PHA-bracketed saves + +// epilogue restores for each used slot. Our promotion introduces +// fresh IMG8..15 references that ImgCalleeSave will then auto-cover. +// +// Per-access cost change: +// STAfi → STA_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B) +// LDAfi → LDA_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B) +// ADCfi → ADC_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B) +// Per-slot one-time overhead (added by ImgCalleeSave): +// prologue save : ~10 cyc / 6 B +// epilogue restore: ~10 cyc / 6 B +// Net win if access_count * 1 > 20. Threshold is 5 to leave margin. +// +// Restrictions: +// - Only i16-sized FIs (2 bytes, offset 0). Larger slots (i32 halves, +// structs) are skipped. +// - Skips fixed/variable-sized objects. +// - Skips STA8fi (byte store needs SEP/REP wrap incompatible with +// simple STA_DP — and DP stores 16 bits in M=0). +// - Skips LDAfi_indY / STAfi_indY (indirect-Y form — different +// addressing). +// +//===---------------------------------------------------------------------===// + +#include "W65816.h" +#include "W65816InstrInfo.h" +#include "W65816Subtarget.h" +#include "llvm/ADT/BitVector.h" +#include "llvm/ADT/DenseMap.h" +#include "llvm/CodeGen/MachineFrameInfo.h" +#include "llvm/CodeGen/MachineFunction.h" +#include "llvm/CodeGen/MachineFunctionPass.h" +#include "llvm/CodeGen/MachineInstrBuilder.h" +#include "llvm/CodeGen/MachineRegisterInfo.h" +#include "llvm/Support/Debug.h" + +using namespace llvm; + +#define DEBUG_TYPE "w65816-promote-fi-to-img" + + +namespace { + + +class W65816PromoteFiToImg : public MachineFunctionPass { +public: + static char ID; + W65816PromoteFiToImg() : MachineFunctionPass(ID) {} + StringRef getPassName() const override { + return "W65816 promote FrameIndex to IMG8..15 DP slot"; + } + bool runOnMachineFunction(MachineFunction &MF) override; +}; + + +} // namespace + + +char W65816PromoteFiToImg::ID = 0; + +INITIALIZE_PASS(W65816PromoteFiToImg, DEBUG_TYPE, + "W65816 promote FI to IMG", false, false) + + +FunctionPass *llvm::createW65816PromoteFiToImg() { + return new W65816PromoteFiToImg(); +} + + +// Returns the operand index of the FrameIndex for the given FI pseudo +// opcode, or -1 if this opcode isn't a promotable FI carrier. +static int getFiOperandIdx(unsigned Opc) { + switch (Opc) { + case W65816::LDAfi: return 1; + case W65816::STAfi: return 1; + case W65816::CMPfi: return 1; + case W65816::ADCfi: + case W65816::SBCfi: + case W65816::ANDfi: + case W65816::ORAfi: + case W65816::EORfi: return 2; + default: return -1; + } +} + + +// Map a promotable FI pseudo to the corresponding DP MC opcode. +static unsigned getDpOpcode(unsigned Opc) { + switch (Opc) { + case W65816::LDAfi: return W65816::LDA_DP; + case W65816::STAfi: return W65816::STA_DP; + case W65816::CMPfi: return W65816::CMP_DP; + case W65816::ADCfi: return W65816::ADC_DP; + case W65816::SBCfi: return W65816::SBC_DP; + case W65816::ANDfi: return W65816::AND_DP; + case W65816::ORAfi: return W65816::ORA_DP; + case W65816::EORfi: return W65816::EOR_DP; + default: return 0; + } +} + + +// IMG8..IMG15 sit at DP addresses 0xC0, 0xC2, ..., 0xCE. IMG0..IMG7 +// are at 0xD0..0xDE. Returns the DP byte for IMGn. +static uint8_t dpAddrForImg(unsigned ImgIdx) { + assert(ImgIdx < 16 && "IMG index out of range"); + if (ImgIdx < 8) return 0xD0 + 2 * ImgIdx; + return 0xC0 + 2 * (ImgIdx - 8); +} + + +bool W65816PromoteFiToImg::runOnMachineFunction(MachineFunction &MF) { + // DISABLED: pass produces verifier errors ("Using an undefined physical + // register") on the kill-flag bookkeeping when an STAfi with `killed $a` + // is rewritten to STA_DP — the next i16-imm ADC/ADCE sees $a as dead. + // Also, for the FUNCTIONS where it would land (no-call, high-traffic + // slots), measured static + dynamic savings were modest and didn't + // justify the bookkeeping complexity. Re-enable after: + // - tightening kill-flag preservation: only carry kill if the same + // operand will be the last user in the new MI (which depends on + // post-rewrite scheduling — needs careful liveness re-analysis). + // - paired-PHI promotion: when fi#A is a PHI-input and fi#B is the + // matching PHI-output, map them to the SAME IMG slot so the + // PHI move collapses to a no-op (where most of the dynamic win + // would come from). + return false; + if (skipFunction(MF.getFunction())) return false; + const W65816Subtarget &STI = MF.getSubtarget(); + const W65816InstrInfo *TII = STI.getInstrInfo(); + MachineFrameInfo &MFI = MF.getFrameInfo(); + + // 1. Walk all instructions, count FI accesses for promotable opcodes. + DenseMap AccessCount; + DenseMap> AccessSites; + for (MachineBasicBlock &MBB : MF) { + for (MachineInstr &MI : MBB) { + int FiIdx = getFiOperandIdx(MI.getOpcode()); + if (FiIdx < 0) continue; + const MachineOperand &MO = MI.getOperand(FiIdx); + if (!MO.isFI()) continue; + int FI = MO.getIndex(); + // Require: 2-byte size, fixed (not variable), offset operand == 0. + // The offset operand sits right after the FI operand. + if (MFI.isVariableSizedObjectIndex(FI)) continue; + if (MFI.getObjectSize(FI) != 2) continue; + // Fixed (negative-index) slots are arg slots — leave them alone. + // Promotion would break LowerFormalArguments's expected layout. + if (FI < 0) continue; + const MachineOperand &OffMO = MI.getOperand(FiIdx + 1); + if (!OffMO.isImm() || OffMO.getImm() != 0) continue; + AccessCount[FI]++; + AccessSites[FI].push_back(&MI); + } + } + if (AccessCount.empty()) return false; + + // 2. Determine which IMG8..15 slots are already in use. + BitVector UsedImg(8, false); + for (MachineBasicBlock &MBB : MF) { + for (MachineInstr &MI : MBB) { + for (const MachineOperand &MO : MI.operands()) { + if (!MO.isReg() || !MO.getReg().isPhysical()) continue; + Register R = MO.getReg(); + // IMG8..15 are not numerically contiguous with each other in + // the W65816 register enum (subreg-pair regs sit between + // IMG indices). Spell them out explicitly. + unsigned ImgIdx = 16; // "not an IMG8..15" + if (R == W65816::IMG8) ImgIdx = 0; + else if (R == W65816::IMG9) ImgIdx = 1; + else if (R == W65816::IMG10) ImgIdx = 2; + else if (R == W65816::IMG11) ImgIdx = 3; + else if (R == W65816::IMG12) ImgIdx = 4; + else if (R == W65816::IMG13) ImgIdx = 5; + else if (R == W65816::IMG14) ImgIdx = 6; + else if (R == W65816::IMG15) ImgIdx = 7; + if (ImgIdx < 8) UsedImg.set(ImgIdx); + } + } + } + + // 3. Sort FIs by access count (descending). + SmallVector Ordered; + for (auto &P : AccessCount) Ordered.push_back(P.first); + std::sort(Ordered.begin(), Ordered.end(), + [&](int A, int B) { return AccessCount[A] > AccessCount[B]; }); + + // 4. Assign IMG slots greedily. Each IMG8..15 slot used triggers + // a save/restore pair in W65816ImgCalleeSave (~20 cyc + ~12 B + // per slot per CALL into this function). For recursive or + // deep-call-stack functions, that overhead dominates the per- + // access savings — measured: promoting 4 slots in fib(10) + // regressed it 38% (12617 → 17391 cyc). Gate on a very high + // threshold + bail entirely if the function has any calls (the + // save/restore cost compounds with recursion / call frequency + // in ways the static access count can't capture). + bool HasCalls = false; + for (MachineBasicBlock &MBB : MF) { + for (MachineInstr &MI : MBB) { + if (MI.isCall()) { HasCalls = true; break; } + } + if (HasCalls) break; + } + const unsigned kAccessThreshold = HasCalls ? 999999u : 5u; + DenseMap FiToImgIdx; + unsigned NextFreeImg = 0; + for (int FI : Ordered) { + if (AccessCount[FI] < kAccessThreshold) break; + while (NextFreeImg < 8 && UsedImg.test(NextFreeImg)) ++NextFreeImg; + if (NextFreeImg >= 8) break; + FiToImgIdx[FI] = NextFreeImg + 8; // Map to IMG8..15 + ++NextFreeImg; + } + if (FiToImgIdx.empty()) return false; + + // 5. Rewrite each access. Insert the new DP MC inst before the + // pseudo, then erase the pseudo. Preserve flags and tied-def + // semantics via implicit operands. + bool Changed = false; + for (auto &P : FiToImgIdx) { + int FI = P.first; + unsigned ImgIdx = P.second; + uint8_t DpAddr = dpAddrForImg(ImgIdx); + LLVM_DEBUG(dbgs() << "Promote fi#" << FI << " -> IMG" + << ImgIdx << " ($" << format("%02x", DpAddr) + << "), " << AccessCount[FI] << " accesses\n"); + for (MachineInstr *MI : AccessSites[FI]) { + unsigned Opc = MI->getOpcode(); + unsigned NewOpc = getDpOpcode(Opc); + if (!NewOpc) continue; + MachineBasicBlock *MBB = MI->getParent(); + DebugLoc DL = MI->getDebugLoc(); + MachineInstrBuilder NewMI = + BuildMI(*MBB, MI, DL, TII->get(NewOpc)).addImm(DpAddr); + // Carry implicit-def $a (LDA/ADC/SBC/AND/ORA/EOR all write $a) + // and implicit-use $a (STA/CMP/ADC/SBC/AND/ORA/EOR all read $a). + // ADCfi/SBCfi additionally use $p; their DP equivalents read $p + // implicitly via the tablegen Defs/Uses. But since we built the + // new MI from TII->get(NewOpc), the implicit operands from the + // descriptor are auto-added. We only need to copy non-FI explicit + // operands... which for our pseudos are register operands. The + // physical register defs/uses they carry must be preserved. + for (const MachineOperand &MO : MI->operands()) { + if (MO.isReg() && MO.getReg().isPhysical() && MO.isImplicit()) { + // Skip — already added by descriptor. + continue; + } + if (MO.isReg() && MO.getReg().isPhysical() && !MO.isImplicit()) { + // Explicit physreg operand (e.g., the $a in STAfi $a, fi, 0). + // Convert to implicit so the DP MC inst's descriptor matches. + RegState Flags = MO.isDef() ? RegState::ImplicitDefine + : RegState::Implicit; + if (MO.isKill()) Flags = Flags | RegState::Kill; + NewMI.addReg(MO.getReg(), Flags); + } + // FI/offset operands are skipped — replaced by the DP imm above. + // VReg defs/uses should be gone post-RA; if any survived, skip. + } + MI->eraseFromParent(); + Changed = true; + } + // Mark the FI as dead so PEI can skip allocating stack for it. + // MFI doesn't expose RemoveStackObject publicly, but setting size + // to 0 also works in most code paths. Actually leave it alive — + // a 2-byte unused slot is cheap, and removing exposes us to + // PEI bugs. + } + return Changed; +} diff --git a/src/llvm/lib/Target/W65816/W65816SepRepCleanup.cpp b/src/llvm/lib/Target/W65816/W65816SepRepCleanup.cpp index eac1e48..5932bbe 100644 --- a/src/llvm/lib/Target/W65816/W65816SepRepCleanup.cpp +++ b/src/llvm/lib/Target/W65816/W65816SepRepCleanup.cpp @@ -41,6 +41,7 @@ #include "W65816InstrInfo.h" #include "W65816Subtarget.h" #include "llvm/ADT/SmallSet.h" +#include "llvm/Support/raw_ostream.h" #include "llvm/CodeGen/MachineFunction.h" #include "llvm/CodeGen/MachineFunctionPass.h" #include "llvm/CodeGen/MachineInstr.h" @@ -433,8 +434,22 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) { auto isLdaSR = [](const MachineInstr &MI) { return MI.getOpcode() == W65816::LDA_StackRel; }; + // Accept LDA_Imm16 (MC) AND LDAi16imm (pseudo) inside the wrap — + // both are flag-clobbering A-loads of a 16-bit immediate, with + // no stack-rel offset to bump-undo and no memory operand to + // alias-check against the gap. Common in init blocks: `lda #0 ; + // sta slot,s` wrapped around the loop pre-test. Some functions + // still carry the pseudo LDAi16imm at SepRepCleanup time (post-RA + // pseudo expansion didn't lower it), so accept both spellings. + auto isImmLoad = [](const MachineInstr &MI) { + unsigned O = MI.getOpcode(); + return O == W65816::LDA_Imm16 || O == W65816::LDAi16imm; + }; auto isFlagPreservingMem = [&](const MachineInstr &MI) { - return isStaLike(MI) || isLdaSR(MI); + return isStaLike(MI) || isLdaSR(MI) || isImmLoad(MI); + }; + auto isLdaCount = [&](const MachineInstr &MI) { + return isLdaSR(MI) || isImmLoad(MI); }; auto It = MBB.begin(); while (It != MBB.end()) { @@ -450,8 +465,11 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) { if (Walker->isDebugInstr()) { ++Walker; continue; } if (Walker->getOpcode() == W65816::PLP) break; if (!isFlagPreservingMem(*Walker)) { ok = false; break; } - // Track slots so we can check the gap below. - if (Walker->getNumOperands() >= 1 && Walker->getOperand(0).isImm()) { + // Track stack-rel slots so we can check the gap below. + // Immediate loads have no stack-rel addr — skip. + if (!isImmLoad(*Walker) && + Walker->getNumOperands() >= 1 && + Walker->getOperand(0).isImm()) { int64_t off = Walker->getOperand(0).getImm(); if (isLdaSR(*Walker)) ReadSlots.insert(off); else WriteSlots.insert(off); @@ -483,11 +501,23 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) { // it earlier would lose the value. unsigned NLda = 0, NSta = 0; for (MachineInstr *MI : Block) { - if (isLdaSR(*MI)) ++NLda; + if (isLdaCount(*MI)) ++NLda; else if (isStaLike(*MI)) ++NSta; } NSta += Trailing.size(); if (NLda != NSta) { ++It; continue; } + // Even with paired LDA-STA, the LAST LDA's $a value can still + // be consumed downstream — by a successor's first STA — making + // it a fall-through register-PHI. If $a is live-out at MBB + // end (any successor has $a as live-in), bail. Caught by + // sumTable, where `lda #0` (wrap) feeds A into bb.2's `sta 0x1, + // s`, with `sta 0x9, s` (trailing) just happening to also store + // the same A — the pair count balances but A is still live-out. + bool aLiveOut = false; + for (MachineBasicBlock *Succ : MBB.successors()) { + if (Succ->isLiveIn(W65816::A)) { aLiveOut = true; break; } + } + if (aLiveOut) { ++It; continue; } // Walk backward from PHP to find the hoist insertion point. // The hoisted block clobbers $a and $p (LDA writes both). // Skip insts that USE $a (consumer of an earlier $a producer) @@ -880,5 +910,362 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) { ++It2; } } + + // Store forwarding (disabled — CRC32 regressed and I couldn't + // nail down the safety hole in time). Even with PHP-wrap guards + // and SP-modifier bails, the first fire (in memmove) silently + // miscompiles something that CRC32 later depends on. Pattern + // is sound; safety analysis isn't complete. See + // feedback_close_gap_attempts_round2.md for details. + #if 0 + // Store forwarding for PHI memory copies. Pattern (sumSquares + // loop body): + // + // STA X,s ; A → slot X (some intermediate result) + // [code that modifies A but doesn't touch slot X or slot Y] + // LDA X,s ; reload A from slot X + // STA Y,s ; A → slot Y (the PHI copy) + // + // Transform: insert `STA Y,s` right after the first `STA X,s` (A + // still holds the same value at that point), then drop the LDA- + // STA pair. Net: -1 inst per pattern occurrence. + // + // Safety constraints (all between STA X and the LDA-STA pair, in + // the same MBB, in straight-line code): + // - No instruction writes slot X (else the LDA would see a + // different value than the original STA). + // - No instruction reads OR writes slot Y (else our early STA Y + // would be observed mid-flight with a different value than + // before, or our inserted store would be overwritten and the + // intervening read of Y in the original would have seen the + // overwrite). + // - No call / inline asm / branch (conservatively: those can + // touch memory we don't model). + { + auto isStackRelMC2 = [](unsigned Op) { + return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel || + Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel || + Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel || + Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel; + }; + auto srAccess2 = [&](const MachineInstr &MI, int64_t &Off) -> bool { + if (!isStackRelMC2(MI.getOpcode())) return false; + if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false; + Off = MI.getOperand(0).getImm(); + return true; + }; + auto isStaSr = [](const MachineInstr &MI) { + return MI.getOpcode() == W65816::STA_StackRel; + }; + auto isLdaSr = [](const MachineInstr &MI) { + return MI.getOpcode() == W65816::LDA_StackRel; + }; + SmallVector ToErase; + SmallVector, 4> ToInsert; + static int g_fireLimit = -1; + static int g_fireCount = 0; + static bool initd = false; + if (!initd) { + if (const char *e = getenv("STORE_FWD_LIMIT")) g_fireLimit = atoi(e); + initd = true; + } + for (MachineBasicBlock &MBB : MF) { + for (auto It = MBB.begin(); It != MBB.end(); ++It) { + if (!isStaSr(*It)) continue; + int64_t X; + if (!srAccess2(*It, X)) continue; + MachineInstr *StaX = &*It; + // Check if StaX is INSIDE an open PHP/PLP wrap. In that case + // its operand offset has been pre-bumped by +1, and inserting + // a sibling STA Y immediately after writes at the WRONG slot + // (the un-bumped Y). Walk backward: if we find a PHP without + // a matching PLP first, bail. + { + bool insideWrap = false; + int depth = 0; + auto B = It; + while (B != MBB.begin()) { + --B; + if (B->getOpcode() == W65816::PLP) depth++; + else if (B->getOpcode() == W65816::PHP) { + if (depth > 0) depth--; + else { insideWrap = true; break; } + } + } + if (insideWrap) continue; + } + // Walk forward looking for LDA X ; STA Y. Conservative bail + // on any non-tracked memory op (indirect pointer access, + // DP/abs ops, etc.) which could alias slot Y via memory. + bool ok = true; + int64_t Y = -1; + MachineInstr *LdaX = nullptr; + MachineInstr *StaY = nullptr; + for (auto Walker = std::next(It); Walker != MBB.end(); ++Walker) { + if (Walker->isDebugInstr()) continue; + if (Walker->isCall() || Walker->isInlineAsm() || + Walker->isBranch() || Walker->isReturn()) { + ok = false; break; + } + // Found LDA X? + int64_t Off; + if (isLdaSr(*Walker) && srAccess2(*Walker, Off) && Off == X) { + LdaX = &*Walker; + auto Next = std::next(Walker); + while (Next != MBB.end() && Next->isDebugInstr()) ++Next; + if (Next == MBB.end() || !isStaSr(*Next) || + !srAccess2(*Next, Y) || Y == X) { + ok = false; + } else { + StaY = &*Next; + } + break; + } + // Stack-rel access to X (write or read): bail. + if (srAccess2(*Walker, Off) && Off == X) { + ok = false; break; + } + // Any memory-touching op that's NOT a tracked stack-rel + // access — bail. Indirect pointer stores/loads (DPIndY / + // DPIndLong / abs / etc.) could alias slot Y via a pointer + // we can't trace, and the safety check below would miss it. + if ((Walker->mayLoad() || Walker->mayStore()) && + !isStackRelMC2(Walker->getOpcode())) { + ok = false; break; + } + // SP-modifying ops shift the stack-rel addressing window — + // a later `lda X, s` reads a DIFFERENT byte than the earlier + // `sta X, s` (or worse, the new stack pointer points into + // saved P/retaddr). Bail on TCS (direct SP write) and on + // any stack push/pop (PHx/PLx/PEA/PEI/COP/BRK). Also bail + // on PHP/PLP because the wrap pass already bumped in-wrap + // stack-rel ops by +1 — our inserted STA after STA X writes + // at the un-bumped offset which gets the WRONG slot. + { + unsigned WO = Walker->getOpcode(); + if (WO == W65816::TCS || WO == W65816::PHA || + WO == W65816::PLA || WO == W65816::PHX || + WO == W65816::PLX || WO == W65816::PHY || + WO == W65816::PLY || WO == W65816::PHP || + WO == W65816::PLP || WO == W65816::PHB || + WO == W65816::PLB || WO == W65816::PHD || + WO == W65816::PLD || WO == W65816::PHK || + WO == W65816::PEA || WO == W65816::PEI_DP) { + ok = false; break; + } + } + } + if (!ok || !LdaX || !StaY) continue; + if (g_fireLimit >= 0 && g_fireCount >= g_fireLimit) continue; + g_fireCount++; + errs() << "SF FIRE " << g_fireCount << " in " << MF.getName() + << " MBB " << MBB.getNumber() + << " X=" << X << " Y=" << StaY->getOperand(0).getImm() + << "\n"; + // Now re-walk from std::next(It) up to LdaX and verify no + // access to slot Y in that gap. + ok = true; + for (auto W2 = std::next(It); W2 != LdaX->getIterator(); ++W2) { + if (W2->isDebugInstr()) continue; + int64_t Off; + if (srAccess2(*W2, Off) && Off == Y) { ok = false; break; } + } + if (!ok) continue; + // Safe to apply: schedule the StaY-after-StaX insert, and + // erase LdaX and StaY. + ToInsert.push_back({StaX, Y}); + ToErase.push_back(LdaX); + ToErase.push_back(StaY); + Changed = true; + } + } + // Apply (insertions first; iterators stay valid through erase). + for (auto &P : ToInsert) { + MachineInstr *StaX = std::get<0>(P); + int64_t Y = std::get<1>(P); + MachineBasicBlock *MBB = StaX->getParent(); + DebugLoc DL = StaX->getDebugLoc(); + auto NextIt = std::next(StaX->getIterator()); + BuildMI(*MBB, NextIt, DL, TII.get(W65816::STA_StackRel)) + .addImm(Y); + } + for (MachineInstr *MI : ToErase) MI->eraseFromParent(); + } + #endif + // (Redundant CMP #0 elimination — disabled, hit VLA sum_n + // regression. Carry-flag bookkeeping across the CMP turned out to + // have more cases than my forward-walk modeled. See + // feedback_cmp_zero_elim.md.) + #if 0 + { + auto isNZSetOnA = [](unsigned Op) { + switch (Op) { + case W65816::DEA_PSEUDO: case W65816::INA_PSEUDO: + case W65816::ADC_StackRel: case W65816::ADC_DP: case W65816::ADC_Imm16: + case W65816::SBC_StackRel: case W65816::SBC_DP: case W65816::SBC_Imm16: + case W65816::AND_StackRel: case W65816::AND_DP: case W65816::AND_Imm16: + case W65816::ORA_StackRel: case W65816::ORA_DP: case W65816::ORA_Imm16: + case W65816::EOR_StackRel: case W65816::EOR_DP: case W65816::EOR_Imm16: + case W65816::LDA_StackRel: case W65816::LDA_DP: + case W65816::LDAi16imm: case W65816::LDA_Imm16: + case W65816::TXA: case W65816::TYA: + case W65816::ADCi16imm: case W65816::ADCEi16imm: + case W65816::SBCi16imm: case W65816::SBCEi16imm: + return true; + default: + return false; + } + }; + auto isCmpZero = [](const MachineInstr &MI) { + if (MI.getOpcode() != W65816::CMPi16imm) return false; + // Operand layout: lhs (Acc16), imm. Find the imm. + for (const MachineOperand &MO : MI.operands()) { + if (MO.isImm()) return MO.getImm() == 0; + } + return false; + }; + auto modifiesA = [](const MachineInstr &MI) { + for (const MachineOperand &MO : MI.operands()) { + if (MO.isReg() && MO.getReg() == W65816::A && MO.isDef()) + return true; + } + return false; + }; + auto readsC = [](const MachineInstr &MI) { + // We don't model individual flag bits; approximate by checking + // if the MI reads $p AND is one of the carry-consuming ops. + unsigned Op = MI.getOpcode(); + switch (Op) { + case W65816::ADC_StackRel: case W65816::ADC_DP: case W65816::ADC_Imm16: + case W65816::SBC_StackRel: case W65816::SBC_DP: case W65816::SBC_Imm16: + case W65816::ADCEi16imm: case W65816::SBCEi16imm: + case W65816::BCC: case W65816::BCS: + case W65816::ROL_A: case W65816::ROR_A: + return true; + default: + return false; + } + }; + SmallVector CmpsToErase; + for (MachineBasicBlock &MBB : MF) { + for (MachineInstr &MI : MBB) { + if (!isCmpZero(MI)) continue; + // Walk backward, skipping flag-preserving instructions. + bool foundProducer = false; + auto Back = MI.getIterator(); + while (Back != MBB.begin()) { + --Back; + if (Back->isDebugInstr()) continue; + if (Back->isCall() || Back->isInlineAsm()) break; + if (modifiesA(*Back)) { + foundProducer = isNZSetOnA(Back->getOpcode()); + break; + } + bool defsP = false; + for (const MachineOperand &MO : Back->operands()) { + if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) { + defsP = true; break; + } + } + if (defsP) break; + } + if (!foundProducer) continue; + // Walk FORWARD from CMP: until the next C-defining MI, no MI + // reads C. + bool cConsumed = false; + for (auto Fwd = std::next(MI.getIterator()); Fwd != MBB.end(); ++Fwd) { + if (Fwd->isDebugInstr()) continue; + if (readsC(*Fwd)) { cConsumed = true; break; } + // Next def of $p: subsequent reads aren't ours. + bool defsP = false; + for (const MachineOperand &MO : Fwd->operands()) { + if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) { + defsP = true; break; + } + } + if (defsP) break; + } + if (cConsumed) continue; + CmpsToErase.push_back(&MI); + } + } + for (MachineInstr *MI : CmpsToErase) MI->eraseFromParent(); + if (!CmpsToErase.empty()) Changed = true; + } + #endif + // (Narrow PHI-copy slot collapse — disabled, qsort regression.) + #if 0 + { + auto isStackRelMC2 = [](unsigned Op) { + return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel || + Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel || + Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel || + Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel; + }; + auto srAccess2 = [&](const MachineInstr &MI, int64_t &Off) { + if (!isStackRelMC2(MI.getOpcode())) return false; + if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false; + Off = MI.getOperand(0).getImm(); + return true; + }; + DenseMap Refs; + DenseMap StaInst, LdaInst; + DenseMap NSta, NLda; + for (MachineBasicBlock &MBB : MF) { + for (MachineInstr &MI : MBB) { + int64_t Off; + if (!srAccess2(MI, Off)) continue; + Refs[Off]++; + if (MI.getOpcode() == W65816::STA_StackRel) { + NSta[Off]++; StaInst[Off] = &MI; + } else if (MI.getOpcode() == W65816::LDA_StackRel) { + NLda[Off]++; LdaInst[Off] = &MI; + } + } + } + SmallVector ToErase; + for (auto &P : Refs) { + int64_t X = P.first; + if (P.second != 2) continue; // exactly 2 references + if (NSta[X] != 1 || NLda[X] != 1) continue; + MachineInstr *Sta = StaInst[X]; + MachineInstr *Lda = LdaInst[X]; + if (Sta->getParent() != Lda->getParent()) continue; + MachineBasicBlock *MBB = Sta->getParent(); + // Sta must be before Lda. + bool staBefore = false; + for (auto It = MBB->begin(); It != MBB->end(); ++It) { + if (&*It == Sta) { staBefore = true; break; } + if (&*It == Lda) break; + } + if (!staBefore) continue; + // Next after Lda must be STA Y where Y != X. + auto NextIt = std::next(Lda->getIterator()); + while (NextIt != MBB->end() && NextIt->isDebugInstr()) ++NextIt; + if (NextIt == MBB->end()) continue; + int64_t Y; + if (NextIt->getOpcode() != W65816::STA_StackRel || + !srAccess2(*NextIt, Y) || Y == X) continue; + // Between Sta and Lda, no read/write of slot Y, no call, no + // anything that would re-set slot Y's value mid-flight. + bool ok = true; + for (auto It = std::next(Sta->getIterator()); It != Lda->getIterator(); + ++It) { + if (It->isDebugInstr()) continue; + if (It->isCall() || It->isInlineAsm()) { ok = false; break; } + int64_t Off; + if (srAccess2(*It, Off) && Off == Y) { ok = false; break; } + } + if (!ok) continue; + // Redirect the original STA to write to Y; delete the LDA-STA pair. + Sta->getOperand(0).setImm(Y); + ToErase.push_back(Lda); + ToErase.push_back(&*NextIt); + Changed = true; + } + for (MachineInstr *MI : ToErase) MI->eraseFromParent(); + } + #endif + return Changed; } diff --git a/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp b/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp index 470179b..56b42c5 100644 --- a/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp +++ b/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp @@ -1492,6 +1492,14 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) { } return false; }; + // Pass 1c can only eliminate CMPi16imm $a, 0 if the preceding + // A-modifier reliably sets N/Z to reflect A's final value. LDAfi + // under FP-rel expansion (`sty $fa ; ldy #imm ; lda [$f6],y ; ldy $fa`) + // ends with `ldy` that clobbers N/Z based on OLD Y, not loaded A — so + // in FP-rel functions (VLA / huge frame), the CMP is load-bearing. + // Skip the whole pass for such functions (saves us from the sum_n + // VLA regression that the PHP-wrap-aware variant tripped). + bool ssCleanupSPRelOnly = !UsesFPRel; for (MachineBasicBlock &MBB : MF) { SmallVector Cmps; for (MachineInstr &MI : MBB) @@ -1516,10 +1524,27 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) { // condition). Caused __adddf3's renormalize while-loop to // skip its body even though `mr & ~mask` was non-zero. bool SafeToErase = true; + bool insidePHPWrap = false; for (auto It = std::next(Cmp->getIterator()); It != Cmp->getParent()->end(); ++It) { if (It->isDebugInstr()) continue; if (It->isBranch() || It->isReturn()) break; + // PHP/PLP-wrap-aware: only safe when LDAfi-expansion sets N/Z + // reliably (SP-rel functions, not FP-rel). + if (ssCleanupSPRelOnly && It->getOpcode() == W65816::PHP) { + // PHP must be IMMEDIATELY after CMP to capture CMP's flags. + if (&*It != &*std::next(Cmp->getIterator())) { + SafeToErase = false; + break; + } + insidePHPWrap = true; + continue; + } + if (It->getOpcode() == W65816::PLP) { + insidePHPWrap = false; + continue; + } + if (insidePHPWrap) continue; if (It->getOpcode() == TargetOpcode::COPY) { SafeToErase = false; break; diff --git a/src/llvm/lib/Target/W65816/W65816StackSlotMerge.cpp b/src/llvm/lib/Target/W65816/W65816StackSlotMerge.cpp new file mode 100644 index 0000000..7d55677 --- /dev/null +++ b/src/llvm/lib/Target/W65816/W65816StackSlotMerge.cpp @@ -0,0 +1,733 @@ +//===-- W65816StackSlotMerge.cpp - Merge value-equivalent stack slots ----===// +// +// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. +// See https://llvm.org/LICENSE.txt for license information. +// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +// +//===---------------------------------------------------------------------===// +// +// Pre-emit pass that runs after PEI (eliminateFrameIndex) and merges +// pairs of stack-rel slots that hold the same value at every observable +// program point — typically the PHI src/dst pair PHI-elim leaves at +// the back-edge of a loop body. +// +// LLVM's StackSlotColoring merges slots with non-overlapping liveness. +// It can't merge slots that are simultaneously live but happen to hold +// the same value (which is what a PHI memory-copy creates). This pass +// catches that case via a stricter "value equivalence" check. +// +// Canonical pattern (sumSquares loop body): +// +// .LBB0_4: +// LDA 0x7, s ; PHA ; JSL __umulhisi3 ; PLY +// CLC ; ADC 0x3, s ; STA 0xb, s ; new total.lo (write X) +// TXA ; ADC 0x1, s ; STA 0x9, s +// LDA 0x7, s ; INC A ; STA 0x7, s +// LDA 0xb, s ; STA 0x3, s ; PHI copy: load X, store Y +// LDA 0x9, s ; STA 0x1, s +// ... +// +// The pair (0xb, 0x3) is the lo-half PHI memory copy. Slots 0xb and +// 0x3 always hold the same value at every read site: +// - Function entry: both initialized to 0 (`lda #0; sta 0xb, s` in +// entry, `lda #0; sta 0x3, s` in preheader). +// - Loop iteration: the PHI copy moves the new total.lo from 0xb to +// 0x3 at the end of every iteration. +// - Exit: only 0xb is read (return value), but its value equals 0x3's. +// +// Rename 0xb → 0x3 function-wide; the now self-copy `lda 0x3; sta 0x3` +// is dead and we erase it. Saves 2 inst per PHI copy occurrence (the +// memory copy round-trip). sumSquares loop body shrinks from 21 to +// 17 inst per iter. +// +// Safety check (sufficient condition for value equivalence): +// 1. Both slots have ≥1 STA in the function (skips arg slots passed +// by the caller — those have only LDA reads, no STAs, and renaming +// would change where we read the arg from). +// 2. For every STA X in the function, find a "twin" STA Y at a +// program point where the values match. Matching = either: +// (a) Same MBB, same A-source value (no intervening A-define). +// Covers the loop-body iter-end pattern: STA X then later +// LDA X ; STA Y. Also covers entry's `lda #N ; sta X` if +// the same MBB also has `sta Y`. +// (b) Different MBBs, both preceded by `LDA #const` of the same +// constant. Covers entry-block STA X=0 paired with +// preheader STA Y=0. +// 3. Symmetric: for every STA Y, find a twin STA X. +// 4. No "orphan" STAs. If a STA X or STA Y has no twin, bail. +// +// When all checks pass, the rename function-wide preserves semantics: +// every read of slot X at program point P sees the same value that +// slot Y holds at P (and vice versa). +// +//===---------------------------------------------------------------------===// + +#include "W65816.h" +#include "W65816InstrInfo.h" +#include "W65816Subtarget.h" +#include "llvm/ADT/DenseMap.h" +#include "llvm/ADT/SmallVector.h" +#include "llvm/CodeGen/MachineDominators.h" +#include "llvm/CodeGen/MachineFunction.h" +#include "llvm/CodeGen/MachineFunctionPass.h" +#include "llvm/CodeGen/MachineInstrBuilder.h" +#include "llvm/InitializePasses.h" +#include "llvm/Support/Debug.h" + +using namespace llvm; + +#define DEBUG_TYPE "w65816-stack-slot-merge" + + +namespace { + + +class W65816StackSlotMerge : public MachineFunctionPass { +public: + static char ID; + W65816StackSlotMerge() : MachineFunctionPass(ID) {} + StringRef getPassName() const override { + return "W65816 merge value-equivalent stack slots (PHI-copy collapse)"; + } + void getAnalysisUsage(AnalysisUsage &AU) const override { + AU.addRequired(); + AU.setPreservesCFG(); + MachineFunctionPass::getAnalysisUsage(AU); + } + bool runOnMachineFunction(MachineFunction &MF) override; +}; + + +} // namespace + + +char W65816StackSlotMerge::ID = 0; + +INITIALIZE_PASS_BEGIN(W65816StackSlotMerge, DEBUG_TYPE, + "W65816 stack slot merge", false, false) +INITIALIZE_PASS_DEPENDENCY(MachineDominatorTreeWrapperPass) +INITIALIZE_PASS_END(W65816StackSlotMerge, DEBUG_TYPE, + "W65816 stack slot merge", false, false) + + +FunctionPass *llvm::createW65816StackSlotMerge() { + return new W65816StackSlotMerge(); +} + + +// Stack-relative MC opcodes — the ops that survive eliminateFrameIndex +// and reference a slot via an 8-bit SP-relative offset. +static bool isStackRelOp(unsigned Op) { + return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel || + Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel || + Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel || + Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel; +} + + +// Returns true if MI is a stack-rel op; out-param Off receives the slot +// offset (operand 0). +static bool srAccess(const MachineInstr &MI, int64_t &Off) { + if (!isStackRelOp(MI.getOpcode())) return false; + if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false; + Off = MI.getOperand(0).getImm(); + return true; +} + + +// True if the MI semantically defines A. Covers both the explicit +// case (operand has reg=A,isDef) AND the implicit case where the +// tablegen InstDP / InstAbs / etc. base classes omit the A-Def +// annotation despite LDA semantically writing A (a backend modelling +// gap — many `LDA_DP`, `LDA_Abs`, `LDA_LongX`, etc. are missing the +// implicit-def in the MIR even though they load into A). Opcode- +// based fallback catches all of them. +static bool semanticallyDefsA(const MachineInstr &MI) { + for (const MachineOperand &MO : MI.operands()) { + if (MO.isReg() && MO.getReg() == W65816::A && MO.isDef()) + return true; + } + unsigned Op = MI.getOpcode(); + switch (Op) { + case W65816::LDA_DP: case W65816::LDA_DPX: + case W65816::LDA_DPInd: case W65816::LDA_DPIndY: + case W65816::LDA_DPIndX: + case W65816::LDA_Abs: case W65816::LDA_AbsX: + case W65816::LDA_AbsY: case W65816::LDA_Long: + case W65816::LDA_LongX: + case W65816::PLA: + return true; + default: + return false; + } +} + + +// Walk backward from MI in its MBB looking for the most recent A-define. +// Returns the MI that defines A, or nullptr if none in the same MBB. +// Skips debug instructions. Stops at MBB boundary, calls, branches, +// inline asm. +static MachineInstr *findPriorADef(MachineInstr *MI) { + MachineBasicBlock *MBB = MI->getParent(); + auto It = MI->getIterator(); + while (It != MBB->begin()) { + --It; + if (It->isDebugInstr()) continue; + if (It->isCall() || It->isInlineAsm()) return nullptr; + if (semanticallyDefsA(*It)) return &*It; + } + return nullptr; +} + + +// Walk forward from `Start` (exclusive) up to (but not including) `End` +// in the same MBB, tracking whether slot `WatchSlot` is written. +// Returns true if slot `WatchSlot` is NOT written in the interval. +static bool slotNotWrittenBetween(MachineBasicBlock::iterator Start, + MachineBasicBlock::iterator End, + int64_t WatchSlot) { + for (auto It = std::next(Start); It != End; ++It) { + if (It->isDebugInstr()) continue; + int64_t Off; + if (It->getOpcode() == W65816::STA_StackRel && srAccess(*It, Off) && + Off == WatchSlot) { + return false; + } + } + return true; +} + + +// Returns true if MI clobbers P (N/Z/C/V flags). Mirrors LLVM's +// operand-based check + an opcode whitelist for tablegen entries that +// omit `Defs = [P]` (InstImplied, InstStackRel, etc.). +static bool clobbersFlagsP(const MachineInstr &MI) { + for (const MachineOperand &MO : MI.operands()) { + if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) + return true; + } + if (MI.isCall() || MI.isInlineAsm()) return true; + unsigned Op = MI.getOpcode(); + switch (Op) { + case W65816::PLA: case W65816::PLY: case W65816::PLX: + case W65816::PLP: + case W65816::INA: case W65816::DEA: + case W65816::INX: case W65816::DEX: + case W65816::INY: case W65816::DEY: + case W65816::TAX: case W65816::TAY: + case W65816::TYA: case W65816::TXA: + case W65816::TYX: case W65816::TXY: + case W65816::LDA_StackRel: case W65816::LDA_DP: + case W65816::LDA_DPX: case W65816::LDA_DPInd: + case W65816::LDA_DPIndY: case W65816::LDA_DPIndX: + case W65816::LDA_Abs: case W65816::LDA_AbsX: + case W65816::LDA_AbsY: case W65816::LDA_Long: + case W65816::LDA_LongX: + case W65816::ADC_StackRel: case W65816::SBC_StackRel: + case W65816::CMP_StackRel: case W65816::AND_StackRel: + case W65816::ORA_StackRel: case W65816::EOR_StackRel: + case W65816::ADC_DP: case W65816::ADC_Abs: + case W65816::SBC_DP: case W65816::SBC_Abs: + case W65816::CMP_DP: case W65816::CMP_Abs: + case W65816::AND_DP: case W65816::AND_Abs: + case W65816::ORA_DP: case W65816::ORA_Abs: + case W65816::EOR_DP: case W65816::EOR_Abs: + return true; + default: + return false; + } +} + + +// Returns true if MI reads P flags (conditional branches, PLP, etc.). +static bool usesFlagsP(const MachineInstr &MI) { + if (MI.isConditionalBranch()) return true; + for (const MachineOperand &MO : MI.operands()) { + if (MO.isReg() && MO.getReg() == W65816::P && MO.isUse() && + !MO.isDef()) + return true; + } + return false; +} + + +// Returns the MOST RECENT A-defining MI strictly before MI in its MBB, +// skipping debug instructions. Returns nullptr if none in the same MBB. +static MachineInstr *findMostRecentADef(MachineInstr *MI) { + MachineBasicBlock *MBB = MI->getParent(); + auto It = MI->getIterator(); + while (It != MBB->begin()) { + --It; + if (It->isDebugInstr()) continue; + if (semanticallyDefsA(*It)) return &*It; + } + return nullptr; +} + + +// "Twin" check. Given a STA X at position StaX and a candidate slot Y, +// scan the function's STA Y instances and return one that's value- +// equivalent under the rules described in the header comment. +// +// Source-value equivalence cases: +// (1) Same-MBB twin store: no A-define between StaX and the candidate +// StaY → both store the same A value. Pure twin pattern. +// (2) Same-MBB PHI-copy: the candidate StaY is preceded by +// `LDA_StackRel slotX` (PHI-copy reload). Even if many A-defines +// sit between StaX and StaY, the LDA X re-establishes A = +// slot[X] = value StaX wrote (assuming slot X wasn't re-written +// in the gap). +// (3) Different MBBs, both preceded by LDA_Imm16 / LDAi16imm of the +// same constant. Covers entry/preheader init parallel pair. +static MachineInstr *findTwin(MachineInstr *StaX, + ArrayRef StasY) { + MachineBasicBlock *MBBStaX = StaX->getParent(); + int64_t XOff = StaX->getOperand(0).getImm(); + // Cases (1) + (2): same MBB. + for (MachineInstr *StaY : StasY) { + if (StaY->getParent() != MBBStaX) continue; + // Determine ordering. + MachineInstr *Earlier = nullptr; + MachineInstr *Later = nullptr; + for (auto It = MBBStaX->begin(); It != MBBStaX->end(); ++It) { + if (&*It == StaX) { Earlier = StaX; Later = StaY; break; } + if (&*It == StaY) { Earlier = StaY; Later = StaX; break; } + } + if (!Earlier || !Later) continue; + int64_t EOff = Earlier->getOperand(0).getImm(); + // Case (2): if Later is preceded by `LDA_StackRel ` + // (the PHI-copy reload), it's a PHI twin. Also require slot + // Earlier-slot wasn't re-written between Earlier and Later. + MachineInstr *PriorOfLater = findMostRecentADef(Later); + if (PriorOfLater) { + int64_t Off; + if (PriorOfLater->getOpcode() == W65816::LDA_StackRel && + srAccess(*PriorOfLater, Off) && Off == EOff && + slotNotWrittenBetween(Earlier->getIterator(), + PriorOfLater->getIterator(), EOff)) { + return StaY; + } + } + // Case (1): no A-define between Earlier and Later — same A value. + { + bool noADefs = true; + for (auto It = std::next(Earlier->getIterator()); + It != Later->getIterator(); ++It) { + if (It->isDebugInstr()) continue; + if (semanticallyDefsA(*It)) { noADefs = false; break; } + } + if (noADefs) return StaY; + } + } + // Case (3): different MBBs, both preceded by LDA_Imm16 / LDAi16imm + // with the same constant. + MachineInstr *PriorX = findPriorADef(StaX); + if (!PriorX) return nullptr; + unsigned PriorXOp = PriorX->getOpcode(); + if (PriorXOp != W65816::LDA_Imm16 && PriorXOp != W65816::LDAi16imm) + return nullptr; + int64_t XConst = 0; + for (const MachineOperand &MO : PriorX->operands()) { + if (MO.isImm()) { XConst = MO.getImm(); break; } + } + for (MachineInstr *StaY : StasY) { + if (StaY->getParent() == MBBStaX) continue; + MachineInstr *PriorY = findPriorADef(StaY); + if (!PriorY) continue; + if (PriorY->getOpcode() != PriorXOp) continue; + int64_t YConst = 0; + for (const MachineOperand &MO : PriorY->operands()) { + if (MO.isImm()) { YConst = MO.getImm(); break; } + } + if (XConst == YConst) return StaY; + } + (void)XOff; + return nullptr; +} + + +// Run Phase 6a + Phase 6 (per-MBB peepholes) — independent of rename +// logic, so they fire on every function. Returns true if anything +// changed. +static bool runPerMBBPeepholes(MachineFunction &MF) { + bool Changed = false; + + // Phase 6a: redundant `STA Y, s` immediately followed by `LDA Y, s`. + for (MachineBasicBlock &MBB : MF) { + SmallVector Dead; + for (auto It = MBB.begin(); It != MBB.end(); ++It) { + if (It->isDebugInstr()) continue; + if (It->getOpcode() != W65816::STA_StackRel) continue; + int64_t StaSlot; + if (!srAccess(*It, StaSlot)) continue; + auto NextIt = std::next(It); + while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt; + if (NextIt == MBB.end()) continue; + if (NextIt->getOpcode() != W65816::LDA_StackRel) continue; + int64_t LdaSlot; + if (!srAccess(*NextIt, LdaSlot)) continue; + if (StaSlot != LdaSlot) continue; + bool flagsSafe = false; + bool aIsUsedBeforeClobber = false; + for (auto Fwd = std::next(NextIt); Fwd != MBB.end(); ++Fwd) { + if (Fwd->isDebugInstr()) continue; + // Calls/JSLs that take A as arg — even though clobbersFlagsP + // returns true for them, the elimination could mis-track A's + // live-in to the call. Bail. + if (Fwd->isCall()) break; + // Generic: any instr that has `implicit $a` as a USE — A is + // live going in. Bail to avoid live-range trouble. + for (const MachineOperand &MO : Fwd->operands()) { + if (MO.isReg() && MO.getReg() == W65816::A && MO.isUse() && + !MO.isDef()) { + aIsUsedBeforeClobber = true; + break; + } + } + if (aIsUsedBeforeClobber) break; + if (usesFlagsP(*Fwd)) break; + if (Fwd->isTerminator() && !Fwd->isConditionalBranch()) { + flagsSafe = true; break; + } + if (clobbersFlagsP(*Fwd)) { flagsSafe = true; break; } + } + if (!flagsSafe) continue; + Dead.push_back(&*NextIt); + } + for (MachineInstr *MI : Dead) { + MI->eraseFromParent(); + Changed = true; + } + } + + // Phase 6: per-MBB redundant `LDA #K` elimination. + auto isAandPPreserving = [](const MachineInstr &MI) -> bool { + unsigned Op = MI.getOpcode(); + switch (Op) { + case W65816::STA_StackRel: + case W65816::STA_DP: case W65816::STA_DPX: + case W65816::STA_DPInd: case W65816::STA_DPIndY: + case W65816::STA_DPIndX: + case W65816::STA_Abs: case W65816::STA_AbsX: + case W65816::STA_AbsY: case W65816::STA_Long: + case W65816::STA_LongX: + case W65816::STX_DP: case W65816::STX_Abs: + case W65816::STY_DP: case W65816::STY_Abs: case W65816::STY_DPX: + case W65816::STZ_DP: case W65816::STZ_Abs: + case W65816::STZ_DPX: case W65816::STZ_AbsX: + return true; + default: + break; + } + for (const MachineOperand &MO : MI.operands()) { + if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) + return false; + } + if (MI.mayStore() && !MI.mayLoad() && !semanticallyDefsA(MI)) + return true; + return false; + }; + auto isLdaImmK = [](const MachineInstr &MI, int64_t &K) -> bool { + unsigned Op = MI.getOpcode(); + if (Op != W65816::LDA_Imm16 && Op != W65816::LDAi16imm) return false; + for (const MachineOperand &MO : MI.operands()) { + if (MO.isImm()) { K = MO.getImm(); return true; } + } + return false; + }; + for (MachineBasicBlock &MBB : MF) { + std::optional KnownK; + SmallVector Dead; + for (auto It = MBB.begin(); It != MBB.end(); ++It) { + if (It->isDebugInstr()) continue; + int64_t K; + if (isLdaImmK(*It, K)) { + if (KnownK && *KnownK == K) { + Dead.push_back(&*It); + continue; + } + KnownK = K; + continue; + } + if (isAandPPreserving(*It)) continue; + KnownK.reset(); + } + for (MachineInstr *MI : Dead) { + MI->eraseFromParent(); + Changed = true; + } + } + + return Changed; +} + + +bool W65816StackSlotMerge::runOnMachineFunction(MachineFunction &MF) { + if (skipFunction(MF.getFunction())) return false; + if (MF.getFunction().hasOptNone()) return false; + + // Run per-MBB peepholes first — independent of rename logic. + bool peepChanged = runPerMBBPeepholes(MF); + + // Phase 1: index all stack-rel STA/LDA grouped by slot offset. + DenseMap> Stas; + DenseMap> Ldas; + DenseMap AllRefs; // STA + LDA + ADC + ... count + for (MachineBasicBlock &MBB : MF) { + for (MachineInstr &MI : MBB) { + int64_t Off; + if (!srAccess(MI, Off)) continue; + AllRefs[Off]++; + if (MI.getOpcode() == W65816::STA_StackRel) { + Stas[Off].push_back(&MI); + } else if (MI.getOpcode() == W65816::LDA_StackRel) { + Ldas[Off].push_back(&MI); + } + } + } + + // Phase 2: find PHI-copy site candidates. Pattern: LDA X ; STA Y + // in a LOOP BODY MBB (= the MBB has itself as a predecessor, i.e. + // a self-loop back-edge). Restricting to loop bodies distinguishes + // genuine PHI-cycle copies from one-shot temp transfers (where + // slot X is just a scratch register dropped on the way to slot Y + // for an unrelated purpose, like qsortIter's pointer-construction + // pattern `STA 5; ...; LDA 5; STA 39` followed by `LDA 39; STA dp`). + DenseMap PhiCopyPair; // X -> Y + for (MachineBasicBlock &MBB : MF) { + // Self-loop check: MBB must have itself as a predecessor. + bool selfLoop = false; + for (MachineBasicBlock *Pred : MBB.predecessors()) { + if (Pred == &MBB) { selfLoop = true; break; } + } + if (!selfLoop) continue; + for (auto It = MBB.begin(); It != MBB.end(); ++It) { + if (It->getOpcode() != W65816::LDA_StackRel) continue; + int64_t X; + if (!srAccess(*It, X)) continue; + auto NextIt = std::next(It); + while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt; + if (NextIt == MBB.end()) continue; + if (NextIt->getOpcode() != W65816::STA_StackRel) continue; + int64_t Y; + if (!srAccess(*NextIt, Y) || Y == X) continue; + if (PhiCopyPair.count(X)) continue; + PhiCopyPair[X] = Y; + } + } + + // Phase 3: validate each pair and apply rename if safe. + // Track which slots have already been merged so we don't double-merge. + DenseMap Renames; // X -> Y + for (auto &P : PhiCopyPair) { + int64_t X = P.first, Y = P.second; + // Don't re-merge an already-processed slot. + if (Renames.count(X) || Renames.count(Y)) continue; + // Arg-slot guard: skip slots with no STAs (caller-passed args). + if (Stas[X].empty() || Stas[Y].empty()) continue; + + // Validate that every STA X has a twin STA Y. + bool allPaired = true; + for (MachineInstr *StaX : Stas[X]) { + if (!findTwin(StaX, Stas[Y])) { allPaired = false; break; } + } + if (!allPaired) continue; + + // Symmetric: every STA Y must have a twin STA X. + for (MachineInstr *StaY : Stas[Y]) { + if (!findTwin(StaY, Stas[X])) { allPaired = false; break; } + } + if (!allPaired) continue; + + LLVM_DEBUG(dbgs() << "StackSlotMerge: rename slot " << X + << " -> " << Y << " in " << MF.getName() << "\n"); + Renames[X] = Y; + } + if (Renames.empty()) return false; + + // Phase 4: apply rename. + bool Changed = false; + for (MachineBasicBlock &MBB : MF) { + SmallVector ToErase; + for (MachineInstr &MI : MBB) { + int64_t Off; + if (!srAccess(MI, Off)) continue; + auto It = Renames.find(Off); + if (It == Renames.end()) continue; + MI.getOperand(0).setImm(It->second); + Changed = true; + } + // After rename, look for now-redundant LDA-STA pairs to the same + // slot (the PHI-copy self-copy). Erase them. + for (auto It = MBB.begin(); It != MBB.end(); ++It) { + if (It->getOpcode() != W65816::LDA_StackRel) continue; + int64_t LdaOff; + if (!srAccess(*It, LdaOff)) continue; + auto NextIt = std::next(It); + while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt; + if (NextIt == MBB.end()) continue; + if (NextIt->getOpcode() != W65816::STA_StackRel) continue; + int64_t StaOff; + if (!srAccess(*NextIt, StaOff)) continue; + if (LdaOff != StaOff) continue; + ToErase.push_back(&*It); + ToErase.push_back(&*NextIt); + } + for (MachineInstr *MI : ToErase) MI->eraseFromParent(); + if (!ToErase.empty()) Changed = true; + } + + // Phase 5: redundant constant-init elimination. After rename, the + // Case (3) twin pairings leave us with TWO sites writing the same + // constant to the same slot (one renamed from X to Y, the other was + // already targeting Y). The dominated one is redundant — its slot + // already holds the constant from the dominating write. + // + // Generalize: scan post-rename for ALL `LDA_Imm16 K ; STA_StackRel Y` + // pairs (or LDAi16imm K; STA Y). For each pair, look for another + // such pair with the same (K, Y) where one DOMINATES the other AND + // no slot-Y access exists on any path between them. Erase the + // dominated STA + its preceding LDA (if A isn't otherwise consumed). + { + auto isLdaImm = [](const MachineInstr &MI) { + unsigned Op = MI.getOpcode(); + return Op == W65816::LDA_Imm16 || Op == W65816::LDAi16imm; + }; + auto immValue = [](const MachineInstr &MI) -> int64_t { + for (const MachineOperand &MO : MI.operands()) { + if (MO.isImm()) return MO.getImm(); + } + return 0; + }; + // Collect `LDA #K ; STA_StackRel Y` pairs, grouped by Y. + DenseMap, 4>> + ConstStas; + for (MachineBasicBlock &MBB : MF) { + for (auto It = MBB.begin(); It != MBB.end(); ++It) { + if (!isLdaImm(*It)) continue; + int64_t K = immValue(*It); + auto NextIt = std::next(It); + while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt; + if (NextIt == MBB.end()) continue; + if (NextIt->getOpcode() != W65816::STA_StackRel) continue; + int64_t Y; + if (!srAccess(*NextIt, Y)) continue; + ConstStas[Y].push_back({&*NextIt, K}); + } + } + // For each slot Y with at least two const-init STAs, check for + // dominator redundancy. + auto &MDT = getAnalysis().getDomTree(); + // Check that no instruction WRITES slot Y on any path between + // From and To. Reads are fine because both From and To write + // the same constant K — any intermediate read would see K either + // way (since From dominates, From has already executed). Calls + // are bailout conditions: a call might write to the stack via + // address-taken locals or other side effects we don't model. + auto noSlotWriteOnPath = [&](MachineInstr *From, MachineInstr *To, + int64_t Y) -> bool { + MachineBasicBlock *FromMBB = From->getParent(); + MachineBasicBlock *ToMBB = To->getParent(); + auto opWritesY = [&](MachineInstr &MI) { + if (MI.isCall() || MI.isInlineAsm()) return true; + int64_t Off; + if (MI.getOpcode() == W65816::STA_StackRel && + srAccess(MI, Off) && Off == Y) { + return true; + } + return false; + }; + // (a) After From in its MBB. + for (auto It = std::next(From->getIterator()); It != FromMBB->end(); + ++It) { + if (It->isDebugInstr()) continue; + if (opWritesY(*It)) return false; + } + // (b) BFS forward from FromMBB's successors, stopping at ToMBB. + SmallPtrSet Visited; + SmallVector Stack; + for (auto *Succ : FromMBB->successors()) Stack.push_back(Succ); + while (!Stack.empty()) { + auto *MBB = Stack.pop_back_val(); + if (MBB == ToMBB) continue; // checked separately in (c) + if (!Visited.insert(MBB).second) continue; + for (auto &MI : *MBB) { + if (MI.isDebugInstr()) continue; + if (opWritesY(MI)) return false; + } + for (auto *Succ : MBB->successors()) Stack.push_back(Succ); + } + // (c) In ToMBB, before To, any write of Y? + for (auto It = ToMBB->begin(); It != To->getIterator(); ++It) { + if (It->isDebugInstr()) continue; + if (opWritesY(*It)) return false; + } + return true; + }; + SmallVector ToErase; + LLVM_DEBUG({ + dbgs() << "Phase 5 in " << MF.getName() << ":\n"; + for (auto &P : ConstStas) { + dbgs() << " slot " << P.first << " has " << P.second.size() + << " const STAs\n"; + } + }); + for (auto &P : ConstStas) { + int64_t Y = P.first; + auto &stas = P.second; + if (stas.size() < 2) continue; + // For each pair (i, j) where i dominates j with same constant K: + for (auto &Sj : stas) { + MachineInstr *DominatedSta = Sj.first; + int64_t Kj = Sj.second; + for (auto &Si : stas) { + if (&Si == &Sj) continue; + if (Si.second != Kj) continue; // different K + MachineInstr *DominatorSta = Si.first; + if (!MDT.dominates(DominatorSta, DominatedSta)) continue; + if (!noSlotWriteOnPath(DominatorSta, DominatedSta, Y)) continue; + // Flag safety: erasing `LDA #K; STA Y` removes a flag-setting + // op (the LDA). Walk forward from the STA looking for next + // flag-clobber or unconditional terminator (safe) vs. + // flag-use (unsafe). + MachineBasicBlock *MBB = DominatedSta->getParent(); + bool flagsSafeP5 = false; + for (auto Fwd = std::next(DominatedSta->getIterator()); + Fwd != MBB->end(); ++Fwd) { + if (Fwd->isDebugInstr()) continue; + if (usesFlagsP(*Fwd)) break; + if (Fwd->isTerminator() && !Fwd->isConditionalBranch()) { + flagsSafeP5 = true; break; + } + if (clobbersFlagsP(*Fwd)) { flagsSafeP5 = true; break; } + } + if (!flagsSafeP5) continue; + // Erase DominatedSta and its preceding LDA #K. + auto Prev = DominatedSta->getIterator(); + while (Prev != MBB->begin()) { + --Prev; + if (!Prev->isDebugInstr()) break; + } + if (Prev != DominatedSta->getIterator() && isLdaImm(*Prev) && + immValue(*Prev) == Kj) { + // Verify A isn't consumed between LDA and STA — they're + // adjacent so no consumers exist; safe. Erase both. + ToErase.push_back(&*Prev); + } + ToErase.push_back(DominatedSta); + break; + } + } + } + // De-dup ToErase before erasing. + SmallPtrSet ErasedSet; + for (MachineInstr *MI : ToErase) { + if (ErasedSet.insert(MI).second) { + MI->eraseFromParent(); + Changed = true; + } + } + } + + return Changed || peepChanged; +} diff --git a/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp b/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp index 031a699..433319b 100644 --- a/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp +++ b/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp @@ -56,6 +56,8 @@ LLVMInitializeW65816Target() { initializeW65816I32IncFoldPass(PR); initializeW65816ImgCalleeSavePass(PR); initializeW65816NarrowI32MulPass(PR); + initializeW65816PromoteFiToImgPass(PR); + initializeW65816StackSlotMergePass(PR); // Default IndVarSimplify's exit-value rewriter to "never". The // closed-form replacement frequently widens an i16 induction var @@ -195,14 +197,19 @@ void W65816PassConfig::addPreRegAlloc() { } void W65816PassConfig::addPostRegAlloc() { - // ImgCalleeSave runs FIRST so its STAfi/LDAfi pseudos go through the - // rest of the post-RA pipeline (SpillToX, StackSlotCleanup) normally. - // It detects IMG8..IMG15 usage post-regalloc and inserts prologue - // save + epilogue restore so those slots act as callee-saved at the - // asm level. Fixes picol's `expr 1+2 == 4` bug: high-pressure - // recursive double fns use IMG8..IMG15 as scratch but, without this - // pass, expected them preserved across calls — and callees were - // happy to clobber them. See W65816ImgCalleeSave.cpp. + // FI→IMG promotion runs FIRST. It scans for high-traffic i16 + // FrameIndex slots (LDAfi/STAfi/ADCfi/etc.) and rewrites them to + // STA_DP/LDA_DP/ADC_DP/... pointed at free IMG8..IMG15 DP slots. + // The introduced IMG8..15 references are then picked up by + // ImgCalleeSave to get prologue save + epilogue restore. See + // W65816PromoteFiToImg.cpp. + addPass(createW65816PromoteFiToImg()); + // ImgCalleeSave detects IMG8..IMG15 usage post-regalloc and inserts + // prologue save + epilogue restore so those slots act as callee- + // saved at the asm level. Fixes picol's `expr 1+2 == 4` bug: + // high-pressure recursive double fns use IMG8..IMG15 as scratch but, + // without this pass, expected them preserved across calls — and + // callees were happy to clobber them. See W65816ImgCalleeSave.cpp. addPass(createW65816ImgCalleeSave()); // SpillToX converts STA/LDA pairs to TAX/TXA bridges; StackSlotCleanup // then deletes still-adjacent redundant spills. A second SpillToX @@ -264,6 +271,14 @@ void W65816PassConfig::addPreEmitPass() { addPass(createW65816I32IncFold()); addPass(createW65816BranchExpand()); addPass(createW65816SepRepCleanup()); + // Merge value-equivalent stack slots last. Runs AFTER SepRepCleanup's + // PHI-copy hoist so the LDA-X ; STA-Y pair has been pulled out of + // any PHP/PLP wrap — that way the stack-rel offsets on both ops are + // the unbumped values and offset-based slot matching is stable. + // Saves 2 inst per PHI-copy occurrence (the memory copy round-trip + // collapses when X and Y are renamed to the same slot). See + // W65816StackSlotMerge.cpp. + addPass(createW65816StackSlotMerge()); } MachineFunctionInfo *W65816TargetMachine::createMachineFunctionInfo( diff --git a/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp b/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp index 5e11bb6..226e1dc 100644 --- a/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp +++ b/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp @@ -64,13 +64,43 @@ FunctionPass *llvm::createW65816WidenAcc16() { return new W65816WidenAcc16(); } -// Returns true if the vreg has any physreg-COPY use (e.g., return-value -// or arg-passing setup that pins the value to a specific physreg). -static bool flowsToPhysReg(Register VReg, const MachineRegisterInfo &MRI) { +// Returns true if the vreg has any physreg-COPY use that would conflict +// with Wide16 class assignment. $a is a member of Wide16 (Wide16 = A + +// IMG0..15), so a COPY to $a is fine — the vreg can be Wide16 and +// regalloc will pick $a to coalesce. $x / $y are in Idx16, NOT in +// Wide16, so a COPY to those forces the vreg to NOT be in Wide16 +// (verifier would reject). +static bool flowsToIncompatiblePhysReg(Register VReg, + const MachineRegisterInfo &MRI) { for (auto &U : MRI.use_nodbg_instructions(VReg)) { if (!U.isCopy()) continue; const MachineOperand &Dst = U.getOperand(0); - if (Dst.isReg() && Dst.getReg().isPhysical()) return true; + if (!Dst.isReg() || !Dst.getReg().isPhysical()) continue; + Register P = Dst.getReg(); + if (P == W65816::A) continue; + if (P >= W65816::IMG0 && P <= W65816::IMG15) continue; + return true; + } + return false; +} + +// Returns true if VReg's def is a COPY from a physreg whose class is not +// Wide16-compatible. copyPhysReg only handles a fixed set of source/dest +// pairs; an incompatible source physreg (e.g., DPF0, the i64-return +// high-half carrier) lowered to an IMG dest would crash with an +// "unhandled copyPhysReg" assertion at AsmPrinter time. (Currently +// only the Phase-2 PHI widening uses this; that's disabled, so mark +// unused.) +[[maybe_unused]] static bool comesFromIncompatiblePhysReg(Register VReg, + const MachineRegisterInfo &MRI) { + for (auto &D : MRI.def_instructions(VReg)) { + if (!D.isCopy()) continue; + const MachineOperand &Src = D.getOperand(1); + if (!Src.isReg() || !Src.getReg().isPhysical()) continue; + Register P = Src.getReg(); + if (P == W65816::A) continue; + if (P >= W65816::IMG0 && P <= W65816::IMG15) continue; + return true; } return false; } @@ -145,7 +175,7 @@ bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) { Register VReg = Register::index2VirtReg(i); if (MRI.def_empty(VReg)) continue; if (MRI.getRegClass(VReg) != &W65816::Acc16RegClass) continue; - if (flowsToPhysReg(VReg, MRI)) continue; + if (flowsToIncompatiblePhysReg(VReg, MRI)) continue; if (usedByPhi(VReg, MRI)) continue; if (!MRI.hasOneDef(VReg)) continue; // require single SSA def if (!allUsesAcceptWide(VReg, MRI, *TRI, *TII)) continue; @@ -181,5 +211,212 @@ bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) { } Changed = true; } + + // Phase 2: PHI cycle widening. EXPERIMENTAL, currently disabled — + // see end of pass for explanation. + #if 0 + // PHIs whose def class is Acc16 keep + // the value pinned to $a across iterations, forcing stack spills + // when the PHI is live across calls or other A-clobbering ops. + // For sumSquares-style loops with an i32 accumulator, this manifests + // as per-iter `LDA slot ; ADC ; STA slot ; LDA slot ; STA slot` (the + // last LDA/STA pair is the PHI-back-edge copy). If we widen the + // PHI's def to Wide16, regalloc can keep it in an IMG slot and the + // back-edge PHI copy collapses to a register coalesce. + // + // To widen a PHI: + // 1. Compute the SCC of Acc16 vregs connected by PHI edges (PHI + // def ↔ PHI incoming vreg). This catches mutually-recursive + // PHIs in nested loops. + // 2. For every member: verify all non-PHI uses accept Wide16, no + // flow to a physreg, single def. + // 3. For each PHI in the SCC, walk its incoming list. Each + // incoming vreg is either ALREADY in the SCC (another PHI, no + // bridge needed) or an external Acc16 vreg whose value flows + // into the SCC — bridge it by inserting `WWide = COPY W` at + // the end of the predecessor block and pointing the PHI's + // incoming at WWide. + // 4. Change every SCC member's register class to Wide16. + auto worklistInsertIfAcc16 = [&MRI](Register V, + DenseSet &Seen, + SmallVectorImpl &WL) { + if (!V.isVirtual()) return; + if (MRI.getRegClass(V) != &W65816::Acc16RegClass) return; + if (!Seen.insert(V).second) return; + WL.push_back(V); + }; + + SmallVector AcctPhis; + for (MachineBasicBlock &MBB : MF) { + for (MachineInstr &MI : MBB.phis()) { + Register DefV = MI.getOperand(0).getReg(); + if (MRI.getRegClass(DefV) == &W65816::Acc16RegClass) { + AcctPhis.push_back(&MI); + } + } + } + DenseSet ProcessedPhiVregs; + for (MachineInstr *Seed : AcctPhis) { + Register SeedDef = Seed->getOperand(0).getReg(); + if (ProcessedPhiVregs.count(SeedDef)) continue; + // Build SCC by following PHI edges in both directions. + DenseSet Comp; + SmallVector Stack; + worklistInsertIfAcc16(SeedDef, Comp, Stack); + while (!Stack.empty()) { + Register V = Stack.pop_back_val(); + // Forward: V flows into other PHIs as an incoming → include those PHI defs. + for (auto &U : MRI.use_nodbg_instructions(V)) { + if (!U.isPHI()) continue; + Register PhiDef = U.getOperand(0).getReg(); + worklistInsertIfAcc16(PhiDef, Comp, Stack); + } + // Backward: if V is itself a PHI def, include the incoming vregs. + MachineInstr *DM = &*MRI.def_instructions(V).begin(); + if (!DM || !DM->isPHI()) continue; + for (unsigned i = 1, e = DM->getNumOperands(); i < e; i += 2) { + MachineOperand &MO = DM->getOperand(i); + if (!MO.isReg() || !MO.getReg().isVirtual()) continue; + worklistInsertIfAcc16(MO.getReg(), Comp, Stack); + } + } + for (Register V : Comp) ProcessedPhiVregs.insert(V); + + // Validate every member. PHI uses are ACCEPTED when the consumer + // PHI is itself in the SCC (those PHIs are being widened in + // lock-step). Narrow-class uses (e.g., INA_PSEUDO's tied-def + // input requires Acc16) are ALSO accepted — we'll insert a + // Wide16→Acc16 COPY at the use site after widening. The only + // unrecoverable cases are: PHI uses where the consumer PHI is + // outside the SCC (forcing cross-SCC class merging), and physreg + // flow to $x/$y/etc. (handled separately above). + auto usesAcceptInSCC = [&](Register V, + SmallVectorImpl *NarrowSites) + -> bool { + for (auto &MO : MRI.use_nodbg_operands(V)) { + MachineInstr *UMI = MO.getParent(); + if (UMI->isCopy()) continue; + if (UMI->isPHI()) { + Register PhiDef = UMI->getOperand(0).getReg(); + if (Comp.count(PhiDef)) continue; // co-widened + return false; + } + unsigned OpIdx = UMI->getOperandNo(&MO); + const TargetRegisterClass *Expected = + TII->getRegClass(UMI->getDesc(), OpIdx); + if (!Expected) continue; + if (Expected == &W65816::Wide16RegClass) continue; + if (Expected->hasSubClassEq(&W65816::Wide16RegClass)) continue; + // Expected is narrower than Wide16 (e.g., Acc16-only tied + // input). Mark for runtime narrowing — we'll insert a COPY + // at apply time. + if (NarrowSites) NarrowSites->push_back(&MO); + } + return true; + }; + bool ok = true; + SmallVector NarrowSites; + for (Register V : Comp) { + if (!MRI.hasOneDef(V)) { ok = false; break; } + if (flowsToIncompatiblePhysReg(V, MRI)) { ok = false; break; } + if (comesFromIncompatiblePhysReg(V, MRI)) { ok = false; break; } + if (!usesAcceptInSCC(V, &NarrowSites)) { ok = false; break; } + } + if (!ok) continue; + + // Apply widening. First insert bridge COPYs at predecessor edges + // for external (non-Comp) Acc16 incomings to each PHI in Comp. + SmallVector, 16> BridgeSites; + for (Register V : Comp) { + MachineInstr *DM = &*MRI.def_instructions(V).begin(); + if (!DM->isPHI()) continue; + for (unsigned i = 1, e = DM->getNumOperands(); i < e; i += 2) { + MachineOperand &MO = DM->getOperand(i); + if (!MO.isReg() || !MO.getReg().isVirtual()) continue; + Register Inc = MO.getReg(); + if (Comp.count(Inc)) continue; // in-SCC, no bridge needed + // External incoming: ensure it's currently Acc16; if so, we'll + // insert a COPY at the predecessor block's end. + if (MRI.getRegClass(Inc) != &W65816::Acc16RegClass && + MRI.getRegClass(Inc) != &W65816::Wide16RegClass) { + ok = false; + break; + } + BridgeSites.push_back({DM, i}); + } + if (!ok) break; + } + if (!ok) continue; + + // Insert bridges. + for (auto &Site : BridgeSites) { + MachineInstr *PhiMI = Site.first; + unsigned OpIdx = Site.second; + Register Inc = PhiMI->getOperand(OpIdx).getReg(); + MachineBasicBlock *PredMBB = PhiMI->getOperand(OpIdx + 1).getMBB(); + // If already Wide16 (e.g., another candidate widened it already), + // no bridge needed — but we still need the PHI incoming to use + // a Wide16 vreg. Use Inc directly. + if (MRI.getRegClass(Inc) == &W65816::Wide16RegClass) { + continue; + } + // Insert COPY before the predecessor's terminator(s). + auto InsertPos = PredMBB->getFirstTerminator(); + DebugLoc DL = (InsertPos == PredMBB->end()) + ? PredMBB->findBranchDebugLoc() + : InsertPos->getDebugLoc(); + Register WideInc = MRI.createVirtualRegister(&W65816::Wide16RegClass); + BuildMI(*PredMBB, InsertPos, DL, TII->get(TargetOpcode::COPY), + WideInc) + .addReg(Inc); + PhiMI->getOperand(OpIdx).setReg(WideInc); + PhiMI->getOperand(OpIdx).setIsKill(false); + } + + // Force every SCC member to Img16 (IMG-only, no A). Using Wide16 + // (A + IMG) doesn't work here: the Register Coalescer joins our + // Wide16 vregs with adjacent Acc16 vregs (intersection = Acc16) + // and narrows them back to A-only, defeating the widening. Img16 + // intersects Acc16 to ∅, so the coalescer can't merge — the PHI + // stays in IMG. This is correct anyway for the common case (PHI + // live across a call): A is JSL-clobbered, so it can't carry the + // value through, and IMG8..15 is the right home. + for (Register V : Comp) { + MRI.setRegClass(V, &W65816::Img16RegClass); + } + // Insert narrowing COPYs at each narrow-class use site. Each site + // is `... = OP V, ...` where the operand requires Acc16 but V is + // now Wide16. Replace with `%Vacc = COPY V (Acc16); ... = OP %Vacc, ...`. + for (MachineOperand *MO : NarrowSites) { + MachineInstr *UMI = MO->getParent(); + Register OldReg = MO->getReg(); + Register NarrowReg = + MRI.createVirtualRegister(&W65816::Acc16RegClass); + DebugLoc DL = UMI->getDebugLoc(); + BuildMI(*UMI->getParent(), UMI, DL, TII->get(TargetOpcode::COPY), + NarrowReg) + .addReg(OldReg); + MO->setReg(NarrowReg); + MO->setIsKill(false); + } + Changed = true; + } + #endif + // Why disabled (2026-05-13 attempt): + // - Widening PHI cycles to Wide16 (= A + IMG0..15) is undone by the + // Register Coalescer: it joins our Wide16 vregs with adjacent + // Acc16 vregs via the bridge COPYs we insert, and the resulting + // joint class is `intersect(Wide16, Acc16) = Acc16`. Net effect: + // no IMG, just more code through the coalescer. + // - Switching to Img16 (= IMG0..15, no A) defeats the coalescer + // (intersection with Acc16 is ∅) but forces ALL widened PHIs into + // IMG slots even when A would be better, AND triggers cascading + // copyPhysReg paths that aren't all implemented (e.g., DPF0 → IMG + // for i64 libcall return values), aborting clang on runtime builds. + // - A targeted fix needs either (a) a class that the coalescer + // refuses to join with Acc16 yet that still allows A as a member, + // (b) a post-coalescer pass that re-widens specific high-traffic + // vregs back to Img16, or (c) regalloc cost-model tuning so it + // prefers IMG8..15 over stack for loop-live values. return Changed; } diff --git a/src/llvm/test/CodeGen/W65816/extract-wide32-regseq.s b/src/llvm/test/CodeGen/W65816/extract-wide32-regseq.s new file mode 100644 index 0000000..e69de29