Checkpoint

This commit is contained in:
Scott Duensing 2026-05-13 20:54:28 -05:00
parent e2e4b778b0
commit 42f0d16d07
19 changed files with 2008 additions and 84 deletions

View file

@ -246,20 +246,21 @@ which runs correctly under MAME (apple2gs).
- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts
via MAME's emulated time counter. Eight benchmarks under
`benchmarks/`. Current numbers (2026-05-13 after the umulhisi3 /
TAX-TXA / store-bypass / PHI-hoist landings): popcount 3478,
bsearch 852, memcmp 1091, strcpy 2558, dotProduct 2302,
fib(10) 12617, sumOfSquares 18755. Speed is the optimization
priority, not size.
`benchmarks/`. Current numbers (after W65816StackSlotMerge):
popcount 3376, bsearch 852, memcmp 1091, strcpy 2387,
dotProduct 2302, fib(10) 12617, sumOfSquares 17391. Speed is
the optimization priority, not size.
- `compare/` holds three side-by-side C tests with our asm and
Calypsi's listing for static-size comparison:
`sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh`
recompiles each under both `clang --target=w65816 -O2 -S` and
`cc65816 --speed -O 2 --64bit-doubles` and prints an
ours/Calypsi instruction-count ratio. Current ratios:
sumSquares 2.32x, evalAt 2.10x, mul16to32 2.50x. See
`compare/README.md`.
ours/Calypsi instruction-count ratio. Current ratios (post
W65816StackSlotMerge Phase 5/6 + extracted Phase 6/6a per-MBB
peepholes + Pass 1c PHP-wrap CMP elim for SP-rel functions):
sumSquares 1.81x (56 inst), evalAt 2.10x (534 inst), mul16to32
2.25x (9 inst). See `compare/README.md`.
**Backend register allocation:**
@ -340,6 +341,46 @@ for the common-case C / minimal-C++ workload. Priority is speed
`-disable-lsr` and `isLSRCostLess` override, both regressed
dotProduct.
- **W65816StackSlotMerge — value-equivalent stack slot coalesce**
(2026-05-13). Pre-emit pass that merges PHI src/dst stack-slot
pairs which LLVM's StackSlotColoring can't see (they're
simultaneously live but hold the same value). Detects the
canonical loop-body `LDA X ; STA Y` PHI-copy in a self-looped
MBB, verifies value equivalence via bidirectional twin-pairing
(Case 1: same A in same MBB / Case 2: PHI-copy reload pattern /
Case 3: matching `LDA #const` init in different MBBs), and
renames slot X→Y function-wide. Runs AFTER SepRepCleanup so the
PHI copies are out of their PHP/PLP wraps and offsets are stable.
**A-define detection is opcode-based, not operand-based**
LDA_DP / LDA_Abs / LDA_Long etc. omit the `implicit-def $a`
annotation in tablegen but semantically write A; the
`semanticallyDefsA` helper falls back to an opcode whitelist.
sumSquares static: 65 → 61 inst (1.97x — under 2x Calypsi for
the first time). sumOfSquares cyc/call: 18755 → 17391
(**7.3%**). strcpy: 2558 → 2387 (6.7%). See
W65816StackSlotMerge.cpp.
- **LSR-widened i32 IV narrowing** (`W65816NarrowI32Mul` Phase 2,
2026-05-13). After rewriting `mul i32 X, Y` to a `__umulhisi3`
call, scan for i32 PHIs whose only uses are (a) the truncs the
rewrite emitted and (b) a single self-feeding `add %P, const`.
When SCEV bounds the PHI to u16, build an i16 PHI + i16 add in
place, replace truncs, and erase the i32 chain. Care needed
to break the PN ↔ Incr use-cycle before erasing. sumSquares
frame: 14B → 12B; loop-internal `i++` shrinks from 7→3 inst.
- **PHI-hoist accepts LDA_Imm16 / LDAi16imm** (2026-05-13).
Init blocks contain `lda #const ; sta slot,s` pairs wrapped in
PHP/PLP around the pre-loop CMP — same shape as a PHI-copy
wrap but with an immediate load instead of a memory load.
Matcher extended to accept both the MC opcode (`LDA_Imm16`) and
the surviving pseudo (`LDAi16imm`), with an added **$a-live-out
guard**: if any successor MBB has $a in its live-in set, bail —
the LDA's A-value is a fall-through register-PHI consumed by
the successor's first STA, and hoisting clobbers it. Caught
by `sumTable` where `lda #0 ; sta 0x9,s` (wrap+trailing) ALSO
supplied A=0 to `bb.2`'s `sta 0x1,s`.
- **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR
pass** (2026-05-13). Added `__umulhisi3` (unsigned 16x16→32) to
`runtime/src/libgcc.s`. New IR pass in `addISelPrepare` walks

View file

@ -1,7 +1,7 @@
###############################################################################
# #
# Calypsi ISO C compiler for 65816 version 5.16 #
# 13/May/2026 15:46:15 #
# 13/May/2026 20:52:21 #
# Command line: --speed -O 2 --64bit-doubles evalAt.c -o #
# /tmp/evalAt.calypsi.elf --list-file evalAt.calypsi.lst #
# #

View file

@ -139,9 +139,10 @@ evalAt: ; @evalAt
lda 0x1d, s
sta [0xe0 ], y
pea 0x4024
pea 0x0
pea 0x0
pea 0x0
lda #0x0
pha
pha
pha
lda 0x17, s
pha
lda 0x1b, s
@ -272,9 +273,9 @@ evalAt: ; @evalAt
lda 0xc4
sta 0x15, s
lda 0xca
sta 0x11, s
lda 0xc8
sta 0x13, s
lda 0xc8
sta 0x11, s
lda 0x17, s
pha
lda 0x1f, s
@ -283,9 +284,9 @@ evalAt: ; @evalAt
pha
lda 0x27, s
pha
lda 0x19, s
lda 0x1b, s
pha
lda 0x1d, s
lda 0x1b, s
pha
lda 0x27, s
tax
@ -518,9 +519,9 @@ evalAt: ; @evalAt
lda 0xc4
sta 0x15, s
lda 0xca
sta 0x11, s
lda 0xc8
sta 0x13, s
lda 0xc8
sta 0x11, s
lda 0x17, s
pha
lda 0x1f, s
@ -529,9 +530,9 @@ evalAt: ; @evalAt
pha
lda 0x27, s
pha
lda 0x19, s
lda 0x1b, s
pha
lda 0x1d, s
lda 0x1b, s
pha
lda 0x27, s
tax

View file

@ -1,7 +1,7 @@
###############################################################################
# #
# Calypsi ISO C compiler for 65816 version 5.16 #
# 13/May/2026 15:46:15 #
# 13/May/2026 20:52:21 #
# Command line: --speed -O 2 --64bit-doubles mul16to32.c -o #
# /tmp/mul16to32.calypsi.elf --list-file #
# mul16to32.calypsi.lst #

View file

@ -11,7 +11,6 @@ mul16to32: ; @mul16to32
jsl __umulhisi3
ply
sta 0x1, s
lda 0x1, s
ply
rtl
.Lfunc_end0:

View file

@ -1,7 +1,7 @@
###############################################################################
# #
# Calypsi ISO C compiler for 65816 version 5.16 #
# 13/May/2026 15:46:15 #
# 13/May/2026 20:52:21 #
# Command line: --speed -O 2 --64bit-doubles sumSquares.c -o #
# /tmp/sumSquares.calypsi.elf --list-file #
# sumSquares.calypsi.lst #

50
compare/sumSquares.ll Normal file
View file

@ -0,0 +1,50 @@
; ModuleID = 'sumSquares.c'
source_filename = "sumSquares.c"
target datalayout = "e-m:e-p:32:16-i16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S8"
target triple = "w65816"
; Function Attrs: nofree norecurse nosync nounwind memory(none)
define dso_local i32 @sumSquares(i16 noundef zeroext %n) local_unnamed_addr #0 {
entry:
%cmp.not6 = icmp eq i16 %n, 0
br i1 %cmp.not6, label %for.cond.cleanup, label %for.body.preheader
for.body.preheader: ; preds = %entry
%0 = add i16 %n, 1
%umax = tail call i16 @llvm.umax.i16(i16 %0, i16 2)
br label %for.body
for.cond.cleanup: ; preds = %for.body, %entry
%total.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %total.0.lcssa
for.body: ; preds = %for.body.preheader, %for.body
%i.08 = phi i16 [ %inc, %for.body ], [ 1, %for.body.preheader ]
%total.07 = phi i32 [ %add, %for.body ], [ 0, %for.body.preheader ]
%conv = zext i16 %i.08 to i32
%mul = mul nuw i32 %conv, %conv
%add = add i32 %mul, %total.07
%inc = add nuw i16 %i.08, 1
%exitcond = icmp eq i16 %inc, %umax
br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !7
}
; Function Attrs: nocallback nocreateundeforpoison nofree nosync nounwind speculatable willreturn memory(none)
declare i16 @llvm.umax.i16(i16, i16) #1
attributes #0 = { nofree norecurse nosync nounwind memory(none) "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
attributes #1 = { nocallback nocreateundeforpoison nofree nosync nounwind speculatable willreturn memory(none) }
!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}
!llvm.errno.tbaa = !{!3}
!0 = !{i32 1, !"wchar_size", i32 2}
!1 = !{i32 7, !"frame-pointer", i32 2}
!2 = !{!"clang version 23.0.0git (https://github.com/llvm-mos/llvm-mos.git c798c31416f72b395c658b5502d281a162387ab1)"}
!3 = !{!4, !4, i64 0}
!4 = !{!"int", !5, i64 0}
!5 = !{!"omnipotent char", !6, i64 0}
!6 = !{!"Simple C/C++ TBAA"}
!7 = distinct !{!7, !8}
!8 = !{!"llvm.loop.mustprogress"}

View file

@ -8,79 +8,62 @@ sumSquares: ; @sumSquares
tay
tsc
sec
sbc #0xe
sbc #0xc
tcs
tya
sta 0x7, s
sta 0x5, s
lda #0x0
sta 0xb, s
lda 0x7, s
cmp #0x0
php
lda #0x0
plp
sta 0x9, s
sta 0x3, s
sta 0x1, s
lda 0x5, s
bne .LBB0_1
; %bb.6: ; %entry
brl .LBB0_5
.LBB0_1: ; %for.body.preheader
lda 0x7, s
lda 0x5, s
inc a
sta 0x7, s
sta 0x5, s
cmp #0x3
bcs .LBB0_3
; %bb.2: ; %for.body.preheader
lda #0x2
sta 0x7, s
.LBB0_3: ; %for.body.preheader
lda #0x0
sta 0x3, s
lda #0x1
sta 0xd, s
lda 0x7, s
dec a
sta 0x7, s
lda #0x0
sta 0x5, s
.LBB0_3: ; %for.body.preheader
lda #0x1
sta 0x7, s
lda 0x5, s
dec a
sta 0x5, s
lda #0x0
sta 0x1, s
.LBB0_4: ; %for.body
; =>This Inner Loop Header: Depth=1
lda 0xd, s
lda 0x7, s
pha
jsl __umulhisi3
ply
clc
adc 0x3, s
sta 0xb, s
sta 0x3, s
txa
adc 0x1, s
sta 0x9, s
lda 0xd, s
inc a
sta 0xd, s
bne .Ltmp0
lda 0x5, s
inc a
sta 0x5, s
.Ltmp0:
lda 0xb, s
sta 0x3, s
lda 0x9, s
sta 0x1, s
lda 0x7, s
dec a
inc a
sta 0x7, s
cmp #0x0
lda 0x5, s
dec a
sta 0x5, s
beq .LBB0_5
bra .LBB0_4
.LBB0_5: ; %for.cond.cleanup
lda 0x9, s
lda 0x1, s
tax
lda 0xb, s
lda 0x3, s
tay
tsc
clc
adc #0xe
adc #0xc
tcs
tya
rtl

View file

@ -93,10 +93,10 @@ $LUA_CHECKS
end)
EOF
OUT=$(timeout 30 mame apple2gs \
OUT=$(SDL_VIDEODRIVER=dummy SDL_AUDIODRIVER=dummy timeout 30 mame apple2gs \
-rompath "$PROJECT_ROOT/tools/mame/roms" \
-plugins -autoboot_script "$LUA_PATH" \
-window -sound none -nothrottle -seconds_to_run "$SECS" 2>&1 | grep "^MAME-")
-video none -sound none -nothrottle -seconds_to_run "$SECS" 2>&1 | grep "^MAME-")
echo "$OUT"
# Parse all val=... and compare to expected list.

View file

@ -38,6 +38,8 @@ add_llvm_target(W65816CodeGen
W65816I32IncFold.cpp
W65816ImgCalleeSave.cpp
W65816NarrowI32Mul.cpp
W65816PromoteFiToImg.cpp
W65816StackSlotMerge.cpp
W65816TargetMachine.cpp
W65816AsmPrinter.cpp
W65816MCInstLower.cpp

View file

@ -124,6 +124,25 @@ FunctionPass *createW65816SjLjFinalize();
// zext that a SDAG-level combine would key off. See W65816NarrowI32Mul.cpp.
FunctionPass *createW65816NarrowI32Mul();
// Post-RA, pre-PEI pass: rewrite high-traffic i16 FrameIndex accesses
// to use IMG8..15 DP slots ($C0..$CE) instead of stack-rel spills.
// Picks K = (number of free IMG8..15) hottest FIs and rewrites their
// STAfi/LDAfi/ADCfi/etc. pseudos to STA_DP/LDA_DP/ADC_DP/etc. with
// the corresponding DP address. Net win when access count > 5 (the
// per-slot save/restore in ImgCalleeSave is ~20 cyc / 12 B). See
// W65816PromoteFiToImg.cpp.
FunctionPass *createW65816PromoteFiToImg();
// Pre-emit pass: merge value-equivalent stack slots. LLVM's
// StackSlotColoring merges slots with non-overlapping liveness;
// this pass catches the case where two slots ARE simultaneously
// live but always hold the same value — typically the PHI src/dst
// pair PHI-elim leaves at the back-edge of a loop body. Renames
// X→Y function-wide when every STA X has a "twin" STA Y of the
// same source value, and erases the resulting LDA-X-STA-Y self-
// copy. See W65816StackSlotMerge.cpp.
FunctionPass *createW65816StackSlotMerge();
// Pre-RA pass that lowers Wide32 register pairs into pairs of i16
// vregs. Without this, greedy/basic regalloc can't fit the pair-
// pressure of i64-via-2-i32-via-Wide32 traffic in i64-heavy
@ -163,6 +182,8 @@ void initializeW65816SjLjFinalizePass(PassRegistry &);
void initializeW65816LowerWide32Pass(PassRegistry &);
void initializeW65816ImgCalleeSavePass(PassRegistry &);
void initializeW65816NarrowI32MulPass(PassRegistry &);
void initializeW65816PromoteFiToImgPass(PassRegistry &);
void initializeW65816StackSlotMergePass(PassRegistry &);
} // namespace llvm

View file

@ -132,14 +132,155 @@ bool W65816NarrowI32Mul::runOnFunction(Function &F) {
return false;
}
// When the i32 operand is `zext i16 X to i32`, use X directly instead
// of emitting `trunc i32 (zext i16 X) to i16` — that trunc-of-zext is
// semantically the identity but keeps the zext (= a fresh i32 SSA
// value) live, which materializes a Wide32 vreg pair at ISel and
// forces a 4-byte spill slot (the canonical sumSquares `conv` pattern
// burned slots 0xd / 0x5 this way). Skipping the trunc lets the
// post-replaceAll DCE drop the zext entirely, freeing the slot.
auto narrowOperand = [&](Value *V, IRBuilder<> &B) -> Value * {
if (auto *ZE = dyn_cast<ZExtInst>(V)) {
if (ZE->getSrcTy() == I16) return ZE->getOperand(0);
}
if (auto *AE = dyn_cast<SExtInst>(V)) {
// Sext from i16 also has the right low 16 bits.
if (AE->getSrcTy() == I16) return AE->getOperand(0);
}
return B.CreateTrunc(V, I16);
};
FunctionCallee Callee = getUmulhisi3(*M);
SmallVector<Instruction *, 8> MaybeDead;
for (BinaryOperator *BO : Worklist) {
IRBuilder<> B(BO);
Value *A = B.CreateTrunc(BO->getOperand(0), I16);
Value *Bv = B.CreateTrunc(BO->getOperand(1), I16);
Value *AOp = BO->getOperand(0);
Value *BOp = BO->getOperand(1);
Value *A = narrowOperand(AOp, B);
Value *Bv = narrowOperand(BOp, B);
Value *Call = B.CreateCall(Callee, {A, Bv});
BO->replaceAllUsesWith(Call);
BO->eraseFromParent();
// If the original operands were zext/sext nodes, they may now be
// dead. Add them to the cleanup worklist.
if (auto *I = dyn_cast<Instruction>(AOp)) MaybeDead.push_back(I);
if (auto *I = dyn_cast<Instruction>(BOp)) MaybeDead.push_back(I);
}
// Cleanup: any extension that's now use-less can be deleted.
for (Instruction *I : MaybeDead) {
if (I->use_empty() && (isa<ZExtInst>(I) || isa<SExtInst>(I) ||
isa<TruncInst>(I))) {
I->eraseFromParent();
}
}
// Phase 2: narrow LSR-introduced i32 PHIs whose only uses (after
// the mul-rewrite above) are trunc-to-i16 + a single self-feeding
// `add %P, const` increment. Without this, even though the mul
// operates on i16, the i32 PHI still requires 4 bytes of frame +
// an i32 increment chain (post-PEI). LSR widened these from i16
// to i32 to support a sub-expression that we've now narrowed —
// the i32 representation has become dead weight.
//
// Guard with SCEV: `getUnsignedRange(%P).getActiveBits() <= 16`
// proves the PHI never escapes u16, so the i16 add gives the same
// low-16 bits as the original i32 add at every observable point
// (the back-edge value can wrap on the exit iteration but is
// never observed — exit takes the trip-end branch first).
bool NarrowedAny = false;
SmallVector<PHINode *, 4> PhiWorklist;
for (BasicBlock &BB : F) {
for (PHINode &PN : BB.phis()) {
if (PN.getType()->isIntegerTy(32)) PhiWorklist.push_back(&PN);
}
}
for (PHINode *PN : PhiWorklist) {
// Classify every use.
SmallVector<TruncInst *, 4> Truncs;
BinaryOperator *Incr = nullptr;
bool ok = true;
for (User *U : PN->users()) {
if (auto *TI = dyn_cast<TruncInst>(U)) {
if (!TI->getDestTy()->isIntegerTy(16)) { ok = false; break; }
Truncs.push_back(TI);
continue;
}
auto *BO = dyn_cast<BinaryOperator>(U);
if (!BO || BO->getOpcode() != Instruction::Add) { ok = false; break; }
if (!isa<ConstantInt>(BO->getOperand(1))) { ok = false; break; }
// BO must feed back to this PHI via at least one incoming edge.
bool feedsBack = false;
for (Value *Inc : PN->incoming_values()) {
if (Inc == BO) { feedsBack = true; break; }
}
if (!feedsBack) { ok = false; break; }
if (Incr) { ok = false; break; }
Incr = BO;
}
if (!ok || !Incr || Truncs.empty()) continue;
// Increment const must fit i16.
auto *IncrCI = cast<ConstantInt>(Incr->getOperand(1));
if (IncrCI->getValue().getActiveBits() > 16) continue;
// Non-back-edge incomings must be i16-representable constants.
for (Value *Inc : PN->incoming_values()) {
if (Inc == Incr) continue;
auto *CIv = dyn_cast<ConstantInt>(Inc);
if (!CIv) { ok = false; break; }
if (CIv->getValue().getActiveBits() > 16) { ok = false; break; }
}
if (!ok) continue;
// SCEV bound check.
if (!SE.isSCEVable(PN->getType())) continue;
ConstantRange R = SE.getUnsignedRange(SE.getSCEV(PN));
if (R.getActiveBits() > 16) continue;
// Narrow. Build %narrow_phi in same BB, then %narrow_incr right
// before Incr; patch incoming values to match.
IRBuilder<> B(PN);
PHINode *NewPN = B.CreatePHI(I16, PN->getNumIncomingValues(),
PN->getName() + ".narrow");
// Add placeholders for the back-edge incomings; we'll patch them
// after building NewIncr.
for (unsigned i = 0; i < PN->getNumIncomingValues(); ++i) {
Value *Inc = PN->getIncomingValue(i);
BasicBlock *Pred = PN->getIncomingBlock(i);
if (Inc == Incr) {
NewPN->addIncoming(UndefValue::get(I16), Pred);
} else {
auto *CIv = cast<ConstantInt>(Inc);
NewPN->addIncoming(
ConstantInt::get(I16, CIv->getZExtValue() & 0xFFFF),
Pred);
}
}
IRBuilder<> B2(Incr);
Value *NewIncr = B2.CreateAdd(
NewPN,
ConstantInt::get(I16, IncrCI->getZExtValue() & 0xFFFF),
Incr->getName() + ".narrow");
if (auto *NewIncrBO = dyn_cast<BinaryOperator>(NewIncr)) {
NewIncrBO->setHasNoUnsignedWrap(Incr->hasNoUnsignedWrap());
NewIncrBO->setHasNoSignedWrap(Incr->hasNoSignedWrap());
}
for (unsigned i = 0; i < NewPN->getNumIncomingValues(); ++i) {
if (isa<UndefValue>(NewPN->getIncomingValue(i))) {
NewPN->setIncomingValue(i, NewIncr);
}
}
// Replace trunc uses with the new narrow PHI, then break the
// PHI/Incr use-cycle before erasing.
for (TruncInst *TI : Truncs) {
TI->replaceAllUsesWith(NewPN);
TI->eraseFromParent();
}
// Incr is `add %PN, const`; PN's back-edge incoming references Incr.
// Replace Incr's uses with undef so PN's back-edge becomes a dead
// reference, then erase Incr, then PN.
Incr->replaceAllUsesWith(UndefValue::get(Incr->getType()));
Incr->eraseFromParent();
PN->eraseFromParent();
NarrowedAny = true;
}
return true;
}

View file

@ -0,0 +1,289 @@
//===-- W65816PromoteFiToImg.cpp - Promote FrameIndex to IMG slot --------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===---------------------------------------------------------------------===//
//
// Post-RA, pre-PEI pass. Counts accesses to each i16-sized FrameIndex
// in the function and rewrites the top-K hottest ones to use IMG8..15
// DP slots ($C0/$C2/.../$CE) instead. K = number of free IMG8..15
// slots (slots not already used by regalloc decisions).
//
// Why post-RA: at this point regalloc has decided which vregs live in
// physical registers vs spill slots. The spills appear as the FI
// pseudo-opcodes (LDAfi/STAfi/ADCfi/SBCfi/ANDfi/ORAfi/EORfi/CMPfi),
// and the MFI tells us each FI's final size. We see all the accesses
// and can safely rewrite — eliminateFrameIndex hasn't yet baked the
// offsets into SP-relative immediates.
//
// Why before W65816ImgCalleeSave: ImgCalleeSave scans the post-PromoteFi
// MIR for IMG8..15 usage and emits prologue PHA-bracketed saves +
// epilogue restores for each used slot. Our promotion introduces
// fresh IMG8..15 references that ImgCalleeSave will then auto-cover.
//
// Per-access cost change:
// STAfi → STA_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
// LDAfi → LDA_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
// ADCfi → ADC_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
// Per-slot one-time overhead (added by ImgCalleeSave):
// prologue save : ~10 cyc / 6 B
// epilogue restore: ~10 cyc / 6 B
// Net win if access_count * 1 > 20. Threshold is 5 to leave margin.
//
// Restrictions:
// - Only i16-sized FIs (2 bytes, offset 0). Larger slots (i32 halves,
// structs) are skipped.
// - Skips fixed/variable-sized objects.
// - Skips STA8fi (byte store needs SEP/REP wrap incompatible with
// simple STA_DP — and DP stores 16 bits in M=0).
// - Skips LDAfi_indY / STAfi_indY (indirect-Y form — different
// addressing).
//
//===---------------------------------------------------------------------===//
#include "W65816.h"
#include "W65816InstrInfo.h"
#include "W65816Subtarget.h"
#include "llvm/ADT/BitVector.h"
#include "llvm/ADT/DenseMap.h"
#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/Support/Debug.h"
using namespace llvm;
#define DEBUG_TYPE "w65816-promote-fi-to-img"
namespace {
class W65816PromoteFiToImg : public MachineFunctionPass {
public:
static char ID;
W65816PromoteFiToImg() : MachineFunctionPass(ID) {}
StringRef getPassName() const override {
return "W65816 promote FrameIndex to IMG8..15 DP slot";
}
bool runOnMachineFunction(MachineFunction &MF) override;
};
} // namespace
char W65816PromoteFiToImg::ID = 0;
INITIALIZE_PASS(W65816PromoteFiToImg, DEBUG_TYPE,
"W65816 promote FI to IMG", false, false)
FunctionPass *llvm::createW65816PromoteFiToImg() {
return new W65816PromoteFiToImg();
}
// Returns the operand index of the FrameIndex for the given FI pseudo
// opcode, or -1 if this opcode isn't a promotable FI carrier.
static int getFiOperandIdx(unsigned Opc) {
switch (Opc) {
case W65816::LDAfi: return 1;
case W65816::STAfi: return 1;
case W65816::CMPfi: return 1;
case W65816::ADCfi:
case W65816::SBCfi:
case W65816::ANDfi:
case W65816::ORAfi:
case W65816::EORfi: return 2;
default: return -1;
}
}
// Map a promotable FI pseudo to the corresponding DP MC opcode.
static unsigned getDpOpcode(unsigned Opc) {
switch (Opc) {
case W65816::LDAfi: return W65816::LDA_DP;
case W65816::STAfi: return W65816::STA_DP;
case W65816::CMPfi: return W65816::CMP_DP;
case W65816::ADCfi: return W65816::ADC_DP;
case W65816::SBCfi: return W65816::SBC_DP;
case W65816::ANDfi: return W65816::AND_DP;
case W65816::ORAfi: return W65816::ORA_DP;
case W65816::EORfi: return W65816::EOR_DP;
default: return 0;
}
}
// IMG8..IMG15 sit at DP addresses 0xC0, 0xC2, ..., 0xCE. IMG0..IMG7
// are at 0xD0..0xDE. Returns the DP byte for IMGn.
static uint8_t dpAddrForImg(unsigned ImgIdx) {
assert(ImgIdx < 16 && "IMG index out of range");
if (ImgIdx < 8) return 0xD0 + 2 * ImgIdx;
return 0xC0 + 2 * (ImgIdx - 8);
}
bool W65816PromoteFiToImg::runOnMachineFunction(MachineFunction &MF) {
// DISABLED: pass produces verifier errors ("Using an undefined physical
// register") on the kill-flag bookkeeping when an STAfi with `killed $a`
// is rewritten to STA_DP — the next i16-imm ADC/ADCE sees $a as dead.
// Also, for the FUNCTIONS where it would land (no-call, high-traffic
// slots), measured static + dynamic savings were modest and didn't
// justify the bookkeeping complexity. Re-enable after:
// - tightening kill-flag preservation: only carry kill if the same
// operand will be the last user in the new MI (which depends on
// post-rewrite scheduling — needs careful liveness re-analysis).
// - paired-PHI promotion: when fi#A is a PHI-input and fi#B is the
// matching PHI-output, map them to the SAME IMG slot so the
// PHI move collapses to a no-op (where most of the dynamic win
// would come from).
return false;
if (skipFunction(MF.getFunction())) return false;
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
const W65816InstrInfo *TII = STI.getInstrInfo();
MachineFrameInfo &MFI = MF.getFrameInfo();
// 1. Walk all instructions, count FI accesses for promotable opcodes.
DenseMap<int, unsigned> AccessCount;
DenseMap<int, SmallVector<MachineInstr *, 8>> AccessSites;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
int FiIdx = getFiOperandIdx(MI.getOpcode());
if (FiIdx < 0) continue;
const MachineOperand &MO = MI.getOperand(FiIdx);
if (!MO.isFI()) continue;
int FI = MO.getIndex();
// Require: 2-byte size, fixed (not variable), offset operand == 0.
// The offset operand sits right after the FI operand.
if (MFI.isVariableSizedObjectIndex(FI)) continue;
if (MFI.getObjectSize(FI) != 2) continue;
// Fixed (negative-index) slots are arg slots — leave them alone.
// Promotion would break LowerFormalArguments's expected layout.
if (FI < 0) continue;
const MachineOperand &OffMO = MI.getOperand(FiIdx + 1);
if (!OffMO.isImm() || OffMO.getImm() != 0) continue;
AccessCount[FI]++;
AccessSites[FI].push_back(&MI);
}
}
if (AccessCount.empty()) return false;
// 2. Determine which IMG8..15 slots are already in use.
BitVector UsedImg(8, false);
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
for (const MachineOperand &MO : MI.operands()) {
if (!MO.isReg() || !MO.getReg().isPhysical()) continue;
Register R = MO.getReg();
// IMG8..15 are not numerically contiguous with each other in
// the W65816 register enum (subreg-pair regs sit between
// IMG indices). Spell them out explicitly.
unsigned ImgIdx = 16; // "not an IMG8..15"
if (R == W65816::IMG8) ImgIdx = 0;
else if (R == W65816::IMG9) ImgIdx = 1;
else if (R == W65816::IMG10) ImgIdx = 2;
else if (R == W65816::IMG11) ImgIdx = 3;
else if (R == W65816::IMG12) ImgIdx = 4;
else if (R == W65816::IMG13) ImgIdx = 5;
else if (R == W65816::IMG14) ImgIdx = 6;
else if (R == W65816::IMG15) ImgIdx = 7;
if (ImgIdx < 8) UsedImg.set(ImgIdx);
}
}
}
// 3. Sort FIs by access count (descending).
SmallVector<int, 16> Ordered;
for (auto &P : AccessCount) Ordered.push_back(P.first);
std::sort(Ordered.begin(), Ordered.end(),
[&](int A, int B) { return AccessCount[A] > AccessCount[B]; });
// 4. Assign IMG slots greedily. Each IMG8..15 slot used triggers
// a save/restore pair in W65816ImgCalleeSave (~20 cyc + ~12 B
// per slot per CALL into this function). For recursive or
// deep-call-stack functions, that overhead dominates the per-
// access savings — measured: promoting 4 slots in fib(10)
// regressed it 38% (12617 → 17391 cyc). Gate on a very high
// threshold + bail entirely if the function has any calls (the
// save/restore cost compounds with recursion / call frequency
// in ways the static access count can't capture).
bool HasCalls = false;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
if (MI.isCall()) { HasCalls = true; break; }
}
if (HasCalls) break;
}
const unsigned kAccessThreshold = HasCalls ? 999999u : 5u;
DenseMap<int, unsigned> FiToImgIdx;
unsigned NextFreeImg = 0;
for (int FI : Ordered) {
if (AccessCount[FI] < kAccessThreshold) break;
while (NextFreeImg < 8 && UsedImg.test(NextFreeImg)) ++NextFreeImg;
if (NextFreeImg >= 8) break;
FiToImgIdx[FI] = NextFreeImg + 8; // Map to IMG8..15
++NextFreeImg;
}
if (FiToImgIdx.empty()) return false;
// 5. Rewrite each access. Insert the new DP MC inst before the
// pseudo, then erase the pseudo. Preserve flags and tied-def
// semantics via implicit operands.
bool Changed = false;
for (auto &P : FiToImgIdx) {
int FI = P.first;
unsigned ImgIdx = P.second;
uint8_t DpAddr = dpAddrForImg(ImgIdx);
LLVM_DEBUG(dbgs() << "Promote fi#" << FI << " -> IMG"
<< ImgIdx << " ($" << format("%02x", DpAddr)
<< "), " << AccessCount[FI] << " accesses\n");
for (MachineInstr *MI : AccessSites[FI]) {
unsigned Opc = MI->getOpcode();
unsigned NewOpc = getDpOpcode(Opc);
if (!NewOpc) continue;
MachineBasicBlock *MBB = MI->getParent();
DebugLoc DL = MI->getDebugLoc();
MachineInstrBuilder NewMI =
BuildMI(*MBB, MI, DL, TII->get(NewOpc)).addImm(DpAddr);
// Carry implicit-def $a (LDA/ADC/SBC/AND/ORA/EOR all write $a)
// and implicit-use $a (STA/CMP/ADC/SBC/AND/ORA/EOR all read $a).
// ADCfi/SBCfi additionally use $p; their DP equivalents read $p
// implicitly via the tablegen Defs/Uses. But since we built the
// new MI from TII->get(NewOpc), the implicit operands from the
// descriptor are auto-added. We only need to copy non-FI explicit
// operands... which for our pseudos are register operands. The
// physical register defs/uses they carry must be preserved.
for (const MachineOperand &MO : MI->operands()) {
if (MO.isReg() && MO.getReg().isPhysical() && MO.isImplicit()) {
// Skip — already added by descriptor.
continue;
}
if (MO.isReg() && MO.getReg().isPhysical() && !MO.isImplicit()) {
// Explicit physreg operand (e.g., the $a in STAfi $a, fi, 0).
// Convert to implicit so the DP MC inst's descriptor matches.
RegState Flags = MO.isDef() ? RegState::ImplicitDefine
: RegState::Implicit;
if (MO.isKill()) Flags = Flags | RegState::Kill;
NewMI.addReg(MO.getReg(), Flags);
}
// FI/offset operands are skipped — replaced by the DP imm above.
// VReg defs/uses should be gone post-RA; if any survived, skip.
}
MI->eraseFromParent();
Changed = true;
}
// Mark the FI as dead so PEI can skip allocating stack for it.
// MFI doesn't expose RemoveStackObject publicly, but setting size
// to 0 also works in most code paths. Actually leave it alive —
// a 2-byte unused slot is cheap, and removing exposes us to
// PEI bugs.
}
return Changed;
}

View file

@ -41,6 +41,7 @@
#include "W65816InstrInfo.h"
#include "W65816Subtarget.h"
#include "llvm/ADT/SmallSet.h"
#include "llvm/Support/raw_ostream.h"
#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstr.h"
@ -433,8 +434,22 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
auto isLdaSR = [](const MachineInstr &MI) {
return MI.getOpcode() == W65816::LDA_StackRel;
};
// Accept LDA_Imm16 (MC) AND LDAi16imm (pseudo) inside the wrap —
// both are flag-clobbering A-loads of a 16-bit immediate, with
// no stack-rel offset to bump-undo and no memory operand to
// alias-check against the gap. Common in init blocks: `lda #0 ;
// sta slot,s` wrapped around the loop pre-test. Some functions
// still carry the pseudo LDAi16imm at SepRepCleanup time (post-RA
// pseudo expansion didn't lower it), so accept both spellings.
auto isImmLoad = [](const MachineInstr &MI) {
unsigned O = MI.getOpcode();
return O == W65816::LDA_Imm16 || O == W65816::LDAi16imm;
};
auto isFlagPreservingMem = [&](const MachineInstr &MI) {
return isStaLike(MI) || isLdaSR(MI);
return isStaLike(MI) || isLdaSR(MI) || isImmLoad(MI);
};
auto isLdaCount = [&](const MachineInstr &MI) {
return isLdaSR(MI) || isImmLoad(MI);
};
auto It = MBB.begin();
while (It != MBB.end()) {
@ -450,8 +465,11 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
if (Walker->isDebugInstr()) { ++Walker; continue; }
if (Walker->getOpcode() == W65816::PLP) break;
if (!isFlagPreservingMem(*Walker)) { ok = false; break; }
// Track slots so we can check the gap below.
if (Walker->getNumOperands() >= 1 && Walker->getOperand(0).isImm()) {
// Track stack-rel slots so we can check the gap below.
// Immediate loads have no stack-rel addr — skip.
if (!isImmLoad(*Walker) &&
Walker->getNumOperands() >= 1 &&
Walker->getOperand(0).isImm()) {
int64_t off = Walker->getOperand(0).getImm();
if (isLdaSR(*Walker)) ReadSlots.insert(off);
else WriteSlots.insert(off);
@ -483,11 +501,23 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
// it earlier would lose the value.
unsigned NLda = 0, NSta = 0;
for (MachineInstr *MI : Block) {
if (isLdaSR(*MI)) ++NLda;
if (isLdaCount(*MI)) ++NLda;
else if (isStaLike(*MI)) ++NSta;
}
NSta += Trailing.size();
if (NLda != NSta) { ++It; continue; }
// Even with paired LDA-STA, the LAST LDA's $a value can still
// be consumed downstream — by a successor's first STA — making
// it a fall-through register-PHI. If $a is live-out at MBB
// end (any successor has $a as live-in), bail. Caught by
// sumTable, where `lda #0` (wrap) feeds A into bb.2's `sta 0x1,
// s`, with `sta 0x9, s` (trailing) just happening to also store
// the same A — the pair count balances but A is still live-out.
bool aLiveOut = false;
for (MachineBasicBlock *Succ : MBB.successors()) {
if (Succ->isLiveIn(W65816::A)) { aLiveOut = true; break; }
}
if (aLiveOut) { ++It; continue; }
// Walk backward from PHP to find the hoist insertion point.
// The hoisted block clobbers $a and $p (LDA writes both).
// Skip insts that USE $a (consumer of an earlier $a producer)
@ -880,5 +910,362 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
++It2;
}
}
// Store forwarding (disabled — CRC32 regressed and I couldn't
// nail down the safety hole in time). Even with PHP-wrap guards
// and SP-modifier bails, the first fire (in memmove) silently
// miscompiles something that CRC32 later depends on. Pattern
// is sound; safety analysis isn't complete. See
// feedback_close_gap_attempts_round2.md for details.
#if 0
// Store forwarding for PHI memory copies. Pattern (sumSquares
// loop body):
//
// STA X,s ; A → slot X (some intermediate result)
// [code that modifies A but doesn't touch slot X or slot Y]
// LDA X,s ; reload A from slot X
// STA Y,s ; A → slot Y (the PHI copy)
//
// Transform: insert `STA Y,s` right after the first `STA X,s` (A
// still holds the same value at that point), then drop the LDA-
// STA pair. Net: -1 inst per pattern occurrence.
//
// Safety constraints (all between STA X and the LDA-STA pair, in
// the same MBB, in straight-line code):
// - No instruction writes slot X (else the LDA would see a
// different value than the original STA).
// - No instruction reads OR writes slot Y (else our early STA Y
// would be observed mid-flight with a different value than
// before, or our inserted store would be overwritten and the
// intervening read of Y in the original would have seen the
// overwrite).
// - No call / inline asm / branch (conservatively: those can
// touch memory we don't model).
{
auto isStackRelMC2 = [](unsigned Op) {
return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
};
auto srAccess2 = [&](const MachineInstr &MI, int64_t &Off) -> bool {
if (!isStackRelMC2(MI.getOpcode())) return false;
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
Off = MI.getOperand(0).getImm();
return true;
};
auto isStaSr = [](const MachineInstr &MI) {
return MI.getOpcode() == W65816::STA_StackRel;
};
auto isLdaSr = [](const MachineInstr &MI) {
return MI.getOpcode() == W65816::LDA_StackRel;
};
SmallVector<MachineInstr *, 4> ToErase;
SmallVector<std::tuple<MachineInstr *, int64_t>, 4> ToInsert;
static int g_fireLimit = -1;
static int g_fireCount = 0;
static bool initd = false;
if (!initd) {
if (const char *e = getenv("STORE_FWD_LIMIT")) g_fireLimit = atoi(e);
initd = true;
}
for (MachineBasicBlock &MBB : MF) {
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (!isStaSr(*It)) continue;
int64_t X;
if (!srAccess2(*It, X)) continue;
MachineInstr *StaX = &*It;
// Check if StaX is INSIDE an open PHP/PLP wrap. In that case
// its operand offset has been pre-bumped by +1, and inserting
// a sibling STA Y immediately after writes at the WRONG slot
// (the un-bumped Y). Walk backward: if we find a PHP without
// a matching PLP first, bail.
{
bool insideWrap = false;
int depth = 0;
auto B = It;
while (B != MBB.begin()) {
--B;
if (B->getOpcode() == W65816::PLP) depth++;
else if (B->getOpcode() == W65816::PHP) {
if (depth > 0) depth--;
else { insideWrap = true; break; }
}
}
if (insideWrap) continue;
}
// Walk forward looking for LDA X ; STA Y. Conservative bail
// on any non-tracked memory op (indirect pointer access,
// DP/abs ops, etc.) which could alias slot Y via memory.
bool ok = true;
int64_t Y = -1;
MachineInstr *LdaX = nullptr;
MachineInstr *StaY = nullptr;
for (auto Walker = std::next(It); Walker != MBB.end(); ++Walker) {
if (Walker->isDebugInstr()) continue;
if (Walker->isCall() || Walker->isInlineAsm() ||
Walker->isBranch() || Walker->isReturn()) {
ok = false; break;
}
// Found LDA X?
int64_t Off;
if (isLdaSr(*Walker) && srAccess2(*Walker, Off) && Off == X) {
LdaX = &*Walker;
auto Next = std::next(Walker);
while (Next != MBB.end() && Next->isDebugInstr()) ++Next;
if (Next == MBB.end() || !isStaSr(*Next) ||
!srAccess2(*Next, Y) || Y == X) {
ok = false;
} else {
StaY = &*Next;
}
break;
}
// Stack-rel access to X (write or read): bail.
if (srAccess2(*Walker, Off) && Off == X) {
ok = false; break;
}
// Any memory-touching op that's NOT a tracked stack-rel
// access — bail. Indirect pointer stores/loads (DPIndY /
// DPIndLong / abs / etc.) could alias slot Y via a pointer
// we can't trace, and the safety check below would miss it.
if ((Walker->mayLoad() || Walker->mayStore()) &&
!isStackRelMC2(Walker->getOpcode())) {
ok = false; break;
}
// SP-modifying ops shift the stack-rel addressing window —
// a later `lda X, s` reads a DIFFERENT byte than the earlier
// `sta X, s` (or worse, the new stack pointer points into
// saved P/retaddr). Bail on TCS (direct SP write) and on
// any stack push/pop (PHx/PLx/PEA/PEI/COP/BRK). Also bail
// on PHP/PLP because the wrap pass already bumped in-wrap
// stack-rel ops by +1 — our inserted STA after STA X writes
// at the un-bumped offset which gets the WRONG slot.
{
unsigned WO = Walker->getOpcode();
if (WO == W65816::TCS || WO == W65816::PHA ||
WO == W65816::PLA || WO == W65816::PHX ||
WO == W65816::PLX || WO == W65816::PHY ||
WO == W65816::PLY || WO == W65816::PHP ||
WO == W65816::PLP || WO == W65816::PHB ||
WO == W65816::PLB || WO == W65816::PHD ||
WO == W65816::PLD || WO == W65816::PHK ||
WO == W65816::PEA || WO == W65816::PEI_DP) {
ok = false; break;
}
}
}
if (!ok || !LdaX || !StaY) continue;
if (g_fireLimit >= 0 && g_fireCount >= g_fireLimit) continue;
g_fireCount++;
errs() << "SF FIRE " << g_fireCount << " in " << MF.getName()
<< " MBB " << MBB.getNumber()
<< " X=" << X << " Y=" << StaY->getOperand(0).getImm()
<< "\n";
// Now re-walk from std::next(It) up to LdaX and verify no
// access to slot Y in that gap.
ok = true;
for (auto W2 = std::next(It); W2 != LdaX->getIterator(); ++W2) {
if (W2->isDebugInstr()) continue;
int64_t Off;
if (srAccess2(*W2, Off) && Off == Y) { ok = false; break; }
}
if (!ok) continue;
// Safe to apply: schedule the StaY-after-StaX insert, and
// erase LdaX and StaY.
ToInsert.push_back({StaX, Y});
ToErase.push_back(LdaX);
ToErase.push_back(StaY);
Changed = true;
}
}
// Apply (insertions first; iterators stay valid through erase).
for (auto &P : ToInsert) {
MachineInstr *StaX = std::get<0>(P);
int64_t Y = std::get<1>(P);
MachineBasicBlock *MBB = StaX->getParent();
DebugLoc DL = StaX->getDebugLoc();
auto NextIt = std::next(StaX->getIterator());
BuildMI(*MBB, NextIt, DL, TII.get(W65816::STA_StackRel))
.addImm(Y);
}
for (MachineInstr *MI : ToErase) MI->eraseFromParent();
}
#endif
// (Redundant CMP #0 elimination — disabled, hit VLA sum_n
// regression. Carry-flag bookkeeping across the CMP turned out to
// have more cases than my forward-walk modeled. See
// feedback_cmp_zero_elim.md.)
#if 0
{
auto isNZSetOnA = [](unsigned Op) {
switch (Op) {
case W65816::DEA_PSEUDO: case W65816::INA_PSEUDO:
case W65816::ADC_StackRel: case W65816::ADC_DP: case W65816::ADC_Imm16:
case W65816::SBC_StackRel: case W65816::SBC_DP: case W65816::SBC_Imm16:
case W65816::AND_StackRel: case W65816::AND_DP: case W65816::AND_Imm16:
case W65816::ORA_StackRel: case W65816::ORA_DP: case W65816::ORA_Imm16:
case W65816::EOR_StackRel: case W65816::EOR_DP: case W65816::EOR_Imm16:
case W65816::LDA_StackRel: case W65816::LDA_DP:
case W65816::LDAi16imm: case W65816::LDA_Imm16:
case W65816::TXA: case W65816::TYA:
case W65816::ADCi16imm: case W65816::ADCEi16imm:
case W65816::SBCi16imm: case W65816::SBCEi16imm:
return true;
default:
return false;
}
};
auto isCmpZero = [](const MachineInstr &MI) {
if (MI.getOpcode() != W65816::CMPi16imm) return false;
// Operand layout: lhs (Acc16), imm. Find the imm.
for (const MachineOperand &MO : MI.operands()) {
if (MO.isImm()) return MO.getImm() == 0;
}
return false;
};
auto modifiesA = [](const MachineInstr &MI) {
for (const MachineOperand &MO : MI.operands()) {
if (MO.isReg() && MO.getReg() == W65816::A && MO.isDef())
return true;
}
return false;
};
auto readsC = [](const MachineInstr &MI) {
// We don't model individual flag bits; approximate by checking
// if the MI reads $p AND is one of the carry-consuming ops.
unsigned Op = MI.getOpcode();
switch (Op) {
case W65816::ADC_StackRel: case W65816::ADC_DP: case W65816::ADC_Imm16:
case W65816::SBC_StackRel: case W65816::SBC_DP: case W65816::SBC_Imm16:
case W65816::ADCEi16imm: case W65816::SBCEi16imm:
case W65816::BCC: case W65816::BCS:
case W65816::ROL_A: case W65816::ROR_A:
return true;
default:
return false;
}
};
SmallVector<MachineInstr *, 4> CmpsToErase;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
if (!isCmpZero(MI)) continue;
// Walk backward, skipping flag-preserving instructions.
bool foundProducer = false;
auto Back = MI.getIterator();
while (Back != MBB.begin()) {
--Back;
if (Back->isDebugInstr()) continue;
if (Back->isCall() || Back->isInlineAsm()) break;
if (modifiesA(*Back)) {
foundProducer = isNZSetOnA(Back->getOpcode());
break;
}
bool defsP = false;
for (const MachineOperand &MO : Back->operands()) {
if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) {
defsP = true; break;
}
}
if (defsP) break;
}
if (!foundProducer) continue;
// Walk FORWARD from CMP: until the next C-defining MI, no MI
// reads C.
bool cConsumed = false;
for (auto Fwd = std::next(MI.getIterator()); Fwd != MBB.end(); ++Fwd) {
if (Fwd->isDebugInstr()) continue;
if (readsC(*Fwd)) { cConsumed = true; break; }
// Next def of $p: subsequent reads aren't ours.
bool defsP = false;
for (const MachineOperand &MO : Fwd->operands()) {
if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) {
defsP = true; break;
}
}
if (defsP) break;
}
if (cConsumed) continue;
CmpsToErase.push_back(&MI);
}
}
for (MachineInstr *MI : CmpsToErase) MI->eraseFromParent();
if (!CmpsToErase.empty()) Changed = true;
}
#endif
// (Narrow PHI-copy slot collapse — disabled, qsort regression.)
#if 0
{
auto isStackRelMC2 = [](unsigned Op) {
return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
};
auto srAccess2 = [&](const MachineInstr &MI, int64_t &Off) {
if (!isStackRelMC2(MI.getOpcode())) return false;
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
Off = MI.getOperand(0).getImm();
return true;
};
DenseMap<int64_t, unsigned> Refs;
DenseMap<int64_t, MachineInstr *> StaInst, LdaInst;
DenseMap<int64_t, unsigned> NSta, NLda;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
int64_t Off;
if (!srAccess2(MI, Off)) continue;
Refs[Off]++;
if (MI.getOpcode() == W65816::STA_StackRel) {
NSta[Off]++; StaInst[Off] = &MI;
} else if (MI.getOpcode() == W65816::LDA_StackRel) {
NLda[Off]++; LdaInst[Off] = &MI;
}
}
}
SmallVector<MachineInstr *, 4> ToErase;
for (auto &P : Refs) {
int64_t X = P.first;
if (P.second != 2) continue; // exactly 2 references
if (NSta[X] != 1 || NLda[X] != 1) continue;
MachineInstr *Sta = StaInst[X];
MachineInstr *Lda = LdaInst[X];
if (Sta->getParent() != Lda->getParent()) continue;
MachineBasicBlock *MBB = Sta->getParent();
// Sta must be before Lda.
bool staBefore = false;
for (auto It = MBB->begin(); It != MBB->end(); ++It) {
if (&*It == Sta) { staBefore = true; break; }
if (&*It == Lda) break;
}
if (!staBefore) continue;
// Next after Lda must be STA Y where Y != X.
auto NextIt = std::next(Lda->getIterator());
while (NextIt != MBB->end() && NextIt->isDebugInstr()) ++NextIt;
if (NextIt == MBB->end()) continue;
int64_t Y;
if (NextIt->getOpcode() != W65816::STA_StackRel ||
!srAccess2(*NextIt, Y) || Y == X) continue;
// Between Sta and Lda, no read/write of slot Y, no call, no
// anything that would re-set slot Y's value mid-flight.
bool ok = true;
for (auto It = std::next(Sta->getIterator()); It != Lda->getIterator();
++It) {
if (It->isDebugInstr()) continue;
if (It->isCall() || It->isInlineAsm()) { ok = false; break; }
int64_t Off;
if (srAccess2(*It, Off) && Off == Y) { ok = false; break; }
}
if (!ok) continue;
// Redirect the original STA to write to Y; delete the LDA-STA pair.
Sta->getOperand(0).setImm(Y);
ToErase.push_back(Lda);
ToErase.push_back(&*NextIt);
Changed = true;
}
for (MachineInstr *MI : ToErase) MI->eraseFromParent();
}
#endif
return Changed;
}

View file

@ -1492,6 +1492,14 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
}
return false;
};
// Pass 1c can only eliminate CMPi16imm $a, 0 if the preceding
// A-modifier reliably sets N/Z to reflect A's final value. LDAfi
// under FP-rel expansion (`sty $fa ; ldy #imm ; lda [$f6],y ; ldy $fa`)
// ends with `ldy` that clobbers N/Z based on OLD Y, not loaded A — so
// in FP-rel functions (VLA / huge frame), the CMP is load-bearing.
// Skip the whole pass for such functions (saves us from the sum_n
// VLA regression that the PHP-wrap-aware variant tripped).
bool ssCleanupSPRelOnly = !UsesFPRel;
for (MachineBasicBlock &MBB : MF) {
SmallVector<MachineInstr *, 8> Cmps;
for (MachineInstr &MI : MBB)
@ -1516,10 +1524,27 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
// condition). Caused __adddf3's renormalize while-loop to
// skip its body even though `mr & ~mask` was non-zero.
bool SafeToErase = true;
bool insidePHPWrap = false;
for (auto It = std::next(Cmp->getIterator());
It != Cmp->getParent()->end(); ++It) {
if (It->isDebugInstr()) continue;
if (It->isBranch() || It->isReturn()) break;
// PHP/PLP-wrap-aware: only safe when LDAfi-expansion sets N/Z
// reliably (SP-rel functions, not FP-rel).
if (ssCleanupSPRelOnly && It->getOpcode() == W65816::PHP) {
// PHP must be IMMEDIATELY after CMP to capture CMP's flags.
if (&*It != &*std::next(Cmp->getIterator())) {
SafeToErase = false;
break;
}
insidePHPWrap = true;
continue;
}
if (It->getOpcode() == W65816::PLP) {
insidePHPWrap = false;
continue;
}
if (insidePHPWrap) continue;
if (It->getOpcode() == TargetOpcode::COPY) {
SafeToErase = false;
break;

View file

@ -0,0 +1,733 @@
//===-- W65816StackSlotMerge.cpp - Merge value-equivalent stack slots ----===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===---------------------------------------------------------------------===//
//
// Pre-emit pass that runs after PEI (eliminateFrameIndex) and merges
// pairs of stack-rel slots that hold the same value at every observable
// program point — typically the PHI src/dst pair PHI-elim leaves at
// the back-edge of a loop body.
//
// LLVM's StackSlotColoring merges slots with non-overlapping liveness.
// It can't merge slots that are simultaneously live but happen to hold
// the same value (which is what a PHI memory-copy creates). This pass
// catches that case via a stricter "value equivalence" check.
//
// Canonical pattern (sumSquares loop body):
//
// .LBB0_4:
// LDA 0x7, s ; PHA ; JSL __umulhisi3 ; PLY
// CLC ; ADC 0x3, s ; STA 0xb, s ; new total.lo (write X)
// TXA ; ADC 0x1, s ; STA 0x9, s
// LDA 0x7, s ; INC A ; STA 0x7, s
// LDA 0xb, s ; STA 0x3, s ; PHI copy: load X, store Y
// LDA 0x9, s ; STA 0x1, s
// ...
//
// The pair (0xb, 0x3) is the lo-half PHI memory copy. Slots 0xb and
// 0x3 always hold the same value at every read site:
// - Function entry: both initialized to 0 (`lda #0; sta 0xb, s` in
// entry, `lda #0; sta 0x3, s` in preheader).
// - Loop iteration: the PHI copy moves the new total.lo from 0xb to
// 0x3 at the end of every iteration.
// - Exit: only 0xb is read (return value), but its value equals 0x3's.
//
// Rename 0xb → 0x3 function-wide; the now self-copy `lda 0x3; sta 0x3`
// is dead and we erase it. Saves 2 inst per PHI copy occurrence (the
// memory copy round-trip). sumSquares loop body shrinks from 21 to
// 17 inst per iter.
//
// Safety check (sufficient condition for value equivalence):
// 1. Both slots have ≥1 STA in the function (skips arg slots passed
// by the caller — those have only LDA reads, no STAs, and renaming
// would change where we read the arg from).
// 2. For every STA X in the function, find a "twin" STA Y at a
// program point where the values match. Matching = either:
// (a) Same MBB, same A-source value (no intervening A-define).
// Covers the loop-body iter-end pattern: STA X then later
// LDA X ; STA Y. Also covers entry's `lda #N ; sta X` if
// the same MBB also has `sta Y`.
// (b) Different MBBs, both preceded by `LDA #const` of the same
// constant. Covers entry-block STA X=0 paired with
// preheader STA Y=0.
// 3. Symmetric: for every STA Y, find a twin STA X.
// 4. No "orphan" STAs. If a STA X or STA Y has no twin, bail.
//
// When all checks pass, the rename function-wide preserves semantics:
// every read of slot X at program point P sees the same value that
// slot Y holds at P (and vice versa).
//
//===---------------------------------------------------------------------===//
#include "W65816.h"
#include "W65816InstrInfo.h"
#include "W65816Subtarget.h"
#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/InitializePasses.h"
#include "llvm/Support/Debug.h"
using namespace llvm;
#define DEBUG_TYPE "w65816-stack-slot-merge"
namespace {
class W65816StackSlotMerge : public MachineFunctionPass {
public:
static char ID;
W65816StackSlotMerge() : MachineFunctionPass(ID) {}
StringRef getPassName() const override {
return "W65816 merge value-equivalent stack slots (PHI-copy collapse)";
}
void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<MachineDominatorTreeWrapperPass>();
AU.setPreservesCFG();
MachineFunctionPass::getAnalysisUsage(AU);
}
bool runOnMachineFunction(MachineFunction &MF) override;
};
} // namespace
char W65816StackSlotMerge::ID = 0;
INITIALIZE_PASS_BEGIN(W65816StackSlotMerge, DEBUG_TYPE,
"W65816 stack slot merge", false, false)
INITIALIZE_PASS_DEPENDENCY(MachineDominatorTreeWrapperPass)
INITIALIZE_PASS_END(W65816StackSlotMerge, DEBUG_TYPE,
"W65816 stack slot merge", false, false)
FunctionPass *llvm::createW65816StackSlotMerge() {
return new W65816StackSlotMerge();
}
// Stack-relative MC opcodes — the ops that survive eliminateFrameIndex
// and reference a slot via an 8-bit SP-relative offset.
static bool isStackRelOp(unsigned Op) {
return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
}
// Returns true if MI is a stack-rel op; out-param Off receives the slot
// offset (operand 0).
static bool srAccess(const MachineInstr &MI, int64_t &Off) {
if (!isStackRelOp(MI.getOpcode())) return false;
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
Off = MI.getOperand(0).getImm();
return true;
}
// True if the MI semantically defines A. Covers both the explicit
// case (operand has reg=A,isDef) AND the implicit case where the
// tablegen InstDP / InstAbs / etc. base classes omit the A-Def
// annotation despite LDA semantically writing A (a backend modelling
// gap — many `LDA_DP`, `LDA_Abs`, `LDA_LongX`, etc. are missing the
// implicit-def in the MIR even though they load into A). Opcode-
// based fallback catches all of them.
static bool semanticallyDefsA(const MachineInstr &MI) {
for (const MachineOperand &MO : MI.operands()) {
if (MO.isReg() && MO.getReg() == W65816::A && MO.isDef())
return true;
}
unsigned Op = MI.getOpcode();
switch (Op) {
case W65816::LDA_DP: case W65816::LDA_DPX:
case W65816::LDA_DPInd: case W65816::LDA_DPIndY:
case W65816::LDA_DPIndX:
case W65816::LDA_Abs: case W65816::LDA_AbsX:
case W65816::LDA_AbsY: case W65816::LDA_Long:
case W65816::LDA_LongX:
case W65816::PLA:
return true;
default:
return false;
}
}
// Walk backward from MI in its MBB looking for the most recent A-define.
// Returns the MI that defines A, or nullptr if none in the same MBB.
// Skips debug instructions. Stops at MBB boundary, calls, branches,
// inline asm.
static MachineInstr *findPriorADef(MachineInstr *MI) {
MachineBasicBlock *MBB = MI->getParent();
auto It = MI->getIterator();
while (It != MBB->begin()) {
--It;
if (It->isDebugInstr()) continue;
if (It->isCall() || It->isInlineAsm()) return nullptr;
if (semanticallyDefsA(*It)) return &*It;
}
return nullptr;
}
// Walk forward from `Start` (exclusive) up to (but not including) `End`
// in the same MBB, tracking whether slot `WatchSlot` is written.
// Returns true if slot `WatchSlot` is NOT written in the interval.
static bool slotNotWrittenBetween(MachineBasicBlock::iterator Start,
MachineBasicBlock::iterator End,
int64_t WatchSlot) {
for (auto It = std::next(Start); It != End; ++It) {
if (It->isDebugInstr()) continue;
int64_t Off;
if (It->getOpcode() == W65816::STA_StackRel && srAccess(*It, Off) &&
Off == WatchSlot) {
return false;
}
}
return true;
}
// Returns true if MI clobbers P (N/Z/C/V flags). Mirrors LLVM's
// operand-based check + an opcode whitelist for tablegen entries that
// omit `Defs = [P]` (InstImplied, InstStackRel, etc.).
static bool clobbersFlagsP(const MachineInstr &MI) {
for (const MachineOperand &MO : MI.operands()) {
if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef())
return true;
}
if (MI.isCall() || MI.isInlineAsm()) return true;
unsigned Op = MI.getOpcode();
switch (Op) {
case W65816::PLA: case W65816::PLY: case W65816::PLX:
case W65816::PLP:
case W65816::INA: case W65816::DEA:
case W65816::INX: case W65816::DEX:
case W65816::INY: case W65816::DEY:
case W65816::TAX: case W65816::TAY:
case W65816::TYA: case W65816::TXA:
case W65816::TYX: case W65816::TXY:
case W65816::LDA_StackRel: case W65816::LDA_DP:
case W65816::LDA_DPX: case W65816::LDA_DPInd:
case W65816::LDA_DPIndY: case W65816::LDA_DPIndX:
case W65816::LDA_Abs: case W65816::LDA_AbsX:
case W65816::LDA_AbsY: case W65816::LDA_Long:
case W65816::LDA_LongX:
case W65816::ADC_StackRel: case W65816::SBC_StackRel:
case W65816::CMP_StackRel: case W65816::AND_StackRel:
case W65816::ORA_StackRel: case W65816::EOR_StackRel:
case W65816::ADC_DP: case W65816::ADC_Abs:
case W65816::SBC_DP: case W65816::SBC_Abs:
case W65816::CMP_DP: case W65816::CMP_Abs:
case W65816::AND_DP: case W65816::AND_Abs:
case W65816::ORA_DP: case W65816::ORA_Abs:
case W65816::EOR_DP: case W65816::EOR_Abs:
return true;
default:
return false;
}
}
// Returns true if MI reads P flags (conditional branches, PLP, etc.).
static bool usesFlagsP(const MachineInstr &MI) {
if (MI.isConditionalBranch()) return true;
for (const MachineOperand &MO : MI.operands()) {
if (MO.isReg() && MO.getReg() == W65816::P && MO.isUse() &&
!MO.isDef())
return true;
}
return false;
}
// Returns the MOST RECENT A-defining MI strictly before MI in its MBB,
// skipping debug instructions. Returns nullptr if none in the same MBB.
static MachineInstr *findMostRecentADef(MachineInstr *MI) {
MachineBasicBlock *MBB = MI->getParent();
auto It = MI->getIterator();
while (It != MBB->begin()) {
--It;
if (It->isDebugInstr()) continue;
if (semanticallyDefsA(*It)) return &*It;
}
return nullptr;
}
// "Twin" check. Given a STA X at position StaX and a candidate slot Y,
// scan the function's STA Y instances and return one that's value-
// equivalent under the rules described in the header comment.
//
// Source-value equivalence cases:
// (1) Same-MBB twin store: no A-define between StaX and the candidate
// StaY → both store the same A value. Pure twin pattern.
// (2) Same-MBB PHI-copy: the candidate StaY is preceded by
// `LDA_StackRel slotX` (PHI-copy reload). Even if many A-defines
// sit between StaX and StaY, the LDA X re-establishes A =
// slot[X] = value StaX wrote (assuming slot X wasn't re-written
// in the gap).
// (3) Different MBBs, both preceded by LDA_Imm16 / LDAi16imm of the
// same constant. Covers entry/preheader init parallel pair.
static MachineInstr *findTwin(MachineInstr *StaX,
ArrayRef<MachineInstr *> StasY) {
MachineBasicBlock *MBBStaX = StaX->getParent();
int64_t XOff = StaX->getOperand(0).getImm();
// Cases (1) + (2): same MBB.
for (MachineInstr *StaY : StasY) {
if (StaY->getParent() != MBBStaX) continue;
// Determine ordering.
MachineInstr *Earlier = nullptr;
MachineInstr *Later = nullptr;
for (auto It = MBBStaX->begin(); It != MBBStaX->end(); ++It) {
if (&*It == StaX) { Earlier = StaX; Later = StaY; break; }
if (&*It == StaY) { Earlier = StaY; Later = StaX; break; }
}
if (!Earlier || !Later) continue;
int64_t EOff = Earlier->getOperand(0).getImm();
// Case (2): if Later is preceded by `LDA_StackRel <Earlier's slot>`
// (the PHI-copy reload), it's a PHI twin. Also require slot
// Earlier-slot wasn't re-written between Earlier and Later.
MachineInstr *PriorOfLater = findMostRecentADef(Later);
if (PriorOfLater) {
int64_t Off;
if (PriorOfLater->getOpcode() == W65816::LDA_StackRel &&
srAccess(*PriorOfLater, Off) && Off == EOff &&
slotNotWrittenBetween(Earlier->getIterator(),
PriorOfLater->getIterator(), EOff)) {
return StaY;
}
}
// Case (1): no A-define between Earlier and Later — same A value.
{
bool noADefs = true;
for (auto It = std::next(Earlier->getIterator());
It != Later->getIterator(); ++It) {
if (It->isDebugInstr()) continue;
if (semanticallyDefsA(*It)) { noADefs = false; break; }
}
if (noADefs) return StaY;
}
}
// Case (3): different MBBs, both preceded by LDA_Imm16 / LDAi16imm
// with the same constant.
MachineInstr *PriorX = findPriorADef(StaX);
if (!PriorX) return nullptr;
unsigned PriorXOp = PriorX->getOpcode();
if (PriorXOp != W65816::LDA_Imm16 && PriorXOp != W65816::LDAi16imm)
return nullptr;
int64_t XConst = 0;
for (const MachineOperand &MO : PriorX->operands()) {
if (MO.isImm()) { XConst = MO.getImm(); break; }
}
for (MachineInstr *StaY : StasY) {
if (StaY->getParent() == MBBStaX) continue;
MachineInstr *PriorY = findPriorADef(StaY);
if (!PriorY) continue;
if (PriorY->getOpcode() != PriorXOp) continue;
int64_t YConst = 0;
for (const MachineOperand &MO : PriorY->operands()) {
if (MO.isImm()) { YConst = MO.getImm(); break; }
}
if (XConst == YConst) return StaY;
}
(void)XOff;
return nullptr;
}
// Run Phase 6a + Phase 6 (per-MBB peepholes) — independent of rename
// logic, so they fire on every function. Returns true if anything
// changed.
static bool runPerMBBPeepholes(MachineFunction &MF) {
bool Changed = false;
// Phase 6a: redundant `STA Y, s` immediately followed by `LDA Y, s`.
for (MachineBasicBlock &MBB : MF) {
SmallVector<MachineInstr *, 4> Dead;
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (It->isDebugInstr()) continue;
if (It->getOpcode() != W65816::STA_StackRel) continue;
int64_t StaSlot;
if (!srAccess(*It, StaSlot)) continue;
auto NextIt = std::next(It);
while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
if (NextIt == MBB.end()) continue;
if (NextIt->getOpcode() != W65816::LDA_StackRel) continue;
int64_t LdaSlot;
if (!srAccess(*NextIt, LdaSlot)) continue;
if (StaSlot != LdaSlot) continue;
bool flagsSafe = false;
bool aIsUsedBeforeClobber = false;
for (auto Fwd = std::next(NextIt); Fwd != MBB.end(); ++Fwd) {
if (Fwd->isDebugInstr()) continue;
// Calls/JSLs that take A as arg — even though clobbersFlagsP
// returns true for them, the elimination could mis-track A's
// live-in to the call. Bail.
if (Fwd->isCall()) break;
// Generic: any instr that has `implicit $a` as a USE — A is
// live going in. Bail to avoid live-range trouble.
for (const MachineOperand &MO : Fwd->operands()) {
if (MO.isReg() && MO.getReg() == W65816::A && MO.isUse() &&
!MO.isDef()) {
aIsUsedBeforeClobber = true;
break;
}
}
if (aIsUsedBeforeClobber) break;
if (usesFlagsP(*Fwd)) break;
if (Fwd->isTerminator() && !Fwd->isConditionalBranch()) {
flagsSafe = true; break;
}
if (clobbersFlagsP(*Fwd)) { flagsSafe = true; break; }
}
if (!flagsSafe) continue;
Dead.push_back(&*NextIt);
}
for (MachineInstr *MI : Dead) {
MI->eraseFromParent();
Changed = true;
}
}
// Phase 6: per-MBB redundant `LDA #K` elimination.
auto isAandPPreserving = [](const MachineInstr &MI) -> bool {
unsigned Op = MI.getOpcode();
switch (Op) {
case W65816::STA_StackRel:
case W65816::STA_DP: case W65816::STA_DPX:
case W65816::STA_DPInd: case W65816::STA_DPIndY:
case W65816::STA_DPIndX:
case W65816::STA_Abs: case W65816::STA_AbsX:
case W65816::STA_AbsY: case W65816::STA_Long:
case W65816::STA_LongX:
case W65816::STX_DP: case W65816::STX_Abs:
case W65816::STY_DP: case W65816::STY_Abs: case W65816::STY_DPX:
case W65816::STZ_DP: case W65816::STZ_Abs:
case W65816::STZ_DPX: case W65816::STZ_AbsX:
return true;
default:
break;
}
for (const MachineOperand &MO : MI.operands()) {
if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef())
return false;
}
if (MI.mayStore() && !MI.mayLoad() && !semanticallyDefsA(MI))
return true;
return false;
};
auto isLdaImmK = [](const MachineInstr &MI, int64_t &K) -> bool {
unsigned Op = MI.getOpcode();
if (Op != W65816::LDA_Imm16 && Op != W65816::LDAi16imm) return false;
for (const MachineOperand &MO : MI.operands()) {
if (MO.isImm()) { K = MO.getImm(); return true; }
}
return false;
};
for (MachineBasicBlock &MBB : MF) {
std::optional<int64_t> KnownK;
SmallVector<MachineInstr *, 4> Dead;
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (It->isDebugInstr()) continue;
int64_t K;
if (isLdaImmK(*It, K)) {
if (KnownK && *KnownK == K) {
Dead.push_back(&*It);
continue;
}
KnownK = K;
continue;
}
if (isAandPPreserving(*It)) continue;
KnownK.reset();
}
for (MachineInstr *MI : Dead) {
MI->eraseFromParent();
Changed = true;
}
}
return Changed;
}
bool W65816StackSlotMerge::runOnMachineFunction(MachineFunction &MF) {
if (skipFunction(MF.getFunction())) return false;
if (MF.getFunction().hasOptNone()) return false;
// Run per-MBB peepholes first — independent of rename logic.
bool peepChanged = runPerMBBPeepholes(MF);
// Phase 1: index all stack-rel STA/LDA grouped by slot offset.
DenseMap<int64_t, SmallVector<MachineInstr *, 4>> Stas;
DenseMap<int64_t, SmallVector<MachineInstr *, 4>> Ldas;
DenseMap<int64_t, unsigned> AllRefs; // STA + LDA + ADC + ... count
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
int64_t Off;
if (!srAccess(MI, Off)) continue;
AllRefs[Off]++;
if (MI.getOpcode() == W65816::STA_StackRel) {
Stas[Off].push_back(&MI);
} else if (MI.getOpcode() == W65816::LDA_StackRel) {
Ldas[Off].push_back(&MI);
}
}
}
// Phase 2: find PHI-copy site candidates. Pattern: LDA X ; STA Y
// in a LOOP BODY MBB (= the MBB has itself as a predecessor, i.e.
// a self-loop back-edge). Restricting to loop bodies distinguishes
// genuine PHI-cycle copies from one-shot temp transfers (where
// slot X is just a scratch register dropped on the way to slot Y
// for an unrelated purpose, like qsortIter's pointer-construction
// pattern `STA 5; ...; LDA 5; STA 39` followed by `LDA 39; STA dp`).
DenseMap<int64_t, int64_t> PhiCopyPair; // X -> Y
for (MachineBasicBlock &MBB : MF) {
// Self-loop check: MBB must have itself as a predecessor.
bool selfLoop = false;
for (MachineBasicBlock *Pred : MBB.predecessors()) {
if (Pred == &MBB) { selfLoop = true; break; }
}
if (!selfLoop) continue;
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (It->getOpcode() != W65816::LDA_StackRel) continue;
int64_t X;
if (!srAccess(*It, X)) continue;
auto NextIt = std::next(It);
while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
if (NextIt == MBB.end()) continue;
if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
int64_t Y;
if (!srAccess(*NextIt, Y) || Y == X) continue;
if (PhiCopyPair.count(X)) continue;
PhiCopyPair[X] = Y;
}
}
// Phase 3: validate each pair and apply rename if safe.
// Track which slots have already been merged so we don't double-merge.
DenseMap<int64_t, int64_t> Renames; // X -> Y
for (auto &P : PhiCopyPair) {
int64_t X = P.first, Y = P.second;
// Don't re-merge an already-processed slot.
if (Renames.count(X) || Renames.count(Y)) continue;
// Arg-slot guard: skip slots with no STAs (caller-passed args).
if (Stas[X].empty() || Stas[Y].empty()) continue;
// Validate that every STA X has a twin STA Y.
bool allPaired = true;
for (MachineInstr *StaX : Stas[X]) {
if (!findTwin(StaX, Stas[Y])) { allPaired = false; break; }
}
if (!allPaired) continue;
// Symmetric: every STA Y must have a twin STA X.
for (MachineInstr *StaY : Stas[Y]) {
if (!findTwin(StaY, Stas[X])) { allPaired = false; break; }
}
if (!allPaired) continue;
LLVM_DEBUG(dbgs() << "StackSlotMerge: rename slot " << X
<< " -> " << Y << " in " << MF.getName() << "\n");
Renames[X] = Y;
}
if (Renames.empty()) return false;
// Phase 4: apply rename.
bool Changed = false;
for (MachineBasicBlock &MBB : MF) {
SmallVector<MachineInstr *, 4> ToErase;
for (MachineInstr &MI : MBB) {
int64_t Off;
if (!srAccess(MI, Off)) continue;
auto It = Renames.find(Off);
if (It == Renames.end()) continue;
MI.getOperand(0).setImm(It->second);
Changed = true;
}
// After rename, look for now-redundant LDA-STA pairs to the same
// slot (the PHI-copy self-copy). Erase them.
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (It->getOpcode() != W65816::LDA_StackRel) continue;
int64_t LdaOff;
if (!srAccess(*It, LdaOff)) continue;
auto NextIt = std::next(It);
while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
if (NextIt == MBB.end()) continue;
if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
int64_t StaOff;
if (!srAccess(*NextIt, StaOff)) continue;
if (LdaOff != StaOff) continue;
ToErase.push_back(&*It);
ToErase.push_back(&*NextIt);
}
for (MachineInstr *MI : ToErase) MI->eraseFromParent();
if (!ToErase.empty()) Changed = true;
}
// Phase 5: redundant constant-init elimination. After rename, the
// Case (3) twin pairings leave us with TWO sites writing the same
// constant to the same slot (one renamed from X to Y, the other was
// already targeting Y). The dominated one is redundant — its slot
// already holds the constant from the dominating write.
//
// Generalize: scan post-rename for ALL `LDA_Imm16 K ; STA_StackRel Y`
// pairs (or LDAi16imm K; STA Y). For each pair, look for another
// such pair with the same (K, Y) where one DOMINATES the other AND
// no slot-Y access exists on any path between them. Erase the
// dominated STA + its preceding LDA (if A isn't otherwise consumed).
{
auto isLdaImm = [](const MachineInstr &MI) {
unsigned Op = MI.getOpcode();
return Op == W65816::LDA_Imm16 || Op == W65816::LDAi16imm;
};
auto immValue = [](const MachineInstr &MI) -> int64_t {
for (const MachineOperand &MO : MI.operands()) {
if (MO.isImm()) return MO.getImm();
}
return 0;
};
// Collect `LDA #K ; STA_StackRel Y` pairs, grouped by Y.
DenseMap<int64_t, SmallVector<std::pair<MachineInstr *, int64_t>, 4>>
ConstStas;
for (MachineBasicBlock &MBB : MF) {
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (!isLdaImm(*It)) continue;
int64_t K = immValue(*It);
auto NextIt = std::next(It);
while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
if (NextIt == MBB.end()) continue;
if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
int64_t Y;
if (!srAccess(*NextIt, Y)) continue;
ConstStas[Y].push_back({&*NextIt, K});
}
}
// For each slot Y with at least two const-init STAs, check for
// dominator redundancy.
auto &MDT = getAnalysis<MachineDominatorTreeWrapperPass>().getDomTree();
// Check that no instruction WRITES slot Y on any path between
// From and To. Reads are fine because both From and To write
// the same constant K — any intermediate read would see K either
// way (since From dominates, From has already executed). Calls
// are bailout conditions: a call might write to the stack via
// address-taken locals or other side effects we don't model.
auto noSlotWriteOnPath = [&](MachineInstr *From, MachineInstr *To,
int64_t Y) -> bool {
MachineBasicBlock *FromMBB = From->getParent();
MachineBasicBlock *ToMBB = To->getParent();
auto opWritesY = [&](MachineInstr &MI) {
if (MI.isCall() || MI.isInlineAsm()) return true;
int64_t Off;
if (MI.getOpcode() == W65816::STA_StackRel &&
srAccess(MI, Off) && Off == Y) {
return true;
}
return false;
};
// (a) After From in its MBB.
for (auto It = std::next(From->getIterator()); It != FromMBB->end();
++It) {
if (It->isDebugInstr()) continue;
if (opWritesY(*It)) return false;
}
// (b) BFS forward from FromMBB's successors, stopping at ToMBB.
SmallPtrSet<MachineBasicBlock *, 8> Visited;
SmallVector<MachineBasicBlock *, 8> Stack;
for (auto *Succ : FromMBB->successors()) Stack.push_back(Succ);
while (!Stack.empty()) {
auto *MBB = Stack.pop_back_val();
if (MBB == ToMBB) continue; // checked separately in (c)
if (!Visited.insert(MBB).second) continue;
for (auto &MI : *MBB) {
if (MI.isDebugInstr()) continue;
if (opWritesY(MI)) return false;
}
for (auto *Succ : MBB->successors()) Stack.push_back(Succ);
}
// (c) In ToMBB, before To, any write of Y?
for (auto It = ToMBB->begin(); It != To->getIterator(); ++It) {
if (It->isDebugInstr()) continue;
if (opWritesY(*It)) return false;
}
return true;
};
SmallVector<MachineInstr *, 8> ToErase;
LLVM_DEBUG({
dbgs() << "Phase 5 in " << MF.getName() << ":\n";
for (auto &P : ConstStas) {
dbgs() << " slot " << P.first << " has " << P.second.size()
<< " const STAs\n";
}
});
for (auto &P : ConstStas) {
int64_t Y = P.first;
auto &stas = P.second;
if (stas.size() < 2) continue;
// For each pair (i, j) where i dominates j with same constant K:
for (auto &Sj : stas) {
MachineInstr *DominatedSta = Sj.first;
int64_t Kj = Sj.second;
for (auto &Si : stas) {
if (&Si == &Sj) continue;
if (Si.second != Kj) continue; // different K
MachineInstr *DominatorSta = Si.first;
if (!MDT.dominates(DominatorSta, DominatedSta)) continue;
if (!noSlotWriteOnPath(DominatorSta, DominatedSta, Y)) continue;
// Flag safety: erasing `LDA #K; STA Y` removes a flag-setting
// op (the LDA). Walk forward from the STA looking for next
// flag-clobber or unconditional terminator (safe) vs.
// flag-use (unsafe).
MachineBasicBlock *MBB = DominatedSta->getParent();
bool flagsSafeP5 = false;
for (auto Fwd = std::next(DominatedSta->getIterator());
Fwd != MBB->end(); ++Fwd) {
if (Fwd->isDebugInstr()) continue;
if (usesFlagsP(*Fwd)) break;
if (Fwd->isTerminator() && !Fwd->isConditionalBranch()) {
flagsSafeP5 = true; break;
}
if (clobbersFlagsP(*Fwd)) { flagsSafeP5 = true; break; }
}
if (!flagsSafeP5) continue;
// Erase DominatedSta and its preceding LDA #K.
auto Prev = DominatedSta->getIterator();
while (Prev != MBB->begin()) {
--Prev;
if (!Prev->isDebugInstr()) break;
}
if (Prev != DominatedSta->getIterator() && isLdaImm(*Prev) &&
immValue(*Prev) == Kj) {
// Verify A isn't consumed between LDA and STA — they're
// adjacent so no consumers exist; safe. Erase both.
ToErase.push_back(&*Prev);
}
ToErase.push_back(DominatedSta);
break;
}
}
}
// De-dup ToErase before erasing.
SmallPtrSet<MachineInstr *, 8> ErasedSet;
for (MachineInstr *MI : ToErase) {
if (ErasedSet.insert(MI).second) {
MI->eraseFromParent();
Changed = true;
}
}
}
return Changed || peepChanged;
}

View file

@ -56,6 +56,8 @@ LLVMInitializeW65816Target() {
initializeW65816I32IncFoldPass(PR);
initializeW65816ImgCalleeSavePass(PR);
initializeW65816NarrowI32MulPass(PR);
initializeW65816PromoteFiToImgPass(PR);
initializeW65816StackSlotMergePass(PR);
// Default IndVarSimplify's exit-value rewriter to "never". The
// closed-form replacement frequently widens an i16 induction var
@ -195,14 +197,19 @@ void W65816PassConfig::addPreRegAlloc() {
}
void W65816PassConfig::addPostRegAlloc() {
// ImgCalleeSave runs FIRST so its STAfi/LDAfi pseudos go through the
// rest of the post-RA pipeline (SpillToX, StackSlotCleanup) normally.
// It detects IMG8..IMG15 usage post-regalloc and inserts prologue
// save + epilogue restore so those slots act as callee-saved at the
// asm level. Fixes picol's `expr 1+2 == 4` bug: high-pressure
// recursive double fns use IMG8..IMG15 as scratch but, without this
// pass, expected them preserved across calls — and callees were
// happy to clobber them. See W65816ImgCalleeSave.cpp.
// FI→IMG promotion runs FIRST. It scans for high-traffic i16
// FrameIndex slots (LDAfi/STAfi/ADCfi/etc.) and rewrites them to
// STA_DP/LDA_DP/ADC_DP/... pointed at free IMG8..IMG15 DP slots.
// The introduced IMG8..15 references are then picked up by
// ImgCalleeSave to get prologue save + epilogue restore. See
// W65816PromoteFiToImg.cpp.
addPass(createW65816PromoteFiToImg());
// ImgCalleeSave detects IMG8..IMG15 usage post-regalloc and inserts
// prologue save + epilogue restore so those slots act as callee-
// saved at the asm level. Fixes picol's `expr 1+2 == 4` bug:
// high-pressure recursive double fns use IMG8..IMG15 as scratch but,
// without this pass, expected them preserved across calls — and
// callees were happy to clobber them. See W65816ImgCalleeSave.cpp.
addPass(createW65816ImgCalleeSave());
// SpillToX converts STA/LDA pairs to TAX/TXA bridges; StackSlotCleanup
// then deletes still-adjacent redundant spills. A second SpillToX
@ -264,6 +271,14 @@ void W65816PassConfig::addPreEmitPass() {
addPass(createW65816I32IncFold());
addPass(createW65816BranchExpand());
addPass(createW65816SepRepCleanup());
// Merge value-equivalent stack slots last. Runs AFTER SepRepCleanup's
// PHI-copy hoist so the LDA-X ; STA-Y pair has been pulled out of
// any PHP/PLP wrap — that way the stack-rel offsets on both ops are
// the unbumped values and offset-based slot matching is stable.
// Saves 2 inst per PHI-copy occurrence (the memory copy round-trip
// collapses when X and Y are renamed to the same slot). See
// W65816StackSlotMerge.cpp.
addPass(createW65816StackSlotMerge());
}
MachineFunctionInfo *W65816TargetMachine::createMachineFunctionInfo(

View file

@ -64,13 +64,43 @@ FunctionPass *llvm::createW65816WidenAcc16() {
return new W65816WidenAcc16();
}
// Returns true if the vreg has any physreg-COPY use (e.g., return-value
// or arg-passing setup that pins the value to a specific physreg).
static bool flowsToPhysReg(Register VReg, const MachineRegisterInfo &MRI) {
// Returns true if the vreg has any physreg-COPY use that would conflict
// with Wide16 class assignment. $a is a member of Wide16 (Wide16 = A +
// IMG0..15), so a COPY to $a is fine — the vreg can be Wide16 and
// regalloc will pick $a to coalesce. $x / $y are in Idx16, NOT in
// Wide16, so a COPY to those forces the vreg to NOT be in Wide16
// (verifier would reject).
static bool flowsToIncompatiblePhysReg(Register VReg,
const MachineRegisterInfo &MRI) {
for (auto &U : MRI.use_nodbg_instructions(VReg)) {
if (!U.isCopy()) continue;
const MachineOperand &Dst = U.getOperand(0);
if (Dst.isReg() && Dst.getReg().isPhysical()) return true;
if (!Dst.isReg() || !Dst.getReg().isPhysical()) continue;
Register P = Dst.getReg();
if (P == W65816::A) continue;
if (P >= W65816::IMG0 && P <= W65816::IMG15) continue;
return true;
}
return false;
}
// Returns true if VReg's def is a COPY from a physreg whose class is not
// Wide16-compatible. copyPhysReg only handles a fixed set of source/dest
// pairs; an incompatible source physreg (e.g., DPF0, the i64-return
// high-half carrier) lowered to an IMG dest would crash with an
// "unhandled copyPhysReg" assertion at AsmPrinter time. (Currently
// only the Phase-2 PHI widening uses this; that's disabled, so mark
// unused.)
[[maybe_unused]] static bool comesFromIncompatiblePhysReg(Register VReg,
const MachineRegisterInfo &MRI) {
for (auto &D : MRI.def_instructions(VReg)) {
if (!D.isCopy()) continue;
const MachineOperand &Src = D.getOperand(1);
if (!Src.isReg() || !Src.getReg().isPhysical()) continue;
Register P = Src.getReg();
if (P == W65816::A) continue;
if (P >= W65816::IMG0 && P <= W65816::IMG15) continue;
return true;
}
return false;
}
@ -145,7 +175,7 @@ bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
Register VReg = Register::index2VirtReg(i);
if (MRI.def_empty(VReg)) continue;
if (MRI.getRegClass(VReg) != &W65816::Acc16RegClass) continue;
if (flowsToPhysReg(VReg, MRI)) continue;
if (flowsToIncompatiblePhysReg(VReg, MRI)) continue;
if (usedByPhi(VReg, MRI)) continue;
if (!MRI.hasOneDef(VReg)) continue; // require single SSA def
if (!allUsesAcceptWide(VReg, MRI, *TRI, *TII)) continue;
@ -181,5 +211,212 @@ bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
}
Changed = true;
}
// Phase 2: PHI cycle widening. EXPERIMENTAL, currently disabled —
// see end of pass for explanation.
#if 0
// PHIs whose def class is Acc16 keep
// the value pinned to $a across iterations, forcing stack spills
// when the PHI is live across calls or other A-clobbering ops.
// For sumSquares-style loops with an i32 accumulator, this manifests
// as per-iter `LDA slot ; ADC ; STA slot ; LDA slot ; STA slot` (the
// last LDA/STA pair is the PHI-back-edge copy). If we widen the
// PHI's def to Wide16, regalloc can keep it in an IMG slot and the
// back-edge PHI copy collapses to a register coalesce.
//
// To widen a PHI:
// 1. Compute the SCC of Acc16 vregs connected by PHI edges (PHI
// def ↔ PHI incoming vreg). This catches mutually-recursive
// PHIs in nested loops.
// 2. For every member: verify all non-PHI uses accept Wide16, no
// flow to a physreg, single def.
// 3. For each PHI in the SCC, walk its incoming list. Each
// incoming vreg is either ALREADY in the SCC (another PHI, no
// bridge needed) or an external Acc16 vreg whose value flows
// into the SCC — bridge it by inserting `WWide = COPY W` at
// the end of the predecessor block and pointing the PHI's
// incoming at WWide.
// 4. Change every SCC member's register class to Wide16.
auto worklistInsertIfAcc16 = [&MRI](Register V,
DenseSet<Register> &Seen,
SmallVectorImpl<Register> &WL) {
if (!V.isVirtual()) return;
if (MRI.getRegClass(V) != &W65816::Acc16RegClass) return;
if (!Seen.insert(V).second) return;
WL.push_back(V);
};
SmallVector<MachineInstr *, 16> AcctPhis;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB.phis()) {
Register DefV = MI.getOperand(0).getReg();
if (MRI.getRegClass(DefV) == &W65816::Acc16RegClass) {
AcctPhis.push_back(&MI);
}
}
}
DenseSet<Register> ProcessedPhiVregs;
for (MachineInstr *Seed : AcctPhis) {
Register SeedDef = Seed->getOperand(0).getReg();
if (ProcessedPhiVregs.count(SeedDef)) continue;
// Build SCC by following PHI edges in both directions.
DenseSet<Register> Comp;
SmallVector<Register, 8> Stack;
worklistInsertIfAcc16(SeedDef, Comp, Stack);
while (!Stack.empty()) {
Register V = Stack.pop_back_val();
// Forward: V flows into other PHIs as an incoming → include those PHI defs.
for (auto &U : MRI.use_nodbg_instructions(V)) {
if (!U.isPHI()) continue;
Register PhiDef = U.getOperand(0).getReg();
worklistInsertIfAcc16(PhiDef, Comp, Stack);
}
// Backward: if V is itself a PHI def, include the incoming vregs.
MachineInstr *DM = &*MRI.def_instructions(V).begin();
if (!DM || !DM->isPHI()) continue;
for (unsigned i = 1, e = DM->getNumOperands(); i < e; i += 2) {
MachineOperand &MO = DM->getOperand(i);
if (!MO.isReg() || !MO.getReg().isVirtual()) continue;
worklistInsertIfAcc16(MO.getReg(), Comp, Stack);
}
}
for (Register V : Comp) ProcessedPhiVregs.insert(V);
// Validate every member. PHI uses are ACCEPTED when the consumer
// PHI is itself in the SCC (those PHIs are being widened in
// lock-step). Narrow-class uses (e.g., INA_PSEUDO's tied-def
// input requires Acc16) are ALSO accepted — we'll insert a
// Wide16→Acc16 COPY at the use site after widening. The only
// unrecoverable cases are: PHI uses where the consumer PHI is
// outside the SCC (forcing cross-SCC class merging), and physreg
// flow to $x/$y/etc. (handled separately above).
auto usesAcceptInSCC = [&](Register V,
SmallVectorImpl<MachineOperand *> *NarrowSites)
-> bool {
for (auto &MO : MRI.use_nodbg_operands(V)) {
MachineInstr *UMI = MO.getParent();
if (UMI->isCopy()) continue;
if (UMI->isPHI()) {
Register PhiDef = UMI->getOperand(0).getReg();
if (Comp.count(PhiDef)) continue; // co-widened
return false;
}
unsigned OpIdx = UMI->getOperandNo(&MO);
const TargetRegisterClass *Expected =
TII->getRegClass(UMI->getDesc(), OpIdx);
if (!Expected) continue;
if (Expected == &W65816::Wide16RegClass) continue;
if (Expected->hasSubClassEq(&W65816::Wide16RegClass)) continue;
// Expected is narrower than Wide16 (e.g., Acc16-only tied
// input). Mark for runtime narrowing — we'll insert a COPY
// at apply time.
if (NarrowSites) NarrowSites->push_back(&MO);
}
return true;
};
bool ok = true;
SmallVector<MachineOperand *, 8> NarrowSites;
for (Register V : Comp) {
if (!MRI.hasOneDef(V)) { ok = false; break; }
if (flowsToIncompatiblePhysReg(V, MRI)) { ok = false; break; }
if (comesFromIncompatiblePhysReg(V, MRI)) { ok = false; break; }
if (!usesAcceptInSCC(V, &NarrowSites)) { ok = false; break; }
}
if (!ok) continue;
// Apply widening. First insert bridge COPYs at predecessor edges
// for external (non-Comp) Acc16 incomings to each PHI in Comp.
SmallVector<std::pair<MachineInstr *, unsigned>, 16> BridgeSites;
for (Register V : Comp) {
MachineInstr *DM = &*MRI.def_instructions(V).begin();
if (!DM->isPHI()) continue;
for (unsigned i = 1, e = DM->getNumOperands(); i < e; i += 2) {
MachineOperand &MO = DM->getOperand(i);
if (!MO.isReg() || !MO.getReg().isVirtual()) continue;
Register Inc = MO.getReg();
if (Comp.count(Inc)) continue; // in-SCC, no bridge needed
// External incoming: ensure it's currently Acc16; if so, we'll
// insert a COPY at the predecessor block's end.
if (MRI.getRegClass(Inc) != &W65816::Acc16RegClass &&
MRI.getRegClass(Inc) != &W65816::Wide16RegClass) {
ok = false;
break;
}
BridgeSites.push_back({DM, i});
}
if (!ok) break;
}
if (!ok) continue;
// Insert bridges.
for (auto &Site : BridgeSites) {
MachineInstr *PhiMI = Site.first;
unsigned OpIdx = Site.second;
Register Inc = PhiMI->getOperand(OpIdx).getReg();
MachineBasicBlock *PredMBB = PhiMI->getOperand(OpIdx + 1).getMBB();
// If already Wide16 (e.g., another candidate widened it already),
// no bridge needed — but we still need the PHI incoming to use
// a Wide16 vreg. Use Inc directly.
if (MRI.getRegClass(Inc) == &W65816::Wide16RegClass) {
continue;
}
// Insert COPY before the predecessor's terminator(s).
auto InsertPos = PredMBB->getFirstTerminator();
DebugLoc DL = (InsertPos == PredMBB->end())
? PredMBB->findBranchDebugLoc()
: InsertPos->getDebugLoc();
Register WideInc = MRI.createVirtualRegister(&W65816::Wide16RegClass);
BuildMI(*PredMBB, InsertPos, DL, TII->get(TargetOpcode::COPY),
WideInc)
.addReg(Inc);
PhiMI->getOperand(OpIdx).setReg(WideInc);
PhiMI->getOperand(OpIdx).setIsKill(false);
}
// Force every SCC member to Img16 (IMG-only, no A). Using Wide16
// (A + IMG) doesn't work here: the Register Coalescer joins our
// Wide16 vregs with adjacent Acc16 vregs (intersection = Acc16)
// and narrows them back to A-only, defeating the widening. Img16
// intersects Acc16 to ∅, so the coalescer can't merge — the PHI
// stays in IMG. This is correct anyway for the common case (PHI
// live across a call): A is JSL-clobbered, so it can't carry the
// value through, and IMG8..15 is the right home.
for (Register V : Comp) {
MRI.setRegClass(V, &W65816::Img16RegClass);
}
// Insert narrowing COPYs at each narrow-class use site. Each site
// is `... = OP V, ...` where the operand requires Acc16 but V is
// now Wide16. Replace with `%Vacc = COPY V (Acc16); ... = OP %Vacc, ...`.
for (MachineOperand *MO : NarrowSites) {
MachineInstr *UMI = MO->getParent();
Register OldReg = MO->getReg();
Register NarrowReg =
MRI.createVirtualRegister(&W65816::Acc16RegClass);
DebugLoc DL = UMI->getDebugLoc();
BuildMI(*UMI->getParent(), UMI, DL, TII->get(TargetOpcode::COPY),
NarrowReg)
.addReg(OldReg);
MO->setReg(NarrowReg);
MO->setIsKill(false);
}
Changed = true;
}
#endif
// Why disabled (2026-05-13 attempt):
// - Widening PHI cycles to Wide16 (= A + IMG0..15) is undone by the
// Register Coalescer: it joins our Wide16 vregs with adjacent
// Acc16 vregs via the bridge COPYs we insert, and the resulting
// joint class is `intersect(Wide16, Acc16) = Acc16`. Net effect:
// no IMG, just more code through the coalescer.
// - Switching to Img16 (= IMG0..15, no A) defeats the coalescer
// (intersection with Acc16 is ∅) but forces ALL widened PHIs into
// IMG slots even when A would be better, AND triggers cascading
// copyPhysReg paths that aren't all implemented (e.g., DPF0 → IMG
// for i64 libcall return values), aborting clang on runtime builds.
// - A targeted fix needs either (a) a class that the coalescer
// refuses to join with Acc16 yet that still allows A as a member,
// (b) a post-coalescer pass that re-widens specific high-traffic
// vregs back to Img16, or (c) regalloc cost-model tuning so it
// prefers IMG8..15 over stack for loop-live values.
return Changed;
}