Checkpoint

This commit is contained in:
Scott Duensing 2026-05-13 20:54:28 -05:00
parent e2e4b778b0
commit 42f0d16d07
19 changed files with 2008 additions and 84 deletions

View file

@ -246,20 +246,21 @@ which runs correctly under MAME (apple2gs).
- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts - `scripts/benchCyclesPrecise.sh` measures per-call cycle counts
via MAME's emulated time counter. Eight benchmarks under via MAME's emulated time counter. Eight benchmarks under
`benchmarks/`. Current numbers (2026-05-13 after the umulhisi3 / `benchmarks/`. Current numbers (after W65816StackSlotMerge):
TAX-TXA / store-bypass / PHI-hoist landings): popcount 3478, popcount 3376, bsearch 852, memcmp 1091, strcpy 2387,
bsearch 852, memcmp 1091, strcpy 2558, dotProduct 2302, dotProduct 2302, fib(10) 12617, sumOfSquares 17391. Speed is
fib(10) 12617, sumOfSquares 18755. Speed is the optimization the optimization priority, not size.
priority, not size.
- `compare/` holds three side-by-side C tests with our asm and - `compare/` holds three side-by-side C tests with our asm and
Calypsi's listing for static-size comparison: Calypsi's listing for static-size comparison:
`sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh` `sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh`
recompiles each under both `clang --target=w65816 -O2 -S` and recompiles each under both `clang --target=w65816 -O2 -S` and
`cc65816 --speed -O 2 --64bit-doubles` and prints an `cc65816 --speed -O 2 --64bit-doubles` and prints an
ours/Calypsi instruction-count ratio. Current ratios: ours/Calypsi instruction-count ratio. Current ratios (post
sumSquares 2.32x, evalAt 2.10x, mul16to32 2.50x. See W65816StackSlotMerge Phase 5/6 + extracted Phase 6/6a per-MBB
`compare/README.md`. peepholes + Pass 1c PHP-wrap CMP elim for SP-rel functions):
sumSquares 1.81x (56 inst), evalAt 2.10x (534 inst), mul16to32
2.25x (9 inst). See `compare/README.md`.
**Backend register allocation:** **Backend register allocation:**
@ -340,6 +341,46 @@ for the common-case C / minimal-C++ workload. Priority is speed
`-disable-lsr` and `isLSRCostLess` override, both regressed `-disable-lsr` and `isLSRCostLess` override, both regressed
dotProduct. dotProduct.
- **W65816StackSlotMerge — value-equivalent stack slot coalesce**
(2026-05-13). Pre-emit pass that merges PHI src/dst stack-slot
pairs which LLVM's StackSlotColoring can't see (they're
simultaneously live but hold the same value). Detects the
canonical loop-body `LDA X ; STA Y` PHI-copy in a self-looped
MBB, verifies value equivalence via bidirectional twin-pairing
(Case 1: same A in same MBB / Case 2: PHI-copy reload pattern /
Case 3: matching `LDA #const` init in different MBBs), and
renames slot X→Y function-wide. Runs AFTER SepRepCleanup so the
PHI copies are out of their PHP/PLP wraps and offsets are stable.
**A-define detection is opcode-based, not operand-based**
LDA_DP / LDA_Abs / LDA_Long etc. omit the `implicit-def $a`
annotation in tablegen but semantically write A; the
`semanticallyDefsA` helper falls back to an opcode whitelist.
sumSquares static: 65 → 61 inst (1.97x — under 2x Calypsi for
the first time). sumOfSquares cyc/call: 18755 → 17391
(**7.3%**). strcpy: 2558 → 2387 (6.7%). See
W65816StackSlotMerge.cpp.
- **LSR-widened i32 IV narrowing** (`W65816NarrowI32Mul` Phase 2,
2026-05-13). After rewriting `mul i32 X, Y` to a `__umulhisi3`
call, scan for i32 PHIs whose only uses are (a) the truncs the
rewrite emitted and (b) a single self-feeding `add %P, const`.
When SCEV bounds the PHI to u16, build an i16 PHI + i16 add in
place, replace truncs, and erase the i32 chain. Care needed
to break the PN ↔ Incr use-cycle before erasing. sumSquares
frame: 14B → 12B; loop-internal `i++` shrinks from 7→3 inst.
- **PHI-hoist accepts LDA_Imm16 / LDAi16imm** (2026-05-13).
Init blocks contain `lda #const ; sta slot,s` pairs wrapped in
PHP/PLP around the pre-loop CMP — same shape as a PHI-copy
wrap but with an immediate load instead of a memory load.
Matcher extended to accept both the MC opcode (`LDA_Imm16`) and
the surviving pseudo (`LDAi16imm`), with an added **$a-live-out
guard**: if any successor MBB has $a in its live-in set, bail —
the LDA's A-value is a fall-through register-PHI consumed by
the successor's first STA, and hoisting clobbers it. Caught
by `sumTable` where `lda #0 ; sta 0x9,s` (wrap+trailing) ALSO
supplied A=0 to `bb.2`'s `sta 0x1,s`.
- **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR - **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR
pass** (2026-05-13). Added `__umulhisi3` (unsigned 16x16→32) to pass** (2026-05-13). Added `__umulhisi3` (unsigned 16x16→32) to
`runtime/src/libgcc.s`. New IR pass in `addISelPrepare` walks `runtime/src/libgcc.s`. New IR pass in `addISelPrepare` walks

View file

@ -1,7 +1,7 @@
############################################################################### ###############################################################################
# # # #
# Calypsi ISO C compiler for 65816 version 5.16 # # Calypsi ISO C compiler for 65816 version 5.16 #
# 13/May/2026 15:46:15 # # 13/May/2026 20:52:21 #
# Command line: --speed -O 2 --64bit-doubles evalAt.c -o # # Command line: --speed -O 2 --64bit-doubles evalAt.c -o #
# /tmp/evalAt.calypsi.elf --list-file evalAt.calypsi.lst # # /tmp/evalAt.calypsi.elf --list-file evalAt.calypsi.lst #
# # # #

View file

@ -139,9 +139,10 @@ evalAt: ; @evalAt
lda 0x1d, s lda 0x1d, s
sta [0xe0 ], y sta [0xe0 ], y
pea 0x4024 pea 0x4024
pea 0x0 lda #0x0
pea 0x0 pha
pea 0x0 pha
pha
lda 0x17, s lda 0x17, s
pha pha
lda 0x1b, s lda 0x1b, s
@ -272,9 +273,9 @@ evalAt: ; @evalAt
lda 0xc4 lda 0xc4
sta 0x15, s sta 0x15, s
lda 0xca lda 0xca
sta 0x11, s
lda 0xc8
sta 0x13, s sta 0x13, s
lda 0xc8
sta 0x11, s
lda 0x17, s lda 0x17, s
pha pha
lda 0x1f, s lda 0x1f, s
@ -283,9 +284,9 @@ evalAt: ; @evalAt
pha pha
lda 0x27, s lda 0x27, s
pha pha
lda 0x19, s lda 0x1b, s
pha pha
lda 0x1d, s lda 0x1b, s
pha pha
lda 0x27, s lda 0x27, s
tax tax
@ -518,9 +519,9 @@ evalAt: ; @evalAt
lda 0xc4 lda 0xc4
sta 0x15, s sta 0x15, s
lda 0xca lda 0xca
sta 0x11, s
lda 0xc8
sta 0x13, s sta 0x13, s
lda 0xc8
sta 0x11, s
lda 0x17, s lda 0x17, s
pha pha
lda 0x1f, s lda 0x1f, s
@ -529,9 +530,9 @@ evalAt: ; @evalAt
pha pha
lda 0x27, s lda 0x27, s
pha pha
lda 0x19, s lda 0x1b, s
pha pha
lda 0x1d, s lda 0x1b, s
pha pha
lda 0x27, s lda 0x27, s
tax tax

View file

@ -1,7 +1,7 @@
############################################################################### ###############################################################################
# # # #
# Calypsi ISO C compiler for 65816 version 5.16 # # Calypsi ISO C compiler for 65816 version 5.16 #
# 13/May/2026 15:46:15 # # 13/May/2026 20:52:21 #
# Command line: --speed -O 2 --64bit-doubles mul16to32.c -o # # Command line: --speed -O 2 --64bit-doubles mul16to32.c -o #
# /tmp/mul16to32.calypsi.elf --list-file # # /tmp/mul16to32.calypsi.elf --list-file #
# mul16to32.calypsi.lst # # mul16to32.calypsi.lst #

View file

@ -11,7 +11,6 @@ mul16to32: ; @mul16to32
jsl __umulhisi3 jsl __umulhisi3
ply ply
sta 0x1, s sta 0x1, s
lda 0x1, s
ply ply
rtl rtl
.Lfunc_end0: .Lfunc_end0:

View file

@ -1,7 +1,7 @@
############################################################################### ###############################################################################
# # # #
# Calypsi ISO C compiler for 65816 version 5.16 # # Calypsi ISO C compiler for 65816 version 5.16 #
# 13/May/2026 15:46:15 # # 13/May/2026 20:52:21 #
# Command line: --speed -O 2 --64bit-doubles sumSquares.c -o # # Command line: --speed -O 2 --64bit-doubles sumSquares.c -o #
# /tmp/sumSquares.calypsi.elf --list-file # # /tmp/sumSquares.calypsi.elf --list-file #
# sumSquares.calypsi.lst # # sumSquares.calypsi.lst #

50
compare/sumSquares.ll Normal file
View file

@ -0,0 +1,50 @@
; ModuleID = 'sumSquares.c'
source_filename = "sumSquares.c"
target datalayout = "e-m:e-p:32:16-i16:16-i32:16-i64:16-f32:16-f64:16-a:8-n8:16-S8"
target triple = "w65816"
; Function Attrs: nofree norecurse nosync nounwind memory(none)
define dso_local i32 @sumSquares(i16 noundef zeroext %n) local_unnamed_addr #0 {
entry:
%cmp.not6 = icmp eq i16 %n, 0
br i1 %cmp.not6, label %for.cond.cleanup, label %for.body.preheader
for.body.preheader: ; preds = %entry
%0 = add i16 %n, 1
%umax = tail call i16 @llvm.umax.i16(i16 %0, i16 2)
br label %for.body
for.cond.cleanup: ; preds = %for.body, %entry
%total.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %total.0.lcssa
for.body: ; preds = %for.body.preheader, %for.body
%i.08 = phi i16 [ %inc, %for.body ], [ 1, %for.body.preheader ]
%total.07 = phi i32 [ %add, %for.body ], [ 0, %for.body.preheader ]
%conv = zext i16 %i.08 to i32
%mul = mul nuw i32 %conv, %conv
%add = add i32 %mul, %total.07
%inc = add nuw i16 %i.08, 1
%exitcond = icmp eq i16 %inc, %umax
br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !7
}
; Function Attrs: nocallback nocreateundeforpoison nofree nosync nounwind speculatable willreturn memory(none)
declare i16 @llvm.umax.i16(i16, i16) #1
attributes #0 = { nofree norecurse nosync nounwind memory(none) "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" }
attributes #1 = { nocallback nocreateundeforpoison nofree nosync nounwind speculatable willreturn memory(none) }
!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}
!llvm.errno.tbaa = !{!3}
!0 = !{i32 1, !"wchar_size", i32 2}
!1 = !{i32 7, !"frame-pointer", i32 2}
!2 = !{!"clang version 23.0.0git (https://github.com/llvm-mos/llvm-mos.git c798c31416f72b395c658b5502d281a162387ab1)"}
!3 = !{!4, !4, i64 0}
!4 = !{!"int", !5, i64 0}
!5 = !{!"omnipotent char", !6, i64 0}
!6 = !{!"Simple C/C++ TBAA"}
!7 = distinct !{!7, !8}
!8 = !{!"llvm.loop.mustprogress"}

View file

@ -8,79 +8,62 @@ sumSquares: ; @sumSquares
tay tay
tsc tsc
sec sec
sbc #0xe sbc #0xc
tcs tcs
tya tya
sta 0x7, s sta 0x5, s
lda #0x0 lda #0x0
sta 0xb, s sta 0x3, s
lda 0x7, s sta 0x1, s
cmp #0x0 lda 0x5, s
php
lda #0x0
plp
sta 0x9, s
bne .LBB0_1 bne .LBB0_1
; %bb.6: ; %entry ; %bb.6: ; %entry
brl .LBB0_5 brl .LBB0_5
.LBB0_1: ; %for.body.preheader .LBB0_1: ; %for.body.preheader
lda 0x7, s lda 0x5, s
inc a inc a
sta 0x7, s sta 0x5, s
cmp #0x3 cmp #0x3
bcs .LBB0_3 bcs .LBB0_3
; %bb.2: ; %for.body.preheader ; %bb.2: ; %for.body.preheader
lda #0x2 lda #0x2
sta 0x7, s
.LBB0_3: ; %for.body.preheader
lda #0x0
sta 0x3, s
lda #0x1
sta 0xd, s
lda 0x7, s
dec a
sta 0x7, s
lda #0x0
sta 0x5, s sta 0x5, s
.LBB0_3: ; %for.body.preheader
lda #0x1
sta 0x7, s
lda 0x5, s
dec a
sta 0x5, s
lda #0x0
sta 0x1, s sta 0x1, s
.LBB0_4: ; %for.body .LBB0_4: ; %for.body
; =>This Inner Loop Header: Depth=1 ; =>This Inner Loop Header: Depth=1
lda 0xd, s lda 0x7, s
pha pha
jsl __umulhisi3 jsl __umulhisi3
ply ply
clc clc
adc 0x3, s adc 0x3, s
sta 0xb, s sta 0x3, s
txa txa
adc 0x1, s adc 0x1, s
sta 0x9, s
lda 0xd, s
inc a
sta 0xd, s
bne .Ltmp0
lda 0x5, s
inc a
sta 0x5, s
.Ltmp0:
lda 0xb, s
sta 0x3, s
lda 0x9, s
sta 0x1, s sta 0x1, s
lda 0x7, s lda 0x7, s
dec a inc a
sta 0x7, s sta 0x7, s
cmp #0x0 lda 0x5, s
dec a
sta 0x5, s
beq .LBB0_5 beq .LBB0_5
bra .LBB0_4 bra .LBB0_4
.LBB0_5: ; %for.cond.cleanup .LBB0_5: ; %for.cond.cleanup
lda 0x9, s lda 0x1, s
tax tax
lda 0xb, s lda 0x3, s
tay tay
tsc tsc
clc clc
adc #0xe adc #0xc
tcs tcs
tya tya
rtl rtl

View file

@ -93,10 +93,10 @@ $LUA_CHECKS
end) end)
EOF EOF
OUT=$(timeout 30 mame apple2gs \ OUT=$(SDL_VIDEODRIVER=dummy SDL_AUDIODRIVER=dummy timeout 30 mame apple2gs \
-rompath "$PROJECT_ROOT/tools/mame/roms" \ -rompath "$PROJECT_ROOT/tools/mame/roms" \
-plugins -autoboot_script "$LUA_PATH" \ -plugins -autoboot_script "$LUA_PATH" \
-window -sound none -nothrottle -seconds_to_run "$SECS" 2>&1 | grep "^MAME-") -video none -sound none -nothrottle -seconds_to_run "$SECS" 2>&1 | grep "^MAME-")
echo "$OUT" echo "$OUT"
# Parse all val=... and compare to expected list. # Parse all val=... and compare to expected list.

View file

@ -38,6 +38,8 @@ add_llvm_target(W65816CodeGen
W65816I32IncFold.cpp W65816I32IncFold.cpp
W65816ImgCalleeSave.cpp W65816ImgCalleeSave.cpp
W65816NarrowI32Mul.cpp W65816NarrowI32Mul.cpp
W65816PromoteFiToImg.cpp
W65816StackSlotMerge.cpp
W65816TargetMachine.cpp W65816TargetMachine.cpp
W65816AsmPrinter.cpp W65816AsmPrinter.cpp
W65816MCInstLower.cpp W65816MCInstLower.cpp

View file

@ -124,6 +124,25 @@ FunctionPass *createW65816SjLjFinalize();
// zext that a SDAG-level combine would key off. See W65816NarrowI32Mul.cpp. // zext that a SDAG-level combine would key off. See W65816NarrowI32Mul.cpp.
FunctionPass *createW65816NarrowI32Mul(); FunctionPass *createW65816NarrowI32Mul();
// Post-RA, pre-PEI pass: rewrite high-traffic i16 FrameIndex accesses
// to use IMG8..15 DP slots ($C0..$CE) instead of stack-rel spills.
// Picks K = (number of free IMG8..15) hottest FIs and rewrites their
// STAfi/LDAfi/ADCfi/etc. pseudos to STA_DP/LDA_DP/ADC_DP/etc. with
// the corresponding DP address. Net win when access count > 5 (the
// per-slot save/restore in ImgCalleeSave is ~20 cyc / 12 B). See
// W65816PromoteFiToImg.cpp.
FunctionPass *createW65816PromoteFiToImg();
// Pre-emit pass: merge value-equivalent stack slots. LLVM's
// StackSlotColoring merges slots with non-overlapping liveness;
// this pass catches the case where two slots ARE simultaneously
// live but always hold the same value — typically the PHI src/dst
// pair PHI-elim leaves at the back-edge of a loop body. Renames
// X→Y function-wide when every STA X has a "twin" STA Y of the
// same source value, and erases the resulting LDA-X-STA-Y self-
// copy. See W65816StackSlotMerge.cpp.
FunctionPass *createW65816StackSlotMerge();
// Pre-RA pass that lowers Wide32 register pairs into pairs of i16 // Pre-RA pass that lowers Wide32 register pairs into pairs of i16
// vregs. Without this, greedy/basic regalloc can't fit the pair- // vregs. Without this, greedy/basic regalloc can't fit the pair-
// pressure of i64-via-2-i32-via-Wide32 traffic in i64-heavy // pressure of i64-via-2-i32-via-Wide32 traffic in i64-heavy
@ -163,6 +182,8 @@ void initializeW65816SjLjFinalizePass(PassRegistry &);
void initializeW65816LowerWide32Pass(PassRegistry &); void initializeW65816LowerWide32Pass(PassRegistry &);
void initializeW65816ImgCalleeSavePass(PassRegistry &); void initializeW65816ImgCalleeSavePass(PassRegistry &);
void initializeW65816NarrowI32MulPass(PassRegistry &); void initializeW65816NarrowI32MulPass(PassRegistry &);
void initializeW65816PromoteFiToImgPass(PassRegistry &);
void initializeW65816StackSlotMergePass(PassRegistry &);
} // namespace llvm } // namespace llvm

View file

@ -132,14 +132,155 @@ bool W65816NarrowI32Mul::runOnFunction(Function &F) {
return false; return false;
} }
// When the i32 operand is `zext i16 X to i32`, use X directly instead
// of emitting `trunc i32 (zext i16 X) to i16` — that trunc-of-zext is
// semantically the identity but keeps the zext (= a fresh i32 SSA
// value) live, which materializes a Wide32 vreg pair at ISel and
// forces a 4-byte spill slot (the canonical sumSquares `conv` pattern
// burned slots 0xd / 0x5 this way). Skipping the trunc lets the
// post-replaceAll DCE drop the zext entirely, freeing the slot.
auto narrowOperand = [&](Value *V, IRBuilder<> &B) -> Value * {
if (auto *ZE = dyn_cast<ZExtInst>(V)) {
if (ZE->getSrcTy() == I16) return ZE->getOperand(0);
}
if (auto *AE = dyn_cast<SExtInst>(V)) {
// Sext from i16 also has the right low 16 bits.
if (AE->getSrcTy() == I16) return AE->getOperand(0);
}
return B.CreateTrunc(V, I16);
};
FunctionCallee Callee = getUmulhisi3(*M); FunctionCallee Callee = getUmulhisi3(*M);
SmallVector<Instruction *, 8> MaybeDead;
for (BinaryOperator *BO : Worklist) { for (BinaryOperator *BO : Worklist) {
IRBuilder<> B(BO); IRBuilder<> B(BO);
Value *A = B.CreateTrunc(BO->getOperand(0), I16); Value *AOp = BO->getOperand(0);
Value *Bv = B.CreateTrunc(BO->getOperand(1), I16); Value *BOp = BO->getOperand(1);
Value *A = narrowOperand(AOp, B);
Value *Bv = narrowOperand(BOp, B);
Value *Call = B.CreateCall(Callee, {A, Bv}); Value *Call = B.CreateCall(Callee, {A, Bv});
BO->replaceAllUsesWith(Call); BO->replaceAllUsesWith(Call);
BO->eraseFromParent(); BO->eraseFromParent();
// If the original operands were zext/sext nodes, they may now be
// dead. Add them to the cleanup worklist.
if (auto *I = dyn_cast<Instruction>(AOp)) MaybeDead.push_back(I);
if (auto *I = dyn_cast<Instruction>(BOp)) MaybeDead.push_back(I);
}
// Cleanup: any extension that's now use-less can be deleted.
for (Instruction *I : MaybeDead) {
if (I->use_empty() && (isa<ZExtInst>(I) || isa<SExtInst>(I) ||
isa<TruncInst>(I))) {
I->eraseFromParent();
}
}
// Phase 2: narrow LSR-introduced i32 PHIs whose only uses (after
// the mul-rewrite above) are trunc-to-i16 + a single self-feeding
// `add %P, const` increment. Without this, even though the mul
// operates on i16, the i32 PHI still requires 4 bytes of frame +
// an i32 increment chain (post-PEI). LSR widened these from i16
// to i32 to support a sub-expression that we've now narrowed —
// the i32 representation has become dead weight.
//
// Guard with SCEV: `getUnsignedRange(%P).getActiveBits() <= 16`
// proves the PHI never escapes u16, so the i16 add gives the same
// low-16 bits as the original i32 add at every observable point
// (the back-edge value can wrap on the exit iteration but is
// never observed — exit takes the trip-end branch first).
bool NarrowedAny = false;
SmallVector<PHINode *, 4> PhiWorklist;
for (BasicBlock &BB : F) {
for (PHINode &PN : BB.phis()) {
if (PN.getType()->isIntegerTy(32)) PhiWorklist.push_back(&PN);
}
}
for (PHINode *PN : PhiWorklist) {
// Classify every use.
SmallVector<TruncInst *, 4> Truncs;
BinaryOperator *Incr = nullptr;
bool ok = true;
for (User *U : PN->users()) {
if (auto *TI = dyn_cast<TruncInst>(U)) {
if (!TI->getDestTy()->isIntegerTy(16)) { ok = false; break; }
Truncs.push_back(TI);
continue;
}
auto *BO = dyn_cast<BinaryOperator>(U);
if (!BO || BO->getOpcode() != Instruction::Add) { ok = false; break; }
if (!isa<ConstantInt>(BO->getOperand(1))) { ok = false; break; }
// BO must feed back to this PHI via at least one incoming edge.
bool feedsBack = false;
for (Value *Inc : PN->incoming_values()) {
if (Inc == BO) { feedsBack = true; break; }
}
if (!feedsBack) { ok = false; break; }
if (Incr) { ok = false; break; }
Incr = BO;
}
if (!ok || !Incr || Truncs.empty()) continue;
// Increment const must fit i16.
auto *IncrCI = cast<ConstantInt>(Incr->getOperand(1));
if (IncrCI->getValue().getActiveBits() > 16) continue;
// Non-back-edge incomings must be i16-representable constants.
for (Value *Inc : PN->incoming_values()) {
if (Inc == Incr) continue;
auto *CIv = dyn_cast<ConstantInt>(Inc);
if (!CIv) { ok = false; break; }
if (CIv->getValue().getActiveBits() > 16) { ok = false; break; }
}
if (!ok) continue;
// SCEV bound check.
if (!SE.isSCEVable(PN->getType())) continue;
ConstantRange R = SE.getUnsignedRange(SE.getSCEV(PN));
if (R.getActiveBits() > 16) continue;
// Narrow. Build %narrow_phi in same BB, then %narrow_incr right
// before Incr; patch incoming values to match.
IRBuilder<> B(PN);
PHINode *NewPN = B.CreatePHI(I16, PN->getNumIncomingValues(),
PN->getName() + ".narrow");
// Add placeholders for the back-edge incomings; we'll patch them
// after building NewIncr.
for (unsigned i = 0; i < PN->getNumIncomingValues(); ++i) {
Value *Inc = PN->getIncomingValue(i);
BasicBlock *Pred = PN->getIncomingBlock(i);
if (Inc == Incr) {
NewPN->addIncoming(UndefValue::get(I16), Pred);
} else {
auto *CIv = cast<ConstantInt>(Inc);
NewPN->addIncoming(
ConstantInt::get(I16, CIv->getZExtValue() & 0xFFFF),
Pred);
}
}
IRBuilder<> B2(Incr);
Value *NewIncr = B2.CreateAdd(
NewPN,
ConstantInt::get(I16, IncrCI->getZExtValue() & 0xFFFF),
Incr->getName() + ".narrow");
if (auto *NewIncrBO = dyn_cast<BinaryOperator>(NewIncr)) {
NewIncrBO->setHasNoUnsignedWrap(Incr->hasNoUnsignedWrap());
NewIncrBO->setHasNoSignedWrap(Incr->hasNoSignedWrap());
}
for (unsigned i = 0; i < NewPN->getNumIncomingValues(); ++i) {
if (isa<UndefValue>(NewPN->getIncomingValue(i))) {
NewPN->setIncomingValue(i, NewIncr);
}
}
// Replace trunc uses with the new narrow PHI, then break the
// PHI/Incr use-cycle before erasing.
for (TruncInst *TI : Truncs) {
TI->replaceAllUsesWith(NewPN);
TI->eraseFromParent();
}
// Incr is `add %PN, const`; PN's back-edge incoming references Incr.
// Replace Incr's uses with undef so PN's back-edge becomes a dead
// reference, then erase Incr, then PN.
Incr->replaceAllUsesWith(UndefValue::get(Incr->getType()));
Incr->eraseFromParent();
PN->eraseFromParent();
NarrowedAny = true;
} }
return true; return true;
} }

View file

@ -0,0 +1,289 @@
//===-- W65816PromoteFiToImg.cpp - Promote FrameIndex to IMG slot --------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===---------------------------------------------------------------------===//
//
// Post-RA, pre-PEI pass. Counts accesses to each i16-sized FrameIndex
// in the function and rewrites the top-K hottest ones to use IMG8..15
// DP slots ($C0/$C2/.../$CE) instead. K = number of free IMG8..15
// slots (slots not already used by regalloc decisions).
//
// Why post-RA: at this point regalloc has decided which vregs live in
// physical registers vs spill slots. The spills appear as the FI
// pseudo-opcodes (LDAfi/STAfi/ADCfi/SBCfi/ANDfi/ORAfi/EORfi/CMPfi),
// and the MFI tells us each FI's final size. We see all the accesses
// and can safely rewrite — eliminateFrameIndex hasn't yet baked the
// offsets into SP-relative immediates.
//
// Why before W65816ImgCalleeSave: ImgCalleeSave scans the post-PromoteFi
// MIR for IMG8..15 usage and emits prologue PHA-bracketed saves +
// epilogue restores for each used slot. Our promotion introduces
// fresh IMG8..15 references that ImgCalleeSave will then auto-cover.
//
// Per-access cost change:
// STAfi → STA_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
// LDAfi → LDA_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
// ADCfi → ADC_DP : 5 cyc / 3 B → 4 cyc / 2 B (saves 1 cyc/1 B)
// Per-slot one-time overhead (added by ImgCalleeSave):
// prologue save : ~10 cyc / 6 B
// epilogue restore: ~10 cyc / 6 B
// Net win if access_count * 1 > 20. Threshold is 5 to leave margin.
//
// Restrictions:
// - Only i16-sized FIs (2 bytes, offset 0). Larger slots (i32 halves,
// structs) are skipped.
// - Skips fixed/variable-sized objects.
// - Skips STA8fi (byte store needs SEP/REP wrap incompatible with
// simple STA_DP — and DP stores 16 bits in M=0).
// - Skips LDAfi_indY / STAfi_indY (indirect-Y form — different
// addressing).
//
//===---------------------------------------------------------------------===//
#include "W65816.h"
#include "W65816InstrInfo.h"
#include "W65816Subtarget.h"
#include "llvm/ADT/BitVector.h"
#include "llvm/ADT/DenseMap.h"
#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"
#include "llvm/Support/Debug.h"
using namespace llvm;
#define DEBUG_TYPE "w65816-promote-fi-to-img"
namespace {
class W65816PromoteFiToImg : public MachineFunctionPass {
public:
static char ID;
W65816PromoteFiToImg() : MachineFunctionPass(ID) {}
StringRef getPassName() const override {
return "W65816 promote FrameIndex to IMG8..15 DP slot";
}
bool runOnMachineFunction(MachineFunction &MF) override;
};
} // namespace
char W65816PromoteFiToImg::ID = 0;
INITIALIZE_PASS(W65816PromoteFiToImg, DEBUG_TYPE,
"W65816 promote FI to IMG", false, false)
FunctionPass *llvm::createW65816PromoteFiToImg() {
return new W65816PromoteFiToImg();
}
// Returns the operand index of the FrameIndex for the given FI pseudo
// opcode, or -1 if this opcode isn't a promotable FI carrier.
static int getFiOperandIdx(unsigned Opc) {
switch (Opc) {
case W65816::LDAfi: return 1;
case W65816::STAfi: return 1;
case W65816::CMPfi: return 1;
case W65816::ADCfi:
case W65816::SBCfi:
case W65816::ANDfi:
case W65816::ORAfi:
case W65816::EORfi: return 2;
default: return -1;
}
}
// Map a promotable FI pseudo to the corresponding DP MC opcode.
static unsigned getDpOpcode(unsigned Opc) {
switch (Opc) {
case W65816::LDAfi: return W65816::LDA_DP;
case W65816::STAfi: return W65816::STA_DP;
case W65816::CMPfi: return W65816::CMP_DP;
case W65816::ADCfi: return W65816::ADC_DP;
case W65816::SBCfi: return W65816::SBC_DP;
case W65816::ANDfi: return W65816::AND_DP;
case W65816::ORAfi: return W65816::ORA_DP;
case W65816::EORfi: return W65816::EOR_DP;
default: return 0;
}
}
// IMG8..IMG15 sit at DP addresses 0xC0, 0xC2, ..., 0xCE. IMG0..IMG7
// are at 0xD0..0xDE. Returns the DP byte for IMGn.
static uint8_t dpAddrForImg(unsigned ImgIdx) {
assert(ImgIdx < 16 && "IMG index out of range");
if (ImgIdx < 8) return 0xD0 + 2 * ImgIdx;
return 0xC0 + 2 * (ImgIdx - 8);
}
bool W65816PromoteFiToImg::runOnMachineFunction(MachineFunction &MF) {
// DISABLED: pass produces verifier errors ("Using an undefined physical
// register") on the kill-flag bookkeeping when an STAfi with `killed $a`
// is rewritten to STA_DP — the next i16-imm ADC/ADCE sees $a as dead.
// Also, for the FUNCTIONS where it would land (no-call, high-traffic
// slots), measured static + dynamic savings were modest and didn't
// justify the bookkeeping complexity. Re-enable after:
// - tightening kill-flag preservation: only carry kill if the same
// operand will be the last user in the new MI (which depends on
// post-rewrite scheduling — needs careful liveness re-analysis).
// - paired-PHI promotion: when fi#A is a PHI-input and fi#B is the
// matching PHI-output, map them to the SAME IMG slot so the
// PHI move collapses to a no-op (where most of the dynamic win
// would come from).
return false;
if (skipFunction(MF.getFunction())) return false;
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
const W65816InstrInfo *TII = STI.getInstrInfo();
MachineFrameInfo &MFI = MF.getFrameInfo();
// 1. Walk all instructions, count FI accesses for promotable opcodes.
DenseMap<int, unsigned> AccessCount;
DenseMap<int, SmallVector<MachineInstr *, 8>> AccessSites;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
int FiIdx = getFiOperandIdx(MI.getOpcode());
if (FiIdx < 0) continue;
const MachineOperand &MO = MI.getOperand(FiIdx);
if (!MO.isFI()) continue;
int FI = MO.getIndex();
// Require: 2-byte size, fixed (not variable), offset operand == 0.
// The offset operand sits right after the FI operand.
if (MFI.isVariableSizedObjectIndex(FI)) continue;
if (MFI.getObjectSize(FI) != 2) continue;
// Fixed (negative-index) slots are arg slots — leave them alone.
// Promotion would break LowerFormalArguments's expected layout.
if (FI < 0) continue;
const MachineOperand &OffMO = MI.getOperand(FiIdx + 1);
if (!OffMO.isImm() || OffMO.getImm() != 0) continue;
AccessCount[FI]++;
AccessSites[FI].push_back(&MI);
}
}
if (AccessCount.empty()) return false;
// 2. Determine which IMG8..15 slots are already in use.
BitVector UsedImg(8, false);
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
for (const MachineOperand &MO : MI.operands()) {
if (!MO.isReg() || !MO.getReg().isPhysical()) continue;
Register R = MO.getReg();
// IMG8..15 are not numerically contiguous with each other in
// the W65816 register enum (subreg-pair regs sit between
// IMG indices). Spell them out explicitly.
unsigned ImgIdx = 16; // "not an IMG8..15"
if (R == W65816::IMG8) ImgIdx = 0;
else if (R == W65816::IMG9) ImgIdx = 1;
else if (R == W65816::IMG10) ImgIdx = 2;
else if (R == W65816::IMG11) ImgIdx = 3;
else if (R == W65816::IMG12) ImgIdx = 4;
else if (R == W65816::IMG13) ImgIdx = 5;
else if (R == W65816::IMG14) ImgIdx = 6;
else if (R == W65816::IMG15) ImgIdx = 7;
if (ImgIdx < 8) UsedImg.set(ImgIdx);
}
}
}
// 3. Sort FIs by access count (descending).
SmallVector<int, 16> Ordered;
for (auto &P : AccessCount) Ordered.push_back(P.first);
std::sort(Ordered.begin(), Ordered.end(),
[&](int A, int B) { return AccessCount[A] > AccessCount[B]; });
// 4. Assign IMG slots greedily. Each IMG8..15 slot used triggers
// a save/restore pair in W65816ImgCalleeSave (~20 cyc + ~12 B
// per slot per CALL into this function). For recursive or
// deep-call-stack functions, that overhead dominates the per-
// access savings — measured: promoting 4 slots in fib(10)
// regressed it 38% (12617 → 17391 cyc). Gate on a very high
// threshold + bail entirely if the function has any calls (the
// save/restore cost compounds with recursion / call frequency
// in ways the static access count can't capture).
bool HasCalls = false;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
if (MI.isCall()) { HasCalls = true; break; }
}
if (HasCalls) break;
}
const unsigned kAccessThreshold = HasCalls ? 999999u : 5u;
DenseMap<int, unsigned> FiToImgIdx;
unsigned NextFreeImg = 0;
for (int FI : Ordered) {
if (AccessCount[FI] < kAccessThreshold) break;
while (NextFreeImg < 8 && UsedImg.test(NextFreeImg)) ++NextFreeImg;
if (NextFreeImg >= 8) break;
FiToImgIdx[FI] = NextFreeImg + 8; // Map to IMG8..15
++NextFreeImg;
}
if (FiToImgIdx.empty()) return false;
// 5. Rewrite each access. Insert the new DP MC inst before the
// pseudo, then erase the pseudo. Preserve flags and tied-def
// semantics via implicit operands.
bool Changed = false;
for (auto &P : FiToImgIdx) {
int FI = P.first;
unsigned ImgIdx = P.second;
uint8_t DpAddr = dpAddrForImg(ImgIdx);
LLVM_DEBUG(dbgs() << "Promote fi#" << FI << " -> IMG"
<< ImgIdx << " ($" << format("%02x", DpAddr)
<< "), " << AccessCount[FI] << " accesses\n");
for (MachineInstr *MI : AccessSites[FI]) {
unsigned Opc = MI->getOpcode();
unsigned NewOpc = getDpOpcode(Opc);
if (!NewOpc) continue;
MachineBasicBlock *MBB = MI->getParent();
DebugLoc DL = MI->getDebugLoc();
MachineInstrBuilder NewMI =
BuildMI(*MBB, MI, DL, TII->get(NewOpc)).addImm(DpAddr);
// Carry implicit-def $a (LDA/ADC/SBC/AND/ORA/EOR all write $a)
// and implicit-use $a (STA/CMP/ADC/SBC/AND/ORA/EOR all read $a).
// ADCfi/SBCfi additionally use $p; their DP equivalents read $p
// implicitly via the tablegen Defs/Uses. But since we built the
// new MI from TII->get(NewOpc), the implicit operands from the
// descriptor are auto-added. We only need to copy non-FI explicit
// operands... which for our pseudos are register operands. The
// physical register defs/uses they carry must be preserved.
for (const MachineOperand &MO : MI->operands()) {
if (MO.isReg() && MO.getReg().isPhysical() && MO.isImplicit()) {
// Skip — already added by descriptor.
continue;
}
if (MO.isReg() && MO.getReg().isPhysical() && !MO.isImplicit()) {
// Explicit physreg operand (e.g., the $a in STAfi $a, fi, 0).
// Convert to implicit so the DP MC inst's descriptor matches.
RegState Flags = MO.isDef() ? RegState::ImplicitDefine
: RegState::Implicit;
if (MO.isKill()) Flags = Flags | RegState::Kill;
NewMI.addReg(MO.getReg(), Flags);
}
// FI/offset operands are skipped — replaced by the DP imm above.
// VReg defs/uses should be gone post-RA; if any survived, skip.
}
MI->eraseFromParent();
Changed = true;
}
// Mark the FI as dead so PEI can skip allocating stack for it.
// MFI doesn't expose RemoveStackObject publicly, but setting size
// to 0 also works in most code paths. Actually leave it alive —
// a 2-byte unused slot is cheap, and removing exposes us to
// PEI bugs.
}
return Changed;
}

View file

@ -41,6 +41,7 @@
#include "W65816InstrInfo.h" #include "W65816InstrInfo.h"
#include "W65816Subtarget.h" #include "W65816Subtarget.h"
#include "llvm/ADT/SmallSet.h" #include "llvm/ADT/SmallSet.h"
#include "llvm/Support/raw_ostream.h"
#include "llvm/CodeGen/MachineFunction.h" #include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h" #include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstr.h" #include "llvm/CodeGen/MachineInstr.h"
@ -433,8 +434,22 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
auto isLdaSR = [](const MachineInstr &MI) { auto isLdaSR = [](const MachineInstr &MI) {
return MI.getOpcode() == W65816::LDA_StackRel; return MI.getOpcode() == W65816::LDA_StackRel;
}; };
// Accept LDA_Imm16 (MC) AND LDAi16imm (pseudo) inside the wrap —
// both are flag-clobbering A-loads of a 16-bit immediate, with
// no stack-rel offset to bump-undo and no memory operand to
// alias-check against the gap. Common in init blocks: `lda #0 ;
// sta slot,s` wrapped around the loop pre-test. Some functions
// still carry the pseudo LDAi16imm at SepRepCleanup time (post-RA
// pseudo expansion didn't lower it), so accept both spellings.
auto isImmLoad = [](const MachineInstr &MI) {
unsigned O = MI.getOpcode();
return O == W65816::LDA_Imm16 || O == W65816::LDAi16imm;
};
auto isFlagPreservingMem = [&](const MachineInstr &MI) { auto isFlagPreservingMem = [&](const MachineInstr &MI) {
return isStaLike(MI) || isLdaSR(MI); return isStaLike(MI) || isLdaSR(MI) || isImmLoad(MI);
};
auto isLdaCount = [&](const MachineInstr &MI) {
return isLdaSR(MI) || isImmLoad(MI);
}; };
auto It = MBB.begin(); auto It = MBB.begin();
while (It != MBB.end()) { while (It != MBB.end()) {
@ -450,8 +465,11 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
if (Walker->isDebugInstr()) { ++Walker; continue; } if (Walker->isDebugInstr()) { ++Walker; continue; }
if (Walker->getOpcode() == W65816::PLP) break; if (Walker->getOpcode() == W65816::PLP) break;
if (!isFlagPreservingMem(*Walker)) { ok = false; break; } if (!isFlagPreservingMem(*Walker)) { ok = false; break; }
// Track slots so we can check the gap below. // Track stack-rel slots so we can check the gap below.
if (Walker->getNumOperands() >= 1 && Walker->getOperand(0).isImm()) { // Immediate loads have no stack-rel addr — skip.
if (!isImmLoad(*Walker) &&
Walker->getNumOperands() >= 1 &&
Walker->getOperand(0).isImm()) {
int64_t off = Walker->getOperand(0).getImm(); int64_t off = Walker->getOperand(0).getImm();
if (isLdaSR(*Walker)) ReadSlots.insert(off); if (isLdaSR(*Walker)) ReadSlots.insert(off);
else WriteSlots.insert(off); else WriteSlots.insert(off);
@ -483,11 +501,23 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
// it earlier would lose the value. // it earlier would lose the value.
unsigned NLda = 0, NSta = 0; unsigned NLda = 0, NSta = 0;
for (MachineInstr *MI : Block) { for (MachineInstr *MI : Block) {
if (isLdaSR(*MI)) ++NLda; if (isLdaCount(*MI)) ++NLda;
else if (isStaLike(*MI)) ++NSta; else if (isStaLike(*MI)) ++NSta;
} }
NSta += Trailing.size(); NSta += Trailing.size();
if (NLda != NSta) { ++It; continue; } if (NLda != NSta) { ++It; continue; }
// Even with paired LDA-STA, the LAST LDA's $a value can still
// be consumed downstream — by a successor's first STA — making
// it a fall-through register-PHI. If $a is live-out at MBB
// end (any successor has $a as live-in), bail. Caught by
// sumTable, where `lda #0` (wrap) feeds A into bb.2's `sta 0x1,
// s`, with `sta 0x9, s` (trailing) just happening to also store
// the same A — the pair count balances but A is still live-out.
bool aLiveOut = false;
for (MachineBasicBlock *Succ : MBB.successors()) {
if (Succ->isLiveIn(W65816::A)) { aLiveOut = true; break; }
}
if (aLiveOut) { ++It; continue; }
// Walk backward from PHP to find the hoist insertion point. // Walk backward from PHP to find the hoist insertion point.
// The hoisted block clobbers $a and $p (LDA writes both). // The hoisted block clobbers $a and $p (LDA writes both).
// Skip insts that USE $a (consumer of an earlier $a producer) // Skip insts that USE $a (consumer of an earlier $a producer)
@ -880,5 +910,362 @@ bool W65816SepRepCleanup::runOnMachineFunction(MachineFunction &MF) {
++It2; ++It2;
} }
} }
// Store forwarding (disabled — CRC32 regressed and I couldn't
// nail down the safety hole in time). Even with PHP-wrap guards
// and SP-modifier bails, the first fire (in memmove) silently
// miscompiles something that CRC32 later depends on. Pattern
// is sound; safety analysis isn't complete. See
// feedback_close_gap_attempts_round2.md for details.
#if 0
// Store forwarding for PHI memory copies. Pattern (sumSquares
// loop body):
//
// STA X,s ; A → slot X (some intermediate result)
// [code that modifies A but doesn't touch slot X or slot Y]
// LDA X,s ; reload A from slot X
// STA Y,s ; A → slot Y (the PHI copy)
//
// Transform: insert `STA Y,s` right after the first `STA X,s` (A
// still holds the same value at that point), then drop the LDA-
// STA pair. Net: -1 inst per pattern occurrence.
//
// Safety constraints (all between STA X and the LDA-STA pair, in
// the same MBB, in straight-line code):
// - No instruction writes slot X (else the LDA would see a
// different value than the original STA).
// - No instruction reads OR writes slot Y (else our early STA Y
// would be observed mid-flight with a different value than
// before, or our inserted store would be overwritten and the
// intervening read of Y in the original would have seen the
// overwrite).
// - No call / inline asm / branch (conservatively: those can
// touch memory we don't model).
{
auto isStackRelMC2 = [](unsigned Op) {
return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
};
auto srAccess2 = [&](const MachineInstr &MI, int64_t &Off) -> bool {
if (!isStackRelMC2(MI.getOpcode())) return false;
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
Off = MI.getOperand(0).getImm();
return true;
};
auto isStaSr = [](const MachineInstr &MI) {
return MI.getOpcode() == W65816::STA_StackRel;
};
auto isLdaSr = [](const MachineInstr &MI) {
return MI.getOpcode() == W65816::LDA_StackRel;
};
SmallVector<MachineInstr *, 4> ToErase;
SmallVector<std::tuple<MachineInstr *, int64_t>, 4> ToInsert;
static int g_fireLimit = -1;
static int g_fireCount = 0;
static bool initd = false;
if (!initd) {
if (const char *e = getenv("STORE_FWD_LIMIT")) g_fireLimit = atoi(e);
initd = true;
}
for (MachineBasicBlock &MBB : MF) {
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (!isStaSr(*It)) continue;
int64_t X;
if (!srAccess2(*It, X)) continue;
MachineInstr *StaX = &*It;
// Check if StaX is INSIDE an open PHP/PLP wrap. In that case
// its operand offset has been pre-bumped by +1, and inserting
// a sibling STA Y immediately after writes at the WRONG slot
// (the un-bumped Y). Walk backward: if we find a PHP without
// a matching PLP first, bail.
{
bool insideWrap = false;
int depth = 0;
auto B = It;
while (B != MBB.begin()) {
--B;
if (B->getOpcode() == W65816::PLP) depth++;
else if (B->getOpcode() == W65816::PHP) {
if (depth > 0) depth--;
else { insideWrap = true; break; }
}
}
if (insideWrap) continue;
}
// Walk forward looking for LDA X ; STA Y. Conservative bail
// on any non-tracked memory op (indirect pointer access,
// DP/abs ops, etc.) which could alias slot Y via memory.
bool ok = true;
int64_t Y = -1;
MachineInstr *LdaX = nullptr;
MachineInstr *StaY = nullptr;
for (auto Walker = std::next(It); Walker != MBB.end(); ++Walker) {
if (Walker->isDebugInstr()) continue;
if (Walker->isCall() || Walker->isInlineAsm() ||
Walker->isBranch() || Walker->isReturn()) {
ok = false; break;
}
// Found LDA X?
int64_t Off;
if (isLdaSr(*Walker) && srAccess2(*Walker, Off) && Off == X) {
LdaX = &*Walker;
auto Next = std::next(Walker);
while (Next != MBB.end() && Next->isDebugInstr()) ++Next;
if (Next == MBB.end() || !isStaSr(*Next) ||
!srAccess2(*Next, Y) || Y == X) {
ok = false;
} else {
StaY = &*Next;
}
break;
}
// Stack-rel access to X (write or read): bail.
if (srAccess2(*Walker, Off) && Off == X) {
ok = false; break;
}
// Any memory-touching op that's NOT a tracked stack-rel
// access — bail. Indirect pointer stores/loads (DPIndY /
// DPIndLong / abs / etc.) could alias slot Y via a pointer
// we can't trace, and the safety check below would miss it.
if ((Walker->mayLoad() || Walker->mayStore()) &&
!isStackRelMC2(Walker->getOpcode())) {
ok = false; break;
}
// SP-modifying ops shift the stack-rel addressing window —
// a later `lda X, s` reads a DIFFERENT byte than the earlier
// `sta X, s` (or worse, the new stack pointer points into
// saved P/retaddr). Bail on TCS (direct SP write) and on
// any stack push/pop (PHx/PLx/PEA/PEI/COP/BRK). Also bail
// on PHP/PLP because the wrap pass already bumped in-wrap
// stack-rel ops by +1 — our inserted STA after STA X writes
// at the un-bumped offset which gets the WRONG slot.
{
unsigned WO = Walker->getOpcode();
if (WO == W65816::TCS || WO == W65816::PHA ||
WO == W65816::PLA || WO == W65816::PHX ||
WO == W65816::PLX || WO == W65816::PHY ||
WO == W65816::PLY || WO == W65816::PHP ||
WO == W65816::PLP || WO == W65816::PHB ||
WO == W65816::PLB || WO == W65816::PHD ||
WO == W65816::PLD || WO == W65816::PHK ||
WO == W65816::PEA || WO == W65816::PEI_DP) {
ok = false; break;
}
}
}
if (!ok || !LdaX || !StaY) continue;
if (g_fireLimit >= 0 && g_fireCount >= g_fireLimit) continue;
g_fireCount++;
errs() << "SF FIRE " << g_fireCount << " in " << MF.getName()
<< " MBB " << MBB.getNumber()
<< " X=" << X << " Y=" << StaY->getOperand(0).getImm()
<< "\n";
// Now re-walk from std::next(It) up to LdaX and verify no
// access to slot Y in that gap.
ok = true;
for (auto W2 = std::next(It); W2 != LdaX->getIterator(); ++W2) {
if (W2->isDebugInstr()) continue;
int64_t Off;
if (srAccess2(*W2, Off) && Off == Y) { ok = false; break; }
}
if (!ok) continue;
// Safe to apply: schedule the StaY-after-StaX insert, and
// erase LdaX and StaY.
ToInsert.push_back({StaX, Y});
ToErase.push_back(LdaX);
ToErase.push_back(StaY);
Changed = true;
}
}
// Apply (insertions first; iterators stay valid through erase).
for (auto &P : ToInsert) {
MachineInstr *StaX = std::get<0>(P);
int64_t Y = std::get<1>(P);
MachineBasicBlock *MBB = StaX->getParent();
DebugLoc DL = StaX->getDebugLoc();
auto NextIt = std::next(StaX->getIterator());
BuildMI(*MBB, NextIt, DL, TII.get(W65816::STA_StackRel))
.addImm(Y);
}
for (MachineInstr *MI : ToErase) MI->eraseFromParent();
}
#endif
// (Redundant CMP #0 elimination — disabled, hit VLA sum_n
// regression. Carry-flag bookkeeping across the CMP turned out to
// have more cases than my forward-walk modeled. See
// feedback_cmp_zero_elim.md.)
#if 0
{
auto isNZSetOnA = [](unsigned Op) {
switch (Op) {
case W65816::DEA_PSEUDO: case W65816::INA_PSEUDO:
case W65816::ADC_StackRel: case W65816::ADC_DP: case W65816::ADC_Imm16:
case W65816::SBC_StackRel: case W65816::SBC_DP: case W65816::SBC_Imm16:
case W65816::AND_StackRel: case W65816::AND_DP: case W65816::AND_Imm16:
case W65816::ORA_StackRel: case W65816::ORA_DP: case W65816::ORA_Imm16:
case W65816::EOR_StackRel: case W65816::EOR_DP: case W65816::EOR_Imm16:
case W65816::LDA_StackRel: case W65816::LDA_DP:
case W65816::LDAi16imm: case W65816::LDA_Imm16:
case W65816::TXA: case W65816::TYA:
case W65816::ADCi16imm: case W65816::ADCEi16imm:
case W65816::SBCi16imm: case W65816::SBCEi16imm:
return true;
default:
return false;
}
};
auto isCmpZero = [](const MachineInstr &MI) {
if (MI.getOpcode() != W65816::CMPi16imm) return false;
// Operand layout: lhs (Acc16), imm. Find the imm.
for (const MachineOperand &MO : MI.operands()) {
if (MO.isImm()) return MO.getImm() == 0;
}
return false;
};
auto modifiesA = [](const MachineInstr &MI) {
for (const MachineOperand &MO : MI.operands()) {
if (MO.isReg() && MO.getReg() == W65816::A && MO.isDef())
return true;
}
return false;
};
auto readsC = [](const MachineInstr &MI) {
// We don't model individual flag bits; approximate by checking
// if the MI reads $p AND is one of the carry-consuming ops.
unsigned Op = MI.getOpcode();
switch (Op) {
case W65816::ADC_StackRel: case W65816::ADC_DP: case W65816::ADC_Imm16:
case W65816::SBC_StackRel: case W65816::SBC_DP: case W65816::SBC_Imm16:
case W65816::ADCEi16imm: case W65816::SBCEi16imm:
case W65816::BCC: case W65816::BCS:
case W65816::ROL_A: case W65816::ROR_A:
return true;
default:
return false;
}
};
SmallVector<MachineInstr *, 4> CmpsToErase;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
if (!isCmpZero(MI)) continue;
// Walk backward, skipping flag-preserving instructions.
bool foundProducer = false;
auto Back = MI.getIterator();
while (Back != MBB.begin()) {
--Back;
if (Back->isDebugInstr()) continue;
if (Back->isCall() || Back->isInlineAsm()) break;
if (modifiesA(*Back)) {
foundProducer = isNZSetOnA(Back->getOpcode());
break;
}
bool defsP = false;
for (const MachineOperand &MO : Back->operands()) {
if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) {
defsP = true; break;
}
}
if (defsP) break;
}
if (!foundProducer) continue;
// Walk FORWARD from CMP: until the next C-defining MI, no MI
// reads C.
bool cConsumed = false;
for (auto Fwd = std::next(MI.getIterator()); Fwd != MBB.end(); ++Fwd) {
if (Fwd->isDebugInstr()) continue;
if (readsC(*Fwd)) { cConsumed = true; break; }
// Next def of $p: subsequent reads aren't ours.
bool defsP = false;
for (const MachineOperand &MO : Fwd->operands()) {
if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef()) {
defsP = true; break;
}
}
if (defsP) break;
}
if (cConsumed) continue;
CmpsToErase.push_back(&MI);
}
}
for (MachineInstr *MI : CmpsToErase) MI->eraseFromParent();
if (!CmpsToErase.empty()) Changed = true;
}
#endif
// (Narrow PHI-copy slot collapse — disabled, qsort regression.)
#if 0
{
auto isStackRelMC2 = [](unsigned Op) {
return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
};
auto srAccess2 = [&](const MachineInstr &MI, int64_t &Off) {
if (!isStackRelMC2(MI.getOpcode())) return false;
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
Off = MI.getOperand(0).getImm();
return true;
};
DenseMap<int64_t, unsigned> Refs;
DenseMap<int64_t, MachineInstr *> StaInst, LdaInst;
DenseMap<int64_t, unsigned> NSta, NLda;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
int64_t Off;
if (!srAccess2(MI, Off)) continue;
Refs[Off]++;
if (MI.getOpcode() == W65816::STA_StackRel) {
NSta[Off]++; StaInst[Off] = &MI;
} else if (MI.getOpcode() == W65816::LDA_StackRel) {
NLda[Off]++; LdaInst[Off] = &MI;
}
}
}
SmallVector<MachineInstr *, 4> ToErase;
for (auto &P : Refs) {
int64_t X = P.first;
if (P.second != 2) continue; // exactly 2 references
if (NSta[X] != 1 || NLda[X] != 1) continue;
MachineInstr *Sta = StaInst[X];
MachineInstr *Lda = LdaInst[X];
if (Sta->getParent() != Lda->getParent()) continue;
MachineBasicBlock *MBB = Sta->getParent();
// Sta must be before Lda.
bool staBefore = false;
for (auto It = MBB->begin(); It != MBB->end(); ++It) {
if (&*It == Sta) { staBefore = true; break; }
if (&*It == Lda) break;
}
if (!staBefore) continue;
// Next after Lda must be STA Y where Y != X.
auto NextIt = std::next(Lda->getIterator());
while (NextIt != MBB->end() && NextIt->isDebugInstr()) ++NextIt;
if (NextIt == MBB->end()) continue;
int64_t Y;
if (NextIt->getOpcode() != W65816::STA_StackRel ||
!srAccess2(*NextIt, Y) || Y == X) continue;
// Between Sta and Lda, no read/write of slot Y, no call, no
// anything that would re-set slot Y's value mid-flight.
bool ok = true;
for (auto It = std::next(Sta->getIterator()); It != Lda->getIterator();
++It) {
if (It->isDebugInstr()) continue;
if (It->isCall() || It->isInlineAsm()) { ok = false; break; }
int64_t Off;
if (srAccess2(*It, Off) && Off == Y) { ok = false; break; }
}
if (!ok) continue;
// Redirect the original STA to write to Y; delete the LDA-STA pair.
Sta->getOperand(0).setImm(Y);
ToErase.push_back(Lda);
ToErase.push_back(&*NextIt);
Changed = true;
}
for (MachineInstr *MI : ToErase) MI->eraseFromParent();
}
#endif
return Changed; return Changed;
} }

View file

@ -1492,6 +1492,14 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
} }
return false; return false;
}; };
// Pass 1c can only eliminate CMPi16imm $a, 0 if the preceding
// A-modifier reliably sets N/Z to reflect A's final value. LDAfi
// under FP-rel expansion (`sty $fa ; ldy #imm ; lda [$f6],y ; ldy $fa`)
// ends with `ldy` that clobbers N/Z based on OLD Y, not loaded A — so
// in FP-rel functions (VLA / huge frame), the CMP is load-bearing.
// Skip the whole pass for such functions (saves us from the sum_n
// VLA regression that the PHP-wrap-aware variant tripped).
bool ssCleanupSPRelOnly = !UsesFPRel;
for (MachineBasicBlock &MBB : MF) { for (MachineBasicBlock &MBB : MF) {
SmallVector<MachineInstr *, 8> Cmps; SmallVector<MachineInstr *, 8> Cmps;
for (MachineInstr &MI : MBB) for (MachineInstr &MI : MBB)
@ -1516,10 +1524,27 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
// condition). Caused __adddf3's renormalize while-loop to // condition). Caused __adddf3's renormalize while-loop to
// skip its body even though `mr & ~mask` was non-zero. // skip its body even though `mr & ~mask` was non-zero.
bool SafeToErase = true; bool SafeToErase = true;
bool insidePHPWrap = false;
for (auto It = std::next(Cmp->getIterator()); for (auto It = std::next(Cmp->getIterator());
It != Cmp->getParent()->end(); ++It) { It != Cmp->getParent()->end(); ++It) {
if (It->isDebugInstr()) continue; if (It->isDebugInstr()) continue;
if (It->isBranch() || It->isReturn()) break; if (It->isBranch() || It->isReturn()) break;
// PHP/PLP-wrap-aware: only safe when LDAfi-expansion sets N/Z
// reliably (SP-rel functions, not FP-rel).
if (ssCleanupSPRelOnly && It->getOpcode() == W65816::PHP) {
// PHP must be IMMEDIATELY after CMP to capture CMP's flags.
if (&*It != &*std::next(Cmp->getIterator())) {
SafeToErase = false;
break;
}
insidePHPWrap = true;
continue;
}
if (It->getOpcode() == W65816::PLP) {
insidePHPWrap = false;
continue;
}
if (insidePHPWrap) continue;
if (It->getOpcode() == TargetOpcode::COPY) { if (It->getOpcode() == TargetOpcode::COPY) {
SafeToErase = false; SafeToErase = false;
break; break;

View file

@ -0,0 +1,733 @@
//===-- W65816StackSlotMerge.cpp - Merge value-equivalent stack slots ----===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===---------------------------------------------------------------------===//
//
// Pre-emit pass that runs after PEI (eliminateFrameIndex) and merges
// pairs of stack-rel slots that hold the same value at every observable
// program point — typically the PHI src/dst pair PHI-elim leaves at
// the back-edge of a loop body.
//
// LLVM's StackSlotColoring merges slots with non-overlapping liveness.
// It can't merge slots that are simultaneously live but happen to hold
// the same value (which is what a PHI memory-copy creates). This pass
// catches that case via a stricter "value equivalence" check.
//
// Canonical pattern (sumSquares loop body):
//
// .LBB0_4:
// LDA 0x7, s ; PHA ; JSL __umulhisi3 ; PLY
// CLC ; ADC 0x3, s ; STA 0xb, s ; new total.lo (write X)
// TXA ; ADC 0x1, s ; STA 0x9, s
// LDA 0x7, s ; INC A ; STA 0x7, s
// LDA 0xb, s ; STA 0x3, s ; PHI copy: load X, store Y
// LDA 0x9, s ; STA 0x1, s
// ...
//
// The pair (0xb, 0x3) is the lo-half PHI memory copy. Slots 0xb and
// 0x3 always hold the same value at every read site:
// - Function entry: both initialized to 0 (`lda #0; sta 0xb, s` in
// entry, `lda #0; sta 0x3, s` in preheader).
// - Loop iteration: the PHI copy moves the new total.lo from 0xb to
// 0x3 at the end of every iteration.
// - Exit: only 0xb is read (return value), but its value equals 0x3's.
//
// Rename 0xb → 0x3 function-wide; the now self-copy `lda 0x3; sta 0x3`
// is dead and we erase it. Saves 2 inst per PHI copy occurrence (the
// memory copy round-trip). sumSquares loop body shrinks from 21 to
// 17 inst per iter.
//
// Safety check (sufficient condition for value equivalence):
// 1. Both slots have ≥1 STA in the function (skips arg slots passed
// by the caller — those have only LDA reads, no STAs, and renaming
// would change where we read the arg from).
// 2. For every STA X in the function, find a "twin" STA Y at a
// program point where the values match. Matching = either:
// (a) Same MBB, same A-source value (no intervening A-define).
// Covers the loop-body iter-end pattern: STA X then later
// LDA X ; STA Y. Also covers entry's `lda #N ; sta X` if
// the same MBB also has `sta Y`.
// (b) Different MBBs, both preceded by `LDA #const` of the same
// constant. Covers entry-block STA X=0 paired with
// preheader STA Y=0.
// 3. Symmetric: for every STA Y, find a twin STA X.
// 4. No "orphan" STAs. If a STA X or STA Y has no twin, bail.
//
// When all checks pass, the rename function-wide preserves semantics:
// every read of slot X at program point P sees the same value that
// slot Y holds at P (and vice versa).
//
//===---------------------------------------------------------------------===//
#include "W65816.h"
#include "W65816InstrInfo.h"
#include "W65816Subtarget.h"
#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/SmallVector.h"
#include "llvm/CodeGen/MachineDominators.h"
#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/InitializePasses.h"
#include "llvm/Support/Debug.h"
using namespace llvm;
#define DEBUG_TYPE "w65816-stack-slot-merge"
namespace {
class W65816StackSlotMerge : public MachineFunctionPass {
public:
static char ID;
W65816StackSlotMerge() : MachineFunctionPass(ID) {}
StringRef getPassName() const override {
return "W65816 merge value-equivalent stack slots (PHI-copy collapse)";
}
void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<MachineDominatorTreeWrapperPass>();
AU.setPreservesCFG();
MachineFunctionPass::getAnalysisUsage(AU);
}
bool runOnMachineFunction(MachineFunction &MF) override;
};
} // namespace
char W65816StackSlotMerge::ID = 0;
INITIALIZE_PASS_BEGIN(W65816StackSlotMerge, DEBUG_TYPE,
"W65816 stack slot merge", false, false)
INITIALIZE_PASS_DEPENDENCY(MachineDominatorTreeWrapperPass)
INITIALIZE_PASS_END(W65816StackSlotMerge, DEBUG_TYPE,
"W65816 stack slot merge", false, false)
FunctionPass *llvm::createW65816StackSlotMerge() {
return new W65816StackSlotMerge();
}
// Stack-relative MC opcodes — the ops that survive eliminateFrameIndex
// and reference a slot via an 8-bit SP-relative offset.
static bool isStackRelOp(unsigned Op) {
return Op == W65816::LDA_StackRel || Op == W65816::STA_StackRel ||
Op == W65816::ADC_StackRel || Op == W65816::SBC_StackRel ||
Op == W65816::AND_StackRel || Op == W65816::ORA_StackRel ||
Op == W65816::EOR_StackRel || Op == W65816::CMP_StackRel;
}
// Returns true if MI is a stack-rel op; out-param Off receives the slot
// offset (operand 0).
static bool srAccess(const MachineInstr &MI, int64_t &Off) {
if (!isStackRelOp(MI.getOpcode())) return false;
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) return false;
Off = MI.getOperand(0).getImm();
return true;
}
// True if the MI semantically defines A. Covers both the explicit
// case (operand has reg=A,isDef) AND the implicit case where the
// tablegen InstDP / InstAbs / etc. base classes omit the A-Def
// annotation despite LDA semantically writing A (a backend modelling
// gap — many `LDA_DP`, `LDA_Abs`, `LDA_LongX`, etc. are missing the
// implicit-def in the MIR even though they load into A). Opcode-
// based fallback catches all of them.
static bool semanticallyDefsA(const MachineInstr &MI) {
for (const MachineOperand &MO : MI.operands()) {
if (MO.isReg() && MO.getReg() == W65816::A && MO.isDef())
return true;
}
unsigned Op = MI.getOpcode();
switch (Op) {
case W65816::LDA_DP: case W65816::LDA_DPX:
case W65816::LDA_DPInd: case W65816::LDA_DPIndY:
case W65816::LDA_DPIndX:
case W65816::LDA_Abs: case W65816::LDA_AbsX:
case W65816::LDA_AbsY: case W65816::LDA_Long:
case W65816::LDA_LongX:
case W65816::PLA:
return true;
default:
return false;
}
}
// Walk backward from MI in its MBB looking for the most recent A-define.
// Returns the MI that defines A, or nullptr if none in the same MBB.
// Skips debug instructions. Stops at MBB boundary, calls, branches,
// inline asm.
static MachineInstr *findPriorADef(MachineInstr *MI) {
MachineBasicBlock *MBB = MI->getParent();
auto It = MI->getIterator();
while (It != MBB->begin()) {
--It;
if (It->isDebugInstr()) continue;
if (It->isCall() || It->isInlineAsm()) return nullptr;
if (semanticallyDefsA(*It)) return &*It;
}
return nullptr;
}
// Walk forward from `Start` (exclusive) up to (but not including) `End`
// in the same MBB, tracking whether slot `WatchSlot` is written.
// Returns true if slot `WatchSlot` is NOT written in the interval.
static bool slotNotWrittenBetween(MachineBasicBlock::iterator Start,
MachineBasicBlock::iterator End,
int64_t WatchSlot) {
for (auto It = std::next(Start); It != End; ++It) {
if (It->isDebugInstr()) continue;
int64_t Off;
if (It->getOpcode() == W65816::STA_StackRel && srAccess(*It, Off) &&
Off == WatchSlot) {
return false;
}
}
return true;
}
// Returns true if MI clobbers P (N/Z/C/V flags). Mirrors LLVM's
// operand-based check + an opcode whitelist for tablegen entries that
// omit `Defs = [P]` (InstImplied, InstStackRel, etc.).
static bool clobbersFlagsP(const MachineInstr &MI) {
for (const MachineOperand &MO : MI.operands()) {
if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef())
return true;
}
if (MI.isCall() || MI.isInlineAsm()) return true;
unsigned Op = MI.getOpcode();
switch (Op) {
case W65816::PLA: case W65816::PLY: case W65816::PLX:
case W65816::PLP:
case W65816::INA: case W65816::DEA:
case W65816::INX: case W65816::DEX:
case W65816::INY: case W65816::DEY:
case W65816::TAX: case W65816::TAY:
case W65816::TYA: case W65816::TXA:
case W65816::TYX: case W65816::TXY:
case W65816::LDA_StackRel: case W65816::LDA_DP:
case W65816::LDA_DPX: case W65816::LDA_DPInd:
case W65816::LDA_DPIndY: case W65816::LDA_DPIndX:
case W65816::LDA_Abs: case W65816::LDA_AbsX:
case W65816::LDA_AbsY: case W65816::LDA_Long:
case W65816::LDA_LongX:
case W65816::ADC_StackRel: case W65816::SBC_StackRel:
case W65816::CMP_StackRel: case W65816::AND_StackRel:
case W65816::ORA_StackRel: case W65816::EOR_StackRel:
case W65816::ADC_DP: case W65816::ADC_Abs:
case W65816::SBC_DP: case W65816::SBC_Abs:
case W65816::CMP_DP: case W65816::CMP_Abs:
case W65816::AND_DP: case W65816::AND_Abs:
case W65816::ORA_DP: case W65816::ORA_Abs:
case W65816::EOR_DP: case W65816::EOR_Abs:
return true;
default:
return false;
}
}
// Returns true if MI reads P flags (conditional branches, PLP, etc.).
static bool usesFlagsP(const MachineInstr &MI) {
if (MI.isConditionalBranch()) return true;
for (const MachineOperand &MO : MI.operands()) {
if (MO.isReg() && MO.getReg() == W65816::P && MO.isUse() &&
!MO.isDef())
return true;
}
return false;
}
// Returns the MOST RECENT A-defining MI strictly before MI in its MBB,
// skipping debug instructions. Returns nullptr if none in the same MBB.
static MachineInstr *findMostRecentADef(MachineInstr *MI) {
MachineBasicBlock *MBB = MI->getParent();
auto It = MI->getIterator();
while (It != MBB->begin()) {
--It;
if (It->isDebugInstr()) continue;
if (semanticallyDefsA(*It)) return &*It;
}
return nullptr;
}
// "Twin" check. Given a STA X at position StaX and a candidate slot Y,
// scan the function's STA Y instances and return one that's value-
// equivalent under the rules described in the header comment.
//
// Source-value equivalence cases:
// (1) Same-MBB twin store: no A-define between StaX and the candidate
// StaY → both store the same A value. Pure twin pattern.
// (2) Same-MBB PHI-copy: the candidate StaY is preceded by
// `LDA_StackRel slotX` (PHI-copy reload). Even if many A-defines
// sit between StaX and StaY, the LDA X re-establishes A =
// slot[X] = value StaX wrote (assuming slot X wasn't re-written
// in the gap).
// (3) Different MBBs, both preceded by LDA_Imm16 / LDAi16imm of the
// same constant. Covers entry/preheader init parallel pair.
static MachineInstr *findTwin(MachineInstr *StaX,
ArrayRef<MachineInstr *> StasY) {
MachineBasicBlock *MBBStaX = StaX->getParent();
int64_t XOff = StaX->getOperand(0).getImm();
// Cases (1) + (2): same MBB.
for (MachineInstr *StaY : StasY) {
if (StaY->getParent() != MBBStaX) continue;
// Determine ordering.
MachineInstr *Earlier = nullptr;
MachineInstr *Later = nullptr;
for (auto It = MBBStaX->begin(); It != MBBStaX->end(); ++It) {
if (&*It == StaX) { Earlier = StaX; Later = StaY; break; }
if (&*It == StaY) { Earlier = StaY; Later = StaX; break; }
}
if (!Earlier || !Later) continue;
int64_t EOff = Earlier->getOperand(0).getImm();
// Case (2): if Later is preceded by `LDA_StackRel <Earlier's slot>`
// (the PHI-copy reload), it's a PHI twin. Also require slot
// Earlier-slot wasn't re-written between Earlier and Later.
MachineInstr *PriorOfLater = findMostRecentADef(Later);
if (PriorOfLater) {
int64_t Off;
if (PriorOfLater->getOpcode() == W65816::LDA_StackRel &&
srAccess(*PriorOfLater, Off) && Off == EOff &&
slotNotWrittenBetween(Earlier->getIterator(),
PriorOfLater->getIterator(), EOff)) {
return StaY;
}
}
// Case (1): no A-define between Earlier and Later — same A value.
{
bool noADefs = true;
for (auto It = std::next(Earlier->getIterator());
It != Later->getIterator(); ++It) {
if (It->isDebugInstr()) continue;
if (semanticallyDefsA(*It)) { noADefs = false; break; }
}
if (noADefs) return StaY;
}
}
// Case (3): different MBBs, both preceded by LDA_Imm16 / LDAi16imm
// with the same constant.
MachineInstr *PriorX = findPriorADef(StaX);
if (!PriorX) return nullptr;
unsigned PriorXOp = PriorX->getOpcode();
if (PriorXOp != W65816::LDA_Imm16 && PriorXOp != W65816::LDAi16imm)
return nullptr;
int64_t XConst = 0;
for (const MachineOperand &MO : PriorX->operands()) {
if (MO.isImm()) { XConst = MO.getImm(); break; }
}
for (MachineInstr *StaY : StasY) {
if (StaY->getParent() == MBBStaX) continue;
MachineInstr *PriorY = findPriorADef(StaY);
if (!PriorY) continue;
if (PriorY->getOpcode() != PriorXOp) continue;
int64_t YConst = 0;
for (const MachineOperand &MO : PriorY->operands()) {
if (MO.isImm()) { YConst = MO.getImm(); break; }
}
if (XConst == YConst) return StaY;
}
(void)XOff;
return nullptr;
}
// Run Phase 6a + Phase 6 (per-MBB peepholes) — independent of rename
// logic, so they fire on every function. Returns true if anything
// changed.
static bool runPerMBBPeepholes(MachineFunction &MF) {
bool Changed = false;
// Phase 6a: redundant `STA Y, s` immediately followed by `LDA Y, s`.
for (MachineBasicBlock &MBB : MF) {
SmallVector<MachineInstr *, 4> Dead;
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (It->isDebugInstr()) continue;
if (It->getOpcode() != W65816::STA_StackRel) continue;
int64_t StaSlot;
if (!srAccess(*It, StaSlot)) continue;
auto NextIt = std::next(It);
while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
if (NextIt == MBB.end()) continue;
if (NextIt->getOpcode() != W65816::LDA_StackRel) continue;
int64_t LdaSlot;
if (!srAccess(*NextIt, LdaSlot)) continue;
if (StaSlot != LdaSlot) continue;
bool flagsSafe = false;
bool aIsUsedBeforeClobber = false;
for (auto Fwd = std::next(NextIt); Fwd != MBB.end(); ++Fwd) {
if (Fwd->isDebugInstr()) continue;
// Calls/JSLs that take A as arg — even though clobbersFlagsP
// returns true for them, the elimination could mis-track A's
// live-in to the call. Bail.
if (Fwd->isCall()) break;
// Generic: any instr that has `implicit $a` as a USE — A is
// live going in. Bail to avoid live-range trouble.
for (const MachineOperand &MO : Fwd->operands()) {
if (MO.isReg() && MO.getReg() == W65816::A && MO.isUse() &&
!MO.isDef()) {
aIsUsedBeforeClobber = true;
break;
}
}
if (aIsUsedBeforeClobber) break;
if (usesFlagsP(*Fwd)) break;
if (Fwd->isTerminator() && !Fwd->isConditionalBranch()) {
flagsSafe = true; break;
}
if (clobbersFlagsP(*Fwd)) { flagsSafe = true; break; }
}
if (!flagsSafe) continue;
Dead.push_back(&*NextIt);
}
for (MachineInstr *MI : Dead) {
MI->eraseFromParent();
Changed = true;
}
}
// Phase 6: per-MBB redundant `LDA #K` elimination.
auto isAandPPreserving = [](const MachineInstr &MI) -> bool {
unsigned Op = MI.getOpcode();
switch (Op) {
case W65816::STA_StackRel:
case W65816::STA_DP: case W65816::STA_DPX:
case W65816::STA_DPInd: case W65816::STA_DPIndY:
case W65816::STA_DPIndX:
case W65816::STA_Abs: case W65816::STA_AbsX:
case W65816::STA_AbsY: case W65816::STA_Long:
case W65816::STA_LongX:
case W65816::STX_DP: case W65816::STX_Abs:
case W65816::STY_DP: case W65816::STY_Abs: case W65816::STY_DPX:
case W65816::STZ_DP: case W65816::STZ_Abs:
case W65816::STZ_DPX: case W65816::STZ_AbsX:
return true;
default:
break;
}
for (const MachineOperand &MO : MI.operands()) {
if (MO.isReg() && MO.getReg() == W65816::P && MO.isDef())
return false;
}
if (MI.mayStore() && !MI.mayLoad() && !semanticallyDefsA(MI))
return true;
return false;
};
auto isLdaImmK = [](const MachineInstr &MI, int64_t &K) -> bool {
unsigned Op = MI.getOpcode();
if (Op != W65816::LDA_Imm16 && Op != W65816::LDAi16imm) return false;
for (const MachineOperand &MO : MI.operands()) {
if (MO.isImm()) { K = MO.getImm(); return true; }
}
return false;
};
for (MachineBasicBlock &MBB : MF) {
std::optional<int64_t> KnownK;
SmallVector<MachineInstr *, 4> Dead;
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (It->isDebugInstr()) continue;
int64_t K;
if (isLdaImmK(*It, K)) {
if (KnownK && *KnownK == K) {
Dead.push_back(&*It);
continue;
}
KnownK = K;
continue;
}
if (isAandPPreserving(*It)) continue;
KnownK.reset();
}
for (MachineInstr *MI : Dead) {
MI->eraseFromParent();
Changed = true;
}
}
return Changed;
}
bool W65816StackSlotMerge::runOnMachineFunction(MachineFunction &MF) {
if (skipFunction(MF.getFunction())) return false;
if (MF.getFunction().hasOptNone()) return false;
// Run per-MBB peepholes first — independent of rename logic.
bool peepChanged = runPerMBBPeepholes(MF);
// Phase 1: index all stack-rel STA/LDA grouped by slot offset.
DenseMap<int64_t, SmallVector<MachineInstr *, 4>> Stas;
DenseMap<int64_t, SmallVector<MachineInstr *, 4>> Ldas;
DenseMap<int64_t, unsigned> AllRefs; // STA + LDA + ADC + ... count
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
int64_t Off;
if (!srAccess(MI, Off)) continue;
AllRefs[Off]++;
if (MI.getOpcode() == W65816::STA_StackRel) {
Stas[Off].push_back(&MI);
} else if (MI.getOpcode() == W65816::LDA_StackRel) {
Ldas[Off].push_back(&MI);
}
}
}
// Phase 2: find PHI-copy site candidates. Pattern: LDA X ; STA Y
// in a LOOP BODY MBB (= the MBB has itself as a predecessor, i.e.
// a self-loop back-edge). Restricting to loop bodies distinguishes
// genuine PHI-cycle copies from one-shot temp transfers (where
// slot X is just a scratch register dropped on the way to slot Y
// for an unrelated purpose, like qsortIter's pointer-construction
// pattern `STA 5; ...; LDA 5; STA 39` followed by `LDA 39; STA dp`).
DenseMap<int64_t, int64_t> PhiCopyPair; // X -> Y
for (MachineBasicBlock &MBB : MF) {
// Self-loop check: MBB must have itself as a predecessor.
bool selfLoop = false;
for (MachineBasicBlock *Pred : MBB.predecessors()) {
if (Pred == &MBB) { selfLoop = true; break; }
}
if (!selfLoop) continue;
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (It->getOpcode() != W65816::LDA_StackRel) continue;
int64_t X;
if (!srAccess(*It, X)) continue;
auto NextIt = std::next(It);
while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
if (NextIt == MBB.end()) continue;
if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
int64_t Y;
if (!srAccess(*NextIt, Y) || Y == X) continue;
if (PhiCopyPair.count(X)) continue;
PhiCopyPair[X] = Y;
}
}
// Phase 3: validate each pair and apply rename if safe.
// Track which slots have already been merged so we don't double-merge.
DenseMap<int64_t, int64_t> Renames; // X -> Y
for (auto &P : PhiCopyPair) {
int64_t X = P.first, Y = P.second;
// Don't re-merge an already-processed slot.
if (Renames.count(X) || Renames.count(Y)) continue;
// Arg-slot guard: skip slots with no STAs (caller-passed args).
if (Stas[X].empty() || Stas[Y].empty()) continue;
// Validate that every STA X has a twin STA Y.
bool allPaired = true;
for (MachineInstr *StaX : Stas[X]) {
if (!findTwin(StaX, Stas[Y])) { allPaired = false; break; }
}
if (!allPaired) continue;
// Symmetric: every STA Y must have a twin STA X.
for (MachineInstr *StaY : Stas[Y]) {
if (!findTwin(StaY, Stas[X])) { allPaired = false; break; }
}
if (!allPaired) continue;
LLVM_DEBUG(dbgs() << "StackSlotMerge: rename slot " << X
<< " -> " << Y << " in " << MF.getName() << "\n");
Renames[X] = Y;
}
if (Renames.empty()) return false;
// Phase 4: apply rename.
bool Changed = false;
for (MachineBasicBlock &MBB : MF) {
SmallVector<MachineInstr *, 4> ToErase;
for (MachineInstr &MI : MBB) {
int64_t Off;
if (!srAccess(MI, Off)) continue;
auto It = Renames.find(Off);
if (It == Renames.end()) continue;
MI.getOperand(0).setImm(It->second);
Changed = true;
}
// After rename, look for now-redundant LDA-STA pairs to the same
// slot (the PHI-copy self-copy). Erase them.
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (It->getOpcode() != W65816::LDA_StackRel) continue;
int64_t LdaOff;
if (!srAccess(*It, LdaOff)) continue;
auto NextIt = std::next(It);
while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
if (NextIt == MBB.end()) continue;
if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
int64_t StaOff;
if (!srAccess(*NextIt, StaOff)) continue;
if (LdaOff != StaOff) continue;
ToErase.push_back(&*It);
ToErase.push_back(&*NextIt);
}
for (MachineInstr *MI : ToErase) MI->eraseFromParent();
if (!ToErase.empty()) Changed = true;
}
// Phase 5: redundant constant-init elimination. After rename, the
// Case (3) twin pairings leave us with TWO sites writing the same
// constant to the same slot (one renamed from X to Y, the other was
// already targeting Y). The dominated one is redundant — its slot
// already holds the constant from the dominating write.
//
// Generalize: scan post-rename for ALL `LDA_Imm16 K ; STA_StackRel Y`
// pairs (or LDAi16imm K; STA Y). For each pair, look for another
// such pair with the same (K, Y) where one DOMINATES the other AND
// no slot-Y access exists on any path between them. Erase the
// dominated STA + its preceding LDA (if A isn't otherwise consumed).
{
auto isLdaImm = [](const MachineInstr &MI) {
unsigned Op = MI.getOpcode();
return Op == W65816::LDA_Imm16 || Op == W65816::LDAi16imm;
};
auto immValue = [](const MachineInstr &MI) -> int64_t {
for (const MachineOperand &MO : MI.operands()) {
if (MO.isImm()) return MO.getImm();
}
return 0;
};
// Collect `LDA #K ; STA_StackRel Y` pairs, grouped by Y.
DenseMap<int64_t, SmallVector<std::pair<MachineInstr *, int64_t>, 4>>
ConstStas;
for (MachineBasicBlock &MBB : MF) {
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (!isLdaImm(*It)) continue;
int64_t K = immValue(*It);
auto NextIt = std::next(It);
while (NextIt != MBB.end() && NextIt->isDebugInstr()) ++NextIt;
if (NextIt == MBB.end()) continue;
if (NextIt->getOpcode() != W65816::STA_StackRel) continue;
int64_t Y;
if (!srAccess(*NextIt, Y)) continue;
ConstStas[Y].push_back({&*NextIt, K});
}
}
// For each slot Y with at least two const-init STAs, check for
// dominator redundancy.
auto &MDT = getAnalysis<MachineDominatorTreeWrapperPass>().getDomTree();
// Check that no instruction WRITES slot Y on any path between
// From and To. Reads are fine because both From and To write
// the same constant K — any intermediate read would see K either
// way (since From dominates, From has already executed). Calls
// are bailout conditions: a call might write to the stack via
// address-taken locals or other side effects we don't model.
auto noSlotWriteOnPath = [&](MachineInstr *From, MachineInstr *To,
int64_t Y) -> bool {
MachineBasicBlock *FromMBB = From->getParent();
MachineBasicBlock *ToMBB = To->getParent();
auto opWritesY = [&](MachineInstr &MI) {
if (MI.isCall() || MI.isInlineAsm()) return true;
int64_t Off;
if (MI.getOpcode() == W65816::STA_StackRel &&
srAccess(MI, Off) && Off == Y) {
return true;
}
return false;
};
// (a) After From in its MBB.
for (auto It = std::next(From->getIterator()); It != FromMBB->end();
++It) {
if (It->isDebugInstr()) continue;
if (opWritesY(*It)) return false;
}
// (b) BFS forward from FromMBB's successors, stopping at ToMBB.
SmallPtrSet<MachineBasicBlock *, 8> Visited;
SmallVector<MachineBasicBlock *, 8> Stack;
for (auto *Succ : FromMBB->successors()) Stack.push_back(Succ);
while (!Stack.empty()) {
auto *MBB = Stack.pop_back_val();
if (MBB == ToMBB) continue; // checked separately in (c)
if (!Visited.insert(MBB).second) continue;
for (auto &MI : *MBB) {
if (MI.isDebugInstr()) continue;
if (opWritesY(MI)) return false;
}
for (auto *Succ : MBB->successors()) Stack.push_back(Succ);
}
// (c) In ToMBB, before To, any write of Y?
for (auto It = ToMBB->begin(); It != To->getIterator(); ++It) {
if (It->isDebugInstr()) continue;
if (opWritesY(*It)) return false;
}
return true;
};
SmallVector<MachineInstr *, 8> ToErase;
LLVM_DEBUG({
dbgs() << "Phase 5 in " << MF.getName() << ":\n";
for (auto &P : ConstStas) {
dbgs() << " slot " << P.first << " has " << P.second.size()
<< " const STAs\n";
}
});
for (auto &P : ConstStas) {
int64_t Y = P.first;
auto &stas = P.second;
if (stas.size() < 2) continue;
// For each pair (i, j) where i dominates j with same constant K:
for (auto &Sj : stas) {
MachineInstr *DominatedSta = Sj.first;
int64_t Kj = Sj.second;
for (auto &Si : stas) {
if (&Si == &Sj) continue;
if (Si.second != Kj) continue; // different K
MachineInstr *DominatorSta = Si.first;
if (!MDT.dominates(DominatorSta, DominatedSta)) continue;
if (!noSlotWriteOnPath(DominatorSta, DominatedSta, Y)) continue;
// Flag safety: erasing `LDA #K; STA Y` removes a flag-setting
// op (the LDA). Walk forward from the STA looking for next
// flag-clobber or unconditional terminator (safe) vs.
// flag-use (unsafe).
MachineBasicBlock *MBB = DominatedSta->getParent();
bool flagsSafeP5 = false;
for (auto Fwd = std::next(DominatedSta->getIterator());
Fwd != MBB->end(); ++Fwd) {
if (Fwd->isDebugInstr()) continue;
if (usesFlagsP(*Fwd)) break;
if (Fwd->isTerminator() && !Fwd->isConditionalBranch()) {
flagsSafeP5 = true; break;
}
if (clobbersFlagsP(*Fwd)) { flagsSafeP5 = true; break; }
}
if (!flagsSafeP5) continue;
// Erase DominatedSta and its preceding LDA #K.
auto Prev = DominatedSta->getIterator();
while (Prev != MBB->begin()) {
--Prev;
if (!Prev->isDebugInstr()) break;
}
if (Prev != DominatedSta->getIterator() && isLdaImm(*Prev) &&
immValue(*Prev) == Kj) {
// Verify A isn't consumed between LDA and STA — they're
// adjacent so no consumers exist; safe. Erase both.
ToErase.push_back(&*Prev);
}
ToErase.push_back(DominatedSta);
break;
}
}
}
// De-dup ToErase before erasing.
SmallPtrSet<MachineInstr *, 8> ErasedSet;
for (MachineInstr *MI : ToErase) {
if (ErasedSet.insert(MI).second) {
MI->eraseFromParent();
Changed = true;
}
}
}
return Changed || peepChanged;
}

View file

@ -56,6 +56,8 @@ LLVMInitializeW65816Target() {
initializeW65816I32IncFoldPass(PR); initializeW65816I32IncFoldPass(PR);
initializeW65816ImgCalleeSavePass(PR); initializeW65816ImgCalleeSavePass(PR);
initializeW65816NarrowI32MulPass(PR); initializeW65816NarrowI32MulPass(PR);
initializeW65816PromoteFiToImgPass(PR);
initializeW65816StackSlotMergePass(PR);
// Default IndVarSimplify's exit-value rewriter to "never". The // Default IndVarSimplify's exit-value rewriter to "never". The
// closed-form replacement frequently widens an i16 induction var // closed-form replacement frequently widens an i16 induction var
@ -195,14 +197,19 @@ void W65816PassConfig::addPreRegAlloc() {
} }
void W65816PassConfig::addPostRegAlloc() { void W65816PassConfig::addPostRegAlloc() {
// ImgCalleeSave runs FIRST so its STAfi/LDAfi pseudos go through the // FI→IMG promotion runs FIRST. It scans for high-traffic i16
// rest of the post-RA pipeline (SpillToX, StackSlotCleanup) normally. // FrameIndex slots (LDAfi/STAfi/ADCfi/etc.) and rewrites them to
// It detects IMG8..IMG15 usage post-regalloc and inserts prologue // STA_DP/LDA_DP/ADC_DP/... pointed at free IMG8..IMG15 DP slots.
// save + epilogue restore so those slots act as callee-saved at the // The introduced IMG8..15 references are then picked up by
// asm level. Fixes picol's `expr 1+2 == 4` bug: high-pressure // ImgCalleeSave to get prologue save + epilogue restore. See
// recursive double fns use IMG8..IMG15 as scratch but, without this // W65816PromoteFiToImg.cpp.
// pass, expected them preserved across calls — and callees were addPass(createW65816PromoteFiToImg());
// happy to clobber them. See W65816ImgCalleeSave.cpp. // ImgCalleeSave detects IMG8..IMG15 usage post-regalloc and inserts
// prologue save + epilogue restore so those slots act as callee-
// saved at the asm level. Fixes picol's `expr 1+2 == 4` bug:
// high-pressure recursive double fns use IMG8..IMG15 as scratch but,
// without this pass, expected them preserved across calls — and
// callees were happy to clobber them. See W65816ImgCalleeSave.cpp.
addPass(createW65816ImgCalleeSave()); addPass(createW65816ImgCalleeSave());
// SpillToX converts STA/LDA pairs to TAX/TXA bridges; StackSlotCleanup // SpillToX converts STA/LDA pairs to TAX/TXA bridges; StackSlotCleanup
// then deletes still-adjacent redundant spills. A second SpillToX // then deletes still-adjacent redundant spills. A second SpillToX
@ -264,6 +271,14 @@ void W65816PassConfig::addPreEmitPass() {
addPass(createW65816I32IncFold()); addPass(createW65816I32IncFold());
addPass(createW65816BranchExpand()); addPass(createW65816BranchExpand());
addPass(createW65816SepRepCleanup()); addPass(createW65816SepRepCleanup());
// Merge value-equivalent stack slots last. Runs AFTER SepRepCleanup's
// PHI-copy hoist so the LDA-X ; STA-Y pair has been pulled out of
// any PHP/PLP wrap — that way the stack-rel offsets on both ops are
// the unbumped values and offset-based slot matching is stable.
// Saves 2 inst per PHI-copy occurrence (the memory copy round-trip
// collapses when X and Y are renamed to the same slot). See
// W65816StackSlotMerge.cpp.
addPass(createW65816StackSlotMerge());
} }
MachineFunctionInfo *W65816TargetMachine::createMachineFunctionInfo( MachineFunctionInfo *W65816TargetMachine::createMachineFunctionInfo(

View file

@ -64,13 +64,43 @@ FunctionPass *llvm::createW65816WidenAcc16() {
return new W65816WidenAcc16(); return new W65816WidenAcc16();
} }
// Returns true if the vreg has any physreg-COPY use (e.g., return-value // Returns true if the vreg has any physreg-COPY use that would conflict
// or arg-passing setup that pins the value to a specific physreg). // with Wide16 class assignment. $a is a member of Wide16 (Wide16 = A +
static bool flowsToPhysReg(Register VReg, const MachineRegisterInfo &MRI) { // IMG0..15), so a COPY to $a is fine — the vreg can be Wide16 and
// regalloc will pick $a to coalesce. $x / $y are in Idx16, NOT in
// Wide16, so a COPY to those forces the vreg to NOT be in Wide16
// (verifier would reject).
static bool flowsToIncompatiblePhysReg(Register VReg,
const MachineRegisterInfo &MRI) {
for (auto &U : MRI.use_nodbg_instructions(VReg)) { for (auto &U : MRI.use_nodbg_instructions(VReg)) {
if (!U.isCopy()) continue; if (!U.isCopy()) continue;
const MachineOperand &Dst = U.getOperand(0); const MachineOperand &Dst = U.getOperand(0);
if (Dst.isReg() && Dst.getReg().isPhysical()) return true; if (!Dst.isReg() || !Dst.getReg().isPhysical()) continue;
Register P = Dst.getReg();
if (P == W65816::A) continue;
if (P >= W65816::IMG0 && P <= W65816::IMG15) continue;
return true;
}
return false;
}
// Returns true if VReg's def is a COPY from a physreg whose class is not
// Wide16-compatible. copyPhysReg only handles a fixed set of source/dest
// pairs; an incompatible source physreg (e.g., DPF0, the i64-return
// high-half carrier) lowered to an IMG dest would crash with an
// "unhandled copyPhysReg" assertion at AsmPrinter time. (Currently
// only the Phase-2 PHI widening uses this; that's disabled, so mark
// unused.)
[[maybe_unused]] static bool comesFromIncompatiblePhysReg(Register VReg,
const MachineRegisterInfo &MRI) {
for (auto &D : MRI.def_instructions(VReg)) {
if (!D.isCopy()) continue;
const MachineOperand &Src = D.getOperand(1);
if (!Src.isReg() || !Src.getReg().isPhysical()) continue;
Register P = Src.getReg();
if (P == W65816::A) continue;
if (P >= W65816::IMG0 && P <= W65816::IMG15) continue;
return true;
} }
return false; return false;
} }
@ -145,7 +175,7 @@ bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
Register VReg = Register::index2VirtReg(i); Register VReg = Register::index2VirtReg(i);
if (MRI.def_empty(VReg)) continue; if (MRI.def_empty(VReg)) continue;
if (MRI.getRegClass(VReg) != &W65816::Acc16RegClass) continue; if (MRI.getRegClass(VReg) != &W65816::Acc16RegClass) continue;
if (flowsToPhysReg(VReg, MRI)) continue; if (flowsToIncompatiblePhysReg(VReg, MRI)) continue;
if (usedByPhi(VReg, MRI)) continue; if (usedByPhi(VReg, MRI)) continue;
if (!MRI.hasOneDef(VReg)) continue; // require single SSA def if (!MRI.hasOneDef(VReg)) continue; // require single SSA def
if (!allUsesAcceptWide(VReg, MRI, *TRI, *TII)) continue; if (!allUsesAcceptWide(VReg, MRI, *TRI, *TII)) continue;
@ -181,5 +211,212 @@ bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
} }
Changed = true; Changed = true;
} }
// Phase 2: PHI cycle widening. EXPERIMENTAL, currently disabled —
// see end of pass for explanation.
#if 0
// PHIs whose def class is Acc16 keep
// the value pinned to $a across iterations, forcing stack spills
// when the PHI is live across calls or other A-clobbering ops.
// For sumSquares-style loops with an i32 accumulator, this manifests
// as per-iter `LDA slot ; ADC ; STA slot ; LDA slot ; STA slot` (the
// last LDA/STA pair is the PHI-back-edge copy). If we widen the
// PHI's def to Wide16, regalloc can keep it in an IMG slot and the
// back-edge PHI copy collapses to a register coalesce.
//
// To widen a PHI:
// 1. Compute the SCC of Acc16 vregs connected by PHI edges (PHI
// def ↔ PHI incoming vreg). This catches mutually-recursive
// PHIs in nested loops.
// 2. For every member: verify all non-PHI uses accept Wide16, no
// flow to a physreg, single def.
// 3. For each PHI in the SCC, walk its incoming list. Each
// incoming vreg is either ALREADY in the SCC (another PHI, no
// bridge needed) or an external Acc16 vreg whose value flows
// into the SCC — bridge it by inserting `WWide = COPY W` at
// the end of the predecessor block and pointing the PHI's
// incoming at WWide.
// 4. Change every SCC member's register class to Wide16.
auto worklistInsertIfAcc16 = [&MRI](Register V,
DenseSet<Register> &Seen,
SmallVectorImpl<Register> &WL) {
if (!V.isVirtual()) return;
if (MRI.getRegClass(V) != &W65816::Acc16RegClass) return;
if (!Seen.insert(V).second) return;
WL.push_back(V);
};
SmallVector<MachineInstr *, 16> AcctPhis;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB.phis()) {
Register DefV = MI.getOperand(0).getReg();
if (MRI.getRegClass(DefV) == &W65816::Acc16RegClass) {
AcctPhis.push_back(&MI);
}
}
}
DenseSet<Register> ProcessedPhiVregs;
for (MachineInstr *Seed : AcctPhis) {
Register SeedDef = Seed->getOperand(0).getReg();
if (ProcessedPhiVregs.count(SeedDef)) continue;
// Build SCC by following PHI edges in both directions.
DenseSet<Register> Comp;
SmallVector<Register, 8> Stack;
worklistInsertIfAcc16(SeedDef, Comp, Stack);
while (!Stack.empty()) {
Register V = Stack.pop_back_val();
// Forward: V flows into other PHIs as an incoming → include those PHI defs.
for (auto &U : MRI.use_nodbg_instructions(V)) {
if (!U.isPHI()) continue;
Register PhiDef = U.getOperand(0).getReg();
worklistInsertIfAcc16(PhiDef, Comp, Stack);
}
// Backward: if V is itself a PHI def, include the incoming vregs.
MachineInstr *DM = &*MRI.def_instructions(V).begin();
if (!DM || !DM->isPHI()) continue;
for (unsigned i = 1, e = DM->getNumOperands(); i < e; i += 2) {
MachineOperand &MO = DM->getOperand(i);
if (!MO.isReg() || !MO.getReg().isVirtual()) continue;
worklistInsertIfAcc16(MO.getReg(), Comp, Stack);
}
}
for (Register V : Comp) ProcessedPhiVregs.insert(V);
// Validate every member. PHI uses are ACCEPTED when the consumer
// PHI is itself in the SCC (those PHIs are being widened in
// lock-step). Narrow-class uses (e.g., INA_PSEUDO's tied-def
// input requires Acc16) are ALSO accepted — we'll insert a
// Wide16→Acc16 COPY at the use site after widening. The only
// unrecoverable cases are: PHI uses where the consumer PHI is
// outside the SCC (forcing cross-SCC class merging), and physreg
// flow to $x/$y/etc. (handled separately above).
auto usesAcceptInSCC = [&](Register V,
SmallVectorImpl<MachineOperand *> *NarrowSites)
-> bool {
for (auto &MO : MRI.use_nodbg_operands(V)) {
MachineInstr *UMI = MO.getParent();
if (UMI->isCopy()) continue;
if (UMI->isPHI()) {
Register PhiDef = UMI->getOperand(0).getReg();
if (Comp.count(PhiDef)) continue; // co-widened
return false;
}
unsigned OpIdx = UMI->getOperandNo(&MO);
const TargetRegisterClass *Expected =
TII->getRegClass(UMI->getDesc(), OpIdx);
if (!Expected) continue;
if (Expected == &W65816::Wide16RegClass) continue;
if (Expected->hasSubClassEq(&W65816::Wide16RegClass)) continue;
// Expected is narrower than Wide16 (e.g., Acc16-only tied
// input). Mark for runtime narrowing — we'll insert a COPY
// at apply time.
if (NarrowSites) NarrowSites->push_back(&MO);
}
return true;
};
bool ok = true;
SmallVector<MachineOperand *, 8> NarrowSites;
for (Register V : Comp) {
if (!MRI.hasOneDef(V)) { ok = false; break; }
if (flowsToIncompatiblePhysReg(V, MRI)) { ok = false; break; }
if (comesFromIncompatiblePhysReg(V, MRI)) { ok = false; break; }
if (!usesAcceptInSCC(V, &NarrowSites)) { ok = false; break; }
}
if (!ok) continue;
// Apply widening. First insert bridge COPYs at predecessor edges
// for external (non-Comp) Acc16 incomings to each PHI in Comp.
SmallVector<std::pair<MachineInstr *, unsigned>, 16> BridgeSites;
for (Register V : Comp) {
MachineInstr *DM = &*MRI.def_instructions(V).begin();
if (!DM->isPHI()) continue;
for (unsigned i = 1, e = DM->getNumOperands(); i < e; i += 2) {
MachineOperand &MO = DM->getOperand(i);
if (!MO.isReg() || !MO.getReg().isVirtual()) continue;
Register Inc = MO.getReg();
if (Comp.count(Inc)) continue; // in-SCC, no bridge needed
// External incoming: ensure it's currently Acc16; if so, we'll
// insert a COPY at the predecessor block's end.
if (MRI.getRegClass(Inc) != &W65816::Acc16RegClass &&
MRI.getRegClass(Inc) != &W65816::Wide16RegClass) {
ok = false;
break;
}
BridgeSites.push_back({DM, i});
}
if (!ok) break;
}
if (!ok) continue;
// Insert bridges.
for (auto &Site : BridgeSites) {
MachineInstr *PhiMI = Site.first;
unsigned OpIdx = Site.second;
Register Inc = PhiMI->getOperand(OpIdx).getReg();
MachineBasicBlock *PredMBB = PhiMI->getOperand(OpIdx + 1).getMBB();
// If already Wide16 (e.g., another candidate widened it already),
// no bridge needed — but we still need the PHI incoming to use
// a Wide16 vreg. Use Inc directly.
if (MRI.getRegClass(Inc) == &W65816::Wide16RegClass) {
continue;
}
// Insert COPY before the predecessor's terminator(s).
auto InsertPos = PredMBB->getFirstTerminator();
DebugLoc DL = (InsertPos == PredMBB->end())
? PredMBB->findBranchDebugLoc()
: InsertPos->getDebugLoc();
Register WideInc = MRI.createVirtualRegister(&W65816::Wide16RegClass);
BuildMI(*PredMBB, InsertPos, DL, TII->get(TargetOpcode::COPY),
WideInc)
.addReg(Inc);
PhiMI->getOperand(OpIdx).setReg(WideInc);
PhiMI->getOperand(OpIdx).setIsKill(false);
}
// Force every SCC member to Img16 (IMG-only, no A). Using Wide16
// (A + IMG) doesn't work here: the Register Coalescer joins our
// Wide16 vregs with adjacent Acc16 vregs (intersection = Acc16)
// and narrows them back to A-only, defeating the widening. Img16
// intersects Acc16 to ∅, so the coalescer can't merge — the PHI
// stays in IMG. This is correct anyway for the common case (PHI
// live across a call): A is JSL-clobbered, so it can't carry the
// value through, and IMG8..15 is the right home.
for (Register V : Comp) {
MRI.setRegClass(V, &W65816::Img16RegClass);
}
// Insert narrowing COPYs at each narrow-class use site. Each site
// is `... = OP V, ...` where the operand requires Acc16 but V is
// now Wide16. Replace with `%Vacc = COPY V (Acc16); ... = OP %Vacc, ...`.
for (MachineOperand *MO : NarrowSites) {
MachineInstr *UMI = MO->getParent();
Register OldReg = MO->getReg();
Register NarrowReg =
MRI.createVirtualRegister(&W65816::Acc16RegClass);
DebugLoc DL = UMI->getDebugLoc();
BuildMI(*UMI->getParent(), UMI, DL, TII->get(TargetOpcode::COPY),
NarrowReg)
.addReg(OldReg);
MO->setReg(NarrowReg);
MO->setIsKill(false);
}
Changed = true;
}
#endif
// Why disabled (2026-05-13 attempt):
// - Widening PHI cycles to Wide16 (= A + IMG0..15) is undone by the
// Register Coalescer: it joins our Wide16 vregs with adjacent
// Acc16 vregs via the bridge COPYs we insert, and the resulting
// joint class is `intersect(Wide16, Acc16) = Acc16`. Net effect:
// no IMG, just more code through the coalescer.
// - Switching to Img16 (= IMG0..15, no A) defeats the coalescer
// (intersection with Acc16 is ∅) but forces ALL widened PHIs into
// IMG slots even when A would be better, AND triggers cascading
// copyPhysReg paths that aren't all implemented (e.g., DPF0 → IMG
// for i64 libcall return values), aborting clang on runtime builds.
// - A targeted fix needs either (a) a class that the coalescer
// refuses to join with Acc16 yet that still allows A as a member,
// (b) a post-coalescer pass that re-widens specific high-traffic
// vregs back to Img16, or (c) regalloc cost-model tuning so it
// prefers IMG8..15 over stack for loop-live values.
return Changed; return Changed;
} }