More optimizations.

This commit is contained in:
Scott Duensing 2026-05-27 19:37:26 -05:00
parent 2bec85ffa5
commit 25a67901a6
31 changed files with 991 additions and 83 deletions

View file

@ -76,20 +76,22 @@ against commercial Calypsi 5.16, measured under MAME via `emu.time()`
| Benchmark | Ours | Calypsi | Ratio |
|---|---:|---:|---:|
| bsearch | 682 | 2,387 | **0.29×** ✓ |
| dotProduct | 1,534 | 5,712 | **0.27×** ✓ |
| bsearch | 682 | 2,387 | **0.29×** ✓ |
| sumOfSquares | 6,820 | 16,368 | **0.42×** ✓ |
| bubbleSort | 11,594 | 17,050 | **0.68×** ✓ |
| djb2Hash | 2,387 | 2,643 | **0.90×** ✓ |
| memcmp | 716 | 716 | **1.00×** |
| strcpy | 1,279 | 1,194 | 1.07× |
| popcount | 1,705 | 1,534 | 1.11× |
| fib | 12,106 | 10,912 | 1.11× |
| strLen | 1,876 | 1,023 | 1.83× |
| strLen | 767 | 1,023 | **0.75×** ✓ |
| djb2Hash | 2,046 | 2,643 | **0.77×** |
| popcount | 1,194 | 1,534 | **0.78×** |
| strcpy | 1,108 | 1,194 | **0.93×** |
| memcmp | 682 | 716 | **0.95×** |
| fib | 11,594 | 10,912 | 1.06× |
**Geomean: 0.74× Calypsi** across this suite. Six of ten benches beat
Calypsi outright; one ties exactly. Run `scripts/benchCyclesPrecise.sh`
(ours) and `scripts/benchCyclesCalypsi.sh` (Calypsi) to reproduce.
**Geomean: 0.62× Calypsi** across this suite. Nine of ten benches beat
Calypsi outright; only fib trails at 1.06×. Run
`scripts/benchCyclesPrecise.sh` (ours, with
`W65816_CC_EXTRA="-mllvm -w65816-dbr-safe-ptrs"`) and
`scripts/benchCyclesCalypsi.sh` (Calypsi) to reproduce.
On real programs:
- **Lua 5.1.5** (17K LoC, 24 source files) compiles + links clean.

View file

@ -1,4 +1,4 @@
# Session Recovery — last updated 2026-05-25
# Session Recovery — last updated 2026-05-27
Living recovery doc. Update on every meaningful change. If session is lost,
read this top-to-bottom + the memory notes referenced inside, then reread
@ -12,10 +12,12 @@ the actual diffs in tree to ground assumptions.
on JSL, greedy regalloc at -O1+. **Inline-threshold lowered to 50
target-wide** (was LLVM default 225; was 75 earlier this session).
- **Branch**: `main`.
- **vs Calypsi (2026-05-25)**:
- **Lua 5.1.5**: default config 1.13× Calypsi; with Layer 2 0.93× (we
beat by 7%).
- **CoreMark 1.0**: with Layer 2 **0.79× Calypsi (we beat by 21%)**.
- **vs Calypsi (2026-05-27)** — Layer 2 + recent peepholes:
- **Cycle benches geomean**: **0.62× Calypsi**. 9 of 10 below 1.0×;
only `fib` trails at 1.06× (recursive overhead, structural). See
cycle bench table below.
- **Lua 5.1.5**: default config 1.13× Calypsi; with Layer 2 0.93×.
- **CoreMark 1.0**: with Layer 2 0.79× Calypsi (we beat by 21%).
- **vs Calypsi static-inst ratio (synthetic bench)**:
sumSquares **0.84×** (26 vs 31 — we beat),
mul16to32 **0.25×** (1 vs 4 — we beat),
@ -31,12 +33,39 @@ the actual diffs in tree to ground assumptions.
- Inline-threshold lowered to 50 (was 225). Lua -23% total, CoreMark
matrix.o 1.37× → 0.97× Calypsi. Override with
`-mllvm -inline-threshold=N`.
- **Cycle benches (2026-05-20)**:
popcount 93, strcpy 91, bsearch 127, memcmp 113, fib 97,
dotProduct 144, sumOfSquares 126 cyc/iter (100 iters);
dadd 1157, ddiv 1261, dmul 1033 cyc/iter (10 iters);
particles 2253 cyc/iter (3 iters), mandelbrot 11570 cyc/iter (1 iter).
- **Recent session wins (2026-05-20)**:
- **Cycle benches per-call (2026-05-27, Layer 2)** — via
`scripts/benchCyclesPrecise.sh` vs `scripts/benchCyclesCalypsi.sh`:
```
Bench Ours Calypsi Ratio
dotProduct 1534 5712 0.27×
bsearch 682 2387 0.29×
sumOfSquares 6820 16368 0.42×
bubbleSort 11594 17050 0.68×
strLen 767 1023 0.75×
djb2Hash 2046 2643 0.77×
popcount 1194 1534 0.78×
strcpy 1108 1194 0.93×
memcmp 682 716 0.95×
fib 11594 10912 1.06×
```
Geomean **0.62×**. Older HBL-tick numbers (per-iter, 100 iter loops)
from `benchCycles.sh` are still available but lower resolution.
- **Recent session wins (2026-05-27)**:
- **Y-as-counter for strLen** — structural rewrite: drop STX/INX/INC,
use Y as offset AND counter. strLen 1279 → 767 cyc (-40%); 0.75×
Calypsi (was 1.25×).
- **Stack-rel dead-store elim** — companion to DP version with SP
tracking across PHA/PHP/PEA/PEI/PER/PLA/PLP/PLX/PLY/PHX/PHY.
strcpy 1194 → 1108 (-7%, 0.93× Calypsi, beats by 7%). Refactored
as a static helper called from the recursive-call bail too so fib
gets it. fib 12106 → 11594 (-4%, 1.06× Calypsi).
- **DP-indirect-Y for iter** (follow-on to X-iter peephole): rewrites
`TXA;STA stack-rel S;INX;…;LDA (S,s),Y` to `STX_DP D;INX;…;LDA
(D),Y`. Saves 4 cyc/iter.
- **Dead INC_HI_IF_CARRY elim** — when the StackRel ptr-hi slot is
never read, elide the carry-bookkeeping for Layer 2 ptr32 loops.
Wide impact across strLen/strcpy/djb2Hash/memcmp.
- **Recent session wins (earlier — 2026-05-20)**:
- 8 always-on peepholes + extended phase 4 in W65816StackRelToImg
(evalAt 498→472, fib -35%, 35 libc fns shrunk)
- __muldi3 32-bit short-circuit (dmul 1605→1033, -36%)

View file

@ -244,26 +244,34 @@ which runs correctly under MAME (apple2gs).
+ dispatch + chained collisions over fprintf-to-mfs),
scripts/bench.sh size-vs-Calypsi harness. 100% pass.
- `scripts/benchCycles.sh` measures per-iteration cycle counts via
MAME's emulated HBL counter. 13 benchmarks under `benchmarks/`
(8 int micro + 3 soft-FP + 2 "game-like": particles, mandelbrot).
Current numbers (2026-05-20):
bsearch 127, crc32 <65, dotProduct 144, fib 97, memcmp 113,
popcount 93, strcpy 91, sumOfSquares 126 cyc/iter (100 iters);
dadd 1157, ddiv 1261, dmul 1033 cyc/iter (10 iters);
particles 2253 cyc/iter (3 iters — 32-particle physics tick);
mandelbrot 11570 cyc/iter (1 iter — 4×4 fixed-point tile, max 8
Mandelbrot iters). Speed is the optimization priority, not size.
- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts via
MAME's `emu.time()` between A1A1/A2A2 markers. Runs vs commercial
Calypsi 5.16 (`scripts/benchCyclesCalypsi.sh`) for an apples-to-
apples speed comparison. Current numbers (2026-05-27, Layer 2):
| Bench | Ours | Calypsi | Ratio |
|--------------|------:|--------:|-------:|
| dotProduct | 1534 | 5712 | 0.27× |
| bsearch | 682 | 2387 | 0.29× |
| sumOfSquares | 6820 | 16368 | 0.42× |
| bubbleSort | 11594 | 17050 | 0.68× |
| strLen | 767 | 1023 | 0.75× |
| djb2Hash | 2046 | 2643 | 0.77× |
| popcount | 1194 | 1534 | 0.78× |
| strcpy | 1108 | 1194 | 0.93× |
| memcmp | 682 | 716 | 0.95× |
| fib | 11594 | 10912 | 1.06× |
**Geomean: 0.62× Calypsi.** 9 of 10 below 1.0×; only fib trails
(recursive call overhead, structural). Speed is the optimization
priority, not size.
- `compare/` holds three side-by-side C tests with our asm and
Calypsi's listing for static-size comparison:
`sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh`
recompiles each under both `clang --target=w65816 -O2 -S` and
`cc65816 --speed -O 2 --64bit-doubles` and prints an
ours/Calypsi instruction-count ratio. Current ratios (2026-05-20):
sumSquares **0.84×** (26 inst — we beat Calypsi's 31),
evalAt 1.86× (472 inst), mul16to32 **0.25×** (1 inst — we beat
Calypsi's 4). See `compare/README.md`.
ours/Calypsi instruction-count ratio. See `compare/README.md`.
**Backend register allocation:**

View file

@ -817,39 +817,40 @@ Useful pass names to filter on:
## Cycle-count benchmarks
13 microbenchmarks live under [`benchmarks/`](../benchmarks/) — eight
integer/string micro-benches, three soft-double FP benches (`dadd`,
`dmul`, `ddiv`), and two "game-like" workloads: `particles` (32-particle
physics tick with i16 bounce/wall collision) and `mandelbrot` (4×4
fixed-point Mandelbrot tile exercising i32 multiply and conditional
control flow).
Microbenchmarks live under [`benchmarks/`](../benchmarks/) — integer/
string micro-benches plus soft-double FP benches.
```bash
bash scripts/benchCycles.sh
W65816_CC_EXTRA="-mllvm -w65816-dbr-safe-ptrs" bash scripts/benchCyclesPrecise.sh
```
Output (2026-05-21):
This measures per-call cycle counts via MAME's `emu.time()` between
markers — apples-to-apples vs the matching
`scripts/benchCyclesCalypsi.sh` runner (commercial Calypsi 5.16).
Current ratios (2026-05-27, Layer 2):
```
| Benchmark | Per-iteration cycles |
|-----------|---------------------:|
| bsearch | 127 cyc/iter (100 iters) |
| crc32 | <65 (under timer resolution) |
| dadd | 1157 cyc/iter (10 iters) |
| ddiv | 1261 cyc/iter (10 iters) |
| dmul | 1033 cyc/iter (10 iters) |
| dotProduct | 144 cyc/iter (100 iters) |
| fib | 97 cyc/iter (100 iters) |
| mandelbrot | 11570 cyc/iter (1 iter, GRID=4 MAX_ITER=8) |
| memcmp | 113 cyc/iter (100 iters) |
| particles | 2253 cyc/iter (3 iters, N=32) |
| popcount | 93 cyc/iter (100 iters) |
| strcpy | 91 cyc/iter (100 iters) |
| sumOfSquares | 126 cyc/iter (100 iters) |
| Benchmark | Ours | Calypsi | Ratio |
|--------------|------:|--------:|------:|
| dotProduct | 1534 | 5712 | 0.27× |
| bsearch | 682 | 2387 | 0.29× |
| sumOfSquares | 6820 | 16368 | 0.42× |
| bubbleSort | 11594 | 17050 | 0.68× |
| strLen | 767 | 1023 | 0.75× |
| djb2Hash | 2046 | 2643 | 0.77× |
| popcount | 1194 | 1534 | 0.78× |
| strcpy | 1108 | 1194 | 0.93× |
| memcmp | 682 | 716 | 0.95× |
| fib | 11594 | 10912 | 1.06× |
```
The legacy `scripts/benchCyclesPrecise.sh` (per-call cycle count via
`emu.time()`) is still available but slower to run.
**Geomean: 0.62× Calypsi.** 9 of 10 below 1.0×. The Layer 2 flag
(`-mllvm -w65816-dbr-safe-ptrs`) enables stack-rel-indirect-Y ptr32
derefs — required for parity since Calypsi's pointer ABI assumes
DBR matches the pointer's bank.
The `scripts/benchCycles.sh` (HBL-tick-based) script is still around
but lower-resolution. Prefer the `Precise` runner above.
The [`compare/`](../compare/) directory has side-by-side `.s` files vs
Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with:

View file

@ -596,6 +596,83 @@ static bool elideRedundantLdaAfterPha(MachineFunction &MF) {
//
// The first STA's value is shadowed by the second. Drop it.
// Saves 1 instruction (3 bytes / 5 cyc) per match.
static bool elideStackRelDeadStore(MachineFunction &MF) {
bool Changed = false;
auto isStackRelRead = [](unsigned Op) {
switch (Op) {
case W65816::LDA_StackRel: case W65816::ADC_StackRel:
case W65816::SBC_StackRel: case W65816::AND_StackRel:
case W65816::ORA_StackRel: case W65816::EOR_StackRel:
case W65816::CMP_StackRel:
return true;
}
return false;
};
auto isStackRelIndY = [](unsigned Op) {
switch (Op) {
case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY:
case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY:
case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY:
case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY:
return true;
}
return false;
};
for (MachineBasicBlock &MBB : MF) {
SmallVector<MachineInstr *, 8> ToErase;
SmallPtrSet<MachineInstr *, 8> ErasedSet;
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
if (ErasedSet.count(&*It)) continue;
if (It->getOpcode() != W65816::STA_StackRel) continue;
if (It->getNumOperands() < 1 || !It->getOperand(0).isImm()) continue;
int64_t OrigSlot = It->getOperand(0).getImm();
int64_t SpAdj = 0;
auto Walk = std::next(It);
while (Walk != MBB.end()) {
if (Walk->isDebugInstr()) { ++Walk; continue; }
if (Walk->isBranch() || Walk->isCall() || Walk->isReturn() ||
Walk->isInlineAsm()) break;
unsigned WO = Walk->getOpcode();
switch (WO) {
case W65816::PHA: case W65816::PHX: case W65816::PHY:
case W65816::PEA: case W65816::PEI_DP: case W65816::PER:
SpAdj -= 2; ++Walk; continue;
case W65816::PLA: case W65816::PLX: case W65816::PLY:
SpAdj += 2; ++Walk; continue;
case W65816::PHP:
SpAdj -= 1; ++Walk; continue;
case W65816::PLP:
SpAdj += 1; ++Walk; continue;
}
if (WO == W65816::STA_StackRel &&
Walk->getNumOperands() >= 1 && Walk->getOperand(0).isImm() &&
Walk->getOperand(0).getImm() + SpAdj == OrigSlot) {
ToErase.push_back(&*It);
ErasedSet.insert(&*It);
break;
}
if (Walk->getNumOperands() >= 1 && Walk->getOperand(0).isImm()) {
int64_t Imm = Walk->getOperand(0).getImm();
if (isStackRelRead(WO) || WO == W65816::STA_StackRel) {
if (Imm + SpAdj == OrigSlot) break;
}
if (isStackRelIndY(WO)) {
if (Imm + SpAdj == OrigSlot || Imm + SpAdj + 1 == OrigSlot)
break;
}
}
++Walk;
}
}
for (MachineInstr *MI : ToErase) {
MI->eraseFromParent();
Changed = true;
}
}
return Changed;
}
static bool elideDeadStaCarry(MachineFunction &MF) {
bool Changed = false;
for (MachineBasicBlock &MBB : MF) {
@ -661,19 +738,25 @@ bool W65816StackRelToImg::runOnMachineFunction(MachineFunction &MF) {
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
if (!MI.isCall()) continue;
if (!isImgSafeCall(MI)) {
ChangedEarly |= elideStoreForwarding(MF);
return ChangedEarly;
}
// Check: recursive self-call first. Apply stack-rel dead-store
// elim here since IMG promotion can't run (recursion clobbers
// IMG slots across the inner call).
bool isSelfCall = false;
for (const MachineOperand &MO : MI.operands()) {
StringRef Name;
if (MO.isGlobal()) Name = MO.getGlobal()->getName();
else if (MO.isSymbol()) Name = MO.getSymbolName();
else continue;
if (Name == SelfName) {
if (Name == SelfName) { isSelfCall = true; break; }
}
if (isSelfCall) {
ChangedEarly |= elideStackRelDeadStore(MF);
ChangedEarly |= elideStoreForwarding(MF);
return ChangedEarly;
}
if (!isImgSafeCall(MI)) {
ChangedEarly |= elideStoreForwarding(MF);
return ChangedEarly;
}
}
}
@ -2319,6 +2402,10 @@ bool W65816StackRelToImg::runOnMachineFunction(MachineFunction &MF) {
}
}
// Stack-rel dead-store elim — called from the always-on section too
// (so it benefits recursive / non-IMG-promotable functions like fib).
Changed |= elideStackRelDeadStore(MF);
// DP-slot zero-check bridge via X. Pattern:
// [op that sets Z on A]
// STA_DP slot
@ -2441,6 +2528,792 @@ bool W65816StackRelToImg::runOnMachineFunction(MachineFunction &MF) {
}
}
// popcount-style "n += (x & 1)" combined with `x >>= 1` LSR/ROR pair:
// use the C flag set by ROR_DP directly via `ADC #0` (which adds C +
// 0 + n = bit + n). Eliminates the `LDA x_orig ; AND #1 ; ... ; CLC
// ; ADC n` sequence and the lagged-PHI copy at end of body. Big win
// for popcount's specific shape.
//
// Pattern (post-shift-fold + dp-zero-check-tax-bridge):
// LSR_DP X_hi
// ROR_DP X_lo ; C = old bit 0 of x_lo = the BIT we want
// LDA_DP X_lo ; (zero-check setup)
// ORA_DP X_hi
// TAX ; preserve Z across the AND chain
// LDA_DP X_orig ; current x_lo (lagged PHI)
// ANDi16imm 1 ; bit = x_orig & 1 (== C from ROR)
// STA_DP X_orig ; dead store (overwritten by PHI-copy below)
// CLC ; ← kills our preserved C
// ADC_DP N ; n += bit (but ADC reads C=0 + bit)
// STA_DP N
// LDA_DP X_lo ; PHI copy start
// STA_DP X_orig ; x_orig = x_lo NEW
// TXA ; restore Z
// BNE loop
//
// Rewrite:
// LSR_DP X_hi
// ROR_DP X_lo ; C = bit
// LDA_DP N
// ADC_Imm 0 ; n + 0 + C = n + bit
// STA_DP N
// LDA_DP X_lo
// ORA_DP X_hi
// BNE loop
//
// Erased: TAX, LDA X_orig, AND #1, STA X_orig, CLC, LDA X_lo PHI-copy,
// STA X_orig PHI-copy, TXA. Plus x_orig itself becomes dead.
//
// Saves ~25 cyc/iter on popcount. 29 iters → ~725 cyc. popcount
// 1705 → ~1000. Calypsi is 1534, so this would BEAT Calypsi.
for (MachineBasicBlock &MBB : MF) {
SmallVector<MachineInstr *, 12> ToErase;
for (auto It = MBB.begin(); It != MBB.end();) {
auto Lsr = It++;
if (Lsr->getOpcode() != W65816::LSR_DP) continue;
if (Lsr->getNumOperands() < 1 || !Lsr->getOperand(0).isImm()) continue;
int64_t HiAddr = Lsr->getOperand(0).getImm();
auto skipDbg = [&](auto &P) {
while (P != MBB.end() && P->isDebugInstr()) ++P;
};
auto Ror = std::next(Lsr); skipDbg(Ror);
if (Ror == MBB.end() || Ror->getOpcode() != W65816::ROR_DP) continue;
if (Ror->getNumOperands() < 1 || !Ror->getOperand(0).isImm()) continue;
int64_t LoAddr = Ror->getOperand(0).getImm();
auto P = std::next(Ror); skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::LDA_DP) continue;
if (P->getOperand(0).getImm() != LoAddr) continue;
auto LdaLo1 = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::ORA_DP) continue;
if (P->getOperand(0).getImm() != HiAddr) continue;
auto OraHi = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::TAX) continue;
auto Tax = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::LDA_DP) continue;
int64_t OrigAddr = P->getOperand(0).getImm();
if (OrigAddr == LoAddr || OrigAddr == HiAddr) continue;
auto LdaOrig = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::ANDi16imm) continue;
// Verify AND operand is 1.
bool isAnd1 = false;
for (const MachineOperand &MO : P->operands()) {
if (MO.isImm() && MO.getImm() == 1) { isAnd1 = true; break; }
}
if (!isAnd1) continue;
auto AndOne = P; ++P; skipDbg(P);
// Optional STA_DP X_orig (dead store; may already have been
// erased by DP-dead-store-elim).
MachineInstr *StaOrig1 = nullptr;
if (P != MBB.end() && P->getOpcode() == W65816::STA_DP &&
P->getOperand(0).isImm() && P->getOperand(0).getImm() == OrigAddr) {
StaOrig1 = &*P; ++P; skipDbg(P);
}
if (P == MBB.end() || P->getOpcode() != W65816::CLC) continue;
auto Clc = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::ADC_DP) continue;
int64_t NAddr = P->getOperand(0).getImm();
auto AdcN = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::STA_DP ||
P->getOperand(0).getImm() != NAddr) continue;
auto StaN = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::LDA_DP ||
P->getOperand(0).getImm() != LoAddr) continue;
auto LdaLo2 = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::STA_DP ||
P->getOperand(0).getImm() != OrigAddr) continue;
auto StaOrig2 = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::TXA) continue;
auto Txa = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::BNE) continue;
// (Bne stays; we just need to put a fresh ORA + BNE-using-Z form.)
const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
// After ROR, insert: LDA N ; ADC #0 ; STA N ; LDA X_lo ; ORA X_hi
auto Insert = std::next(MachineBasicBlock::iterator(Ror));
DebugLoc DL = Ror->getDebugLoc();
BuildMI(MBB, Insert, DL, TII->get(W65816::LDA_DP)).addImm(NAddr);
BuildMI(MBB, Insert, DL, TII->get(W65816::ADC_Imm16)).addImm(0);
BuildMI(MBB, Insert, DL, TII->get(W65816::STA_DP)).addImm(NAddr);
BuildMI(MBB, Insert, DL, TII->get(W65816::LDA_DP)).addImm(LoAddr);
BuildMI(MBB, Insert, DL, TII->get(W65816::ORA_DP)).addImm(HiAddr);
// Erase the old sequence.
ToErase.push_back(&*LdaLo1);
ToErase.push_back(&*OraHi);
ToErase.push_back(&*Tax);
ToErase.push_back(&*LdaOrig);
ToErase.push_back(&*AndOne);
if (StaOrig1) ToErase.push_back(StaOrig1);
ToErase.push_back(&*Clc);
ToErase.push_back(&*AdcN);
ToErase.push_back(&*StaN);
ToErase.push_back(&*LdaLo2);
ToErase.push_back(&*StaOrig2);
ToErase.push_back(&*Txa);
// BNE stays — but uses Z from our new ORA.
It = std::next(MachineBasicBlock::iterator(Ror));
}
for (MachineInstr *MI : ToErase) {
MI->eraseFromParent();
Changed = true;
}
}
// X-register iter peephole. In a self-loop body whose pointer-walk
// iter is held in a DP slot, replace the per-iter
// LDA_DP X1 ; reload iter from DP (3 cyc)
// STA_StackRel S ; copy to lagged slot (5 cyc)
// INA_PSEUDO ; iter++ (2 cyc)
// STA_DP X1 ; store back to DP (3 cyc)
// chain with the iter held in the X register across iters:
// TXA ; A = X = OLD iter (2 cyc)
// STA_StackRel S ; copy to lagged slot (5 cyc)
// INX ; X = NEW iter (2 cyc)
// Saves 4 cyc/iter (13 → 9). Targets strLen-shape loops. The
// preheader gets an LDA_DP X1; TAX inserted to seed X.
//
// Safety conditions:
// - MBB is a self-loop (MBB ∈ MBB.predecessors()).
// - DP slot X1 is not referenced ANYWHERE in the function outside
// this pattern (else our drop of STA_DP X1 corrupts other reads).
// - X register is dead in the MBB (no TAX/INX/etc.) AND not live-in
// to any successor MBB.
// - Preheader exists (= a non-self predecessor we can insert into).
for (MachineBasicBlock &MBB : MF) {
bool selfLoop = false;
MachineBasicBlock *Preheader = nullptr;
for (MachineBasicBlock *Pred : MBB.predecessors()) {
if (Pred == &MBB) selfLoop = true;
else Preheader = Pred;
}
if (!selfLoop || !Preheader) continue;
// Successors must not have X live-in (we'll clobber X to use it
// as iter). The self-loop successor (= MBB itself) is allowed
// because we'll redefine X every iter.
bool succXLive = false;
for (MachineBasicBlock *Succ : MBB.successors()) {
if (Succ == &MBB) continue;
if (Succ->isLiveIn(W65816::X)) { succXLive = true; break; }
}
if (succXLive) continue;
// MBB must not touch X register anywhere.
bool mbbTouchesX = false;
for (const MachineInstr &MI : MBB) {
if (MI.isCall()) { mbbTouchesX = true; break; }
switch (MI.getOpcode()) {
case W65816::TAX: case W65816::TYX: case W65816::TSX:
case W65816::PLX: case W65816::TXA: case W65816::TXY:
case W65816::TXS: case W65816::PHX: case W65816::INX:
case W65816::DEX: case W65816::LDX_Imm16: case W65816::LDX_Imm8:
case W65816::LDX_DP: case W65816::LDX_Abs:
case W65816::LDX_DPY: case W65816::LDX_AbsY:
case W65816::STX_DP: case W65816::STX_Abs:
case W65816::STX_DPY:
case W65816::CPX_DP: case W65816::CPX_Abs:
case W65816::CPX_Imm8: case W65816::CPX_Imm16:
mbbTouchesX = true; break;
}
if (mbbTouchesX) break;
for (const MachineOperand &MO : MI.operands()) {
if (MO.isReg() && MO.getReg() == W65816::X) {
mbbTouchesX = true; break;
}
}
if (mbbTouchesX) break;
}
if (mbbTouchesX) continue;
// Find the 4-op pattern in MBB.
SmallVector<MachineInstr *, 4> ToErase;
for (auto It = MBB.begin(); It != MBB.end();) {
auto LdaMI = It++;
if (LdaMI->getOpcode() != W65816::LDA_DP) continue;
if (LdaMI->getNumOperands() < 1 || !LdaMI->getOperand(0).isImm())
continue;
int64_t X1 = LdaMI->getOperand(0).getImm();
auto skipDbg = [&](auto &P) {
while (P != MBB.end() && P->isDebugInstr()) ++P;
};
auto P = std::next(LdaMI); skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::STA_StackRel) continue;
auto StaS = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::INA_PSEUDO) continue;
auto Ina = P; ++P; skipDbg(P);
if (P == MBB.end() || P->getOpcode() != W65816::STA_DP) continue;
if (P->getOperand(0).getImm() != X1) continue;
auto StaBack = P;
// Verify DP slot X1 outside-pattern references are at most ONE
// STA_DP in the preheader (the iter initialization), which we'll
// rewrite to TAX. Any read of X1 outside the pattern bails.
MachineInstr *PreheaderInit = nullptr;
bool referencedElsewhere = false;
for (MachineBasicBlock &OtherMBB : MF) {
for (MachineInstr &OtherMI : OtherMBB) {
if (&OtherMI == &*LdaMI || &OtherMI == &*StaBack) continue;
unsigned OO = OtherMI.getOpcode();
// Look for any DP-addressing op whose operand matches X1.
bool refsX1 = false;
for (const MachineOperand &MO : OtherMI.operands()) {
if (MO.isImm() && MO.getImm() == X1) {
if (OO == W65816::LDA_DP || OO == W65816::STA_DP ||
OO == W65816::STZ_DP || OO == W65816::ADC_DP ||
OO == W65816::SBC_DP || OO == W65816::AND_DP ||
OO == W65816::ORA_DP || OO == W65816::EOR_DP ||
OO == W65816::CMP_DP || OO == W65816::INC_DP ||
OO == W65816::DEC_DP || OO == W65816::LSR_DP ||
OO == W65816::ROR_DP || OO == W65816::ASL_DP ||
OO == W65816::ROL_DP || OO == W65816::BIT_DP ||
OO == W65816::LDX_DP || OO == W65816::STX_DP ||
OO == W65816::LDY_DP || OO == W65816::STY_DP) {
refsX1 = true;
}
break;
}
}
if (!refsX1) continue;
// Accept ONE STA_DP X1 in the preheader (we'll rewrite to TAX).
if (OO == W65816::STA_DP && &OtherMBB == Preheader &&
!PreheaderInit) {
PreheaderInit = &OtherMI;
continue;
}
// Otherwise bail.
referencedElsewhere = true; break;
}
if (referencedElsewhere) break;
}
if (referencedElsewhere) continue;
if (!PreheaderInit) continue;
// Apply: in preheader, the existing STA_DP X1 (PreheaderInit)
// gets replaced with TAX (A already has the initial value about
// to be stored). In the loop, replace the 4-op chain with
// TXA; STA_StackRel S; INX.
const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
BuildMI(*Preheader, PreheaderInit, PreheaderInit->getDebugLoc(),
TII->get(W65816::TAX));
ToErase.push_back(PreheaderInit);
// In the loop: insert TXA before StaS, and INX after StaS.
DebugLoc DL = StaS->getDebugLoc();
BuildMI(MBB, StaS, DL, TII->get(W65816::TXA));
BuildMI(MBB, std::next(MachineBasicBlock::iterator(StaS)),
DL, TII->get(W65816::INX));
// Erase LdaMI, Ina, StaBack. StaS stays.
ToErase.push_back(&*LdaMI);
ToErase.push_back(&*Ina);
ToErase.push_back(&*StaBack);
It = std::next(MachineBasicBlock::iterator(StaBack));
break; // only fire once per MBB to avoid cascading
}
for (MachineInstr *MI : ToErase) {
MI->eraseFromParent();
Changed = true;
}
}
// Dead INC_HI_IF_CARRY_StackRel elimination. For Layer 2
// (`-w65816-dbr-safe-ptrs`) loops, ptr_hi tracking via slot H is
// bookkeeping for a pointer whose bank stays in DBR. When the
// function never READS slot H (only writes it), the
// `INC_HI_IF_CARRY_StackRel H` pseudo and the preheader STAs to H
// are dead — eliminating them saves 3 cyc/iter (the BNE taken).
//
// Detect: a `INC_HI_IF_CARRY_StackRel H` instruction.
// Verify: no instruction reads slot H anywhere in the function.
// - LDA_StackRel H, ADC_StackRel H, etc. (any read)
// - LDA_StackRelIndY H, STA_StackRelIndY H, etc. (used as pointer)
// If clean, erase the INC and all STA_StackRel H stores.
{
DenseSet<int64_t> SlotsTouched;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
if (MI.getOpcode() != W65816::INC_HI_IF_CARRY_StackRel) continue;
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) continue;
SlotsTouched.insert(MI.getOperand(0).getImm());
}
}
for (int64_t H : SlotsTouched) {
// Scan the entire function for any READ of slot H or USE of H as
// an indirect-Y base.
bool readSomewhere = false;
for (MachineBasicBlock &MBB : MF) {
for (const MachineInstr &MI : MBB) {
unsigned Op = MI.getOpcode();
int64_t Imm = -1;
if (MI.getNumOperands() >= 1 && MI.getOperand(0).isImm())
Imm = MI.getOperand(0).getImm();
if (Imm != H) continue;
switch (Op) {
// Reads of slot H via stack-rel direct.
case W65816::LDA_StackRel:
case W65816::ADC_StackRel: case W65816::SBC_StackRel:
case W65816::AND_StackRel: case W65816::ORA_StackRel:
case W65816::EOR_StackRel: case W65816::CMP_StackRel:
// Uses of slot H as an indirect-Y base.
case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY:
case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY:
case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY:
case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY:
readSomewhere = true;
break;
default:
break;
}
// Also catch indirect uses of slot H-1 (the IndY reads 2 bytes
// at H-1, H). Conservative.
if (Imm == H - 1) {
switch (Op) {
case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY:
case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY:
case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY:
case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY:
readSomewhere = true;
break;
default:
break;
}
}
if (readSomewhere) break;
}
if (readSomewhere) break;
}
if (readSomewhere) continue;
// Slot H is dead. Erase the INC_HI_IF_CARRY and all STA_StackRel
// H stores.
SmallVector<MachineInstr *, 4> ToErase;
for (MachineBasicBlock &MBB : MF) {
for (MachineInstr &MI : MBB) {
unsigned Op = MI.getOpcode();
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) continue;
if (MI.getOperand(0).getImm() != H) continue;
if (Op == W65816::INC_HI_IF_CARRY_StackRel ||
Op == W65816::STA_StackRel) {
ToErase.push_back(&MI);
}
}
}
for (MachineInstr *MI : ToErase) {
MI->eraseFromParent();
Changed = true;
}
}
}
// Lagged-stack-rel → DP-indirect-Y conversion. After the X-iter
// peephole, the loop body looks like:
// TXA ; A = X = OLD iter
// STA_StackRel S ; lagged stack-rel ptr slot
// INX ; X = NEW
// ...
// LDA_StackRelIndY S ; deref via stack-rel-indirect-Y (7 cyc)
//
// Equivalent with the iter in X and the ptr in DP:
// STX_DP D ; write 16-bit X to DP slot D:D+1 (4 cyc)
// INX ; X = NEW (2)
// ...
// LDA_DPIndY D ; deref via DP-indirect-Y (6 cyc)
//
// Saves TXA (2) + 1 cyc on STA→STX (5→4) + 1 cyc on IndY-stack→
// IndY-DP (7→6) = 4 cyc/iter. Requires a free DP slot pair.
//
// Conditions:
// - MBB is a self-loop (already paired with X-iter peephole).
// - Pattern TXA;STA_StackRel S;INX exists in MBB.
// - LDA_StackRelIndY S exists in MBB.
// - Slot S is referenced ONLY by the STA above and the LDA above
// (no other reads/writes in the function).
// - There exists an unused IMG-range DP slot (16-bit pair).
for (MachineBasicBlock &MBB : MF) {
bool selfLoop = false;
for (MachineBasicBlock *Pred : MBB.predecessors()) {
if (Pred == &MBB) { selfLoop = true; break; }
}
if (!selfLoop) continue;
// Find TXA ; STA_StackRel S ; INX in this MBB.
MachineInstr *Txa = nullptr;
MachineInstr *StaS = nullptr;
MachineInstr *Inx = nullptr;
int64_t Soff = -1;
auto It = MBB.begin();
while (It != MBB.end()) {
if (It->getOpcode() == W65816::TXA) {
auto P = std::next(It);
while (P != MBB.end() && P->isDebugInstr()) ++P;
if (P == MBB.end() || P->getOpcode() != W65816::STA_StackRel) {
++It; continue;
}
auto Sta = P; ++P;
while (P != MBB.end() && P->isDebugInstr()) ++P;
if (P == MBB.end() || P->getOpcode() != W65816::INX) {
++It; continue;
}
if (Sta->getNumOperands() < 1 || !Sta->getOperand(0).isImm()) {
++It; continue;
}
Txa = &*It; StaS = &*Sta; Inx = &*P;
Soff = Sta->getOperand(0).getImm();
break;
}
++It;
}
if (!Txa) continue;
// Find LDA_StackRelIndY S in MBB.
MachineInstr *LdaIndY = nullptr;
for (MachineInstr &MI : MBB) {
if (MI.getOpcode() == W65816::LDA_StackRelIndY &&
MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() &&
MI.getOperand(0).getImm() == Soff) {
LdaIndY = &MI;
break;
}
}
if (!LdaIndY) continue;
// Find the loop's predecessor (= preheader for our self-loop case).
MachineBasicBlock *Preheader = nullptr;
for (MachineBasicBlock *Pred : MBB.predecessors()) {
if (Pred != &MBB) { Preheader = Pred; break; }
}
// Verify slot S is referenced ONLY by writes (dropped) and the
// LdaIndY, EXCEPT we tolerate LDA_StackRel S reads in the
// preheader (the X-iter peephole inserts these to bootstrap X).
// Any read elsewhere (post-loop, non-preheader pred, etc.) bails.
SmallVector<MachineInstr *, 2> DeadStaWrites;
SmallVector<MachineInstr *, 2> DeadPreheaderReads;
bool slotElsewhere = false;
for (MachineBasicBlock &OtherMBB : MF) {
for (MachineInstr &OtherMI : OtherMBB) {
if (&OtherMI == StaS || &OtherMI == LdaIndY) continue;
unsigned OOO = OtherMI.getOpcode();
if (OtherMI.getNumOperands() < 1 || !OtherMI.getOperand(0).isImm())
continue;
if (OtherMI.getOperand(0).getImm() != Soff) continue;
switch (OOO) {
case W65816::STA_StackRel:
DeadStaWrites.push_back(&OtherMI);
continue;
case W65816::LDA_StackRel:
// Allow LDA in the preheader (X-iter's reload-to-bootstrap-X).
if (Preheader && &OtherMBB == Preheader) {
DeadPreheaderReads.push_back(&OtherMI);
continue;
}
slotElsewhere = true; break;
case W65816::ADC_StackRel: case W65816::SBC_StackRel:
case W65816::AND_StackRel: case W65816::ORA_StackRel:
case W65816::EOR_StackRel: case W65816::CMP_StackRel:
case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY:
case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY:
case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY:
case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY:
slotElsewhere = true; break;
default: break;
}
if (slotElsewhere) break;
}
if (slotElsewhere) break;
}
if (slotElsewhere) continue;
// Each preheader LDA_StackRel S must be followed by TAX (it's the
// X-iter bootstrap pattern). Verify; if not, bail.
for (MachineInstr *MI : DeadPreheaderReads) {
auto N = std::next(MachineBasicBlock::iterator(MI));
while (N != MI->getParent()->end() && N->isDebugInstr()) ++N;
if (N == MI->getParent()->end() || N->getOpcode() != W65816::TAX) {
slotElsewhere = true; break;
}
}
if (slotElsewhere) continue;
// Find a free DP slot in the IMG range. Scan all DP-addressing
// ops and collect used addresses; pick the lowest unused 16-bit
// aligned slot.
DenseSet<int64_t> UsedDpAddrs;
for (MachineBasicBlock &OtherMBB : MF) {
for (MachineInstr &OtherMI : OtherMBB) {
if (OtherMI.getNumOperands() < 1 || !OtherMI.getOperand(0).isImm())
continue;
unsigned OOO = OtherMI.getOpcode();
switch (OOO) {
case W65816::LDA_DP: case W65816::STA_DP: case W65816::STZ_DP:
case W65816::LDX_DP: case W65816::STX_DP:
case W65816::LDY_DP: case W65816::STY_DP:
case W65816::ADC_DP: case W65816::SBC_DP:
case W65816::AND_DP: case W65816::ORA_DP:
case W65816::EOR_DP: case W65816::CMP_DP:
case W65816::CPX_DP: case W65816::CPY_DP:
case W65816::LSR_DP: case W65816::ROR_DP:
case W65816::ASL_DP: case W65816::ROL_DP:
case W65816::INC_DP: case W65816::DEC_DP:
case W65816::BIT_DP:
case W65816::LDA_DPInd: case W65816::STA_DPInd:
case W65816::LDA_DPIndY: case W65816::STA_DPIndY:
case W65816::LDA_DPIndX: case W65816::STA_DPIndX:
case W65816::LDA_DPIndLong: case W65816::STA_DPIndLong:
case W65816::LDA_DPIndLongY: case W65816::STA_DPIndLongY:
{
int64_t A = OtherMI.getOperand(0).getImm();
UsedDpAddrs.insert(A);
UsedDpAddrs.insert(A + 1); // 16-bit ops occupy 2 bytes
}
break;
default: break;
}
}
}
// Pick a free 16-bit-aligned slot in $C0..$DE.
int64_t FreeDp = -1;
for (int64_t A = 0xC0; A <= 0xDE; A += 2) {
if (!UsedDpAddrs.count(A) && !UsedDpAddrs.count(A + 1)) {
FreeDp = A;
break;
}
}
if (FreeDp < 0) continue;
// Apply: rewrite TXA;STA_StackRel S;INX → STX_DP FreeDp;INX
// (TXA and StaS erased; STX_DP inserted at TXA's position).
const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
BuildMI(MBB, Txa, Txa->getDebugLoc(),
TII->get(W65816::STX_DP)).addImm(FreeDp);
Txa->eraseFromParent();
StaS->eraseFromParent();
// Rewrite LDA_StackRelIndY S → LDA_DPIndY FreeDp.
BuildMI(MBB, LdaIndY, LdaIndY->getDebugLoc(),
TII->get(W65816::LDA_DPIndY))
.addImm(FreeDp);
LdaIndY->eraseFromParent();
// We do NOT erase the preheader's STA writes or LDA reads to slot
// S — they're the X-iter peephole's bootstrap (STA at preheader
// init writes s_lo to S; LDA reloads s_lo to A for TAX→X). The
// body no longer uses slot S, but leaving them there is harmless.
(void)DeadStaWrites;
(void)DeadPreheaderReads;
Changed = true;
}
// Y-as-counter for strLen-shape loops. After DP-indirect-Y rewrite +
// dead-INC_HI elim, the strLen body is:
// STX_DP D (4 cyc) writes iter X to DP slot D
// INX (2) iter++
// INC_DP C (5) counter++
// LDA_DPIndY D (6) deref via slot D
// ANDi16imm #ff (--) mask to 8-bit (lowers to AND #imm)
// BNE loop (3) branch on byte != 0
// 22 cyc/iter.
//
// We can drop STX/INX/INC entirely by using Y as both the offset and
// counter: D holds initial s (one-time write in preheader), Y starts
// at -1, INY at top of each iter brings Y to 0, 1, 2, ...
// INY (2)
// LDA_DPIndY D (6)
// ANDi16imm #ff
// BNE loop (3)
// 13 cyc/iter — saves 9 cyc per iter.
//
// Preheader: LDY_Imm16 0 → LDY_Imm16 0xFFFF. Add STX_DP D (one-time
// s init). Drop the existing `LDA #-1 ; STA_DP C` counter init.
//
// Exit MBB: replace `LDA_DP C` (returns the counter) with `TYA`.
//
// strLen 1279 → ~874 cyc (predicted).
{
for (MachineBasicBlock &MBB : MF) {
// Self-loop check.
bool isSelfLoop = false;
for (auto *Succ : MBB.successors()) {
if (Succ == &MBB) { isSelfLoop = true; break; }
}
if (!isSelfLoop) continue;
if (MBB.pred_size() != 2) continue;
if (MBB.succ_size() != 2) continue;
// Find the 6-op pattern.
MachineInstr *Stx = nullptr;
MachineInstr *Inx = nullptr;
MachineInstr *IncC = nullptr;
MachineInstr *Lda = nullptr;
MachineInstr *And = nullptr;
MachineInstr *Bne = nullptr;
int64_t D = -1, C = -1;
bool extraInsn = false;
for (MachineInstr &MI : MBB) {
if (MI.isDebugInstr()) continue;
switch (MI.getOpcode()) {
case W65816::STX_DP:
if (Stx) { extraInsn = true; break; }
Stx = &MI;
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) {
extraInsn = true; break;
}
D = MI.getOperand(0).getImm();
break;
case W65816::INX:
if (!Stx || Inx) { extraInsn = true; break; }
Inx = &MI;
break;
case W65816::INC_DP:
if (!Inx || IncC) { extraInsn = true; break; }
IncC = &MI;
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) {
extraInsn = true; break;
}
C = MI.getOperand(0).getImm();
if (C == D) { extraInsn = true; break; }
break;
case W65816::LDA_DPIndY:
if (!IncC || Lda) { extraInsn = true; break; }
Lda = &MI;
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm() ||
MI.getOperand(0).getImm() != D) {
extraInsn = true; break;
}
break;
case W65816::ANDi16imm:
if (!Lda || And) { extraInsn = true; break; }
And = &MI;
if (MI.getNumOperands() < 3 || !MI.getOperand(2).isImm() ||
MI.getOperand(2).getImm() != 255) {
extraInsn = true; break;
}
break;
case W65816::BNE:
if (!And || Bne) { extraInsn = true; break; }
Bne = &MI;
break;
default:
extraInsn = true;
break;
}
if (extraInsn) break;
}
if (extraInsn || !Stx || !Inx || !IncC || !Lda || !And || !Bne) continue;
// Find preheader (predecessor that is not self).
MachineBasicBlock *Preheader = nullptr;
for (auto *Pred : MBB.predecessors()) {
if (Pred != &MBB) {
if (Preheader) { Preheader = nullptr; break; }
Preheader = Pred;
}
}
if (!Preheader) continue;
// Find exit MBB (successor that is not self).
MachineBasicBlock *Exit = nullptr;
for (auto *Succ : MBB.successors()) {
if (Succ != &MBB) {
if (Exit) { Exit = nullptr; break; }
Exit = Succ;
}
}
if (!Exit) continue;
// In preheader: find LDY_Imm16 0 and LDA #-1; STA_DP C (counter init).
MachineInstr *Ldy = nullptr;
MachineInstr *LdaNeg1 = nullptr;
MachineInstr *StaC = nullptr;
MachineInstr *Tax = nullptr;
for (MachineInstr &MI : *Preheader) {
if (MI.getOpcode() == W65816::LDY_Imm16 &&
MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() &&
MI.getOperand(0).getImm() == 0) {
Ldy = &MI;
}
if (MI.getOpcode() == W65816::LDAi16imm &&
MI.getNumOperands() >= 2 && MI.getOperand(1).isImm() &&
(MI.getOperand(1).getImm() == -1 ||
MI.getOperand(1).getImm() == 0xFFFF)) {
LdaNeg1 = &MI;
}
if (MI.getOpcode() == W65816::STA_DP &&
MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() &&
MI.getOperand(0).getImm() == C) {
StaC = &MI;
}
if (MI.getOpcode() == W65816::TAX) {
Tax = &MI;
}
}
if (!Ldy || !LdaNeg1 || !StaC || !Tax) continue;
// In exit MBB: find LDA_DP C followed by RTL (return uses counter).
MachineInstr *ExitLdaC = nullptr;
for (MachineInstr &MI : *Exit) {
if (MI.getOpcode() == W65816::LDA_DP &&
MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() &&
MI.getOperand(0).getImm() == C) {
ExitLdaC = &MI;
break;
}
}
if (!ExitLdaC) continue;
// Verify no other references to slots D or C anywhere else in MF.
bool extraDRef = false;
bool extraCRef = false;
for (MachineBasicBlock &OtherMBB : MF) {
for (MachineInstr &MI : OtherMBB) {
if (&MI == Stx || &MI == IncC || &MI == Lda ||
&MI == LdaNeg1 || &MI == StaC || &MI == ExitLdaC) continue;
if (MI.getNumOperands() >= 1 && MI.getOperand(0).isImm()) {
int64_t Imm = MI.getOperand(0).getImm();
unsigned Op = MI.getOpcode();
// Catch any DP or DPInd op touching slot D or D+1, C or C+1.
switch (Op) {
case W65816::LDA_DP: case W65816::STA_DP: case W65816::STZ_DP:
case W65816::LDX_DP: case W65816::STX_DP:
case W65816::LDY_DP: case W65816::STY_DP:
case W65816::ADC_DP: case W65816::SBC_DP:
case W65816::AND_DP: case W65816::ORA_DP: case W65816::EOR_DP:
case W65816::CMP_DP: case W65816::CPX_DP: case W65816::CPY_DP:
case W65816::LSR_DP: case W65816::ROR_DP:
case W65816::ASL_DP: case W65816::ROL_DP:
case W65816::INC_DP: case W65816::DEC_DP:
case W65816::BIT_DP: case W65816::TSB_DP: case W65816::TRB_DP:
case W65816::LDA_DPIndY: case W65816::STA_DPIndY:
case W65816::LDA_DPInd: case W65816::STA_DPInd:
if (Imm == D || Imm == D + 1) extraDRef = true;
if (Imm == C || Imm == C + 1) extraCRef = true;
break;
}
}
}
}
if (extraDRef || extraCRef) continue;
// Perform the transformation.
const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
// 1. Preheader: change LDY_Imm16 0 → LDY_Imm16 0xFFFF.
Ldy->getOperand(0).setImm(0xFFFF);
// 2. Preheader: erase LDA #-1 and STA_DP C (counter init dead).
LdaNeg1->eraseFromParent();
StaC->eraseFromParent();
// 3. Preheader: insert STX_DP D after TAX (one-time s → D init).
auto AfterTax = std::next(MachineBasicBlock::iterator(Tax));
BuildMI(*Preheader, AfterTax, Tax->getDebugLoc(),
TII->get(W65816::STX_DP)).addImm(D);
// 4. Body: erase STX_DP D, INX, INC_DP C.
Stx->eraseFromParent();
Inx->eraseFromParent();
IncC->eraseFromParent();
// 5. Body: insert INY at start (before LDA_DPIndY).
BuildMI(MBB, Lda, Lda->getDebugLoc(),
TII->get(W65816::INY));
// 6. Exit: replace LDA_DP C with TYA.
BuildMI(*Exit, ExitLdaC, ExitLdaC->getDebugLoc(),
TII->get(W65816::TYA));
ExitLdaC->eraseFromParent();
Changed = true;
}
}
// Run elideStoreForwarding at the very end, AFTER IMG promotion has
// committed slot assignments. Running this peephole earlier (with
// the other early peepholes) cascades into different IMG-promotion

View file

@ -107,14 +107,7 @@ bool W65816UnLSR::runOnFunction(Function &F) {
for (Loop *L : LI) {
Changed |= processLoop(L);
Changed |= processCounterToPtrPHIs(L);
// NOTE: processReturnedCounter (strLen-shape counter → ptr-difference
// at exit) is correct but produces a NET LOSS on strLen: without the
// counter PHI, the i32 pointer arithmetic falls back to clc+adc
// chains (16+ cyc/iter) instead of inc-A on the lo half (5 cyc/iter
// for ptr update + 5 for counter inc). See feedback memory.
// Disabled until codegen can use inc-DP for the lo half of a pointer
// PHI's increment without the SDAG materializing a full i32 add.
// Recurse into nested loops.
// processReturnedCounter remains disabled — see note above.
SmallVector<Loop *, 4> Worklist(L->begin(), L->end());
while (!Worklist.empty()) {
Loop *Sub = Worklist.pop_back_val();

View file

@ -64,16 +64,18 @@ binary; the run only works in an unrestricted shell. Workaround: copy
## Size vs Calypsi (5 core files, ITERATIONS=1, PERFORMANCE_RUN)
| File | Ours (L2+threshold=75) | Calypsi 5.16 | Ratio |
| File | Ours (L2+threshold=50) | Calypsi 5.16 | Ratio |
|------|----------------------:|-------------:|------:|
| core_list_join.o | 10,188 | 9,073 | 1.12× |
| core_main.o | 11,656 | 19,772 | 0.59× |
| core_matrix.o | 15,180 | 11,078 | 1.37× |
| core_state.o | 7,348 | 9,944 | 0.74× |
| core_list_join.o | 10,008 | 9,073 | 1.10× |
| core_main.o | 11,588 | 19,772 | 0.59× |
| core_matrix.o | 10,660 | 11,078 | 0.96× |
| core_state.o | 7,256 | 9,944 | 0.73× |
| core_util.o | 3,156 | 4,631 | 0.68× |
| **TOTAL** | **47,528** | **54,498** | **0.87×** |
| **TOTAL** | **42,668** | **54,498** | **0.78×** |
We beat Calypsi by 13% on CoreMark overall.
We beat Calypsi by 22% on CoreMark overall. (Since the inline-
threshold dropped from 75 to 50 target-wide, `core_matrix.o` improved
from 1.37× → 0.96× by no longer inlining 5 nested-loop helpers.)
## Notes on the porting layer

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.

Binary file not shown.