diff --git a/README.md b/README.md index 7a18843..28a84f4 100644 --- a/README.md +++ b/README.md @@ -76,20 +76,22 @@ against commercial Calypsi 5.16, measured under MAME via `emu.time()` | Benchmark | Ours | Calypsi | Ratio | |---|---:|---:|---:| -| bsearch | 682 | 2,387 | **0.29×** ✓ | | dotProduct | 1,534 | 5,712 | **0.27×** ✓ | +| bsearch | 682 | 2,387 | **0.29×** ✓ | | sumOfSquares | 6,820 | 16,368 | **0.42×** ✓ | | bubbleSort | 11,594 | 17,050 | **0.68×** ✓ | -| djb2Hash | 2,387 | 2,643 | **0.90×** ✓ | -| memcmp | 716 | 716 | **1.00×** | -| strcpy | 1,279 | 1,194 | 1.07× | -| popcount | 1,705 | 1,534 | 1.11× | -| fib | 12,106 | 10,912 | 1.11× | -| strLen | 1,876 | 1,023 | 1.83× | +| strLen | 767 | 1,023 | **0.75×** ✓ | +| djb2Hash | 2,046 | 2,643 | **0.77×** ✓ | +| popcount | 1,194 | 1,534 | **0.78×** ✓ | +| strcpy | 1,108 | 1,194 | **0.93×** ✓ | +| memcmp | 682 | 716 | **0.95×** ✓ | +| fib | 11,594 | 10,912 | 1.06× | -**Geomean: 0.74× Calypsi** across this suite. Six of ten benches beat -Calypsi outright; one ties exactly. Run `scripts/benchCyclesPrecise.sh` -(ours) and `scripts/benchCyclesCalypsi.sh` (Calypsi) to reproduce. +**Geomean: 0.62× Calypsi** across this suite. Nine of ten benches beat +Calypsi outright; only fib trails at 1.06×. Run +`scripts/benchCyclesPrecise.sh` (ours, with +`W65816_CC_EXTRA="-mllvm -w65816-dbr-safe-ptrs"`) and +`scripts/benchCyclesCalypsi.sh` (Calypsi) to reproduce. On real programs: - **Lua 5.1.5** (17K LoC, 24 source files) compiles + links clean. diff --git a/SESSION_RECOVERY.md b/SESSION_RECOVERY.md index 37c9a01..b8b2d5a 100644 --- a/SESSION_RECOVERY.md +++ b/SESSION_RECOVERY.md @@ -1,4 +1,4 @@ -# Session Recovery — last updated 2026-05-25 +# Session Recovery — last updated 2026-05-27 Living recovery doc. Update on every meaningful change. If session is lost, read this top-to-bottom + the memory notes referenced inside, then reread @@ -12,10 +12,12 @@ the actual diffs in tree to ground assumptions. on JSL, greedy regalloc at -O1+. **Inline-threshold lowered to 50 target-wide** (was LLVM default 225; was 75 earlier this session). - **Branch**: `main`. -- **vs Calypsi (2026-05-25)**: - - **Lua 5.1.5**: default config 1.13× Calypsi; with Layer 2 0.93× (we - beat by 7%). - - **CoreMark 1.0**: with Layer 2 **0.79× Calypsi (we beat by 21%)**. +- **vs Calypsi (2026-05-27)** — Layer 2 + recent peepholes: + - **Cycle benches geomean**: **0.62× Calypsi**. 9 of 10 below 1.0×; + only `fib` trails at 1.06× (recursive overhead, structural). See + cycle bench table below. + - **Lua 5.1.5**: default config 1.13× Calypsi; with Layer 2 0.93×. + - **CoreMark 1.0**: with Layer 2 0.79× Calypsi (we beat by 21%). - **vs Calypsi static-inst ratio (synthetic bench)**: sumSquares **0.84×** (26 vs 31 — we beat), mul16to32 **0.25×** (1 vs 4 — we beat), @@ -31,12 +33,39 @@ the actual diffs in tree to ground assumptions. - Inline-threshold lowered to 50 (was 225). Lua -23% total, CoreMark matrix.o 1.37× → 0.97× Calypsi. Override with `-mllvm -inline-threshold=N`. -- **Cycle benches (2026-05-20)**: - popcount 93, strcpy 91, bsearch 127, memcmp 113, fib 97, - dotProduct 144, sumOfSquares 126 cyc/iter (100 iters); - dadd 1157, ddiv 1261, dmul 1033 cyc/iter (10 iters); - particles 2253 cyc/iter (3 iters), mandelbrot 11570 cyc/iter (1 iter). -- **Recent session wins (2026-05-20)**: +- **Cycle benches per-call (2026-05-27, Layer 2)** — via + `scripts/benchCyclesPrecise.sh` vs `scripts/benchCyclesCalypsi.sh`: + ``` + Bench Ours Calypsi Ratio + dotProduct 1534 5712 0.27× + bsearch 682 2387 0.29× + sumOfSquares 6820 16368 0.42× + bubbleSort 11594 17050 0.68× + strLen 767 1023 0.75× + djb2Hash 2046 2643 0.77× + popcount 1194 1534 0.78× + strcpy 1108 1194 0.93× + memcmp 682 716 0.95× + fib 11594 10912 1.06× + ``` + Geomean **0.62×**. Older HBL-tick numbers (per-iter, 100 iter loops) + from `benchCycles.sh` are still available but lower resolution. +- **Recent session wins (2026-05-27)**: + - **Y-as-counter for strLen** — structural rewrite: drop STX/INX/INC, + use Y as offset AND counter. strLen 1279 → 767 cyc (-40%); 0.75× + Calypsi (was 1.25×). + - **Stack-rel dead-store elim** — companion to DP version with SP + tracking across PHA/PHP/PEA/PEI/PER/PLA/PLP/PLX/PLY/PHX/PHY. + strcpy 1194 → 1108 (-7%, 0.93× Calypsi, beats by 7%). Refactored + as a static helper called from the recursive-call bail too so fib + gets it. fib 12106 → 11594 (-4%, 1.06× Calypsi). + - **DP-indirect-Y for iter** (follow-on to X-iter peephole): rewrites + `TXA;STA stack-rel S;INX;…;LDA (S,s),Y` to `STX_DP D;INX;…;LDA + (D),Y`. Saves 4 cyc/iter. + - **Dead INC_HI_IF_CARRY elim** — when the StackRel ptr-hi slot is + never read, elide the carry-bookkeeping for Layer 2 ptr32 loops. + Wide impact across strLen/strcpy/djb2Hash/memcmp. +- **Recent session wins (earlier — 2026-05-20)**: - 8 always-on peepholes + extended phase 4 in W65816StackRelToImg (evalAt 498→472, fib -35%, 35 libc fns shrunk) - __muldi3 32-bit short-circuit (dmul 1605→1033, -36%) diff --git a/STATUS.md b/STATUS.md index d191b09..b7060bb 100644 --- a/STATUS.md +++ b/STATUS.md @@ -244,26 +244,34 @@ which runs correctly under MAME (apple2gs). + dispatch + chained collisions over fprintf-to-mfs), scripts/bench.sh size-vs-Calypsi harness. 100% pass. -- `scripts/benchCycles.sh` measures per-iteration cycle counts via - MAME's emulated HBL counter. 13 benchmarks under `benchmarks/` - (8 int micro + 3 soft-FP + 2 "game-like": particles, mandelbrot). - Current numbers (2026-05-20): - bsearch 127, crc32 <65, dotProduct 144, fib 97, memcmp 113, - popcount 93, strcpy 91, sumOfSquares 126 cyc/iter (100 iters); - dadd 1157, ddiv 1261, dmul 1033 cyc/iter (10 iters); - particles 2253 cyc/iter (3 iters — 32-particle physics tick); - mandelbrot 11570 cyc/iter (1 iter — 4×4 fixed-point tile, max 8 - Mandelbrot iters). Speed is the optimization priority, not size. +- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts via + MAME's `emu.time()` between A1A1/A2A2 markers. Runs vs commercial + Calypsi 5.16 (`scripts/benchCyclesCalypsi.sh`) for an apples-to- + apples speed comparison. Current numbers (2026-05-27, Layer 2): + + | Bench | Ours | Calypsi | Ratio | + |--------------|------:|--------:|-------:| + | dotProduct | 1534 | 5712 | 0.27× | + | bsearch | 682 | 2387 | 0.29× | + | sumOfSquares | 6820 | 16368 | 0.42× | + | bubbleSort | 11594 | 17050 | 0.68× | + | strLen | 767 | 1023 | 0.75× | + | djb2Hash | 2046 | 2643 | 0.77× | + | popcount | 1194 | 1534 | 0.78× | + | strcpy | 1108 | 1194 | 0.93× | + | memcmp | 682 | 716 | 0.95× | + | fib | 11594 | 10912 | 1.06× | + + **Geomean: 0.62× Calypsi.** 9 of 10 below 1.0×; only fib trails + (recursive call overhead, structural). Speed is the optimization + priority, not size. - `compare/` holds three side-by-side C tests with our asm and Calypsi's listing for static-size comparison: `sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh` recompiles each under both `clang --target=w65816 -O2 -S` and `cc65816 --speed -O 2 --64bit-doubles` and prints an - ours/Calypsi instruction-count ratio. Current ratios (2026-05-20): - sumSquares **0.84×** (26 inst — we beat Calypsi's 31), - evalAt 1.86× (472 inst), mul16to32 **0.25×** (1 inst — we beat - Calypsi's 4). See `compare/README.md`. + ours/Calypsi instruction-count ratio. See `compare/README.md`. **Backend register allocation:** diff --git a/docs/USAGE.md b/docs/USAGE.md index 4e07b63..38a4c74 100644 --- a/docs/USAGE.md +++ b/docs/USAGE.md @@ -817,39 +817,40 @@ Useful pass names to filter on: ## Cycle-count benchmarks -13 microbenchmarks live under [`benchmarks/`](../benchmarks/) — eight -integer/string micro-benches, three soft-double FP benches (`dadd`, -`dmul`, `ddiv`), and two "game-like" workloads: `particles` (32-particle -physics tick with i16 bounce/wall collision) and `mandelbrot` (4×4 -fixed-point Mandelbrot tile exercising i32 multiply and conditional -control flow). +Microbenchmarks live under [`benchmarks/`](../benchmarks/) — integer/ +string micro-benches plus soft-double FP benches. ```bash -bash scripts/benchCycles.sh +W65816_CC_EXTRA="-mllvm -w65816-dbr-safe-ptrs" bash scripts/benchCyclesPrecise.sh ``` -Output (2026-05-21): +This measures per-call cycle counts via MAME's `emu.time()` between +markers — apples-to-apples vs the matching +`scripts/benchCyclesCalypsi.sh` runner (commercial Calypsi 5.16). +Current ratios (2026-05-27, Layer 2): ``` -| Benchmark | Per-iteration cycles | -|-----------|---------------------:| -| bsearch | 127 cyc/iter (100 iters) | -| crc32 | <65 (under timer resolution) | -| dadd | 1157 cyc/iter (10 iters) | -| ddiv | 1261 cyc/iter (10 iters) | -| dmul | 1033 cyc/iter (10 iters) | -| dotProduct | 144 cyc/iter (100 iters) | -| fib | 97 cyc/iter (100 iters) | -| mandelbrot | 11570 cyc/iter (1 iter, GRID=4 MAX_ITER=8) | -| memcmp | 113 cyc/iter (100 iters) | -| particles | 2253 cyc/iter (3 iters, N=32) | -| popcount | 93 cyc/iter (100 iters) | -| strcpy | 91 cyc/iter (100 iters) | -| sumOfSquares | 126 cyc/iter (100 iters) | +| Benchmark | Ours | Calypsi | Ratio | +|--------------|------:|--------:|------:| +| dotProduct | 1534 | 5712 | 0.27× | +| bsearch | 682 | 2387 | 0.29× | +| sumOfSquares | 6820 | 16368 | 0.42× | +| bubbleSort | 11594 | 17050 | 0.68× | +| strLen | 767 | 1023 | 0.75× | +| djb2Hash | 2046 | 2643 | 0.77× | +| popcount | 1194 | 1534 | 0.78× | +| strcpy | 1108 | 1194 | 0.93× | +| memcmp | 682 | 716 | 0.95× | +| fib | 11594 | 10912 | 1.06× | ``` -The legacy `scripts/benchCyclesPrecise.sh` (per-call cycle count via -`emu.time()`) is still available but slower to run. +**Geomean: 0.62× Calypsi.** 9 of 10 below 1.0×. The Layer 2 flag +(`-mllvm -w65816-dbr-safe-ptrs`) enables stack-rel-indirect-Y ptr32 +derefs — required for parity since Calypsi's pointer ABI assumes +DBR matches the pointer's bank. + +The `scripts/benchCycles.sh` (HBL-tick-based) script is still around +but lower-resolution. Prefer the `Precise` runner above. The [`compare/`](../compare/) directory has side-by-side `.s` files vs Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with: diff --git a/src/llvm/lib/Target/W65816/W65816StackRelToImg.cpp b/src/llvm/lib/Target/W65816/W65816StackRelToImg.cpp index a02df53..a811439 100644 --- a/src/llvm/lib/Target/W65816/W65816StackRelToImg.cpp +++ b/src/llvm/lib/Target/W65816/W65816StackRelToImg.cpp @@ -596,6 +596,83 @@ static bool elideRedundantLdaAfterPha(MachineFunction &MF) { // // The first STA's value is shadowed by the second. Drop it. // Saves 1 instruction (3 bytes / 5 cyc) per match. +static bool elideStackRelDeadStore(MachineFunction &MF) { + bool Changed = false; + auto isStackRelRead = [](unsigned Op) { + switch (Op) { + case W65816::LDA_StackRel: case W65816::ADC_StackRel: + case W65816::SBC_StackRel: case W65816::AND_StackRel: + case W65816::ORA_StackRel: case W65816::EOR_StackRel: + case W65816::CMP_StackRel: + return true; + } + return false; + }; + auto isStackRelIndY = [](unsigned Op) { + switch (Op) { + case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY: + case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY: + case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY: + case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY: + return true; + } + return false; + }; + for (MachineBasicBlock &MBB : MF) { + SmallVector ToErase; + SmallPtrSet ErasedSet; + for (auto It = MBB.begin(); It != MBB.end(); ++It) { + if (ErasedSet.count(&*It)) continue; + if (It->getOpcode() != W65816::STA_StackRel) continue; + if (It->getNumOperands() < 1 || !It->getOperand(0).isImm()) continue; + int64_t OrigSlot = It->getOperand(0).getImm(); + int64_t SpAdj = 0; + auto Walk = std::next(It); + while (Walk != MBB.end()) { + if (Walk->isDebugInstr()) { ++Walk; continue; } + if (Walk->isBranch() || Walk->isCall() || Walk->isReturn() || + Walk->isInlineAsm()) break; + unsigned WO = Walk->getOpcode(); + switch (WO) { + case W65816::PHA: case W65816::PHX: case W65816::PHY: + case W65816::PEA: case W65816::PEI_DP: case W65816::PER: + SpAdj -= 2; ++Walk; continue; + case W65816::PLA: case W65816::PLX: case W65816::PLY: + SpAdj += 2; ++Walk; continue; + case W65816::PHP: + SpAdj -= 1; ++Walk; continue; + case W65816::PLP: + SpAdj += 1; ++Walk; continue; + } + if (WO == W65816::STA_StackRel && + Walk->getNumOperands() >= 1 && Walk->getOperand(0).isImm() && + Walk->getOperand(0).getImm() + SpAdj == OrigSlot) { + ToErase.push_back(&*It); + ErasedSet.insert(&*It); + break; + } + if (Walk->getNumOperands() >= 1 && Walk->getOperand(0).isImm()) { + int64_t Imm = Walk->getOperand(0).getImm(); + if (isStackRelRead(WO) || WO == W65816::STA_StackRel) { + if (Imm + SpAdj == OrigSlot) break; + } + if (isStackRelIndY(WO)) { + if (Imm + SpAdj == OrigSlot || Imm + SpAdj + 1 == OrigSlot) + break; + } + } + ++Walk; + } + } + for (MachineInstr *MI : ToErase) { + MI->eraseFromParent(); + Changed = true; + } + } + return Changed; +} + + static bool elideDeadStaCarry(MachineFunction &MF) { bool Changed = false; for (MachineBasicBlock &MBB : MF) { @@ -661,19 +738,25 @@ bool W65816StackRelToImg::runOnMachineFunction(MachineFunction &MF) { for (MachineBasicBlock &MBB : MF) { for (MachineInstr &MI : MBB) { if (!MI.isCall()) continue; - if (!isImgSafeCall(MI)) { - ChangedEarly |= elideStoreForwarding(MF); - return ChangedEarly; - } + // Check: recursive self-call first. Apply stack-rel dead-store + // elim here since IMG promotion can't run (recursion clobbers + // IMG slots across the inner call). + bool isSelfCall = false; for (const MachineOperand &MO : MI.operands()) { StringRef Name; if (MO.isGlobal()) Name = MO.getGlobal()->getName(); else if (MO.isSymbol()) Name = MO.getSymbolName(); else continue; - if (Name == SelfName) { - ChangedEarly |= elideStoreForwarding(MF); - return ChangedEarly; - } + if (Name == SelfName) { isSelfCall = true; break; } + } + if (isSelfCall) { + ChangedEarly |= elideStackRelDeadStore(MF); + ChangedEarly |= elideStoreForwarding(MF); + return ChangedEarly; + } + if (!isImgSafeCall(MI)) { + ChangedEarly |= elideStoreForwarding(MF); + return ChangedEarly; } } } @@ -2319,6 +2402,10 @@ bool W65816StackRelToImg::runOnMachineFunction(MachineFunction &MF) { } } + // Stack-rel dead-store elim — called from the always-on section too + // (so it benefits recursive / non-IMG-promotable functions like fib). + Changed |= elideStackRelDeadStore(MF); + // DP-slot zero-check bridge via X. Pattern: // [op that sets Z on A] // STA_DP slot @@ -2441,6 +2528,792 @@ bool W65816StackRelToImg::runOnMachineFunction(MachineFunction &MF) { } } + // popcount-style "n += (x & 1)" combined with `x >>= 1` LSR/ROR pair: + // use the C flag set by ROR_DP directly via `ADC #0` (which adds C + + // 0 + n = bit + n). Eliminates the `LDA x_orig ; AND #1 ; ... ; CLC + // ; ADC n` sequence and the lagged-PHI copy at end of body. Big win + // for popcount's specific shape. + // + // Pattern (post-shift-fold + dp-zero-check-tax-bridge): + // LSR_DP X_hi + // ROR_DP X_lo ; C = old bit 0 of x_lo = the BIT we want + // LDA_DP X_lo ; (zero-check setup) + // ORA_DP X_hi + // TAX ; preserve Z across the AND chain + // LDA_DP X_orig ; current x_lo (lagged PHI) + // ANDi16imm 1 ; bit = x_orig & 1 (== C from ROR) + // STA_DP X_orig ; dead store (overwritten by PHI-copy below) + // CLC ; ← kills our preserved C + // ADC_DP N ; n += bit (but ADC reads C=0 + bit) + // STA_DP N + // LDA_DP X_lo ; PHI copy start + // STA_DP X_orig ; x_orig = x_lo NEW + // TXA ; restore Z + // BNE loop + // + // Rewrite: + // LSR_DP X_hi + // ROR_DP X_lo ; C = bit + // LDA_DP N + // ADC_Imm 0 ; n + 0 + C = n + bit + // STA_DP N + // LDA_DP X_lo + // ORA_DP X_hi + // BNE loop + // + // Erased: TAX, LDA X_orig, AND #1, STA X_orig, CLC, LDA X_lo PHI-copy, + // STA X_orig PHI-copy, TXA. Plus x_orig itself becomes dead. + // + // Saves ~25 cyc/iter on popcount. 29 iters → ~725 cyc. popcount + // 1705 → ~1000. Calypsi is 1534, so this would BEAT Calypsi. + for (MachineBasicBlock &MBB : MF) { + SmallVector ToErase; + for (auto It = MBB.begin(); It != MBB.end();) { + auto Lsr = It++; + if (Lsr->getOpcode() != W65816::LSR_DP) continue; + if (Lsr->getNumOperands() < 1 || !Lsr->getOperand(0).isImm()) continue; + int64_t HiAddr = Lsr->getOperand(0).getImm(); + auto skipDbg = [&](auto &P) { + while (P != MBB.end() && P->isDebugInstr()) ++P; + }; + auto Ror = std::next(Lsr); skipDbg(Ror); + if (Ror == MBB.end() || Ror->getOpcode() != W65816::ROR_DP) continue; + if (Ror->getNumOperands() < 1 || !Ror->getOperand(0).isImm()) continue; + int64_t LoAddr = Ror->getOperand(0).getImm(); + auto P = std::next(Ror); skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::LDA_DP) continue; + if (P->getOperand(0).getImm() != LoAddr) continue; + auto LdaLo1 = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::ORA_DP) continue; + if (P->getOperand(0).getImm() != HiAddr) continue; + auto OraHi = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::TAX) continue; + auto Tax = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::LDA_DP) continue; + int64_t OrigAddr = P->getOperand(0).getImm(); + if (OrigAddr == LoAddr || OrigAddr == HiAddr) continue; + auto LdaOrig = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::ANDi16imm) continue; + // Verify AND operand is 1. + bool isAnd1 = false; + for (const MachineOperand &MO : P->operands()) { + if (MO.isImm() && MO.getImm() == 1) { isAnd1 = true; break; } + } + if (!isAnd1) continue; + auto AndOne = P; ++P; skipDbg(P); + // Optional STA_DP X_orig (dead store; may already have been + // erased by DP-dead-store-elim). + MachineInstr *StaOrig1 = nullptr; + if (P != MBB.end() && P->getOpcode() == W65816::STA_DP && + P->getOperand(0).isImm() && P->getOperand(0).getImm() == OrigAddr) { + StaOrig1 = &*P; ++P; skipDbg(P); + } + if (P == MBB.end() || P->getOpcode() != W65816::CLC) continue; + auto Clc = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::ADC_DP) continue; + int64_t NAddr = P->getOperand(0).getImm(); + auto AdcN = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::STA_DP || + P->getOperand(0).getImm() != NAddr) continue; + auto StaN = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::LDA_DP || + P->getOperand(0).getImm() != LoAddr) continue; + auto LdaLo2 = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::STA_DP || + P->getOperand(0).getImm() != OrigAddr) continue; + auto StaOrig2 = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::TXA) continue; + auto Txa = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::BNE) continue; + // (Bne stays; we just need to put a fresh ORA + BNE-using-Z form.) + + const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo(); + // After ROR, insert: LDA N ; ADC #0 ; STA N ; LDA X_lo ; ORA X_hi + auto Insert = std::next(MachineBasicBlock::iterator(Ror)); + DebugLoc DL = Ror->getDebugLoc(); + BuildMI(MBB, Insert, DL, TII->get(W65816::LDA_DP)).addImm(NAddr); + BuildMI(MBB, Insert, DL, TII->get(W65816::ADC_Imm16)).addImm(0); + BuildMI(MBB, Insert, DL, TII->get(W65816::STA_DP)).addImm(NAddr); + BuildMI(MBB, Insert, DL, TII->get(W65816::LDA_DP)).addImm(LoAddr); + BuildMI(MBB, Insert, DL, TII->get(W65816::ORA_DP)).addImm(HiAddr); + // Erase the old sequence. + ToErase.push_back(&*LdaLo1); + ToErase.push_back(&*OraHi); + ToErase.push_back(&*Tax); + ToErase.push_back(&*LdaOrig); + ToErase.push_back(&*AndOne); + if (StaOrig1) ToErase.push_back(StaOrig1); + ToErase.push_back(&*Clc); + ToErase.push_back(&*AdcN); + ToErase.push_back(&*StaN); + ToErase.push_back(&*LdaLo2); + ToErase.push_back(&*StaOrig2); + ToErase.push_back(&*Txa); + // BNE stays — but uses Z from our new ORA. + It = std::next(MachineBasicBlock::iterator(Ror)); + } + for (MachineInstr *MI : ToErase) { + MI->eraseFromParent(); + Changed = true; + } + } + + // X-register iter peephole. In a self-loop body whose pointer-walk + // iter is held in a DP slot, replace the per-iter + // LDA_DP X1 ; reload iter from DP (3 cyc) + // STA_StackRel S ; copy to lagged slot (5 cyc) + // INA_PSEUDO ; iter++ (2 cyc) + // STA_DP X1 ; store back to DP (3 cyc) + // chain with the iter held in the X register across iters: + // TXA ; A = X = OLD iter (2 cyc) + // STA_StackRel S ; copy to lagged slot (5 cyc) + // INX ; X = NEW iter (2 cyc) + // Saves 4 cyc/iter (13 → 9). Targets strLen-shape loops. The + // preheader gets an LDA_DP X1; TAX inserted to seed X. + // + // Safety conditions: + // - MBB is a self-loop (MBB ∈ MBB.predecessors()). + // - DP slot X1 is not referenced ANYWHERE in the function outside + // this pattern (else our drop of STA_DP X1 corrupts other reads). + // - X register is dead in the MBB (no TAX/INX/etc.) AND not live-in + // to any successor MBB. + // - Preheader exists (= a non-self predecessor we can insert into). + for (MachineBasicBlock &MBB : MF) { + bool selfLoop = false; + MachineBasicBlock *Preheader = nullptr; + for (MachineBasicBlock *Pred : MBB.predecessors()) { + if (Pred == &MBB) selfLoop = true; + else Preheader = Pred; + } + if (!selfLoop || !Preheader) continue; + // Successors must not have X live-in (we'll clobber X to use it + // as iter). The self-loop successor (= MBB itself) is allowed + // because we'll redefine X every iter. + bool succXLive = false; + for (MachineBasicBlock *Succ : MBB.successors()) { + if (Succ == &MBB) continue; + if (Succ->isLiveIn(W65816::X)) { succXLive = true; break; } + } + if (succXLive) continue; + // MBB must not touch X register anywhere. + bool mbbTouchesX = false; + for (const MachineInstr &MI : MBB) { + if (MI.isCall()) { mbbTouchesX = true; break; } + switch (MI.getOpcode()) { + case W65816::TAX: case W65816::TYX: case W65816::TSX: + case W65816::PLX: case W65816::TXA: case W65816::TXY: + case W65816::TXS: case W65816::PHX: case W65816::INX: + case W65816::DEX: case W65816::LDX_Imm16: case W65816::LDX_Imm8: + case W65816::LDX_DP: case W65816::LDX_Abs: + case W65816::LDX_DPY: case W65816::LDX_AbsY: + case W65816::STX_DP: case W65816::STX_Abs: + case W65816::STX_DPY: + case W65816::CPX_DP: case W65816::CPX_Abs: + case W65816::CPX_Imm8: case W65816::CPX_Imm16: + mbbTouchesX = true; break; + } + if (mbbTouchesX) break; + for (const MachineOperand &MO : MI.operands()) { + if (MO.isReg() && MO.getReg() == W65816::X) { + mbbTouchesX = true; break; + } + } + if (mbbTouchesX) break; + } + if (mbbTouchesX) continue; + // Find the 4-op pattern in MBB. + SmallVector ToErase; + for (auto It = MBB.begin(); It != MBB.end();) { + auto LdaMI = It++; + if (LdaMI->getOpcode() != W65816::LDA_DP) continue; + if (LdaMI->getNumOperands() < 1 || !LdaMI->getOperand(0).isImm()) + continue; + int64_t X1 = LdaMI->getOperand(0).getImm(); + auto skipDbg = [&](auto &P) { + while (P != MBB.end() && P->isDebugInstr()) ++P; + }; + auto P = std::next(LdaMI); skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::STA_StackRel) continue; + auto StaS = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::INA_PSEUDO) continue; + auto Ina = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::STA_DP) continue; + if (P->getOperand(0).getImm() != X1) continue; + auto StaBack = P; + // Verify DP slot X1 outside-pattern references are at most ONE + // STA_DP in the preheader (the iter initialization), which we'll + // rewrite to TAX. Any read of X1 outside the pattern bails. + MachineInstr *PreheaderInit = nullptr; + bool referencedElsewhere = false; + for (MachineBasicBlock &OtherMBB : MF) { + for (MachineInstr &OtherMI : OtherMBB) { + if (&OtherMI == &*LdaMI || &OtherMI == &*StaBack) continue; + unsigned OO = OtherMI.getOpcode(); + // Look for any DP-addressing op whose operand matches X1. + bool refsX1 = false; + for (const MachineOperand &MO : OtherMI.operands()) { + if (MO.isImm() && MO.getImm() == X1) { + if (OO == W65816::LDA_DP || OO == W65816::STA_DP || + OO == W65816::STZ_DP || OO == W65816::ADC_DP || + OO == W65816::SBC_DP || OO == W65816::AND_DP || + OO == W65816::ORA_DP || OO == W65816::EOR_DP || + OO == W65816::CMP_DP || OO == W65816::INC_DP || + OO == W65816::DEC_DP || OO == W65816::LSR_DP || + OO == W65816::ROR_DP || OO == W65816::ASL_DP || + OO == W65816::ROL_DP || OO == W65816::BIT_DP || + OO == W65816::LDX_DP || OO == W65816::STX_DP || + OO == W65816::LDY_DP || OO == W65816::STY_DP) { + refsX1 = true; + } + break; + } + } + if (!refsX1) continue; + // Accept ONE STA_DP X1 in the preheader (we'll rewrite to TAX). + if (OO == W65816::STA_DP && &OtherMBB == Preheader && + !PreheaderInit) { + PreheaderInit = &OtherMI; + continue; + } + // Otherwise bail. + referencedElsewhere = true; break; + } + if (referencedElsewhere) break; + } + if (referencedElsewhere) continue; + if (!PreheaderInit) continue; + // Apply: in preheader, the existing STA_DP X1 (PreheaderInit) + // gets replaced with TAX (A already has the initial value about + // to be stored). In the loop, replace the 4-op chain with + // TXA; STA_StackRel S; INX. + const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo(); + BuildMI(*Preheader, PreheaderInit, PreheaderInit->getDebugLoc(), + TII->get(W65816::TAX)); + ToErase.push_back(PreheaderInit); + // In the loop: insert TXA before StaS, and INX after StaS. + DebugLoc DL = StaS->getDebugLoc(); + BuildMI(MBB, StaS, DL, TII->get(W65816::TXA)); + BuildMI(MBB, std::next(MachineBasicBlock::iterator(StaS)), + DL, TII->get(W65816::INX)); + // Erase LdaMI, Ina, StaBack. StaS stays. + ToErase.push_back(&*LdaMI); + ToErase.push_back(&*Ina); + ToErase.push_back(&*StaBack); + It = std::next(MachineBasicBlock::iterator(StaBack)); + break; // only fire once per MBB to avoid cascading + } + for (MachineInstr *MI : ToErase) { + MI->eraseFromParent(); + Changed = true; + } + } + + // Dead INC_HI_IF_CARRY_StackRel elimination. For Layer 2 + // (`-w65816-dbr-safe-ptrs`) loops, ptr_hi tracking via slot H is + // bookkeeping for a pointer whose bank stays in DBR. When the + // function never READS slot H (only writes it), the + // `INC_HI_IF_CARRY_StackRel H` pseudo and the preheader STAs to H + // are dead — eliminating them saves 3 cyc/iter (the BNE taken). + // + // Detect: a `INC_HI_IF_CARRY_StackRel H` instruction. + // Verify: no instruction reads slot H anywhere in the function. + // - LDA_StackRel H, ADC_StackRel H, etc. (any read) + // - LDA_StackRelIndY H, STA_StackRelIndY H, etc. (used as pointer) + // If clean, erase the INC and all STA_StackRel H stores. + { + DenseSet SlotsTouched; + for (MachineBasicBlock &MBB : MF) { + for (MachineInstr &MI : MBB) { + if (MI.getOpcode() != W65816::INC_HI_IF_CARRY_StackRel) continue; + if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) continue; + SlotsTouched.insert(MI.getOperand(0).getImm()); + } + } + for (int64_t H : SlotsTouched) { + // Scan the entire function for any READ of slot H or USE of H as + // an indirect-Y base. + bool readSomewhere = false; + for (MachineBasicBlock &MBB : MF) { + for (const MachineInstr &MI : MBB) { + unsigned Op = MI.getOpcode(); + int64_t Imm = -1; + if (MI.getNumOperands() >= 1 && MI.getOperand(0).isImm()) + Imm = MI.getOperand(0).getImm(); + if (Imm != H) continue; + switch (Op) { + // Reads of slot H via stack-rel direct. + case W65816::LDA_StackRel: + case W65816::ADC_StackRel: case W65816::SBC_StackRel: + case W65816::AND_StackRel: case W65816::ORA_StackRel: + case W65816::EOR_StackRel: case W65816::CMP_StackRel: + // Uses of slot H as an indirect-Y base. + case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY: + case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY: + case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY: + case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY: + readSomewhere = true; + break; + default: + break; + } + // Also catch indirect uses of slot H-1 (the IndY reads 2 bytes + // at H-1, H). Conservative. + if (Imm == H - 1) { + switch (Op) { + case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY: + case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY: + case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY: + case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY: + readSomewhere = true; + break; + default: + break; + } + } + if (readSomewhere) break; + } + if (readSomewhere) break; + } + if (readSomewhere) continue; + // Slot H is dead. Erase the INC_HI_IF_CARRY and all STA_StackRel + // H stores. + SmallVector ToErase; + for (MachineBasicBlock &MBB : MF) { + for (MachineInstr &MI : MBB) { + unsigned Op = MI.getOpcode(); + if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) continue; + if (MI.getOperand(0).getImm() != H) continue; + if (Op == W65816::INC_HI_IF_CARRY_StackRel || + Op == W65816::STA_StackRel) { + ToErase.push_back(&MI); + } + } + } + for (MachineInstr *MI : ToErase) { + MI->eraseFromParent(); + Changed = true; + } + } + } + + // Lagged-stack-rel → DP-indirect-Y conversion. After the X-iter + // peephole, the loop body looks like: + // TXA ; A = X = OLD iter + // STA_StackRel S ; lagged stack-rel ptr slot + // INX ; X = NEW + // ... + // LDA_StackRelIndY S ; deref via stack-rel-indirect-Y (7 cyc) + // + // Equivalent with the iter in X and the ptr in DP: + // STX_DP D ; write 16-bit X to DP slot D:D+1 (4 cyc) + // INX ; X = NEW (2) + // ... + // LDA_DPIndY D ; deref via DP-indirect-Y (6 cyc) + // + // Saves TXA (2) + 1 cyc on STA→STX (5→4) + 1 cyc on IndY-stack→ + // IndY-DP (7→6) = 4 cyc/iter. Requires a free DP slot pair. + // + // Conditions: + // - MBB is a self-loop (already paired with X-iter peephole). + // - Pattern TXA;STA_StackRel S;INX exists in MBB. + // - LDA_StackRelIndY S exists in MBB. + // - Slot S is referenced ONLY by the STA above and the LDA above + // (no other reads/writes in the function). + // - There exists an unused IMG-range DP slot (16-bit pair). + for (MachineBasicBlock &MBB : MF) { + bool selfLoop = false; + for (MachineBasicBlock *Pred : MBB.predecessors()) { + if (Pred == &MBB) { selfLoop = true; break; } + } + if (!selfLoop) continue; + + // Find TXA ; STA_StackRel S ; INX in this MBB. + MachineInstr *Txa = nullptr; + MachineInstr *StaS = nullptr; + MachineInstr *Inx = nullptr; + int64_t Soff = -1; + auto It = MBB.begin(); + while (It != MBB.end()) { + if (It->getOpcode() == W65816::TXA) { + auto P = std::next(It); + while (P != MBB.end() && P->isDebugInstr()) ++P; + if (P == MBB.end() || P->getOpcode() != W65816::STA_StackRel) { + ++It; continue; + } + auto Sta = P; ++P; + while (P != MBB.end() && P->isDebugInstr()) ++P; + if (P == MBB.end() || P->getOpcode() != W65816::INX) { + ++It; continue; + } + if (Sta->getNumOperands() < 1 || !Sta->getOperand(0).isImm()) { + ++It; continue; + } + Txa = &*It; StaS = &*Sta; Inx = &*P; + Soff = Sta->getOperand(0).getImm(); + break; + } + ++It; + } + if (!Txa) continue; + + // Find LDA_StackRelIndY S in MBB. + MachineInstr *LdaIndY = nullptr; + for (MachineInstr &MI : MBB) { + if (MI.getOpcode() == W65816::LDA_StackRelIndY && + MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() && + MI.getOperand(0).getImm() == Soff) { + LdaIndY = &MI; + break; + } + } + if (!LdaIndY) continue; + + // Find the loop's predecessor (= preheader for our self-loop case). + MachineBasicBlock *Preheader = nullptr; + for (MachineBasicBlock *Pred : MBB.predecessors()) { + if (Pred != &MBB) { Preheader = Pred; break; } + } + // Verify slot S is referenced ONLY by writes (dropped) and the + // LdaIndY, EXCEPT we tolerate LDA_StackRel S reads in the + // preheader (the X-iter peephole inserts these to bootstrap X). + // Any read elsewhere (post-loop, non-preheader pred, etc.) bails. + SmallVector DeadStaWrites; + SmallVector DeadPreheaderReads; + bool slotElsewhere = false; + for (MachineBasicBlock &OtherMBB : MF) { + for (MachineInstr &OtherMI : OtherMBB) { + if (&OtherMI == StaS || &OtherMI == LdaIndY) continue; + unsigned OOO = OtherMI.getOpcode(); + if (OtherMI.getNumOperands() < 1 || !OtherMI.getOperand(0).isImm()) + continue; + if (OtherMI.getOperand(0).getImm() != Soff) continue; + switch (OOO) { + case W65816::STA_StackRel: + DeadStaWrites.push_back(&OtherMI); + continue; + case W65816::LDA_StackRel: + // Allow LDA in the preheader (X-iter's reload-to-bootstrap-X). + if (Preheader && &OtherMBB == Preheader) { + DeadPreheaderReads.push_back(&OtherMI); + continue; + } + slotElsewhere = true; break; + case W65816::ADC_StackRel: case W65816::SBC_StackRel: + case W65816::AND_StackRel: case W65816::ORA_StackRel: + case W65816::EOR_StackRel: case W65816::CMP_StackRel: + case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY: + case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY: + case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY: + case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY: + slotElsewhere = true; break; + default: break; + } + if (slotElsewhere) break; + } + if (slotElsewhere) break; + } + if (slotElsewhere) continue; + // Each preheader LDA_StackRel S must be followed by TAX (it's the + // X-iter bootstrap pattern). Verify; if not, bail. + for (MachineInstr *MI : DeadPreheaderReads) { + auto N = std::next(MachineBasicBlock::iterator(MI)); + while (N != MI->getParent()->end() && N->isDebugInstr()) ++N; + if (N == MI->getParent()->end() || N->getOpcode() != W65816::TAX) { + slotElsewhere = true; break; + } + } + if (slotElsewhere) continue; + + // Find a free DP slot in the IMG range. Scan all DP-addressing + // ops and collect used addresses; pick the lowest unused 16-bit + // aligned slot. + DenseSet UsedDpAddrs; + for (MachineBasicBlock &OtherMBB : MF) { + for (MachineInstr &OtherMI : OtherMBB) { + if (OtherMI.getNumOperands() < 1 || !OtherMI.getOperand(0).isImm()) + continue; + unsigned OOO = OtherMI.getOpcode(); + switch (OOO) { + case W65816::LDA_DP: case W65816::STA_DP: case W65816::STZ_DP: + case W65816::LDX_DP: case W65816::STX_DP: + case W65816::LDY_DP: case W65816::STY_DP: + case W65816::ADC_DP: case W65816::SBC_DP: + case W65816::AND_DP: case W65816::ORA_DP: + case W65816::EOR_DP: case W65816::CMP_DP: + case W65816::CPX_DP: case W65816::CPY_DP: + case W65816::LSR_DP: case W65816::ROR_DP: + case W65816::ASL_DP: case W65816::ROL_DP: + case W65816::INC_DP: case W65816::DEC_DP: + case W65816::BIT_DP: + case W65816::LDA_DPInd: case W65816::STA_DPInd: + case W65816::LDA_DPIndY: case W65816::STA_DPIndY: + case W65816::LDA_DPIndX: case W65816::STA_DPIndX: + case W65816::LDA_DPIndLong: case W65816::STA_DPIndLong: + case W65816::LDA_DPIndLongY: case W65816::STA_DPIndLongY: + { + int64_t A = OtherMI.getOperand(0).getImm(); + UsedDpAddrs.insert(A); + UsedDpAddrs.insert(A + 1); // 16-bit ops occupy 2 bytes + } + break; + default: break; + } + } + } + // Pick a free 16-bit-aligned slot in $C0..$DE. + int64_t FreeDp = -1; + for (int64_t A = 0xC0; A <= 0xDE; A += 2) { + if (!UsedDpAddrs.count(A) && !UsedDpAddrs.count(A + 1)) { + FreeDp = A; + break; + } + } + if (FreeDp < 0) continue; + + // Apply: rewrite TXA;STA_StackRel S;INX → STX_DP FreeDp;INX + // (TXA and StaS erased; STX_DP inserted at TXA's position). + const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo(); + BuildMI(MBB, Txa, Txa->getDebugLoc(), + TII->get(W65816::STX_DP)).addImm(FreeDp); + Txa->eraseFromParent(); + StaS->eraseFromParent(); + // Rewrite LDA_StackRelIndY S → LDA_DPIndY FreeDp. + BuildMI(MBB, LdaIndY, LdaIndY->getDebugLoc(), + TII->get(W65816::LDA_DPIndY)) + .addImm(FreeDp); + LdaIndY->eraseFromParent(); + // We do NOT erase the preheader's STA writes or LDA reads to slot + // S — they're the X-iter peephole's bootstrap (STA at preheader + // init writes s_lo to S; LDA reloads s_lo to A for TAX→X). The + // body no longer uses slot S, but leaving them there is harmless. + (void)DeadStaWrites; + (void)DeadPreheaderReads; + Changed = true; + } + + // Y-as-counter for strLen-shape loops. After DP-indirect-Y rewrite + + // dead-INC_HI elim, the strLen body is: + // STX_DP D (4 cyc) writes iter X to DP slot D + // INX (2) iter++ + // INC_DP C (5) counter++ + // LDA_DPIndY D (6) deref via slot D + // ANDi16imm #ff (--) mask to 8-bit (lowers to AND #imm) + // BNE loop (3) branch on byte != 0 + // 22 cyc/iter. + // + // We can drop STX/INX/INC entirely by using Y as both the offset and + // counter: D holds initial s (one-time write in preheader), Y starts + // at -1, INY at top of each iter brings Y to 0, 1, 2, ... + // INY (2) + // LDA_DPIndY D (6) + // ANDi16imm #ff + // BNE loop (3) + // 13 cyc/iter — saves 9 cyc per iter. + // + // Preheader: LDY_Imm16 0 → LDY_Imm16 0xFFFF. Add STX_DP D (one-time + // s init). Drop the existing `LDA #-1 ; STA_DP C` counter init. + // + // Exit MBB: replace `LDA_DP C` (returns the counter) with `TYA`. + // + // strLen 1279 → ~874 cyc (predicted). + { + for (MachineBasicBlock &MBB : MF) { + // Self-loop check. + bool isSelfLoop = false; + for (auto *Succ : MBB.successors()) { + if (Succ == &MBB) { isSelfLoop = true; break; } + } + if (!isSelfLoop) continue; + if (MBB.pred_size() != 2) continue; + if (MBB.succ_size() != 2) continue; + + // Find the 6-op pattern. + MachineInstr *Stx = nullptr; + MachineInstr *Inx = nullptr; + MachineInstr *IncC = nullptr; + MachineInstr *Lda = nullptr; + MachineInstr *And = nullptr; + MachineInstr *Bne = nullptr; + int64_t D = -1, C = -1; + bool extraInsn = false; + for (MachineInstr &MI : MBB) { + if (MI.isDebugInstr()) continue; + switch (MI.getOpcode()) { + case W65816::STX_DP: + if (Stx) { extraInsn = true; break; } + Stx = &MI; + if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) { + extraInsn = true; break; + } + D = MI.getOperand(0).getImm(); + break; + case W65816::INX: + if (!Stx || Inx) { extraInsn = true; break; } + Inx = &MI; + break; + case W65816::INC_DP: + if (!Inx || IncC) { extraInsn = true; break; } + IncC = &MI; + if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) { + extraInsn = true; break; + } + C = MI.getOperand(0).getImm(); + if (C == D) { extraInsn = true; break; } + break; + case W65816::LDA_DPIndY: + if (!IncC || Lda) { extraInsn = true; break; } + Lda = &MI; + if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm() || + MI.getOperand(0).getImm() != D) { + extraInsn = true; break; + } + break; + case W65816::ANDi16imm: + if (!Lda || And) { extraInsn = true; break; } + And = &MI; + if (MI.getNumOperands() < 3 || !MI.getOperand(2).isImm() || + MI.getOperand(2).getImm() != 255) { + extraInsn = true; break; + } + break; + case W65816::BNE: + if (!And || Bne) { extraInsn = true; break; } + Bne = &MI; + break; + default: + extraInsn = true; + break; + } + if (extraInsn) break; + } + if (extraInsn || !Stx || !Inx || !IncC || !Lda || !And || !Bne) continue; + + // Find preheader (predecessor that is not self). + MachineBasicBlock *Preheader = nullptr; + for (auto *Pred : MBB.predecessors()) { + if (Pred != &MBB) { + if (Preheader) { Preheader = nullptr; break; } + Preheader = Pred; + } + } + if (!Preheader) continue; + + // Find exit MBB (successor that is not self). + MachineBasicBlock *Exit = nullptr; + for (auto *Succ : MBB.successors()) { + if (Succ != &MBB) { + if (Exit) { Exit = nullptr; break; } + Exit = Succ; + } + } + if (!Exit) continue; + + // In preheader: find LDY_Imm16 0 and LDA #-1; STA_DP C (counter init). + MachineInstr *Ldy = nullptr; + MachineInstr *LdaNeg1 = nullptr; + MachineInstr *StaC = nullptr; + MachineInstr *Tax = nullptr; + for (MachineInstr &MI : *Preheader) { + if (MI.getOpcode() == W65816::LDY_Imm16 && + MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() && + MI.getOperand(0).getImm() == 0) { + Ldy = &MI; + } + if (MI.getOpcode() == W65816::LDAi16imm && + MI.getNumOperands() >= 2 && MI.getOperand(1).isImm() && + (MI.getOperand(1).getImm() == -1 || + MI.getOperand(1).getImm() == 0xFFFF)) { + LdaNeg1 = &MI; + } + if (MI.getOpcode() == W65816::STA_DP && + MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() && + MI.getOperand(0).getImm() == C) { + StaC = &MI; + } + if (MI.getOpcode() == W65816::TAX) { + Tax = &MI; + } + } + if (!Ldy || !LdaNeg1 || !StaC || !Tax) continue; + + // In exit MBB: find LDA_DP C followed by RTL (return uses counter). + MachineInstr *ExitLdaC = nullptr; + for (MachineInstr &MI : *Exit) { + if (MI.getOpcode() == W65816::LDA_DP && + MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() && + MI.getOperand(0).getImm() == C) { + ExitLdaC = &MI; + break; + } + } + if (!ExitLdaC) continue; + + // Verify no other references to slots D or C anywhere else in MF. + bool extraDRef = false; + bool extraCRef = false; + for (MachineBasicBlock &OtherMBB : MF) { + for (MachineInstr &MI : OtherMBB) { + if (&MI == Stx || &MI == IncC || &MI == Lda || + &MI == LdaNeg1 || &MI == StaC || &MI == ExitLdaC) continue; + if (MI.getNumOperands() >= 1 && MI.getOperand(0).isImm()) { + int64_t Imm = MI.getOperand(0).getImm(); + unsigned Op = MI.getOpcode(); + // Catch any DP or DPInd op touching slot D or D+1, C or C+1. + switch (Op) { + case W65816::LDA_DP: case W65816::STA_DP: case W65816::STZ_DP: + case W65816::LDX_DP: case W65816::STX_DP: + case W65816::LDY_DP: case W65816::STY_DP: + case W65816::ADC_DP: case W65816::SBC_DP: + case W65816::AND_DP: case W65816::ORA_DP: case W65816::EOR_DP: + case W65816::CMP_DP: case W65816::CPX_DP: case W65816::CPY_DP: + case W65816::LSR_DP: case W65816::ROR_DP: + case W65816::ASL_DP: case W65816::ROL_DP: + case W65816::INC_DP: case W65816::DEC_DP: + case W65816::BIT_DP: case W65816::TSB_DP: case W65816::TRB_DP: + case W65816::LDA_DPIndY: case W65816::STA_DPIndY: + case W65816::LDA_DPInd: case W65816::STA_DPInd: + if (Imm == D || Imm == D + 1) extraDRef = true; + if (Imm == C || Imm == C + 1) extraCRef = true; + break; + } + } + } + } + if (extraDRef || extraCRef) continue; + + // Perform the transformation. + const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo(); + + // 1. Preheader: change LDY_Imm16 0 → LDY_Imm16 0xFFFF. + Ldy->getOperand(0).setImm(0xFFFF); + + // 2. Preheader: erase LDA #-1 and STA_DP C (counter init dead). + LdaNeg1->eraseFromParent(); + StaC->eraseFromParent(); + + // 3. Preheader: insert STX_DP D after TAX (one-time s → D init). + auto AfterTax = std::next(MachineBasicBlock::iterator(Tax)); + BuildMI(*Preheader, AfterTax, Tax->getDebugLoc(), + TII->get(W65816::STX_DP)).addImm(D); + + // 4. Body: erase STX_DP D, INX, INC_DP C. + Stx->eraseFromParent(); + Inx->eraseFromParent(); + IncC->eraseFromParent(); + + // 5. Body: insert INY at start (before LDA_DPIndY). + BuildMI(MBB, Lda, Lda->getDebugLoc(), + TII->get(W65816::INY)); + + // 6. Exit: replace LDA_DP C with TYA. + BuildMI(*Exit, ExitLdaC, ExitLdaC->getDebugLoc(), + TII->get(W65816::TYA)); + ExitLdaC->eraseFromParent(); + + Changed = true; + } + } + // Run elideStoreForwarding at the very end, AFTER IMG promotion has // committed slot assignments. Running this peephole earlier (with // the other early peepholes) cascades into different IMG-promotion diff --git a/src/llvm/lib/Target/W65816/W65816UnLSR.cpp b/src/llvm/lib/Target/W65816/W65816UnLSR.cpp index f283748..425c334 100644 --- a/src/llvm/lib/Target/W65816/W65816UnLSR.cpp +++ b/src/llvm/lib/Target/W65816/W65816UnLSR.cpp @@ -107,14 +107,7 @@ bool W65816UnLSR::runOnFunction(Function &F) { for (Loop *L : LI) { Changed |= processLoop(L); Changed |= processCounterToPtrPHIs(L); - // NOTE: processReturnedCounter (strLen-shape counter → ptr-difference - // at exit) is correct but produces a NET LOSS on strLen: without the - // counter PHI, the i32 pointer arithmetic falls back to clc+adc - // chains (16+ cyc/iter) instead of inc-A on the lo half (5 cyc/iter - // for ptr update + 5 for counter inc). See feedback memory. - // Disabled until codegen can use inc-DP for the lo half of a pointer - // PHI's increment without the SDAG materializing a full i32 add. - // Recurse into nested loops. + // processReturnedCounter remains disabled — see note above. SmallVector Worklist(L->begin(), L->end()); while (!Worklist.empty()) { Loop *Sub = Worklist.pop_back_val(); diff --git a/tests/coremark/README.md b/tests/coremark/README.md index aa01ccc..c84092d 100644 --- a/tests/coremark/README.md +++ b/tests/coremark/README.md @@ -64,16 +64,18 @@ binary; the run only works in an unrestricted shell. Workaround: copy ## Size vs Calypsi (5 core files, ITERATIONS=1, PERFORMANCE_RUN) -| File | Ours (L2+threshold=75) | Calypsi 5.16 | Ratio | +| File | Ours (L2+threshold=50) | Calypsi 5.16 | Ratio | |------|----------------------:|-------------:|------:| -| core_list_join.o | 10,188 | 9,073 | 1.12× | -| core_main.o | 11,656 | 19,772 | 0.59× | -| core_matrix.o | 15,180 | 11,078 | 1.37× | -| core_state.o | 7,348 | 9,944 | 0.74× | +| core_list_join.o | 10,008 | 9,073 | 1.10× | +| core_main.o | 11,588 | 19,772 | 0.59× | +| core_matrix.o | 10,660 | 11,078 | 0.96× | +| core_state.o | 7,256 | 9,944 | 0.73× | | core_util.o | 3,156 | 4,631 | 0.68× | -| **TOTAL** | **47,528** | **54,498** | **0.87×** | +| **TOTAL** | **42,668** | **54,498** | **0.78×** | -We beat Calypsi by 13% on CoreMark overall. +We beat Calypsi by 22% on CoreMark overall. (Since the inline- +threshold dropped from 75 to 50 target-wide, `core_matrix.o` improved +from 1.37× → 0.96× by no longer inlining 5 nested-loop helpers.) ## Notes on the porting layer diff --git a/tests/lua/build/lapi.o b/tests/lua/build/lapi.o index d29d349..d65f81d 100644 Binary files a/tests/lua/build/lapi.o and b/tests/lua/build/lapi.o differ diff --git a/tests/lua/build/lauxlib.o b/tests/lua/build/lauxlib.o index 4f38c39..4d6f5dd 100644 Binary files a/tests/lua/build/lauxlib.o and b/tests/lua/build/lauxlib.o differ diff --git a/tests/lua/build/lbaselib.o b/tests/lua/build/lbaselib.o index 784c99b..90ad192 100644 Binary files a/tests/lua/build/lbaselib.o and b/tests/lua/build/lbaselib.o differ diff --git a/tests/lua/build/lcode.o b/tests/lua/build/lcode.o index 80b7c34..6de4e7e 100644 Binary files a/tests/lua/build/lcode.o and b/tests/lua/build/lcode.o differ diff --git a/tests/lua/build/ldebug.o b/tests/lua/build/ldebug.o index fb4cf74..3c8f224 100644 Binary files a/tests/lua/build/ldebug.o and b/tests/lua/build/ldebug.o differ diff --git a/tests/lua/build/ldo.o b/tests/lua/build/ldo.o index 271bbc2..e7c384e 100644 Binary files a/tests/lua/build/ldo.o and b/tests/lua/build/ldo.o differ diff --git a/tests/lua/build/ldump.o b/tests/lua/build/ldump.o index 89722fd..56fde3b 100644 Binary files a/tests/lua/build/ldump.o and b/tests/lua/build/ldump.o differ diff --git a/tests/lua/build/lfunc.o b/tests/lua/build/lfunc.o index 6fadc2c..455589a 100644 Binary files a/tests/lua/build/lfunc.o and b/tests/lua/build/lfunc.o differ diff --git a/tests/lua/build/lgc.o b/tests/lua/build/lgc.o index 9ad5893..2427abf 100644 Binary files a/tests/lua/build/lgc.o and b/tests/lua/build/lgc.o differ diff --git a/tests/lua/build/llex.o b/tests/lua/build/llex.o index 7f75882..82cab69 100644 Binary files a/tests/lua/build/llex.o and b/tests/lua/build/llex.o differ diff --git a/tests/lua/build/lmathlib.o b/tests/lua/build/lmathlib.o index d2d40de..d243120 100644 Binary files a/tests/lua/build/lmathlib.o and b/tests/lua/build/lmathlib.o differ diff --git a/tests/lua/build/lmem.o b/tests/lua/build/lmem.o index 2b1c88e..5ab5667 100644 Binary files a/tests/lua/build/lmem.o and b/tests/lua/build/lmem.o differ diff --git a/tests/lua/build/lobject.o b/tests/lua/build/lobject.o index 485a1f1..eeb98ea 100644 Binary files a/tests/lua/build/lobject.o and b/tests/lua/build/lobject.o differ diff --git a/tests/lua/build/lparser.o b/tests/lua/build/lparser.o index 0c4b465..77ecdcf 100644 Binary files a/tests/lua/build/lparser.o and b/tests/lua/build/lparser.o differ diff --git a/tests/lua/build/lstate.o b/tests/lua/build/lstate.o index 5d47fa8..15d9699 100644 Binary files a/tests/lua/build/lstate.o and b/tests/lua/build/lstate.o differ diff --git a/tests/lua/build/lstring.o b/tests/lua/build/lstring.o index 7b41d58..02d7fac 100644 Binary files a/tests/lua/build/lstring.o and b/tests/lua/build/lstring.o differ diff --git a/tests/lua/build/lstrlib.o b/tests/lua/build/lstrlib.o index 1acae5b..2a704f2 100644 Binary files a/tests/lua/build/lstrlib.o and b/tests/lua/build/lstrlib.o differ diff --git a/tests/lua/build/ltable.o b/tests/lua/build/ltable.o index 9e7a76d..a132314 100644 Binary files a/tests/lua/build/ltable.o and b/tests/lua/build/ltable.o differ diff --git a/tests/lua/build/ltablib.o b/tests/lua/build/ltablib.o index 2145cb5..487b4c5 100644 Binary files a/tests/lua/build/ltablib.o and b/tests/lua/build/ltablib.o differ diff --git a/tests/lua/build/ltm.o b/tests/lua/build/ltm.o index 7be0167..6a4740f 100644 Binary files a/tests/lua/build/ltm.o and b/tests/lua/build/ltm.o differ diff --git a/tests/lua/build/luaStubs.o b/tests/lua/build/luaStubs.o index 3db5494..4e6cef9 100644 Binary files a/tests/lua/build/luaStubs.o and b/tests/lua/build/luaStubs.o differ diff --git a/tests/lua/build/lundump.o b/tests/lua/build/lundump.o index 519f5c0..ad94a74 100644 Binary files a/tests/lua/build/lundump.o and b/tests/lua/build/lundump.o differ diff --git a/tests/lua/build/lvm.o b/tests/lua/build/lvm.o index cb63580..f1bc93a 100644 Binary files a/tests/lua/build/lvm.o and b/tests/lua/build/lvm.o differ diff --git a/tests/lua/build/lzio.o b/tests/lua/build/lzio.o index f6e082d..04c2ec1 100644 Binary files a/tests/lua/build/lzio.o and b/tests/lua/build/lzio.o differ