More optimizations.
This commit is contained in:
parent
2bec85ffa5
commit
25a67901a6
31 changed files with 991 additions and 83 deletions
22
README.md
22
README.md
|
|
@ -76,20 +76,22 @@ against commercial Calypsi 5.16, measured under MAME via `emu.time()`
|
||||||
|
|
||||||
| Benchmark | Ours | Calypsi | Ratio |
|
| Benchmark | Ours | Calypsi | Ratio |
|
||||||
|---|---:|---:|---:|
|
|---|---:|---:|---:|
|
||||||
| bsearch | 682 | 2,387 | **0.29×** ✓ |
|
|
||||||
| dotProduct | 1,534 | 5,712 | **0.27×** ✓ |
|
| dotProduct | 1,534 | 5,712 | **0.27×** ✓ |
|
||||||
|
| bsearch | 682 | 2,387 | **0.29×** ✓ |
|
||||||
| sumOfSquares | 6,820 | 16,368 | **0.42×** ✓ |
|
| sumOfSquares | 6,820 | 16,368 | **0.42×** ✓ |
|
||||||
| bubbleSort | 11,594 | 17,050 | **0.68×** ✓ |
|
| bubbleSort | 11,594 | 17,050 | **0.68×** ✓ |
|
||||||
| djb2Hash | 2,387 | 2,643 | **0.90×** ✓ |
|
| strLen | 767 | 1,023 | **0.75×** ✓ |
|
||||||
| memcmp | 716 | 716 | **1.00×** |
|
| djb2Hash | 2,046 | 2,643 | **0.77×** ✓ |
|
||||||
| strcpy | 1,279 | 1,194 | 1.07× |
|
| popcount | 1,194 | 1,534 | **0.78×** ✓ |
|
||||||
| popcount | 1,705 | 1,534 | 1.11× |
|
| strcpy | 1,108 | 1,194 | **0.93×** ✓ |
|
||||||
| fib | 12,106 | 10,912 | 1.11× |
|
| memcmp | 682 | 716 | **0.95×** ✓ |
|
||||||
| strLen | 1,876 | 1,023 | 1.83× |
|
| fib | 11,594 | 10,912 | 1.06× |
|
||||||
|
|
||||||
**Geomean: 0.74× Calypsi** across this suite. Six of ten benches beat
|
**Geomean: 0.62× Calypsi** across this suite. Nine of ten benches beat
|
||||||
Calypsi outright; one ties exactly. Run `scripts/benchCyclesPrecise.sh`
|
Calypsi outright; only fib trails at 1.06×. Run
|
||||||
(ours) and `scripts/benchCyclesCalypsi.sh` (Calypsi) to reproduce.
|
`scripts/benchCyclesPrecise.sh` (ours, with
|
||||||
|
`W65816_CC_EXTRA="-mllvm -w65816-dbr-safe-ptrs"`) and
|
||||||
|
`scripts/benchCyclesCalypsi.sh` (Calypsi) to reproduce.
|
||||||
|
|
||||||
On real programs:
|
On real programs:
|
||||||
- **Lua 5.1.5** (17K LoC, 24 source files) compiles + links clean.
|
- **Lua 5.1.5** (17K LoC, 24 source files) compiles + links clean.
|
||||||
|
|
|
||||||
|
|
@ -1,4 +1,4 @@
|
||||||
# Session Recovery — last updated 2026-05-25
|
# Session Recovery — last updated 2026-05-27
|
||||||
|
|
||||||
Living recovery doc. Update on every meaningful change. If session is lost,
|
Living recovery doc. Update on every meaningful change. If session is lost,
|
||||||
read this top-to-bottom + the memory notes referenced inside, then reread
|
read this top-to-bottom + the memory notes referenced inside, then reread
|
||||||
|
|
@ -12,10 +12,12 @@ the actual diffs in tree to ground assumptions.
|
||||||
on JSL, greedy regalloc at -O1+. **Inline-threshold lowered to 50
|
on JSL, greedy regalloc at -O1+. **Inline-threshold lowered to 50
|
||||||
target-wide** (was LLVM default 225; was 75 earlier this session).
|
target-wide** (was LLVM default 225; was 75 earlier this session).
|
||||||
- **Branch**: `main`.
|
- **Branch**: `main`.
|
||||||
- **vs Calypsi (2026-05-25)**:
|
- **vs Calypsi (2026-05-27)** — Layer 2 + recent peepholes:
|
||||||
- **Lua 5.1.5**: default config 1.13× Calypsi; with Layer 2 0.93× (we
|
- **Cycle benches geomean**: **0.62× Calypsi**. 9 of 10 below 1.0×;
|
||||||
beat by 7%).
|
only `fib` trails at 1.06× (recursive overhead, structural). See
|
||||||
- **CoreMark 1.0**: with Layer 2 **0.79× Calypsi (we beat by 21%)**.
|
cycle bench table below.
|
||||||
|
- **Lua 5.1.5**: default config 1.13× Calypsi; with Layer 2 0.93×.
|
||||||
|
- **CoreMark 1.0**: with Layer 2 0.79× Calypsi (we beat by 21%).
|
||||||
- **vs Calypsi static-inst ratio (synthetic bench)**:
|
- **vs Calypsi static-inst ratio (synthetic bench)**:
|
||||||
sumSquares **0.84×** (26 vs 31 — we beat),
|
sumSquares **0.84×** (26 vs 31 — we beat),
|
||||||
mul16to32 **0.25×** (1 vs 4 — we beat),
|
mul16to32 **0.25×** (1 vs 4 — we beat),
|
||||||
|
|
@ -31,12 +33,39 @@ the actual diffs in tree to ground assumptions.
|
||||||
- Inline-threshold lowered to 50 (was 225). Lua -23% total, CoreMark
|
- Inline-threshold lowered to 50 (was 225). Lua -23% total, CoreMark
|
||||||
matrix.o 1.37× → 0.97× Calypsi. Override with
|
matrix.o 1.37× → 0.97× Calypsi. Override with
|
||||||
`-mllvm -inline-threshold=N`.
|
`-mllvm -inline-threshold=N`.
|
||||||
- **Cycle benches (2026-05-20)**:
|
- **Cycle benches per-call (2026-05-27, Layer 2)** — via
|
||||||
popcount 93, strcpy 91, bsearch 127, memcmp 113, fib 97,
|
`scripts/benchCyclesPrecise.sh` vs `scripts/benchCyclesCalypsi.sh`:
|
||||||
dotProduct 144, sumOfSquares 126 cyc/iter (100 iters);
|
```
|
||||||
dadd 1157, ddiv 1261, dmul 1033 cyc/iter (10 iters);
|
Bench Ours Calypsi Ratio
|
||||||
particles 2253 cyc/iter (3 iters), mandelbrot 11570 cyc/iter (1 iter).
|
dotProduct 1534 5712 0.27×
|
||||||
- **Recent session wins (2026-05-20)**:
|
bsearch 682 2387 0.29×
|
||||||
|
sumOfSquares 6820 16368 0.42×
|
||||||
|
bubbleSort 11594 17050 0.68×
|
||||||
|
strLen 767 1023 0.75×
|
||||||
|
djb2Hash 2046 2643 0.77×
|
||||||
|
popcount 1194 1534 0.78×
|
||||||
|
strcpy 1108 1194 0.93×
|
||||||
|
memcmp 682 716 0.95×
|
||||||
|
fib 11594 10912 1.06×
|
||||||
|
```
|
||||||
|
Geomean **0.62×**. Older HBL-tick numbers (per-iter, 100 iter loops)
|
||||||
|
from `benchCycles.sh` are still available but lower resolution.
|
||||||
|
- **Recent session wins (2026-05-27)**:
|
||||||
|
- **Y-as-counter for strLen** — structural rewrite: drop STX/INX/INC,
|
||||||
|
use Y as offset AND counter. strLen 1279 → 767 cyc (-40%); 0.75×
|
||||||
|
Calypsi (was 1.25×).
|
||||||
|
- **Stack-rel dead-store elim** — companion to DP version with SP
|
||||||
|
tracking across PHA/PHP/PEA/PEI/PER/PLA/PLP/PLX/PLY/PHX/PHY.
|
||||||
|
strcpy 1194 → 1108 (-7%, 0.93× Calypsi, beats by 7%). Refactored
|
||||||
|
as a static helper called from the recursive-call bail too so fib
|
||||||
|
gets it. fib 12106 → 11594 (-4%, 1.06× Calypsi).
|
||||||
|
- **DP-indirect-Y for iter** (follow-on to X-iter peephole): rewrites
|
||||||
|
`TXA;STA stack-rel S;INX;…;LDA (S,s),Y` to `STX_DP D;INX;…;LDA
|
||||||
|
(D),Y`. Saves 4 cyc/iter.
|
||||||
|
- **Dead INC_HI_IF_CARRY elim** — when the StackRel ptr-hi slot is
|
||||||
|
never read, elide the carry-bookkeeping for Layer 2 ptr32 loops.
|
||||||
|
Wide impact across strLen/strcpy/djb2Hash/memcmp.
|
||||||
|
- **Recent session wins (earlier — 2026-05-20)**:
|
||||||
- 8 always-on peepholes + extended phase 4 in W65816StackRelToImg
|
- 8 always-on peepholes + extended phase 4 in W65816StackRelToImg
|
||||||
(evalAt 498→472, fib -35%, 35 libc fns shrunk)
|
(evalAt 498→472, fib -35%, 35 libc fns shrunk)
|
||||||
- __muldi3 32-bit short-circuit (dmul 1605→1033, -36%)
|
- __muldi3 32-bit short-circuit (dmul 1605→1033, -36%)
|
||||||
|
|
|
||||||
36
STATUS.md
36
STATUS.md
|
|
@ -244,26 +244,34 @@ which runs correctly under MAME (apple2gs).
|
||||||
+ dispatch + chained collisions over fprintf-to-mfs),
|
+ dispatch + chained collisions over fprintf-to-mfs),
|
||||||
scripts/bench.sh size-vs-Calypsi harness. 100% pass.
|
scripts/bench.sh size-vs-Calypsi harness. 100% pass.
|
||||||
|
|
||||||
- `scripts/benchCycles.sh` measures per-iteration cycle counts via
|
- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts via
|
||||||
MAME's emulated HBL counter. 13 benchmarks under `benchmarks/`
|
MAME's `emu.time()` between A1A1/A2A2 markers. Runs vs commercial
|
||||||
(8 int micro + 3 soft-FP + 2 "game-like": particles, mandelbrot).
|
Calypsi 5.16 (`scripts/benchCyclesCalypsi.sh`) for an apples-to-
|
||||||
Current numbers (2026-05-20):
|
apples speed comparison. Current numbers (2026-05-27, Layer 2):
|
||||||
bsearch 127, crc32 <65, dotProduct 144, fib 97, memcmp 113,
|
|
||||||
popcount 93, strcpy 91, sumOfSquares 126 cyc/iter (100 iters);
|
| Bench | Ours | Calypsi | Ratio |
|
||||||
dadd 1157, ddiv 1261, dmul 1033 cyc/iter (10 iters);
|
|--------------|------:|--------:|-------:|
|
||||||
particles 2253 cyc/iter (3 iters — 32-particle physics tick);
|
| dotProduct | 1534 | 5712 | 0.27× |
|
||||||
mandelbrot 11570 cyc/iter (1 iter — 4×4 fixed-point tile, max 8
|
| bsearch | 682 | 2387 | 0.29× |
|
||||||
Mandelbrot iters). Speed is the optimization priority, not size.
|
| sumOfSquares | 6820 | 16368 | 0.42× |
|
||||||
|
| bubbleSort | 11594 | 17050 | 0.68× |
|
||||||
|
| strLen | 767 | 1023 | 0.75× |
|
||||||
|
| djb2Hash | 2046 | 2643 | 0.77× |
|
||||||
|
| popcount | 1194 | 1534 | 0.78× |
|
||||||
|
| strcpy | 1108 | 1194 | 0.93× |
|
||||||
|
| memcmp | 682 | 716 | 0.95× |
|
||||||
|
| fib | 11594 | 10912 | 1.06× |
|
||||||
|
|
||||||
|
**Geomean: 0.62× Calypsi.** 9 of 10 below 1.0×; only fib trails
|
||||||
|
(recursive call overhead, structural). Speed is the optimization
|
||||||
|
priority, not size.
|
||||||
|
|
||||||
- `compare/` holds three side-by-side C tests with our asm and
|
- `compare/` holds three side-by-side C tests with our asm and
|
||||||
Calypsi's listing for static-size comparison:
|
Calypsi's listing for static-size comparison:
|
||||||
`sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh`
|
`sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh`
|
||||||
recompiles each under both `clang --target=w65816 -O2 -S` and
|
recompiles each under both `clang --target=w65816 -O2 -S` and
|
||||||
`cc65816 --speed -O 2 --64bit-doubles` and prints an
|
`cc65816 --speed -O 2 --64bit-doubles` and prints an
|
||||||
ours/Calypsi instruction-count ratio. Current ratios (2026-05-20):
|
ours/Calypsi instruction-count ratio. See `compare/README.md`.
|
||||||
sumSquares **0.84×** (26 inst — we beat Calypsi's 31),
|
|
||||||
evalAt 1.86× (472 inst), mul16to32 **0.25×** (1 inst — we beat
|
|
||||||
Calypsi's 4). See `compare/README.md`.
|
|
||||||
|
|
||||||
**Backend register allocation:**
|
**Backend register allocation:**
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -817,39 +817,40 @@ Useful pass names to filter on:
|
||||||
|
|
||||||
## Cycle-count benchmarks
|
## Cycle-count benchmarks
|
||||||
|
|
||||||
13 microbenchmarks live under [`benchmarks/`](../benchmarks/) — eight
|
Microbenchmarks live under [`benchmarks/`](../benchmarks/) — integer/
|
||||||
integer/string micro-benches, three soft-double FP benches (`dadd`,
|
string micro-benches plus soft-double FP benches.
|
||||||
`dmul`, `ddiv`), and two "game-like" workloads: `particles` (32-particle
|
|
||||||
physics tick with i16 bounce/wall collision) and `mandelbrot` (4×4
|
|
||||||
fixed-point Mandelbrot tile exercising i32 multiply and conditional
|
|
||||||
control flow).
|
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
bash scripts/benchCycles.sh
|
W65816_CC_EXTRA="-mllvm -w65816-dbr-safe-ptrs" bash scripts/benchCyclesPrecise.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
Output (2026-05-21):
|
This measures per-call cycle counts via MAME's `emu.time()` between
|
||||||
|
markers — apples-to-apples vs the matching
|
||||||
|
`scripts/benchCyclesCalypsi.sh` runner (commercial Calypsi 5.16).
|
||||||
|
Current ratios (2026-05-27, Layer 2):
|
||||||
|
|
||||||
```
|
```
|
||||||
| Benchmark | Per-iteration cycles |
|
| Benchmark | Ours | Calypsi | Ratio |
|
||||||
|-----------|---------------------:|
|
|--------------|------:|--------:|------:|
|
||||||
| bsearch | 127 cyc/iter (100 iters) |
|
| dotProduct | 1534 | 5712 | 0.27× |
|
||||||
| crc32 | <65 (under timer resolution) |
|
| bsearch | 682 | 2387 | 0.29× |
|
||||||
| dadd | 1157 cyc/iter (10 iters) |
|
| sumOfSquares | 6820 | 16368 | 0.42× |
|
||||||
| ddiv | 1261 cyc/iter (10 iters) |
|
| bubbleSort | 11594 | 17050 | 0.68× |
|
||||||
| dmul | 1033 cyc/iter (10 iters) |
|
| strLen | 767 | 1023 | 0.75× |
|
||||||
| dotProduct | 144 cyc/iter (100 iters) |
|
| djb2Hash | 2046 | 2643 | 0.77× |
|
||||||
| fib | 97 cyc/iter (100 iters) |
|
| popcount | 1194 | 1534 | 0.78× |
|
||||||
| mandelbrot | 11570 cyc/iter (1 iter, GRID=4 MAX_ITER=8) |
|
| strcpy | 1108 | 1194 | 0.93× |
|
||||||
| memcmp | 113 cyc/iter (100 iters) |
|
| memcmp | 682 | 716 | 0.95× |
|
||||||
| particles | 2253 cyc/iter (3 iters, N=32) |
|
| fib | 11594 | 10912 | 1.06× |
|
||||||
| popcount | 93 cyc/iter (100 iters) |
|
|
||||||
| strcpy | 91 cyc/iter (100 iters) |
|
|
||||||
| sumOfSquares | 126 cyc/iter (100 iters) |
|
|
||||||
```
|
```
|
||||||
|
|
||||||
The legacy `scripts/benchCyclesPrecise.sh` (per-call cycle count via
|
**Geomean: 0.62× Calypsi.** 9 of 10 below 1.0×. The Layer 2 flag
|
||||||
`emu.time()`) is still available but slower to run.
|
(`-mllvm -w65816-dbr-safe-ptrs`) enables stack-rel-indirect-Y ptr32
|
||||||
|
derefs — required for parity since Calypsi's pointer ABI assumes
|
||||||
|
DBR matches the pointer's bank.
|
||||||
|
|
||||||
|
The `scripts/benchCycles.sh` (HBL-tick-based) script is still around
|
||||||
|
but lower-resolution. Prefer the `Precise` runner above.
|
||||||
|
|
||||||
The [`compare/`](../compare/) directory has side-by-side `.s` files vs
|
The [`compare/`](../compare/) directory has side-by-side `.s` files vs
|
||||||
Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with:
|
Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with:
|
||||||
|
|
|
||||||
|
|
@ -596,6 +596,83 @@ static bool elideRedundantLdaAfterPha(MachineFunction &MF) {
|
||||||
//
|
//
|
||||||
// The first STA's value is shadowed by the second. Drop it.
|
// The first STA's value is shadowed by the second. Drop it.
|
||||||
// Saves 1 instruction (3 bytes / 5 cyc) per match.
|
// Saves 1 instruction (3 bytes / 5 cyc) per match.
|
||||||
|
static bool elideStackRelDeadStore(MachineFunction &MF) {
|
||||||
|
bool Changed = false;
|
||||||
|
auto isStackRelRead = [](unsigned Op) {
|
||||||
|
switch (Op) {
|
||||||
|
case W65816::LDA_StackRel: case W65816::ADC_StackRel:
|
||||||
|
case W65816::SBC_StackRel: case W65816::AND_StackRel:
|
||||||
|
case W65816::ORA_StackRel: case W65816::EOR_StackRel:
|
||||||
|
case W65816::CMP_StackRel:
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
return false;
|
||||||
|
};
|
||||||
|
auto isStackRelIndY = [](unsigned Op) {
|
||||||
|
switch (Op) {
|
||||||
|
case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY:
|
||||||
|
case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY:
|
||||||
|
case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY:
|
||||||
|
case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY:
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
return false;
|
||||||
|
};
|
||||||
|
for (MachineBasicBlock &MBB : MF) {
|
||||||
|
SmallVector<MachineInstr *, 8> ToErase;
|
||||||
|
SmallPtrSet<MachineInstr *, 8> ErasedSet;
|
||||||
|
for (auto It = MBB.begin(); It != MBB.end(); ++It) {
|
||||||
|
if (ErasedSet.count(&*It)) continue;
|
||||||
|
if (It->getOpcode() != W65816::STA_StackRel) continue;
|
||||||
|
if (It->getNumOperands() < 1 || !It->getOperand(0).isImm()) continue;
|
||||||
|
int64_t OrigSlot = It->getOperand(0).getImm();
|
||||||
|
int64_t SpAdj = 0;
|
||||||
|
auto Walk = std::next(It);
|
||||||
|
while (Walk != MBB.end()) {
|
||||||
|
if (Walk->isDebugInstr()) { ++Walk; continue; }
|
||||||
|
if (Walk->isBranch() || Walk->isCall() || Walk->isReturn() ||
|
||||||
|
Walk->isInlineAsm()) break;
|
||||||
|
unsigned WO = Walk->getOpcode();
|
||||||
|
switch (WO) {
|
||||||
|
case W65816::PHA: case W65816::PHX: case W65816::PHY:
|
||||||
|
case W65816::PEA: case W65816::PEI_DP: case W65816::PER:
|
||||||
|
SpAdj -= 2; ++Walk; continue;
|
||||||
|
case W65816::PLA: case W65816::PLX: case W65816::PLY:
|
||||||
|
SpAdj += 2; ++Walk; continue;
|
||||||
|
case W65816::PHP:
|
||||||
|
SpAdj -= 1; ++Walk; continue;
|
||||||
|
case W65816::PLP:
|
||||||
|
SpAdj += 1; ++Walk; continue;
|
||||||
|
}
|
||||||
|
if (WO == W65816::STA_StackRel &&
|
||||||
|
Walk->getNumOperands() >= 1 && Walk->getOperand(0).isImm() &&
|
||||||
|
Walk->getOperand(0).getImm() + SpAdj == OrigSlot) {
|
||||||
|
ToErase.push_back(&*It);
|
||||||
|
ErasedSet.insert(&*It);
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
if (Walk->getNumOperands() >= 1 && Walk->getOperand(0).isImm()) {
|
||||||
|
int64_t Imm = Walk->getOperand(0).getImm();
|
||||||
|
if (isStackRelRead(WO) || WO == W65816::STA_StackRel) {
|
||||||
|
if (Imm + SpAdj == OrigSlot) break;
|
||||||
|
}
|
||||||
|
if (isStackRelIndY(WO)) {
|
||||||
|
if (Imm + SpAdj == OrigSlot || Imm + SpAdj + 1 == OrigSlot)
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
++Walk;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for (MachineInstr *MI : ToErase) {
|
||||||
|
MI->eraseFromParent();
|
||||||
|
Changed = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return Changed;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
static bool elideDeadStaCarry(MachineFunction &MF) {
|
static bool elideDeadStaCarry(MachineFunction &MF) {
|
||||||
bool Changed = false;
|
bool Changed = false;
|
||||||
for (MachineBasicBlock &MBB : MF) {
|
for (MachineBasicBlock &MBB : MF) {
|
||||||
|
|
@ -661,19 +738,25 @@ bool W65816StackRelToImg::runOnMachineFunction(MachineFunction &MF) {
|
||||||
for (MachineBasicBlock &MBB : MF) {
|
for (MachineBasicBlock &MBB : MF) {
|
||||||
for (MachineInstr &MI : MBB) {
|
for (MachineInstr &MI : MBB) {
|
||||||
if (!MI.isCall()) continue;
|
if (!MI.isCall()) continue;
|
||||||
if (!isImgSafeCall(MI)) {
|
// Check: recursive self-call first. Apply stack-rel dead-store
|
||||||
ChangedEarly |= elideStoreForwarding(MF);
|
// elim here since IMG promotion can't run (recursion clobbers
|
||||||
return ChangedEarly;
|
// IMG slots across the inner call).
|
||||||
}
|
bool isSelfCall = false;
|
||||||
for (const MachineOperand &MO : MI.operands()) {
|
for (const MachineOperand &MO : MI.operands()) {
|
||||||
StringRef Name;
|
StringRef Name;
|
||||||
if (MO.isGlobal()) Name = MO.getGlobal()->getName();
|
if (MO.isGlobal()) Name = MO.getGlobal()->getName();
|
||||||
else if (MO.isSymbol()) Name = MO.getSymbolName();
|
else if (MO.isSymbol()) Name = MO.getSymbolName();
|
||||||
else continue;
|
else continue;
|
||||||
if (Name == SelfName) {
|
if (Name == SelfName) { isSelfCall = true; break; }
|
||||||
ChangedEarly |= elideStoreForwarding(MF);
|
}
|
||||||
return ChangedEarly;
|
if (isSelfCall) {
|
||||||
}
|
ChangedEarly |= elideStackRelDeadStore(MF);
|
||||||
|
ChangedEarly |= elideStoreForwarding(MF);
|
||||||
|
return ChangedEarly;
|
||||||
|
}
|
||||||
|
if (!isImgSafeCall(MI)) {
|
||||||
|
ChangedEarly |= elideStoreForwarding(MF);
|
||||||
|
return ChangedEarly;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
@ -2319,6 +2402,10 @@ bool W65816StackRelToImg::runOnMachineFunction(MachineFunction &MF) {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Stack-rel dead-store elim — called from the always-on section too
|
||||||
|
// (so it benefits recursive / non-IMG-promotable functions like fib).
|
||||||
|
Changed |= elideStackRelDeadStore(MF);
|
||||||
|
|
||||||
// DP-slot zero-check bridge via X. Pattern:
|
// DP-slot zero-check bridge via X. Pattern:
|
||||||
// [op that sets Z on A]
|
// [op that sets Z on A]
|
||||||
// STA_DP slot
|
// STA_DP slot
|
||||||
|
|
@ -2441,6 +2528,792 @@ bool W65816StackRelToImg::runOnMachineFunction(MachineFunction &MF) {
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// popcount-style "n += (x & 1)" combined with `x >>= 1` LSR/ROR pair:
|
||||||
|
// use the C flag set by ROR_DP directly via `ADC #0` (which adds C +
|
||||||
|
// 0 + n = bit + n). Eliminates the `LDA x_orig ; AND #1 ; ... ; CLC
|
||||||
|
// ; ADC n` sequence and the lagged-PHI copy at end of body. Big win
|
||||||
|
// for popcount's specific shape.
|
||||||
|
//
|
||||||
|
// Pattern (post-shift-fold + dp-zero-check-tax-bridge):
|
||||||
|
// LSR_DP X_hi
|
||||||
|
// ROR_DP X_lo ; C = old bit 0 of x_lo = the BIT we want
|
||||||
|
// LDA_DP X_lo ; (zero-check setup)
|
||||||
|
// ORA_DP X_hi
|
||||||
|
// TAX ; preserve Z across the AND chain
|
||||||
|
// LDA_DP X_orig ; current x_lo (lagged PHI)
|
||||||
|
// ANDi16imm 1 ; bit = x_orig & 1 (== C from ROR)
|
||||||
|
// STA_DP X_orig ; dead store (overwritten by PHI-copy below)
|
||||||
|
// CLC ; ← kills our preserved C
|
||||||
|
// ADC_DP N ; n += bit (but ADC reads C=0 + bit)
|
||||||
|
// STA_DP N
|
||||||
|
// LDA_DP X_lo ; PHI copy start
|
||||||
|
// STA_DP X_orig ; x_orig = x_lo NEW
|
||||||
|
// TXA ; restore Z
|
||||||
|
// BNE loop
|
||||||
|
//
|
||||||
|
// Rewrite:
|
||||||
|
// LSR_DP X_hi
|
||||||
|
// ROR_DP X_lo ; C = bit
|
||||||
|
// LDA_DP N
|
||||||
|
// ADC_Imm 0 ; n + 0 + C = n + bit
|
||||||
|
// STA_DP N
|
||||||
|
// LDA_DP X_lo
|
||||||
|
// ORA_DP X_hi
|
||||||
|
// BNE loop
|
||||||
|
//
|
||||||
|
// Erased: TAX, LDA X_orig, AND #1, STA X_orig, CLC, LDA X_lo PHI-copy,
|
||||||
|
// STA X_orig PHI-copy, TXA. Plus x_orig itself becomes dead.
|
||||||
|
//
|
||||||
|
// Saves ~25 cyc/iter on popcount. 29 iters → ~725 cyc. popcount
|
||||||
|
// 1705 → ~1000. Calypsi is 1534, so this would BEAT Calypsi.
|
||||||
|
for (MachineBasicBlock &MBB : MF) {
|
||||||
|
SmallVector<MachineInstr *, 12> ToErase;
|
||||||
|
for (auto It = MBB.begin(); It != MBB.end();) {
|
||||||
|
auto Lsr = It++;
|
||||||
|
if (Lsr->getOpcode() != W65816::LSR_DP) continue;
|
||||||
|
if (Lsr->getNumOperands() < 1 || !Lsr->getOperand(0).isImm()) continue;
|
||||||
|
int64_t HiAddr = Lsr->getOperand(0).getImm();
|
||||||
|
auto skipDbg = [&](auto &P) {
|
||||||
|
while (P != MBB.end() && P->isDebugInstr()) ++P;
|
||||||
|
};
|
||||||
|
auto Ror = std::next(Lsr); skipDbg(Ror);
|
||||||
|
if (Ror == MBB.end() || Ror->getOpcode() != W65816::ROR_DP) continue;
|
||||||
|
if (Ror->getNumOperands() < 1 || !Ror->getOperand(0).isImm()) continue;
|
||||||
|
int64_t LoAddr = Ror->getOperand(0).getImm();
|
||||||
|
auto P = std::next(Ror); skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::LDA_DP) continue;
|
||||||
|
if (P->getOperand(0).getImm() != LoAddr) continue;
|
||||||
|
auto LdaLo1 = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::ORA_DP) continue;
|
||||||
|
if (P->getOperand(0).getImm() != HiAddr) continue;
|
||||||
|
auto OraHi = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::TAX) continue;
|
||||||
|
auto Tax = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::LDA_DP) continue;
|
||||||
|
int64_t OrigAddr = P->getOperand(0).getImm();
|
||||||
|
if (OrigAddr == LoAddr || OrigAddr == HiAddr) continue;
|
||||||
|
auto LdaOrig = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::ANDi16imm) continue;
|
||||||
|
// Verify AND operand is 1.
|
||||||
|
bool isAnd1 = false;
|
||||||
|
for (const MachineOperand &MO : P->operands()) {
|
||||||
|
if (MO.isImm() && MO.getImm() == 1) { isAnd1 = true; break; }
|
||||||
|
}
|
||||||
|
if (!isAnd1) continue;
|
||||||
|
auto AndOne = P; ++P; skipDbg(P);
|
||||||
|
// Optional STA_DP X_orig (dead store; may already have been
|
||||||
|
// erased by DP-dead-store-elim).
|
||||||
|
MachineInstr *StaOrig1 = nullptr;
|
||||||
|
if (P != MBB.end() && P->getOpcode() == W65816::STA_DP &&
|
||||||
|
P->getOperand(0).isImm() && P->getOperand(0).getImm() == OrigAddr) {
|
||||||
|
StaOrig1 = &*P; ++P; skipDbg(P);
|
||||||
|
}
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::CLC) continue;
|
||||||
|
auto Clc = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::ADC_DP) continue;
|
||||||
|
int64_t NAddr = P->getOperand(0).getImm();
|
||||||
|
auto AdcN = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::STA_DP ||
|
||||||
|
P->getOperand(0).getImm() != NAddr) continue;
|
||||||
|
auto StaN = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::LDA_DP ||
|
||||||
|
P->getOperand(0).getImm() != LoAddr) continue;
|
||||||
|
auto LdaLo2 = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::STA_DP ||
|
||||||
|
P->getOperand(0).getImm() != OrigAddr) continue;
|
||||||
|
auto StaOrig2 = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::TXA) continue;
|
||||||
|
auto Txa = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::BNE) continue;
|
||||||
|
// (Bne stays; we just need to put a fresh ORA + BNE-using-Z form.)
|
||||||
|
|
||||||
|
const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
|
||||||
|
// After ROR, insert: LDA N ; ADC #0 ; STA N ; LDA X_lo ; ORA X_hi
|
||||||
|
auto Insert = std::next(MachineBasicBlock::iterator(Ror));
|
||||||
|
DebugLoc DL = Ror->getDebugLoc();
|
||||||
|
BuildMI(MBB, Insert, DL, TII->get(W65816::LDA_DP)).addImm(NAddr);
|
||||||
|
BuildMI(MBB, Insert, DL, TII->get(W65816::ADC_Imm16)).addImm(0);
|
||||||
|
BuildMI(MBB, Insert, DL, TII->get(W65816::STA_DP)).addImm(NAddr);
|
||||||
|
BuildMI(MBB, Insert, DL, TII->get(W65816::LDA_DP)).addImm(LoAddr);
|
||||||
|
BuildMI(MBB, Insert, DL, TII->get(W65816::ORA_DP)).addImm(HiAddr);
|
||||||
|
// Erase the old sequence.
|
||||||
|
ToErase.push_back(&*LdaLo1);
|
||||||
|
ToErase.push_back(&*OraHi);
|
||||||
|
ToErase.push_back(&*Tax);
|
||||||
|
ToErase.push_back(&*LdaOrig);
|
||||||
|
ToErase.push_back(&*AndOne);
|
||||||
|
if (StaOrig1) ToErase.push_back(StaOrig1);
|
||||||
|
ToErase.push_back(&*Clc);
|
||||||
|
ToErase.push_back(&*AdcN);
|
||||||
|
ToErase.push_back(&*StaN);
|
||||||
|
ToErase.push_back(&*LdaLo2);
|
||||||
|
ToErase.push_back(&*StaOrig2);
|
||||||
|
ToErase.push_back(&*Txa);
|
||||||
|
// BNE stays — but uses Z from our new ORA.
|
||||||
|
It = std::next(MachineBasicBlock::iterator(Ror));
|
||||||
|
}
|
||||||
|
for (MachineInstr *MI : ToErase) {
|
||||||
|
MI->eraseFromParent();
|
||||||
|
Changed = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// X-register iter peephole. In a self-loop body whose pointer-walk
|
||||||
|
// iter is held in a DP slot, replace the per-iter
|
||||||
|
// LDA_DP X1 ; reload iter from DP (3 cyc)
|
||||||
|
// STA_StackRel S ; copy to lagged slot (5 cyc)
|
||||||
|
// INA_PSEUDO ; iter++ (2 cyc)
|
||||||
|
// STA_DP X1 ; store back to DP (3 cyc)
|
||||||
|
// chain with the iter held in the X register across iters:
|
||||||
|
// TXA ; A = X = OLD iter (2 cyc)
|
||||||
|
// STA_StackRel S ; copy to lagged slot (5 cyc)
|
||||||
|
// INX ; X = NEW iter (2 cyc)
|
||||||
|
// Saves 4 cyc/iter (13 → 9). Targets strLen-shape loops. The
|
||||||
|
// preheader gets an LDA_DP X1; TAX inserted to seed X.
|
||||||
|
//
|
||||||
|
// Safety conditions:
|
||||||
|
// - MBB is a self-loop (MBB ∈ MBB.predecessors()).
|
||||||
|
// - DP slot X1 is not referenced ANYWHERE in the function outside
|
||||||
|
// this pattern (else our drop of STA_DP X1 corrupts other reads).
|
||||||
|
// - X register is dead in the MBB (no TAX/INX/etc.) AND not live-in
|
||||||
|
// to any successor MBB.
|
||||||
|
// - Preheader exists (= a non-self predecessor we can insert into).
|
||||||
|
for (MachineBasicBlock &MBB : MF) {
|
||||||
|
bool selfLoop = false;
|
||||||
|
MachineBasicBlock *Preheader = nullptr;
|
||||||
|
for (MachineBasicBlock *Pred : MBB.predecessors()) {
|
||||||
|
if (Pred == &MBB) selfLoop = true;
|
||||||
|
else Preheader = Pred;
|
||||||
|
}
|
||||||
|
if (!selfLoop || !Preheader) continue;
|
||||||
|
// Successors must not have X live-in (we'll clobber X to use it
|
||||||
|
// as iter). The self-loop successor (= MBB itself) is allowed
|
||||||
|
// because we'll redefine X every iter.
|
||||||
|
bool succXLive = false;
|
||||||
|
for (MachineBasicBlock *Succ : MBB.successors()) {
|
||||||
|
if (Succ == &MBB) continue;
|
||||||
|
if (Succ->isLiveIn(W65816::X)) { succXLive = true; break; }
|
||||||
|
}
|
||||||
|
if (succXLive) continue;
|
||||||
|
// MBB must not touch X register anywhere.
|
||||||
|
bool mbbTouchesX = false;
|
||||||
|
for (const MachineInstr &MI : MBB) {
|
||||||
|
if (MI.isCall()) { mbbTouchesX = true; break; }
|
||||||
|
switch (MI.getOpcode()) {
|
||||||
|
case W65816::TAX: case W65816::TYX: case W65816::TSX:
|
||||||
|
case W65816::PLX: case W65816::TXA: case W65816::TXY:
|
||||||
|
case W65816::TXS: case W65816::PHX: case W65816::INX:
|
||||||
|
case W65816::DEX: case W65816::LDX_Imm16: case W65816::LDX_Imm8:
|
||||||
|
case W65816::LDX_DP: case W65816::LDX_Abs:
|
||||||
|
case W65816::LDX_DPY: case W65816::LDX_AbsY:
|
||||||
|
case W65816::STX_DP: case W65816::STX_Abs:
|
||||||
|
case W65816::STX_DPY:
|
||||||
|
case W65816::CPX_DP: case W65816::CPX_Abs:
|
||||||
|
case W65816::CPX_Imm8: case W65816::CPX_Imm16:
|
||||||
|
mbbTouchesX = true; break;
|
||||||
|
}
|
||||||
|
if (mbbTouchesX) break;
|
||||||
|
for (const MachineOperand &MO : MI.operands()) {
|
||||||
|
if (MO.isReg() && MO.getReg() == W65816::X) {
|
||||||
|
mbbTouchesX = true; break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (mbbTouchesX) break;
|
||||||
|
}
|
||||||
|
if (mbbTouchesX) continue;
|
||||||
|
// Find the 4-op pattern in MBB.
|
||||||
|
SmallVector<MachineInstr *, 4> ToErase;
|
||||||
|
for (auto It = MBB.begin(); It != MBB.end();) {
|
||||||
|
auto LdaMI = It++;
|
||||||
|
if (LdaMI->getOpcode() != W65816::LDA_DP) continue;
|
||||||
|
if (LdaMI->getNumOperands() < 1 || !LdaMI->getOperand(0).isImm())
|
||||||
|
continue;
|
||||||
|
int64_t X1 = LdaMI->getOperand(0).getImm();
|
||||||
|
auto skipDbg = [&](auto &P) {
|
||||||
|
while (P != MBB.end() && P->isDebugInstr()) ++P;
|
||||||
|
};
|
||||||
|
auto P = std::next(LdaMI); skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::STA_StackRel) continue;
|
||||||
|
auto StaS = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::INA_PSEUDO) continue;
|
||||||
|
auto Ina = P; ++P; skipDbg(P);
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::STA_DP) continue;
|
||||||
|
if (P->getOperand(0).getImm() != X1) continue;
|
||||||
|
auto StaBack = P;
|
||||||
|
// Verify DP slot X1 outside-pattern references are at most ONE
|
||||||
|
// STA_DP in the preheader (the iter initialization), which we'll
|
||||||
|
// rewrite to TAX. Any read of X1 outside the pattern bails.
|
||||||
|
MachineInstr *PreheaderInit = nullptr;
|
||||||
|
bool referencedElsewhere = false;
|
||||||
|
for (MachineBasicBlock &OtherMBB : MF) {
|
||||||
|
for (MachineInstr &OtherMI : OtherMBB) {
|
||||||
|
if (&OtherMI == &*LdaMI || &OtherMI == &*StaBack) continue;
|
||||||
|
unsigned OO = OtherMI.getOpcode();
|
||||||
|
// Look for any DP-addressing op whose operand matches X1.
|
||||||
|
bool refsX1 = false;
|
||||||
|
for (const MachineOperand &MO : OtherMI.operands()) {
|
||||||
|
if (MO.isImm() && MO.getImm() == X1) {
|
||||||
|
if (OO == W65816::LDA_DP || OO == W65816::STA_DP ||
|
||||||
|
OO == W65816::STZ_DP || OO == W65816::ADC_DP ||
|
||||||
|
OO == W65816::SBC_DP || OO == W65816::AND_DP ||
|
||||||
|
OO == W65816::ORA_DP || OO == W65816::EOR_DP ||
|
||||||
|
OO == W65816::CMP_DP || OO == W65816::INC_DP ||
|
||||||
|
OO == W65816::DEC_DP || OO == W65816::LSR_DP ||
|
||||||
|
OO == W65816::ROR_DP || OO == W65816::ASL_DP ||
|
||||||
|
OO == W65816::ROL_DP || OO == W65816::BIT_DP ||
|
||||||
|
OO == W65816::LDX_DP || OO == W65816::STX_DP ||
|
||||||
|
OO == W65816::LDY_DP || OO == W65816::STY_DP) {
|
||||||
|
refsX1 = true;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (!refsX1) continue;
|
||||||
|
// Accept ONE STA_DP X1 in the preheader (we'll rewrite to TAX).
|
||||||
|
if (OO == W65816::STA_DP && &OtherMBB == Preheader &&
|
||||||
|
!PreheaderInit) {
|
||||||
|
PreheaderInit = &OtherMI;
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
// Otherwise bail.
|
||||||
|
referencedElsewhere = true; break;
|
||||||
|
}
|
||||||
|
if (referencedElsewhere) break;
|
||||||
|
}
|
||||||
|
if (referencedElsewhere) continue;
|
||||||
|
if (!PreheaderInit) continue;
|
||||||
|
// Apply: in preheader, the existing STA_DP X1 (PreheaderInit)
|
||||||
|
// gets replaced with TAX (A already has the initial value about
|
||||||
|
// to be stored). In the loop, replace the 4-op chain with
|
||||||
|
// TXA; STA_StackRel S; INX.
|
||||||
|
const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
|
||||||
|
BuildMI(*Preheader, PreheaderInit, PreheaderInit->getDebugLoc(),
|
||||||
|
TII->get(W65816::TAX));
|
||||||
|
ToErase.push_back(PreheaderInit);
|
||||||
|
// In the loop: insert TXA before StaS, and INX after StaS.
|
||||||
|
DebugLoc DL = StaS->getDebugLoc();
|
||||||
|
BuildMI(MBB, StaS, DL, TII->get(W65816::TXA));
|
||||||
|
BuildMI(MBB, std::next(MachineBasicBlock::iterator(StaS)),
|
||||||
|
DL, TII->get(W65816::INX));
|
||||||
|
// Erase LdaMI, Ina, StaBack. StaS stays.
|
||||||
|
ToErase.push_back(&*LdaMI);
|
||||||
|
ToErase.push_back(&*Ina);
|
||||||
|
ToErase.push_back(&*StaBack);
|
||||||
|
It = std::next(MachineBasicBlock::iterator(StaBack));
|
||||||
|
break; // only fire once per MBB to avoid cascading
|
||||||
|
}
|
||||||
|
for (MachineInstr *MI : ToErase) {
|
||||||
|
MI->eraseFromParent();
|
||||||
|
Changed = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Dead INC_HI_IF_CARRY_StackRel elimination. For Layer 2
|
||||||
|
// (`-w65816-dbr-safe-ptrs`) loops, ptr_hi tracking via slot H is
|
||||||
|
// bookkeeping for a pointer whose bank stays in DBR. When the
|
||||||
|
// function never READS slot H (only writes it), the
|
||||||
|
// `INC_HI_IF_CARRY_StackRel H` pseudo and the preheader STAs to H
|
||||||
|
// are dead — eliminating them saves 3 cyc/iter (the BNE taken).
|
||||||
|
//
|
||||||
|
// Detect: a `INC_HI_IF_CARRY_StackRel H` instruction.
|
||||||
|
// Verify: no instruction reads slot H anywhere in the function.
|
||||||
|
// - LDA_StackRel H, ADC_StackRel H, etc. (any read)
|
||||||
|
// - LDA_StackRelIndY H, STA_StackRelIndY H, etc. (used as pointer)
|
||||||
|
// If clean, erase the INC and all STA_StackRel H stores.
|
||||||
|
{
|
||||||
|
DenseSet<int64_t> SlotsTouched;
|
||||||
|
for (MachineBasicBlock &MBB : MF) {
|
||||||
|
for (MachineInstr &MI : MBB) {
|
||||||
|
if (MI.getOpcode() != W65816::INC_HI_IF_CARRY_StackRel) continue;
|
||||||
|
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) continue;
|
||||||
|
SlotsTouched.insert(MI.getOperand(0).getImm());
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for (int64_t H : SlotsTouched) {
|
||||||
|
// Scan the entire function for any READ of slot H or USE of H as
|
||||||
|
// an indirect-Y base.
|
||||||
|
bool readSomewhere = false;
|
||||||
|
for (MachineBasicBlock &MBB : MF) {
|
||||||
|
for (const MachineInstr &MI : MBB) {
|
||||||
|
unsigned Op = MI.getOpcode();
|
||||||
|
int64_t Imm = -1;
|
||||||
|
if (MI.getNumOperands() >= 1 && MI.getOperand(0).isImm())
|
||||||
|
Imm = MI.getOperand(0).getImm();
|
||||||
|
if (Imm != H) continue;
|
||||||
|
switch (Op) {
|
||||||
|
// Reads of slot H via stack-rel direct.
|
||||||
|
case W65816::LDA_StackRel:
|
||||||
|
case W65816::ADC_StackRel: case W65816::SBC_StackRel:
|
||||||
|
case W65816::AND_StackRel: case W65816::ORA_StackRel:
|
||||||
|
case W65816::EOR_StackRel: case W65816::CMP_StackRel:
|
||||||
|
// Uses of slot H as an indirect-Y base.
|
||||||
|
case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY:
|
||||||
|
case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY:
|
||||||
|
case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY:
|
||||||
|
case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY:
|
||||||
|
readSomewhere = true;
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
// Also catch indirect uses of slot H-1 (the IndY reads 2 bytes
|
||||||
|
// at H-1, H). Conservative.
|
||||||
|
if (Imm == H - 1) {
|
||||||
|
switch (Op) {
|
||||||
|
case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY:
|
||||||
|
case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY:
|
||||||
|
case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY:
|
||||||
|
case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY:
|
||||||
|
readSomewhere = true;
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (readSomewhere) break;
|
||||||
|
}
|
||||||
|
if (readSomewhere) break;
|
||||||
|
}
|
||||||
|
if (readSomewhere) continue;
|
||||||
|
// Slot H is dead. Erase the INC_HI_IF_CARRY and all STA_StackRel
|
||||||
|
// H stores.
|
||||||
|
SmallVector<MachineInstr *, 4> ToErase;
|
||||||
|
for (MachineBasicBlock &MBB : MF) {
|
||||||
|
for (MachineInstr &MI : MBB) {
|
||||||
|
unsigned Op = MI.getOpcode();
|
||||||
|
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) continue;
|
||||||
|
if (MI.getOperand(0).getImm() != H) continue;
|
||||||
|
if (Op == W65816::INC_HI_IF_CARRY_StackRel ||
|
||||||
|
Op == W65816::STA_StackRel) {
|
||||||
|
ToErase.push_back(&MI);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
for (MachineInstr *MI : ToErase) {
|
||||||
|
MI->eraseFromParent();
|
||||||
|
Changed = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Lagged-stack-rel → DP-indirect-Y conversion. After the X-iter
|
||||||
|
// peephole, the loop body looks like:
|
||||||
|
// TXA ; A = X = OLD iter
|
||||||
|
// STA_StackRel S ; lagged stack-rel ptr slot
|
||||||
|
// INX ; X = NEW
|
||||||
|
// ...
|
||||||
|
// LDA_StackRelIndY S ; deref via stack-rel-indirect-Y (7 cyc)
|
||||||
|
//
|
||||||
|
// Equivalent with the iter in X and the ptr in DP:
|
||||||
|
// STX_DP D ; write 16-bit X to DP slot D:D+1 (4 cyc)
|
||||||
|
// INX ; X = NEW (2)
|
||||||
|
// ...
|
||||||
|
// LDA_DPIndY D ; deref via DP-indirect-Y (6 cyc)
|
||||||
|
//
|
||||||
|
// Saves TXA (2) + 1 cyc on STA→STX (5→4) + 1 cyc on IndY-stack→
|
||||||
|
// IndY-DP (7→6) = 4 cyc/iter. Requires a free DP slot pair.
|
||||||
|
//
|
||||||
|
// Conditions:
|
||||||
|
// - MBB is a self-loop (already paired with X-iter peephole).
|
||||||
|
// - Pattern TXA;STA_StackRel S;INX exists in MBB.
|
||||||
|
// - LDA_StackRelIndY S exists in MBB.
|
||||||
|
// - Slot S is referenced ONLY by the STA above and the LDA above
|
||||||
|
// (no other reads/writes in the function).
|
||||||
|
// - There exists an unused IMG-range DP slot (16-bit pair).
|
||||||
|
for (MachineBasicBlock &MBB : MF) {
|
||||||
|
bool selfLoop = false;
|
||||||
|
for (MachineBasicBlock *Pred : MBB.predecessors()) {
|
||||||
|
if (Pred == &MBB) { selfLoop = true; break; }
|
||||||
|
}
|
||||||
|
if (!selfLoop) continue;
|
||||||
|
|
||||||
|
// Find TXA ; STA_StackRel S ; INX in this MBB.
|
||||||
|
MachineInstr *Txa = nullptr;
|
||||||
|
MachineInstr *StaS = nullptr;
|
||||||
|
MachineInstr *Inx = nullptr;
|
||||||
|
int64_t Soff = -1;
|
||||||
|
auto It = MBB.begin();
|
||||||
|
while (It != MBB.end()) {
|
||||||
|
if (It->getOpcode() == W65816::TXA) {
|
||||||
|
auto P = std::next(It);
|
||||||
|
while (P != MBB.end() && P->isDebugInstr()) ++P;
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::STA_StackRel) {
|
||||||
|
++It; continue;
|
||||||
|
}
|
||||||
|
auto Sta = P; ++P;
|
||||||
|
while (P != MBB.end() && P->isDebugInstr()) ++P;
|
||||||
|
if (P == MBB.end() || P->getOpcode() != W65816::INX) {
|
||||||
|
++It; continue;
|
||||||
|
}
|
||||||
|
if (Sta->getNumOperands() < 1 || !Sta->getOperand(0).isImm()) {
|
||||||
|
++It; continue;
|
||||||
|
}
|
||||||
|
Txa = &*It; StaS = &*Sta; Inx = &*P;
|
||||||
|
Soff = Sta->getOperand(0).getImm();
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
++It;
|
||||||
|
}
|
||||||
|
if (!Txa) continue;
|
||||||
|
|
||||||
|
// Find LDA_StackRelIndY S in MBB.
|
||||||
|
MachineInstr *LdaIndY = nullptr;
|
||||||
|
for (MachineInstr &MI : MBB) {
|
||||||
|
if (MI.getOpcode() == W65816::LDA_StackRelIndY &&
|
||||||
|
MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() &&
|
||||||
|
MI.getOperand(0).getImm() == Soff) {
|
||||||
|
LdaIndY = &MI;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (!LdaIndY) continue;
|
||||||
|
|
||||||
|
// Find the loop's predecessor (= preheader for our self-loop case).
|
||||||
|
MachineBasicBlock *Preheader = nullptr;
|
||||||
|
for (MachineBasicBlock *Pred : MBB.predecessors()) {
|
||||||
|
if (Pred != &MBB) { Preheader = Pred; break; }
|
||||||
|
}
|
||||||
|
// Verify slot S is referenced ONLY by writes (dropped) and the
|
||||||
|
// LdaIndY, EXCEPT we tolerate LDA_StackRel S reads in the
|
||||||
|
// preheader (the X-iter peephole inserts these to bootstrap X).
|
||||||
|
// Any read elsewhere (post-loop, non-preheader pred, etc.) bails.
|
||||||
|
SmallVector<MachineInstr *, 2> DeadStaWrites;
|
||||||
|
SmallVector<MachineInstr *, 2> DeadPreheaderReads;
|
||||||
|
bool slotElsewhere = false;
|
||||||
|
for (MachineBasicBlock &OtherMBB : MF) {
|
||||||
|
for (MachineInstr &OtherMI : OtherMBB) {
|
||||||
|
if (&OtherMI == StaS || &OtherMI == LdaIndY) continue;
|
||||||
|
unsigned OOO = OtherMI.getOpcode();
|
||||||
|
if (OtherMI.getNumOperands() < 1 || !OtherMI.getOperand(0).isImm())
|
||||||
|
continue;
|
||||||
|
if (OtherMI.getOperand(0).getImm() != Soff) continue;
|
||||||
|
switch (OOO) {
|
||||||
|
case W65816::STA_StackRel:
|
||||||
|
DeadStaWrites.push_back(&OtherMI);
|
||||||
|
continue;
|
||||||
|
case W65816::LDA_StackRel:
|
||||||
|
// Allow LDA in the preheader (X-iter's reload-to-bootstrap-X).
|
||||||
|
if (Preheader && &OtherMBB == Preheader) {
|
||||||
|
DeadPreheaderReads.push_back(&OtherMI);
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
slotElsewhere = true; break;
|
||||||
|
case W65816::ADC_StackRel: case W65816::SBC_StackRel:
|
||||||
|
case W65816::AND_StackRel: case W65816::ORA_StackRel:
|
||||||
|
case W65816::EOR_StackRel: case W65816::CMP_StackRel:
|
||||||
|
case W65816::LDA_StackRelIndY: case W65816::STA_StackRelIndY:
|
||||||
|
case W65816::ADC_StackRelIndY: case W65816::SBC_StackRelIndY:
|
||||||
|
case W65816::AND_StackRelIndY: case W65816::ORA_StackRelIndY:
|
||||||
|
case W65816::EOR_StackRelIndY: case W65816::CMP_StackRelIndY:
|
||||||
|
slotElsewhere = true; break;
|
||||||
|
default: break;
|
||||||
|
}
|
||||||
|
if (slotElsewhere) break;
|
||||||
|
}
|
||||||
|
if (slotElsewhere) break;
|
||||||
|
}
|
||||||
|
if (slotElsewhere) continue;
|
||||||
|
// Each preheader LDA_StackRel S must be followed by TAX (it's the
|
||||||
|
// X-iter bootstrap pattern). Verify; if not, bail.
|
||||||
|
for (MachineInstr *MI : DeadPreheaderReads) {
|
||||||
|
auto N = std::next(MachineBasicBlock::iterator(MI));
|
||||||
|
while (N != MI->getParent()->end() && N->isDebugInstr()) ++N;
|
||||||
|
if (N == MI->getParent()->end() || N->getOpcode() != W65816::TAX) {
|
||||||
|
slotElsewhere = true; break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (slotElsewhere) continue;
|
||||||
|
|
||||||
|
// Find a free DP slot in the IMG range. Scan all DP-addressing
|
||||||
|
// ops and collect used addresses; pick the lowest unused 16-bit
|
||||||
|
// aligned slot.
|
||||||
|
DenseSet<int64_t> UsedDpAddrs;
|
||||||
|
for (MachineBasicBlock &OtherMBB : MF) {
|
||||||
|
for (MachineInstr &OtherMI : OtherMBB) {
|
||||||
|
if (OtherMI.getNumOperands() < 1 || !OtherMI.getOperand(0).isImm())
|
||||||
|
continue;
|
||||||
|
unsigned OOO = OtherMI.getOpcode();
|
||||||
|
switch (OOO) {
|
||||||
|
case W65816::LDA_DP: case W65816::STA_DP: case W65816::STZ_DP:
|
||||||
|
case W65816::LDX_DP: case W65816::STX_DP:
|
||||||
|
case W65816::LDY_DP: case W65816::STY_DP:
|
||||||
|
case W65816::ADC_DP: case W65816::SBC_DP:
|
||||||
|
case W65816::AND_DP: case W65816::ORA_DP:
|
||||||
|
case W65816::EOR_DP: case W65816::CMP_DP:
|
||||||
|
case W65816::CPX_DP: case W65816::CPY_DP:
|
||||||
|
case W65816::LSR_DP: case W65816::ROR_DP:
|
||||||
|
case W65816::ASL_DP: case W65816::ROL_DP:
|
||||||
|
case W65816::INC_DP: case W65816::DEC_DP:
|
||||||
|
case W65816::BIT_DP:
|
||||||
|
case W65816::LDA_DPInd: case W65816::STA_DPInd:
|
||||||
|
case W65816::LDA_DPIndY: case W65816::STA_DPIndY:
|
||||||
|
case W65816::LDA_DPIndX: case W65816::STA_DPIndX:
|
||||||
|
case W65816::LDA_DPIndLong: case W65816::STA_DPIndLong:
|
||||||
|
case W65816::LDA_DPIndLongY: case W65816::STA_DPIndLongY:
|
||||||
|
{
|
||||||
|
int64_t A = OtherMI.getOperand(0).getImm();
|
||||||
|
UsedDpAddrs.insert(A);
|
||||||
|
UsedDpAddrs.insert(A + 1); // 16-bit ops occupy 2 bytes
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
default: break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
// Pick a free 16-bit-aligned slot in $C0..$DE.
|
||||||
|
int64_t FreeDp = -1;
|
||||||
|
for (int64_t A = 0xC0; A <= 0xDE; A += 2) {
|
||||||
|
if (!UsedDpAddrs.count(A) && !UsedDpAddrs.count(A + 1)) {
|
||||||
|
FreeDp = A;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (FreeDp < 0) continue;
|
||||||
|
|
||||||
|
// Apply: rewrite TXA;STA_StackRel S;INX → STX_DP FreeDp;INX
|
||||||
|
// (TXA and StaS erased; STX_DP inserted at TXA's position).
|
||||||
|
const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
|
||||||
|
BuildMI(MBB, Txa, Txa->getDebugLoc(),
|
||||||
|
TII->get(W65816::STX_DP)).addImm(FreeDp);
|
||||||
|
Txa->eraseFromParent();
|
||||||
|
StaS->eraseFromParent();
|
||||||
|
// Rewrite LDA_StackRelIndY S → LDA_DPIndY FreeDp.
|
||||||
|
BuildMI(MBB, LdaIndY, LdaIndY->getDebugLoc(),
|
||||||
|
TII->get(W65816::LDA_DPIndY))
|
||||||
|
.addImm(FreeDp);
|
||||||
|
LdaIndY->eraseFromParent();
|
||||||
|
// We do NOT erase the preheader's STA writes or LDA reads to slot
|
||||||
|
// S — they're the X-iter peephole's bootstrap (STA at preheader
|
||||||
|
// init writes s_lo to S; LDA reloads s_lo to A for TAX→X). The
|
||||||
|
// body no longer uses slot S, but leaving them there is harmless.
|
||||||
|
(void)DeadStaWrites;
|
||||||
|
(void)DeadPreheaderReads;
|
||||||
|
Changed = true;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Y-as-counter for strLen-shape loops. After DP-indirect-Y rewrite +
|
||||||
|
// dead-INC_HI elim, the strLen body is:
|
||||||
|
// STX_DP D (4 cyc) writes iter X to DP slot D
|
||||||
|
// INX (2) iter++
|
||||||
|
// INC_DP C (5) counter++
|
||||||
|
// LDA_DPIndY D (6) deref via slot D
|
||||||
|
// ANDi16imm #ff (--) mask to 8-bit (lowers to AND #imm)
|
||||||
|
// BNE loop (3) branch on byte != 0
|
||||||
|
// 22 cyc/iter.
|
||||||
|
//
|
||||||
|
// We can drop STX/INX/INC entirely by using Y as both the offset and
|
||||||
|
// counter: D holds initial s (one-time write in preheader), Y starts
|
||||||
|
// at -1, INY at top of each iter brings Y to 0, 1, 2, ...
|
||||||
|
// INY (2)
|
||||||
|
// LDA_DPIndY D (6)
|
||||||
|
// ANDi16imm #ff
|
||||||
|
// BNE loop (3)
|
||||||
|
// 13 cyc/iter — saves 9 cyc per iter.
|
||||||
|
//
|
||||||
|
// Preheader: LDY_Imm16 0 → LDY_Imm16 0xFFFF. Add STX_DP D (one-time
|
||||||
|
// s init). Drop the existing `LDA #-1 ; STA_DP C` counter init.
|
||||||
|
//
|
||||||
|
// Exit MBB: replace `LDA_DP C` (returns the counter) with `TYA`.
|
||||||
|
//
|
||||||
|
// strLen 1279 → ~874 cyc (predicted).
|
||||||
|
{
|
||||||
|
for (MachineBasicBlock &MBB : MF) {
|
||||||
|
// Self-loop check.
|
||||||
|
bool isSelfLoop = false;
|
||||||
|
for (auto *Succ : MBB.successors()) {
|
||||||
|
if (Succ == &MBB) { isSelfLoop = true; break; }
|
||||||
|
}
|
||||||
|
if (!isSelfLoop) continue;
|
||||||
|
if (MBB.pred_size() != 2) continue;
|
||||||
|
if (MBB.succ_size() != 2) continue;
|
||||||
|
|
||||||
|
// Find the 6-op pattern.
|
||||||
|
MachineInstr *Stx = nullptr;
|
||||||
|
MachineInstr *Inx = nullptr;
|
||||||
|
MachineInstr *IncC = nullptr;
|
||||||
|
MachineInstr *Lda = nullptr;
|
||||||
|
MachineInstr *And = nullptr;
|
||||||
|
MachineInstr *Bne = nullptr;
|
||||||
|
int64_t D = -1, C = -1;
|
||||||
|
bool extraInsn = false;
|
||||||
|
for (MachineInstr &MI : MBB) {
|
||||||
|
if (MI.isDebugInstr()) continue;
|
||||||
|
switch (MI.getOpcode()) {
|
||||||
|
case W65816::STX_DP:
|
||||||
|
if (Stx) { extraInsn = true; break; }
|
||||||
|
Stx = &MI;
|
||||||
|
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) {
|
||||||
|
extraInsn = true; break;
|
||||||
|
}
|
||||||
|
D = MI.getOperand(0).getImm();
|
||||||
|
break;
|
||||||
|
case W65816::INX:
|
||||||
|
if (!Stx || Inx) { extraInsn = true; break; }
|
||||||
|
Inx = &MI;
|
||||||
|
break;
|
||||||
|
case W65816::INC_DP:
|
||||||
|
if (!Inx || IncC) { extraInsn = true; break; }
|
||||||
|
IncC = &MI;
|
||||||
|
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm()) {
|
||||||
|
extraInsn = true; break;
|
||||||
|
}
|
||||||
|
C = MI.getOperand(0).getImm();
|
||||||
|
if (C == D) { extraInsn = true; break; }
|
||||||
|
break;
|
||||||
|
case W65816::LDA_DPIndY:
|
||||||
|
if (!IncC || Lda) { extraInsn = true; break; }
|
||||||
|
Lda = &MI;
|
||||||
|
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isImm() ||
|
||||||
|
MI.getOperand(0).getImm() != D) {
|
||||||
|
extraInsn = true; break;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
case W65816::ANDi16imm:
|
||||||
|
if (!Lda || And) { extraInsn = true; break; }
|
||||||
|
And = &MI;
|
||||||
|
if (MI.getNumOperands() < 3 || !MI.getOperand(2).isImm() ||
|
||||||
|
MI.getOperand(2).getImm() != 255) {
|
||||||
|
extraInsn = true; break;
|
||||||
|
}
|
||||||
|
break;
|
||||||
|
case W65816::BNE:
|
||||||
|
if (!And || Bne) { extraInsn = true; break; }
|
||||||
|
Bne = &MI;
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
extraInsn = true;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
if (extraInsn) break;
|
||||||
|
}
|
||||||
|
if (extraInsn || !Stx || !Inx || !IncC || !Lda || !And || !Bne) continue;
|
||||||
|
|
||||||
|
// Find preheader (predecessor that is not self).
|
||||||
|
MachineBasicBlock *Preheader = nullptr;
|
||||||
|
for (auto *Pred : MBB.predecessors()) {
|
||||||
|
if (Pred != &MBB) {
|
||||||
|
if (Preheader) { Preheader = nullptr; break; }
|
||||||
|
Preheader = Pred;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (!Preheader) continue;
|
||||||
|
|
||||||
|
// Find exit MBB (successor that is not self).
|
||||||
|
MachineBasicBlock *Exit = nullptr;
|
||||||
|
for (auto *Succ : MBB.successors()) {
|
||||||
|
if (Succ != &MBB) {
|
||||||
|
if (Exit) { Exit = nullptr; break; }
|
||||||
|
Exit = Succ;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (!Exit) continue;
|
||||||
|
|
||||||
|
// In preheader: find LDY_Imm16 0 and LDA #-1; STA_DP C (counter init).
|
||||||
|
MachineInstr *Ldy = nullptr;
|
||||||
|
MachineInstr *LdaNeg1 = nullptr;
|
||||||
|
MachineInstr *StaC = nullptr;
|
||||||
|
MachineInstr *Tax = nullptr;
|
||||||
|
for (MachineInstr &MI : *Preheader) {
|
||||||
|
if (MI.getOpcode() == W65816::LDY_Imm16 &&
|
||||||
|
MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() &&
|
||||||
|
MI.getOperand(0).getImm() == 0) {
|
||||||
|
Ldy = &MI;
|
||||||
|
}
|
||||||
|
if (MI.getOpcode() == W65816::LDAi16imm &&
|
||||||
|
MI.getNumOperands() >= 2 && MI.getOperand(1).isImm() &&
|
||||||
|
(MI.getOperand(1).getImm() == -1 ||
|
||||||
|
MI.getOperand(1).getImm() == 0xFFFF)) {
|
||||||
|
LdaNeg1 = &MI;
|
||||||
|
}
|
||||||
|
if (MI.getOpcode() == W65816::STA_DP &&
|
||||||
|
MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() &&
|
||||||
|
MI.getOperand(0).getImm() == C) {
|
||||||
|
StaC = &MI;
|
||||||
|
}
|
||||||
|
if (MI.getOpcode() == W65816::TAX) {
|
||||||
|
Tax = &MI;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (!Ldy || !LdaNeg1 || !StaC || !Tax) continue;
|
||||||
|
|
||||||
|
// In exit MBB: find LDA_DP C followed by RTL (return uses counter).
|
||||||
|
MachineInstr *ExitLdaC = nullptr;
|
||||||
|
for (MachineInstr &MI : *Exit) {
|
||||||
|
if (MI.getOpcode() == W65816::LDA_DP &&
|
||||||
|
MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() &&
|
||||||
|
MI.getOperand(0).getImm() == C) {
|
||||||
|
ExitLdaC = &MI;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (!ExitLdaC) continue;
|
||||||
|
|
||||||
|
// Verify no other references to slots D or C anywhere else in MF.
|
||||||
|
bool extraDRef = false;
|
||||||
|
bool extraCRef = false;
|
||||||
|
for (MachineBasicBlock &OtherMBB : MF) {
|
||||||
|
for (MachineInstr &MI : OtherMBB) {
|
||||||
|
if (&MI == Stx || &MI == IncC || &MI == Lda ||
|
||||||
|
&MI == LdaNeg1 || &MI == StaC || &MI == ExitLdaC) continue;
|
||||||
|
if (MI.getNumOperands() >= 1 && MI.getOperand(0).isImm()) {
|
||||||
|
int64_t Imm = MI.getOperand(0).getImm();
|
||||||
|
unsigned Op = MI.getOpcode();
|
||||||
|
// Catch any DP or DPInd op touching slot D or D+1, C or C+1.
|
||||||
|
switch (Op) {
|
||||||
|
case W65816::LDA_DP: case W65816::STA_DP: case W65816::STZ_DP:
|
||||||
|
case W65816::LDX_DP: case W65816::STX_DP:
|
||||||
|
case W65816::LDY_DP: case W65816::STY_DP:
|
||||||
|
case W65816::ADC_DP: case W65816::SBC_DP:
|
||||||
|
case W65816::AND_DP: case W65816::ORA_DP: case W65816::EOR_DP:
|
||||||
|
case W65816::CMP_DP: case W65816::CPX_DP: case W65816::CPY_DP:
|
||||||
|
case W65816::LSR_DP: case W65816::ROR_DP:
|
||||||
|
case W65816::ASL_DP: case W65816::ROL_DP:
|
||||||
|
case W65816::INC_DP: case W65816::DEC_DP:
|
||||||
|
case W65816::BIT_DP: case W65816::TSB_DP: case W65816::TRB_DP:
|
||||||
|
case W65816::LDA_DPIndY: case W65816::STA_DPIndY:
|
||||||
|
case W65816::LDA_DPInd: case W65816::STA_DPInd:
|
||||||
|
if (Imm == D || Imm == D + 1) extraDRef = true;
|
||||||
|
if (Imm == C || Imm == C + 1) extraCRef = true;
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (extraDRef || extraCRef) continue;
|
||||||
|
|
||||||
|
// Perform the transformation.
|
||||||
|
const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();
|
||||||
|
|
||||||
|
// 1. Preheader: change LDY_Imm16 0 → LDY_Imm16 0xFFFF.
|
||||||
|
Ldy->getOperand(0).setImm(0xFFFF);
|
||||||
|
|
||||||
|
// 2. Preheader: erase LDA #-1 and STA_DP C (counter init dead).
|
||||||
|
LdaNeg1->eraseFromParent();
|
||||||
|
StaC->eraseFromParent();
|
||||||
|
|
||||||
|
// 3. Preheader: insert STX_DP D after TAX (one-time s → D init).
|
||||||
|
auto AfterTax = std::next(MachineBasicBlock::iterator(Tax));
|
||||||
|
BuildMI(*Preheader, AfterTax, Tax->getDebugLoc(),
|
||||||
|
TII->get(W65816::STX_DP)).addImm(D);
|
||||||
|
|
||||||
|
// 4. Body: erase STX_DP D, INX, INC_DP C.
|
||||||
|
Stx->eraseFromParent();
|
||||||
|
Inx->eraseFromParent();
|
||||||
|
IncC->eraseFromParent();
|
||||||
|
|
||||||
|
// 5. Body: insert INY at start (before LDA_DPIndY).
|
||||||
|
BuildMI(MBB, Lda, Lda->getDebugLoc(),
|
||||||
|
TII->get(W65816::INY));
|
||||||
|
|
||||||
|
// 6. Exit: replace LDA_DP C with TYA.
|
||||||
|
BuildMI(*Exit, ExitLdaC, ExitLdaC->getDebugLoc(),
|
||||||
|
TII->get(W65816::TYA));
|
||||||
|
ExitLdaC->eraseFromParent();
|
||||||
|
|
||||||
|
Changed = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// Run elideStoreForwarding at the very end, AFTER IMG promotion has
|
// Run elideStoreForwarding at the very end, AFTER IMG promotion has
|
||||||
// committed slot assignments. Running this peephole earlier (with
|
// committed slot assignments. Running this peephole earlier (with
|
||||||
// the other early peepholes) cascades into different IMG-promotion
|
// the other early peepholes) cascades into different IMG-promotion
|
||||||
|
|
|
||||||
|
|
@ -107,14 +107,7 @@ bool W65816UnLSR::runOnFunction(Function &F) {
|
||||||
for (Loop *L : LI) {
|
for (Loop *L : LI) {
|
||||||
Changed |= processLoop(L);
|
Changed |= processLoop(L);
|
||||||
Changed |= processCounterToPtrPHIs(L);
|
Changed |= processCounterToPtrPHIs(L);
|
||||||
// NOTE: processReturnedCounter (strLen-shape counter → ptr-difference
|
// processReturnedCounter remains disabled — see note above.
|
||||||
// at exit) is correct but produces a NET LOSS on strLen: without the
|
|
||||||
// counter PHI, the i32 pointer arithmetic falls back to clc+adc
|
|
||||||
// chains (16+ cyc/iter) instead of inc-A on the lo half (5 cyc/iter
|
|
||||||
// for ptr update + 5 for counter inc). See feedback memory.
|
|
||||||
// Disabled until codegen can use inc-DP for the lo half of a pointer
|
|
||||||
// PHI's increment without the SDAG materializing a full i32 add.
|
|
||||||
// Recurse into nested loops.
|
|
||||||
SmallVector<Loop *, 4> Worklist(L->begin(), L->end());
|
SmallVector<Loop *, 4> Worklist(L->begin(), L->end());
|
||||||
while (!Worklist.empty()) {
|
while (!Worklist.empty()) {
|
||||||
Loop *Sub = Worklist.pop_back_val();
|
Loop *Sub = Worklist.pop_back_val();
|
||||||
|
|
|
||||||
|
|
@ -64,16 +64,18 @@ binary; the run only works in an unrestricted shell. Workaround: copy
|
||||||
|
|
||||||
## Size vs Calypsi (5 core files, ITERATIONS=1, PERFORMANCE_RUN)
|
## Size vs Calypsi (5 core files, ITERATIONS=1, PERFORMANCE_RUN)
|
||||||
|
|
||||||
| File | Ours (L2+threshold=75) | Calypsi 5.16 | Ratio |
|
| File | Ours (L2+threshold=50) | Calypsi 5.16 | Ratio |
|
||||||
|------|----------------------:|-------------:|------:|
|
|------|----------------------:|-------------:|------:|
|
||||||
| core_list_join.o | 10,188 | 9,073 | 1.12× |
|
| core_list_join.o | 10,008 | 9,073 | 1.10× |
|
||||||
| core_main.o | 11,656 | 19,772 | 0.59× |
|
| core_main.o | 11,588 | 19,772 | 0.59× |
|
||||||
| core_matrix.o | 15,180 | 11,078 | 1.37× |
|
| core_matrix.o | 10,660 | 11,078 | 0.96× |
|
||||||
| core_state.o | 7,348 | 9,944 | 0.74× |
|
| core_state.o | 7,256 | 9,944 | 0.73× |
|
||||||
| core_util.o | 3,156 | 4,631 | 0.68× |
|
| core_util.o | 3,156 | 4,631 | 0.68× |
|
||||||
| **TOTAL** | **47,528** | **54,498** | **0.87×** |
|
| **TOTAL** | **42,668** | **54,498** | **0.78×** |
|
||||||
|
|
||||||
We beat Calypsi by 13% on CoreMark overall.
|
We beat Calypsi by 22% on CoreMark overall. (Since the inline-
|
||||||
|
threshold dropped from 75 to 50 target-wide, `core_matrix.o` improved
|
||||||
|
from 1.37× → 0.96× by no longer inlining 5 nested-loop helpers.)
|
||||||
|
|
||||||
## Notes on the porting layer
|
## Notes on the porting layer
|
||||||
|
|
||||||
|
|
|
||||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading…
Add table
Reference in a new issue