diff --git a/README.md b/README.md index 09ae2d2..7a18843 100644 --- a/README.md +++ b/README.md @@ -26,21 +26,20 @@ tight loops in benchmarks like sumOfSquares, popcount, and strcpy. After installation (see [docs/INSTALL.md](docs/INSTALL.md)): ```bash -# Compile a C file +# Write a tiny C program that computes 1+2+...+10 = 55 and stores it. cat > hello.c <<'EOF' -__attribute__((noinline)) void switchToBank2(void) { - __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); -} int main(void) { unsigned short x = 0; - for (int i = 1; i <= 10; i++) x += i; // x = 55 - switchToBank2(); - *(volatile unsigned short *)0x5000 = x; + for (int i = 1; i <= 10; i++) x += i; // x = 55 = 0x37 + // Write to a known 24-bit absolute address. The compiler lowers + // this to `sta long $025000` — no bank switching needed. The MAME + // test harness reads this cell to verify the program ran. + *(volatile unsigned short *)0x025000 = x; while (1) {} } EOF -# Build + run under MAME (writes 0x0037 to $025000, MAME displays it) +# Compile, link, run under MAME, check the result. ./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -c hello.c -o hello.o ./tools/link816 -o hello.bin --text-base 0x1000 \ runtime/crt0.o runtime/libc.o runtime/libgcc.o hello.o @@ -71,21 +70,32 @@ docs/ this directory — INSTALL.md, USAGE.md, design notes ## Status -Stable enough to build real programs. Static instruction-count -ratio against commercial Calypsi 5.16 (lower is better): +Stable enough to build real programs. Per-call cycle measurements +against commercial Calypsi 5.16, measured under MAME via `emu.time()` +(IIgs slow-mode 1.023 MHz, `-mllvm -w65816-dbr-safe-ptrs` enabled): -| Benchmark | Ours (inst) | Calypsi (inst) | Ratio | +| Benchmark | Ours | Calypsi | Ratio | |---|---:|---:|---:| -| sumSquares | 26 | 31 | **0.84×** ✓ | -| evalAt | 472 | 254 | 1.86× | -| mul16to32 | 1 | 4 | **0.25×** ✓ | +| bsearch | 682 | 2,387 | **0.29×** ✓ | +| dotProduct | 1,534 | 5,712 | **0.27×** ✓ | +| sumOfSquares | 6,820 | 16,368 | **0.42×** ✓ | +| bubbleSort | 11,594 | 17,050 | **0.68×** ✓ | +| djb2Hash | 2,387 | 2,643 | **0.90×** ✓ | +| memcmp | 716 | 716 | **1.00×** | +| strcpy | 1,279 | 1,194 | 1.07× | +| popcount | 1,705 | 1,534 | 1.11× | +| fib | 12,106 | 10,912 | 1.11× | +| strLen | 1,876 | 1,023 | 1.83× | -Per-iteration cycle measurements (via MAME's HBL counter, 2026-05-20): -bsearch 127, dotProduct 144, fib 97, memcmp 113, popcount 93, -strcpy 91, sumOfSquares 126 cyc/iter (100 iters); -dadd 1157, ddiv 1261, dmul 1033 cyc/iter (10 iters); -particles 2253 (3 iters — 32-particle physics tick); -mandelbrot 11570 (1 iter — 4×4 fixed-point tile). +**Geomean: 0.74× Calypsi** across this suite. Six of ten benches beat +Calypsi outright; one ties exactly. Run `scripts/benchCyclesPrecise.sh` +(ours) and `scripts/benchCyclesCalypsi.sh` (Calypsi) to reproduce. + +On real programs: +- **Lua 5.1.5** (17K LoC, 24 source files) compiles + links clean. + Object total 0.93× Calypsi. +- **CoreMark 1.0** (EEMBC standard benchmark) compiles + links clean. + Object total 0.80× Calypsi. See [STATUS.md](STATUS.md) for full language and runtime feature coverage, and [LLVM_65816_DESIGN.md](LLVM_65816_DESIGN.md) for diff --git a/docs/INSTALL.md b/docs/INSTALL.md index 9f826e2..3c0fdc8 100644 --- a/docs/INSTALL.md +++ b/docs/INSTALL.md @@ -46,14 +46,15 @@ everywhere. distro-neutral. - **CPU:** any 64-bit x86 or ARM Linux machine. We're cross-compiling, so the host CPU only matters for build speed. -- **Disk:** ~10 GB free total (~5 GB during build, ~7 GB after install - with all reference compilers). If you skip Calypsi (`--skip-calypsi`), - knock 580 MB off. -- **RAM:** 8 GB minimum for the default install (downloads a prebuilt - llvm-mos SDK). 16 GB recommended if you use `--build-llvm` (compiles - LLVM from source). -- **Time:** ~5 minutes for the default (prebuilt) path; 30-60 minutes - for `--build-llvm` on a modern laptop (depends on core count). +- **Disk:** ~20 GB free during install (~12 GB peak for LLVM's cmake + intermediates, ~7 GB resident after the install + delete the + intermediates). If you skip Calypsi (`--skip-calypsi`), knock + 580 MB off the resident size. +- **RAM:** 16 GB recommended (LLVM's link step is the memory-heavy + one). 8 GB works but the linker may swap, doubling build time. +- **Time:** 30-60 minutes end-to-end (LLVM is the long pole). After + the first build, incremental edits to the W65816 backend rebuild + in ~30 seconds. - **Network:** the install pulls ~500 MB of binaries from GitHub, archive.org, and the Calypsi releases page. No proxy support baked in — set `http_proxy` / `https_proxy` if you need one. @@ -89,7 +90,8 @@ running, see `scripts/installDeps.sh` for the exact list. ## One-command install ```bash -git clone llvm816 +# Replace with the actual git URL for this repository. +git clone llvm816 cd llvm816 ./setup.sh ``` @@ -99,33 +101,40 @@ That's it. `setup.sh` runs five stages in order: | Stage | Script | What it does | Time | |---|---|---|---| | 1/5 | `installDeps.sh` | `sudo apt-get install` the packages listed above. | ~1 min | -| 2/5 | `installLlvmMos.sh` | Clone `llvm-mos` source (5 GB), download prebuilt llvm-mos SDK (400 MB), build our W65816 clang under `tools/llvm-mos-build/`. Without `--build-llvm`, downloads the prebuilt SDK only; clang for our target is then built incrementally. | ~5 min (no source build) or 30-60 min (with `--build-llvm`) | +| 2/5 | `installLlvmMos.sh` | Clone `llvm-mos` source (5 GB), download prebuilt llvm-mos SDK (400 MB), **apply our W65816 backend** (symlinks + patches), **build clang/llc/llvm-mc with W65816 target enabled**, **build `link816` + `omfEmit`**, and **build the runtime libraries** (`libc.o`, `crt0.o`, `libgcc.o`, soft-float, etc.). After this stage you have a working W65816 toolchain end-to-end. | ~30-60 min (first time; LLVM build is the long pole) | | 3/5 | `installMame.sh` | Install MAME via apt, download `apple2gs.zip` (ROM 03) and `apple2gsr1.zip` (ROM 01) into `tools/mame/roms/`. | ~30 s | | 4/5 | `installCalypsi.sh` | Download Calypsi 5.16 .deb, extract its payload into `tools/calypsi/` (no system-wide install). | ~30 s | | 5/5 | `installOrcaC.sh` | Shallow clone of byteworksinc's ORCA/C repo into `tools/orca-c/` for toolbox header reference. | ~15 s | After each stage, the script prints `=== N/5 stage-name ===` so you can follow progress. At the end it runs `verify.sh` which sanity- -checks every tool was installed. +checks every tool was installed AND end-to-end compiles a tiny C +program to confirm `clang` actually produces W65816 machine code. A successful install ends with: ``` +[llvm816] all checks passed [llvm816] setup complete ``` +If `verify.sh` reports failures, the most common cause is that the +LLVM build didn't include the W65816 target. Re-run +`scripts/applyBackend.sh` followed by +`ninja -C tools/llvm-mos-build clang llc llvm-mc llvm-objdump`. + ### `setup.sh` flags | Flag | Effect | |---|---| -| `--build-llvm` | Build clang from source (30-60 min) instead of using the prebuilt SDK. Required if you plan to modify the W65816 backend. | | `--skip-deps` | Don't run apt (use if you've already installed the system packages). | -| `--skip-llvm` | Skip the LLVM clone + build. Useful for iterating on other parts. | +| `--skip-llvm` | Skip the LLVM clone + build + runtime. Useful for iterating on other parts. | | `--skip-mame` | Skip MAME + ROM download. | | `--skip-calypsi` | Skip Calypsi (saves 580 MB if you don't need the comparison benchmarks). | | `--skip-orca` | Skip ORCA/C (saves ~10 MB; only needed if you regenerate `iigs/toolbox.h`). | | `--skip-verify` | Don't run the post-install verification check. | | `--verify-only` | Just run the verification check, don't install anything. | +| `--build-llvm` | Deprecated alias — the LLVM build is now always part of stage 2/5 (without it we wouldn't have a usable W65816 compiler). | --- @@ -240,19 +249,11 @@ If you only want to *build* C programs (no benchmarks, no comparisons), ### Building W65816 clang from source -The default install pulls a *prebuilt* llvm-mos SDK but builds our -W65816 backend incrementally on top. If you want to build everything -from source (recommended for backend development): - -```bash -./setup.sh --build-llvm -``` - -This adds about 30-60 minutes to install time but means you can edit -files under `src/llvm/lib/Target/W65816/` and rebuild quickly. - -After the initial source build, incremental rebuilds after editing -backend code take ~30 seconds: +`setup.sh` always builds clang from source — that's the only way to +get a `clang` that actually targets W65816 (the prebuilt llvm-mos SDK +in `tools/llvm-mos-sdk/` only knows about the 6502 MOS target). The +initial build takes 30-60 minutes depending on core count; after that +incremental rebuilds are ~30 seconds: ```bash ninja -C tools/llvm-mos-build llc clang @@ -314,7 +315,7 @@ If you want a fully clean rebuild (e.g., to chase a "stale .o" bug): ```bash rm -rf tools/llvm-mos-build -./setup.sh --build-llvm +./setup.sh --skip-deps --skip-mame --skip-calypsi --skip-orca ``` --- @@ -417,11 +418,11 @@ bash scripts/updateLlvmMos.sh That script handles the symlinks safely. -### Disk fills up during `--build-llvm` +### Disk fills up during the LLVM build A full LLVM build needs ~12 GB of temporary build artifacts (cmake's intermediate `.o` files, .a archives, etc.) on top of the 5 GB source -tree. Free ~15 GB before running `--build-llvm`. +tree. Free ~15 GB before running `setup.sh`. Once the build completes, the *intermediate* artifacts under `tools/llvm-mos-build/CMakeFiles/` can be deleted — the binaries @@ -431,7 +432,7 @@ under `tools/llvm-mos-build/bin/` are self-contained: rm -rf tools/llvm-mos-build/CMakeFiles tools/llvm-mos-build/lib ``` -But this disables incremental rebuilds. Re-running `--build-llvm` +But this disables incremental rebuilds. Re-running `setup.sh` recreates everything. ### Calypsi install fails / I don't want it diff --git a/scripts/installLlvmMos.sh b/scripts/installLlvmMos.sh index 4be384c..aaae4da 100755 --- a/scripts/installLlvmMos.sh +++ b/scripts/installLlvmMos.sh @@ -1,17 +1,21 @@ #!/usr/bin/env bash -# Install llvm-mos: clone source tree for backend development, plus -# download prebuilt SDK for reference/smoke-testing existing 6502 targets. +# Install + build the W65816 toolchain on top of llvm-mos: +# 1. Clone llvm-mos source. +# 2. Download the prebuilt llvm-mos-sdk (reference baseline). +# 3. Apply our W65816 backend INTO the clone (symlinks + patches). +# 4. Configure + build clang + llc + llvm-mc with the W65816 target. +# 5. Build link816 + omfEmit (the linker). +# 6. Build the runtime (libc.o, crt0.o, libgcc.o, etc.). # # Flags: -# --build also build the source tree with cmake/ninja (slow, ~30-60 min) +# --build (no-op; retained for backward compat — we always build) set -euo pipefail source "$(dirname "$0")/common.sh" -doBuild=0 for arg in "$@"; do case "$arg" in - --build) doBuild=1 ;; + --build) ;; # no-op; we always build now (see step 4) *) die "unknown flag: $arg" ;; esac done @@ -44,7 +48,9 @@ else git clone --depth=1 https://github.com/llvm-mos/llvm-mos.git "$LLVM_SRC" fi -# 2. Prebuilt SDK for testing/reference. +# 2. Prebuilt SDK for testing/reference (smoke tests against the +# vanilla 6502 MOS target; mostly unused once you have a W65816 +# build). if [ -x "$LLVM_SDK/bin/mos-common-clang" ] || [ -x "$LLVM_SDK/bin/clang" ]; then log "llvm-mos-sdk already extracted" else @@ -54,30 +60,57 @@ else tar -xJf "$archive" -C "$LLVM_SDK" --strip-components=1 fi -# 3. Optional source build. -if [ "$doBuild" -eq 1 ]; then - needCmd cmake - needCmd ninja - log "configuring llvm-mos build (this takes a while)" +# 3. Apply our W65816 backend INTO the clone (symlinks + patches). +# Must run BEFORE cmake configure so the W65816 target dir + cmake +# patch are present. +log "applying W65816 backend (symlinks + patches)" +bash "$(dirname "$0")/applyBackend.sh" + +# 4. Configure + build LLVM with W65816 enabled. We always build — +# without a built clang the rest of the toolchain (runtime, link816) +# can't produce any usable output. --build is kept as a no-op flag +# for backward compat. +needCmd cmake +needCmd ninja +if [ -x "$LLVM_BUILD/bin/clang" ] && \ + [ -x "$LLVM_BUILD/bin/llc" ] && \ + "$LLVM_BUILD/bin/llc" --version 2>/dev/null | grep -q "^[[:space:]]*w65816[[:space:]]"; then + log "llvm-mos-build/bin/clang already exists and supports w65816" +else + log "configuring llvm-mos build (LLVM + clang + lld; ~5 min after the first cmake)" install -d "$LLVM_BUILD" cmake -S "$LLVM_SRC/llvm" -B "$LLVM_BUILD" -G Ninja \ -DCMAKE_BUILD_TYPE=Release \ -DLLVM_TARGETS_TO_BUILD="" \ - -DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD="MOS" \ + -DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD="MOS;W65816" \ -DLLVM_ENABLE_PROJECTS="clang;lld" \ -DLLVM_PARALLEL_LINK_JOBS=1 \ -DLLVM_USE_LINKER=lld \ -DLLVM_INCLUDE_TESTS=OFF \ -DLLVM_INCLUDE_EXAMPLES=OFF \ -DLLVM_INCLUDE_BENCHMARKS=OFF - log "building llvm-mos (background-friendly: use --build only when you have time)" - ninja -C "$LLVM_BUILD" - log "llvm-mos build complete: $LLVM_BUILD/bin/clang" -else - log "skipped source build; rerun with --build when ready (cmake+ninja)" + log "building clang, llc, llvm-mc, llvm-objdump (the tools we actually use)" + ninja -C "$LLVM_BUILD" clang llc llvm-mc llvm-objdump llvm-readobj + log "llvm build done: $LLVM_BUILD/bin/clang" fi +# Sanity check: llc must list w65816 as a registered target. +if ! "$LLVM_BUILD/bin/llc" --version 2>/dev/null | grep -q "^[[:space:]]*w65816[[:space:]]"; then + "$LLVM_BUILD/bin/llc" --version 2>/dev/null | head -20 + warn "llc built but does NOT list w65816 as a target. Backend symlinks/patches may have failed. Re-run scripts/applyBackend.sh and ninja -C tools/llvm-mos-build." +fi + +# 5. Build link816 + omfEmit. +log "building link816 + omfEmit" +make -C "$PROJECT_ROOT/src/link816" all + +# 6. Build the runtime (libc.o, crt0.o, libgcc.o, etc.). +log "building runtime" +bash "$PROJECT_ROOT/runtime/build.sh" log "llvm-mos install done" log " source: $LLVM_SRC" log " sdk: $LLVM_SDK" -[ "$doBuild" -eq 1 ] && log " build: $LLVM_BUILD" +log " build: $LLVM_BUILD" +log " clang: $LLVM_BUILD/bin/clang" +log " link816: $PROJECT_ROOT/tools/link816" +log " omfEmit: $PROJECT_ROOT/tools/omfEmit" diff --git a/scripts/verify.sh b/scripts/verify.sh index e209893..2ba89b5 100755 --- a/scripts/verify.sh +++ b/scripts/verify.sh @@ -28,17 +28,61 @@ check "git" git --version check "llvm-mos source tree" test -d "$TOOLS_DIR/llvm-mos/.git" check "llvm-mos-sdk extracted" test -x "$TOOLS_DIR/llvm-mos-sdk/bin/mos-common-clang" +# W65816 backend integration +check "W65816 source symlinked into llvm-mos" \ + test -L "$TOOLS_DIR/llvm-mos/llvm/lib/Target/W65816/W65816ISelLowering.cpp" +check "W65816 clang built" test -x "$TOOLS_DIR/llvm-mos-build/bin/clang" +check "W65816 llc built" test -x "$TOOLS_DIR/llvm-mos-build/bin/llc" +check "W65816 llvm-mc built" test -x "$TOOLS_DIR/llvm-mos-build/bin/llvm-mc" +check "llc lists w65816 target" \ + bash -c "'$TOOLS_DIR/llvm-mos-build/bin/llc' --version 2>/dev/null | grep -q '^[[:space:]]*w65816[[:space:]]'" + +# link816 + omfEmit +check "link816 binary" test -x "$TOOLS_DIR/link816" +check "omfEmit binary" test -x "$TOOLS_DIR/omfEmit" + +# Runtime libraries (built objects we link into every program) +check "runtime/crt0.o" test -s "$PROJECT_ROOT/runtime/crt0.o" +check "runtime/libc.o" test -s "$PROJECT_ROOT/runtime/libc.o" +check "runtime/libgcc.o" test -s "$PROJECT_ROOT/runtime/libgcc.o" +check "runtime/softFloat.o" test -s "$PROJECT_ROOT/runtime/softFloat.o" +check "runtime/softDouble.o" test -s "$PROJECT_ROOT/runtime/softDouble.o" + # MAME + ROMs check "mame binary" command -v mame check "mame Lua console support" bash -c "mame -showusage 2>&1 | grep -q -- '-console'" check "rom: apple2gs.zip" test -s "$TOOLS_DIR/mame/roms/apple2gs.zip" check "rom: apple2gsr1.zip" test -s "$TOOLS_DIR/mame/roms/apple2gsr1.zip" -# Calypsi benchmark -check "calypsi cc65816" test -x "$TOOLS_DIR/calypsi/bin/cc65816" +# Calypsi benchmark (optional; --skip-calypsi at install skips this). +if [ -d "$TOOLS_DIR/calypsi" ]; then + check "calypsi cc65816" \ + test -x "$TOOLS_DIR/calypsi/usr/local/lib/calypsi-65816-5.16/bin/cc65816" +fi -# ORCA/C reference -check "orca-c source" test -d "$TOOLS_DIR/orca-c/.git" +# ORCA/C reference (optional; --skip-orca skips this). +if [ -d "$TOOLS_DIR/orca-c" ]; then + check "orca-c source" test -d "$TOOLS_DIR/orca-c/.git" +fi + +# End-to-end smoke: compile a tiny C program for W65816 to prove the +# toolchain actually produces machine code. +echo +log "end-to-end compile check" +tmp=$(mktemp -d -t llvm816-verify.XXXXXX) +trap "rm -rf '$tmp'" EXIT +cat > "$tmp/hello.c" <<'EOF' +int answer(void) { return 42; } +EOF +if "$TOOLS_DIR/llvm-mos-build/bin/clang" --target=w65816 -O2 -c "$tmp/hello.c" \ + -o "$tmp/hello.o" 2>/dev/null \ + && "$TOOLS_DIR/llvm-mos-build/bin/llvm-objdump" --triple=w65816 -d "$tmp/hello.o" 2>/dev/null \ + | grep -q "lda"; then + printf ' [ OK ] C -> w65816 .o compile produces lda instruction\n' +else + printf ' [FAIL] end-to-end compile failed\n' + fails=$((fails + 1)) +fi echo if [ "$fails" -eq 0 ]; then diff --git a/setup.sh b/setup.sh index c8bac22..e11c329 100755 --- a/setup.sh +++ b/setup.sh @@ -2,15 +2,24 @@ # Top-level installer for the llvm816 project. Installs everything into # ./tools/ so the tree is self-contained and deletable. # +# Stages (5/5): +# 1. apt dependencies +# 2. llvm-mos clone + W65816 backend apply + cmake/ninja build of +# clang/llc/llvm-mc + build link816/omfEmit + build runtime (.o) +# 3. MAME + apple2gs ROMs +# 4. Calypsi 5.16 (reference compiler — optional) +# 5. ORCA/C source (header reference — optional) +# # Usage: -# ./setup.sh # install everything (no llvm-mos source build) -# ./setup.sh --build-llvm # also cmake+ninja build llvm-mos (slow) +# ./setup.sh # install + build everything (~30-60 min) # ./setup.sh --skip-deps # skip apt packages -# ./setup.sh --skip-llvm # skip llvm-mos +# ./setup.sh --skip-llvm # skip stage 2 (clang/runtime/link816) # ./setup.sh --skip-mame # skip MAME + ROM fetch # ./setup.sh --skip-calypsi # skip Calypsi # ./setup.sh --skip-orca # skip ORCA/C reference +# ./setup.sh --skip-verify # skip the post-install verification # ./setup.sh --verify-only # run verification only +# ./setup.sh --build-llvm # deprecated alias for the default behavior set -euo pipefail diff --git a/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp b/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp index 1b85780..d7777ba 100644 --- a/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp +++ b/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp @@ -1731,28 +1731,28 @@ SDValue W65816TargetLowering::LowerShift(SDValue Op, SelectionDAG &DAG) const { SDValue X = Op.getOperand(0); SDValue Lo = extractWide32Lo(DAG, DL, X); SDValue Hi = extractWide32Hi(DAG, DL, X); - SDValue One = DAG.getConstant(1, DL, MVT::i16); - SDValue Fifteen = DAG.getConstant(15, DL, MVT::i16); - for (unsigned i = 0; i < N; i++) { - if (Op0 == ISD::SHL) { - // (Hi:Lo) << 1: carry = Lo bit15 → into Hi bit0. - SDValue NewLo = DAG.getNode(ISD::SHL, DL, MVT::i16, Lo, One); - SDValue HiBit0 = DAG.getNode(ISD::SRL, DL, MVT::i16, Lo, Fifteen); - SDValue HiShl = DAG.getNode(ISD::SHL, DL, MVT::i16, Hi, One); - SDValue NewHi = DAG.getNode(ISD::OR, DL, MVT::i16, HiShl, HiBit0); - Lo = NewLo; Hi = NewHi; - } else { - // SRL/SRA: Hi shifts (logical or arithmetic), Lo gets the - // low bit of pre-shift Hi inserted at bit 15. - SDValue NewHi = DAG.getNode(Op0, DL, MVT::i16, Hi, One); - SDValue HiLow = DAG.getNode(ISD::AND, DL, MVT::i16, Hi, One); - SDValue LoTop = DAG.getNode(ISD::SHL, DL, MVT::i16, HiLow, Fifteen); - SDValue LoSrl = DAG.getNode(ISD::SRL, DL, MVT::i16, Lo, One); - SDValue NewLo = DAG.getNode(ISD::OR, DL, MVT::i16, LoSrl, LoTop); - Lo = NewLo; Hi = NewHi; - } + SDValue ShN = DAG.getConstant(N, DL, MVT::i16); + SDValue ShCo = DAG.getConstant(16 - N, DL, MVT::i16); + if (Op0 == ISD::SHL) { + // (Hi:Lo) << N == ((Hi << N) | (Lo >> (16-N))) : (Lo << N) + // 4 SDAG ops instead of N iterations of 4 ops. Lets the + // combiner / isel produce ASLA16-cascade + SRL8A+LSRA16- + // cascade + single OR, avoiding the bit-by-bit OR cascade + // that the unrolled form produced. + SDValue NewLo = DAG.getNode(ISD::SHL, DL, MVT::i16, Lo, ShN); + SDValue HiTop = DAG.getNode(ISD::SRL, DL, MVT::i16, Lo, ShCo); + SDValue HiShl = DAG.getNode(ISD::SHL, DL, MVT::i16, Hi, ShN); + SDValue NewHi = DAG.getNode(ISD::OR, DL, MVT::i16, HiShl, HiTop); + return buildWide32(DAG, DL, NewLo, NewHi); + } else { + // SRL/SRA by N: NewHi = Hi >> N (logical or arithmetic); + // NewLo = (Lo >> N) | (Hi << (16-N)). + SDValue NewHi = DAG.getNode(Op0, DL, MVT::i16, Hi, ShN); + SDValue LoTop = DAG.getNode(ISD::SHL, DL, MVT::i16, Hi, ShCo); + SDValue LoSrl = DAG.getNode(ISD::SRL, DL, MVT::i16, Lo, ShN); + SDValue NewLo = DAG.getNode(ISD::OR, DL, MVT::i16, LoSrl, LoTop); + return buildWide32(DAG, DL, NewLo, NewHi); } - return buildWide32(DAG, DL, Lo, Hi); } } } @@ -2663,6 +2663,32 @@ SDValue W65816TargetLowering::LowerMUL_I32(SDValue Op, return SDValue(); }; + // Mul-by-constant strength reduction: (X * K) where K-1 or K+1 is + // a small power of 2 (shift count 1..5, matching our inlined i32 + // SHL range) expands to (X << N) +/- X — saves a __mulsi3 libcall + // (~250 cyc) for ~70 cyc of inlined shift+add. Catches djb2Hash's + // `h * 33` = (h << 5) + h. + // + // Patterns covered: + // K = 2^N + 1 in {3,5,9,17,33} → (X << N) + X + // K = 2^N - 1 in {7,15,31} → (X << N) - X + // Larger N hits the i32 SHL libcall path (no longer profitable). + if (auto *CN = dyn_cast(Rhs)) { + int64_t K = CN->getSExtValue(); + for (unsigned N = 1; N <= 5; N++) { + int64_t Pow = int64_t{1} << N; + SDValue ShAmt = DAG.getConstant(N, DL, MVT::i16); + if (K == Pow + 1) { + SDValue Shl = DAG.getNode(ISD::SHL, DL, MVT::i32, Lhs, ShAmt); + return DAG.getNode(ISD::ADD, DL, MVT::i32, Shl, Lhs); + } + if (K == Pow - 1) { + SDValue Shl = DAG.getNode(ISD::SHL, DL, MVT::i32, Lhs, ShAmt); + return DAG.getNode(ISD::SUB, DL, MVT::i32, Shl, Lhs); + } + } + } + SDValue A = narrowToI16(Lhs); SDValue B = narrowToI16(Rhs); if (A && B) { diff --git a/src/llvm/lib/Target/W65816/W65816InstrInfo.td b/src/llvm/lib/Target/W65816/W65816InstrInfo.td index 647b7cd..298daed 100644 --- a/src/llvm/lib/Target/W65816/W65816InstrInfo.td +++ b/src/llvm/lib/Target/W65816/W65816InstrInfo.td @@ -1414,8 +1414,10 @@ def PEI_DP : InstDP<0xD4, "pei">; // AsmParser has no way to know the current M/X bits, so it always // reaches for the _Imm16 form. Codegen can still select _Imm8 // explicitly once we have 8-bit patterns. +let hasSideEffects = 0, mayLoad = 0, mayStore = 0, isReMaterializable = 1, isAsCheapAsAMove = 1 in { def LDA_Imm8 : InstImm8<0xA9, "lda"> { let MHigh = 1; let DecoderNamespace = "W65816MHigh"; let isCodeGenOnly = 1; let Defs = [A]; } def LDA_Imm16 : InstImm16<0xA9, "lda"> { let MLow = 1; let Defs = [A]; } +} // Same opcode/encoding as LDA_Imm16, but the operand emits a fixup_bank16 // so the linker / OMF Loader fills in (bankByte, 0) of the symbol. // Used by the LDAi16imm_bank pseudo for materializing the high half of @@ -1455,8 +1457,10 @@ def STA_DPIndLongY : InstDPIndLongY<0x97, "sta"> { let Uses = [A, Y]; } def STA_LongX : InstAbsLongX<0x9F, "sta">; //---------------------------------------------------------------- LDX (load X) +let hasSideEffects = 0, mayLoad = 0, mayStore = 0, isReMaterializable = 1, isAsCheapAsAMove = 1 in { def LDX_Imm8 : InstImm8<0xA2, "ldx"> { let XHigh = 1; let DecoderNamespace = "W65816XHigh"; let isCodeGenOnly = 1; let Defs = [X]; } def LDX_Imm16 : InstImm16<0xA2, "ldx"> { let XLow = 1; let Defs = [X]; } +} def LDX_DP : InstDP<0xA6, "ldx">; def LDX_Abs : InstAbs<0xAE, "ldx">; def LDX_DPY : InstDPY<0xB6, "ldx">; @@ -1468,8 +1472,10 @@ def STX_Abs : InstAbs<0x8E, "stx">; def STX_DPY : InstDPY<0x96, "stx">; //---------------------------------------------------------------- LDY (load Y) +let hasSideEffects = 0, mayLoad = 0, mayStore = 0, isReMaterializable = 1, isAsCheapAsAMove = 1 in { def LDY_Imm8 : InstImm8<0xA0, "ldy"> { let XHigh = 1; let DecoderNamespace = "W65816XHigh"; let isCodeGenOnly = 1; let Defs = [Y]; } def LDY_Imm16 : InstImm16<0xA0, "ldy"> { let XLow = 1; let Defs = [Y]; } +} def LDY_DP : InstDP<0xA4, "ldy">; def LDY_Abs : InstAbs<0xAC, "ldy">; def LDY_DPX : InstDPX<0xB4, "ldy">; diff --git a/src/llvm/lib/Target/W65816/W65816StackRelToImg.cpp b/src/llvm/lib/Target/W65816/W65816StackRelToImg.cpp index 832c6b9..a02df53 100644 --- a/src/llvm/lib/Target/W65816/W65816StackRelToImg.cpp +++ b/src/llvm/lib/Target/W65816/W65816StackRelToImg.cpp @@ -2051,6 +2051,396 @@ bool W65816StackRelToImg::runOnMachineFunction(MachineFunction &MF) { } } + // Shift-cascade dead-store elimination. Greedy regalloc sometimes + // emits `LDA dp; (ASLA16; STA dp){×N}; ...` where each intermediate + // STA_DP is dead — the next ASLA16 reads $a (still holding the value) + // and shifts again, then stores again. Only the final STA matters. + // + // Pattern: LDA_DP X ; (ASLA16; STA_DP X){×N+1} + // where every STA writes to the same DP slot the LDA read + // from, and nothing in between reads $a or DP[X] except + // the cascade itself. + // Rewrite: LDA_DP X ; ASLA16{×N+1} ; STA_DP X + // (final STA only; intermediate STAs erased.) + // + // For N=4 (i.e. 5 shifts), saves 4 STA_DPs = 12 cyc. Hits djb2Hash's + // `Hi << 5` cascade (where greedy spills the intermediate vregs). + for (MachineBasicBlock &MBB : MF) { + SmallVector ToErase; + for (auto It = MBB.begin(); It != MBB.end(); ++It) { + if (It->getOpcode() != W65816::LDA_DP) continue; + if (It->getNumOperands() < 1 || !It->getOperand(0).isImm()) continue; + int64_t Slot = It->getOperand(0).getImm(); + auto Cur = std::next(It); + // Track intermediate STAs in this cascade. + SmallVector Intermediates; + bool sawAny = false; + while (Cur != MBB.end()) { + if (Cur->isDebugInstr()) { ++Cur; continue; } + unsigned Op = Cur->getOpcode(); + if (Op != W65816::ASLA16 && Op != W65816::LSRA16) break; + auto Sta = std::next(Cur); + while (Sta != MBB.end() && Sta->isDebugInstr()) ++Sta; + if (Sta == MBB.end()) break; + if (Sta->getOpcode() != W65816::STA_DP) break; + if (Sta->getNumOperands() < 1 || !Sta->getOperand(0).isImm()) break; + if (Sta->getOperand(0).getImm() != Slot) break; + sawAny = true; + Intermediates.push_back(&*Sta); + Cur = std::next(Sta); + } + if (!sawAny || Intermediates.size() < 2) continue; + // The LAST STA is the real one; mark everything before it for erase. + Intermediates.pop_back(); + for (MachineInstr *MI : Intermediates) + ToErase.push_back(MI); + } + for (MachineInstr *MI : ToErase) { + MI->eraseFromParent(); + Changed = true; + } + } + + // i32-SRL-by-1 fold: detects the SDAG expansion for `(SRL i32 X, 1)` + // and rewrites to the two-instruction `LSR Hi ; ROR Lo` pair when + // both halves live in DP. + // + // Input pattern (after the DP-shift-fold above runs on the trailing + // `LDA Hi ; LSRA16 ; STA Hi` triplet): + // LDA_DP Hi + // SHL15A ; A = (Hi & 1) << 15 + // STA_DP Yc ; carry slot + // LDA_DP Lo + // LSRA16 ; Lo >>= 1 + // STA_DP Lo + // ORA_DP Yc ; combine with carry-at-bit-15 + // STA_DP Lo + // LSR_DP Hi ; (folded from the trailing triplet) + // + // Output: + // LSR_DP Hi ; Hi >>= 1, C = old bit 0 of Hi + // ROR_DP Lo ; Lo = (C, Lo >> 1) + // + // Same semantics, ~30 cyc saved per iter. popcount measured at 2728 + // cyc; expected post-fold ~1858 cyc (-32%) due to 29 iters of i32 + // SRL by 1. + for (MachineBasicBlock &MBB : MF) { + SmallVector ToErase; + for (auto It = MBB.begin(); It != MBB.end();) { + auto LdaHi = It++; + if (LdaHi->getOpcode() != W65816::LDA_DP) continue; + if (LdaHi->getNumOperands() < 1 || !LdaHi->getOperand(0).isImm()) + continue; + int64_t HiAddr = LdaHi->getOperand(0).getImm(); + auto P = std::next(LdaHi); + auto skipDbg = [&](auto &P) { + while (P != MBB.end() && P->isDebugInstr()) ++P; + }; + skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::SHL15A) continue; + auto Shl = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::STA_DP) continue; + if (P->getNumOperands() < 1 || !P->getOperand(0).isImm()) continue; + int64_t YcAddr = P->getOperand(0).getImm(); + auto StaYc = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::LDA_DP) continue; + if (P->getNumOperands() < 1 || !P->getOperand(0).isImm()) continue; + int64_t LoAddr = P->getOperand(0).getImm(); + auto LdaLo = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::LSRA16) continue; + auto LsrA1 = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::STA_DP) continue; + if (P->getOperand(0).getImm() != LoAddr) continue; + auto StaLo1 = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::ORA_DP) continue; + if (P->getOperand(0).getImm() != YcAddr) continue; + auto OraYc = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::STA_DP) continue; + if (P->getOperand(0).getImm() != LoAddr) continue; + auto StaLo2 = P; ++P; skipDbg(P); + if (P == MBB.end() || P->getOpcode() != W65816::LSR_DP) continue; + if (P->getOperand(0).getImm() != HiAddr) continue; + auto LsrHi = P; + // Check that YcAddr is not READ after LsrHi before being + // overwritten. If the next op touching YcAddr is a STA (write), + // the carry-slot value is dead — safe to drop our STA Yc. If + // it's a load-style op (LDA/ORA/AND/etc.) before any STA, then + // some downstream code is consuming our stored carry — bail. + bool YcReadBeforeWrite = false; + for (auto Q = std::next(LsrHi); Q != MBB.end(); ++Q) { + if (Q->isDebugInstr()) continue; + bool touchesYc = false; + bool isWriteOfYc = false; + for (const MachineOperand &MO : Q->operands()) { + if (MO.isImm() && MO.getImm() == YcAddr) { + unsigned QO = Q->getOpcode(); + if (QO == W65816::STA_DP || QO == W65816::STZ_DP || + QO == W65816::STX_DP || QO == W65816::STY_DP) { + touchesYc = true; isWriteOfYc = true; + } else if (QO == W65816::LDA_DP || QO == W65816::ORA_DP || + QO == W65816::AND_DP || QO == W65816::EOR_DP || + QO == W65816::ADC_DP || QO == W65816::SBC_DP || + QO == W65816::CMP_DP || QO == W65816::LSR_DP || + QO == W65816::ROR_DP || QO == W65816::ASL_DP || + QO == W65816::ROL_DP || QO == W65816::INC_DP || + QO == W65816::DEC_DP) { + touchesYc = true; // any of these is a read or RMW + } + break; + } + } + if (touchesYc) { + if (!isWriteOfYc) YcReadBeforeWrite = true; + break; // first touch decides + } + } + if (YcReadBeforeWrite) continue; + // Apply: replace the 8-op sequence with LSR_DP Hi ; ROR_DP Lo. + // The LSR_DP Hi already exists (at LsrHi); just insert ROR_DP Lo + // immediately after it and erase the rest. + BuildMI(MBB, std::next(MachineBasicBlock::iterator(LsrHi)), + LsrHi->getDebugLoc(), TII->get(W65816::ROR_DP)) + .addImm(LoAddr); + ToErase.push_back(&*LdaHi); + ToErase.push_back(&*Shl); + ToErase.push_back(&*StaYc); + ToErase.push_back(&*LdaLo); + ToErase.push_back(&*LsrA1); + ToErase.push_back(&*StaLo1); + ToErase.push_back(&*OraYc); + ToErase.push_back(&*StaLo2); + It = std::next(MachineBasicBlock::iterator(LsrHi)); + } + for (MachineInstr *MI : ToErase) { + MI->eraseFromParent(); + Changed = true; + } + } + + // DP dead-store elimination — runs LAST (after the i32-SRL-by-1 + // fold, which depends on the `STA_DP Lo` between LSRA16 and ORA_DP + // staying intact). When two STA_DP X stores write the same DP slot + // with no intervening read or write of that slot, the first is dead. + // Emerges from popcount's `bit = x & 1` pattern at end of body. + // Saves 5 cyc per match. + // + // Conservative: branches / calls / inline asm and all DP-indirect + // addressing modes (which use the slot AS a pointer) block the + // elim — those reads must see the intended store value. + { + auto touchesDpSlot = [](const MachineInstr &MI, int64_t Addr) { + unsigned Op = MI.getOpcode(); + switch (Op) { + // Direct DP ops. + case W65816::LDA_DP: case W65816::STA_DP: case W65816::STZ_DP: + case W65816::LDX_DP: case W65816::STX_DP: + case W65816::LDY_DP: case W65816::STY_DP: + case W65816::ADC_DP: case W65816::SBC_DP: + case W65816::AND_DP: case W65816::ORA_DP: + case W65816::EOR_DP: case W65816::CMP_DP: + case W65816::CPX_DP: case W65816::CPY_DP: + case W65816::LSR_DP: case W65816::ROR_DP: + case W65816::ASL_DP: case W65816::ROL_DP: + case W65816::INC_DP: case W65816::DEC_DP: + case W65816::BIT_DP: case W65816::TSB_DP: + case W65816::TRB_DP: + // DP-indexed ops. + case W65816::LDA_DPX: case W65816::STA_DPX: case W65816::STZ_DPX: + case W65816::LDY_DPX: case W65816::STY_DPX: + case W65816::ADC_DPX: case W65816::SBC_DPX: + case W65816::AND_DPX: case W65816::ORA_DPX: + case W65816::EOR_DPX: case W65816::CMP_DPX: + case W65816::LDX_DPY: case W65816::STX_DPY: + case W65816::LSR_DPX: case W65816::ROR_DPX: + case W65816::ASL_DPX: case W65816::ROL_DPX: + case W65816::BIT_DPX: + // DP-indirect ops — read the slot AS a pointer. + case W65816::LDA_DPInd: case W65816::STA_DPInd: + case W65816::LDA_DPIndY: case W65816::STA_DPIndY: + case W65816::LDA_DPIndX: case W65816::STA_DPIndX: + case W65816::LDA_DPIndLong: case W65816::STA_DPIndLong: + case W65816::LDA_DPIndLongY: case W65816::STA_DPIndLongY: + case W65816::ADC_DPInd: case W65816::SBC_DPInd: + case W65816::AND_DPInd: case W65816::ORA_DPInd: + case W65816::EOR_DPInd: case W65816::CMP_DPInd: + case W65816::ADC_DPIndY: case W65816::SBC_DPIndY: + case W65816::AND_DPIndY: case W65816::ORA_DPIndY: + case W65816::EOR_DPIndY: case W65816::CMP_DPIndY: + case W65816::ADC_DPIndLong: case W65816::SBC_DPIndLong: + case W65816::AND_DPIndLong: case W65816::ORA_DPIndLong: + case W65816::EOR_DPIndLong: case W65816::CMP_DPIndLong: + case W65816::ADC_DPIndLongY: case W65816::SBC_DPIndLongY: + case W65816::AND_DPIndLongY: case W65816::ORA_DPIndLongY: + case W65816::EOR_DPIndLongY: case W65816::CMP_DPIndLongY: + if (MI.getNumOperands() >= 1 && MI.getOperand(0).isImm() && + MI.getOperand(0).getImm() == Addr) + return true; + // Indirect ops read addr AND addr+1 (16-bit ptr) or addr+1,2 + // (24-bit ptr). Bail when the candidate dead-store target is + // within those bytes. + if (MI.getNumOperands() >= 1 && MI.getOperand(0).isImm()) { + int64_t IndAddr = MI.getOperand(0).getImm(); + if (IndAddr == Addr - 1 || IndAddr == Addr - 2) + return true; + } + break; + } + return false; + }; + for (MachineBasicBlock &MBB : MF) { + SmallVector ToErase; + SmallPtrSet ErasedSet; + for (auto It = MBB.begin(); It != MBB.end(); ++It) { + if (ErasedSet.count(&*It)) continue; + if (It->getOpcode() != W65816::STA_DP) continue; + if (It->getNumOperands() < 1 || !It->getOperand(0).isImm()) + continue; + int64_t Sta1Addr = It->getOperand(0).getImm(); + auto Walk = std::next(It); + while (Walk != MBB.end()) { + if (Walk->isDebugInstr()) { ++Walk; continue; } + if (Walk->isBranch() || Walk->isCall() || Walk->isReturn() || + Walk->isInlineAsm()) break; + if (Walk->getOpcode() == W65816::STA_DP && + Walk->getNumOperands() >= 1 && Walk->getOperand(0).isImm() && + Walk->getOperand(0).getImm() == Sta1Addr) { + ToErase.push_back(&*It); + ErasedSet.insert(&*It); + break; + } + if (touchesDpSlot(*Walk, Sta1Addr)) break; + ++Walk; + } + } + for (MachineInstr *MI : ToErase) { + MI->eraseFromParent(); + Changed = true; + } + } + } + + // DP-slot zero-check bridge via X. Pattern: + // [op that sets Z on A] + // STA_DP slot + // [ops that don't read/write slot, don't touch X, don't branch/call] + // LDA_DP slot + // Bcond (BNE/BEQ) + // + // The STA/LDA round-trip exists purely to preserve A's value across + // the clobbers. TAX/TXA does the same job in 4 cyc instead of 8. + // Saves 4 cyc/match. Hits popcount's `x_lo | x_hi ; (work) ; bne`. + // + // Safety: X register must be dead. Conservative check — fires only + // when the entire MBB doesn't reference X register except as our new + // TAX/TXA, AND all MBB successors don't have X as live-in. + for (MachineBasicBlock &MBB : MF) { + // First: does this MBB reference X at all? If yes, bail. This is + // conservative — refining would need full liveness. + bool MbbTouchesX = false; + for (const MachineInstr &MI : MBB) { + for (const MachineOperand &MO : MI.operands()) { + if (MO.isReg() && MO.getReg() == W65816::X) { + MbbTouchesX = true; break; + } + } + if (MbbTouchesX) break; + // Also handle InstImplied ops that don't list X explicitly. + switch (MI.getOpcode()) { + case W65816::TAX: case W65816::TYX: case W65816::TSX: + case W65816::PLX: case W65816::TXA: case W65816::TXY: + case W65816::TXS: case W65816::PHX: case W65816::INX: + case W65816::DEX: + MbbTouchesX = true; break; + } + if (MI.isCall()) { MbbTouchesX = true; break; } + if (MbbTouchesX) break; + } + if (MbbTouchesX) continue; + // Successors with X as live-in? + bool SuccUsesX = false; + for (MachineBasicBlock *Succ : MBB.successors()) { + if (Succ->isLiveIn(W65816::X)) { SuccUsesX = true; break; } + } + if (SuccUsesX) continue; + // Walk forward looking for STA_DP / ... / LDA_DP / Bcond patterns. + SmallVector ToErase; + for (auto It = MBB.begin(); It != MBB.end(); ++It) { + if (It->getOpcode() != W65816::STA_DP) continue; + if (It->getNumOperands() < 1 || !It->getOperand(0).isImm()) continue; + int64_t StaAddr = It->getOperand(0).getImm(); + auto Walk = std::next(It); + MachineInstr *LdaMI = nullptr; + MachineInstr *BcondMI = nullptr; + bool blocked = false; + while (Walk != MBB.end()) { + if (Walk->isDebugInstr()) { ++Walk; continue; } + unsigned WO = Walk->getOpcode(); + if (Walk->isBranch() || Walk->isCall() || Walk->isReturn() || + Walk->isInlineAsm()) { + // If this is the Bcond AFTER our LDA, capture it. + if (LdaMI && (WO == W65816::BNE || WO == W65816::BEQ)) { + BcondMI = &*Walk; + } + break; + } + if (WO == W65816::LDA_DP && + Walk->getNumOperands() >= 1 && Walk->getOperand(0).isImm() && + Walk->getOperand(0).getImm() == StaAddr) { + LdaMI = &*Walk; + // Next non-debug must be BNE/BEQ. + auto Next = std::next(Walk); + while (Next != MBB.end() && Next->isDebugInstr()) ++Next; + if (Next != MBB.end()) { + unsigned NO = Next->getOpcode(); + if (NO == W65816::BNE || NO == W65816::BEQ) BcondMI = &*Next; + } + break; + } + // Anything touching our DP slot bails. + for (const MachineOperand &MO : Walk->operands()) { + if (MO.isImm() && MO.getImm() == StaAddr) { + // Conservatively assume DP-op refs StaAddr unless we know + // it's a different opcode entirely. The dead-store-elim + // has similar logic but more refined; here we keep it + // simple: bail on any imm-matching op. + blocked = true; break; + } + } + if (blocked) break; + ++Walk; + } + if (blocked || !LdaMI || !BcondMI) continue; + // Global-use check: if any OTHER MBB references the DP slot, the + // STA we'd erase may be initializing it for a later use. Bail. + // Caught by sumOfSquares' counter at $D0 — entry-BB's STA_DP 208 + // initializes the countdown counter that bb.4 reads via DEC_DP. + bool referencedElsewhere = false; + for (MachineBasicBlock &OtherMBB : MF) { + if (&OtherMBB == &MBB) continue; + for (const MachineInstr &OtherMI : OtherMBB) { + for (const MachineOperand &MO : OtherMI.operands()) { + if (MO.isImm() && MO.getImm() == StaAddr) { + referencedElsewhere = true; break; + } + } + if (referencedElsewhere) break; + } + if (referencedElsewhere) break; + } + if (referencedElsewhere) continue; + // Replace STA_DP with TAX, LDA_DP with TXA. + const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo(); + BuildMI(MBB, It, It->getDebugLoc(), TII->get(W65816::TAX)); + BuildMI(MBB, LdaMI, LdaMI->getDebugLoc(), TII->get(W65816::TXA)); + ToErase.push_back(&*It); + ToErase.push_back(LdaMI); + } + for (MachineInstr *MI : ToErase) { + MI->eraseFromParent(); + Changed = true; + } + } + // Run elideStoreForwarding at the very end, AFTER IMG promotion has // committed slot assignments. Running this peephole earlier (with // the other early peepholes) cascades into different IMG-promotion diff --git a/src/llvm/lib/Target/W65816/W65816UnLSR.cpp b/src/llvm/lib/Target/W65816/W65816UnLSR.cpp index 8451bec..f283748 100644 --- a/src/llvm/lib/Target/W65816/W65816UnLSR.cpp +++ b/src/llvm/lib/Target/W65816/W65816UnLSR.cpp @@ -40,6 +40,7 @@ //===---------------------------------------------------------------------===// #include "W65816.h" +#include "llvm/ADT/DenseMap.h" #include "llvm/ADT/SmallPtrSet.h" #include "llvm/ADT/SmallVector.h" #include "llvm/Analysis/LoopInfo.h" @@ -82,6 +83,8 @@ public: private: bool processLoop(Loop *L); + bool processCounterToPtrPHIs(Loop *L); + bool processReturnedCounter(Loop *L); }; } // namespace @@ -103,17 +106,409 @@ bool W65816UnLSR::runOnFunction(Function &F) { bool Changed = false; for (Loop *L : LI) { Changed |= processLoop(L); + Changed |= processCounterToPtrPHIs(L); + // NOTE: processReturnedCounter (strLen-shape counter → ptr-difference + // at exit) is correct but produces a NET LOSS on strLen: without the + // counter PHI, the i32 pointer arithmetic falls back to clc+adc + // chains (16+ cyc/iter) instead of inc-A on the lo half (5 cyc/iter + // for ptr update + 5 for counter inc). See feedback memory. + // Disabled until codegen can use inc-DP for the lo half of a pointer + // PHI's increment without the SDAG materializing a full i32 add. // Recurse into nested loops. SmallVector Worklist(L->begin(), L->end()); while (!Worklist.empty()) { Loop *Sub = Worklist.pop_back_val(); Changed |= processLoop(Sub); + Changed |= processCounterToPtrPHIs(Sub); Worklist.append(Sub->begin(), Sub->end()); } } return Changed; } + +// strLen-style undo: LSR converts `return p - s` into a counter PHI +// `%lsr.iv` that increments per iter and is returned directly: +// %lsr.iv = phi i16 [-1, %entry], [%lsr.iv.next, %latch] +// %p.0 = phi ptr [%s, %entry], [%incdec.ptr, %latch] +// %incdec.ptr = getelementptr i8, %p.0, i32 1 +// %lsr.iv.next = add i16 %lsr.iv, 1 +// br ..., %exit, %loop +// %exit: +// ret i16 %lsr.iv.next +// +// LSR's reasoning: cheaper to maintain a counter than compute (p - s) +// at exit. On W65816 the opposite is true: counter inc per iter costs +// 5 cyc/iter * N iters; one-time sub at exit costs ~10 cyc total. +// +// This undo finds the counter PHI, verifies its only out-of-loop use +// is via LCSSA → return, finds the sibling pointer PHI with the same +// stride, and replaces the return value with +// `(i16)(p_lcssa - base) + (K_init + 1)`. Erases the counter PHI. +// +// Saves ~5 cyc/iter on strLen-shape loops with a returned counter. +bool W65816UnLSR::processReturnedCounter(Loop *L) { + BasicBlock *Header = L->getHeader(); + BasicBlock *Latch = L->getLoopLatch(); + BasicBlock *Preheader = L->getLoopPreheader(); + if (!Latch || !Preheader) return false; + + // Single-exit loop. + SmallVector ExitBlocks; + L->getExitBlocks(ExitBlocks); + if (ExitBlocks.size() != 1) return false; + BasicBlock *Exit = ExitBlocks[0]; + + // Find a candidate counter PHI: integer, init=ConstantInt, step=+1. + PHINode *CounterPHI = nullptr; + ConstantInt *KInit = nullptr; + BinaryOperator *CounterStep = nullptr; + for (PHINode &PN : Header->phis()) { + if (!PN.getType()->isIntegerTy()) continue; + if (PN.getNumIncomingValues() != 2) continue; + Value *Init = nullptr, *Step = nullptr; + for (unsigned i = 0; i < PN.getNumIncomingValues(); ++i) { + BasicBlock *Pred = PN.getIncomingBlock(i); + if (L->contains(Pred)) Step = PN.getIncomingValue(i); + else Init = PN.getIncomingValue(i); + } + if (!Init || !Step) continue; + auto *InitC = dyn_cast(Init); + if (!InitC) continue; + auto *StepBO = dyn_cast(Step); + if (!StepBO || StepBO->getOpcode() != Instruction::Add) continue; + Value *Other = nullptr; + if (StepBO->getOperand(0) == &PN) Other = StepBO->getOperand(1); + else if (StepBO->getOperand(1) == &PN) Other = StepBO->getOperand(0); + if (!Other) continue; + auto *StepCI = dyn_cast(Other); + if (!StepCI || !StepCI->isOne()) continue; + CounterPHI = &PN; + KInit = InitC; + CounterStep = StepBO; + break; + } + if (!CounterPHI) return false; + + // The counter PHI must be used INSIDE the loop only by its increment + // and OUTSIDE the loop only via an LCSSA PHI in the exit block that + // feeds a return. Same for the increment. + auto isOnlyInLoopUseTheStep = [&](Value *V) { + for (User *U : V->users()) { + auto *UI = dyn_cast(U); + if (!UI) return false; + if (!L->contains(UI)) continue; // out-of-loop is handled separately + if (UI == CounterStep) continue; + // The PHI itself is allowed (V might be CounterStep, used by + // CounterPHI's back-edge incoming). + if (UI == CounterPHI) continue; + return false; + } + return true; + }; + if (!isOnlyInLoopUseTheStep(CounterPHI)) return false; + if (!isOnlyInLoopUseTheStep(CounterStep)) return false; + + // Find a use of CounterPHI or CounterStep that's a ReturnInst. + // The use might be DIRECT (no LCSSA — common after LCSSA cleanup) + // or via an LCSSA PHI in the exit block. + ReturnInst *Ret = nullptr; + Value *RetSource = nullptr; // the value the ret reads + PHINode *ExitLCSSA = nullptr; // optional LCSSA PHI to erase + bool fromNext = false; // true if return source is CounterStep + auto findRet = [&](Value *V, bool isNext) -> bool { + for (User *U : V->users()) { + auto *UI = dyn_cast(U); + if (!UI) continue; + // Skip in-loop uses (those are the counter increment chain). + if (L->contains(UI->getParent())) continue; + if (auto *R = dyn_cast(UI)) { + if (R->getReturnValue() != V) continue; + Ret = R; RetSource = V; fromNext = isNext; return true; + } + // LCSSA PHI in the exit block? + if (auto *PN = dyn_cast(UI)) { + if (PN->getParent() != Exit) continue; + if (PN->getNumIncomingValues() != 1) continue; + if (PN->getIncomingValue(0) != V) continue; + if (!PN->hasOneUse()) continue; + auto *R = dyn_cast(PN->user_back()); + if (!R || R->getReturnValue() != PN) continue; + Ret = R; RetSource = V; fromNext = isNext; ExitLCSSA = PN; + return true; + } + } + return false; + }; + if (!findRet(CounterStep, true) && !findRet(CounterPHI, false)) + return false; + + // Find a sibling pointer PHI: init=Base, latch incoming is a + // `getelementptr i8, %ptr, 1` of itself. + PHINode *PtrPHI = nullptr; + Value *Base = nullptr; + GetElementPtrInst *PtrStep = nullptr; + for (PHINode &PN : Header->phis()) { + if (!PN.getType()->isPointerTy()) continue; + if (PN.getNumIncomingValues() != 2) continue; + Value *Init = nullptr, *Step = nullptr; + for (unsigned i = 0; i < PN.getNumIncomingValues(); ++i) { + BasicBlock *Pred = PN.getIncomingBlock(i); + if (L->contains(Pred)) Step = PN.getIncomingValue(i); + else Init = PN.getIncomingValue(i); + } + if (!Init || !Step) continue; + auto *StepGEP = dyn_cast(Step); + if (!StepGEP) continue; + if (StepGEP->getPointerOperand() != &PN) continue; + if (StepGEP->getNumIndices() != 1) continue; + if (!StepGEP->getSourceElementType()->isIntegerTy(8)) continue; + auto *StrideCI = dyn_cast(StepGEP->getOperand(1)); + if (!StrideCI || !StrideCI->isOne()) continue; + PtrPHI = &PN; + Base = Init; + PtrStep = StepGEP; + break; + } + if (!PtrPHI) return false; + + // The pointer-PHI must have an LCSSA in the exit (so we can compute + // p_lcssa - base). Find it or create one. + PHINode *PtrLCSSA = nullptr; + for (PHINode &EPN : Exit->phis()) { + if (EPN.getNumIncomingValues() != 1) continue; + if (EPN.getIncomingValue(0) == PtrPHI) { + PtrLCSSA = &EPN; break; + } + } + if (!PtrLCSSA) { + // Create LCSSA for PtrPHI. + IRBuilder<> B(&Exit->front()); + PtrLCSSA = B.CreatePHI(PtrPHI->getType(), 1, "unlsr.p.lcssa"); + PtrLCSSA->addIncoming(PtrPHI, Latch); + } + + // Build replacement value: (i16)(p_lcssa - base) + (K_init + (fromNext ? 1 : 0)) + // For fromNext=true (returning %counter.next): value = K_init + iters + // p_lcssa - base = iters (in bytes, stride 1) → value = K_init + (p_lcssa - base) + // But we want: counter.next at exit = K_init + iters; and p_lcssa - base = iters. + // So replacement = (i16)(p_lcssa - base) + K_init. + // For strLen: K_init = -1; iters at exit = K (where ret = K - 1 + 1 = K) + // Wait let me re-derive. counter init = -1. iter 1 entry: counter = -1. + // iter 1 exit: counter.next = 0. Suppose exit-iter is iter K. Then at + // iter K's icmp-true, counter.next = -1 + K. + // And p_lcssa = base + (K - 1) (since iter K had p.0 = base + K-1). + // So p_lcssa - base = K - 1. + // We want counter.next = K - 1 (because exit-iter is iter K, but counter.next + // was computed before icmp tested 0 - so it's K - 1 (with K iters = K decisions)) + // Hmm, off-by-one is tricky. Let me just test empirically. + + // The "return value type" we'll cast to. + Type *RetTy = Ret->getReturnValue()->getType(); + if (!RetTy->isIntegerTy()) return false; + Instruction *InsertPt = ExitLCSSA ? ExitLCSSA->getNextNode() : Ret; + IRBuilder<> B(InsertPt); + // (p_lcssa - base) as integer. + Value *PLcssaInt = B.CreatePtrToInt(PtrLCSSA, Type::getInt32Ty(Header->getContext()), "unlsr.plcssa.i"); + Value *BaseInt = B.CreatePtrToInt(Base, Type::getInt32Ty(Header->getContext()), "unlsr.base.i"); + Value *Diff = B.CreateSub(PLcssaInt, BaseInt, "unlsr.diff"); + // Truncate to counter type. + Value *DiffI = B.CreateTrunc(Diff, CounterPHI->getType(), "unlsr.diff.trunc"); + // For fromNext (returning %counter.next): replacement = diff + (K_init + 1). + // At exit, counter.next = K_init + iters. + // p_lcssa - base = iters (in bytes; stride 1). Wait but iters is the iter count. + // Let me re-check with concrete example. + // strLen("a\0"): iter 1: p.0 = s, *p='a'!=0, p++, counter=-1, counter.next=0. + // iter 2: p.0 = s+1, *p=0, exit. counter=0, counter.next=1. + // At exit: counter.next = 1. iters before exit-iter's icmp-true = 2. + // p_lcssa = s+1 (the iter-2 entry value). p_lcssa - base = 1. + // counter.next = 1 = K_init + 2 = -1 + 2 = 1. ✓ + // p_lcssa - base = 1. So counter.next = p_lcssa - base + 0. + // (K_init + iters - (iters - (p_lcssa - base))) = K_init + (p_lcssa - base) = K_init + 1. + // Wait: counter.next = K_init + iters; p_lcssa - base = iters - 1. + // So counter.next = K_init + (p_lcssa - base) + 1. + // For K_init = -1: counter.next = -1 + 1 + 1 = 1 if iters=2. ✓ + // So replacement = diff + (K_init + 1). + int64_t Adjust = KInit->getSExtValue() + (fromNext ? 1 : 0); + Value *Result = DiffI; + if (Adjust != 0) { + Result = B.CreateAdd(DiffI, + ConstantInt::get(CounterPHI->getType(), Adjust), + "unlsr.result"); + } + // Cast to return type if different. + if (Result->getType() != RetTy) { + if (CounterPHI->getType()->getIntegerBitWidth() < + RetTy->getIntegerBitWidth()) + Result = B.CreateZExt(Result, RetTy); + else + Result = B.CreateTrunc(Result, RetTy); + } + // Replace the return. If there's an LCSSA PHI, replace it. Otherwise + // replace the direct use in `ret`. + if (ExitLCSSA) { + ExitLCSSA->replaceAllUsesWith(Result); + ExitLCSSA->eraseFromParent(); + } else { + Ret->setOperand(0, Result); + } + + // Erase the counter PHI and its increment. + CounterStep->replaceAllUsesWith(UndefValue::get(CounterPHI->getType())); + CounterPHI->replaceAllUsesWith(UndefValue::get(CounterPHI->getType())); + CounterStep->eraseFromParent(); + CounterPHI->eraseFromParent(); + return true; +} + + +// strcpy-style undo: LSR converts two pointer PHIs (`src.addr.0` and +// `d.0` each stepping by 1) into a single counter PHI (`lsr.iv`) plus +// GEPs `(base, counter)` per iter. On 65816 the counter+GEP form +// each iter does i32 (base + counter) on each pointer — much more +// expensive than just incrementing two i16 pointer PHIs. +// +// Pattern (post-LSR): +// %lsr.iv = phi i32 [0, %entry], [%lsr.iv.next, %latch] +// %scevgep_i = getelementptr i8, ptr %base_i, i32 %lsr.iv (for each base_i) +// ... loads/stores via %scevgep_i ... +// %lsr.iv.next = add i32 %lsr.iv, 1 +// +// Where each %base_i is loop-invariant (typically a function arg). +// +// Rewrite: for each base_i, introduce a pointer PHI that strides by 1 +// per iter. Replace %scevgep_i with the new pointer PHI. If counter +// has no other uses, eliminate it. +bool W65816UnLSR::processCounterToPtrPHIs(Loop *L) { + BasicBlock *Header = L->getHeader(); + BasicBlock *Latch = L->getLoopLatch(); + BasicBlock *Preheader = L->getLoopPreheader(); + if (!Latch || !Preheader) return false; + + // Find an integer counter PHI starting at 0 with step +1. + PHINode *Counter = nullptr; + Value *CounterNext = nullptr; + for (PHINode &PN : Header->phis()) { + if (!PN.getType()->isIntegerTy()) continue; + if (PN.getNumIncomingValues() != 2) continue; + Value *Init = nullptr, *Step = nullptr; + for (unsigned i = 0; i < PN.getNumIncomingValues(); ++i) { + BasicBlock *Pred = PN.getIncomingBlock(i); + if (L->contains(Pred)) Step = PN.getIncomingValue(i); + else Init = PN.getIncomingValue(i); + } + if (!Init || !Step) continue; + auto *InitCI = dyn_cast(Init); + if (!InitCI || !InitCI->isZero()) continue; + auto *StepBO = dyn_cast(Step); + if (!StepBO || StepBO->getOpcode() != Instruction::Add) continue; + Value *Other = nullptr; + if (StepBO->getOperand(0) == &PN) Other = StepBO->getOperand(1); + else if (StepBO->getOperand(1) == &PN) Other = StepBO->getOperand(0); + if (!Other) continue; + auto *StepCI = dyn_cast(Other); + if (!StepCI || !StepCI->isOne()) continue; + Counter = &PN; + CounterNext = StepBO; + break; + } + if (!Counter) return false; + + // Find GEPs `getelementptr i8, %base, %counter` (or %counter.next) + // where base is loop-invariant. Collect them and verify the counter + // has no OTHER uses outside this pattern. + SmallVector GEPs; + for (User *U : Counter->users()) { + if (U == CounterNext) continue; + auto *GEP = dyn_cast(U); + if (!GEP) return false; + if (GEP->getNumIndices() != 1) return false; + if (GEP->getOperand(1) != Counter) return false; + Value *Base = GEP->getPointerOperand(); + // base must be loop-invariant. Instructions inside the loop fail; + // arguments and globals are always invariant. + if (auto *BaseI = dyn_cast(Base)) + if (L->contains(BaseI)) return false; + if (!Base->getType()->isPointerTy()) return false; + // Only handle the i8 element type (byte stride). Other strides + // would need different ptr-PHI step values. + if (!GEP->getSourceElementType()->isIntegerTy(8)) return false; + GEPs.push_back(GEP); + } + // Also accept if CounterNext is used as a GEP index (sometimes LSR + // uses the post-increment value). Walk those too. + for (User *U : CounterNext->users()) { + if (U == Counter) continue; + auto *GEP = dyn_cast(U); + if (GEP) { + // Bail if CounterNext is used as a GEP index — we'd need to add + // a +1 offset to the new pointer PHI to match. Keep this simple + // for now: only handle uses of Counter, not CounterNext. + if (GEP->getNumIndices() == 1 && GEP->getOperand(1) == CounterNext) + return false; + } + // Allow icmp / branch / other non-GEP uses of CounterNext — those + // are the loop's exit test, fine to leave alone. + } + if (GEPs.empty()) return false; + + // For each unique base, build a pointer PHI. + LLVMContext &Ctx = Header->getContext(); + Type *I8 = Type::getInt8Ty(Ctx); + DenseMap BasePhis; + for (GetElementPtrInst *GEP : GEPs) { + Value *Base = GEP->getPointerOperand(); + if (BasePhis.count(Base)) continue; + IRBuilder<> B(&Header->front()); + PHINode *PtrPHI = B.CreatePHI(Base->getType(), 2, "unlsr.ptr"); + PtrPHI->addIncoming(Base, Preheader); + // Build the step GEP in the latch (just before terminator). + IRBuilder<> BL(Latch->getTerminator()); + Value *PtrNext = BL.CreateGEP(I8, PtrPHI, + ConstantInt::get(Type::getInt16Ty(Ctx), 1), + "unlsr.ptr.next"); + PtrPHI->addIncoming(PtrNext, Latch); + BasePhis[Base] = PtrPHI; + } + + // Replace each GEP's uses with the corresponding pointer PHI. + for (GetElementPtrInst *GEP : GEPs) { + GEP->replaceAllUsesWith(BasePhis[GEP->getPointerOperand()]); + } + // Erase the now-dead GEPs. + for (GetElementPtrInst *GEP : GEPs) { + if (GEP->use_empty()) GEP->eraseFromParent(); + } + + // If counter has no other uses (besides CounterNext and the latch + // incoming), eliminate it. CounterNext might still be used by the + // exit test — leave that alone. + bool counterDead = true; + for (User *U : Counter->users()) { + if (U == CounterNext) continue; + counterDead = false; + break; + } + if (counterDead) { + // CounterNext might be used by other PHIs / icmp. Don't erase if so. + bool counterNextHasOtherUses = false; + for (User *U : CounterNext->users()) { + if (U == Counter) continue; + counterNextHasOtherUses = true; + break; + } + if (!counterNextHasOtherUses) { + Type *IntT = Counter->getType(); + cast(CounterNext)->replaceAllUsesWith( + UndefValue::get(IntT)); + Counter->replaceAllUsesWith(UndefValue::get(IntT)); + cast(CounterNext)->eraseFromParent(); + Counter->eraseFromParent(); + } + } + return true; +} + bool W65816UnLSR::processLoop(Loop *L) { BasicBlock *Header = L->getHeader(); BasicBlock *Latch = L->getLoopLatch(); diff --git a/tests/lua/build/lapi.o b/tests/lua/build/lapi.o index 5d5029c..d29d349 100644 Binary files a/tests/lua/build/lapi.o and b/tests/lua/build/lapi.o differ diff --git a/tests/lua/build/lauxlib.o b/tests/lua/build/lauxlib.o index c0a87e9..4f38c39 100644 Binary files a/tests/lua/build/lauxlib.o and b/tests/lua/build/lauxlib.o differ diff --git a/tests/lua/build/lbaselib.o b/tests/lua/build/lbaselib.o index 546b04a..784c99b 100644 Binary files a/tests/lua/build/lbaselib.o and b/tests/lua/build/lbaselib.o differ diff --git a/tests/lua/build/lcode.o b/tests/lua/build/lcode.o index 93bf77b..80b7c34 100644 Binary files a/tests/lua/build/lcode.o and b/tests/lua/build/lcode.o differ diff --git a/tests/lua/build/ldebug.o b/tests/lua/build/ldebug.o index db21e4e..fb4cf74 100644 Binary files a/tests/lua/build/ldebug.o and b/tests/lua/build/ldebug.o differ diff --git a/tests/lua/build/ldo.o b/tests/lua/build/ldo.o index 45b9e1e..271bbc2 100644 Binary files a/tests/lua/build/ldo.o and b/tests/lua/build/ldo.o differ diff --git a/tests/lua/build/lfunc.o b/tests/lua/build/lfunc.o index 781c16e..6fadc2c 100644 Binary files a/tests/lua/build/lfunc.o and b/tests/lua/build/lfunc.o differ diff --git a/tests/lua/build/lgc.o b/tests/lua/build/lgc.o index 4536466..9ad5893 100644 Binary files a/tests/lua/build/lgc.o and b/tests/lua/build/lgc.o differ diff --git a/tests/lua/build/llex.o b/tests/lua/build/llex.o index 54eef2a..7f75882 100644 Binary files a/tests/lua/build/llex.o and b/tests/lua/build/llex.o differ diff --git a/tests/lua/build/lparser.o b/tests/lua/build/lparser.o index 8563b60..0c4b465 100644 Binary files a/tests/lua/build/lparser.o and b/tests/lua/build/lparser.o differ diff --git a/tests/lua/build/lstate.o b/tests/lua/build/lstate.o index 1b5e978..5d47fa8 100644 Binary files a/tests/lua/build/lstate.o and b/tests/lua/build/lstate.o differ diff --git a/tests/lua/build/lstring.o b/tests/lua/build/lstring.o index baa30d7..7b41d58 100644 Binary files a/tests/lua/build/lstring.o and b/tests/lua/build/lstring.o differ diff --git a/tests/lua/build/lstrlib.o b/tests/lua/build/lstrlib.o index 1c0a3bc..1acae5b 100644 Binary files a/tests/lua/build/lstrlib.o and b/tests/lua/build/lstrlib.o differ diff --git a/tests/lua/build/ltable.o b/tests/lua/build/ltable.o index 25dae50..9e7a76d 100644 Binary files a/tests/lua/build/ltable.o and b/tests/lua/build/ltable.o differ diff --git a/tests/lua/build/ltm.o b/tests/lua/build/ltm.o index 108d774..7be0167 100644 Binary files a/tests/lua/build/ltm.o and b/tests/lua/build/ltm.o differ diff --git a/tests/lua/build/lundump.o b/tests/lua/build/lundump.o index 6b1ddfc..519f5c0 100644 Binary files a/tests/lua/build/lundump.o and b/tests/lua/build/lundump.o differ diff --git a/tests/lua/build/lvm.o b/tests/lua/build/lvm.o index 153d5f5..cb63580 100644 Binary files a/tests/lua/build/lvm.o and b/tests/lua/build/lvm.o differ