diff --git a/STATUS.md b/STATUS.md new file mode 100644 index 0000000..3d65a76 --- /dev/null +++ b/STATUS.md @@ -0,0 +1,144 @@ +# llvm816 — Current Status + +LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from +llvm-mos as a separate `W65816` target. + +## What works + +End-to-end C-to-binary toolchain that produces 65816 machine code +which runs correctly under MAME (apple2gs). + +**Language coverage at -O2 (no extra flags):** + +- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod + (signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos + + ASLA16 / shift libcalls. +- Comparisons and signed/unsigned widening (sext, zext, trunc) for all + the above sizes. +- Pointer arithmetic, array indexing, struct field access, struct + return-by-value (up to 8 bytes — Pair, Vec4, double). +- Bitfields, switch statements (verified up to ~12 cases + default), + function pointers, function-pointer tables, indirect calls via + `__jsl_indir` trampoline. +- Recursion: factorial, Fibonacci, depth-3 binary-tree + insert/sum/min/max, simple recursive quicksort. +- Loops with goto / break / continue, nested loops, state machines. +- `` varargs with int / long / unsigned long long mixed args. +- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list + reverse with `cons` works. +- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa + roundtrip. +- Soft-float (single): all four ops + comparisons, MAME-verified. +- Soft-double: add, sub, mul, div all return correct bit patterns; + 3-iter Newton sqrt converges. Long-running iterations may hit MAME's + 1-second sim-time budget (test config issue, not a compiler bug). +- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and + arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom). +- C++ minimal: clang++ compiles a class with virtual + non-trivial + ctor (vtable + RTTI omitted; no exceptions). +- printf with `%d %x %s %c %p` and width/precision specifiers. +- `setjmp` / `longjmp` from libgcc.s. +- Static constructors via crt0's init_array walk. + +**Toolchain:** + +- `clang` / `llc` produce W65816 assembly + ELF object files. +- `tools/link816` resolves cross-translation-unit refs, lays out + text/rodata/bss, emits a flat binary the IIgs ROM can load. +- `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's + native object format) for round-tripping with classic dev tools. +- `runtime/build.sh` builds crt0, libc, soft-float, soft-double, + libgcc into linkable objects. +- `scripts/smokeTest.sh` runs ~80 end-to-end checks (scalar ops, + control flow, calling conventions, MAME execution, regressions). + Currently 100% pass. + +**ABI:** + +- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL + on the system stack with PHA. Caller deallocates via `tsc;clc;adc + #N;tcs` or `PLY*N/2`. +- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for + the highest 16 bits. +- Frame is empty-descending (S points to next-free); offsets account + for the +1 skew vs LLVM's full-descending model. + +## In flight (build-system level) + +- **DWARF sidecar emission in link816** (#51): The link should produce + a separate sidecar file with line-number / variable-location info + that an IDE or post-mortem dumper can consume. Skeleton not yet + written; deferred until other correctness work is done. + +## Known issues / workarounds + +- **Greedy register allocator mis-orders spills** in two patterns + (#69, #70): + 1. Functions where both `$a` and `$x` are live-in (i64-first-arg + with a stack-output pointer, e.g. `udivmod(i64, i64, ptr)`). + The TAX bridging `$x` to A clobbers `$a`'s value before the + second STA can save it. + 2. Iterative quicksort with `if/else` recursion choice: complex + live-ranges across two `swap()` calls produce wrong arg values. + + Both reproduce only at `-O1`/`-O2` with greedy. Workaround: + `-mllvm -regalloc=fast` for the affected translation unit. + `softDouble.c` already requires this flag for `__muldf3` (build.sh + applies it automatically). + + Real fix is a pre-RA pass that pre-spills critical pointer + arguments to memory, or a targeted fix in greedy's spill-ordering + heuristic. Material work; deferred. + +- **(d,s),y / (sr,s),y addressing wraps the bank** when Y is + negative as 16-bit unsigned. Worked around by `W65816NegYIndY` + rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays + correct for negative offsets like `arr[i-1]`. + +- **(d,s),y for stack-local pointer dereferences uses DBR**, so + user code that switches DBR (e.g. `pha;plb` to bank 2 to reach + IIgs hardware) must not call into a function that takes the + address of one of its locals — the callee's `*p = v` will write + to the wrong bank. Documented; no compiler-side mitigation + beyond the existing DPF0 fake-physreg routing for the i64-return + high half. + +## What's still needed for a "ship-ready" toolchain + +- **Greedy regalloc spill-ordering fix** — see above. Removes the + need for the per-file `-regalloc=fast` workaround on + `softDouble.c` and unblocks pattern-rich code that currently + must be compiled at `-O0` for correctness. + +- **Round-to-nearest-even in `__divdf3`** — currently + truncate-toward-zero, which differs from gcc by ±1 ULP in + several test cases. Acceptable today (Newton iterations still + converge); revisit when an exact-match test suite lands. + +- **DWARF sidecar** (#51) for source-level debugging. + +- **More of the C standard library**: `` transcendental + functions (sin, cos, exp, log, pow), `` beyond what's + hand-coded, `` file I/O (`fopen`, `fread`, `fwrite`, + `fseek`). + +- **C++ runtime support**: vtable layout for multiple inheritance, + RTTI, exceptions (or a documented `-fno-exceptions` requirement). + +- **REP/SEP scheduling pass** (design doc §3.3): the current + prologue picks one M-mode for the whole function based on + whether any 8-bit accumulator value is used. A per-region + scheduler would reduce the SEP/REP wrap overhead on i8 stores. + +- **Toolbox / IIgs system call bindings**: header files declaring + the Apple IIgs system calls (`SystemTask`, `WaitMouseUp`, + `DrawString`, …) with the right inline-asm dispatch glue. + +- **Real-world program coverage**: the smoke tests are + microbenchmarks. A few known-good Apple IIgs C programs (e.g. + a textfile pager, a small game) compiled and run end-to-end + would catch issues no synthetic test currently exercises. + +- **Cycle-time / size benchmarks vs Calypsi 5.16**: design doc §1 + says the goal is to "match or exceed" Calypsi. We have neither + baseline numbers nor a comparison harness yet. diff --git a/runtime/build.sh b/runtime/build.sh index dff9a7a..db71310 100755 --- a/runtime/build.sh +++ b/runtime/build.sh @@ -23,8 +23,10 @@ asm() { cc() { local c="$1" local o="$OUT/$(basename "${c%.c}").o" + local extra=("${@:2}") echo " CC $(basename "$c")" "$CLANG" -target w65816 -O2 -ffunction-sections \ + "${extra[@]}" \ -I"$PROJECT_ROOT/runtime/include" \ -c "$c" -o "$o" } @@ -33,6 +35,9 @@ asm "$SRC/crt0.s" asm "$SRC/libgcc.s" cc "$SRC/libc.c" cc "$SRC/softFloat.c" -cc "$SRC/softDouble.c" +# softDouble.c needs -regalloc=fast: __muldf3's 64x64 -> 128 mul + +# inlined alignment shifts overflows the greedy allocator on the +# single-A target. +cc "$SRC/softDouble.c" -mllvm -regalloc=fast echo "runtime built: $(ls -1 "$OUT"/*.o | wc -l) objects" diff --git a/runtime/src/libgcc.s b/runtime/src/libgcc.s index a96977b..f34b22e 100644 --- a/runtime/src/libgcc.s +++ b/runtime/src/libgcc.s @@ -673,19 +673,30 @@ __divmodsi_setup: ; setup; signed variants flip signs around it. ; -------------------------------------------------------------------- __divmoddi4_stash: + ; Called via JSR from another libgcc helper that was itself + ; called via JSL. Stack layout inside this routine: + ; slot 1..2 = JSR return address (2 bytes, same-bank) + ; slot 3..5 = JSL return address (3 bytes, long) + ; slot 6..7 = first 16-bit stack arg (caller's first push) + ; slot 8..9 = second + ; ... etc. + ; Earlier code read slots 4, 6, 8, 10, 12, 14 — which lands on + ; the JSL ret address bytes, treating them as args. Caught by + ; `u64mul(0x12, 0x12)` returning the result at $E2 (mid-low) + ; instead of $E0 (lo) plus 0x678-shaped garbage at $E6. sta 0xe0 ; a_lo_lo stx 0xe2 ; a_lo_hi - lda 0x4, s - sta 0xe4 ; a_hi_lo lda 0x6, s - sta 0xe6 ; a_hi_hi + sta 0xe4 ; a_hi_lo lda 0x8, s - sta 0xe8 ; b_lo_lo + sta 0xe6 ; a_hi_hi lda 0xa, s - sta 0xea ; b_lo_hi + sta 0xe8 ; b_lo_lo lda 0xc, s - sta 0xec ; b_hi_lo + sta 0xea ; b_lo_hi lda 0xe, s + sta 0xec ; b_hi_lo + lda 0x10, s sta 0xee ; b_hi_hi rts @@ -805,19 +816,28 @@ __muldi3: ; Loop 64 times on a's bits. ldy #0x40 .Lmuldi_loop: - ; Test bit 0 of a (= LSR a; C = old bit 0). - lda 0xe0 + ; Right-shift the 64-bit `a` by 1. $E0=lo..$E6=hi (matches the + ; stash + __retdi convention). Must shift HI first (LSR loses + ; bit 63 of $E6) so each ROR carries the previous half's bit 0 + ; INTO the top of the next-LOWER half — that's the actual + ; right-shift direction in a $E0=lo layout. After the chain, + ; C = orig $E0_b0 = bit 0 of the 64-bit value, which drops out + ; and is what we want to BCC on. The earlier code shifted lo + ; first which ran the shift in the WRONG direction (lo → hi) + ; and tested $E6_b0 (bit 48) instead of bit 0 — every multiply + ; involving bits 16+ came back garbage. + lda 0xe6 lsr a - sta 0xe0 - lda 0xe2 - ror a - sta 0xe2 + sta 0xe6 lda 0xe4 ror a sta 0xe4 - lda 0xe6 + lda 0xe2 ror a - sta 0xe6 + sta 0xe2 + lda 0xe0 + ror a + sta 0xe0 bcc .Lmuldi_noadd ; Add b ($E8..$EE) to product ($F2..$F8). clc diff --git a/runtime/src/softDouble.c b/runtime/src/softDouble.c index 88af25d..97cc8e5 100644 --- a/runtime/src/softDouble.c +++ b/runtime/src/softDouble.c @@ -111,16 +111,25 @@ u64 __negdf2(u64 a) { return a ^ DSIGN_BIT; } -u64 __muldf3(u64 a, u64 b) { - u64 sa, sb, ma, mb; - s16 ea, eb; - u16 ca = dclass(a, &sa, &ea, &ma); - u16 cb = dclass(b, &sb, &eb, &mb); - u64 sr = sa ^ sb; - if (ca == 0 || cb == 0) return sr; - // Truncated 64*64 → high-64 product via 32*32 partials. We only - // need the upper bits of the 106-bit product because the mantissas - // are 53 bits each. +// Carry the high 64 bits of a 128-bit product in `hi` and the low 64 +// in `lo`. Carry bit indicates whether the leading bit was at 105 +// (caller must increment exponent). +typedef struct { + u64 mantissa; + u16 carry; +} MantCarryT; + +// 64x64 -> 128-bit product, returned as a packed u64 pair. Returns +// the high 64 bits in the high u64 of the .mantissa lane is not +// possible — instead, we shift in-line and return the aligned mantissa +// directly. Splitting keeps register pressure low enough for greedy +// regalloc on the single-A W65816. +// +// Inlinable on purpose: passing a pointer to a stack local across a +// noinline boundary lowers to `sta (d,s),y` which uses DBR-relative +// addressing — broken under DBR != 0 (e.g. after a bank switch). +// Keeping these inline keeps the stores within the caller's frame. +static inline u64 mulhi64Aligned(u64 ma, u64 mb, u16 *out_carry) { u32 alo = (u32)ma; u32 ahi = (u32)(ma >> 32); u32 blo = (u32)mb; @@ -131,16 +140,26 @@ u64 __muldf3(u64 a, u64 b) { u64 hh = (u64)ahi * (u64)bhi; u64 mid = lh + hl + (ll >> 32); u64 prod_hi = hh + (mid >> 32); - s16 er = ea + eb; - while (prod_hi & ~(DMANT_LEAD | DMANT_MASK)) { - prod_hi >>= 1; - er++; + u64 prod_lo = (ll & 0xFFFFFFFFULL) | ((mid & 0xFFFFFFFFULL) << 32); + if (prod_hi & (1ULL << 41)) { + *out_carry = 1; + return (prod_hi << 11) | (prod_lo >> 53); } - while ((prod_hi & DMANT_LEAD) == 0 && prod_hi != 0) { - prod_hi <<= 1; - er--; - } - return dpack(sr, er, prod_hi); + *out_carry = 0; + return (prod_hi << 12) | (prod_lo >> 52); +} + +u64 __muldf3(u64 a, u64 b) { + u64 sa, sb, ma, mb; + s16 ea, eb; + u16 ca = dclass(a, &sa, &ea, &ma); + u16 cb = dclass(b, &sb, &eb, &mb); + u64 sr = sa ^ sb; + if (ca == 0 || cb == 0) return sr; + u16 carry; + u64 mr = mulhi64Aligned(ma, mb, &carry); + s16 er = ea + eb + (s16)carry; + return dpack(sr, er, mr); } u64 __divdf3(u64 a, u64 b) { @@ -151,26 +170,29 @@ u64 __divdf3(u64 a, u64 b) { u64 sr = sa ^ sb; if (ca == 0) return sr; if (cb == 0) return sr | DEXP_MASK; // div-by-zero → inf - // Long division: shift a left by 11 to make room for quotient bits. - u64 q = 0; - u64 r = ma; - for (int i = 0; i < 53; i++) { + // Long division: handle the leading quotient bit explicitly (since + // we need to "consume" the dividend's leading 1 by subtracting), + // then generate 52 more fractional bits by shifting r left and + // testing. The previous shift-and-test-only loop over-counted + // when r == mb after subtraction (e.g. 2.0/1.0 returned ~4.0). + s16 er = ea - eb; + // Normalize so the dividend is in [mb, 2*mb). This ensures the + // leading quotient bit will land at position 52 below. + if (ma < mb) { + ma <<= 1; + er--; + } + // Handle the leading quotient bit explicitly. + u64 q = DMANT_LEAD; + u64 r = ma - mb; + // Compute 52 more fractional bits via standard shift-test-subtract. + for (int i = 51; i >= 0; i--) { r <<= 1; - q <<= 1; if (r >= mb) { r -= mb; - q |= 1; + q |= (1ULL << i); } } - s16 er = ea - eb; - while (q & ~(DMANT_LEAD | DMANT_MASK)) { - q >>= 1; - er++; - } - while ((q & DMANT_LEAD) == 0 && q != 0) { - q <<= 1; - er--; - } return dpack(sr, er, q); } diff --git a/scripts/smokeTest.sh b/scripts/smokeTest.sh index 935dd26..b862f9f 100755 --- a/scripts/smokeTest.sh +++ b/scripts/smokeTest.sh @@ -1104,7 +1104,10 @@ int toInt(double x) { return (int)x; } double fromInt(int n) { return (double)n; } EOF "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblFile" -o "$oDblFile" - "$CLANG" --target=w65816 -O2 -ffunction-sections \ + # softDouble.c uses -regalloc=fast because __muldf3's 64x64 -> 128 + # multiply with the inlined alignment shifts overflows the greedy + # allocator's spill heuristics on the single-A target. + "$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \ -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdFile" "$PROJECT_ROOT/tools/link816" -o "$binDblFile" \ --text-base 0x8000 --map "$mapDblFile" \ @@ -1281,7 +1284,7 @@ int main(void) { } EOF "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblMame" -o "$oDblMame" - "$CLANG" --target=w65816 -O2 -ffunction-sections \ + "$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \ -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdMame" "$PROJECT_ROOT/tools/link816" -o "$binDblMame" \ --text-base 0x1000 \ @@ -1402,7 +1405,7 @@ EOF -c "$PROJECT_ROOT/runtime/src/libc.c" -o "$oLibcF" "$CLANG" --target=w65816 -O2 -ffunction-sections \ -c "$PROJECT_ROOT/runtime/src/softFloat.c" -o "$oSfF" - "$CLANG" --target=w65816 -O2 -ffunction-sections \ + "$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \ -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdF" oCrt0F="$(mktemp --suffix=.o)" "$PROJECT_ROOT/tools/llvm-mos-build/bin/llvm-mc" -arch=w65816 \ @@ -1708,9 +1711,10 @@ EOF fi rm -f "$cP2File" "$oP2File" "$binP2File" - # Bubble sort with the loop form that compiles correctly - # (i=1..n; inner j+1 arr[j+1]) { unsigned short t = arr[j]; arr[j] = arr[j+1]; @@ -1752,8 +1756,507 @@ EOF 0x025004=0003 0x025006=0004 >/dev/null 2>&1; then die "MAME: bubbleSort([4,1,3,2]) != [1,2,3,4]" fi - rm -f "$cBsFile" "$oBsFile" "$binBsFile" \ - "$oLibcF" "$oSfF" "$oSdF" "$oCrt0F" + rm -f "$cBsFile" "$oBsFile" "$binBsFile" + + # printf("ABCDE") returns 5. Canary for the BranchExpand + # leftover-BRA-Skip bug: without removing the original BRA + # after rewriting Bxx to INV_Bxx, the inserted Bridge MBB + # becomes unreachable and the conditional flow is lost. Also + # exercises vprintf's main loop end-to-end (no varargs). + log "check: MAME runs printf('ABCDE') → 5 (BranchExpand bridge regression)" + cPfFile="$(mktemp --suffix=.c)" + oPfFile="$(mktemp --suffix=.o)" + binPfFile="$(mktemp --suffix=.bin)" + cat > "$cPfFile" <<'EOF' +#include +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +int main(void) { + int r = printf("ABCDE"); + switchToBank2(); + *(volatile unsigned short *)0x5000 = (unsigned short)r; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections \ + -I"$PROJECT_ROOT/runtime/include" -c "$cPfFile" -o "$oPfFile" + "$PROJECT_ROOT/tools/link816" -o "$binPfFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oPfFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \ + "$binPfFile" 0x025000 0005 >/dev/null 2>&1; then + die "MAME: printf('ABCDE') != 5 (BranchExpand bridge regression)" + fi + rm -f "$cPfFile" "$oPfFile" "$binPfFile" + + # parse('BCDE') with switch-on-spec — used to fail to link with + # PCREL8-out-of-range because long unconditional BRA didn't + # auto-relax to BRL. W65816BranchExpand now force-promotes + # long BRA to BRL. + log "check: MAME runs nested-loop+multiply f(4) → 120 (regalloc + BRA-relax)" + cFnFile="$(mktemp --suffix=.c)" + oFnFile="$(mktemp --suffix=.o)" + binFnFile="$(mktemp --suffix=.bin)" + cat > "$cFnFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) unsigned short f(unsigned short n) { + unsigned short s = 0; + for (unsigned short i = 0; i < n; i++) + for (unsigned short j = 0; j < n; j++) + s += i*n+j; + return s; +} +int main(void) { + unsigned short r = f(4); + switchToBank2(); + *(volatile unsigned short *)0x5000 = r; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cFnFile" -o "$oFnFile" + "$PROJECT_ROOT/tools/link816" -o "$binFnFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oFnFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \ + "$binFnFile" 0x025000 0078 >/dev/null 2>&1; then + die "MAME: f(4) != 120 (regalloc + BRA-relax regression)" + fi + rm -f "$cFnFile" "$oFnFile" "$binFnFile" + + # u64add through a noinline boundary — exercises the + # ADJCALLSTACKUP teardown's STA $E0 / LDA $E0 path that + # preserves Y across the SP-restore. The earlier PLY*N/2 + # implementation clobbered Y, so any i64 return came back + # with the last popped arg in Y instead of the sum's mid-high. + # Recursive u64 factorial — exercises __muldi3 + i64 ABI through + # a recursive noinline boundary. 20! = 0x21c3_677c_82b4_0000. + # Used to come back as garbage because __divmoddi4_stash read + # caller args from slot 4 when it was actually JSR-called from + # __muldi3 (so slot 4 was the JSL ret address byte, not a_mh). + # dadd through a noinline boundary — exercises __adddf3 + the + # full i64-return ABI through a real call. The earlier soft- + # double smoke test ran `c = 1.5 + 2.5` inline, which clang + # constant-folds to a literal 0x4010... bit pattern — never + # actually executed __adddf3. This one calls a noinline + # `dadd` so the libcall and the i64 ABI run end-to-end. + # printf("%d", n) — used to crash MAME entirely because MachineCSE + # eliminated the `if (isLong)` re-test of *fmt as a "redundant" + # CMP (it had matched an earlier identical CMP), and the + # surviving BNE then read whatever leftover P-flag state happened + # to be in P from the last spec-dispatch CMP. Backend now + # disables MachineCSE entirely. + log "check: MAME runs printf('%%d %%d', 42, 99) chain (MachineCSE disable)" + cPdFile="$(mktemp --suffix=.c)" + oPdFile="$(mktemp --suffix=.o)" + binPdFile="$(mktemp --suffix=.bin)" + cat > "$cPdFile" <<'EOF' +#include +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) int give42(void) { return 42; } +int main(void) { + // vprintf returns the increment count: 1 per format spec, 1 per + // non-spec char. "Hi %d ok\n" → H,i,' ',%d,' ',o,k,'\n' = 8. + int n = printf("Hi %d ok\n", give42()); + switchToBank2(); + *(volatile unsigned short *)0x5000 = (unsigned short)n; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections \ + -I"$PROJECT_ROOT/runtime/include" -c \ + "$cPdFile" -o "$oPdFile" + "$PROJECT_ROOT/tools/link816" -o "$binPdFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oPdFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binPdFile" 0x025000 0008 \ + >/dev/null 2>&1; then + die "MAME: printf('Hi %d ok\\n', 42) != 8 (vprintf isLong / MachineCSE)" + fi + rm -f "$cPdFile" "$oPdFile" "$binPdFile" + + log "check: MAME runs noinline dadd(1.5,2.5) → 4.0 (__adddf3 + i64 ABI)" + cDdFile="$(mktemp --suffix=.c)" + oDdFile="$(mktemp --suffix=.o)" + binDdFile="$(mktemp --suffix=.bin)" + cat > "$cDdFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) double dadd(double a, double b) { return a + b; } +int main(void) { + union { double d; unsigned short w[4]; } u; + u.d = dadd(1.5, 2.5); + switchToBank2(); + *(volatile unsigned short *)0x5000 = u.w[0]; + *(volatile unsigned short *)0x5002 = u.w[1]; + *(volatile unsigned short *)0x5004 = u.w[2]; + *(volatile unsigned short *)0x5006 = u.w[3]; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cDdFile" -o "$oDdFile" + "$PROJECT_ROOT/tools/link816" -o "$binDdFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDdFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDdFile" --check \ + 0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4010 \ + >/dev/null 2>&1; then + die "MAME: noinline dadd(1.5,2.5) != 4.0 (i64-ABI through libcall)" + fi + rm -f "$cDdFile" "$oDdFile" "$binDdFile" + + log "check: MAME runs fact_u64(20) → 0x21c3677c82b40000 (__muldi3 stash slots)" + cFkFile="$(mktemp --suffix=.c)" + oFkFile="$(mktemp --suffix=.o)" + binFkFile="$(mktemp --suffix=.bin)" + cat > "$cFkFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) unsigned long long fact_u64(unsigned int n) { + if (n <= 1) return 1ULL; + return (unsigned long long)n * fact_u64(n - 1); +} +int main(void) { + unsigned long long r = fact_u64(20); + union { unsigned long long u; unsigned short w[4]; } u; + u.u = r; + switchToBank2(); + *(volatile unsigned short *)0x5000 = u.w[0]; + *(volatile unsigned short *)0x5002 = u.w[1]; + *(volatile unsigned short *)0x5004 = u.w[2]; + *(volatile unsigned short *)0x5006 = u.w[3]; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cFkFile" -o "$oFkFile" + "$PROJECT_ROOT/tools/link816" -o "$binFkFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oFkFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binFkFile" --check \ + 0x025000=0000 0x025002=82b4 0x025004=677c 0x025006=21c3 \ + >/dev/null 2>&1; then + die "MAME: fact_u64(20) returned wrong bits (__muldi3 / stash slots)" + fi + rm -f "$cFkFile" "$oFkFile" "$binFkFile" + + log "check: MAME runs u64add(0x3FF8...,0x4004...) → 0x7FFC... (call-up Y-preserve)" + cU64File="$(mktemp --suffix=.c)" + oU64File="$(mktemp --suffix=.o)" + binU64File="$(mktemp --suffix=.bin)" + cat > "$cU64File" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) unsigned long long u64add(unsigned long long a, unsigned long long b) { + return a + b; +} +int main(void) { + unsigned long long c = u64add(0x3FF8000000000000ULL, 0x4004000000000000ULL); + union { unsigned long long u; unsigned short w[4]; } u; + u.u = c; + switchToBank2(); + *(volatile unsigned short *)0x5000 = u.w[0]; + *(volatile unsigned short *)0x5002 = u.w[1]; + *(volatile unsigned short *)0x5004 = u.w[2]; + *(volatile unsigned short *)0x5006 = u.w[3]; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cU64File" -o "$oU64File" + "$PROJECT_ROOT/tools/link816" -o "$binU64File" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oU64File" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binU64File" --check \ + 0x025000=0000 0x025002=0000 0x025004=0000 0x025006=7ffc \ + >/dev/null 2>&1; then + die "MAME: u64add through noinline returned wrong middle halves (call-up Y-clobber)" + fi + rm -f "$cU64File" "$oU64File" "$binU64File" + + log "check: MAME runs addOff(p,1) p[0]+=p[1] → 12 (StackSlotCleanup killed-Y respect)" + cAofFile="$(mktemp --suffix=.c)" + oAofFile="$(mktemp --suffix=.o)" + binAofFile="$(mktemp --suffix=.bin)" + cat > "$cAofFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) short addOff(short *p, short i) { + short b = p[i]; + p[i-1] = p[i-1] + b; + return p[i-1]; +} +int main(void) { + short stk[2] = { 5, 7 }; + short r = addOff(stk, 1); + short s0 = stk[0]; + switchToBank2(); + *(volatile unsigned short *)0x5000 = (unsigned short)r; + *(volatile unsigned short *)0x5002 = (unsigned short)s0; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cAofFile" -o "$oAofFile" + "$PROJECT_ROOT/tools/link816" -o "$binAofFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oAofFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \ + "$binAofFile" --check 0x025000=000c 0x025002=000c \ + >/dev/null 2>&1; then + die "MAME: addOff p[i-1]+=p[i] returned wrong store (NegYIndY/X-clobber or LDY-erase)" + fi + rm -f "$cAofFile" "$oAofFile" "$binAofFile" + + log "check: MAME runs sqr(10) → 100 (frame-less ADJCALLSTACKUP must emit PLY)" + cSqrFile="$(mktemp --suffix=.c)" + oSqrFile="$(mktemp --suffix=.o)" + binSqrFile="$(mktemp --suffix=.bin)" + cat > "$cSqrFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) unsigned short sqr(unsigned short x) { return x * x; } +int main(void) { + unsigned short r = sqr(10); + switchToBank2(); + *(volatile unsigned short *)0x5000 = r; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cSqrFile" -o "$oSqrFile" + "$PROJECT_ROOT/tools/link816" -o "$binSqrFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oSqrFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \ + "$binSqrFile" --check 0x025000=0064 >/dev/null 2>&1; then + die "MAME: sqr(10) crashed or != 100 (ADJCALLSTACKUP not emitting PLY for frame-less)" + fi + rm -f "$cSqrFile" "$oSqrFile" "$binSqrFile" + + log "check: MAME runs ddiv(8.0,4.0) → 2.0 (__divdf3 algorithm fix)" + cDdvFile="$(mktemp --suffix=.c)" + oDdvFile="$(mktemp --suffix=.o)" + binDdvFile="$(mktemp --suffix=.bin)" + cat > "$cDdvFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) double ddiv(double a, double b) { return a / b; } +int main(void) { + union { double d; unsigned short w[4]; } u; + u.d = ddiv(8.0, 4.0); + switchToBank2(); + *(volatile unsigned short *)0x5000 = u.w[0]; + *(volatile unsigned short *)0x5002 = u.w[1]; + *(volatile unsigned short *)0x5004 = u.w[2]; + *(volatile unsigned short *)0x5006 = u.w[3]; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cDdvFile" -o "$oDdvFile" + "$PROJECT_ROOT/tools/link816" -o "$binDdvFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDdvFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \ + "$binDdvFile" --check 0x025000=0000 0x025002=0000 \ + 0x025004=0000 0x025006=4000 >/dev/null 2>&1; then + die "MAME: ddiv(8,4) != 2.0 (__divdf3 long-division bug)" + fi + rm -f "$cDdvFile" "$oDdvFile" "$binDdvFile" + + log "check: MAME runs Newton-iter loop → high-half ~1.41 (BranchExpand self-loop BRA fix)" + cSqFile="$(mktemp --suffix=.c)" + oSqFile="$(mktemp --suffix=.o)" + binSqFile="$(mktemp --suffix=.bin)" + # 3-iter Newton-method sqrt with a counted for-loop (the loop-back + # BRA is a self-loop, which the BranchExpand distance estimator + # used to report as 0 bytes, so it never promoted to BRL even + # when the loop body grew well past +/-128 bytes). + cat > "$cSqFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) double sqrt3(double x) { + double g = x * 0.5; + for (unsigned short i = 0; i < 3; i++) + g = (g + x / g) * 0.5; + return g; +} +int main(void) { + union { double d; unsigned short w[4]; } u; + u.d = sqrt3(2.0); + switchToBank2(); + // Only the high half is precision-stable (low halves vary slightly + // due to truncation vs round-to-nearest in __divdf3). Verify just + // the high half — that's enough to prove the self-loop BRA was + // promoted (the link would have failed otherwise) and __divdf3 is + // converging to the right magnitude. + *(volatile unsigned short *)0x5006 = u.w[3]; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cSqFile" -o "$oSqFile" + "$PROJECT_ROOT/tools/link816" -o "$binSqFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oSqFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \ + "$binSqFile" --check 0x025006=3ff6 >/dev/null 2>&1; then + die "MAME: sqrt3(2.0) high half wrong (self-loop BRA / __divdf3)" + fi + rm -f "$cSqFile" "$oSqFile" "$binSqFile" + + log "check: MAME runs -O0 addOne(7) → 8 (lda-overwrite-immediate fix; fast regalloc)" + cO0File="$(mktemp --suffix=.c)" + oO0File="$(mktemp --suffix=.o)" + binO0File="$(mktemp --suffix=.bin)" + cat > "$cO0File" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +unsigned short addOne(unsigned short a) { return a + 1; } +int main(void) { + unsigned short r = addOne(7); + switchToBank2(); + *(volatile unsigned short *)0x5000 = r; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O0 -ffunction-sections -c \ + "$cO0File" -o "$oO0File" + "$PROJECT_ROOT/tools/link816" -o "$binO0File" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oO0File" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \ + "$binO0File" --check 0x025000=0008 >/dev/null 2>&1; then + die "MAME: -O0 addOne(7) != 8 (lda overwrite immediate / regalloc choice)" + fi + rm -f "$cO0File" "$oO0File" "$binO0File" + + log "check: MAME runs bubble sort with mySwap helper [4,1,3,2] → [1,2,3,4] (greedy across helper-call)" + cBshFile="$(mktemp --suffix=.c)" + oBshFile="$(mktemp --suffix=.o)" + binBshFile="$(mktemp --suffix=.bin)" + cat > "$cBshFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +unsigned short bsdata[4] = { 4, 1, 3, 2 }; +__attribute__((noinline)) void mySwap(unsigned short *a, unsigned short *b) { + unsigned short t = *a; *a = *b; *b = t; +} +__attribute__((noinline)) void mySort(unsigned short *arr, unsigned short n) { + for (unsigned short i = 0; i < n - 1; i++) + for (unsigned short j = 0; j < n - i - 1; j++) + if (arr[j] > arr[j+1]) + mySwap(&arr[j], &arr[j+1]); +} +int main(void) { + mySort(bsdata, 4); + unsigned short d0 = bsdata[0], d1 = bsdata[1], d2 = bsdata[2], d3 = bsdata[3]; + switchToBank2(); + *(volatile unsigned short *)0x5000 = d0; + *(volatile unsigned short *)0x5002 = d1; + *(volatile unsigned short *)0x5004 = d2; + *(volatile unsigned short *)0x5006 = d3; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cBshFile" -o "$oBshFile" + "$PROJECT_ROOT/tools/link816" -o "$binBshFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oBshFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \ + "$binBshFile" --check 0x025000=0001 0x025002=0002 \ + 0x025004=0003 0x025006=0004 >/dev/null 2>&1; then + die "MAME: mySort with mySwap helper miscompiled (greedy regalloc across call)" + fi + rm -f "$cBshFile" "$oBshFile" "$binBshFile" + + log "check: MAME runs dmul(8.0,2.0) AFTER bank-switch → 16.0 (DPF0 store + __muldf3)" + cDmFile="$(mktemp --suffix=.c)" + oDmFile="$(mktemp --suffix=.o)" + binDmFile="$(mktemp --suffix=.bin)" + cat > "$cDmFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) double dmul(double a, double b) { return a * b; } +int main(void) { + union { double d; unsigned short w[4]; } u; + switchToBank2(); + u.d = dmul(8.0, 2.0); + *(volatile unsigned short *)0x5000 = u.w[0]; + *(volatile unsigned short *)0x5002 = u.w[1]; + *(volatile unsigned short *)0x5004 = u.w[2]; + *(volatile unsigned short *)0x5006 = u.w[3]; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cDmFile" -o "$oDmFile" + "$PROJECT_ROOT/tools/link816" -o "$binDmFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDmFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDmFile" --check \ + 0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4030 \ + >/dev/null 2>&1; then + die "MAME: dmul(8,2) under DBR=2 produced wrong bits (DPF0 store / __muldf3)" + fi + rm -f "$cDmFile" "$oDmFile" "$binDmFile" + + log "check: MAME runs dmath = (a+b)*(a-b), 5,3 → 16.0 (chained libcall ABI)" + cDmaFile="$(mktemp --suffix=.c)" + oDmaFile="$(mktemp --suffix=.o)" + binDmaFile="$(mktemp --suffix=.bin)" + cat > "$cDmaFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) double dadd(double a, double b) { return a + b; } +__attribute__((noinline)) double dsub(double a, double b) { return a - b; } +__attribute__((noinline)) double dmul(double a, double b) { return a * b; } +__attribute__((noinline)) double dmath(double a, double b) { + return dmul(dadd(a, b), dsub(a, b)); +} +int main(void) { + union { double d; unsigned short w[4]; } u; + u.d = dmath(5.0, 3.0); + switchToBank2(); + *(volatile unsigned short *)0x5000 = u.w[0]; + *(volatile unsigned short *)0x5002 = u.w[1]; + *(volatile unsigned short *)0x5004 = u.w[2]; + *(volatile unsigned short *)0x5006 = u.w[3]; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cDmaFile" -o "$oDmaFile" + "$PROJECT_ROOT/tools/link816" -o "$binDmaFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDmaFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDmaFile" --check \ + 0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4030 \ + >/dev/null 2>&1; then + die "MAME: dmath(5,3) returned wrong high half (DP[\$F0] CSE across libcalls)" + fi + rm -f "$cDmaFile" "$oDmaFile" "$binDmaFile" + + rm -f "$oLibcF" "$oSfF" "$oSdF" "$oCrt0F" else warn "MAME or apple2gs ROMs not installed; skipping end-to-end test" fi diff --git a/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp b/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp index 17c6dcf..562af1d 100644 --- a/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp +++ b/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp @@ -131,6 +131,7 @@ static bool clobbersImg(const MachineInstr &MI, bool W65816ABridgeViaX::runOnMachineFunction(MachineFunction &MF) { if (!MF.getRegInfo().getNumVirtRegs()) return false; + if (MF.getFunction().hasOptNone()) return false; MachineRegisterInfo &MRI = MF.getRegInfo(); const W65816Subtarget &STI = MF.getSubtarget(); const W65816InstrInfo *TII = STI.getInstrInfo(); diff --git a/src/llvm/lib/Target/W65816/W65816AsmPrinter.cpp b/src/llvm/lib/Target/W65816/W65816AsmPrinter.cpp index 7ba68b3..043f08b 100644 --- a/src/llvm/lib/Target/W65816/W65816AsmPrinter.cpp +++ b/src/llvm/lib/Target/W65816/W65816AsmPrinter.cpp @@ -83,21 +83,71 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) { switch (MI->getOpcode()) { default: break; - case W65816::ADJCALLSTACKDOWN: + case W65816::ADJCALLSTACKDOWN: { + // DOWN is a no-op in our scheme — the PUSH16 sequence in LowerCall + // already shifted SP incrementally as args were pushed. Nothing + // to emit; PEI may or may not have processed it, either is fine. + return; + } case W65816::ADJCALLSTACKUP: { - // PEI's eliminateCallFramePseudoInstr removes these *only* when the - // function has frame work (StackSize > 0 or any FrameIndex use). - // Functions that just tail-call into a libcall (e.g. `int toInt(float - // x) { return (int)x; }` lowers to a single jsl __fixsfsi) have - // neither; PEI skips its call-frame phase and the pseudo survives - // to MC. AsmStreamer renders the pseudo's "# ADJCALLSTACK..." - // string as a comment, but MCObjectStreamer asks the encoder to - // emit bytes — which fails ("Unsupported instruction MCInst 337"). - // Dropping it here is correct: when amt is zero (the "no frame" - // path) the call sequence is a no-op anyway; when non-zero, PEI - // would have replaced it with PLA-loop / TSC-ADC sequence already. - // If we ever see a non-zero amount slip through, that's a real - // bug — emit nothing and trust the comment-stripped path. + // PEI's eliminateCallFramePseudoInstr handles UP whenever the + // function has any frame work (StackSize > 0 or any FI use). + // Frame-less functions — e.g. `unsigned short sqr(unsigned short + // x) { return x*x; }` lowers to PUSH16 + jsl __mulhi3 + RTL with + // no locals — get skipped by PEI's call-frame phase, leaving + // ADJCALLSTACKUP as a pseudo all the way to here. Previously we + // silently dropped it, which left SP off by N bytes after the + // call and corrupted the caller's stack frame (caught by sqr(x) + // segfaulting MAME). Emit the SP fixup ourselves: PLY*N/2 for + // small even N, otherwise the TAY/TSC-ADC/TYA bracket. + int N = MI->getOperand(0).getImm(); + if (N == 0) return; + // A holds the callee's return value; preserve it. Walk forward + // looking for X/Y uses (i64-return halves) — same logic as + // eliminateCallFramePseudoInstr. + bool YLive = false; + for (auto J = std::next(MI->getIterator()); J != MI->getParent()->end(); + ++J) { + if (J->isCall()) break; + bool yDef = false; + for (const MachineOperand &MO : J->operands()) { + if (!MO.isReg()) continue; + if (MO.getReg() == W65816::Y) { + if (MO.isUse()) { YLive = true; break; } + if (MO.isDef()) yDef = true; + } + } + if (YLive || yDef) break; + } + if (YLive) { + // Route through DP $E0 to preserve both A and Y. + MCInst Sta; Sta.setOpcode(W65816::STA_DP); + Sta.addOperand(MCOperand::createImm(0xE0)); + EmitToStreamer(*OutStreamer, Sta); + MCInst Tsc; Tsc.setOpcode(W65816::TSC); EmitToStreamer(*OutStreamer, Tsc); + MCInst Clc; Clc.setOpcode(W65816::CLC); EmitToStreamer(*OutStreamer, Clc); + MCInst Adc; Adc.setOpcode(W65816::ADC_Imm16); + Adc.addOperand(MCOperand::createImm(N)); + EmitToStreamer(*OutStreamer, Adc); + MCInst Tcs; Tcs.setOpcode(W65816::TCS); EmitToStreamer(*OutStreamer, Tcs); + MCInst Lda; Lda.setOpcode(W65816::LDA_DP); + Lda.addOperand(MCOperand::createImm(0xE0)); + EmitToStreamer(*OutStreamer, Lda); + } else if (N <= 14 && (N % 2) == 0) { + for (int i = 0; i < N / 2; ++i) { + MCInst Ply; Ply.setOpcode(W65816::PLY); + EmitToStreamer(*OutStreamer, Ply); + } + } else { + MCInst Tay; Tay.setOpcode(W65816::TAY); EmitToStreamer(*OutStreamer, Tay); + MCInst Tsc; Tsc.setOpcode(W65816::TSC); EmitToStreamer(*OutStreamer, Tsc); + MCInst Clc; Clc.setOpcode(W65816::CLC); EmitToStreamer(*OutStreamer, Clc); + MCInst Adc; Adc.setOpcode(W65816::ADC_Imm16); + Adc.addOperand(MCOperand::createImm(N)); + EmitToStreamer(*OutStreamer, Adc); + MCInst Tcs; Tcs.setOpcode(W65816::TCS); EmitToStreamer(*OutStreamer, Tcs); + MCInst Tya; Tya.setOpcode(W65816::TYA); EmitToStreamer(*OutStreamer, Tya); + } return; } case W65816::LDXi16imm: { diff --git a/src/llvm/lib/Target/W65816/W65816BranchExpand.cpp b/src/llvm/lib/Target/W65816/W65816BranchExpand.cpp index 3c69b9d..7fc390b 100644 --- a/src/llvm/lib/Target/W65816/W65816BranchExpand.cpp +++ b/src/llvm/lib/Target/W65816/W65816BranchExpand.cpp @@ -46,6 +46,7 @@ #include "llvm/CodeGen/MachineFunctionPass.h" #include "llvm/CodeGen/MachineInstr.h" #include "llvm/CodeGen/MachineInstrBuilder.h" +#include "llvm/Support/raw_ostream.h" using namespace llvm; @@ -100,7 +101,17 @@ static unsigned estimateDistance(MachineFunction &MF, const MachineInstr &Br, MachineBasicBlock *To) { const MachineBasicBlock *From = Br.getParent(); - if (From == To) return 0; + // Self-loop branch: target is the start of From, branch is somewhere + // inside From. Distance is the bytes from start of From to the + // branch instruction (i.e., everything before Br in From). + if (From == To) { + unsigned Bytes = 0; + for (const auto &MI : *From) { + if (&MI == &Br) break; + Bytes += TII->getInstSizeInBytes(MI); + } + return Bytes; + } // Two cases by layout direction: // forward: bytes after Br in From, plus all of MBBs strictly @@ -276,11 +287,30 @@ bool W65816BranchExpand::runOnMachineFunction(MachineFunction &MF) { // Step 2: iterate to fixed-point. Each expansion adds 3 bytes // (bridge BRA), which may push another previously-OK branch over // the threshold. Cap at MAX_ITERS to avoid pathological cases. - const unsigned EXPAND_DIST_THRESHOLD = 100; // safe under +/-128 + const unsigned EXPAND_DIST_THRESHOLD = 90; // tighter margin under +/-128 const unsigned MAX_ITERS = 10; for (unsigned iter = 0; iter < MAX_ITERS; ++iter) { bool Changed = false; + // Promote long BRA to BRL. The assembler's BRA→BRL relaxation + // sometimes fails to fire when the target symbol resolves early + // in MC layout — the linker then sees a PCREL8 reloc that's out + // of range. Force the BRL ourselves when the estimate exceeds + // the safe threshold; saves one byte if BRA would have fit, but + // beats a hard link error. + for (auto &MBB : MF) { + for (auto &MI : MBB.terminators()) { + if (MI.getOpcode() != W65816::BRA) continue; + if (MI.getNumOperands() < 1 || !MI.getOperand(0).isMBB()) continue; + MachineBasicBlock *Target = MI.getOperand(0).getMBB(); + unsigned Dist = estimateDistance(MF, TII, MI, Target); + if (Dist > EXPAND_DIST_THRESHOLD) { + MI.setDesc(TII->get(W65816::BRL)); + Changed = true; + } + } + } + // Collect candidates. After step 1, each MBB has at most one // conditional terminator, so we walk terminators(). SmallVector, 8> Candidates; @@ -337,6 +367,27 @@ bool W65816BranchExpand::runOnMachineFunction(MachineFunction &MF) { // fall-through marker after stays after. auto insertPt = MBB->getFirstTerminator(); BuildMI(*MBB, insertPt, DL, TII->get(InvOpc)).addMBB(Skip); + // After the rewrite, MBB falls through to Bridge (which now sits + // immediately after MBB in layout). Any unconditional BRA/BRL + // already at the end of MBB used to direct the fall-through to + // Skip — but with Bridge interposed, that BRA would skip past + // Bridge entirely and Bridge becomes unreachable. Remove it. + // (Skip is still reachable via INV_Bxx; Target is reachable via + // fall-through-to-Bridge then BRL.) Caught by vprintf crashing + // because dropDeadConditionalsToBRATarget then dropped the + // INV_Bxx as redundant with the leftover BRA Skip. + while (insertPt != MBB->end()) { + unsigned NextOpc = insertPt->getOpcode(); + if (NextOpc == W65816::BRA || NextOpc == W65816::BRL) { + if (insertPt->getNumOperands() >= 1 && + insertPt->getOperand(0).isMBB() && + insertPt->getOperand(0).getMBB() == Skip) { + insertPt = insertPt->eraseFromParent(); + continue; + } + } + ++insertPt; + } // Bridge: BRL Target. Always emit the long form rather than // relying on the assembler to relax BRA→BRL — the relaxation diff --git a/src/llvm/lib/Target/W65816/W65816FrameLowering.cpp b/src/llvm/lib/Target/W65816/W65816FrameLowering.cpp index 8a2df0b..4f5f6f6 100644 --- a/src/llvm/lib/Target/W65816/W65816FrameLowering.cpp +++ b/src/llvm/lib/Target/W65816/W65816FrameLowering.cpp @@ -162,15 +162,39 @@ void W65816FrameLowering::emitEpilogue(MachineFunction &MF, // Insert before the terminator (the return). DebugLoc DL = MBBI != MBB.end() ? MBBI->getDebugLoc() : DebugLoc(); + // Detect whether the return live-out includes Y or X — for i64 returns + // (Outs[0..2] -> A,X,Y), Y holds bits 32-47 and X holds bits 16-31, so + // any TAY/PLY/TAX in the SP-restore would corrupt the return value. + // The RTL terminator carries implicit-uses for every live-out return + // register; scan them to decide which scratch we can use safely. + bool YLive = false; + bool XLive = false; + if (MBBI != MBB.end() && MBBI->isReturn()) { + for (const MachineOperand &MO : MBBI->operands()) { + if (!MO.isReg() || !MO.isImplicit() || !MO.isUse()) continue; + if (MO.getReg() == W65816::Y) YLive = true; + else if (MO.getReg() == W65816::X) XLive = true; + } + } + // VLA cleanup: restore entry SP from DP $F4 (saved in prologue). // This subsumes BOTH the static frame and any dynamic_stackalloc // bytes — we can skip the per-byte PLY/PLA loop entirely. Preserve - // A through TAY/TYA since it holds the return value. + // A through TAY/TYA since it holds the return value. For i64 + // returns where Y is also live, route the save through DP $E0 + // ($E0..$EF is libcall scratch — guaranteed dead by epilogue time). if (HasVLA) { - BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY)); - BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4); - BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS)); - BuildMI(MBB, MBBI, DL, TII.get(W65816::TYA)); + if (YLive) { + BuildMI(MBB, MBBI, DL, TII.get(W65816::STA_DP)).addImm(0xE0); + BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4); + BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS)); + BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xE0); + } else { + BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY)); + BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4); + BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS)); + BuildMI(MBB, MBBI, DL, TII.get(W65816::TYA)); + } return; } @@ -182,11 +206,26 @@ void W65816FrameLowering::emitEpilogue(MachineFunction &MF, // N/2 PLY (pop into Y, discard); larger frames use // TAY/TSC/CLC/ADC #N/TCS/TYA. // Mirror the prologue threshold (see comment there). - if (StackSize <= 6 && (StackSize % 2) == 0) { + if (StackSize <= 6 && (StackSize % 2) == 0 && !YLive) { + // PLY clobbers Y, which is fine when Y isn't a return reg. for (uint64_t i = 0; i < StackSize / 2; ++i) BuildMI(MBB, MBBI, DL, TII.get(W65816::PLY)); return; } + if (YLive) { + // Y is a return register (i64 / double). Save A via DP $E0 + // instead of TAY so Y survives. 4 cyc slower than TAY/TYA but + // correct. X is allowed to be live too — none of these touch X. + BuildMI(MBB, MBBI, DL, TII.get(W65816::STA_DP)).addImm(0xE0); + BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC)); + BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC)); + BuildMI(MBB, MBBI, DL, TII.get(W65816::ADC_Imm16)) + .addImm(StackSize); + BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS)); + BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xE0); + (void)XLive; + return; + } BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY)); BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC)); BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC)); @@ -207,15 +246,56 @@ MachineBasicBlock::iterator W65816FrameLowering::eliminateCallFramePseudoInstr( // ADJCALLSTACKUP releases all the pushed bytes after a call. // // Critical: A holds the callee's return value here, so this MUST NOT - // clobber A. The naive `tsc;clc;adc #N;tcs` does (TSC overwrites A), - // which silently corrupts every call's return value. Same fix as the - // epilogue: small N via PLY (clobbers Y, preserves A); larger N via - // TAY/.../TYA bracket. + // clobber A. PLY (small-N path) clobbers Y; TAY/.../TYA bracket + // (large-N path) also clobbers Y. Both are fine for i8/i16/i32 + // returns but DESTROY the return for i64/double (where X and Y hold + // mid halves). Detect i64-return calls by walking back to the JSL + // and checking implicit-def $x/$y; in that case, save A via DP $E0 + // (libcall scratch, dead by call-up time) so X and Y survive. + // Caught by `unsigned long long u64add(a,b)` through a noinline + // boundary returning Y = b_hi (the last popped) instead of the + // sum's mid-high. if (I->getOpcode() == W65816::ADJCALLSTACKUP) { int N = I->getOperand(0).getImm(); if (N > 0) { DebugLoc DL = I->getDebugLoc(); - if (N <= 14 && (N % 2) == 0) { + bool YLive = false; + bool XLive = false; + // Walk forward looking for COPY %vreg = $x / $y — LowerCall's + // pattern for materializing return halves. JSLpseudo's tablegen + // declares only `Defs=[A]`, so implicit-defs of X/Y aren't on + // the call op itself. We have to read what comes after. + // Stop at the next call (re-clobbers everything) or at any def + // of X/Y (cancels their post-call value). + bool Stopped = false; + for (auto J = std::next(I); J != MBB.end() && !Stopped; ++J) { + if (J->isCall()) break; + for (const MachineOperand &MO : J->operands()) { + if (!MO.isReg()) continue; + Register R = MO.getReg(); + if (R == W65816::Y) { + if (MO.isUse()) YLive = true; + else if (MO.isDef() && !YLive) Stopped = true; + } else if (R == W65816::X) { + if (MO.isUse()) XLive = true; + else if (MO.isDef() && !XLive) Stopped = true; + } + } + if (YLive && XLive) break; + } + if (YLive) { + // i64 return: PLY would eat Y. Route through DP $E0. Worth + // ~4 cyc more than PLY*N/2 but correctness wins. X is not + // touched by any of these insns either way, so XLive doesn't + // change anything here — track it for symmetry. + BuildMI(MBB, I, DL, TII.get(W65816::STA_DP)).addImm(0xE0); + BuildMI(MBB, I, DL, TII.get(W65816::TSC)); + BuildMI(MBB, I, DL, TII.get(W65816::CLC)); + BuildMI(MBB, I, DL, TII.get(W65816::ADC_Imm16)).addImm(N); + BuildMI(MBB, I, DL, TII.get(W65816::TCS)); + BuildMI(MBB, I, DL, TII.get(W65816::LDA_DP)).addImm(0xE0); + (void)XLive; + } else if (N <= 14 && (N % 2) == 0) { for (int i = 0; i < N / 2; ++i) BuildMI(MBB, I, DL, TII.get(W65816::PLY)); } else { diff --git a/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp b/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp index 1d0865e..bf398d8 100644 --- a/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp +++ b/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp @@ -861,10 +861,17 @@ W65816TargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI, Glue = V.getValue(2); InVals.push_back(V); } else { - // 4th half: load from DP $F0. - SDValue DPAddr = DAG.getConstant(0xF0, DL, MVT::i16); - SDValue V = DAG.getLoad(VT, DL, Chain, DPAddr, MachinePointerInfo()); + // 4th half: read DP[$F0..$F1] via CopyFromReg(DPF0). DPF0 is a + // pseudo-physreg modeled as JSLpseudo's implicit-def, so each + // call's CopyFromReg has Glue tied to the corresponding call — + // the SDAG combiner can't merge them and the scheduler can't + // reorder them past the next call. copyPhysReg lowers DPF0 → + // A as `LDA $F0`. Without this, plain `getLoad(0xF0)` was + // being CSE'd / reordered across i64-returning calls, causing + // `dmath = (a+b)*(a-b)` to return 4 instead of 16. + SDValue V = DAG.getCopyFromReg(Chain, DL, W65816::DPF0, VT, Glue); Chain = V.getValue(1); + Glue = V.getValue(2); InVals.push_back(V); } } @@ -900,11 +907,17 @@ SDValue W65816TargetLowering::LowerReturn( SDValue Glue; SmallVector RetOps(1, Chain); - // Outs[3] -> store to DP $F0 (only for i64 returns). Done first so - // its computation can use A freely before A holds the low result. + // Outs[3] -> DP $F0 via CopyToReg(DPF0). Using the DPF0 fake physreg + // (lowered to `STA $F0` by copyPhysReg) is critical: a generic + // ISD::STORE with addr=0xF0 lowered to `sta (d,s),y`, an indirect + // through the DBR, which silently misbehaved when DBR != 0. STA dp + // uses D + dp directly and is unaffected by DBR. Done first so its + // computation can use A freely before A holds the low result. Glued + // to RET_GLUE via the RetOps Register entry below so DCE doesn't + // strip the COPY. if (Outs.size() >= 4) { - SDValue DPAddr = DAG.getConstant(0xF0, DL, MVT::i16); - Chain = DAG.getStore(Chain, DL, OutVals[3], DPAddr, MachinePointerInfo()); + Chain = DAG.getCopyToReg(Chain, DL, W65816::DPF0, OutVals[3], Glue); + Glue = Chain.getValue(1); } // Outs[2] -> Y. if (Outs.size() >= 3) { @@ -926,6 +939,8 @@ SDValue W65816TargetLowering::LowerReturn( RetOps.push_back(DAG.getRegister(W65816::X, Outs[1].VT)); if (Outs.size() >= 3) RetOps.push_back(DAG.getRegister(W65816::Y, Outs[2].VT)); + if (Outs.size() >= 4) + RetOps.push_back(DAG.getRegister(W65816::DPF0, Outs[3].VT)); RetOps[0] = Chain; if (Glue.getNode()) diff --git a/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp b/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp index 702d8ad..81226fa 100644 --- a/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp +++ b/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp @@ -92,6 +92,44 @@ void W65816InstrInfo::copyPhysReg(MachineBasicBlock &MBB, BuildMI(MBB, I, DL, get(W65816::STA_DP)).addImm(dstImg); return; } + // X → IMGn / IMGn → X: STX dp / LDX dp. Avoids the A-bridge that + // TAX/TXA would impose; critical for i32-first-arg signatures + // (live-in $a + $x) where bridging X via A clobbers $a's value + // before it can be saved. Caught by udivmod and iterative qsort. + if (dstImg >= 0 && SrcReg == W65816::X) { + BuildMI(MBB, I, DL, get(W65816::STX_DP)).addImm(dstImg); + return; + } + if (DestReg == W65816::X && srcImg >= 0) { + BuildMI(MBB, I, DL, get(W65816::LDX_DP)).addImm(srcImg); + return; + } + // Y → IMGn / IMGn → Y: STY dp / LDY dp — symmetric. + if (dstImg >= 0 && SrcReg == W65816::Y) { + BuildMI(MBB, I, DL, get(W65816::STY_DP)).addImm(dstImg); + return; + } + if (DestReg == W65816::Y && srcImg >= 0) { + BuildMI(MBB, I, DL, get(W65816::LDY_DP)).addImm(srcImg); + return; + } + // DPF0 → A: emit `LDA $F0`. DPF0 is the pseudo-physreg carrier + // for an i64-returning call's high 16 bits; LowerCall builds a + // CopyFromReg(DPF0) glued to the call so the SDAG combiner / + // scheduler can't merge or reorder reads across calls. + if (DestReg == W65816::A && SrcReg == W65816::DPF0) { + BuildMI(MBB, I, DL, get(W65816::LDA_DP)).addImm(0xF0); + return; + } + // A → DPF0: emit `STA $F0`. Used by LowerReturn for the i64 high + // half; using a true direct-page store is critical because plain + // ISD::STORE with addr=0xF0 was lowering to `(d,s),y` indirect via + // DBR — which silently broke under DBR != 0 (e.g. after a bank + // switch). STA dp uses D + dp directly, ignoring DBR. + if (DestReg == W65816::DPF0 && SrcReg == W65816::A) { + BuildMI(MBB, I, DL, get(W65816::STA_DP)).addImm(0xF0); + return; + } llvm_unreachable("W65816: cross-class copyPhysReg not yet implemented"); } @@ -101,8 +139,14 @@ void W65816InstrInfo::storeRegToStackSlot( MachineInstr::MIFlag Flags) const { // STAfi gets eliminated by W65816RegisterInfo::eliminateFrameIndex into // a real STA d,S. Source is implicit A; emit the pseudo with the FI - // and zero offset. + // and zero offset. When regalloc hands us a spill from X or Y, bridge + // through A (TXA / TYA) — same rationale as loadRegFromStackSlot. DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc(); + if (SrcReg == W65816::X || SrcReg == W65816::Y) { + unsigned XferOp = (SrcReg == W65816::X) ? W65816::TXA : W65816::TYA; + BuildMI(MBB, MI, DL, get(XferOp)); + SrcReg = W65816::A; + } BuildMI(MBB, MI, DL, get(W65816::STAfi)) .addReg(SrcReg, getKillRegState(isKill)) .addFrameIndex(FrameIdx) @@ -115,9 +159,30 @@ void W65816InstrInfo::loadRegFromStackSlot(MachineBasicBlock &MBB, const TargetRegisterClass *RC, Register VReg, unsigned SubReg, MachineInstr::MIFlag Flags) const { - // Mirror image of storeRegToStackSlot: emit LDAfi, which the frame - // index pass turns into LDA d,S. + // LDAfi only knows how to put the value in A. If regalloc asks for + // a spill into X or Y, we have to bridge through A: LDA d,S then + // TAX / TAY. Without this, the MIR has `$x = LDAfi` but the asm + // printer emits just `LDA d,S` (which writes A, not X) — a silent + // miscompile that surfaced as i64 subtract chains using stale X + // values for the second word (caught by udivmod's `a - q*b` mod + // computation). DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc(); + if (DestReg == W65816::A) { + BuildMI(MBB, MI, DL, get(W65816::LDAfi), DestReg) + .addFrameIndex(FrameIdx) + .addImm(0); + return; + } + if (DestReg == W65816::X || DestReg == W65816::Y) { + // Load via A, then transfer. A is implicitly clobbered. + BuildMI(MBB, MI, DL, get(W65816::LDAfi), W65816::A) + .addFrameIndex(FrameIdx) + .addImm(0); + unsigned XferOp = (DestReg == W65816::X) ? W65816::TAX : W65816::TAY; + BuildMI(MBB, MI, DL, get(XferOp)); + return; + } + // Fallback: assume A path (covers Acc16 / Wide16 vregs by class). BuildMI(MBB, MI, DL, get(W65816::LDAfi), DestReg) .addFrameIndex(FrameIdx) .addImm(0); diff --git a/src/llvm/lib/Target/W65816/W65816InstrInfo.td b/src/llvm/lib/Target/W65816/W65816InstrInfo.td index 01518df..641664b 100644 --- a/src/llvm/lib/Target/W65816/W65816InstrInfo.td +++ b/src/llvm/lib/Target/W65816/W65816InstrInfo.td @@ -70,6 +70,7 @@ def W65816pushx : SDNode<"W65816ISD::PUSH_X", SDTNone, [SDNPHasChain, SDNPInGlue, SDNPOutGlue, SDNPSideEffect, SDNPMayStore]>; + // SELECT_CC: takes (TVal, FVal, CC) plus a glue value carrying the // flags from a preceding W65816cmp. Lowered by EmitInstrWithCustomInserter // into a CMP (already in the BB) + Bxx + diamond CFG + PHI. @@ -1356,10 +1357,18 @@ def : Pat<(store // function doesn't have to know how it was called to choose its // return instruction. A pseudo bridges the i16 symbol operand // to JSL_Long's 24-bit operand class. +// Defs include DPF0 — every i64-returning libcall clobbers DP[$F0] +// (it's the carrier for the highest 16 bits of the return). The +// LowerCall side captures the pre-call DPF0 via CopyFromReg(DPF0) +// glued to the call so the SDAG combiner / scheduler can't merge +// or reorder reads across calls. Without DPF0 in Defs, plain +// `getLoad(0xF0)` was being CSE'd across calls, leading to +// `dmath = (a+b)*(a-b)` returning 4 instead of 16. let isCall = 1, hasSideEffects = 0, mayLoad = 0, mayStore = 0, - Defs = [A] in { + Defs = [A, DPF0] in { def JSLpseudo : W65816Pseudo<(outs), (ins i16imm:$dst), "# JSLpseudo $dst", []>; } + def : Pat<(W65816call (i16 tglobaladdr:$dst)), (JSLpseudo tglobaladdr:$dst)>; def : Pat<(W65816call (i16 texternalsym:$dst)), (JSLpseudo texternalsym:$dst)>; diff --git a/src/llvm/lib/Target/W65816/W65816MachineFunctionInfo.h b/src/llvm/lib/Target/W65816/W65816MachineFunctionInfo.h index 88c02b2..f6a4d78 100644 --- a/src/llvm/lib/Target/W65816/W65816MachineFunctionInfo.h +++ b/src/llvm/lib/Target/W65816/W65816MachineFunctionInfo.h @@ -40,6 +40,7 @@ class W65816MachineFunctionInfo : public MachineFunctionInfo { /// STA8abs needs an SEP/REP wrap in M=0 to avoid a 2-byte store). bool UsesAcc8 = false; + public: W65816MachineFunctionInfo() = default; diff --git a/src/llvm/lib/Target/W65816/W65816NegYIndY.cpp b/src/llvm/lib/Target/W65816/W65816NegYIndY.cpp index e6f3a7f..dd7fc82 100644 --- a/src/llvm/lib/Target/W65816/W65816NegYIndY.cpp +++ b/src/llvm/lib/Target/W65816/W65816NegYIndY.cpp @@ -89,6 +89,31 @@ bool W65816NegYIndY::runOnMachineFunction(MachineFunction &MF) { continue; unsigned Disp = MI.getOperand(0).getImm() & 0xFF; DebugLoc DL = MI.getDebugLoc(); + // X-liveness check: SpillToX may have stashed a value in X + // that's used after this rewrite. If so, save X to DP $E1 + // (libcall scratch high half — $E0 is reserved for the A-save + // dance in eliminateCallFramePseudoInstr) and restore after. + // Walk forward from MI looking for an X use without a prior + // X def; if found, X is live and we must preserve it. + bool XLive = false; + for (auto Scan = std::next(MachineBasicBlock::iterator(&MI)); + Scan != MBB.end(); ++Scan) { + if (Scan->isDebugInstr()) continue; + bool xDef = false; + for (const MachineOperand &MO : Scan->operands()) { + if (!MO.isReg()) continue; + if (MO.getReg() == W65816::X) { + if (MO.isUse()) { XLive = true; break; } + if (MO.isDef()) xDef = true; + } + } + if (XLive || xDef) break; + } + if (XLive) { + // Save X to DP $E2 (don't use $E0 — that's the A-preserve + // slot in call-frame teardown and may be live). + BuildMI(MBB, MI, DL, TII->get(W65816::STX_DP)).addImm(0xE2); + } if (IsLDA) { // LDA disp,S ; CLC ; ADC #neg ; TAX ; LDA $0000,X BuildMI(MBB, MI, DL, TII->get(W65816::LDA_StackRel)) @@ -127,6 +152,10 @@ bool W65816NegYIndY::runOnMachineFunction(MachineFunction &MF) { .addImm(0) .addReg(W65816::A, RegState::Implicit); } + if (XLive) { + // Restore X from DP $E2. + BuildMI(MBB, MI, DL, TII->get(W65816::LDX_DP)).addImm(0xE2); + } // Erase original LDY and the (sr,s),Y op. if (LastLDY) { LastLDY->eraseFromParent(); LastLDY = nullptr; } MI.eraseFromParent(); diff --git a/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp b/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp index aa1752b..7d5715b 100644 --- a/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp +++ b/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp @@ -73,7 +73,30 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II, bool NeedsCarryPrefix = false; bool IsSub = false; switch (Opc) { - case W65816::LDAfi: NewOpc = W65816::LDA_StackRel; break; + case W65816::LDAfi: { + // LDAfi targets A. If the regalloc parked the dest in X or Y + // (which can happen via Idx16 vreg coalescing), bridge through A + // by appending a TAX / TAY. + Register Dst = MI.getOperand(0).getReg(); + int FI = MI.getOperand(FIOperandNum).getIndex(); + int FrameOffset = MFI.getObjectOffset(FI); + int ImmOffset = MI.getOperand(FIOperandNum + 1).getImm(); + int Offset = FrameOffset + ImmOffset + (int)MFI.getStackSize() + SPAdj; + if (FrameOffset < 0) Offset += 1; + if (Offset < 0 || Offset > 0xFF) + report_fatal_error("W65816: frame offset out of stack-relative range"); + BuildMI(*MI.getParent(), II, MI.getDebugLoc(), + TII.get(W65816::LDA_StackRel)) + .addImm(Offset) + .addReg(W65816::A, RegState::ImplicitDefine); + if (Dst == W65816::X) { + BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::TAX)); + } else if (Dst == W65816::Y) { + BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::TAY)); + } + MI.eraseFromParent(); + return true; + } case W65816::STAfi: { // Wide16-source STAfi: if the source ended up in IMGn (DP-backed), // prepend LDA dp so the value reaches A before the actual store. @@ -108,6 +131,12 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II, BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::LDA_DP)).addImm(srcDP); } + // Note: STAfi with X or Y source is NOT supported here — adding a + // TXA/TYA pre-bracket would clobber A which a downstream STAfi $a + // may still need (the prologue stashes arg0_lo from A and arg0_ml + // from X via two adjacent STAfi, and putting A's STA *before* X's + // is the caller's responsibility). storeRegToStackSlot already + // bridges X/Y → A for spills it generates. BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::STA_StackRel)) .addImm(Offset) diff --git a/src/llvm/lib/Target/W65816/W65816RegisterInfo.td b/src/llvm/lib/Target/W65816/W65816RegisterInfo.td index d703239..574cefe 100644 --- a/src/llvm/lib/Target/W65816/W65816RegisterInfo.td +++ b/src/llvm/lib/Target/W65816/W65816RegisterInfo.td @@ -55,6 +55,15 @@ def IMG5 : W65816Reg<21, "img5">, DwarfRegNum<[21]>; def IMG6 : W65816Reg<22, "img6">, DwarfRegNum<[22]>; def IMG7 : W65816Reg<23, "img7">, DwarfRegNum<[23]>; +// DPF0 — pseudo-physreg modeling the i16 storage at DP $F0..$F1. +// Used as the carrier for the highest 16 bits of an i64/double +// return. JSLpseudo Defs DPF0 so the SDAG combiner / scheduler +// can't merge or reorder reads of it across calls; we plumb the +// 4th return half via CopyFromReg(DPF0) in LowerCall, which lowers +// to `LDA $F0` via copyPhysReg. Never allocated to a vreg — +// always a transient bridge from DP[$F0] to A. +def DPF0 : W65816Reg<24, "dpf0">, DwarfRegNum<[24]>; + //===----------------------------------------------------------------------===// // Register Classes //===----------------------------------------------------------------------===// @@ -90,6 +99,13 @@ def Wide16 : RegisterClass<"W65816", [i16], 16, def PtrRegs : RegisterClass<"W65816", [i16], 16, (add SP)>; +// Single-register class for DPF0, the i64-return high-half carrier. +// Not allocatable — only used as a CopyFromReg source in LowerCall; +// copyPhysReg lowers DPF0 → A by emitting `LDA $F0`. +def DPF0Reg : RegisterClass<"W65816", [i16], 16, (add DPF0)> { + let isAllocatable = 0; +} + // Single-register class for the processor status register, used for condition // code modeling. Not currently allocatable. def StatusReg : RegisterClass<"W65816", [i8], 8, (add P)> { diff --git a/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp b/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp index 11ebd30..a7966db 100644 --- a/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp +++ b/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp @@ -1217,6 +1217,13 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) { } if (MI.isCall()) break; if (MI.modifiesRegister(W65816::Y, TRI)) break; + // killsRegister: an instruction with `implicit killed $y` USES Y + // and that's the LAST use — Y is dead after. We must NOT treat + // a subsequent LDY_Imm16 #N as redundant after a kill, because + // the held value is conceptually gone. Caught by `addOff(p,i) + // { p[i-1] += p[i]; }` where LDY -2 ; LDA_indY (kills Y) ; ... ; + // LDY -2 ; STA_indY needs the second LDY to reinitialize Y. + if (MI.killsRegister(W65816::Y, TRI)) break; if (MI.isInlineAsm() || MI.isBranch() || MI.isReturn()) break; ++It; } diff --git a/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp b/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp index e86633b..6ca79fa 100644 --- a/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp +++ b/src/llvm/lib/Target/W65816/W65816TargetMachine.cpp @@ -14,6 +14,7 @@ #include "W65816.h" #include "W65816MachineFunctionInfo.h" #include "TargetInfo/W65816TargetInfo.h" +#include "llvm/CodeGen/MachineCSE.h" #include "llvm/CodeGen/Passes.h" #include "llvm/CodeGen/TargetLoweringObjectFileImpl.h" #include "llvm/CodeGen/TargetPassConfig.h" @@ -82,16 +83,19 @@ public: void addPreRegAlloc() override; void addPostRegAlloc() override; void addPreEmitPass() override; + void addMachineSSAOptimization() override; - // W65816's only 16-bit ALU register is A. We use fast regalloc by - // default — always succeeds, ~30-50% bigger code than greedy in - // pathological cases but correctness is paramount. Greedy fails - // outright on functions with 4+ simultaneously live i16 vregs (heap - // sift etc.). TiedDefSpill (pre-RA) handles the tied-def-multi-use - // hazard for the sub-pattern that's frequent enough to matter. + // W65816's only 16-bit ALU register is A. Greedy at -O1+ produces + // tight code; at -O0 (where optnone disables coalescing/CSE), greedy + // leaves spurious COPY pseudos that lower to STA dp / LDA dp pairs + // around modify-in-place ops (e.g. INA), miscompiling a + 1. Use + // fast regalloc when the target framework signals unoptimized. + // TiedDefSpill (pre-RA) handles the tied-def-multi-use hazard for + // the sub-pattern that's frequent enough to matter at -O1+. // - FunctionPass *createTargetRegisterAllocator(bool /*Optimized*/) override { - return createGreedyRegisterAllocator(); + FunctionPass *createTargetRegisterAllocator(bool Optimized) override { + return Optimized ? createGreedyRegisterAllocator() + : createFastRegisterAllocator(); } }; @@ -101,6 +105,24 @@ TargetPassConfig *W65816TargetMachine::createPassConfig(PassManagerBase &PM) { return new W65816PassConfig(*this, PM); } +void W65816PassConfig::addMachineSSAOptimization() { + // MachineCSE incorrectly eliminates "redundant" CMP instructions when + // it sees an earlier identical CMP elsewhere in the function — the + // P (status) flag is considered "available", but on this target P is + // clobbered by every intervening LDA/STA/ADC, so the surviving Bxx + // ends up dispatching on stale flags. We don't model `Uses=[P]` on + // Bxx because doing so causes regalloc/layout shifts that uncovered + // a different latent bug in vprintf. Disabling the pass entirely + // is the lower-cost workaround until the Bxx-Uses=[P] regression is + // root-caused. Caught by `printf("%d", n)` returning 0. + // + // Other SSA opts (early-tailduplication, opt-phis, dead-mi-elim, + // licm, machine-sink, peephole-opt, etc.) still run by chaining + // through the default impl — we just skip MachineCSE. + disablePass(&MachineCSELegacyID); + TargetPassConfig::addMachineSSAOptimization(); +} + void W65816PassConfig::addPreRegAlloc() { addPass(createW65816ABridgeViaX()); addPass(createW65816TiedDefSpill()); @@ -125,7 +147,11 @@ void W65816PassConfig::addPreEmitPass() { addPass(createW65816SpillToX()); // Rewrite negative-Y indirect-Y stack-rel ops. Must run BEFORE // BranchExpand because the rewrite expands one instruction into - // several and shifts branch distances. + // several and shifts branch distances. The pass internally checks + // X-liveness and saves/restores X via DP $E0 when SpillToX has + // a value parked there; without that check, the rewrite's TAX + // would clobber spill-bridged values (caught by `addOff(p,i) { + // p[i-1] += p[i]; }` returning p[i-1] + &p[i-1] instead of +b). addPass(createW65816NegYIndY()); // Branch expansion runs after that so the BRA introduced for long // conditional branches gets seen by SepRepCleanup (which can diff --git a/src/llvm/lib/Target/W65816/W65816TiedDefSpill.cpp b/src/llvm/lib/Target/W65816/W65816TiedDefSpill.cpp index 00d4ccb..ca63345 100644 --- a/src/llvm/lib/Target/W65816/W65816TiedDefSpill.cpp +++ b/src/llvm/lib/Target/W65816/W65816TiedDefSpill.cpp @@ -118,6 +118,11 @@ bool W65816TiedDefSpill::runOnMachineFunction(MachineFunction &MF) { // Only pre-RA: skip if vregs are already gone. if (!MF.getRegInfo().getNumVirtRegs()) return false; + // At -O0/optnone, the spill+reload pattern this pass introduces + // doesn't get coalesced and ends up wasting frame space without + // helping greedy. Same skip rationale as WidenAcc16. + if (MF.getFunction().hasOptNone()) + return false; MachineRegisterInfo &MRI = MF.getRegInfo(); const W65816Subtarget &STI = MF.getSubtarget(); diff --git a/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp b/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp index 9e3fdce..5e11bb6 100644 --- a/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp +++ b/src/llvm/lib/Target/W65816/W65816WidenAcc16.cpp @@ -119,6 +119,13 @@ static bool allUsesAcceptWide(Register VReg, bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) { if (!MF.getRegInfo().getNumVirtRegs()) return false; + // At -O0 / optnone, register coalescing doesn't run, so the COPY we + // insert to bridge Acc16 → Wide16 doesn't get folded; instead it + // forces wide16 spills through DP-mapped slots that collide and + // produce miscompiles around modify-in-place ops (lda dp; inc a; + // sta dp; lda dp reads pre-inc value). The promotion is purely a + // performance optimization, so skip it for optnone functions. + if (MF.getFunction().hasOptNone()) return false; MachineRegisterInfo &MRI = MF.getRegInfo(); const W65816Subtarget &STI = MF.getSubtarget(); const W65816InstrInfo *TII = STI.getInstrInfo();