Added STATUS.md

This commit is contained in:
Scott Duensing 2026-04-30 18:49:00 -05:00
parent 6d7eae0356
commit 91ac5476a5
20 changed files with 1192 additions and 107 deletions

144
STATUS.md Normal file
View file

@ -0,0 +1,144 @@
# llvm816 — Current Status
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
llvm-mos as a separate `W65816` target.
## What works
End-to-end C-to-binary toolchain that produces 65816 machine code
which runs correctly under MAME (apple2gs).
**Language coverage at -O2 (no extra flags):**
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
+ ASLA16 / shift libcalls.
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
the above sizes.
- Pointer arithmetic, array indexing, struct field access, struct
return-by-value (up to 8 bytes — Pair, Vec4, double).
- Bitfields, switch statements (verified up to ~12 cases + default),
function pointers, function-pointer tables, indirect calls via
`__jsl_indir` trampoline.
- Recursion: factorial, Fibonacci, depth-3 binary-tree
insert/sum/min/max, simple recursive quicksort.
- Loops with goto / break / continue, nested loops, state machines.
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
reverse with `cons` works.
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
roundtrip.
- Soft-float (single): all four ops + comparisons, MAME-verified.
- Soft-double: add, sub, mul, div all return correct bit patterns;
3-iter Newton sqrt converges. Long-running iterations may hit MAME's
1-second sim-time budget (test config issue, not a compiler bug).
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
- C++ minimal: clang++ compiles a class with virtual + non-trivial
ctor (vtable + RTTI omitted; no exceptions).
- printf with `%d %x %s %c %p` and width/precision specifiers.
- `setjmp` / `longjmp` from libgcc.s.
- Static constructors via crt0's init_array walk.
**Toolchain:**
- `clang` / `llc` produce W65816 assembly + ELF object files.
- `tools/link816` resolves cross-translation-unit refs, lays out
text/rodata/bss, emits a flat binary the IIgs ROM can load.
- `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's
native object format) for round-tripping with classic dev tools.
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
libgcc into linkable objects.
- `scripts/smokeTest.sh` runs ~80 end-to-end checks (scalar ops,
control flow, calling conventions, MAME execution, regressions).
Currently 100% pass.
**ABI:**
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
on the system stack with PHA. Caller deallocates via `tsc;clc;adc
#N;tcs` or `PLY*N/2`.
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
the highest 16 bits.
- Frame is empty-descending (S points to next-free); offsets account
for the +1 skew vs LLVM's full-descending model.
## In flight (build-system level)
- **DWARF sidecar emission in link816** (#51): The link should produce
a separate sidecar file with line-number / variable-location info
that an IDE or post-mortem dumper can consume. Skeleton not yet
written; deferred until other correctness work is done.
## Known issues / workarounds
- **Greedy register allocator mis-orders spills** in two patterns
(#69, #70):
1. Functions where both `$a` and `$x` are live-in (i64-first-arg
with a stack-output pointer, e.g. `udivmod(i64, i64, ptr)`).
The TAX bridging `$x` to A clobbers `$a`'s value before the
second STA can save it.
2. Iterative quicksort with `if/else` recursion choice: complex
live-ranges across two `swap()` calls produce wrong arg values.
Both reproduce only at `-O1`/`-O2` with greedy. Workaround:
`-mllvm -regalloc=fast` for the affected translation unit.
`softDouble.c` already requires this flag for `__muldf3` (build.sh
applies it automatically).
Real fix is a pre-RA pass that pre-spills critical pointer
arguments to memory, or a targeted fix in greedy's spill-ordering
heuristic. Material work; deferred.
- **(d,s),y / (sr,s),y addressing wraps the bank** when Y is
negative as 16-bit unsigned. Worked around by `W65816NegYIndY`
rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays
correct for negative offsets like `arr[i-1]`.
- **(d,s),y for stack-local pointer dereferences uses DBR**, so
user code that switches DBR (e.g. `pha;plb` to bank 2 to reach
IIgs hardware) must not call into a function that takes the
address of one of its locals — the callee's `*p = v` will write
to the wrong bank. Documented; no compiler-side mitigation
beyond the existing DPF0 fake-physreg routing for the i64-return
high half.
## What's still needed for a "ship-ready" toolchain
- **Greedy regalloc spill-ordering fix** — see above. Removes the
need for the per-file `-regalloc=fast` workaround on
`softDouble.c` and unblocks pattern-rich code that currently
must be compiled at `-O0` for correctness.
- **Round-to-nearest-even in `__divdf3`** — currently
truncate-toward-zero, which differs from gcc by ±1 ULP in
several test cases. Acceptable today (Newton iterations still
converge); revisit when an exact-match test suite lands.
- **DWARF sidecar** (#51) for source-level debugging.
- **More of the C standard library**: `<math.h>` transcendental
functions (sin, cos, exp, log, pow), `<string.h>` beyond what's
hand-coded, `<stdio.h>` file I/O (`fopen`, `fread`, `fwrite`,
`fseek`).
- **C++ runtime support**: vtable layout for multiple inheritance,
RTTI, exceptions (or a documented `-fno-exceptions` requirement).
- **REP/SEP scheduling pass** (design doc §3.3): the current
prologue picks one M-mode for the whole function based on
whether any 8-bit accumulator value is used. A per-region
scheduler would reduce the SEP/REP wrap overhead on i8 stores.
- **Toolbox / IIgs system call bindings**: header files declaring
the Apple IIgs system calls (`SystemTask`, `WaitMouseUp`,
`DrawString`, …) with the right inline-asm dispatch glue.
- **Real-world program coverage**: the smoke tests are
microbenchmarks. A few known-good Apple IIgs C programs (e.g.
a textfile pager, a small game) compiled and run end-to-end
would catch issues no synthetic test currently exercises.
- **Cycle-time / size benchmarks vs Calypsi 5.16**: design doc §1
says the goal is to "match or exceed" Calypsi. We have neither
baseline numbers nor a comparison harness yet.

View file

@ -23,8 +23,10 @@ asm() {
cc() {
local c="$1"
local o="$OUT/$(basename "${c%.c}").o"
local extra=("${@:2}")
echo " CC $(basename "$c")"
"$CLANG" -target w65816 -O2 -ffunction-sections \
"${extra[@]}" \
-I"$PROJECT_ROOT/runtime/include" \
-c "$c" -o "$o"
}
@ -33,6 +35,9 @@ asm "$SRC/crt0.s"
asm "$SRC/libgcc.s"
cc "$SRC/libc.c"
cc "$SRC/softFloat.c"
cc "$SRC/softDouble.c"
# softDouble.c needs -regalloc=fast: __muldf3's 64x64 -> 128 mul +
# inlined alignment shifts overflows the greedy allocator on the
# single-A target.
cc "$SRC/softDouble.c" -mllvm -regalloc=fast
echo "runtime built: $(ls -1 "$OUT"/*.o | wc -l) objects"

View file

@ -673,19 +673,30 @@ __divmodsi_setup:
; setup; signed variants flip signs around it.
; --------------------------------------------------------------------
__divmoddi4_stash:
; Called via JSR from another libgcc helper that was itself
; called via JSL. Stack layout inside this routine:
; slot 1..2 = JSR return address (2 bytes, same-bank)
; slot 3..5 = JSL return address (3 bytes, long)
; slot 6..7 = first 16-bit stack arg (caller's first push)
; slot 8..9 = second
; ... etc.
; Earlier code read slots 4, 6, 8, 10, 12, 14 — which lands on
; the JSL ret address bytes, treating them as args. Caught by
; `u64mul(0x12, 0x12)` returning the result at $E2 (mid-low)
; instead of $E0 (lo) plus 0x678-shaped garbage at $E6.
sta 0xe0 ; a_lo_lo
stx 0xe2 ; a_lo_hi
lda 0x4, s
sta 0xe4 ; a_hi_lo
lda 0x6, s
sta 0xe6 ; a_hi_hi
sta 0xe4 ; a_hi_lo
lda 0x8, s
sta 0xe8 ; b_lo_lo
sta 0xe6 ; a_hi_hi
lda 0xa, s
sta 0xea ; b_lo_hi
sta 0xe8 ; b_lo_lo
lda 0xc, s
sta 0xec ; b_hi_lo
sta 0xea ; b_lo_hi
lda 0xe, s
sta 0xec ; b_hi_lo
lda 0x10, s
sta 0xee ; b_hi_hi
rts
@ -805,19 +816,28 @@ __muldi3:
; Loop 64 times on a's bits.
ldy #0x40
.Lmuldi_loop:
; Test bit 0 of a (= LSR a; C = old bit 0).
lda 0xe0
; Right-shift the 64-bit `a` by 1. $E0=lo..$E6=hi (matches the
; stash + __retdi convention). Must shift HI first (LSR loses
; bit 63 of $E6) so each ROR carries the previous half's bit 0
; INTO the top of the next-LOWER half — that's the actual
; right-shift direction in a $E0=lo layout. After the chain,
; C = orig $E0_b0 = bit 0 of the 64-bit value, which drops out
; and is what we want to BCC on. The earlier code shifted lo
; first which ran the shift in the WRONG direction (lo → hi)
; and tested $E6_b0 (bit 48) instead of bit 0 — every multiply
; involving bits 16+ came back garbage.
lda 0xe6
lsr a
sta 0xe0
lda 0xe2
ror a
sta 0xe2
sta 0xe6
lda 0xe4
ror a
sta 0xe4
lda 0xe6
lda 0xe2
ror a
sta 0xe6
sta 0xe2
lda 0xe0
ror a
sta 0xe0
bcc .Lmuldi_noadd
; Add b ($E8..$EE) to product ($F2..$F8).
clc

View file

@ -111,16 +111,25 @@ u64 __negdf2(u64 a) {
return a ^ DSIGN_BIT;
}
u64 __muldf3(u64 a, u64 b) {
u64 sa, sb, ma, mb;
s16 ea, eb;
u16 ca = dclass(a, &sa, &ea, &ma);
u16 cb = dclass(b, &sb, &eb, &mb);
u64 sr = sa ^ sb;
if (ca == 0 || cb == 0) return sr;
// Truncated 64*64 → high-64 product via 32*32 partials. We only
// need the upper bits of the 106-bit product because the mantissas
// are 53 bits each.
// Carry the high 64 bits of a 128-bit product in `hi` and the low 64
// in `lo`. Carry bit indicates whether the leading bit was at 105
// (caller must increment exponent).
typedef struct {
u64 mantissa;
u16 carry;
} MantCarryT;
// 64x64 -> 128-bit product, returned as a packed u64 pair. Returns
// the high 64 bits in the high u64 of the .mantissa lane is not
// possible — instead, we shift in-line and return the aligned mantissa
// directly. Splitting keeps register pressure low enough for greedy
// regalloc on the single-A W65816.
//
// Inlinable on purpose: passing a pointer to a stack local across a
// noinline boundary lowers to `sta (d,s),y` which uses DBR-relative
// addressing — broken under DBR != 0 (e.g. after a bank switch).
// Keeping these inline keeps the stores within the caller's frame.
static inline u64 mulhi64Aligned(u64 ma, u64 mb, u16 *out_carry) {
u32 alo = (u32)ma;
u32 ahi = (u32)(ma >> 32);
u32 blo = (u32)mb;
@ -131,16 +140,26 @@ u64 __muldf3(u64 a, u64 b) {
u64 hh = (u64)ahi * (u64)bhi;
u64 mid = lh + hl + (ll >> 32);
u64 prod_hi = hh + (mid >> 32);
s16 er = ea + eb;
while (prod_hi & ~(DMANT_LEAD | DMANT_MASK)) {
prod_hi >>= 1;
er++;
u64 prod_lo = (ll & 0xFFFFFFFFULL) | ((mid & 0xFFFFFFFFULL) << 32);
if (prod_hi & (1ULL << 41)) {
*out_carry = 1;
return (prod_hi << 11) | (prod_lo >> 53);
}
while ((prod_hi & DMANT_LEAD) == 0 && prod_hi != 0) {
prod_hi <<= 1;
er--;
*out_carry = 0;
return (prod_hi << 12) | (prod_lo >> 52);
}
return dpack(sr, er, prod_hi);
u64 __muldf3(u64 a, u64 b) {
u64 sa, sb, ma, mb;
s16 ea, eb;
u16 ca = dclass(a, &sa, &ea, &ma);
u16 cb = dclass(b, &sb, &eb, &mb);
u64 sr = sa ^ sb;
if (ca == 0 || cb == 0) return sr;
u16 carry;
u64 mr = mulhi64Aligned(ma, mb, &carry);
s16 er = ea + eb + (s16)carry;
return dpack(sr, er, mr);
}
u64 __divdf3(u64 a, u64 b) {
@ -151,26 +170,29 @@ u64 __divdf3(u64 a, u64 b) {
u64 sr = sa ^ sb;
if (ca == 0) return sr;
if (cb == 0) return sr | DEXP_MASK; // div-by-zero → inf
// Long division: shift a left by 11 to make room for quotient bits.
u64 q = 0;
u64 r = ma;
for (int i = 0; i < 53; i++) {
// Long division: handle the leading quotient bit explicitly (since
// we need to "consume" the dividend's leading 1 by subtracting),
// then generate 52 more fractional bits by shifting r left and
// testing. The previous shift-and-test-only loop over-counted
// when r == mb after subtraction (e.g. 2.0/1.0 returned ~4.0).
s16 er = ea - eb;
// Normalize so the dividend is in [mb, 2*mb). This ensures the
// leading quotient bit will land at position 52 below.
if (ma < mb) {
ma <<= 1;
er--;
}
// Handle the leading quotient bit explicitly.
u64 q = DMANT_LEAD;
u64 r = ma - mb;
// Compute 52 more fractional bits via standard shift-test-subtract.
for (int i = 51; i >= 0; i--) {
r <<= 1;
q <<= 1;
if (r >= mb) {
r -= mb;
q |= 1;
q |= (1ULL << i);
}
}
s16 er = ea - eb;
while (q & ~(DMANT_LEAD | DMANT_MASK)) {
q >>= 1;
er++;
}
while ((q & DMANT_LEAD) == 0 && q != 0) {
q <<= 1;
er--;
}
return dpack(sr, er, q);
}

View file

@ -1104,7 +1104,10 @@ int toInt(double x) { return (int)x; }
double fromInt(int n) { return (double)n; }
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblFile" -o "$oDblFile"
"$CLANG" --target=w65816 -O2 -ffunction-sections \
# softDouble.c uses -regalloc=fast because __muldf3's 64x64 -> 128
# multiply with the inlined alignment shifts overflows the greedy
# allocator's spill heuristics on the single-A target.
"$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
-c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdFile"
"$PROJECT_ROOT/tools/link816" -o "$binDblFile" \
--text-base 0x8000 --map "$mapDblFile" \
@ -1281,7 +1284,7 @@ int main(void) {
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblMame" -o "$oDblMame"
"$CLANG" --target=w65816 -O2 -ffunction-sections \
"$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
-c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdMame"
"$PROJECT_ROOT/tools/link816" -o "$binDblMame" \
--text-base 0x1000 \
@ -1402,7 +1405,7 @@ EOF
-c "$PROJECT_ROOT/runtime/src/libc.c" -o "$oLibcF"
"$CLANG" --target=w65816 -O2 -ffunction-sections \
-c "$PROJECT_ROOT/runtime/src/softFloat.c" -o "$oSfF"
"$CLANG" --target=w65816 -O2 -ffunction-sections \
"$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
-c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdF"
oCrt0F="$(mktemp --suffix=.o)"
"$PROJECT_ROOT/tools/llvm-mos-build/bin/llvm-mc" -arch=w65816 \
@ -1708,9 +1711,10 @@ EOF
fi
rm -f "$cP2File" "$oP2File" "$binP2File"
# Bubble sort with the loop form that compiles correctly
# (i=1..n; inner j+1<n-i+1). The other form `i<n-1; j<n-i-1`
# has an outstanding compiler bug (#65); use this canary form.
# Canonical bubble sort. Both this form (`i < n-1; j < n-i-1`)
# and the alternate form work after the BranchExpand bridge
# fix. Catches a regression in either BranchExpand or
# TiedDefSpill if the conditional flow gets miscompiled.
log "check: MAME runs bubble sort [4,1,3,2] → [1,2,3,4]"
cBsFile="$(mktemp --suffix=.c)"
oBsFile="$(mktemp --suffix=.o)"
@ -1721,8 +1725,8 @@ __attribute__((noinline)) void switchToBank2(void) {
}
unsigned short data[4] = { 4, 1, 3, 2 };
__attribute__((noinline)) void bubbleSort(unsigned short *arr, unsigned short n) {
for (unsigned short i = 1; i < n; i++) {
for (unsigned short j = 0; j + 1 < n - i + 1; j++) {
for (unsigned short i = 0; i < n - 1; i++) {
for (unsigned short j = 0; j < n - i - 1; j++) {
if (arr[j] > arr[j+1]) {
unsigned short t = arr[j];
arr[j] = arr[j+1];
@ -1752,8 +1756,507 @@ EOF
0x025004=0003 0x025006=0004 >/dev/null 2>&1; then
die "MAME: bubbleSort([4,1,3,2]) != [1,2,3,4]"
fi
rm -f "$cBsFile" "$oBsFile" "$binBsFile" \
"$oLibcF" "$oSfF" "$oSdF" "$oCrt0F"
rm -f "$cBsFile" "$oBsFile" "$binBsFile"
# printf("ABCDE") returns 5. Canary for the BranchExpand
# leftover-BRA-Skip bug: without removing the original BRA
# after rewriting Bxx to INV_Bxx, the inserted Bridge MBB
# becomes unreachable and the conditional flow is lost. Also
# exercises vprintf's main loop end-to-end (no varargs).
log "check: MAME runs printf('ABCDE') → 5 (BranchExpand bridge regression)"
cPfFile="$(mktemp --suffix=.c)"
oPfFile="$(mktemp --suffix=.o)"
binPfFile="$(mktemp --suffix=.bin)"
cat > "$cPfFile" <<'EOF'
#include <stdio.h>
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
int main(void) {
int r = printf("ABCDE");
switchToBank2();
*(volatile unsigned short *)0x5000 = (unsigned short)r;
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections \
-I"$PROJECT_ROOT/runtime/include" -c "$cPfFile" -o "$oPfFile"
"$PROJECT_ROOT/tools/link816" -o "$binPfFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oPfFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
"$binPfFile" 0x025000 0005 >/dev/null 2>&1; then
die "MAME: printf('ABCDE') != 5 (BranchExpand bridge regression)"
fi
rm -f "$cPfFile" "$oPfFile" "$binPfFile"
# parse('BCDE') with switch-on-spec — used to fail to link with
# PCREL8-out-of-range because long unconditional BRA didn't
# auto-relax to BRL. W65816BranchExpand now force-promotes
# long BRA to BRL.
log "check: MAME runs nested-loop+multiply f(4) → 120 (regalloc + BRA-relax)"
cFnFile="$(mktemp --suffix=.c)"
oFnFile="$(mktemp --suffix=.o)"
binFnFile="$(mktemp --suffix=.bin)"
cat > "$cFnFile" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
__attribute__((noinline)) unsigned short f(unsigned short n) {
unsigned short s = 0;
for (unsigned short i = 0; i < n; i++)
for (unsigned short j = 0; j < n; j++)
s += i*n+j;
return s;
}
int main(void) {
unsigned short r = f(4);
switchToBank2();
*(volatile unsigned short *)0x5000 = r;
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
"$cFnFile" -o "$oFnFile"
"$PROJECT_ROOT/tools/link816" -o "$binFnFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oFnFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
"$binFnFile" 0x025000 0078 >/dev/null 2>&1; then
die "MAME: f(4) != 120 (regalloc + BRA-relax regression)"
fi
rm -f "$cFnFile" "$oFnFile" "$binFnFile"
# u64add through a noinline boundary — exercises the
# ADJCALLSTACKUP teardown's STA $E0 / LDA $E0 path that
# preserves Y across the SP-restore. The earlier PLY*N/2
# implementation clobbered Y, so any i64 return came back
# with the last popped arg in Y instead of the sum's mid-high.
# Recursive u64 factorial — exercises __muldi3 + i64 ABI through
# a recursive noinline boundary. 20! = 0x21c3_677c_82b4_0000.
# Used to come back as garbage because __divmoddi4_stash read
# caller args from slot 4 when it was actually JSR-called from
# __muldi3 (so slot 4 was the JSL ret address byte, not a_mh).
# dadd through a noinline boundary — exercises __adddf3 + the
# full i64-return ABI through a real call. The earlier soft-
# double smoke test ran `c = 1.5 + 2.5` inline, which clang
# constant-folds to a literal 0x4010... bit pattern — never
# actually executed __adddf3. This one calls a noinline
# `dadd` so the libcall and the i64 ABI run end-to-end.
# printf("%d", n) — used to crash MAME entirely because MachineCSE
# eliminated the `if (isLong)` re-test of *fmt as a "redundant"
# CMP (it had matched an earlier identical CMP), and the
# surviving BNE then read whatever leftover P-flag state happened
# to be in P from the last spec-dispatch CMP. Backend now
# disables MachineCSE entirely.
log "check: MAME runs printf('%%d %%d', 42, 99) chain (MachineCSE disable)"
cPdFile="$(mktemp --suffix=.c)"
oPdFile="$(mktemp --suffix=.o)"
binPdFile="$(mktemp --suffix=.bin)"
cat > "$cPdFile" <<'EOF'
#include <stdio.h>
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
__attribute__((noinline)) int give42(void) { return 42; }
int main(void) {
// vprintf returns the increment count: 1 per format spec, 1 per
// non-spec char. "Hi %d ok\n" → H,i,' ',%d,' ',o,k,'\n' = 8.
int n = printf("Hi %d ok\n", give42());
switchToBank2();
*(volatile unsigned short *)0x5000 = (unsigned short)n;
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections \
-I"$PROJECT_ROOT/runtime/include" -c \
"$cPdFile" -o "$oPdFile"
"$PROJECT_ROOT/tools/link816" -o "$binPdFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oPdFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binPdFile" 0x025000 0008 \
>/dev/null 2>&1; then
die "MAME: printf('Hi %d ok\\n', 42) != 8 (vprintf isLong / MachineCSE)"
fi
rm -f "$cPdFile" "$oPdFile" "$binPdFile"
log "check: MAME runs noinline dadd(1.5,2.5) → 4.0 (__adddf3 + i64 ABI)"
cDdFile="$(mktemp --suffix=.c)"
oDdFile="$(mktemp --suffix=.o)"
binDdFile="$(mktemp --suffix=.bin)"
cat > "$cDdFile" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
__attribute__((noinline)) double dadd(double a, double b) { return a + b; }
int main(void) {
union { double d; unsigned short w[4]; } u;
u.d = dadd(1.5, 2.5);
switchToBank2();
*(volatile unsigned short *)0x5000 = u.w[0];
*(volatile unsigned short *)0x5002 = u.w[1];
*(volatile unsigned short *)0x5004 = u.w[2];
*(volatile unsigned short *)0x5006 = u.w[3];
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
"$cDdFile" -o "$oDdFile"
"$PROJECT_ROOT/tools/link816" -o "$binDdFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDdFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDdFile" --check \
0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4010 \
>/dev/null 2>&1; then
die "MAME: noinline dadd(1.5,2.5) != 4.0 (i64-ABI through libcall)"
fi
rm -f "$cDdFile" "$oDdFile" "$binDdFile"
log "check: MAME runs fact_u64(20) → 0x21c3677c82b40000 (__muldi3 stash slots)"
cFkFile="$(mktemp --suffix=.c)"
oFkFile="$(mktemp --suffix=.o)"
binFkFile="$(mktemp --suffix=.bin)"
cat > "$cFkFile" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
__attribute__((noinline)) unsigned long long fact_u64(unsigned int n) {
if (n <= 1) return 1ULL;
return (unsigned long long)n * fact_u64(n - 1);
}
int main(void) {
unsigned long long r = fact_u64(20);
union { unsigned long long u; unsigned short w[4]; } u;
u.u = r;
switchToBank2();
*(volatile unsigned short *)0x5000 = u.w[0];
*(volatile unsigned short *)0x5002 = u.w[1];
*(volatile unsigned short *)0x5004 = u.w[2];
*(volatile unsigned short *)0x5006 = u.w[3];
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
"$cFkFile" -o "$oFkFile"
"$PROJECT_ROOT/tools/link816" -o "$binFkFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oFkFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binFkFile" --check \
0x025000=0000 0x025002=82b4 0x025004=677c 0x025006=21c3 \
>/dev/null 2>&1; then
die "MAME: fact_u64(20) returned wrong bits (__muldi3 / stash slots)"
fi
rm -f "$cFkFile" "$oFkFile" "$binFkFile"
log "check: MAME runs u64add(0x3FF8...,0x4004...) → 0x7FFC... (call-up Y-preserve)"
cU64File="$(mktemp --suffix=.c)"
oU64File="$(mktemp --suffix=.o)"
binU64File="$(mktemp --suffix=.bin)"
cat > "$cU64File" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
__attribute__((noinline)) unsigned long long u64add(unsigned long long a, unsigned long long b) {
return a + b;
}
int main(void) {
unsigned long long c = u64add(0x3FF8000000000000ULL, 0x4004000000000000ULL);
union { unsigned long long u; unsigned short w[4]; } u;
u.u = c;
switchToBank2();
*(volatile unsigned short *)0x5000 = u.w[0];
*(volatile unsigned short *)0x5002 = u.w[1];
*(volatile unsigned short *)0x5004 = u.w[2];
*(volatile unsigned short *)0x5006 = u.w[3];
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
"$cU64File" -o "$oU64File"
"$PROJECT_ROOT/tools/link816" -o "$binU64File" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oU64File" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binU64File" --check \
0x025000=0000 0x025002=0000 0x025004=0000 0x025006=7ffc \
>/dev/null 2>&1; then
die "MAME: u64add through noinline returned wrong middle halves (call-up Y-clobber)"
fi
rm -f "$cU64File" "$oU64File" "$binU64File"
log "check: MAME runs addOff(p,1) p[0]+=p[1] → 12 (StackSlotCleanup killed-Y respect)"
cAofFile="$(mktemp --suffix=.c)"
oAofFile="$(mktemp --suffix=.o)"
binAofFile="$(mktemp --suffix=.bin)"
cat > "$cAofFile" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
__attribute__((noinline)) short addOff(short *p, short i) {
short b = p[i];
p[i-1] = p[i-1] + b;
return p[i-1];
}
int main(void) {
short stk[2] = { 5, 7 };
short r = addOff(stk, 1);
short s0 = stk[0];
switchToBank2();
*(volatile unsigned short *)0x5000 = (unsigned short)r;
*(volatile unsigned short *)0x5002 = (unsigned short)s0;
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
"$cAofFile" -o "$oAofFile"
"$PROJECT_ROOT/tools/link816" -o "$binAofFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oAofFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
"$binAofFile" --check 0x025000=000c 0x025002=000c \
>/dev/null 2>&1; then
die "MAME: addOff p[i-1]+=p[i] returned wrong store (NegYIndY/X-clobber or LDY-erase)"
fi
rm -f "$cAofFile" "$oAofFile" "$binAofFile"
log "check: MAME runs sqr(10) → 100 (frame-less ADJCALLSTACKUP must emit PLY)"
cSqrFile="$(mktemp --suffix=.c)"
oSqrFile="$(mktemp --suffix=.o)"
binSqrFile="$(mktemp --suffix=.bin)"
cat > "$cSqrFile" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
__attribute__((noinline)) unsigned short sqr(unsigned short x) { return x * x; }
int main(void) {
unsigned short r = sqr(10);
switchToBank2();
*(volatile unsigned short *)0x5000 = r;
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
"$cSqrFile" -o "$oSqrFile"
"$PROJECT_ROOT/tools/link816" -o "$binSqrFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oSqrFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
"$binSqrFile" --check 0x025000=0064 >/dev/null 2>&1; then
die "MAME: sqr(10) crashed or != 100 (ADJCALLSTACKUP not emitting PLY for frame-less)"
fi
rm -f "$cSqrFile" "$oSqrFile" "$binSqrFile"
log "check: MAME runs ddiv(8.0,4.0) → 2.0 (__divdf3 algorithm fix)"
cDdvFile="$(mktemp --suffix=.c)"
oDdvFile="$(mktemp --suffix=.o)"
binDdvFile="$(mktemp --suffix=.bin)"
cat > "$cDdvFile" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
__attribute__((noinline)) double ddiv(double a, double b) { return a / b; }
int main(void) {
union { double d; unsigned short w[4]; } u;
u.d = ddiv(8.0, 4.0);
switchToBank2();
*(volatile unsigned short *)0x5000 = u.w[0];
*(volatile unsigned short *)0x5002 = u.w[1];
*(volatile unsigned short *)0x5004 = u.w[2];
*(volatile unsigned short *)0x5006 = u.w[3];
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
"$cDdvFile" -o "$oDdvFile"
"$PROJECT_ROOT/tools/link816" -o "$binDdvFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDdvFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
"$binDdvFile" --check 0x025000=0000 0x025002=0000 \
0x025004=0000 0x025006=4000 >/dev/null 2>&1; then
die "MAME: ddiv(8,4) != 2.0 (__divdf3 long-division bug)"
fi
rm -f "$cDdvFile" "$oDdvFile" "$binDdvFile"
log "check: MAME runs Newton-iter loop → high-half ~1.41 (BranchExpand self-loop BRA fix)"
cSqFile="$(mktemp --suffix=.c)"
oSqFile="$(mktemp --suffix=.o)"
binSqFile="$(mktemp --suffix=.bin)"
# 3-iter Newton-method sqrt with a counted for-loop (the loop-back
# BRA is a self-loop, which the BranchExpand distance estimator
# used to report as 0 bytes, so it never promoted to BRL even
# when the loop body grew well past +/-128 bytes).
cat > "$cSqFile" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
__attribute__((noinline)) double sqrt3(double x) {
double g = x * 0.5;
for (unsigned short i = 0; i < 3; i++)
g = (g + x / g) * 0.5;
return g;
}
int main(void) {
union { double d; unsigned short w[4]; } u;
u.d = sqrt3(2.0);
switchToBank2();
// Only the high half is precision-stable (low halves vary slightly
// due to truncation vs round-to-nearest in __divdf3). Verify just
// the high half — that's enough to prove the self-loop BRA was
// promoted (the link would have failed otherwise) and __divdf3 is
// converging to the right magnitude.
*(volatile unsigned short *)0x5006 = u.w[3];
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
"$cSqFile" -o "$oSqFile"
"$PROJECT_ROOT/tools/link816" -o "$binSqFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oSqFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
"$binSqFile" --check 0x025006=3ff6 >/dev/null 2>&1; then
die "MAME: sqrt3(2.0) high half wrong (self-loop BRA / __divdf3)"
fi
rm -f "$cSqFile" "$oSqFile" "$binSqFile"
log "check: MAME runs -O0 addOne(7) → 8 (lda-overwrite-immediate fix; fast regalloc)"
cO0File="$(mktemp --suffix=.c)"
oO0File="$(mktemp --suffix=.o)"
binO0File="$(mktemp --suffix=.bin)"
cat > "$cO0File" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
unsigned short addOne(unsigned short a) { return a + 1; }
int main(void) {
unsigned short r = addOne(7);
switchToBank2();
*(volatile unsigned short *)0x5000 = r;
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O0 -ffunction-sections -c \
"$cO0File" -o "$oO0File"
"$PROJECT_ROOT/tools/link816" -o "$binO0File" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oO0File" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
"$binO0File" --check 0x025000=0008 >/dev/null 2>&1; then
die "MAME: -O0 addOne(7) != 8 (lda overwrite immediate / regalloc choice)"
fi
rm -f "$cO0File" "$oO0File" "$binO0File"
log "check: MAME runs bubble sort with mySwap helper [4,1,3,2] → [1,2,3,4] (greedy across helper-call)"
cBshFile="$(mktemp --suffix=.c)"
oBshFile="$(mktemp --suffix=.o)"
binBshFile="$(mktemp --suffix=.bin)"
cat > "$cBshFile" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
unsigned short bsdata[4] = { 4, 1, 3, 2 };
__attribute__((noinline)) void mySwap(unsigned short *a, unsigned short *b) {
unsigned short t = *a; *a = *b; *b = t;
}
__attribute__((noinline)) void mySort(unsigned short *arr, unsigned short n) {
for (unsigned short i = 0; i < n - 1; i++)
for (unsigned short j = 0; j < n - i - 1; j++)
if (arr[j] > arr[j+1])
mySwap(&arr[j], &arr[j+1]);
}
int main(void) {
mySort(bsdata, 4);
unsigned short d0 = bsdata[0], d1 = bsdata[1], d2 = bsdata[2], d3 = bsdata[3];
switchToBank2();
*(volatile unsigned short *)0x5000 = d0;
*(volatile unsigned short *)0x5002 = d1;
*(volatile unsigned short *)0x5004 = d2;
*(volatile unsigned short *)0x5006 = d3;
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
"$cBshFile" -o "$oBshFile"
"$PROJECT_ROOT/tools/link816" -o "$binBshFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oBshFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
"$binBshFile" --check 0x025000=0001 0x025002=0002 \
0x025004=0003 0x025006=0004 >/dev/null 2>&1; then
die "MAME: mySort with mySwap helper miscompiled (greedy regalloc across call)"
fi
rm -f "$cBshFile" "$oBshFile" "$binBshFile"
log "check: MAME runs dmul(8.0,2.0) AFTER bank-switch → 16.0 (DPF0 store + __muldf3)"
cDmFile="$(mktemp --suffix=.c)"
oDmFile="$(mktemp --suffix=.o)"
binDmFile="$(mktemp --suffix=.bin)"
cat > "$cDmFile" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
__attribute__((noinline)) double dmul(double a, double b) { return a * b; }
int main(void) {
union { double d; unsigned short w[4]; } u;
switchToBank2();
u.d = dmul(8.0, 2.0);
*(volatile unsigned short *)0x5000 = u.w[0];
*(volatile unsigned short *)0x5002 = u.w[1];
*(volatile unsigned short *)0x5004 = u.w[2];
*(volatile unsigned short *)0x5006 = u.w[3];
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
"$cDmFile" -o "$oDmFile"
"$PROJECT_ROOT/tools/link816" -o "$binDmFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDmFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDmFile" --check \
0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4030 \
>/dev/null 2>&1; then
die "MAME: dmul(8,2) under DBR=2 produced wrong bits (DPF0 store / __muldf3)"
fi
rm -f "$cDmFile" "$oDmFile" "$binDmFile"
log "check: MAME runs dmath = (a+b)*(a-b), 5,3 → 16.0 (chained libcall ABI)"
cDmaFile="$(mktemp --suffix=.c)"
oDmaFile="$(mktemp --suffix=.o)"
binDmaFile="$(mktemp --suffix=.bin)"
cat > "$cDmaFile" <<'EOF'
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
__attribute__((noinline)) double dadd(double a, double b) { return a + b; }
__attribute__((noinline)) double dsub(double a, double b) { return a - b; }
__attribute__((noinline)) double dmul(double a, double b) { return a * b; }
__attribute__((noinline)) double dmath(double a, double b) {
return dmul(dadd(a, b), dsub(a, b));
}
int main(void) {
union { double d; unsigned short w[4]; } u;
u.d = dmath(5.0, 3.0);
switchToBank2();
*(volatile unsigned short *)0x5000 = u.w[0];
*(volatile unsigned short *)0x5002 = u.w[1];
*(volatile unsigned short *)0x5004 = u.w[2];
*(volatile unsigned short *)0x5006 = u.w[3];
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
"$cDmaFile" -o "$oDmaFile"
"$PROJECT_ROOT/tools/link816" -o "$binDmaFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDmaFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDmaFile" --check \
0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4030 \
>/dev/null 2>&1; then
die "MAME: dmath(5,3) returned wrong high half (DP[\$F0] CSE across libcalls)"
fi
rm -f "$cDmaFile" "$oDmaFile" "$binDmaFile"
rm -f "$oLibcF" "$oSfF" "$oSdF" "$oCrt0F"
else
warn "MAME or apple2gs ROMs not installed; skipping end-to-end test"
fi

View file

@ -131,6 +131,7 @@ static bool clobbersImg(const MachineInstr &MI,
bool W65816ABridgeViaX::runOnMachineFunction(MachineFunction &MF) {
if (!MF.getRegInfo().getNumVirtRegs()) return false;
if (MF.getFunction().hasOptNone()) return false;
MachineRegisterInfo &MRI = MF.getRegInfo();
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
const W65816InstrInfo *TII = STI.getInstrInfo();

View file

@ -83,21 +83,71 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) {
switch (MI->getOpcode()) {
default:
break;
case W65816::ADJCALLSTACKDOWN:
case W65816::ADJCALLSTACKDOWN: {
// DOWN is a no-op in our scheme — the PUSH16 sequence in LowerCall
// already shifted SP incrementally as args were pushed. Nothing
// to emit; PEI may or may not have processed it, either is fine.
return;
}
case W65816::ADJCALLSTACKUP: {
// PEI's eliminateCallFramePseudoInstr removes these *only* when the
// function has frame work (StackSize > 0 or any FrameIndex use).
// Functions that just tail-call into a libcall (e.g. `int toInt(float
// x) { return (int)x; }` lowers to a single jsl __fixsfsi) have
// neither; PEI skips its call-frame phase and the pseudo survives
// to MC. AsmStreamer renders the pseudo's "# ADJCALLSTACK..."
// string as a comment, but MCObjectStreamer asks the encoder to
// emit bytes — which fails ("Unsupported instruction MCInst 337").
// Dropping it here is correct: when amt is zero (the "no frame"
// path) the call sequence is a no-op anyway; when non-zero, PEI
// would have replaced it with PLA-loop / TSC-ADC sequence already.
// If we ever see a non-zero amount slip through, that's a real
// bug — emit nothing and trust the comment-stripped path.
// PEI's eliminateCallFramePseudoInstr handles UP whenever the
// function has any frame work (StackSize > 0 or any FI use).
// Frame-less functions — e.g. `unsigned short sqr(unsigned short
// x) { return x*x; }` lowers to PUSH16 + jsl __mulhi3 + RTL with
// no locals — get skipped by PEI's call-frame phase, leaving
// ADJCALLSTACKUP as a pseudo all the way to here. Previously we
// silently dropped it, which left SP off by N bytes after the
// call and corrupted the caller's stack frame (caught by sqr(x)
// segfaulting MAME). Emit the SP fixup ourselves: PLY*N/2 for
// small even N, otherwise the TAY/TSC-ADC/TYA bracket.
int N = MI->getOperand(0).getImm();
if (N == 0) return;
// A holds the callee's return value; preserve it. Walk forward
// looking for X/Y uses (i64-return halves) — same logic as
// eliminateCallFramePseudoInstr.
bool YLive = false;
for (auto J = std::next(MI->getIterator()); J != MI->getParent()->end();
++J) {
if (J->isCall()) break;
bool yDef = false;
for (const MachineOperand &MO : J->operands()) {
if (!MO.isReg()) continue;
if (MO.getReg() == W65816::Y) {
if (MO.isUse()) { YLive = true; break; }
if (MO.isDef()) yDef = true;
}
}
if (YLive || yDef) break;
}
if (YLive) {
// Route through DP $E0 to preserve both A and Y.
MCInst Sta; Sta.setOpcode(W65816::STA_DP);
Sta.addOperand(MCOperand::createImm(0xE0));
EmitToStreamer(*OutStreamer, Sta);
MCInst Tsc; Tsc.setOpcode(W65816::TSC); EmitToStreamer(*OutStreamer, Tsc);
MCInst Clc; Clc.setOpcode(W65816::CLC); EmitToStreamer(*OutStreamer, Clc);
MCInst Adc; Adc.setOpcode(W65816::ADC_Imm16);
Adc.addOperand(MCOperand::createImm(N));
EmitToStreamer(*OutStreamer, Adc);
MCInst Tcs; Tcs.setOpcode(W65816::TCS); EmitToStreamer(*OutStreamer, Tcs);
MCInst Lda; Lda.setOpcode(W65816::LDA_DP);
Lda.addOperand(MCOperand::createImm(0xE0));
EmitToStreamer(*OutStreamer, Lda);
} else if (N <= 14 && (N % 2) == 0) {
for (int i = 0; i < N / 2; ++i) {
MCInst Ply; Ply.setOpcode(W65816::PLY);
EmitToStreamer(*OutStreamer, Ply);
}
} else {
MCInst Tay; Tay.setOpcode(W65816::TAY); EmitToStreamer(*OutStreamer, Tay);
MCInst Tsc; Tsc.setOpcode(W65816::TSC); EmitToStreamer(*OutStreamer, Tsc);
MCInst Clc; Clc.setOpcode(W65816::CLC); EmitToStreamer(*OutStreamer, Clc);
MCInst Adc; Adc.setOpcode(W65816::ADC_Imm16);
Adc.addOperand(MCOperand::createImm(N));
EmitToStreamer(*OutStreamer, Adc);
MCInst Tcs; Tcs.setOpcode(W65816::TCS); EmitToStreamer(*OutStreamer, Tcs);
MCInst Tya; Tya.setOpcode(W65816::TYA); EmitToStreamer(*OutStreamer, Tya);
}
return;
}
case W65816::LDXi16imm: {

View file

@ -46,6 +46,7 @@
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstr.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/Support/raw_ostream.h"
using namespace llvm;
@ -100,7 +101,17 @@ static unsigned estimateDistance(MachineFunction &MF,
const MachineInstr &Br,
MachineBasicBlock *To) {
const MachineBasicBlock *From = Br.getParent();
if (From == To) return 0;
// Self-loop branch: target is the start of From, branch is somewhere
// inside From. Distance is the bytes from start of From to the
// branch instruction (i.e., everything before Br in From).
if (From == To) {
unsigned Bytes = 0;
for (const auto &MI : *From) {
if (&MI == &Br) break;
Bytes += TII->getInstSizeInBytes(MI);
}
return Bytes;
}
// Two cases by layout direction:
// forward: bytes after Br in From, plus all of MBBs strictly
@ -276,11 +287,30 @@ bool W65816BranchExpand::runOnMachineFunction(MachineFunction &MF) {
// Step 2: iterate to fixed-point. Each expansion adds 3 bytes
// (bridge BRA), which may push another previously-OK branch over
// the threshold. Cap at MAX_ITERS to avoid pathological cases.
const unsigned EXPAND_DIST_THRESHOLD = 100; // safe under +/-128
const unsigned EXPAND_DIST_THRESHOLD = 90; // tighter margin under +/-128
const unsigned MAX_ITERS = 10;
for (unsigned iter = 0; iter < MAX_ITERS; ++iter) {
bool Changed = false;
// Promote long BRA to BRL. The assembler's BRA→BRL relaxation
// sometimes fails to fire when the target symbol resolves early
// in MC layout — the linker then sees a PCREL8 reloc that's out
// of range. Force the BRL ourselves when the estimate exceeds
// the safe threshold; saves one byte if BRA would have fit, but
// beats a hard link error.
for (auto &MBB : MF) {
for (auto &MI : MBB.terminators()) {
if (MI.getOpcode() != W65816::BRA) continue;
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isMBB()) continue;
MachineBasicBlock *Target = MI.getOperand(0).getMBB();
unsigned Dist = estimateDistance(MF, TII, MI, Target);
if (Dist > EXPAND_DIST_THRESHOLD) {
MI.setDesc(TII->get(W65816::BRL));
Changed = true;
}
}
}
// Collect candidates. After step 1, each MBB has at most one
// conditional terminator, so we walk terminators().
SmallVector<std::pair<MachineBasicBlock *, MachineInstr *>, 8> Candidates;
@ -337,6 +367,27 @@ bool W65816BranchExpand::runOnMachineFunction(MachineFunction &MF) {
// fall-through marker after stays after.
auto insertPt = MBB->getFirstTerminator();
BuildMI(*MBB, insertPt, DL, TII->get(InvOpc)).addMBB(Skip);
// After the rewrite, MBB falls through to Bridge (which now sits
// immediately after MBB in layout). Any unconditional BRA/BRL
// already at the end of MBB used to direct the fall-through to
// Skip — but with Bridge interposed, that BRA would skip past
// Bridge entirely and Bridge becomes unreachable. Remove it.
// (Skip is still reachable via INV_Bxx; Target is reachable via
// fall-through-to-Bridge then BRL.) Caught by vprintf crashing
// because dropDeadConditionalsToBRATarget then dropped the
// INV_Bxx as redundant with the leftover BRA Skip.
while (insertPt != MBB->end()) {
unsigned NextOpc = insertPt->getOpcode();
if (NextOpc == W65816::BRA || NextOpc == W65816::BRL) {
if (insertPt->getNumOperands() >= 1 &&
insertPt->getOperand(0).isMBB() &&
insertPt->getOperand(0).getMBB() == Skip) {
insertPt = insertPt->eraseFromParent();
continue;
}
}
++insertPt;
}
// Bridge: BRL Target. Always emit the long form rather than
// relying on the assembler to relax BRA→BRL — the relaxation

View file

@ -162,15 +162,39 @@ void W65816FrameLowering::emitEpilogue(MachineFunction &MF,
// Insert before the terminator (the return).
DebugLoc DL = MBBI != MBB.end() ? MBBI->getDebugLoc() : DebugLoc();
// Detect whether the return live-out includes Y or X — for i64 returns
// (Outs[0..2] -> A,X,Y), Y holds bits 32-47 and X holds bits 16-31, so
// any TAY/PLY/TAX in the SP-restore would corrupt the return value.
// The RTL terminator carries implicit-uses for every live-out return
// register; scan them to decide which scratch we can use safely.
bool YLive = false;
bool XLive = false;
if (MBBI != MBB.end() && MBBI->isReturn()) {
for (const MachineOperand &MO : MBBI->operands()) {
if (!MO.isReg() || !MO.isImplicit() || !MO.isUse()) continue;
if (MO.getReg() == W65816::Y) YLive = true;
else if (MO.getReg() == W65816::X) XLive = true;
}
}
// VLA cleanup: restore entry SP from DP $F4 (saved in prologue).
// This subsumes BOTH the static frame and any dynamic_stackalloc
// bytes — we can skip the per-byte PLY/PLA loop entirely. Preserve
// A through TAY/TYA since it holds the return value.
// A through TAY/TYA since it holds the return value. For i64
// returns where Y is also live, route the save through DP $E0
// ($E0..$EF is libcall scratch — guaranteed dead by epilogue time).
if (HasVLA) {
if (YLive) {
BuildMI(MBB, MBBI, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4);
BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
} else {
BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4);
BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
BuildMI(MBB, MBBI, DL, TII.get(W65816::TYA));
}
return;
}
@ -182,11 +206,26 @@ void W65816FrameLowering::emitEpilogue(MachineFunction &MF,
// N/2 PLY (pop into Y, discard); larger frames use
// TAY/TSC/CLC/ADC #N/TCS/TYA.
// Mirror the prologue threshold (see comment there).
if (StackSize <= 6 && (StackSize % 2) == 0) {
if (StackSize <= 6 && (StackSize % 2) == 0 && !YLive) {
// PLY clobbers Y, which is fine when Y isn't a return reg.
for (uint64_t i = 0; i < StackSize / 2; ++i)
BuildMI(MBB, MBBI, DL, TII.get(W65816::PLY));
return;
}
if (YLive) {
// Y is a return register (i64 / double). Save A via DP $E0
// instead of TAY so Y survives. 4 cyc slower than TAY/TYA but
// correct. X is allowed to be live too — none of these touch X.
BuildMI(MBB, MBBI, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC));
BuildMI(MBB, MBBI, DL, TII.get(W65816::ADC_Imm16))
.addImm(StackSize);
BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
(void)XLive;
return;
}
BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC));
@ -207,15 +246,56 @@ MachineBasicBlock::iterator W65816FrameLowering::eliminateCallFramePseudoInstr(
// ADJCALLSTACKUP releases all the pushed bytes after a call.
//
// Critical: A holds the callee's return value here, so this MUST NOT
// clobber A. The naive `tsc;clc;adc #N;tcs` does (TSC overwrites A),
// which silently corrupts every call's return value. Same fix as the
// epilogue: small N via PLY (clobbers Y, preserves A); larger N via
// TAY/.../TYA bracket.
// clobber A. PLY (small-N path) clobbers Y; TAY/.../TYA bracket
// (large-N path) also clobbers Y. Both are fine for i8/i16/i32
// returns but DESTROY the return for i64/double (where X and Y hold
// mid halves). Detect i64-return calls by walking back to the JSL
// and checking implicit-def $x/$y; in that case, save A via DP $E0
// (libcall scratch, dead by call-up time) so X and Y survive.
// Caught by `unsigned long long u64add(a,b)` through a noinline
// boundary returning Y = b_hi (the last popped) instead of the
// sum's mid-high.
if (I->getOpcode() == W65816::ADJCALLSTACKUP) {
int N = I->getOperand(0).getImm();
if (N > 0) {
DebugLoc DL = I->getDebugLoc();
if (N <= 14 && (N % 2) == 0) {
bool YLive = false;
bool XLive = false;
// Walk forward looking for COPY %vreg = $x / $y — LowerCall's
// pattern for materializing return halves. JSLpseudo's tablegen
// declares only `Defs=[A]`, so implicit-defs of X/Y aren't on
// the call op itself. We have to read what comes after.
// Stop at the next call (re-clobbers everything) or at any def
// of X/Y (cancels their post-call value).
bool Stopped = false;
for (auto J = std::next(I); J != MBB.end() && !Stopped; ++J) {
if (J->isCall()) break;
for (const MachineOperand &MO : J->operands()) {
if (!MO.isReg()) continue;
Register R = MO.getReg();
if (R == W65816::Y) {
if (MO.isUse()) YLive = true;
else if (MO.isDef() && !YLive) Stopped = true;
} else if (R == W65816::X) {
if (MO.isUse()) XLive = true;
else if (MO.isDef() && !XLive) Stopped = true;
}
}
if (YLive && XLive) break;
}
if (YLive) {
// i64 return: PLY would eat Y. Route through DP $E0. Worth
// ~4 cyc more than PLY*N/2 but correctness wins. X is not
// touched by any of these insns either way, so XLive doesn't
// change anything here — track it for symmetry.
BuildMI(MBB, I, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
BuildMI(MBB, I, DL, TII.get(W65816::TSC));
BuildMI(MBB, I, DL, TII.get(W65816::CLC));
BuildMI(MBB, I, DL, TII.get(W65816::ADC_Imm16)).addImm(N);
BuildMI(MBB, I, DL, TII.get(W65816::TCS));
BuildMI(MBB, I, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
(void)XLive;
} else if (N <= 14 && (N % 2) == 0) {
for (int i = 0; i < N / 2; ++i)
BuildMI(MBB, I, DL, TII.get(W65816::PLY));
} else {

View file

@ -861,10 +861,17 @@ W65816TargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
Glue = V.getValue(2);
InVals.push_back(V);
} else {
// 4th half: load from DP $F0.
SDValue DPAddr = DAG.getConstant(0xF0, DL, MVT::i16);
SDValue V = DAG.getLoad(VT, DL, Chain, DPAddr, MachinePointerInfo());
// 4th half: read DP[$F0..$F1] via CopyFromReg(DPF0). DPF0 is a
// pseudo-physreg modeled as JSLpseudo's implicit-def, so each
// call's CopyFromReg has Glue tied to the corresponding call —
// the SDAG combiner can't merge them and the scheduler can't
// reorder them past the next call. copyPhysReg lowers DPF0 →
// A as `LDA $F0`. Without this, plain `getLoad(0xF0)` was
// being CSE'd / reordered across i64-returning calls, causing
// `dmath = (a+b)*(a-b)` to return 4 instead of 16.
SDValue V = DAG.getCopyFromReg(Chain, DL, W65816::DPF0, VT, Glue);
Chain = V.getValue(1);
Glue = V.getValue(2);
InVals.push_back(V);
}
}
@ -900,11 +907,17 @@ SDValue W65816TargetLowering::LowerReturn(
SDValue Glue;
SmallVector<SDValue, 8> RetOps(1, Chain);
// Outs[3] -> store to DP $F0 (only for i64 returns). Done first so
// its computation can use A freely before A holds the low result.
// Outs[3] -> DP $F0 via CopyToReg(DPF0). Using the DPF0 fake physreg
// (lowered to `STA $F0` by copyPhysReg) is critical: a generic
// ISD::STORE with addr=0xF0 lowered to `sta (d,s),y`, an indirect
// through the DBR, which silently misbehaved when DBR != 0. STA dp
// uses D + dp directly and is unaffected by DBR. Done first so its
// computation can use A freely before A holds the low result. Glued
// to RET_GLUE via the RetOps Register entry below so DCE doesn't
// strip the COPY.
if (Outs.size() >= 4) {
SDValue DPAddr = DAG.getConstant(0xF0, DL, MVT::i16);
Chain = DAG.getStore(Chain, DL, OutVals[3], DPAddr, MachinePointerInfo());
Chain = DAG.getCopyToReg(Chain, DL, W65816::DPF0, OutVals[3], Glue);
Glue = Chain.getValue(1);
}
// Outs[2] -> Y.
if (Outs.size() >= 3) {
@ -926,6 +939,8 @@ SDValue W65816TargetLowering::LowerReturn(
RetOps.push_back(DAG.getRegister(W65816::X, Outs[1].VT));
if (Outs.size() >= 3)
RetOps.push_back(DAG.getRegister(W65816::Y, Outs[2].VT));
if (Outs.size() >= 4)
RetOps.push_back(DAG.getRegister(W65816::DPF0, Outs[3].VT));
RetOps[0] = Chain;
if (Glue.getNode())

View file

@ -92,6 +92,44 @@ void W65816InstrInfo::copyPhysReg(MachineBasicBlock &MBB,
BuildMI(MBB, I, DL, get(W65816::STA_DP)).addImm(dstImg);
return;
}
// X → IMGn / IMGn → X: STX dp / LDX dp. Avoids the A-bridge that
// TAX/TXA would impose; critical for i32-first-arg signatures
// (live-in $a + $x) where bridging X via A clobbers $a's value
// before it can be saved. Caught by udivmod and iterative qsort.
if (dstImg >= 0 && SrcReg == W65816::X) {
BuildMI(MBB, I, DL, get(W65816::STX_DP)).addImm(dstImg);
return;
}
if (DestReg == W65816::X && srcImg >= 0) {
BuildMI(MBB, I, DL, get(W65816::LDX_DP)).addImm(srcImg);
return;
}
// Y → IMGn / IMGn → Y: STY dp / LDY dp — symmetric.
if (dstImg >= 0 && SrcReg == W65816::Y) {
BuildMI(MBB, I, DL, get(W65816::STY_DP)).addImm(dstImg);
return;
}
if (DestReg == W65816::Y && srcImg >= 0) {
BuildMI(MBB, I, DL, get(W65816::LDY_DP)).addImm(srcImg);
return;
}
// DPF0 → A: emit `LDA $F0`. DPF0 is the pseudo-physreg carrier
// for an i64-returning call's high 16 bits; LowerCall builds a
// CopyFromReg(DPF0) glued to the call so the SDAG combiner /
// scheduler can't merge or reorder reads across calls.
if (DestReg == W65816::A && SrcReg == W65816::DPF0) {
BuildMI(MBB, I, DL, get(W65816::LDA_DP)).addImm(0xF0);
return;
}
// A → DPF0: emit `STA $F0`. Used by LowerReturn for the i64 high
// half; using a true direct-page store is critical because plain
// ISD::STORE with addr=0xF0 was lowering to `(d,s),y` indirect via
// DBR — which silently broke under DBR != 0 (e.g. after a bank
// switch). STA dp uses D + dp directly, ignoring DBR.
if (DestReg == W65816::DPF0 && SrcReg == W65816::A) {
BuildMI(MBB, I, DL, get(W65816::STA_DP)).addImm(0xF0);
return;
}
llvm_unreachable("W65816: cross-class copyPhysReg not yet implemented");
}
@ -101,8 +139,14 @@ void W65816InstrInfo::storeRegToStackSlot(
MachineInstr::MIFlag Flags) const {
// STAfi gets eliminated by W65816RegisterInfo::eliminateFrameIndex into
// a real STA d,S. Source is implicit A; emit the pseudo with the FI
// and zero offset.
// and zero offset. When regalloc hands us a spill from X or Y, bridge
// through A (TXA / TYA) — same rationale as loadRegFromStackSlot.
DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc();
if (SrcReg == W65816::X || SrcReg == W65816::Y) {
unsigned XferOp = (SrcReg == W65816::X) ? W65816::TXA : W65816::TYA;
BuildMI(MBB, MI, DL, get(XferOp));
SrcReg = W65816::A;
}
BuildMI(MBB, MI, DL, get(W65816::STAfi))
.addReg(SrcReg, getKillRegState(isKill))
.addFrameIndex(FrameIdx)
@ -115,9 +159,30 @@ void W65816InstrInfo::loadRegFromStackSlot(MachineBasicBlock &MBB,
const TargetRegisterClass *RC,
Register VReg, unsigned SubReg,
MachineInstr::MIFlag Flags) const {
// Mirror image of storeRegToStackSlot: emit LDAfi, which the frame
// index pass turns into LDA d,S.
// LDAfi only knows how to put the value in A. If regalloc asks for
// a spill into X or Y, we have to bridge through A: LDA d,S then
// TAX / TAY. Without this, the MIR has `$x = LDAfi` but the asm
// printer emits just `LDA d,S` (which writes A, not X) — a silent
// miscompile that surfaced as i64 subtract chains using stale X
// values for the second word (caught by udivmod's `a - q*b` mod
// computation).
DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc();
if (DestReg == W65816::A) {
BuildMI(MBB, MI, DL, get(W65816::LDAfi), DestReg)
.addFrameIndex(FrameIdx)
.addImm(0);
return;
}
if (DestReg == W65816::X || DestReg == W65816::Y) {
// Load via A, then transfer. A is implicitly clobbered.
BuildMI(MBB, MI, DL, get(W65816::LDAfi), W65816::A)
.addFrameIndex(FrameIdx)
.addImm(0);
unsigned XferOp = (DestReg == W65816::X) ? W65816::TAX : W65816::TAY;
BuildMI(MBB, MI, DL, get(XferOp));
return;
}
// Fallback: assume A path (covers Acc16 / Wide16 vregs by class).
BuildMI(MBB, MI, DL, get(W65816::LDAfi), DestReg)
.addFrameIndex(FrameIdx)
.addImm(0);

View file

@ -70,6 +70,7 @@ def W65816pushx : SDNode<"W65816ISD::PUSH_X", SDTNone,
[SDNPHasChain, SDNPInGlue, SDNPOutGlue,
SDNPSideEffect, SDNPMayStore]>;
// SELECT_CC: takes (TVal, FVal, CC) plus a glue value carrying the
// flags from a preceding W65816cmp. Lowered by EmitInstrWithCustomInserter
// into a CMP (already in the BB) + Bxx + diamond CFG + PHI.
@ -1356,10 +1357,18 @@ def : Pat<(store
// function doesn't have to know how it was called to choose its
// return instruction. A pseudo bridges the i16 symbol operand
// to JSL_Long's 24-bit operand class.
// Defs include DPF0 every i64-returning libcall clobbers DP[$F0]
// (it's the carrier for the highest 16 bits of the return). The
// LowerCall side captures the pre-call DPF0 via CopyFromReg(DPF0)
// glued to the call so the SDAG combiner / scheduler can't merge
// or reorder reads across calls. Without DPF0 in Defs, plain
// `getLoad(0xF0)` was being CSE'd across calls, leading to
// `dmath = (a+b)*(a-b)` returning 4 instead of 16.
let isCall = 1, hasSideEffects = 0, mayLoad = 0, mayStore = 0,
Defs = [A] in {
Defs = [A, DPF0] in {
def JSLpseudo : W65816Pseudo<(outs), (ins i16imm:$dst),
"# JSLpseudo $dst", []>;
}
def : Pat<(W65816call (i16 tglobaladdr:$dst)), (JSLpseudo tglobaladdr:$dst)>;
def : Pat<(W65816call (i16 texternalsym:$dst)), (JSLpseudo texternalsym:$dst)>;

View file

@ -40,6 +40,7 @@ class W65816MachineFunctionInfo : public MachineFunctionInfo {
/// STA8abs needs an SEP/REP wrap in M=0 to avoid a 2-byte store).
bool UsesAcc8 = false;
public:
W65816MachineFunctionInfo() = default;

View file

@ -89,6 +89,31 @@ bool W65816NegYIndY::runOnMachineFunction(MachineFunction &MF) {
continue;
unsigned Disp = MI.getOperand(0).getImm() & 0xFF;
DebugLoc DL = MI.getDebugLoc();
// X-liveness check: SpillToX may have stashed a value in X
// that's used after this rewrite. If so, save X to DP $E1
// (libcall scratch high half — $E0 is reserved for the A-save
// dance in eliminateCallFramePseudoInstr) and restore after.
// Walk forward from MI looking for an X use without a prior
// X def; if found, X is live and we must preserve it.
bool XLive = false;
for (auto Scan = std::next(MachineBasicBlock::iterator(&MI));
Scan != MBB.end(); ++Scan) {
if (Scan->isDebugInstr()) continue;
bool xDef = false;
for (const MachineOperand &MO : Scan->operands()) {
if (!MO.isReg()) continue;
if (MO.getReg() == W65816::X) {
if (MO.isUse()) { XLive = true; break; }
if (MO.isDef()) xDef = true;
}
}
if (XLive || xDef) break;
}
if (XLive) {
// Save X to DP $E2 (don't use $E0 — that's the A-preserve
// slot in call-frame teardown and may be live).
BuildMI(MBB, MI, DL, TII->get(W65816::STX_DP)).addImm(0xE2);
}
if (IsLDA) {
// LDA disp,S ; CLC ; ADC #neg ; TAX ; LDA $0000,X
BuildMI(MBB, MI, DL, TII->get(W65816::LDA_StackRel))
@ -127,6 +152,10 @@ bool W65816NegYIndY::runOnMachineFunction(MachineFunction &MF) {
.addImm(0)
.addReg(W65816::A, RegState::Implicit);
}
if (XLive) {
// Restore X from DP $E2.
BuildMI(MBB, MI, DL, TII->get(W65816::LDX_DP)).addImm(0xE2);
}
// Erase original LDY and the (sr,s),Y op.
if (LastLDY) { LastLDY->eraseFromParent(); LastLDY = nullptr; }
MI.eraseFromParent();

View file

@ -73,7 +73,30 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
bool NeedsCarryPrefix = false;
bool IsSub = false;
switch (Opc) {
case W65816::LDAfi: NewOpc = W65816::LDA_StackRel; break;
case W65816::LDAfi: {
// LDAfi targets A. If the regalloc parked the dest in X or Y
// (which can happen via Idx16 vreg coalescing), bridge through A
// by appending a TAX / TAY.
Register Dst = MI.getOperand(0).getReg();
int FI = MI.getOperand(FIOperandNum).getIndex();
int FrameOffset = MFI.getObjectOffset(FI);
int ImmOffset = MI.getOperand(FIOperandNum + 1).getImm();
int Offset = FrameOffset + ImmOffset + (int)MFI.getStackSize() + SPAdj;
if (FrameOffset < 0) Offset += 1;
if (Offset < 0 || Offset > 0xFF)
report_fatal_error("W65816: frame offset out of stack-relative range");
BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
TII.get(W65816::LDA_StackRel))
.addImm(Offset)
.addReg(W65816::A, RegState::ImplicitDefine);
if (Dst == W65816::X) {
BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::TAX));
} else if (Dst == W65816::Y) {
BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::TAY));
}
MI.eraseFromParent();
return true;
}
case W65816::STAfi: {
// Wide16-source STAfi: if the source ended up in IMGn (DP-backed),
// prepend LDA dp so the value reaches A before the actual store.
@ -108,6 +131,12 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
TII.get(W65816::LDA_DP)).addImm(srcDP);
}
// Note: STAfi with X or Y source is NOT supported here — adding a
// TXA/TYA pre-bracket would clobber A which a downstream STAfi $a
// may still need (the prologue stashes arg0_lo from A and arg0_ml
// from X via two adjacent STAfi, and putting A's STA *before* X's
// is the caller's responsibility). storeRegToStackSlot already
// bridges X/Y → A for spills it generates.
BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
TII.get(W65816::STA_StackRel))
.addImm(Offset)

View file

@ -55,6 +55,15 @@ def IMG5 : W65816Reg<21, "img5">, DwarfRegNum<[21]>;
def IMG6 : W65816Reg<22, "img6">, DwarfRegNum<[22]>;
def IMG7 : W65816Reg<23, "img7">, DwarfRegNum<[23]>;
// DPF0 pseudo-physreg modeling the i16 storage at DP $F0..$F1.
// Used as the carrier for the highest 16 bits of an i64/double
// return. JSLpseudo Defs DPF0 so the SDAG combiner / scheduler
// can't merge or reorder reads of it across calls; we plumb the
// 4th return half via CopyFromReg(DPF0) in LowerCall, which lowers
// to `LDA $F0` via copyPhysReg. Never allocated to a vreg
// always a transient bridge from DP[$F0] to A.
def DPF0 : W65816Reg<24, "dpf0">, DwarfRegNum<[24]>;
//===----------------------------------------------------------------------===//
// Register Classes
//===----------------------------------------------------------------------===//
@ -90,6 +99,13 @@ def Wide16 : RegisterClass<"W65816", [i16], 16,
def PtrRegs : RegisterClass<"W65816", [i16], 16, (add SP)>;
// Single-register class for DPF0, the i64-return high-half carrier.
// Not allocatable only used as a CopyFromReg source in LowerCall;
// copyPhysReg lowers DPF0 A by emitting `LDA $F0`.
def DPF0Reg : RegisterClass<"W65816", [i16], 16, (add DPF0)> {
let isAllocatable = 0;
}
// Single-register class for the processor status register, used for condition
// code modeling. Not currently allocatable.
def StatusReg : RegisterClass<"W65816", [i8], 8, (add P)> {

View file

@ -1217,6 +1217,13 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
}
if (MI.isCall()) break;
if (MI.modifiesRegister(W65816::Y, TRI)) break;
// killsRegister: an instruction with `implicit killed $y` USES Y
// and that's the LAST use — Y is dead after. We must NOT treat
// a subsequent LDY_Imm16 #N as redundant after a kill, because
// the held value is conceptually gone. Caught by `addOff(p,i)
// { p[i-1] += p[i]; }` where LDY -2 ; LDA_indY (kills Y) ; ... ;
// LDY -2 ; STA_indY needs the second LDY to reinitialize Y.
if (MI.killsRegister(W65816::Y, TRI)) break;
if (MI.isInlineAsm() || MI.isBranch() || MI.isReturn()) break;
++It;
}

View file

@ -14,6 +14,7 @@
#include "W65816.h"
#include "W65816MachineFunctionInfo.h"
#include "TargetInfo/W65816TargetInfo.h"
#include "llvm/CodeGen/MachineCSE.h"
#include "llvm/CodeGen/Passes.h"
#include "llvm/CodeGen/TargetLoweringObjectFileImpl.h"
#include "llvm/CodeGen/TargetPassConfig.h"
@ -82,16 +83,19 @@ public:
void addPreRegAlloc() override;
void addPostRegAlloc() override;
void addPreEmitPass() override;
void addMachineSSAOptimization() override;
// W65816's only 16-bit ALU register is A. We use fast regalloc by
// default — always succeeds, ~30-50% bigger code than greedy in
// pathological cases but correctness is paramount. Greedy fails
// outright on functions with 4+ simultaneously live i16 vregs (heap
// sift etc.). TiedDefSpill (pre-RA) handles the tied-def-multi-use
// hazard for the sub-pattern that's frequent enough to matter.
// W65816's only 16-bit ALU register is A. Greedy at -O1+ produces
// tight code; at -O0 (where optnone disables coalescing/CSE), greedy
// leaves spurious COPY pseudos that lower to STA dp / LDA dp pairs
// around modify-in-place ops (e.g. INA), miscompiling a + 1. Use
// fast regalloc when the target framework signals unoptimized.
// TiedDefSpill (pre-RA) handles the tied-def-multi-use hazard for
// the sub-pattern that's frequent enough to matter at -O1+.
//
FunctionPass *createTargetRegisterAllocator(bool /*Optimized*/) override {
return createGreedyRegisterAllocator();
FunctionPass *createTargetRegisterAllocator(bool Optimized) override {
return Optimized ? createGreedyRegisterAllocator()
: createFastRegisterAllocator();
}
};
@ -101,6 +105,24 @@ TargetPassConfig *W65816TargetMachine::createPassConfig(PassManagerBase &PM) {
return new W65816PassConfig(*this, PM);
}
void W65816PassConfig::addMachineSSAOptimization() {
// MachineCSE incorrectly eliminates "redundant" CMP instructions when
// it sees an earlier identical CMP elsewhere in the function — the
// P (status) flag is considered "available", but on this target P is
// clobbered by every intervening LDA/STA/ADC, so the surviving Bxx
// ends up dispatching on stale flags. We don't model `Uses=[P]` on
// Bxx because doing so causes regalloc/layout shifts that uncovered
// a different latent bug in vprintf. Disabling the pass entirely
// is the lower-cost workaround until the Bxx-Uses=[P] regression is
// root-caused. Caught by `printf("%d", n)` returning 0.
//
// Other SSA opts (early-tailduplication, opt-phis, dead-mi-elim,
// licm, machine-sink, peephole-opt, etc.) still run by chaining
// through the default impl — we just skip MachineCSE.
disablePass(&MachineCSELegacyID);
TargetPassConfig::addMachineSSAOptimization();
}
void W65816PassConfig::addPreRegAlloc() {
addPass(createW65816ABridgeViaX());
addPass(createW65816TiedDefSpill());
@ -125,7 +147,11 @@ void W65816PassConfig::addPreEmitPass() {
addPass(createW65816SpillToX());
// Rewrite negative-Y indirect-Y stack-rel ops. Must run BEFORE
// BranchExpand because the rewrite expands one instruction into
// several and shifts branch distances.
// several and shifts branch distances. The pass internally checks
// X-liveness and saves/restores X via DP $E0 when SpillToX has
// a value parked there; without that check, the rewrite's TAX
// would clobber spill-bridged values (caught by `addOff(p,i) {
// p[i-1] += p[i]; }` returning p[i-1] + &p[i-1] instead of +b).
addPass(createW65816NegYIndY());
// Branch expansion runs after that so the BRA introduced for long
// conditional branches gets seen by SepRepCleanup (which can

View file

@ -118,6 +118,11 @@ bool W65816TiedDefSpill::runOnMachineFunction(MachineFunction &MF) {
// Only pre-RA: skip if vregs are already gone.
if (!MF.getRegInfo().getNumVirtRegs())
return false;
// At -O0/optnone, the spill+reload pattern this pass introduces
// doesn't get coalesced and ends up wasting frame space without
// helping greedy. Same skip rationale as WidenAcc16.
if (MF.getFunction().hasOptNone())
return false;
MachineRegisterInfo &MRI = MF.getRegInfo();
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();

View file

@ -119,6 +119,13 @@ static bool allUsesAcceptWide(Register VReg,
bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
if (!MF.getRegInfo().getNumVirtRegs()) return false;
// At -O0 / optnone, register coalescing doesn't run, so the COPY we
// insert to bridge Acc16 → Wide16 doesn't get folded; instead it
// forces wide16 spills through DP-mapped slots that collide and
// produce miscompiles around modify-in-place ops (lda dp; inc a;
// sta dp; lda dp reads pre-inc value). The promotion is purely a
// performance optimization, so skip it for optnone functions.
if (MF.getFunction().hasOptNone()) return false;
MachineRegisterInfo &MRI = MF.getRegInfo();
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
const W65816InstrInfo *TII = STI.getInstrInfo();