Added STATUS.md
This commit is contained in:
parent
6d7eae0356
commit
91ac5476a5
20 changed files with 1192 additions and 107 deletions
144
STATUS.md
Normal file
144
STATUS.md
Normal file
|
|
@ -0,0 +1,144 @@
|
||||||
|
# llvm816 — Current Status
|
||||||
|
|
||||||
|
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
|
||||||
|
llvm-mos as a separate `W65816` target.
|
||||||
|
|
||||||
|
## What works
|
||||||
|
|
||||||
|
End-to-end C-to-binary toolchain that produces 65816 machine code
|
||||||
|
which runs correctly under MAME (apple2gs).
|
||||||
|
|
||||||
|
**Language coverage at -O2 (no extra flags):**
|
||||||
|
|
||||||
|
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
|
||||||
|
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
|
||||||
|
+ ASLA16 / shift libcalls.
|
||||||
|
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
|
||||||
|
the above sizes.
|
||||||
|
- Pointer arithmetic, array indexing, struct field access, struct
|
||||||
|
return-by-value (up to 8 bytes — Pair, Vec4, double).
|
||||||
|
- Bitfields, switch statements (verified up to ~12 cases + default),
|
||||||
|
function pointers, function-pointer tables, indirect calls via
|
||||||
|
`__jsl_indir` trampoline.
|
||||||
|
- Recursion: factorial, Fibonacci, depth-3 binary-tree
|
||||||
|
insert/sum/min/max, simple recursive quicksort.
|
||||||
|
- Loops with goto / break / continue, nested loops, state machines.
|
||||||
|
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
|
||||||
|
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
|
||||||
|
reverse with `cons` works.
|
||||||
|
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
|
||||||
|
roundtrip.
|
||||||
|
- Soft-float (single): all four ops + comparisons, MAME-verified.
|
||||||
|
- Soft-double: add, sub, mul, div all return correct bit patterns;
|
||||||
|
3-iter Newton sqrt converges. Long-running iterations may hit MAME's
|
||||||
|
1-second sim-time budget (test config issue, not a compiler bug).
|
||||||
|
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
|
||||||
|
arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
|
||||||
|
- C++ minimal: clang++ compiles a class with virtual + non-trivial
|
||||||
|
ctor (vtable + RTTI omitted; no exceptions).
|
||||||
|
- printf with `%d %x %s %c %p` and width/precision specifiers.
|
||||||
|
- `setjmp` / `longjmp` from libgcc.s.
|
||||||
|
- Static constructors via crt0's init_array walk.
|
||||||
|
|
||||||
|
**Toolchain:**
|
||||||
|
|
||||||
|
- `clang` / `llc` produce W65816 assembly + ELF object files.
|
||||||
|
- `tools/link816` resolves cross-translation-unit refs, lays out
|
||||||
|
text/rodata/bss, emits a flat binary the IIgs ROM can load.
|
||||||
|
- `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's
|
||||||
|
native object format) for round-tripping with classic dev tools.
|
||||||
|
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
|
||||||
|
libgcc into linkable objects.
|
||||||
|
- `scripts/smokeTest.sh` runs ~80 end-to-end checks (scalar ops,
|
||||||
|
control flow, calling conventions, MAME execution, regressions).
|
||||||
|
Currently 100% pass.
|
||||||
|
|
||||||
|
**ABI:**
|
||||||
|
|
||||||
|
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
|
||||||
|
on the system stack with PHA. Caller deallocates via `tsc;clc;adc
|
||||||
|
#N;tcs` or `PLY*N/2`.
|
||||||
|
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
|
||||||
|
the highest 16 bits.
|
||||||
|
- Frame is empty-descending (S points to next-free); offsets account
|
||||||
|
for the +1 skew vs LLVM's full-descending model.
|
||||||
|
|
||||||
|
## In flight (build-system level)
|
||||||
|
|
||||||
|
- **DWARF sidecar emission in link816** (#51): The link should produce
|
||||||
|
a separate sidecar file with line-number / variable-location info
|
||||||
|
that an IDE or post-mortem dumper can consume. Skeleton not yet
|
||||||
|
written; deferred until other correctness work is done.
|
||||||
|
|
||||||
|
## Known issues / workarounds
|
||||||
|
|
||||||
|
- **Greedy register allocator mis-orders spills** in two patterns
|
||||||
|
(#69, #70):
|
||||||
|
1. Functions where both `$a` and `$x` are live-in (i64-first-arg
|
||||||
|
with a stack-output pointer, e.g. `udivmod(i64, i64, ptr)`).
|
||||||
|
The TAX bridging `$x` to A clobbers `$a`'s value before the
|
||||||
|
second STA can save it.
|
||||||
|
2. Iterative quicksort with `if/else` recursion choice: complex
|
||||||
|
live-ranges across two `swap()` calls produce wrong arg values.
|
||||||
|
|
||||||
|
Both reproduce only at `-O1`/`-O2` with greedy. Workaround:
|
||||||
|
`-mllvm -regalloc=fast` for the affected translation unit.
|
||||||
|
`softDouble.c` already requires this flag for `__muldf3` (build.sh
|
||||||
|
applies it automatically).
|
||||||
|
|
||||||
|
Real fix is a pre-RA pass that pre-spills critical pointer
|
||||||
|
arguments to memory, or a targeted fix in greedy's spill-ordering
|
||||||
|
heuristic. Material work; deferred.
|
||||||
|
|
||||||
|
- **(d,s),y / (sr,s),y addressing wraps the bank** when Y is
|
||||||
|
negative as 16-bit unsigned. Worked around by `W65816NegYIndY`
|
||||||
|
rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays
|
||||||
|
correct for negative offsets like `arr[i-1]`.
|
||||||
|
|
||||||
|
- **(d,s),y for stack-local pointer dereferences uses DBR**, so
|
||||||
|
user code that switches DBR (e.g. `pha;plb` to bank 2 to reach
|
||||||
|
IIgs hardware) must not call into a function that takes the
|
||||||
|
address of one of its locals — the callee's `*p = v` will write
|
||||||
|
to the wrong bank. Documented; no compiler-side mitigation
|
||||||
|
beyond the existing DPF0 fake-physreg routing for the i64-return
|
||||||
|
high half.
|
||||||
|
|
||||||
|
## What's still needed for a "ship-ready" toolchain
|
||||||
|
|
||||||
|
- **Greedy regalloc spill-ordering fix** — see above. Removes the
|
||||||
|
need for the per-file `-regalloc=fast` workaround on
|
||||||
|
`softDouble.c` and unblocks pattern-rich code that currently
|
||||||
|
must be compiled at `-O0` for correctness.
|
||||||
|
|
||||||
|
- **Round-to-nearest-even in `__divdf3`** — currently
|
||||||
|
truncate-toward-zero, which differs from gcc by ±1 ULP in
|
||||||
|
several test cases. Acceptable today (Newton iterations still
|
||||||
|
converge); revisit when an exact-match test suite lands.
|
||||||
|
|
||||||
|
- **DWARF sidecar** (#51) for source-level debugging.
|
||||||
|
|
||||||
|
- **More of the C standard library**: `<math.h>` transcendental
|
||||||
|
functions (sin, cos, exp, log, pow), `<string.h>` beyond what's
|
||||||
|
hand-coded, `<stdio.h>` file I/O (`fopen`, `fread`, `fwrite`,
|
||||||
|
`fseek`).
|
||||||
|
|
||||||
|
- **C++ runtime support**: vtable layout for multiple inheritance,
|
||||||
|
RTTI, exceptions (or a documented `-fno-exceptions` requirement).
|
||||||
|
|
||||||
|
- **REP/SEP scheduling pass** (design doc §3.3): the current
|
||||||
|
prologue picks one M-mode for the whole function based on
|
||||||
|
whether any 8-bit accumulator value is used. A per-region
|
||||||
|
scheduler would reduce the SEP/REP wrap overhead on i8 stores.
|
||||||
|
|
||||||
|
- **Toolbox / IIgs system call bindings**: header files declaring
|
||||||
|
the Apple IIgs system calls (`SystemTask`, `WaitMouseUp`,
|
||||||
|
`DrawString`, …) with the right inline-asm dispatch glue.
|
||||||
|
|
||||||
|
- **Real-world program coverage**: the smoke tests are
|
||||||
|
microbenchmarks. A few known-good Apple IIgs C programs (e.g.
|
||||||
|
a textfile pager, a small game) compiled and run end-to-end
|
||||||
|
would catch issues no synthetic test currently exercises.
|
||||||
|
|
||||||
|
- **Cycle-time / size benchmarks vs Calypsi 5.16**: design doc §1
|
||||||
|
says the goal is to "match or exceed" Calypsi. We have neither
|
||||||
|
baseline numbers nor a comparison harness yet.
|
||||||
|
|
@ -23,8 +23,10 @@ asm() {
|
||||||
cc() {
|
cc() {
|
||||||
local c="$1"
|
local c="$1"
|
||||||
local o="$OUT/$(basename "${c%.c}").o"
|
local o="$OUT/$(basename "${c%.c}").o"
|
||||||
|
local extra=("${@:2}")
|
||||||
echo " CC $(basename "$c")"
|
echo " CC $(basename "$c")"
|
||||||
"$CLANG" -target w65816 -O2 -ffunction-sections \
|
"$CLANG" -target w65816 -O2 -ffunction-sections \
|
||||||
|
"${extra[@]}" \
|
||||||
-I"$PROJECT_ROOT/runtime/include" \
|
-I"$PROJECT_ROOT/runtime/include" \
|
||||||
-c "$c" -o "$o"
|
-c "$c" -o "$o"
|
||||||
}
|
}
|
||||||
|
|
@ -33,6 +35,9 @@ asm "$SRC/crt0.s"
|
||||||
asm "$SRC/libgcc.s"
|
asm "$SRC/libgcc.s"
|
||||||
cc "$SRC/libc.c"
|
cc "$SRC/libc.c"
|
||||||
cc "$SRC/softFloat.c"
|
cc "$SRC/softFloat.c"
|
||||||
cc "$SRC/softDouble.c"
|
# softDouble.c needs -regalloc=fast: __muldf3's 64x64 -> 128 mul +
|
||||||
|
# inlined alignment shifts overflows the greedy allocator on the
|
||||||
|
# single-A target.
|
||||||
|
cc "$SRC/softDouble.c" -mllvm -regalloc=fast
|
||||||
|
|
||||||
echo "runtime built: $(ls -1 "$OUT"/*.o | wc -l) objects"
|
echo "runtime built: $(ls -1 "$OUT"/*.o | wc -l) objects"
|
||||||
|
|
|
||||||
|
|
@ -673,19 +673,30 @@ __divmodsi_setup:
|
||||||
; setup; signed variants flip signs around it.
|
; setup; signed variants flip signs around it.
|
||||||
; --------------------------------------------------------------------
|
; --------------------------------------------------------------------
|
||||||
__divmoddi4_stash:
|
__divmoddi4_stash:
|
||||||
|
; Called via JSR from another libgcc helper that was itself
|
||||||
|
; called via JSL. Stack layout inside this routine:
|
||||||
|
; slot 1..2 = JSR return address (2 bytes, same-bank)
|
||||||
|
; slot 3..5 = JSL return address (3 bytes, long)
|
||||||
|
; slot 6..7 = first 16-bit stack arg (caller's first push)
|
||||||
|
; slot 8..9 = second
|
||||||
|
; ... etc.
|
||||||
|
; Earlier code read slots 4, 6, 8, 10, 12, 14 — which lands on
|
||||||
|
; the JSL ret address bytes, treating them as args. Caught by
|
||||||
|
; `u64mul(0x12, 0x12)` returning the result at $E2 (mid-low)
|
||||||
|
; instead of $E0 (lo) plus 0x678-shaped garbage at $E6.
|
||||||
sta 0xe0 ; a_lo_lo
|
sta 0xe0 ; a_lo_lo
|
||||||
stx 0xe2 ; a_lo_hi
|
stx 0xe2 ; a_lo_hi
|
||||||
lda 0x4, s
|
|
||||||
sta 0xe4 ; a_hi_lo
|
|
||||||
lda 0x6, s
|
lda 0x6, s
|
||||||
sta 0xe6 ; a_hi_hi
|
sta 0xe4 ; a_hi_lo
|
||||||
lda 0x8, s
|
lda 0x8, s
|
||||||
sta 0xe8 ; b_lo_lo
|
sta 0xe6 ; a_hi_hi
|
||||||
lda 0xa, s
|
lda 0xa, s
|
||||||
sta 0xea ; b_lo_hi
|
sta 0xe8 ; b_lo_lo
|
||||||
lda 0xc, s
|
lda 0xc, s
|
||||||
sta 0xec ; b_hi_lo
|
sta 0xea ; b_lo_hi
|
||||||
lda 0xe, s
|
lda 0xe, s
|
||||||
|
sta 0xec ; b_hi_lo
|
||||||
|
lda 0x10, s
|
||||||
sta 0xee ; b_hi_hi
|
sta 0xee ; b_hi_hi
|
||||||
rts
|
rts
|
||||||
|
|
||||||
|
|
@ -805,19 +816,28 @@ __muldi3:
|
||||||
; Loop 64 times on a's bits.
|
; Loop 64 times on a's bits.
|
||||||
ldy #0x40
|
ldy #0x40
|
||||||
.Lmuldi_loop:
|
.Lmuldi_loop:
|
||||||
; Test bit 0 of a (= LSR a; C = old bit 0).
|
; Right-shift the 64-bit `a` by 1. $E0=lo..$E6=hi (matches the
|
||||||
lda 0xe0
|
; stash + __retdi convention). Must shift HI first (LSR loses
|
||||||
|
; bit 63 of $E6) so each ROR carries the previous half's bit 0
|
||||||
|
; INTO the top of the next-LOWER half — that's the actual
|
||||||
|
; right-shift direction in a $E0=lo layout. After the chain,
|
||||||
|
; C = orig $E0_b0 = bit 0 of the 64-bit value, which drops out
|
||||||
|
; and is what we want to BCC on. The earlier code shifted lo
|
||||||
|
; first which ran the shift in the WRONG direction (lo → hi)
|
||||||
|
; and tested $E6_b0 (bit 48) instead of bit 0 — every multiply
|
||||||
|
; involving bits 16+ came back garbage.
|
||||||
|
lda 0xe6
|
||||||
lsr a
|
lsr a
|
||||||
sta 0xe0
|
sta 0xe6
|
||||||
lda 0xe2
|
|
||||||
ror a
|
|
||||||
sta 0xe2
|
|
||||||
lda 0xe4
|
lda 0xe4
|
||||||
ror a
|
ror a
|
||||||
sta 0xe4
|
sta 0xe4
|
||||||
lda 0xe6
|
lda 0xe2
|
||||||
ror a
|
ror a
|
||||||
sta 0xe6
|
sta 0xe2
|
||||||
|
lda 0xe0
|
||||||
|
ror a
|
||||||
|
sta 0xe0
|
||||||
bcc .Lmuldi_noadd
|
bcc .Lmuldi_noadd
|
||||||
; Add b ($E8..$EE) to product ($F2..$F8).
|
; Add b ($E8..$EE) to product ($F2..$F8).
|
||||||
clc
|
clc
|
||||||
|
|
|
||||||
|
|
@ -111,16 +111,25 @@ u64 __negdf2(u64 a) {
|
||||||
return a ^ DSIGN_BIT;
|
return a ^ DSIGN_BIT;
|
||||||
}
|
}
|
||||||
|
|
||||||
u64 __muldf3(u64 a, u64 b) {
|
// Carry the high 64 bits of a 128-bit product in `hi` and the low 64
|
||||||
u64 sa, sb, ma, mb;
|
// in `lo`. Carry bit indicates whether the leading bit was at 105
|
||||||
s16 ea, eb;
|
// (caller must increment exponent).
|
||||||
u16 ca = dclass(a, &sa, &ea, &ma);
|
typedef struct {
|
||||||
u16 cb = dclass(b, &sb, &eb, &mb);
|
u64 mantissa;
|
||||||
u64 sr = sa ^ sb;
|
u16 carry;
|
||||||
if (ca == 0 || cb == 0) return sr;
|
} MantCarryT;
|
||||||
// Truncated 64*64 → high-64 product via 32*32 partials. We only
|
|
||||||
// need the upper bits of the 106-bit product because the mantissas
|
// 64x64 -> 128-bit product, returned as a packed u64 pair. Returns
|
||||||
// are 53 bits each.
|
// the high 64 bits in the high u64 of the .mantissa lane is not
|
||||||
|
// possible — instead, we shift in-line and return the aligned mantissa
|
||||||
|
// directly. Splitting keeps register pressure low enough for greedy
|
||||||
|
// regalloc on the single-A W65816.
|
||||||
|
//
|
||||||
|
// Inlinable on purpose: passing a pointer to a stack local across a
|
||||||
|
// noinline boundary lowers to `sta (d,s),y` which uses DBR-relative
|
||||||
|
// addressing — broken under DBR != 0 (e.g. after a bank switch).
|
||||||
|
// Keeping these inline keeps the stores within the caller's frame.
|
||||||
|
static inline u64 mulhi64Aligned(u64 ma, u64 mb, u16 *out_carry) {
|
||||||
u32 alo = (u32)ma;
|
u32 alo = (u32)ma;
|
||||||
u32 ahi = (u32)(ma >> 32);
|
u32 ahi = (u32)(ma >> 32);
|
||||||
u32 blo = (u32)mb;
|
u32 blo = (u32)mb;
|
||||||
|
|
@ -131,16 +140,26 @@ u64 __muldf3(u64 a, u64 b) {
|
||||||
u64 hh = (u64)ahi * (u64)bhi;
|
u64 hh = (u64)ahi * (u64)bhi;
|
||||||
u64 mid = lh + hl + (ll >> 32);
|
u64 mid = lh + hl + (ll >> 32);
|
||||||
u64 prod_hi = hh + (mid >> 32);
|
u64 prod_hi = hh + (mid >> 32);
|
||||||
s16 er = ea + eb;
|
u64 prod_lo = (ll & 0xFFFFFFFFULL) | ((mid & 0xFFFFFFFFULL) << 32);
|
||||||
while (prod_hi & ~(DMANT_LEAD | DMANT_MASK)) {
|
if (prod_hi & (1ULL << 41)) {
|
||||||
prod_hi >>= 1;
|
*out_carry = 1;
|
||||||
er++;
|
return (prod_hi << 11) | (prod_lo >> 53);
|
||||||
}
|
}
|
||||||
while ((prod_hi & DMANT_LEAD) == 0 && prod_hi != 0) {
|
*out_carry = 0;
|
||||||
prod_hi <<= 1;
|
return (prod_hi << 12) | (prod_lo >> 52);
|
||||||
er--;
|
}
|
||||||
}
|
|
||||||
return dpack(sr, er, prod_hi);
|
u64 __muldf3(u64 a, u64 b) {
|
||||||
|
u64 sa, sb, ma, mb;
|
||||||
|
s16 ea, eb;
|
||||||
|
u16 ca = dclass(a, &sa, &ea, &ma);
|
||||||
|
u16 cb = dclass(b, &sb, &eb, &mb);
|
||||||
|
u64 sr = sa ^ sb;
|
||||||
|
if (ca == 0 || cb == 0) return sr;
|
||||||
|
u16 carry;
|
||||||
|
u64 mr = mulhi64Aligned(ma, mb, &carry);
|
||||||
|
s16 er = ea + eb + (s16)carry;
|
||||||
|
return dpack(sr, er, mr);
|
||||||
}
|
}
|
||||||
|
|
||||||
u64 __divdf3(u64 a, u64 b) {
|
u64 __divdf3(u64 a, u64 b) {
|
||||||
|
|
@ -151,26 +170,29 @@ u64 __divdf3(u64 a, u64 b) {
|
||||||
u64 sr = sa ^ sb;
|
u64 sr = sa ^ sb;
|
||||||
if (ca == 0) return sr;
|
if (ca == 0) return sr;
|
||||||
if (cb == 0) return sr | DEXP_MASK; // div-by-zero → inf
|
if (cb == 0) return sr | DEXP_MASK; // div-by-zero → inf
|
||||||
// Long division: shift a left by 11 to make room for quotient bits.
|
// Long division: handle the leading quotient bit explicitly (since
|
||||||
u64 q = 0;
|
// we need to "consume" the dividend's leading 1 by subtracting),
|
||||||
u64 r = ma;
|
// then generate 52 more fractional bits by shifting r left and
|
||||||
for (int i = 0; i < 53; i++) {
|
// testing. The previous shift-and-test-only loop over-counted
|
||||||
|
// when r == mb after subtraction (e.g. 2.0/1.0 returned ~4.0).
|
||||||
|
s16 er = ea - eb;
|
||||||
|
// Normalize so the dividend is in [mb, 2*mb). This ensures the
|
||||||
|
// leading quotient bit will land at position 52 below.
|
||||||
|
if (ma < mb) {
|
||||||
|
ma <<= 1;
|
||||||
|
er--;
|
||||||
|
}
|
||||||
|
// Handle the leading quotient bit explicitly.
|
||||||
|
u64 q = DMANT_LEAD;
|
||||||
|
u64 r = ma - mb;
|
||||||
|
// Compute 52 more fractional bits via standard shift-test-subtract.
|
||||||
|
for (int i = 51; i >= 0; i--) {
|
||||||
r <<= 1;
|
r <<= 1;
|
||||||
q <<= 1;
|
|
||||||
if (r >= mb) {
|
if (r >= mb) {
|
||||||
r -= mb;
|
r -= mb;
|
||||||
q |= 1;
|
q |= (1ULL << i);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
s16 er = ea - eb;
|
|
||||||
while (q & ~(DMANT_LEAD | DMANT_MASK)) {
|
|
||||||
q >>= 1;
|
|
||||||
er++;
|
|
||||||
}
|
|
||||||
while ((q & DMANT_LEAD) == 0 && q != 0) {
|
|
||||||
q <<= 1;
|
|
||||||
er--;
|
|
||||||
}
|
|
||||||
return dpack(sr, er, q);
|
return dpack(sr, er, q);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -1104,7 +1104,10 @@ int toInt(double x) { return (int)x; }
|
||||||
double fromInt(int n) { return (double)n; }
|
double fromInt(int n) { return (double)n; }
|
||||||
EOF
|
EOF
|
||||||
"$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblFile" -o "$oDblFile"
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblFile" -o "$oDblFile"
|
||||||
"$CLANG" --target=w65816 -O2 -ffunction-sections \
|
# softDouble.c uses -regalloc=fast because __muldf3's 64x64 -> 128
|
||||||
|
# multiply with the inlined alignment shifts overflows the greedy
|
||||||
|
# allocator's spill heuristics on the single-A target.
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
|
||||||
-c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdFile"
|
-c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdFile"
|
||||||
"$PROJECT_ROOT/tools/link816" -o "$binDblFile" \
|
"$PROJECT_ROOT/tools/link816" -o "$binDblFile" \
|
||||||
--text-base 0x8000 --map "$mapDblFile" \
|
--text-base 0x8000 --map "$mapDblFile" \
|
||||||
|
|
@ -1281,7 +1284,7 @@ int main(void) {
|
||||||
}
|
}
|
||||||
EOF
|
EOF
|
||||||
"$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblMame" -o "$oDblMame"
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblMame" -o "$oDblMame"
|
||||||
"$CLANG" --target=w65816 -O2 -ffunction-sections \
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
|
||||||
-c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdMame"
|
-c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdMame"
|
||||||
"$PROJECT_ROOT/tools/link816" -o "$binDblMame" \
|
"$PROJECT_ROOT/tools/link816" -o "$binDblMame" \
|
||||||
--text-base 0x1000 \
|
--text-base 0x1000 \
|
||||||
|
|
@ -1402,7 +1405,7 @@ EOF
|
||||||
-c "$PROJECT_ROOT/runtime/src/libc.c" -o "$oLibcF"
|
-c "$PROJECT_ROOT/runtime/src/libc.c" -o "$oLibcF"
|
||||||
"$CLANG" --target=w65816 -O2 -ffunction-sections \
|
"$CLANG" --target=w65816 -O2 -ffunction-sections \
|
||||||
-c "$PROJECT_ROOT/runtime/src/softFloat.c" -o "$oSfF"
|
-c "$PROJECT_ROOT/runtime/src/softFloat.c" -o "$oSfF"
|
||||||
"$CLANG" --target=w65816 -O2 -ffunction-sections \
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -mllvm -regalloc=fast \
|
||||||
-c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdF"
|
-c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdF"
|
||||||
oCrt0F="$(mktemp --suffix=.o)"
|
oCrt0F="$(mktemp --suffix=.o)"
|
||||||
"$PROJECT_ROOT/tools/llvm-mos-build/bin/llvm-mc" -arch=w65816 \
|
"$PROJECT_ROOT/tools/llvm-mos-build/bin/llvm-mc" -arch=w65816 \
|
||||||
|
|
@ -1708,9 +1711,10 @@ EOF
|
||||||
fi
|
fi
|
||||||
rm -f "$cP2File" "$oP2File" "$binP2File"
|
rm -f "$cP2File" "$oP2File" "$binP2File"
|
||||||
|
|
||||||
# Bubble sort with the loop form that compiles correctly
|
# Canonical bubble sort. Both this form (`i < n-1; j < n-i-1`)
|
||||||
# (i=1..n; inner j+1<n-i+1). The other form `i<n-1; j<n-i-1`
|
# and the alternate form work after the BranchExpand bridge
|
||||||
# has an outstanding compiler bug (#65); use this canary form.
|
# fix. Catches a regression in either BranchExpand or
|
||||||
|
# TiedDefSpill if the conditional flow gets miscompiled.
|
||||||
log "check: MAME runs bubble sort [4,1,3,2] → [1,2,3,4]"
|
log "check: MAME runs bubble sort [4,1,3,2] → [1,2,3,4]"
|
||||||
cBsFile="$(mktemp --suffix=.c)"
|
cBsFile="$(mktemp --suffix=.c)"
|
||||||
oBsFile="$(mktemp --suffix=.o)"
|
oBsFile="$(mktemp --suffix=.o)"
|
||||||
|
|
@ -1721,8 +1725,8 @@ __attribute__((noinline)) void switchToBank2(void) {
|
||||||
}
|
}
|
||||||
unsigned short data[4] = { 4, 1, 3, 2 };
|
unsigned short data[4] = { 4, 1, 3, 2 };
|
||||||
__attribute__((noinline)) void bubbleSort(unsigned short *arr, unsigned short n) {
|
__attribute__((noinline)) void bubbleSort(unsigned short *arr, unsigned short n) {
|
||||||
for (unsigned short i = 1; i < n; i++) {
|
for (unsigned short i = 0; i < n - 1; i++) {
|
||||||
for (unsigned short j = 0; j + 1 < n - i + 1; j++) {
|
for (unsigned short j = 0; j < n - i - 1; j++) {
|
||||||
if (arr[j] > arr[j+1]) {
|
if (arr[j] > arr[j+1]) {
|
||||||
unsigned short t = arr[j];
|
unsigned short t = arr[j];
|
||||||
arr[j] = arr[j+1];
|
arr[j] = arr[j+1];
|
||||||
|
|
@ -1752,8 +1756,507 @@ EOF
|
||||||
0x025004=0003 0x025006=0004 >/dev/null 2>&1; then
|
0x025004=0003 0x025006=0004 >/dev/null 2>&1; then
|
||||||
die "MAME: bubbleSort([4,1,3,2]) != [1,2,3,4]"
|
die "MAME: bubbleSort([4,1,3,2]) != [1,2,3,4]"
|
||||||
fi
|
fi
|
||||||
rm -f "$cBsFile" "$oBsFile" "$binBsFile" \
|
rm -f "$cBsFile" "$oBsFile" "$binBsFile"
|
||||||
"$oLibcF" "$oSfF" "$oSdF" "$oCrt0F"
|
|
||||||
|
# printf("ABCDE") returns 5. Canary for the BranchExpand
|
||||||
|
# leftover-BRA-Skip bug: without removing the original BRA
|
||||||
|
# after rewriting Bxx to INV_Bxx, the inserted Bridge MBB
|
||||||
|
# becomes unreachable and the conditional flow is lost. Also
|
||||||
|
# exercises vprintf's main loop end-to-end (no varargs).
|
||||||
|
log "check: MAME runs printf('ABCDE') → 5 (BranchExpand bridge regression)"
|
||||||
|
cPfFile="$(mktemp --suffix=.c)"
|
||||||
|
oPfFile="$(mktemp --suffix=.o)"
|
||||||
|
binPfFile="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cPfFile" <<'EOF'
|
||||||
|
#include <stdio.h>
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
int main(void) {
|
||||||
|
int r = printf("ABCDE");
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = (unsigned short)r;
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections \
|
||||||
|
-I"$PROJECT_ROOT/runtime/include" -c "$cPfFile" -o "$oPfFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binPfFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oPfFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
|
||||||
|
"$binPfFile" 0x025000 0005 >/dev/null 2>&1; then
|
||||||
|
die "MAME: printf('ABCDE') != 5 (BranchExpand bridge regression)"
|
||||||
|
fi
|
||||||
|
rm -f "$cPfFile" "$oPfFile" "$binPfFile"
|
||||||
|
|
||||||
|
# parse('BCDE') with switch-on-spec — used to fail to link with
|
||||||
|
# PCREL8-out-of-range because long unconditional BRA didn't
|
||||||
|
# auto-relax to BRL. W65816BranchExpand now force-promotes
|
||||||
|
# long BRA to BRL.
|
||||||
|
log "check: MAME runs nested-loop+multiply f(4) → 120 (regalloc + BRA-relax)"
|
||||||
|
cFnFile="$(mktemp --suffix=.c)"
|
||||||
|
oFnFile="$(mktemp --suffix=.o)"
|
||||||
|
binFnFile="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cFnFile" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) unsigned short f(unsigned short n) {
|
||||||
|
unsigned short s = 0;
|
||||||
|
for (unsigned short i = 0; i < n; i++)
|
||||||
|
for (unsigned short j = 0; j < n; j++)
|
||||||
|
s += i*n+j;
|
||||||
|
return s;
|
||||||
|
}
|
||||||
|
int main(void) {
|
||||||
|
unsigned short r = f(4);
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = r;
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
|
||||||
|
"$cFnFile" -o "$oFnFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binFnFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oFnFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
|
||||||
|
"$binFnFile" 0x025000 0078 >/dev/null 2>&1; then
|
||||||
|
die "MAME: f(4) != 120 (regalloc + BRA-relax regression)"
|
||||||
|
fi
|
||||||
|
rm -f "$cFnFile" "$oFnFile" "$binFnFile"
|
||||||
|
|
||||||
|
# u64add through a noinline boundary — exercises the
|
||||||
|
# ADJCALLSTACKUP teardown's STA $E0 / LDA $E0 path that
|
||||||
|
# preserves Y across the SP-restore. The earlier PLY*N/2
|
||||||
|
# implementation clobbered Y, so any i64 return came back
|
||||||
|
# with the last popped arg in Y instead of the sum's mid-high.
|
||||||
|
# Recursive u64 factorial — exercises __muldi3 + i64 ABI through
|
||||||
|
# a recursive noinline boundary. 20! = 0x21c3_677c_82b4_0000.
|
||||||
|
# Used to come back as garbage because __divmoddi4_stash read
|
||||||
|
# caller args from slot 4 when it was actually JSR-called from
|
||||||
|
# __muldi3 (so slot 4 was the JSL ret address byte, not a_mh).
|
||||||
|
# dadd through a noinline boundary — exercises __adddf3 + the
|
||||||
|
# full i64-return ABI through a real call. The earlier soft-
|
||||||
|
# double smoke test ran `c = 1.5 + 2.5` inline, which clang
|
||||||
|
# constant-folds to a literal 0x4010... bit pattern — never
|
||||||
|
# actually executed __adddf3. This one calls a noinline
|
||||||
|
# `dadd` so the libcall and the i64 ABI run end-to-end.
|
||||||
|
# printf("%d", n) — used to crash MAME entirely because MachineCSE
|
||||||
|
# eliminated the `if (isLong)` re-test of *fmt as a "redundant"
|
||||||
|
# CMP (it had matched an earlier identical CMP), and the
|
||||||
|
# surviving BNE then read whatever leftover P-flag state happened
|
||||||
|
# to be in P from the last spec-dispatch CMP. Backend now
|
||||||
|
# disables MachineCSE entirely.
|
||||||
|
log "check: MAME runs printf('%%d %%d', 42, 99) chain (MachineCSE disable)"
|
||||||
|
cPdFile="$(mktemp --suffix=.c)"
|
||||||
|
oPdFile="$(mktemp --suffix=.o)"
|
||||||
|
binPdFile="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cPdFile" <<'EOF'
|
||||||
|
#include <stdio.h>
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) int give42(void) { return 42; }
|
||||||
|
int main(void) {
|
||||||
|
// vprintf returns the increment count: 1 per format spec, 1 per
|
||||||
|
// non-spec char. "Hi %d ok\n" → H,i,' ',%d,' ',o,k,'\n' = 8.
|
||||||
|
int n = printf("Hi %d ok\n", give42());
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = (unsigned short)n;
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections \
|
||||||
|
-I"$PROJECT_ROOT/runtime/include" -c \
|
||||||
|
"$cPdFile" -o "$oPdFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binPdFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oPdFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binPdFile" 0x025000 0008 \
|
||||||
|
>/dev/null 2>&1; then
|
||||||
|
die "MAME: printf('Hi %d ok\\n', 42) != 8 (vprintf isLong / MachineCSE)"
|
||||||
|
fi
|
||||||
|
rm -f "$cPdFile" "$oPdFile" "$binPdFile"
|
||||||
|
|
||||||
|
log "check: MAME runs noinline dadd(1.5,2.5) → 4.0 (__adddf3 + i64 ABI)"
|
||||||
|
cDdFile="$(mktemp --suffix=.c)"
|
||||||
|
oDdFile="$(mktemp --suffix=.o)"
|
||||||
|
binDdFile="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cDdFile" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) double dadd(double a, double b) { return a + b; }
|
||||||
|
int main(void) {
|
||||||
|
union { double d; unsigned short w[4]; } u;
|
||||||
|
u.d = dadd(1.5, 2.5);
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = u.w[0];
|
||||||
|
*(volatile unsigned short *)0x5002 = u.w[1];
|
||||||
|
*(volatile unsigned short *)0x5004 = u.w[2];
|
||||||
|
*(volatile unsigned short *)0x5006 = u.w[3];
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
|
||||||
|
"$cDdFile" -o "$oDdFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binDdFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDdFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDdFile" --check \
|
||||||
|
0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4010 \
|
||||||
|
>/dev/null 2>&1; then
|
||||||
|
die "MAME: noinline dadd(1.5,2.5) != 4.0 (i64-ABI through libcall)"
|
||||||
|
fi
|
||||||
|
rm -f "$cDdFile" "$oDdFile" "$binDdFile"
|
||||||
|
|
||||||
|
log "check: MAME runs fact_u64(20) → 0x21c3677c82b40000 (__muldi3 stash slots)"
|
||||||
|
cFkFile="$(mktemp --suffix=.c)"
|
||||||
|
oFkFile="$(mktemp --suffix=.o)"
|
||||||
|
binFkFile="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cFkFile" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) unsigned long long fact_u64(unsigned int n) {
|
||||||
|
if (n <= 1) return 1ULL;
|
||||||
|
return (unsigned long long)n * fact_u64(n - 1);
|
||||||
|
}
|
||||||
|
int main(void) {
|
||||||
|
unsigned long long r = fact_u64(20);
|
||||||
|
union { unsigned long long u; unsigned short w[4]; } u;
|
||||||
|
u.u = r;
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = u.w[0];
|
||||||
|
*(volatile unsigned short *)0x5002 = u.w[1];
|
||||||
|
*(volatile unsigned short *)0x5004 = u.w[2];
|
||||||
|
*(volatile unsigned short *)0x5006 = u.w[3];
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
|
||||||
|
"$cFkFile" -o "$oFkFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binFkFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oFkFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binFkFile" --check \
|
||||||
|
0x025000=0000 0x025002=82b4 0x025004=677c 0x025006=21c3 \
|
||||||
|
>/dev/null 2>&1; then
|
||||||
|
die "MAME: fact_u64(20) returned wrong bits (__muldi3 / stash slots)"
|
||||||
|
fi
|
||||||
|
rm -f "$cFkFile" "$oFkFile" "$binFkFile"
|
||||||
|
|
||||||
|
log "check: MAME runs u64add(0x3FF8...,0x4004...) → 0x7FFC... (call-up Y-preserve)"
|
||||||
|
cU64File="$(mktemp --suffix=.c)"
|
||||||
|
oU64File="$(mktemp --suffix=.o)"
|
||||||
|
binU64File="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cU64File" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) unsigned long long u64add(unsigned long long a, unsigned long long b) {
|
||||||
|
return a + b;
|
||||||
|
}
|
||||||
|
int main(void) {
|
||||||
|
unsigned long long c = u64add(0x3FF8000000000000ULL, 0x4004000000000000ULL);
|
||||||
|
union { unsigned long long u; unsigned short w[4]; } u;
|
||||||
|
u.u = c;
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = u.w[0];
|
||||||
|
*(volatile unsigned short *)0x5002 = u.w[1];
|
||||||
|
*(volatile unsigned short *)0x5004 = u.w[2];
|
||||||
|
*(volatile unsigned short *)0x5006 = u.w[3];
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
|
||||||
|
"$cU64File" -o "$oU64File"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binU64File" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oU64File" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binU64File" --check \
|
||||||
|
0x025000=0000 0x025002=0000 0x025004=0000 0x025006=7ffc \
|
||||||
|
>/dev/null 2>&1; then
|
||||||
|
die "MAME: u64add through noinline returned wrong middle halves (call-up Y-clobber)"
|
||||||
|
fi
|
||||||
|
rm -f "$cU64File" "$oU64File" "$binU64File"
|
||||||
|
|
||||||
|
log "check: MAME runs addOff(p,1) p[0]+=p[1] → 12 (StackSlotCleanup killed-Y respect)"
|
||||||
|
cAofFile="$(mktemp --suffix=.c)"
|
||||||
|
oAofFile="$(mktemp --suffix=.o)"
|
||||||
|
binAofFile="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cAofFile" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) short addOff(short *p, short i) {
|
||||||
|
short b = p[i];
|
||||||
|
p[i-1] = p[i-1] + b;
|
||||||
|
return p[i-1];
|
||||||
|
}
|
||||||
|
int main(void) {
|
||||||
|
short stk[2] = { 5, 7 };
|
||||||
|
short r = addOff(stk, 1);
|
||||||
|
short s0 = stk[0];
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = (unsigned short)r;
|
||||||
|
*(volatile unsigned short *)0x5002 = (unsigned short)s0;
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
|
||||||
|
"$cAofFile" -o "$oAofFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binAofFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oAofFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
|
||||||
|
"$binAofFile" --check 0x025000=000c 0x025002=000c \
|
||||||
|
>/dev/null 2>&1; then
|
||||||
|
die "MAME: addOff p[i-1]+=p[i] returned wrong store (NegYIndY/X-clobber or LDY-erase)"
|
||||||
|
fi
|
||||||
|
rm -f "$cAofFile" "$oAofFile" "$binAofFile"
|
||||||
|
|
||||||
|
log "check: MAME runs sqr(10) → 100 (frame-less ADJCALLSTACKUP must emit PLY)"
|
||||||
|
cSqrFile="$(mktemp --suffix=.c)"
|
||||||
|
oSqrFile="$(mktemp --suffix=.o)"
|
||||||
|
binSqrFile="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cSqrFile" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) unsigned short sqr(unsigned short x) { return x * x; }
|
||||||
|
int main(void) {
|
||||||
|
unsigned short r = sqr(10);
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = r;
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
|
||||||
|
"$cSqrFile" -o "$oSqrFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binSqrFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oSqrFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
|
||||||
|
"$binSqrFile" --check 0x025000=0064 >/dev/null 2>&1; then
|
||||||
|
die "MAME: sqr(10) crashed or != 100 (ADJCALLSTACKUP not emitting PLY for frame-less)"
|
||||||
|
fi
|
||||||
|
rm -f "$cSqrFile" "$oSqrFile" "$binSqrFile"
|
||||||
|
|
||||||
|
log "check: MAME runs ddiv(8.0,4.0) → 2.0 (__divdf3 algorithm fix)"
|
||||||
|
cDdvFile="$(mktemp --suffix=.c)"
|
||||||
|
oDdvFile="$(mktemp --suffix=.o)"
|
||||||
|
binDdvFile="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cDdvFile" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) double ddiv(double a, double b) { return a / b; }
|
||||||
|
int main(void) {
|
||||||
|
union { double d; unsigned short w[4]; } u;
|
||||||
|
u.d = ddiv(8.0, 4.0);
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = u.w[0];
|
||||||
|
*(volatile unsigned short *)0x5002 = u.w[1];
|
||||||
|
*(volatile unsigned short *)0x5004 = u.w[2];
|
||||||
|
*(volatile unsigned short *)0x5006 = u.w[3];
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
|
||||||
|
"$cDdvFile" -o "$oDdvFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binDdvFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDdvFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
|
||||||
|
"$binDdvFile" --check 0x025000=0000 0x025002=0000 \
|
||||||
|
0x025004=0000 0x025006=4000 >/dev/null 2>&1; then
|
||||||
|
die "MAME: ddiv(8,4) != 2.0 (__divdf3 long-division bug)"
|
||||||
|
fi
|
||||||
|
rm -f "$cDdvFile" "$oDdvFile" "$binDdvFile"
|
||||||
|
|
||||||
|
log "check: MAME runs Newton-iter loop → high-half ~1.41 (BranchExpand self-loop BRA fix)"
|
||||||
|
cSqFile="$(mktemp --suffix=.c)"
|
||||||
|
oSqFile="$(mktemp --suffix=.o)"
|
||||||
|
binSqFile="$(mktemp --suffix=.bin)"
|
||||||
|
# 3-iter Newton-method sqrt with a counted for-loop (the loop-back
|
||||||
|
# BRA is a self-loop, which the BranchExpand distance estimator
|
||||||
|
# used to report as 0 bytes, so it never promoted to BRL even
|
||||||
|
# when the loop body grew well past +/-128 bytes).
|
||||||
|
cat > "$cSqFile" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) double sqrt3(double x) {
|
||||||
|
double g = x * 0.5;
|
||||||
|
for (unsigned short i = 0; i < 3; i++)
|
||||||
|
g = (g + x / g) * 0.5;
|
||||||
|
return g;
|
||||||
|
}
|
||||||
|
int main(void) {
|
||||||
|
union { double d; unsigned short w[4]; } u;
|
||||||
|
u.d = sqrt3(2.0);
|
||||||
|
switchToBank2();
|
||||||
|
// Only the high half is precision-stable (low halves vary slightly
|
||||||
|
// due to truncation vs round-to-nearest in __divdf3). Verify just
|
||||||
|
// the high half — that's enough to prove the self-loop BRA was
|
||||||
|
// promoted (the link would have failed otherwise) and __divdf3 is
|
||||||
|
// converging to the right magnitude.
|
||||||
|
*(volatile unsigned short *)0x5006 = u.w[3];
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
|
||||||
|
"$cSqFile" -o "$oSqFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binSqFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oSqFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
|
||||||
|
"$binSqFile" --check 0x025006=3ff6 >/dev/null 2>&1; then
|
||||||
|
die "MAME: sqrt3(2.0) high half wrong (self-loop BRA / __divdf3)"
|
||||||
|
fi
|
||||||
|
rm -f "$cSqFile" "$oSqFile" "$binSqFile"
|
||||||
|
|
||||||
|
log "check: MAME runs -O0 addOne(7) → 8 (lda-overwrite-immediate fix; fast regalloc)"
|
||||||
|
cO0File="$(mktemp --suffix=.c)"
|
||||||
|
oO0File="$(mktemp --suffix=.o)"
|
||||||
|
binO0File="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cO0File" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
unsigned short addOne(unsigned short a) { return a + 1; }
|
||||||
|
int main(void) {
|
||||||
|
unsigned short r = addOne(7);
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = r;
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O0 -ffunction-sections -c \
|
||||||
|
"$cO0File" -o "$oO0File"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binO0File" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oO0File" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
|
||||||
|
"$binO0File" --check 0x025000=0008 >/dev/null 2>&1; then
|
||||||
|
die "MAME: -O0 addOne(7) != 8 (lda overwrite immediate / regalloc choice)"
|
||||||
|
fi
|
||||||
|
rm -f "$cO0File" "$oO0File" "$binO0File"
|
||||||
|
|
||||||
|
log "check: MAME runs bubble sort with mySwap helper [4,1,3,2] → [1,2,3,4] (greedy across helper-call)"
|
||||||
|
cBshFile="$(mktemp --suffix=.c)"
|
||||||
|
oBshFile="$(mktemp --suffix=.o)"
|
||||||
|
binBshFile="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cBshFile" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
unsigned short bsdata[4] = { 4, 1, 3, 2 };
|
||||||
|
__attribute__((noinline)) void mySwap(unsigned short *a, unsigned short *b) {
|
||||||
|
unsigned short t = *a; *a = *b; *b = t;
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) void mySort(unsigned short *arr, unsigned short n) {
|
||||||
|
for (unsigned short i = 0; i < n - 1; i++)
|
||||||
|
for (unsigned short j = 0; j < n - i - 1; j++)
|
||||||
|
if (arr[j] > arr[j+1])
|
||||||
|
mySwap(&arr[j], &arr[j+1]);
|
||||||
|
}
|
||||||
|
int main(void) {
|
||||||
|
mySort(bsdata, 4);
|
||||||
|
unsigned short d0 = bsdata[0], d1 = bsdata[1], d2 = bsdata[2], d3 = bsdata[3];
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = d0;
|
||||||
|
*(volatile unsigned short *)0x5002 = d1;
|
||||||
|
*(volatile unsigned short *)0x5004 = d2;
|
||||||
|
*(volatile unsigned short *)0x5006 = d3;
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
|
||||||
|
"$cBshFile" -o "$oBshFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binBshFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oBshFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" \
|
||||||
|
"$binBshFile" --check 0x025000=0001 0x025002=0002 \
|
||||||
|
0x025004=0003 0x025006=0004 >/dev/null 2>&1; then
|
||||||
|
die "MAME: mySort with mySwap helper miscompiled (greedy regalloc across call)"
|
||||||
|
fi
|
||||||
|
rm -f "$cBshFile" "$oBshFile" "$binBshFile"
|
||||||
|
|
||||||
|
log "check: MAME runs dmul(8.0,2.0) AFTER bank-switch → 16.0 (DPF0 store + __muldf3)"
|
||||||
|
cDmFile="$(mktemp --suffix=.c)"
|
||||||
|
oDmFile="$(mktemp --suffix=.o)"
|
||||||
|
binDmFile="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cDmFile" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) double dmul(double a, double b) { return a * b; }
|
||||||
|
int main(void) {
|
||||||
|
union { double d; unsigned short w[4]; } u;
|
||||||
|
switchToBank2();
|
||||||
|
u.d = dmul(8.0, 2.0);
|
||||||
|
*(volatile unsigned short *)0x5000 = u.w[0];
|
||||||
|
*(volatile unsigned short *)0x5002 = u.w[1];
|
||||||
|
*(volatile unsigned short *)0x5004 = u.w[2];
|
||||||
|
*(volatile unsigned short *)0x5006 = u.w[3];
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
|
||||||
|
"$cDmFile" -o "$oDmFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binDmFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDmFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDmFile" --check \
|
||||||
|
0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4030 \
|
||||||
|
>/dev/null 2>&1; then
|
||||||
|
die "MAME: dmul(8,2) under DBR=2 produced wrong bits (DPF0 store / __muldf3)"
|
||||||
|
fi
|
||||||
|
rm -f "$cDmFile" "$oDmFile" "$binDmFile"
|
||||||
|
|
||||||
|
log "check: MAME runs dmath = (a+b)*(a-b), 5,3 → 16.0 (chained libcall ABI)"
|
||||||
|
cDmaFile="$(mktemp --suffix=.c)"
|
||||||
|
oDmaFile="$(mktemp --suffix=.o)"
|
||||||
|
binDmaFile="$(mktemp --suffix=.bin)"
|
||||||
|
cat > "$cDmaFile" <<'EOF'
|
||||||
|
__attribute__((noinline)) void switchToBank2(void) {
|
||||||
|
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
|
||||||
|
}
|
||||||
|
__attribute__((noinline)) double dadd(double a, double b) { return a + b; }
|
||||||
|
__attribute__((noinline)) double dsub(double a, double b) { return a - b; }
|
||||||
|
__attribute__((noinline)) double dmul(double a, double b) { return a * b; }
|
||||||
|
__attribute__((noinline)) double dmath(double a, double b) {
|
||||||
|
return dmul(dadd(a, b), dsub(a, b));
|
||||||
|
}
|
||||||
|
int main(void) {
|
||||||
|
union { double d; unsigned short w[4]; } u;
|
||||||
|
u.d = dmath(5.0, 3.0);
|
||||||
|
switchToBank2();
|
||||||
|
*(volatile unsigned short *)0x5000 = u.w[0];
|
||||||
|
*(volatile unsigned short *)0x5002 = u.w[1];
|
||||||
|
*(volatile unsigned short *)0x5004 = u.w[2];
|
||||||
|
*(volatile unsigned short *)0x5006 = u.w[3];
|
||||||
|
while (1) {}
|
||||||
|
}
|
||||||
|
EOF
|
||||||
|
"$CLANG" --target=w65816 -O2 -ffunction-sections -c \
|
||||||
|
"$cDmaFile" -o "$oDmaFile"
|
||||||
|
"$PROJECT_ROOT/tools/link816" -o "$binDmaFile" --text-base 0x1000 \
|
||||||
|
"$oCrt0F" "$oLibcF" "$oSfF" "$oSdF" "$oLibgccFile" "$oDmaFile" \
|
||||||
|
>/dev/null 2>&1
|
||||||
|
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDmaFile" --check \
|
||||||
|
0x025000=0000 0x025002=0000 0x025004=0000 0x025006=4030 \
|
||||||
|
>/dev/null 2>&1; then
|
||||||
|
die "MAME: dmath(5,3) returned wrong high half (DP[\$F0] CSE across libcalls)"
|
||||||
|
fi
|
||||||
|
rm -f "$cDmaFile" "$oDmaFile" "$binDmaFile"
|
||||||
|
|
||||||
|
rm -f "$oLibcF" "$oSfF" "$oSdF" "$oCrt0F"
|
||||||
else
|
else
|
||||||
warn "MAME or apple2gs ROMs not installed; skipping end-to-end test"
|
warn "MAME or apple2gs ROMs not installed; skipping end-to-end test"
|
||||||
fi
|
fi
|
||||||
|
|
|
||||||
|
|
@ -131,6 +131,7 @@ static bool clobbersImg(const MachineInstr &MI,
|
||||||
|
|
||||||
bool W65816ABridgeViaX::runOnMachineFunction(MachineFunction &MF) {
|
bool W65816ABridgeViaX::runOnMachineFunction(MachineFunction &MF) {
|
||||||
if (!MF.getRegInfo().getNumVirtRegs()) return false;
|
if (!MF.getRegInfo().getNumVirtRegs()) return false;
|
||||||
|
if (MF.getFunction().hasOptNone()) return false;
|
||||||
MachineRegisterInfo &MRI = MF.getRegInfo();
|
MachineRegisterInfo &MRI = MF.getRegInfo();
|
||||||
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
|
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
|
||||||
const W65816InstrInfo *TII = STI.getInstrInfo();
|
const W65816InstrInfo *TII = STI.getInstrInfo();
|
||||||
|
|
|
||||||
|
|
@ -83,21 +83,71 @@ void W65816AsmPrinter::emitInstruction(const MachineInstr *MI) {
|
||||||
switch (MI->getOpcode()) {
|
switch (MI->getOpcode()) {
|
||||||
default:
|
default:
|
||||||
break;
|
break;
|
||||||
case W65816::ADJCALLSTACKDOWN:
|
case W65816::ADJCALLSTACKDOWN: {
|
||||||
|
// DOWN is a no-op in our scheme — the PUSH16 sequence in LowerCall
|
||||||
|
// already shifted SP incrementally as args were pushed. Nothing
|
||||||
|
// to emit; PEI may or may not have processed it, either is fine.
|
||||||
|
return;
|
||||||
|
}
|
||||||
case W65816::ADJCALLSTACKUP: {
|
case W65816::ADJCALLSTACKUP: {
|
||||||
// PEI's eliminateCallFramePseudoInstr removes these *only* when the
|
// PEI's eliminateCallFramePseudoInstr handles UP whenever the
|
||||||
// function has frame work (StackSize > 0 or any FrameIndex use).
|
// function has any frame work (StackSize > 0 or any FI use).
|
||||||
// Functions that just tail-call into a libcall (e.g. `int toInt(float
|
// Frame-less functions — e.g. `unsigned short sqr(unsigned short
|
||||||
// x) { return (int)x; }` lowers to a single jsl __fixsfsi) have
|
// x) { return x*x; }` lowers to PUSH16 + jsl __mulhi3 + RTL with
|
||||||
// neither; PEI skips its call-frame phase and the pseudo survives
|
// no locals — get skipped by PEI's call-frame phase, leaving
|
||||||
// to MC. AsmStreamer renders the pseudo's "# ADJCALLSTACK..."
|
// ADJCALLSTACKUP as a pseudo all the way to here. Previously we
|
||||||
// string as a comment, but MCObjectStreamer asks the encoder to
|
// silently dropped it, which left SP off by N bytes after the
|
||||||
// emit bytes — which fails ("Unsupported instruction MCInst 337").
|
// call and corrupted the caller's stack frame (caught by sqr(x)
|
||||||
// Dropping it here is correct: when amt is zero (the "no frame"
|
// segfaulting MAME). Emit the SP fixup ourselves: PLY*N/2 for
|
||||||
// path) the call sequence is a no-op anyway; when non-zero, PEI
|
// small even N, otherwise the TAY/TSC-ADC/TYA bracket.
|
||||||
// would have replaced it with PLA-loop / TSC-ADC sequence already.
|
int N = MI->getOperand(0).getImm();
|
||||||
// If we ever see a non-zero amount slip through, that's a real
|
if (N == 0) return;
|
||||||
// bug — emit nothing and trust the comment-stripped path.
|
// A holds the callee's return value; preserve it. Walk forward
|
||||||
|
// looking for X/Y uses (i64-return halves) — same logic as
|
||||||
|
// eliminateCallFramePseudoInstr.
|
||||||
|
bool YLive = false;
|
||||||
|
for (auto J = std::next(MI->getIterator()); J != MI->getParent()->end();
|
||||||
|
++J) {
|
||||||
|
if (J->isCall()) break;
|
||||||
|
bool yDef = false;
|
||||||
|
for (const MachineOperand &MO : J->operands()) {
|
||||||
|
if (!MO.isReg()) continue;
|
||||||
|
if (MO.getReg() == W65816::Y) {
|
||||||
|
if (MO.isUse()) { YLive = true; break; }
|
||||||
|
if (MO.isDef()) yDef = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (YLive || yDef) break;
|
||||||
|
}
|
||||||
|
if (YLive) {
|
||||||
|
// Route through DP $E0 to preserve both A and Y.
|
||||||
|
MCInst Sta; Sta.setOpcode(W65816::STA_DP);
|
||||||
|
Sta.addOperand(MCOperand::createImm(0xE0));
|
||||||
|
EmitToStreamer(*OutStreamer, Sta);
|
||||||
|
MCInst Tsc; Tsc.setOpcode(W65816::TSC); EmitToStreamer(*OutStreamer, Tsc);
|
||||||
|
MCInst Clc; Clc.setOpcode(W65816::CLC); EmitToStreamer(*OutStreamer, Clc);
|
||||||
|
MCInst Adc; Adc.setOpcode(W65816::ADC_Imm16);
|
||||||
|
Adc.addOperand(MCOperand::createImm(N));
|
||||||
|
EmitToStreamer(*OutStreamer, Adc);
|
||||||
|
MCInst Tcs; Tcs.setOpcode(W65816::TCS); EmitToStreamer(*OutStreamer, Tcs);
|
||||||
|
MCInst Lda; Lda.setOpcode(W65816::LDA_DP);
|
||||||
|
Lda.addOperand(MCOperand::createImm(0xE0));
|
||||||
|
EmitToStreamer(*OutStreamer, Lda);
|
||||||
|
} else if (N <= 14 && (N % 2) == 0) {
|
||||||
|
for (int i = 0; i < N / 2; ++i) {
|
||||||
|
MCInst Ply; Ply.setOpcode(W65816::PLY);
|
||||||
|
EmitToStreamer(*OutStreamer, Ply);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
MCInst Tay; Tay.setOpcode(W65816::TAY); EmitToStreamer(*OutStreamer, Tay);
|
||||||
|
MCInst Tsc; Tsc.setOpcode(W65816::TSC); EmitToStreamer(*OutStreamer, Tsc);
|
||||||
|
MCInst Clc; Clc.setOpcode(W65816::CLC); EmitToStreamer(*OutStreamer, Clc);
|
||||||
|
MCInst Adc; Adc.setOpcode(W65816::ADC_Imm16);
|
||||||
|
Adc.addOperand(MCOperand::createImm(N));
|
||||||
|
EmitToStreamer(*OutStreamer, Adc);
|
||||||
|
MCInst Tcs; Tcs.setOpcode(W65816::TCS); EmitToStreamer(*OutStreamer, Tcs);
|
||||||
|
MCInst Tya; Tya.setOpcode(W65816::TYA); EmitToStreamer(*OutStreamer, Tya);
|
||||||
|
}
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
case W65816::LDXi16imm: {
|
case W65816::LDXi16imm: {
|
||||||
|
|
|
||||||
|
|
@ -46,6 +46,7 @@
|
||||||
#include "llvm/CodeGen/MachineFunctionPass.h"
|
#include "llvm/CodeGen/MachineFunctionPass.h"
|
||||||
#include "llvm/CodeGen/MachineInstr.h"
|
#include "llvm/CodeGen/MachineInstr.h"
|
||||||
#include "llvm/CodeGen/MachineInstrBuilder.h"
|
#include "llvm/CodeGen/MachineInstrBuilder.h"
|
||||||
|
#include "llvm/Support/raw_ostream.h"
|
||||||
|
|
||||||
using namespace llvm;
|
using namespace llvm;
|
||||||
|
|
||||||
|
|
@ -100,7 +101,17 @@ static unsigned estimateDistance(MachineFunction &MF,
|
||||||
const MachineInstr &Br,
|
const MachineInstr &Br,
|
||||||
MachineBasicBlock *To) {
|
MachineBasicBlock *To) {
|
||||||
const MachineBasicBlock *From = Br.getParent();
|
const MachineBasicBlock *From = Br.getParent();
|
||||||
if (From == To) return 0;
|
// Self-loop branch: target is the start of From, branch is somewhere
|
||||||
|
// inside From. Distance is the bytes from start of From to the
|
||||||
|
// branch instruction (i.e., everything before Br in From).
|
||||||
|
if (From == To) {
|
||||||
|
unsigned Bytes = 0;
|
||||||
|
for (const auto &MI : *From) {
|
||||||
|
if (&MI == &Br) break;
|
||||||
|
Bytes += TII->getInstSizeInBytes(MI);
|
||||||
|
}
|
||||||
|
return Bytes;
|
||||||
|
}
|
||||||
|
|
||||||
// Two cases by layout direction:
|
// Two cases by layout direction:
|
||||||
// forward: bytes after Br in From, plus all of MBBs strictly
|
// forward: bytes after Br in From, plus all of MBBs strictly
|
||||||
|
|
@ -276,11 +287,30 @@ bool W65816BranchExpand::runOnMachineFunction(MachineFunction &MF) {
|
||||||
// Step 2: iterate to fixed-point. Each expansion adds 3 bytes
|
// Step 2: iterate to fixed-point. Each expansion adds 3 bytes
|
||||||
// (bridge BRA), which may push another previously-OK branch over
|
// (bridge BRA), which may push another previously-OK branch over
|
||||||
// the threshold. Cap at MAX_ITERS to avoid pathological cases.
|
// the threshold. Cap at MAX_ITERS to avoid pathological cases.
|
||||||
const unsigned EXPAND_DIST_THRESHOLD = 100; // safe under +/-128
|
const unsigned EXPAND_DIST_THRESHOLD = 90; // tighter margin under +/-128
|
||||||
const unsigned MAX_ITERS = 10;
|
const unsigned MAX_ITERS = 10;
|
||||||
for (unsigned iter = 0; iter < MAX_ITERS; ++iter) {
|
for (unsigned iter = 0; iter < MAX_ITERS; ++iter) {
|
||||||
bool Changed = false;
|
bool Changed = false;
|
||||||
|
|
||||||
|
// Promote long BRA to BRL. The assembler's BRA→BRL relaxation
|
||||||
|
// sometimes fails to fire when the target symbol resolves early
|
||||||
|
// in MC layout — the linker then sees a PCREL8 reloc that's out
|
||||||
|
// of range. Force the BRL ourselves when the estimate exceeds
|
||||||
|
// the safe threshold; saves one byte if BRA would have fit, but
|
||||||
|
// beats a hard link error.
|
||||||
|
for (auto &MBB : MF) {
|
||||||
|
for (auto &MI : MBB.terminators()) {
|
||||||
|
if (MI.getOpcode() != W65816::BRA) continue;
|
||||||
|
if (MI.getNumOperands() < 1 || !MI.getOperand(0).isMBB()) continue;
|
||||||
|
MachineBasicBlock *Target = MI.getOperand(0).getMBB();
|
||||||
|
unsigned Dist = estimateDistance(MF, TII, MI, Target);
|
||||||
|
if (Dist > EXPAND_DIST_THRESHOLD) {
|
||||||
|
MI.setDesc(TII->get(W65816::BRL));
|
||||||
|
Changed = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// Collect candidates. After step 1, each MBB has at most one
|
// Collect candidates. After step 1, each MBB has at most one
|
||||||
// conditional terminator, so we walk terminators().
|
// conditional terminator, so we walk terminators().
|
||||||
SmallVector<std::pair<MachineBasicBlock *, MachineInstr *>, 8> Candidates;
|
SmallVector<std::pair<MachineBasicBlock *, MachineInstr *>, 8> Candidates;
|
||||||
|
|
@ -337,6 +367,27 @@ bool W65816BranchExpand::runOnMachineFunction(MachineFunction &MF) {
|
||||||
// fall-through marker after stays after.
|
// fall-through marker after stays after.
|
||||||
auto insertPt = MBB->getFirstTerminator();
|
auto insertPt = MBB->getFirstTerminator();
|
||||||
BuildMI(*MBB, insertPt, DL, TII->get(InvOpc)).addMBB(Skip);
|
BuildMI(*MBB, insertPt, DL, TII->get(InvOpc)).addMBB(Skip);
|
||||||
|
// After the rewrite, MBB falls through to Bridge (which now sits
|
||||||
|
// immediately after MBB in layout). Any unconditional BRA/BRL
|
||||||
|
// already at the end of MBB used to direct the fall-through to
|
||||||
|
// Skip — but with Bridge interposed, that BRA would skip past
|
||||||
|
// Bridge entirely and Bridge becomes unreachable. Remove it.
|
||||||
|
// (Skip is still reachable via INV_Bxx; Target is reachable via
|
||||||
|
// fall-through-to-Bridge then BRL.) Caught by vprintf crashing
|
||||||
|
// because dropDeadConditionalsToBRATarget then dropped the
|
||||||
|
// INV_Bxx as redundant with the leftover BRA Skip.
|
||||||
|
while (insertPt != MBB->end()) {
|
||||||
|
unsigned NextOpc = insertPt->getOpcode();
|
||||||
|
if (NextOpc == W65816::BRA || NextOpc == W65816::BRL) {
|
||||||
|
if (insertPt->getNumOperands() >= 1 &&
|
||||||
|
insertPt->getOperand(0).isMBB() &&
|
||||||
|
insertPt->getOperand(0).getMBB() == Skip) {
|
||||||
|
insertPt = insertPt->eraseFromParent();
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
++insertPt;
|
||||||
|
}
|
||||||
|
|
||||||
// Bridge: BRL Target. Always emit the long form rather than
|
// Bridge: BRL Target. Always emit the long form rather than
|
||||||
// relying on the assembler to relax BRA→BRL — the relaxation
|
// relying on the assembler to relax BRA→BRL — the relaxation
|
||||||
|
|
|
||||||
|
|
@ -162,15 +162,39 @@ void W65816FrameLowering::emitEpilogue(MachineFunction &MF,
|
||||||
// Insert before the terminator (the return).
|
// Insert before the terminator (the return).
|
||||||
DebugLoc DL = MBBI != MBB.end() ? MBBI->getDebugLoc() : DebugLoc();
|
DebugLoc DL = MBBI != MBB.end() ? MBBI->getDebugLoc() : DebugLoc();
|
||||||
|
|
||||||
|
// Detect whether the return live-out includes Y or X — for i64 returns
|
||||||
|
// (Outs[0..2] -> A,X,Y), Y holds bits 32-47 and X holds bits 16-31, so
|
||||||
|
// any TAY/PLY/TAX in the SP-restore would corrupt the return value.
|
||||||
|
// The RTL terminator carries implicit-uses for every live-out return
|
||||||
|
// register; scan them to decide which scratch we can use safely.
|
||||||
|
bool YLive = false;
|
||||||
|
bool XLive = false;
|
||||||
|
if (MBBI != MBB.end() && MBBI->isReturn()) {
|
||||||
|
for (const MachineOperand &MO : MBBI->operands()) {
|
||||||
|
if (!MO.isReg() || !MO.isImplicit() || !MO.isUse()) continue;
|
||||||
|
if (MO.getReg() == W65816::Y) YLive = true;
|
||||||
|
else if (MO.getReg() == W65816::X) XLive = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// VLA cleanup: restore entry SP from DP $F4 (saved in prologue).
|
// VLA cleanup: restore entry SP from DP $F4 (saved in prologue).
|
||||||
// This subsumes BOTH the static frame and any dynamic_stackalloc
|
// This subsumes BOTH the static frame and any dynamic_stackalloc
|
||||||
// bytes — we can skip the per-byte PLY/PLA loop entirely. Preserve
|
// bytes — we can skip the per-byte PLY/PLA loop entirely. Preserve
|
||||||
// A through TAY/TYA since it holds the return value.
|
// A through TAY/TYA since it holds the return value. For i64
|
||||||
|
// returns where Y is also live, route the save through DP $E0
|
||||||
|
// ($E0..$EF is libcall scratch — guaranteed dead by epilogue time).
|
||||||
if (HasVLA) {
|
if (HasVLA) {
|
||||||
|
if (YLive) {
|
||||||
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
|
||||||
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4);
|
||||||
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
|
||||||
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
|
||||||
|
} else {
|
||||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
|
||||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4);
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xF4);
|
||||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
|
||||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TYA));
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::TYA));
|
||||||
|
}
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -182,11 +206,26 @@ void W65816FrameLowering::emitEpilogue(MachineFunction &MF,
|
||||||
// N/2 PLY (pop into Y, discard); larger frames use
|
// N/2 PLY (pop into Y, discard); larger frames use
|
||||||
// TAY/TSC/CLC/ADC #N/TCS/TYA.
|
// TAY/TSC/CLC/ADC #N/TCS/TYA.
|
||||||
// Mirror the prologue threshold (see comment there).
|
// Mirror the prologue threshold (see comment there).
|
||||||
if (StackSize <= 6 && (StackSize % 2) == 0) {
|
if (StackSize <= 6 && (StackSize % 2) == 0 && !YLive) {
|
||||||
|
// PLY clobbers Y, which is fine when Y isn't a return reg.
|
||||||
for (uint64_t i = 0; i < StackSize / 2; ++i)
|
for (uint64_t i = 0; i < StackSize / 2; ++i)
|
||||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::PLY));
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::PLY));
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
if (YLive) {
|
||||||
|
// Y is a return register (i64 / double). Save A via DP $E0
|
||||||
|
// instead of TAY so Y survives. 4 cyc slower than TAY/TYA but
|
||||||
|
// correct. X is allowed to be live too — none of these touch X.
|
||||||
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
|
||||||
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
|
||||||
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC));
|
||||||
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::ADC_Imm16))
|
||||||
|
.addImm(StackSize);
|
||||||
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::TCS));
|
||||||
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
|
||||||
|
(void)XLive;
|
||||||
|
return;
|
||||||
|
}
|
||||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::TAY));
|
||||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::TSC));
|
||||||
BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC));
|
BuildMI(MBB, MBBI, DL, TII.get(W65816::CLC));
|
||||||
|
|
@ -207,15 +246,56 @@ MachineBasicBlock::iterator W65816FrameLowering::eliminateCallFramePseudoInstr(
|
||||||
// ADJCALLSTACKUP releases all the pushed bytes after a call.
|
// ADJCALLSTACKUP releases all the pushed bytes after a call.
|
||||||
//
|
//
|
||||||
// Critical: A holds the callee's return value here, so this MUST NOT
|
// Critical: A holds the callee's return value here, so this MUST NOT
|
||||||
// clobber A. The naive `tsc;clc;adc #N;tcs` does (TSC overwrites A),
|
// clobber A. PLY (small-N path) clobbers Y; TAY/.../TYA bracket
|
||||||
// which silently corrupts every call's return value. Same fix as the
|
// (large-N path) also clobbers Y. Both are fine for i8/i16/i32
|
||||||
// epilogue: small N via PLY (clobbers Y, preserves A); larger N via
|
// returns but DESTROY the return for i64/double (where X and Y hold
|
||||||
// TAY/.../TYA bracket.
|
// mid halves). Detect i64-return calls by walking back to the JSL
|
||||||
|
// and checking implicit-def $x/$y; in that case, save A via DP $E0
|
||||||
|
// (libcall scratch, dead by call-up time) so X and Y survive.
|
||||||
|
// Caught by `unsigned long long u64add(a,b)` through a noinline
|
||||||
|
// boundary returning Y = b_hi (the last popped) instead of the
|
||||||
|
// sum's mid-high.
|
||||||
if (I->getOpcode() == W65816::ADJCALLSTACKUP) {
|
if (I->getOpcode() == W65816::ADJCALLSTACKUP) {
|
||||||
int N = I->getOperand(0).getImm();
|
int N = I->getOperand(0).getImm();
|
||||||
if (N > 0) {
|
if (N > 0) {
|
||||||
DebugLoc DL = I->getDebugLoc();
|
DebugLoc DL = I->getDebugLoc();
|
||||||
if (N <= 14 && (N % 2) == 0) {
|
bool YLive = false;
|
||||||
|
bool XLive = false;
|
||||||
|
// Walk forward looking for COPY %vreg = $x / $y — LowerCall's
|
||||||
|
// pattern for materializing return halves. JSLpseudo's tablegen
|
||||||
|
// declares only `Defs=[A]`, so implicit-defs of X/Y aren't on
|
||||||
|
// the call op itself. We have to read what comes after.
|
||||||
|
// Stop at the next call (re-clobbers everything) or at any def
|
||||||
|
// of X/Y (cancels their post-call value).
|
||||||
|
bool Stopped = false;
|
||||||
|
for (auto J = std::next(I); J != MBB.end() && !Stopped; ++J) {
|
||||||
|
if (J->isCall()) break;
|
||||||
|
for (const MachineOperand &MO : J->operands()) {
|
||||||
|
if (!MO.isReg()) continue;
|
||||||
|
Register R = MO.getReg();
|
||||||
|
if (R == W65816::Y) {
|
||||||
|
if (MO.isUse()) YLive = true;
|
||||||
|
else if (MO.isDef() && !YLive) Stopped = true;
|
||||||
|
} else if (R == W65816::X) {
|
||||||
|
if (MO.isUse()) XLive = true;
|
||||||
|
else if (MO.isDef() && !XLive) Stopped = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (YLive && XLive) break;
|
||||||
|
}
|
||||||
|
if (YLive) {
|
||||||
|
// i64 return: PLY would eat Y. Route through DP $E0. Worth
|
||||||
|
// ~4 cyc more than PLY*N/2 but correctness wins. X is not
|
||||||
|
// touched by any of these insns either way, so XLive doesn't
|
||||||
|
// change anything here — track it for symmetry.
|
||||||
|
BuildMI(MBB, I, DL, TII.get(W65816::STA_DP)).addImm(0xE0);
|
||||||
|
BuildMI(MBB, I, DL, TII.get(W65816::TSC));
|
||||||
|
BuildMI(MBB, I, DL, TII.get(W65816::CLC));
|
||||||
|
BuildMI(MBB, I, DL, TII.get(W65816::ADC_Imm16)).addImm(N);
|
||||||
|
BuildMI(MBB, I, DL, TII.get(W65816::TCS));
|
||||||
|
BuildMI(MBB, I, DL, TII.get(W65816::LDA_DP)).addImm(0xE0);
|
||||||
|
(void)XLive;
|
||||||
|
} else if (N <= 14 && (N % 2) == 0) {
|
||||||
for (int i = 0; i < N / 2; ++i)
|
for (int i = 0; i < N / 2; ++i)
|
||||||
BuildMI(MBB, I, DL, TII.get(W65816::PLY));
|
BuildMI(MBB, I, DL, TII.get(W65816::PLY));
|
||||||
} else {
|
} else {
|
||||||
|
|
|
||||||
|
|
@ -861,10 +861,17 @@ W65816TargetLowering::LowerCall(TargetLowering::CallLoweringInfo &CLI,
|
||||||
Glue = V.getValue(2);
|
Glue = V.getValue(2);
|
||||||
InVals.push_back(V);
|
InVals.push_back(V);
|
||||||
} else {
|
} else {
|
||||||
// 4th half: load from DP $F0.
|
// 4th half: read DP[$F0..$F1] via CopyFromReg(DPF0). DPF0 is a
|
||||||
SDValue DPAddr = DAG.getConstant(0xF0, DL, MVT::i16);
|
// pseudo-physreg modeled as JSLpseudo's implicit-def, so each
|
||||||
SDValue V = DAG.getLoad(VT, DL, Chain, DPAddr, MachinePointerInfo());
|
// call's CopyFromReg has Glue tied to the corresponding call —
|
||||||
|
// the SDAG combiner can't merge them and the scheduler can't
|
||||||
|
// reorder them past the next call. copyPhysReg lowers DPF0 →
|
||||||
|
// A as `LDA $F0`. Without this, plain `getLoad(0xF0)` was
|
||||||
|
// being CSE'd / reordered across i64-returning calls, causing
|
||||||
|
// `dmath = (a+b)*(a-b)` to return 4 instead of 16.
|
||||||
|
SDValue V = DAG.getCopyFromReg(Chain, DL, W65816::DPF0, VT, Glue);
|
||||||
Chain = V.getValue(1);
|
Chain = V.getValue(1);
|
||||||
|
Glue = V.getValue(2);
|
||||||
InVals.push_back(V);
|
InVals.push_back(V);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
@ -900,11 +907,17 @@ SDValue W65816TargetLowering::LowerReturn(
|
||||||
SDValue Glue;
|
SDValue Glue;
|
||||||
SmallVector<SDValue, 8> RetOps(1, Chain);
|
SmallVector<SDValue, 8> RetOps(1, Chain);
|
||||||
|
|
||||||
// Outs[3] -> store to DP $F0 (only for i64 returns). Done first so
|
// Outs[3] -> DP $F0 via CopyToReg(DPF0). Using the DPF0 fake physreg
|
||||||
// its computation can use A freely before A holds the low result.
|
// (lowered to `STA $F0` by copyPhysReg) is critical: a generic
|
||||||
|
// ISD::STORE with addr=0xF0 lowered to `sta (d,s),y`, an indirect
|
||||||
|
// through the DBR, which silently misbehaved when DBR != 0. STA dp
|
||||||
|
// uses D + dp directly and is unaffected by DBR. Done first so its
|
||||||
|
// computation can use A freely before A holds the low result. Glued
|
||||||
|
// to RET_GLUE via the RetOps Register entry below so DCE doesn't
|
||||||
|
// strip the COPY.
|
||||||
if (Outs.size() >= 4) {
|
if (Outs.size() >= 4) {
|
||||||
SDValue DPAddr = DAG.getConstant(0xF0, DL, MVT::i16);
|
Chain = DAG.getCopyToReg(Chain, DL, W65816::DPF0, OutVals[3], Glue);
|
||||||
Chain = DAG.getStore(Chain, DL, OutVals[3], DPAddr, MachinePointerInfo());
|
Glue = Chain.getValue(1);
|
||||||
}
|
}
|
||||||
// Outs[2] -> Y.
|
// Outs[2] -> Y.
|
||||||
if (Outs.size() >= 3) {
|
if (Outs.size() >= 3) {
|
||||||
|
|
@ -926,6 +939,8 @@ SDValue W65816TargetLowering::LowerReturn(
|
||||||
RetOps.push_back(DAG.getRegister(W65816::X, Outs[1].VT));
|
RetOps.push_back(DAG.getRegister(W65816::X, Outs[1].VT));
|
||||||
if (Outs.size() >= 3)
|
if (Outs.size() >= 3)
|
||||||
RetOps.push_back(DAG.getRegister(W65816::Y, Outs[2].VT));
|
RetOps.push_back(DAG.getRegister(W65816::Y, Outs[2].VT));
|
||||||
|
if (Outs.size() >= 4)
|
||||||
|
RetOps.push_back(DAG.getRegister(W65816::DPF0, Outs[3].VT));
|
||||||
|
|
||||||
RetOps[0] = Chain;
|
RetOps[0] = Chain;
|
||||||
if (Glue.getNode())
|
if (Glue.getNode())
|
||||||
|
|
|
||||||
|
|
@ -92,6 +92,44 @@ void W65816InstrInfo::copyPhysReg(MachineBasicBlock &MBB,
|
||||||
BuildMI(MBB, I, DL, get(W65816::STA_DP)).addImm(dstImg);
|
BuildMI(MBB, I, DL, get(W65816::STA_DP)).addImm(dstImg);
|
||||||
return;
|
return;
|
||||||
}
|
}
|
||||||
|
// X → IMGn / IMGn → X: STX dp / LDX dp. Avoids the A-bridge that
|
||||||
|
// TAX/TXA would impose; critical for i32-first-arg signatures
|
||||||
|
// (live-in $a + $x) where bridging X via A clobbers $a's value
|
||||||
|
// before it can be saved. Caught by udivmod and iterative qsort.
|
||||||
|
if (dstImg >= 0 && SrcReg == W65816::X) {
|
||||||
|
BuildMI(MBB, I, DL, get(W65816::STX_DP)).addImm(dstImg);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (DestReg == W65816::X && srcImg >= 0) {
|
||||||
|
BuildMI(MBB, I, DL, get(W65816::LDX_DP)).addImm(srcImg);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
// Y → IMGn / IMGn → Y: STY dp / LDY dp — symmetric.
|
||||||
|
if (dstImg >= 0 && SrcReg == W65816::Y) {
|
||||||
|
BuildMI(MBB, I, DL, get(W65816::STY_DP)).addImm(dstImg);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (DestReg == W65816::Y && srcImg >= 0) {
|
||||||
|
BuildMI(MBB, I, DL, get(W65816::LDY_DP)).addImm(srcImg);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
// DPF0 → A: emit `LDA $F0`. DPF0 is the pseudo-physreg carrier
|
||||||
|
// for an i64-returning call's high 16 bits; LowerCall builds a
|
||||||
|
// CopyFromReg(DPF0) glued to the call so the SDAG combiner /
|
||||||
|
// scheduler can't merge or reorder reads across calls.
|
||||||
|
if (DestReg == W65816::A && SrcReg == W65816::DPF0) {
|
||||||
|
BuildMI(MBB, I, DL, get(W65816::LDA_DP)).addImm(0xF0);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
// A → DPF0: emit `STA $F0`. Used by LowerReturn for the i64 high
|
||||||
|
// half; using a true direct-page store is critical because plain
|
||||||
|
// ISD::STORE with addr=0xF0 was lowering to `(d,s),y` indirect via
|
||||||
|
// DBR — which silently broke under DBR != 0 (e.g. after a bank
|
||||||
|
// switch). STA dp uses D + dp directly, ignoring DBR.
|
||||||
|
if (DestReg == W65816::DPF0 && SrcReg == W65816::A) {
|
||||||
|
BuildMI(MBB, I, DL, get(W65816::STA_DP)).addImm(0xF0);
|
||||||
|
return;
|
||||||
|
}
|
||||||
llvm_unreachable("W65816: cross-class copyPhysReg not yet implemented");
|
llvm_unreachable("W65816: cross-class copyPhysReg not yet implemented");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
@ -101,8 +139,14 @@ void W65816InstrInfo::storeRegToStackSlot(
|
||||||
MachineInstr::MIFlag Flags) const {
|
MachineInstr::MIFlag Flags) const {
|
||||||
// STAfi gets eliminated by W65816RegisterInfo::eliminateFrameIndex into
|
// STAfi gets eliminated by W65816RegisterInfo::eliminateFrameIndex into
|
||||||
// a real STA d,S. Source is implicit A; emit the pseudo with the FI
|
// a real STA d,S. Source is implicit A; emit the pseudo with the FI
|
||||||
// and zero offset.
|
// and zero offset. When regalloc hands us a spill from X or Y, bridge
|
||||||
|
// through A (TXA / TYA) — same rationale as loadRegFromStackSlot.
|
||||||
DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc();
|
DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc();
|
||||||
|
if (SrcReg == W65816::X || SrcReg == W65816::Y) {
|
||||||
|
unsigned XferOp = (SrcReg == W65816::X) ? W65816::TXA : W65816::TYA;
|
||||||
|
BuildMI(MBB, MI, DL, get(XferOp));
|
||||||
|
SrcReg = W65816::A;
|
||||||
|
}
|
||||||
BuildMI(MBB, MI, DL, get(W65816::STAfi))
|
BuildMI(MBB, MI, DL, get(W65816::STAfi))
|
||||||
.addReg(SrcReg, getKillRegState(isKill))
|
.addReg(SrcReg, getKillRegState(isKill))
|
||||||
.addFrameIndex(FrameIdx)
|
.addFrameIndex(FrameIdx)
|
||||||
|
|
@ -115,9 +159,30 @@ void W65816InstrInfo::loadRegFromStackSlot(MachineBasicBlock &MBB,
|
||||||
const TargetRegisterClass *RC,
|
const TargetRegisterClass *RC,
|
||||||
Register VReg, unsigned SubReg,
|
Register VReg, unsigned SubReg,
|
||||||
MachineInstr::MIFlag Flags) const {
|
MachineInstr::MIFlag Flags) const {
|
||||||
// Mirror image of storeRegToStackSlot: emit LDAfi, which the frame
|
// LDAfi only knows how to put the value in A. If regalloc asks for
|
||||||
// index pass turns into LDA d,S.
|
// a spill into X or Y, we have to bridge through A: LDA d,S then
|
||||||
|
// TAX / TAY. Without this, the MIR has `$x = LDAfi` but the asm
|
||||||
|
// printer emits just `LDA d,S` (which writes A, not X) — a silent
|
||||||
|
// miscompile that surfaced as i64 subtract chains using stale X
|
||||||
|
// values for the second word (caught by udivmod's `a - q*b` mod
|
||||||
|
// computation).
|
||||||
DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc();
|
DebugLoc DL = MI != MBB.end() ? MI->getDebugLoc() : DebugLoc();
|
||||||
|
if (DestReg == W65816::A) {
|
||||||
|
BuildMI(MBB, MI, DL, get(W65816::LDAfi), DestReg)
|
||||||
|
.addFrameIndex(FrameIdx)
|
||||||
|
.addImm(0);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
if (DestReg == W65816::X || DestReg == W65816::Y) {
|
||||||
|
// Load via A, then transfer. A is implicitly clobbered.
|
||||||
|
BuildMI(MBB, MI, DL, get(W65816::LDAfi), W65816::A)
|
||||||
|
.addFrameIndex(FrameIdx)
|
||||||
|
.addImm(0);
|
||||||
|
unsigned XferOp = (DestReg == W65816::X) ? W65816::TAX : W65816::TAY;
|
||||||
|
BuildMI(MBB, MI, DL, get(XferOp));
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
// Fallback: assume A path (covers Acc16 / Wide16 vregs by class).
|
||||||
BuildMI(MBB, MI, DL, get(W65816::LDAfi), DestReg)
|
BuildMI(MBB, MI, DL, get(W65816::LDAfi), DestReg)
|
||||||
.addFrameIndex(FrameIdx)
|
.addFrameIndex(FrameIdx)
|
||||||
.addImm(0);
|
.addImm(0);
|
||||||
|
|
|
||||||
|
|
@ -70,6 +70,7 @@ def W65816pushx : SDNode<"W65816ISD::PUSH_X", SDTNone,
|
||||||
[SDNPHasChain, SDNPInGlue, SDNPOutGlue,
|
[SDNPHasChain, SDNPInGlue, SDNPOutGlue,
|
||||||
SDNPSideEffect, SDNPMayStore]>;
|
SDNPSideEffect, SDNPMayStore]>;
|
||||||
|
|
||||||
|
|
||||||
// SELECT_CC: takes (TVal, FVal, CC) plus a glue value carrying the
|
// SELECT_CC: takes (TVal, FVal, CC) plus a glue value carrying the
|
||||||
// flags from a preceding W65816cmp. Lowered by EmitInstrWithCustomInserter
|
// flags from a preceding W65816cmp. Lowered by EmitInstrWithCustomInserter
|
||||||
// into a CMP (already in the BB) + Bxx + diamond CFG + PHI.
|
// into a CMP (already in the BB) + Bxx + diamond CFG + PHI.
|
||||||
|
|
@ -1356,10 +1357,18 @@ def : Pat<(store
|
||||||
// function doesn't have to know how it was called to choose its
|
// function doesn't have to know how it was called to choose its
|
||||||
// return instruction. A pseudo bridges the i16 symbol operand
|
// return instruction. A pseudo bridges the i16 symbol operand
|
||||||
// to JSL_Long's 24-bit operand class.
|
// to JSL_Long's 24-bit operand class.
|
||||||
|
// Defs include DPF0 — every i64-returning libcall clobbers DP[$F0]
|
||||||
|
// (it's the carrier for the highest 16 bits of the return). The
|
||||||
|
// LowerCall side captures the pre-call DPF0 via CopyFromReg(DPF0)
|
||||||
|
// glued to the call so the SDAG combiner / scheduler can't merge
|
||||||
|
// or reorder reads across calls. Without DPF0 in Defs, plain
|
||||||
|
// `getLoad(0xF0)` was being CSE'd across calls, leading to
|
||||||
|
// `dmath = (a+b)*(a-b)` returning 4 instead of 16.
|
||||||
let isCall = 1, hasSideEffects = 0, mayLoad = 0, mayStore = 0,
|
let isCall = 1, hasSideEffects = 0, mayLoad = 0, mayStore = 0,
|
||||||
Defs = [A] in {
|
Defs = [A, DPF0] in {
|
||||||
def JSLpseudo : W65816Pseudo<(outs), (ins i16imm:$dst),
|
def JSLpseudo : W65816Pseudo<(outs), (ins i16imm:$dst),
|
||||||
"# JSLpseudo $dst", []>;
|
"# JSLpseudo $dst", []>;
|
||||||
}
|
}
|
||||||
|
|
||||||
def : Pat<(W65816call (i16 tglobaladdr:$dst)), (JSLpseudo tglobaladdr:$dst)>;
|
def : Pat<(W65816call (i16 tglobaladdr:$dst)), (JSLpseudo tglobaladdr:$dst)>;
|
||||||
def : Pat<(W65816call (i16 texternalsym:$dst)), (JSLpseudo texternalsym:$dst)>;
|
def : Pat<(W65816call (i16 texternalsym:$dst)), (JSLpseudo texternalsym:$dst)>;
|
||||||
|
|
|
||||||
|
|
@ -40,6 +40,7 @@ class W65816MachineFunctionInfo : public MachineFunctionInfo {
|
||||||
/// STA8abs needs an SEP/REP wrap in M=0 to avoid a 2-byte store).
|
/// STA8abs needs an SEP/REP wrap in M=0 to avoid a 2-byte store).
|
||||||
bool UsesAcc8 = false;
|
bool UsesAcc8 = false;
|
||||||
|
|
||||||
|
|
||||||
public:
|
public:
|
||||||
W65816MachineFunctionInfo() = default;
|
W65816MachineFunctionInfo() = default;
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -89,6 +89,31 @@ bool W65816NegYIndY::runOnMachineFunction(MachineFunction &MF) {
|
||||||
continue;
|
continue;
|
||||||
unsigned Disp = MI.getOperand(0).getImm() & 0xFF;
|
unsigned Disp = MI.getOperand(0).getImm() & 0xFF;
|
||||||
DebugLoc DL = MI.getDebugLoc();
|
DebugLoc DL = MI.getDebugLoc();
|
||||||
|
// X-liveness check: SpillToX may have stashed a value in X
|
||||||
|
// that's used after this rewrite. If so, save X to DP $E1
|
||||||
|
// (libcall scratch high half — $E0 is reserved for the A-save
|
||||||
|
// dance in eliminateCallFramePseudoInstr) and restore after.
|
||||||
|
// Walk forward from MI looking for an X use without a prior
|
||||||
|
// X def; if found, X is live and we must preserve it.
|
||||||
|
bool XLive = false;
|
||||||
|
for (auto Scan = std::next(MachineBasicBlock::iterator(&MI));
|
||||||
|
Scan != MBB.end(); ++Scan) {
|
||||||
|
if (Scan->isDebugInstr()) continue;
|
||||||
|
bool xDef = false;
|
||||||
|
for (const MachineOperand &MO : Scan->operands()) {
|
||||||
|
if (!MO.isReg()) continue;
|
||||||
|
if (MO.getReg() == W65816::X) {
|
||||||
|
if (MO.isUse()) { XLive = true; break; }
|
||||||
|
if (MO.isDef()) xDef = true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (XLive || xDef) break;
|
||||||
|
}
|
||||||
|
if (XLive) {
|
||||||
|
// Save X to DP $E2 (don't use $E0 — that's the A-preserve
|
||||||
|
// slot in call-frame teardown and may be live).
|
||||||
|
BuildMI(MBB, MI, DL, TII->get(W65816::STX_DP)).addImm(0xE2);
|
||||||
|
}
|
||||||
if (IsLDA) {
|
if (IsLDA) {
|
||||||
// LDA disp,S ; CLC ; ADC #neg ; TAX ; LDA $0000,X
|
// LDA disp,S ; CLC ; ADC #neg ; TAX ; LDA $0000,X
|
||||||
BuildMI(MBB, MI, DL, TII->get(W65816::LDA_StackRel))
|
BuildMI(MBB, MI, DL, TII->get(W65816::LDA_StackRel))
|
||||||
|
|
@ -127,6 +152,10 @@ bool W65816NegYIndY::runOnMachineFunction(MachineFunction &MF) {
|
||||||
.addImm(0)
|
.addImm(0)
|
||||||
.addReg(W65816::A, RegState::Implicit);
|
.addReg(W65816::A, RegState::Implicit);
|
||||||
}
|
}
|
||||||
|
if (XLive) {
|
||||||
|
// Restore X from DP $E2.
|
||||||
|
BuildMI(MBB, MI, DL, TII->get(W65816::LDX_DP)).addImm(0xE2);
|
||||||
|
}
|
||||||
// Erase original LDY and the (sr,s),Y op.
|
// Erase original LDY and the (sr,s),Y op.
|
||||||
if (LastLDY) { LastLDY->eraseFromParent(); LastLDY = nullptr; }
|
if (LastLDY) { LastLDY->eraseFromParent(); LastLDY = nullptr; }
|
||||||
MI.eraseFromParent();
|
MI.eraseFromParent();
|
||||||
|
|
|
||||||
|
|
@ -73,7 +73,30 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
|
||||||
bool NeedsCarryPrefix = false;
|
bool NeedsCarryPrefix = false;
|
||||||
bool IsSub = false;
|
bool IsSub = false;
|
||||||
switch (Opc) {
|
switch (Opc) {
|
||||||
case W65816::LDAfi: NewOpc = W65816::LDA_StackRel; break;
|
case W65816::LDAfi: {
|
||||||
|
// LDAfi targets A. If the regalloc parked the dest in X or Y
|
||||||
|
// (which can happen via Idx16 vreg coalescing), bridge through A
|
||||||
|
// by appending a TAX / TAY.
|
||||||
|
Register Dst = MI.getOperand(0).getReg();
|
||||||
|
int FI = MI.getOperand(FIOperandNum).getIndex();
|
||||||
|
int FrameOffset = MFI.getObjectOffset(FI);
|
||||||
|
int ImmOffset = MI.getOperand(FIOperandNum + 1).getImm();
|
||||||
|
int Offset = FrameOffset + ImmOffset + (int)MFI.getStackSize() + SPAdj;
|
||||||
|
if (FrameOffset < 0) Offset += 1;
|
||||||
|
if (Offset < 0 || Offset > 0xFF)
|
||||||
|
report_fatal_error("W65816: frame offset out of stack-relative range");
|
||||||
|
BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
|
||||||
|
TII.get(W65816::LDA_StackRel))
|
||||||
|
.addImm(Offset)
|
||||||
|
.addReg(W65816::A, RegState::ImplicitDefine);
|
||||||
|
if (Dst == W65816::X) {
|
||||||
|
BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::TAX));
|
||||||
|
} else if (Dst == W65816::Y) {
|
||||||
|
BuildMI(*MI.getParent(), II, MI.getDebugLoc(), TII.get(W65816::TAY));
|
||||||
|
}
|
||||||
|
MI.eraseFromParent();
|
||||||
|
return true;
|
||||||
|
}
|
||||||
case W65816::STAfi: {
|
case W65816::STAfi: {
|
||||||
// Wide16-source STAfi: if the source ended up in IMGn (DP-backed),
|
// Wide16-source STAfi: if the source ended up in IMGn (DP-backed),
|
||||||
// prepend LDA dp so the value reaches A before the actual store.
|
// prepend LDA dp so the value reaches A before the actual store.
|
||||||
|
|
@ -108,6 +131,12 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II,
|
||||||
BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
|
BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
|
||||||
TII.get(W65816::LDA_DP)).addImm(srcDP);
|
TII.get(W65816::LDA_DP)).addImm(srcDP);
|
||||||
}
|
}
|
||||||
|
// Note: STAfi with X or Y source is NOT supported here — adding a
|
||||||
|
// TXA/TYA pre-bracket would clobber A which a downstream STAfi $a
|
||||||
|
// may still need (the prologue stashes arg0_lo from A and arg0_ml
|
||||||
|
// from X via two adjacent STAfi, and putting A's STA *before* X's
|
||||||
|
// is the caller's responsibility). storeRegToStackSlot already
|
||||||
|
// bridges X/Y → A for spills it generates.
|
||||||
BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
|
BuildMI(*MI.getParent(), II, MI.getDebugLoc(),
|
||||||
TII.get(W65816::STA_StackRel))
|
TII.get(W65816::STA_StackRel))
|
||||||
.addImm(Offset)
|
.addImm(Offset)
|
||||||
|
|
|
||||||
|
|
@ -55,6 +55,15 @@ def IMG5 : W65816Reg<21, "img5">, DwarfRegNum<[21]>;
|
||||||
def IMG6 : W65816Reg<22, "img6">, DwarfRegNum<[22]>;
|
def IMG6 : W65816Reg<22, "img6">, DwarfRegNum<[22]>;
|
||||||
def IMG7 : W65816Reg<23, "img7">, DwarfRegNum<[23]>;
|
def IMG7 : W65816Reg<23, "img7">, DwarfRegNum<[23]>;
|
||||||
|
|
||||||
|
// DPF0 — pseudo-physreg modeling the i16 storage at DP $F0..$F1.
|
||||||
|
// Used as the carrier for the highest 16 bits of an i64/double
|
||||||
|
// return. JSLpseudo Defs DPF0 so the SDAG combiner / scheduler
|
||||||
|
// can't merge or reorder reads of it across calls; we plumb the
|
||||||
|
// 4th return half via CopyFromReg(DPF0) in LowerCall, which lowers
|
||||||
|
// to `LDA $F0` via copyPhysReg. Never allocated to a vreg —
|
||||||
|
// always a transient bridge from DP[$F0] to A.
|
||||||
|
def DPF0 : W65816Reg<24, "dpf0">, DwarfRegNum<[24]>;
|
||||||
|
|
||||||
//===----------------------------------------------------------------------===//
|
//===----------------------------------------------------------------------===//
|
||||||
// Register Classes
|
// Register Classes
|
||||||
//===----------------------------------------------------------------------===//
|
//===----------------------------------------------------------------------===//
|
||||||
|
|
@ -90,6 +99,13 @@ def Wide16 : RegisterClass<"W65816", [i16], 16,
|
||||||
|
|
||||||
def PtrRegs : RegisterClass<"W65816", [i16], 16, (add SP)>;
|
def PtrRegs : RegisterClass<"W65816", [i16], 16, (add SP)>;
|
||||||
|
|
||||||
|
// Single-register class for DPF0, the i64-return high-half carrier.
|
||||||
|
// Not allocatable — only used as a CopyFromReg source in LowerCall;
|
||||||
|
// copyPhysReg lowers DPF0 → A by emitting `LDA $F0`.
|
||||||
|
def DPF0Reg : RegisterClass<"W65816", [i16], 16, (add DPF0)> {
|
||||||
|
let isAllocatable = 0;
|
||||||
|
}
|
||||||
|
|
||||||
// Single-register class for the processor status register, used for condition
|
// Single-register class for the processor status register, used for condition
|
||||||
// code modeling. Not currently allocatable.
|
// code modeling. Not currently allocatable.
|
||||||
def StatusReg : RegisterClass<"W65816", [i8], 8, (add P)> {
|
def StatusReg : RegisterClass<"W65816", [i8], 8, (add P)> {
|
||||||
|
|
|
||||||
|
|
@ -1217,6 +1217,13 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) {
|
||||||
}
|
}
|
||||||
if (MI.isCall()) break;
|
if (MI.isCall()) break;
|
||||||
if (MI.modifiesRegister(W65816::Y, TRI)) break;
|
if (MI.modifiesRegister(W65816::Y, TRI)) break;
|
||||||
|
// killsRegister: an instruction with `implicit killed $y` USES Y
|
||||||
|
// and that's the LAST use — Y is dead after. We must NOT treat
|
||||||
|
// a subsequent LDY_Imm16 #N as redundant after a kill, because
|
||||||
|
// the held value is conceptually gone. Caught by `addOff(p,i)
|
||||||
|
// { p[i-1] += p[i]; }` where LDY -2 ; LDA_indY (kills Y) ; ... ;
|
||||||
|
// LDY -2 ; STA_indY needs the second LDY to reinitialize Y.
|
||||||
|
if (MI.killsRegister(W65816::Y, TRI)) break;
|
||||||
if (MI.isInlineAsm() || MI.isBranch() || MI.isReturn()) break;
|
if (MI.isInlineAsm() || MI.isBranch() || MI.isReturn()) break;
|
||||||
++It;
|
++It;
|
||||||
}
|
}
|
||||||
|
|
|
||||||
|
|
@ -14,6 +14,7 @@
|
||||||
#include "W65816.h"
|
#include "W65816.h"
|
||||||
#include "W65816MachineFunctionInfo.h"
|
#include "W65816MachineFunctionInfo.h"
|
||||||
#include "TargetInfo/W65816TargetInfo.h"
|
#include "TargetInfo/W65816TargetInfo.h"
|
||||||
|
#include "llvm/CodeGen/MachineCSE.h"
|
||||||
#include "llvm/CodeGen/Passes.h"
|
#include "llvm/CodeGen/Passes.h"
|
||||||
#include "llvm/CodeGen/TargetLoweringObjectFileImpl.h"
|
#include "llvm/CodeGen/TargetLoweringObjectFileImpl.h"
|
||||||
#include "llvm/CodeGen/TargetPassConfig.h"
|
#include "llvm/CodeGen/TargetPassConfig.h"
|
||||||
|
|
@ -82,16 +83,19 @@ public:
|
||||||
void addPreRegAlloc() override;
|
void addPreRegAlloc() override;
|
||||||
void addPostRegAlloc() override;
|
void addPostRegAlloc() override;
|
||||||
void addPreEmitPass() override;
|
void addPreEmitPass() override;
|
||||||
|
void addMachineSSAOptimization() override;
|
||||||
|
|
||||||
// W65816's only 16-bit ALU register is A. We use fast regalloc by
|
// W65816's only 16-bit ALU register is A. Greedy at -O1+ produces
|
||||||
// default — always succeeds, ~30-50% bigger code than greedy in
|
// tight code; at -O0 (where optnone disables coalescing/CSE), greedy
|
||||||
// pathological cases but correctness is paramount. Greedy fails
|
// leaves spurious COPY pseudos that lower to STA dp / LDA dp pairs
|
||||||
// outright on functions with 4+ simultaneously live i16 vregs (heap
|
// around modify-in-place ops (e.g. INA), miscompiling a + 1. Use
|
||||||
// sift etc.). TiedDefSpill (pre-RA) handles the tied-def-multi-use
|
// fast regalloc when the target framework signals unoptimized.
|
||||||
// hazard for the sub-pattern that's frequent enough to matter.
|
// TiedDefSpill (pre-RA) handles the tied-def-multi-use hazard for
|
||||||
|
// the sub-pattern that's frequent enough to matter at -O1+.
|
||||||
//
|
//
|
||||||
FunctionPass *createTargetRegisterAllocator(bool /*Optimized*/) override {
|
FunctionPass *createTargetRegisterAllocator(bool Optimized) override {
|
||||||
return createGreedyRegisterAllocator();
|
return Optimized ? createGreedyRegisterAllocator()
|
||||||
|
: createFastRegisterAllocator();
|
||||||
}
|
}
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|
@ -101,6 +105,24 @@ TargetPassConfig *W65816TargetMachine::createPassConfig(PassManagerBase &PM) {
|
||||||
return new W65816PassConfig(*this, PM);
|
return new W65816PassConfig(*this, PM);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
void W65816PassConfig::addMachineSSAOptimization() {
|
||||||
|
// MachineCSE incorrectly eliminates "redundant" CMP instructions when
|
||||||
|
// it sees an earlier identical CMP elsewhere in the function — the
|
||||||
|
// P (status) flag is considered "available", but on this target P is
|
||||||
|
// clobbered by every intervening LDA/STA/ADC, so the surviving Bxx
|
||||||
|
// ends up dispatching on stale flags. We don't model `Uses=[P]` on
|
||||||
|
// Bxx because doing so causes regalloc/layout shifts that uncovered
|
||||||
|
// a different latent bug in vprintf. Disabling the pass entirely
|
||||||
|
// is the lower-cost workaround until the Bxx-Uses=[P] regression is
|
||||||
|
// root-caused. Caught by `printf("%d", n)` returning 0.
|
||||||
|
//
|
||||||
|
// Other SSA opts (early-tailduplication, opt-phis, dead-mi-elim,
|
||||||
|
// licm, machine-sink, peephole-opt, etc.) still run by chaining
|
||||||
|
// through the default impl — we just skip MachineCSE.
|
||||||
|
disablePass(&MachineCSELegacyID);
|
||||||
|
TargetPassConfig::addMachineSSAOptimization();
|
||||||
|
}
|
||||||
|
|
||||||
void W65816PassConfig::addPreRegAlloc() {
|
void W65816PassConfig::addPreRegAlloc() {
|
||||||
addPass(createW65816ABridgeViaX());
|
addPass(createW65816ABridgeViaX());
|
||||||
addPass(createW65816TiedDefSpill());
|
addPass(createW65816TiedDefSpill());
|
||||||
|
|
@ -125,7 +147,11 @@ void W65816PassConfig::addPreEmitPass() {
|
||||||
addPass(createW65816SpillToX());
|
addPass(createW65816SpillToX());
|
||||||
// Rewrite negative-Y indirect-Y stack-rel ops. Must run BEFORE
|
// Rewrite negative-Y indirect-Y stack-rel ops. Must run BEFORE
|
||||||
// BranchExpand because the rewrite expands one instruction into
|
// BranchExpand because the rewrite expands one instruction into
|
||||||
// several and shifts branch distances.
|
// several and shifts branch distances. The pass internally checks
|
||||||
|
// X-liveness and saves/restores X via DP $E0 when SpillToX has
|
||||||
|
// a value parked there; without that check, the rewrite's TAX
|
||||||
|
// would clobber spill-bridged values (caught by `addOff(p,i) {
|
||||||
|
// p[i-1] += p[i]; }` returning p[i-1] + &p[i-1] instead of +b).
|
||||||
addPass(createW65816NegYIndY());
|
addPass(createW65816NegYIndY());
|
||||||
// Branch expansion runs after that so the BRA introduced for long
|
// Branch expansion runs after that so the BRA introduced for long
|
||||||
// conditional branches gets seen by SepRepCleanup (which can
|
// conditional branches gets seen by SepRepCleanup (which can
|
||||||
|
|
|
||||||
|
|
@ -118,6 +118,11 @@ bool W65816TiedDefSpill::runOnMachineFunction(MachineFunction &MF) {
|
||||||
// Only pre-RA: skip if vregs are already gone.
|
// Only pre-RA: skip if vregs are already gone.
|
||||||
if (!MF.getRegInfo().getNumVirtRegs())
|
if (!MF.getRegInfo().getNumVirtRegs())
|
||||||
return false;
|
return false;
|
||||||
|
// At -O0/optnone, the spill+reload pattern this pass introduces
|
||||||
|
// doesn't get coalesced and ends up wasting frame space without
|
||||||
|
// helping greedy. Same skip rationale as WidenAcc16.
|
||||||
|
if (MF.getFunction().hasOptNone())
|
||||||
|
return false;
|
||||||
|
|
||||||
MachineRegisterInfo &MRI = MF.getRegInfo();
|
MachineRegisterInfo &MRI = MF.getRegInfo();
|
||||||
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
|
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
|
||||||
|
|
|
||||||
|
|
@ -119,6 +119,13 @@ static bool allUsesAcceptWide(Register VReg,
|
||||||
|
|
||||||
bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
|
bool W65816WidenAcc16::runOnMachineFunction(MachineFunction &MF) {
|
||||||
if (!MF.getRegInfo().getNumVirtRegs()) return false;
|
if (!MF.getRegInfo().getNumVirtRegs()) return false;
|
||||||
|
// At -O0 / optnone, register coalescing doesn't run, so the COPY we
|
||||||
|
// insert to bridge Acc16 → Wide16 doesn't get folded; instead it
|
||||||
|
// forces wide16 spills through DP-mapped slots that collide and
|
||||||
|
// produce miscompiles around modify-in-place ops (lda dp; inc a;
|
||||||
|
// sta dp; lda dp reads pre-inc value). The promotion is purely a
|
||||||
|
// performance optimization, so skip it for optnone functions.
|
||||||
|
if (MF.getFunction().hasOptNone()) return false;
|
||||||
MachineRegisterInfo &MRI = MF.getRegInfo();
|
MachineRegisterInfo &MRI = MF.getRegInfo();
|
||||||
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
|
const W65816Subtarget &STI = MF.getSubtarget<W65816Subtarget>();
|
||||||
const W65816InstrInfo *TII = STI.getInstrInfo();
|
const W65816InstrInfo *TII = STI.getInstrInfo();
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue