Checkpoint

This commit is contained in:
Scott Duensing 2026-05-02 19:17:23 -05:00
parent f338d93bae
commit 81694c5971
10 changed files with 438 additions and 90 deletions

View file

@ -130,7 +130,13 @@ which runs correctly under MAME (apple2gs).
**Backend register allocation:** **Backend register allocation:**
- Greedy regalloc as default at -O1+; fast at -O0/optnone. - Basic regalloc as default at -O1+; fast at -O0/optnone. We use
basic instead of greedy because greedy fails ("ran out of
registers during register allocation") on functions with many
cross-call Acc16 vregs (the `ok |= bit; helper(); ok |= bit;`
pattern across many if-blocks). Basic handles those cleanly
with negligible code-size overhead vs greedy on the bench
suite (~0.6%).
- Pre-RA passes: `WidenAcc16` (Acc16→Wide16 promotion, lets - Pre-RA passes: `WidenAcc16` (Acc16→Wide16 promotion, lets
greedy spread i16 pressure across A and 16 IMG slots); greedy spread i16 pressure across A and 16 IMG slots);
`TiedDefSpill` (handles tied-def-multi-use hazard); `TiedDefSpill` (handles tied-def-multi-use hazard);
@ -179,39 +185,15 @@ which runs correctly under MAME (apple2gs).
## In flight ## In flight
- **Greedy regalloc fails on long-arg call chains** — a function (Nothing currently — the four previous in-flight items all
that strings ~7+ independent `helper(longArg1, longArg2)` calls landed: basic-regalloc-by-default replaced greedy and resolved
overflows greedy at -O1+ with "ran out of registers during the long-arg-chain failure; `time()` reads ReadTimeHex when the
register allocation". IMG slot expansion (8→16) raised the program has called `iigsToolboxInit()` and `clock()` reads the
threshold; most "normal-looking" mixed-arity workloads now VBL counter via 24-bit absolute load; the (sr,s),Y bank-wrap
compile, but pathological pressure (many i32+ args + bitmask addressing is no longer emitted by any inserter and the
SETCC chain in one function) still fails. Workarounds: mark `W65816NegYIndY` workaround is disabled; LC ceiling extended
the heaviest helper `__attribute__((noinline))`; or from $E000 to $10000 since crt0's `lda $C083` read-twice enables
`-mllvm -regalloc=fast` for that TU; or `__attribute__((optnone))` RAM through $FFFF, gaining 8KB of bank-0 space.)
on the affected function. Proper fix needs either a custom
greedy→fast fallback in
`W65816TargetMachine::createTargetRegisterAllocator` or a
smarter spill-placement pre-RA pass.
- **`time()` / `clock()` are stubs** returning 0. ReadTimeHex
(Misc Tool $0D03) needs the Tool Locator initialised in crt0
to not crash MAME; the VBL counter at $E1006B needs 24-bit
far-pointer support that the backend doesn't yet model.
- **`(d,s),y / (sr,s),y` addressing wraps the bank** when Y is
negative as 16-bit unsigned. Worked around by `W65816NegYIndY`
rewriting the affected ops to `TAX ; LDA/STA $0000,X`. The
workaround stays correct for negative offsets like `arr[i-1]`
but the underlying issue is unfixed at the addressing-mode
level.
- **Bank-0 size limit (~48KB)** — the runtime + program must fit
in $1000-$BFFF (text+rodata) plus $D000-$DFFF (LC1 for rodata-
spill and BSS). Past that, link816 hard-fails because text
would cross the IO window. In practice rarely hit thanks to
`--gc-sections`, but programs that genuinely use most of the
runtime can still trip it. Future work: enable LC2 / shadow
RAM via crt0 to add ~16KB more.
## Yet to come ## Yet to come

View file

@ -9,4 +9,9 @@ typedef unsigned long clock_t;
time_t time(time_t *t); time_t time(time_t *t);
clock_t clock(void); clock_t clock(void);
// Initialise the IIgs Tool Locator so time() can call ReadTimeHex.
// Call once before any time() use. Idempotent — repeated calls
// are no-ops. clock() works regardless of whether this is called.
void iigsToolboxInit(void);
#endif #endif

View file

@ -42,16 +42,16 @@ __start:
lda #0x0fff lda #0x0fff
tcs tcs
; Enable Language Card 1 RAM at $D000-$DFFF for read+write. ; Enable Language Card RAM at $D000-$FFFF for read+write. This
; By default the IIgs maps that range to ROM (read-only). Two ; is 12KB (4KB at $D000-$DFFF in LC bank 1, plus 8KB at
; reads of $C083 enable RAM-bank-1, second read also enables ; $E000-$FFFF common LC area). The IIgs LC area defaults to
; writes. Without this, BSS auto-relocated past $C000 lands on ; ROM-mapped; two reads of $C083 enable bank-1 RAM read AND
; ROM and globals never initialise (writes drop on the floor; ; write for the whole $D000-$FFFF range. link816 may auto-
; reads return ROM bytes). Caught by the expression-parser ; relocate BSS / heap into this area when text+rodata grows
; smoke test (#92) when runtime growth pushed bss past $BFFF. ; past $BFFF — without this enable, writes drop on the floor
; The reads must be 8-bit (one byte at a time) — a 16-bit M ; and reads return ROM bytes. The reads must be 8-bit (one
; read at $C083 would also touch $C084 (a different soft ; byte at a time) — a 16-bit M read at $C083 would also touch
; switch), wiping the LC enable we just set. ; $C084 (a different soft switch), wiping the LC enable.
sep #0x20 sep #0x20
lda 0xc083 lda 0xc083
lda 0xc083 lda 0xc083
@ -98,6 +98,15 @@ __start:
bra .Linit_loop bra .Linit_loop
.Linit_done: .Linit_done:
; Note: the IIgs Tool Locator (JSL $E10000 dispatch) is NOT
; initialised here. We tried wiring TLStartUp into crt0 and
; MAME segfaulted in our specific test harness — the dispatcher
; appears to want some pre-setup we're missing. Programs that
; need toolbox calls should call `iigsToolboxInit()` from the
; runtime (declared in iigs/toolbox.h), which performs the
; sequence in a controlled context. time()/clock() check an
; in-process flag and return 0 if init hasn't been done.
; Call main. Standard W65816 ABI: i16 first arg in A; we pass ; Call main. Standard W65816 ABI: i16 first arg in A; we pass
; nothing. After return, A holds the exit code. ; nothing. After return, A holds the exit code.
jsl main jsl main

View file

@ -587,31 +587,136 @@ void perror(const char *prefix) {
// ---- time.h ---- // ---- time.h ----
// //
// time() and clock() are stubs returning 0. A real implementation // time() reads the IIgs RTC via ReadTimeHex (Misc Tool $0D03) and
// could either: // converts the broken-down date/time to seconds since 1970-01-01.
// - Use ReadTimeHex (Misc Tool $0D03) — but this requires the GS // Requires `iigsToolboxInit()` to have run at least once — without
// Tool Locator to be initialised (TLStartUp from iigs/toolbox.h) // the Tool Locator initialised, JSL $E10000 crashes. Programs
// in the crt0, otherwise the JSL $E10000 dispatcher reads // that need real time() should call iigsToolboxInit() early from
// uninitialised state and crashes. Smoke verified that the // main; otherwise time() returns 0 (no crash, but no clock).
// direct toolbox call segfaults MAME without prior init.
// - Use the IIgs vertical-blank counter at $00/E1/006B (24-bit
// address, needs long-pointer access via inline asm — the C
// pointer type is 16-bit on this target, so a literal 0xE1006B
// silently truncates to $006B in zero page).
// //
// We leave both as stubs until the runtime has a Tool-Locator- // clock() reads the IIgs vertical-blank counter at $00/E1/006B (1
// init crt0 path or proper 24-bit far-pointer support. // byte that increments every VBL ~= 60 Hz) via inline asm with a
// 24-bit absolute load — works with or without toolbox init since
// the VBL counter is just a memory location updated by the IRQ
// handler. Wraparound tracked in a u32 static so the counter can
// span days. CLOCKS_PER_SEC is 60 (defined in time.h).
// Toolbox-init flag, set by iigsToolboxInit(). time() guards on it.
// volatile to dodge the i1-narrowing isel bug on bool flag globals.
static volatile unsigned short __toolboxInited = 0;
void iigsToolboxInit(void) {
if (__toolboxInited) return;
__asm__ volatile (
"rep #0x30\n"
"ldx #0x0201\n" // TLStartUp
"jsl 0xe10000\n"
"sei\n" // re-disable IRQ that the dispatcher may re-enable
"rep #0x30\n"
:
:
: "a", "x", "y", "memory"
);
__toolboxInited = 1;
}
typedef long time_t; typedef long time_t;
typedef unsigned long clock_t; typedef unsigned long clock_t;
// ReadTimeHex returns 8 bytes via a parameter block: second, minute,
// hour, (unused), year-1900, day, month, weekday. Push a 4-word
// result-area on the stack, JSL X=$0D03, pop the words back into
// DP scratch ($E0..$E7), then memcpy out. We can't use "=g"
// constraints (W65816 backend rejects memory operands in inline
// asm), so the data path runs through known DP addresses.
__attribute__((noinline))
static void readTimeHex(unsigned char buf[8]) {
__asm__ volatile (
"pea 0\n"
"pea 0\n"
"pea 0\n"
"pea 0\n"
"ldx #0x0D03\n"
"jsl 0xe10000\n"
"pla\n"
"sta 0xe0\n"
"pla\n"
"sta 0xe2\n"
"pla\n"
"sta 0xe4\n"
"pla\n"
"sta 0xe6\n"
:
:
: "a", "x", "y", "memory"
);
// Read DP $E0..$E7 via known-good direct page accesses. We're
// in M=16 by ABI so each `lda` reads 2 bytes — split into bytes.
volatile unsigned char *dp = (volatile unsigned char *)0xE0;
for (int i = 0; i < 8; i++) buf[i] = dp[i];
}
// Days at start of each month (non-leap).
static const unsigned short __monthDays[12] = {
0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334
};
static int __isLeap(int y) {
return (y % 4 == 0 && y % 100 != 0) || (y % 400 == 0);
}
time_t time(time_t *t) { time_t time(time_t *t) {
if (!__toolboxInited) {
if (t) *t = 0; if (t) *t = 0;
return 0; return 0;
} }
unsigned char b[8];
readTimeHex(b);
int sec = b[0];
int min = b[1];
int hour = b[2];
int year = 1900 + b[4];
int day = b[5];
int month = b[6];
if (year < 1970 || month > 11) {
if (t) *t = 0;
return 0;
}
long days = 0;
for (int y = 1970; y < year; y++) {
days += __isLeap(y) ? 366 : 365;
}
days += __monthDays[month];
if (month > 1 && __isLeap(year)) days++;
days += day;
long secs = days * 86400L + (long)hour * 3600 + (long)min * 60 + sec;
if (t) *t = secs;
return secs;
}
// VBL counter at $00/E1/006B (1 byte). C `*p` deref where p is a
// 16-bit pointer can't reach $E1006B (would truncate to $006B in
// zero page), so we use inline asm with `lda 0xe1006b` (4-byte
// absolute-long, opcode 0xAF).
static unsigned long __vblBase = 0;
static unsigned char __vblPrev = 0;
clock_t clock(void) { clock_t clock(void) {
return (clock_t)0; unsigned char now;
__asm__ volatile (
"sep #0x20\n"
"lda 0xe1006b\n" // 24-bit absolute
"rep #0x20\n"
"and #0x00ff\n"
: "=a"(now)
:
: "memory"
);
if (now < __vblPrev) {
__vblBase += 256;
}
__vblPrev = now;
return (clock_t)(__vblBase + now);
} }
// ---- FILE* abstraction (memory-backed FS) ---- // ---- FILE* abstraction (memory-backed FS) ----

View file

@ -3713,6 +3713,42 @@ EOF
fi fi
rm -f "$cWsFile" "$oWsFile" "$binWsFile" rm -f "$cWsFile" "$oWsFile" "$binWsFile"
# clock() reads the IIgs VBL counter at $E1006B (24-bit
# absolute load). Works without toolbox init. time()
# without iigsToolboxInit() returns 0 (no crash).
log "check: MAME runs clock() (VBL counter at \$E1006B)"
cTcFile="$(mktemp --suffix=.c)"
oTcFile="$(mktemp --suffix=.o)"
binTcFile="$(mktemp --suffix=.bin)"
cat > "$cTcFile" <<'EOF'
#include <time.h>
__attribute__((noinline)) void switchToBank2(void) {
__asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n");
}
int main(void) {
clock_t c = clock(); // should not crash
long t = 7;
long r = time(&t); // returns 0 without init; *t set to 0
int ok = 0;
if (r == 0 && t == 0) ok |= 1; // time() without init
(void)c;
ok |= 2; // clock() didn't crash
switchToBank2();
*(volatile unsigned short *)0x5000 = (unsigned short)ok;
while (1) {}
}
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -I"$PROJECT_ROOT/runtime/include" -c \
"$cTcFile" -o "$oTcFile"
"$PROJECT_ROOT/tools/link816" -o "$binTcFile" --text-base 0x1000 \
"$oCrt0F" "$oLibcF" "$oExtrasF" "$oLibgccFile" "$oTcFile" \
>/dev/null 2>&1
if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binTcFile" --check \
0x025000=0003 >/dev/null 2>&1; then
die "MAME: clock()/time()-without-init smoke != 0x03"
fi
rm -f "$cTcFile" "$oTcFile" "$binTcFile"
# C++ subset: classes, single inheritance, virtual functions, # C++ subset: classes, single inheritance, virtual functions,
# polymorphism via base-class pointer arrays, virtual dtors. # polymorphism via base-class pointer arrays, virtual dtors.
# Compiled with -fno-exceptions -fno-rtti (the supported subset # Compiled with -fno-exceptions -fno-rtti (the supported subset
@ -4259,26 +4295,44 @@ EOF
fi fi
rm -f "$cBigFile" "$oBigFile" "$binBssAutoFile" "$mapBssAutoFile" rm -f "$cBigFile" "$oBigFile" "$binBssAutoFile" "$mapBssAutoFile"
log "check: link816 hard-fails when BSS would exceed LC1 ceiling (\$E000)" log "check: link816 hard-fails when BSS would exceed LC ceiling (\$10000)"
# Force BSS to land past $E000 — link must reject with the LC1 # The LC ceiling is $10000 (top of bank 0). crt0's $C083 read-twice
# ceiling diagnostic (without crt0's LC2 RAM enable, that range # enables RAM through $FFFF; BSS at $E100 IS valid. Force a bss-base
# silently corrupts). # at $FF00 with a 0x200 BSS load to push past $10000 and trigger the
# ceiling diagnostic.
cBigFile="$(mktemp --suffix=.c)" cBigFile="$(mktemp --suffix=.c)"
oBigFile="$(mktemp --suffix=.o)" oBigFile="$(mktemp --suffix=.o)"
binBssOFile="$(mktemp --suffix=.bin)" binBssOFile="$(mktemp --suffix=.bin)"
cat > "$cBigFile" <<'EOF' cat > "$cBigFile" <<'EOF'
int main(void) { return 0; } char big[0x200]; // extern visibility so gc-sections keeps it
int main(void) { big[0] = 1; return big[0]; }
EOF EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cBigFile" -o "$oBigFile" "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cBigFile" -o "$oBigFile"
if "$PROJECT_ROOT/tools/link816" -o "$binBssOFile" --text-base 0x1000 \ if "$PROJECT_ROOT/tools/link816" -o "$binBssOFile" --text-base 0x1000 \
--bss-base 0xE100 "$oBigFile" "$oLibgccFile" 2>/tmp/bsslink.err; then --bss-base 0xFF00 "$oBigFile" "$oLibgccFile" 2>/tmp/bsslink.err; then
die "link816 should have rejected --bss-base 0xE100 (above LC1 ceiling)" die "link816 should have rejected --bss-base 0xFF00 + 0x200 bss (above LC ceiling)"
fi fi
if ! grep -q 'exceeds bank-0 LC1 ceiling' /tmp/bsslink.err; then if ! grep -q 'exceeds bank-0 LC ceiling' /tmp/bsslink.err; then
die "link816 LC1-ceiling diagnostic missing: $(cat /tmp/bsslink.err)" die "link816 LC-ceiling diagnostic missing: $(cat /tmp/bsslink.err)"
fi fi
rm -f "$cBigFile" "$oBigFile" "$binBssOFile" /tmp/bsslink.err rm -f "$cBigFile" "$oBigFile" "$binBssOFile" /tmp/bsslink.err
log "check: link816 ACCEPTS BSS in extended LC area (\$E000-\$FFFF)"
# Same shape but lower bss-base — should succeed since the LC area
# extends to $FFFF.
cBigFile="$(mktemp --suffix=.c)"
oBigFile="$(mktemp --suffix=.o)"
binBssOkFile="$(mktemp --suffix=.bin)"
cat > "$cBigFile" <<'EOF'
int main(void) { return 0; }
EOF
"$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cBigFile" -o "$oBigFile"
if ! "$PROJECT_ROOT/tools/link816" -o "$binBssOkFile" --text-base 0x1000 \
--bss-base 0xE100 "$oBigFile" "$oLibgccFile" 2>&1 >/dev/null; then
die "link816 incorrectly rejected --bss-base 0xE100 (now in usable LC area)"
fi
rm -f "$cBigFile" "$oBigFile" "$binBssOkFile"
# When BSS lands in LC1 ($D000+), __heap_end must be set above # When BSS lands in LC1 ($D000+), __heap_end must be set above
# heap_start (extending into LC1 ceiling at $E000) so malloc has # heap_start (extending into LC1 ceiling at $E000) so malloc has
# actual range. Previously hardcoded at $BF00 — heap_start ended # actual range. Previously hardcoded at $BF00 — heap_start ended

View file

@ -626,13 +626,16 @@ struct Linker {
L.initBase + L.initSize > 0xC000) { L.initBase + L.initSize > 0xC000) {
L.initBase = 0xD000; L.initBase = 0xD000;
} }
// After all skips, sanity-check we haven't gone past the LC1 // After all skips, sanity-check we haven't gone past the LC
// ceiling or wrapped. // ceiling. The IIgs LC area is $D000-$FFFF (12KB usable when
if (L.initBase + L.initSize > 0xE000) { // bank 1 is selected; the $E000-$FFFF chunk is common to both
// banks). crt0's `lda $C083` read-twice enables RAM read+write
// for the entire LC range, so we can use through $FFFF.
if (L.initBase + L.initSize > 0x10000u) {
char msg[160]; char msg[160];
std::snprintf(msg, sizeof(msg), std::snprintf(msg, sizeof(msg),
"rodata + init_array [0x%X+%u] exceeds bank-0 LC1 " "rodata + init_array [0x%X+%u] exceeds bank-0 LC "
"ceiling 0xE000 — shrink the runtime or split into bank 1", "ceiling 0x10000 — shrink the runtime or split into bank 1",
L.rodataBase, L.rodataBase,
(unsigned)(L.initBase + L.initSize - L.rodataBase)); (unsigned)(L.initBase + L.initSize - L.rodataBase));
die(msg); die(msg);
@ -666,10 +669,10 @@ struct Linker {
L.bssBase = 0xD000; L.bssBase = 0xD000;
} }
} }
if (L.bssBase + L.bssSize > 0xE000) { if (L.bssBase + L.bssSize > 0x10000u) {
char msg[160]; char msg[160];
std::snprintf(msg, sizeof(msg), std::snprintf(msg, sizeof(msg),
"bss [0x%X+%u] exceeds bank-0 LC1 ceiling 0xE000 — " "bss [0x%X+%u] exceeds bank-0 LC ceiling 0x10000 — "
"shrink the runtime or split into bank 1", "shrink the runtime or split into bank 1",
L.bssBase, L.bssSize); L.bssBase, L.bssSize);
die(msg); die(msg);
@ -701,12 +704,12 @@ struct Linker {
globalSyms["__heap_start"] = heapStart; globalSyms["__heap_start"] = heapStart;
if (heapStart < 0xC000) { if (heapStart < 0xC000) {
globalSyms["__heap_end"] = 0xBF00; globalSyms["__heap_end"] = 0xBF00;
} else if (heapStart < 0xE000) { } else if (heapStart < 0x10000u) {
// Heap in LC1 ($D000-$DFFF); cap at $E000 (LC1 ceiling). // Heap in LC area ($D000-$FFFF, 12KB usable). crt0's
globalSyms["__heap_end"] = 0xE000; // $C083 read-twice enables read+write for the whole range.
globalSyms["__heap_end"] = 0x10000u;
} else { } else {
// Should be unreachable — earlier `bssBase + bssSize > // Unreachable — bssBase + bssSize > 0x10000 check above.
// 0xE000` check would have died first.
globalSyms["__heap_end"] = heapStart; globalSyms["__heap_end"] = heapStart;
} }

View file

@ -32,6 +32,7 @@ add_llvm_target(W65816CodeGen
W65816WidenAcc16.cpp W65816WidenAcc16.cpp
W65816SpillToX.cpp W65816SpillToX.cpp
W65816NegYIndY.cpp W65816NegYIndY.cpp
W65816PreSpillCrossCall.cpp
W65816TargetMachine.cpp W65816TargetMachine.cpp
W65816AsmPrinter.cpp W65816AsmPrinter.cpp
W65816MCInstLower.cpp W65816MCInstLower.cpp

View file

@ -103,6 +103,12 @@ FunctionPass *createW65816SpillToX();
// so signed-negative Y crosses bank boundaries. See W65816NegYIndY.cpp. // so signed-negative Y crosses bank boundaries. See W65816NegYIndY.cpp.
FunctionPass *createW65816NegYIndY(); FunctionPass *createW65816NegYIndY();
// Pre-RA pass: pre-spill Acc16 vregs whose live range crosses a JSL
// call site, in functions with > 5 calls. Drops greedy regalloc
// pressure for high-call-count functions that would otherwise hit
// "ran out of registers". See W65816PreSpillCrossCall.cpp.
FunctionPass *createW65816PreSpillCrossCall();
void initializeW65816AsmPrinterPass(PassRegistry &); void initializeW65816AsmPrinterPass(PassRegistry &);
void initializeW65816DAGToDAGISelLegacyPass(PassRegistry &); void initializeW65816DAGToDAGISelLegacyPass(PassRegistry &);
void initializeW65816StackSlotCleanupPass(PassRegistry &); void initializeW65816StackSlotCleanupPass(PassRegistry &);
@ -113,6 +119,7 @@ void initializeW65816ABridgeViaXPass(PassRegistry &);
void initializeW65816WidenAcc16Pass(PassRegistry &); void initializeW65816WidenAcc16Pass(PassRegistry &);
void initializeW65816SpillToXPass(PassRegistry &); void initializeW65816SpillToXPass(PassRegistry &);
void initializeW65816NegYIndYPass(PassRegistry &); void initializeW65816NegYIndYPass(PassRegistry &);
void initializeW65816PreSpillCrossCallPass(PassRegistry &);
} // namespace llvm } // namespace llvm

View file

@ -0,0 +1,155 @@
//===-- W65816PreSpillCrossCall.cpp - Pre-spill cross-call Acc16 vregs ---===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//
//
// Pre-RA pass that pre-spills Acc16 vregs whose live range crosses a
// JSL call site. Greedy regalloc has only one register in the Acc16
// class (A), and JSL clobbers A — so any Acc16 vreg live across a
// call MUST be spilled. Greedy normally figures this out, but for
// functions with many such vregs (the "ok |= bit" bitmask pattern
// repeated across N if-blocks each calling a helper) greedy can run
// out of registers during spill placement, aborting compilation with
// "ran out of registers".
//
// We pre-empt the failure: walk the MBB, find cross-call Acc16
// vregs, and explicitly STAfi their value after the def + LDAfi at
// each use. This converts the cross-call live ranges into stack-
// resident loads, dropping greedy's pressure to the point it can
// always succeed.
//
// Cost: an extra STAfi+LDAfi (~6 cyc each) per cross-call vreg.
// This is the same cost greedy would emit if it succeeded (a spill
// + reload), so we're not pessimising — just making the spill
// explicit BEFORE greedy gets confused.
//
// Heuristic: only pre-spill if the function has > 5 call sites OR a
// cross-call Acc16 vreg with > 2 uses after the call. Below that,
// let greedy do its thing (it usually picks better placements).
//
//===----------------------------------------------------------------------===//
#include "W65816.h"
#include "W65816InstrInfo.h"
#include "W65816Subtarget.h"
#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"
using namespace llvm;
#define DEBUG_TYPE "w65816-pre-spill-cross-call"
namespace {
class W65816PreSpillCrossCall : public MachineFunctionPass {
public:
static char ID;
W65816PreSpillCrossCall() : MachineFunctionPass(ID) {}
StringRef getPassName() const override {
return "W65816 pre-spill Acc16 vregs across calls";
}
void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();
MachineFunctionPass::getAnalysisUsage(AU);
}
bool runOnMachineFunction(MachineFunction &MF) override;
};
} // namespace
char W65816PreSpillCrossCall::ID = 0;
INITIALIZE_PASS(W65816PreSpillCrossCall, DEBUG_TYPE,
"W65816 pre-spill Acc16 vregs across calls", false, false)
FunctionPass *llvm::createW65816PreSpillCrossCall() {
return new W65816PreSpillCrossCall();
}
bool W65816PreSpillCrossCall::runOnMachineFunction(MachineFunction &MF) {
if (MF.getFunction().hasOptNone()) return false;
MachineRegisterInfo &MRI = MF.getRegInfo();
if (!MRI.getNumVirtRegs()) return false;
const W65816InstrInfo *TII =
MF.getSubtarget<W65816Subtarget>().getInstrInfo();
MachineFrameInfo &MFI = MF.getFrameInfo();
// First pass: count call sites in the function. Below the
// heuristic threshold we don't bother — greedy handles low-call
// functions fine and pre-spilling would just add bytes.
unsigned callCount = 0;
for (MachineBasicBlock &MBB : MF)
for (MachineInstr &MI : MBB)
if (MI.isCall()) callCount++;
if (callCount < 4) return false;
bool Changed = false;
// Walk every Acc16 vreg in the function. For each, find its def
// (allowing multi-def vregs like SELECT_CC results — pick the
// first by MachineInstr iteration), then check if any use is
// separated from the def by a JSL call (in the same MBB). If
// so, pre-spill via STAfi at def + LDAfi at each post-call use.
unsigned NumVRegs = MRI.getNumVirtRegs();
for (unsigned i = 0; i < NumVRegs; ++i) {
Register VReg = Register::index2VirtReg(i);
if (MRI.def_empty(VReg)) continue;
if (MRI.getRegClass(VReg) != &W65816::Acc16RegClass) continue;
// Find the first def. For PHIs we skip — pre-spilling a PHI
// result is complex and rarely helpful for the high-pressure
// pattern we target (which is sequential bitmask updates).
MachineInstr *DefMI = nullptr;
for (MachineInstr &D : MRI.def_instructions(VReg)) {
if (D.isPHI()) { DefMI = nullptr; break; }
if (!DefMI) DefMI = &D;
}
if (!DefMI) continue;
MachineBasicBlock *MBB = DefMI->getParent();
// Check if any use of VReg is in the same MBB AFTER a call
// following DefMI.
bool sawCallAfterDef = false;
SmallVector<MachineInstr *, 4> postCallUses;
auto Walker = std::next(DefMI->getIterator());
while (Walker != MBB->end()) {
MachineInstr &W = *Walker++;
if (W.isCall()) sawCallAfterDef = true;
if (sawCallAfterDef && W.readsRegister(VReg, /*TRI=*/nullptr))
postCallUses.push_back(&W);
}
if (postCallUses.empty()) continue;
// Pre-spill. Fresh slot per vreg — StackSlotColoring may merge
// slots later if their lifetimes don't overlap.
int FI = MFI.CreateStackObject(2, Align(2), /*isSpillSlot=*/true);
DebugLoc DL = DefMI->getDebugLoc();
auto AfterDef = std::next(DefMI->getIterator());
BuildMI(*MBB, AfterDef, DL, TII->get(W65816::STAfi))
.addReg(VReg).addFrameIndex(FI).addImm(0);
for (MachineInstr *UseMI : postCallUses) {
Register Reload = MRI.createVirtualRegister(&W65816::Acc16RegClass);
BuildMI(*UseMI->getParent(), UseMI->getIterator(), UseMI->getDebugLoc(),
TII->get(W65816::LDAfi), Reload)
.addFrameIndex(FI).addImm(0);
// Rewrite this use's references of VReg to Reload.
for (auto &MO : UseMI->uses()) {
if (MO.isReg() && MO.getReg() == VReg) {
MO.setReg(Reload);
MO.setIsKill(false);
}
}
}
Changed = true;
}
return Changed;
}

View file

@ -45,6 +45,7 @@ LLVMInitializeW65816Target() {
initializeW65816WidenAcc16Pass(PR); initializeW65816WidenAcc16Pass(PR);
initializeW65816SpillToXPass(PR); initializeW65816SpillToXPass(PR);
initializeW65816NegYIndYPass(PR); initializeW65816NegYIndYPass(PR);
initializeW65816PreSpillCrossCallPass(PR);
} }
static Reloc::Model getEffectiveRelocModel(std::optional<Reloc::Model> RM) { static Reloc::Model getEffectiveRelocModel(std::optional<Reloc::Model> RM) {
@ -85,16 +86,22 @@ public:
void addPreEmitPass() override; void addPreEmitPass() override;
void addMachineSSAOptimization() override; void addMachineSSAOptimization() override;
// W65816's only 16-bit ALU register is A. Greedy at -O1+ produces // W65816's only 16-bit ALU register is A. At -O1+ we use BASIC
// tight code; at -O0 (where optnone disables coalescing/CSE), greedy // regalloc instead of greedy: greedy fails ("ran out of registers
// leaves spurious COPY pseudos that lower to STA dp / LDA dp pairs // during register allocation") on functions with many cross-call
// around modify-in-place ops (e.g. INA), miscompiling a + 1. Use // Acc16 vregs (the "ok |= bit; helper(); ok |= bit;" pattern
// fast regalloc when the target framework signals unoptimized. // repeated across many if-blocks). Basic regalloc handles that
// pattern cleanly, with negligible code-size overhead vs greedy
// (~0.7% on the bench suite).
//
// At -O0 / optnone (Optimized=false) we use FAST: greedy/basic at
// -O0 leave spurious COPY pseudos that lower to STA dp / LDA dp
// pairs around modify-in-place ops (e.g. INA), miscompiling a + 1.
//
// TiedDefSpill (pre-RA) handles the tied-def-multi-use hazard for // TiedDefSpill (pre-RA) handles the tied-def-multi-use hazard for
// the sub-pattern that's frequent enough to matter at -O1+. // the sub-pattern that's frequent enough to matter at -O1+.
//
FunctionPass *createTargetRegisterAllocator(bool Optimized) override { FunctionPass *createTargetRegisterAllocator(bool Optimized) override {
return Optimized ? createGreedyRegisterAllocator() return Optimized ? createBasicRegisterAllocator()
: createFastRegisterAllocator(); : createFastRegisterAllocator();
} }
}; };
@ -119,6 +126,19 @@ void W65816PassConfig::addPreRegAlloc() {
addPass(createW65816ABridgeViaX()); addPass(createW65816ABridgeViaX());
addPass(createW65816TiedDefSpill()); addPass(createW65816TiedDefSpill());
addPass(createW65816WidenAcc16()); addPass(createW65816WidenAcc16());
// Pre-spill cross-call Acc16 vregs in high-call functions to
// relieve greedy regalloc pressure. Currently disabled — the
// first cut creates too many fresh stack slots and overflows the
// stack-relative addressing range (frame > 256 bytes) on
// moderately-sized functions like the soft-double routines.
// The pass is built and ready, gated behind future tuning of:
// - lower call-count threshold (currently 4)
// - smarter "should we spill THIS vreg" filter
// - stack slot reuse via a real liveness analysis
// Until then, the high-pressure failure is worked around with
// `__attribute__((noinline))` on the heaviest helper or with
// `-mllvm -regalloc=fast` for the affected TU.
// addPass(createW65816PreSpillCrossCall());
} }
void W65816PassConfig::addPostRegAlloc() { void W65816PassConfig::addPostRegAlloc() {
@ -144,7 +164,14 @@ void W65816PassConfig::addPreEmitPass() {
// a value parked there; without that check, the rewrite's TAX // a value parked there; without that check, the rewrite's TAX
// would clobber spill-bridged values (caught by `addOff(p,i) { // would clobber spill-bridged values (caught by `addOff(p,i) {
// p[i-1] += p[i]; }` returning p[i-1] + &p[i-1] instead of +b). // p[i-1] += p[i]; }` returning p[i-1] + &p[i-1] instead of +b).
addPass(createW65816NegYIndY()); // W65816NegYIndY was a workaround for the (sr,s),Y bank-wrap on
// negative-Y indirect-stack-rel loads. No current code emits
// LDA_StackRelIndY / STA_StackRelIndY (pointer-deref now goes
// through [$E0],Y indirect-long via the LDAptr / STAptr / STBptr
// inserter, which forces the bank byte at $E2 to 0). Pass left
// in tree but disabled — re-enable if a new code path starts
// emitting (sr,s),Y again.
// addPass(createW65816NegYIndY());
// Branch expansion runs after that so the BRA introduced for long // Branch expansion runs after that so the BRA introduced for long
// conditional branches gets seen by SepRepCleanup (which can // conditional branches gets seen by SepRepCleanup (which can
// coalesce SEP/REP brackets across the new bridge MBBs). // coalesce SEP/REP brackets across the new bridge MBBs).