diff --git a/STATUS.md b/STATUS.md index 08a87cf..8a63ee4 100644 --- a/STATUS.md +++ b/STATUS.md @@ -72,11 +72,13 @@ which runs correctly under MAME (apple2gs). native object format) for round-tripping with classic dev tools. - `runtime/build.sh` builds crt0, libc, soft-float, soft-double, libgcc into linkable objects. -- `scripts/smokeTest.sh` runs 99 end-to-end checks (scalar ops, +- `scripts/smokeTest.sh` runs 102 end-to-end checks (scalar ops, control flow, calling conventions, MAME execution, regressions, - link816 bss-base safety, iigs/toolbox.h compile-check, standalone - runtime headers, AsmPrinter peepholes for STZ / PEA / PEI — - single-STA, shared-LDA-multi-STA, and DPF0-forwarding cases). + link816 bss-base safety + weak-symbol resolution + + heap_end-vs-heap_start sanity, iigs/toolbox.h compile-check, + standalone runtime headers, AsmPrinter peepholes for STZ / + PEA / PEI — single-STA, shared-LDA-multi-STA, and DPF0- + forwarding cases — malloc/free coalesce ordering). Currently 100% pass at -O2 throughout. **ABI:** @@ -131,11 +133,10 @@ Two open bugs tracked: both pass. Workaround comments in build.sh / smokeTest.sh removed. - The `__attribute__((noinline,optnone))` markers on iterative - qsort, RPN `runAll`, and expression-parser `runAll` are kept - for now as defense; with the new backend fixes they may no - longer be required, but removing them needs case-by-case - verification. + The `__attribute__((noinline,optnone))` defenses on iterative + qsort / RPN `runAll` / expression-parser `runAll` were + subsequently dropped; the smoke now compiles them at plain + `-O2` without escape hatches. The W65816 backend assembler now supports all common indirect addressing modes (`(dp)`, `(dp),Y`, `(dp,X)`, `(d,s),Y`, @@ -208,18 +209,45 @@ sidecar bytes. rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays correct for negative offsets like `arr[i-1]`. -- **(d,s),y for stack-local pointer dereferences uses DBR**, so - user code that switches DBR (e.g. `pha;plb` to bank 2 to reach - IIgs hardware) must not call into a function that takes the - address of one of its locals — the callee's `*p = v` will write - to the wrong bank. Documented; no compiler-side mitigation - beyond the existing DPF0 fake-physreg routing for the i64-return - high half. Workaround: inline pointer-arg helpers so the writes - stay in the caller's frame using stack-rel direct stores. The - W65816 only has three DBR-independent addressing modes - (abs_long, abs_long,X, [dp],Y) — none cheap to retrofit into - the current pointer-deref lowering (+5 bytes minimum per access). - Real fix needs PHB/PLB at noinline-pointer-callee entry/exit. +- **Pointer-deref bank policy is now split-by-syntax** (FIXED): + `*p` (where `p` is a runtime pointer / local-or-arg vreg) lowers + via `LDAptr / STAptr / STBptr` to `[$E0],Y` indirect-LONG with + the bank byte at `$E2` forced to 0 — DBR-independent. The + `*(volatile uint16 *)0x5000 = v` MMIO idiom (const-int pointer) + is matched by a separate TableGen pattern that lowers straight + to `STAabs` (DBR-relative) so the smoke tests' bank-2 write + path still works. Two tracked issues this resolved: + (a) PHI-elim was eliding the inserter's `COPY $a = ptr_vreg` + when the loop body had multiple Acc16 PHIs competing for A — + the inserter now spills the pointer to a fresh stack slot and + reloads via LDAfi to keep RA honest; sumTable now correct. + (b) pointer staging through `[$E0]` is bank-0 only, so + switchToBank2 + helper-with-local-ptr no longer corrupts data + in the wrong bank. See `feedback_dbr_ptr_deref_spill.md`. + +- **Greedy regalloc fails on long-arg call chains** — a function + that strings ~7+ independent `helper(longArg1, longArg2)` calls + overflows greedy at -O1+ ("ran out of registers during register + allocation"). Same root issue as softDouble's old -O2 hold-out. + Threshold raised somewhat by expanding IMG slots from 8 to 16 + (now backed by DP $C0..$DE) — most "normal-looking" mixed-arity + workloads now compile, but pathological pressure (many i32+ args + + bitmask SETCC chain) still fails. Workarounds (in order of + preference): mark the heaviest helper `__attribute__((noinline))` + to reduce caller pressure; `-mllvm -regalloc=fast` for that TU; + or `__attribute__((optnone))` on the affected function. A proper + fix needs either a custom greedy→fast fallback in + `W65816TargetMachine::createTargetRegisterAllocator` or a smarter + spill-placement pre-RA pass. + +- **Bank-0 size limit (~48KB)** — the runtime + program must fit in + $1000-$BFFF (text+rodata) plus $D000-$DFFF (LC1 for rodata-spill + and BSS). Past that, link816 hard-fails because text would + cross the IO window. In practice this is rarely hit now that + link816 has `--gc-sections` (default ON, see Recently Fixed) + which drops unreachable functions: a minimal program shrinks + from ~43KB (whole runtime) to ~1.5KB. Programs that genuinely + use most of the runtime can still hit the limit. ## Recently fixed @@ -288,24 +316,173 @@ sidecar bytes. also removes two PHA/PLA save-restore wraps around the LDA #0 (STZ doesn't touch A, so the wraps are unnecessary). +- **libgcc.s `lda dp; pha` -> `pei dp`** — 2 sites in __divhi3 / + __modhi3 where the loaded A is dead after the push. PEI + doesn't touch A, saves 1 byte each. + +- **W65816StackSlotCleanup Pass 1c skip-list extended** — added + STAabs / STA8abs / STAptr / STBptr / STAptrOff / STBptrOff and + ADJCALLSTACKDOWN to the A-transparent list. Lets the redundant- + CMP-after-A-modifier elimination see through more pseudo + stores and the call-stack-down pseudo. Saves 8 bytes in math.o. + (ADJCALLSTACKUP is NOT transparent — when PEI doesn't process + it, AsmPrinter emits a TSC/CLC/ADC/TCS that clobbers A.) + +- **crt0.s `lda #0; sta` -> `stz`** — IRQ-disable block and the + BSS-zero loop both used `.byte 0xa9, 0x00 ; sta` raw-byte + workarounds for `lda #0` (the assembler emits a 16-bit immediate + in M=8, mis-encoding it). `stz` works in M=8 (stores 1 byte) and + doesn't touch A — both `.byte` workarounds removed; saves 4 bytes + in crt0.o. + +- **Runtime correctness pass — five real bugs fixed:** + - `free()` coalesce: when a freed block was absorbed into a + lower-address neighbour (`bEnd == a` path), the absorbed entry + was left in the free list overlapping the extended one. A + follow-on malloc could hand out the same memory to two + callers. Fix: track outer-loop predecessor and excise the + absorbed entry. Smoke #100 added. + - `sqrt(-0.0)` returned NaN; should return -0.0 per IEEE-754. + The sign-bit check fired before the zero check. Fix: mask + sign bit when testing for zero. + - `log(0)` returned NaN; should return -Infinity (pole error). + Same sign-bit-vs-zero ordering issue; both ±0 now return + `-1.0/0.0`. + - `snprintf(buf, 0, ...)` wrote `'\0'` to `buf[-1]` (one byte + BEFORE the buffer). C99 says n=0 must not touch the buffer. + Fix: set `gEnd = NULL` for n=0 so neither the normal nor the + truncation NUL-write path fires. Smoke #76 extended. + - `malloc(>~32KB)` and `calloc(n, m)` had silent integer overflow + on size_t (16-bit), wrapping to small values and handing out + tiny allocations claiming huge sizes. Bumped malloc to bail + above 0x7FF0 (heap is at most ~32KB anyway) and made calloc + overflow-check before multiplying. + +- **Removed** dead `runtime/src/softDouble.s` (a stub from before + `softDouble.c` was implemented; the build script doesn't reference + it but it was confusing to leave around). + +- **inttypes.h PRId64 / PRIu64 / PRIx64** documented as + unsupported in the runtime's printf — the macros expand to + `"lld"`/`"llu"`/`"llx"` but the formatter only knows the `l` + length modifier, not `ll`, so the format prints literally and + the va_list misaligns. Use `PRId32` etc. for now. + +- **More runtime fixes (round 2):** + - `fputs(s, stream)` was forwarding to `puts(s)`, which appends a + newline. C says fputs MUST NOT add one. Direct char-by-char + write now. + - `exit(code)` never invoked the registered `atexit` handler. + C99 7.20.4.3 requires it. Now runs the single-slot handler + (with re-entry guard) before the BRK. + - `printf("%f", -0.0)` printed `0.000000` instead of `-0.000000` + because `if (v < 0)` (a `__ltdf2` call) returns false for + negative zero. Switched to the IEEE-754 sign-bit test that + snprintf already uses. + - `vfprintf` was missing entirely (declared neither in stdio.h + nor implemented). Added a thin wrapper around vprintf. + +- **link816 weak-symbol resolution:** the linker previously used + "last def wins" with no regard for STB_GLOBAL vs STB_WEAK. When + a user provided a strong override of a weak libc stub (e.g. + `putchar`), it worked only by link-order luck — reversing the + order let the weak stub silently overwrite the strong def. + Now properly: strong over weak (any order), strong + strong + errors out, weak + weak picks the first. Smoke #100 added. + +- **More runtime fixes (round 3):** + - `writeHex` / `emitHex` had a stack-overflow buffer overrun + (`char buf[5]` but `printf("%08x", ...)` would write 8 bytes). + On 16-bit `unsigned int`, max useful width is 4 — buf shrunk + to 4 and width is now capped. + - `writeDec` / `writeSignedLong` / `emitDec` / `emitSignedLong` + used `-n` on signed input, which overflows for INT_MIN / + LONG_MIN (UB). All four switched to unsigned-negation + (`0u - (unsigned)n`) for correctness and to keep an + optimizer-aware compiler from exploiting the UB. + - `atoi` / `atol` / `strtol` / `strtoul` likewise built the + parsed magnitude in a signed accumulator and negated at the + end — same UB on the boundary value. All switched to + unsigned magnitude + unsigned-negation cast. + - `link816 parseInt` / `omfEmit parseInt` silently truncated + addresses > 24 bits to `uint32_t` low bits — `--text-base + 0x100000000` would silently wrap to 0. Both now reject + out-of-range addresses with a clear error. + +- **More runtime fixes (round 4):** + - `pow(x, y)` computed `n = -n` for the integer-y branch when + yi was INT_MIN (-32768); same signed-overflow UB pattern as + the print functions. Switched to unsigned magnitude. + - Added `perror(prefix)` — was missing from the runtime; common + pattern in portable code that reports I/O failure via + `errno + strerror`. Declared in stdio.h, implemented as + char-by-char emit through putchar (no fprintf dependency). + +- **link816 `__heap_end` was hardcoded at $BF00**, ignoring where + `__heap_start` actually ended up. When BSS got auto-relocated + into LC1 ($D000+), heap_start ended up > heap_end and malloc + immediately returned NULL on every call — silently bricking any + program that allocated dynamic memory after the runtime grew + past the default-bss threshold. Heap_end now picks + $BF00 / $E000 based on where heap_start lands (and skips the IO + window if heap_start would have landed in $C000-$CFFF). + Smoke #102 added. + +- **link816 rodata auto-skips IIgs IO window** ($C000-$CFFF). When + text+rodata grew past 0xC000 the rodata bytes silently corrupted + at runtime — string literals in the IO range read back as + hardware register values, breaking strcmp / strstr / printf / etc. + Now: rodata that would land in or cross $C000-$CFFF auto-skips + to $D000. Init_array gets the same treatment. Text that would + cross IO is hard-rejected at link time (no auto-fix possible — + PC fetches in IO would read hardware registers). This was the + root cause of the "tan/tanf triggers layout-sensitive failure" + symptom listed in older STATUS notes. + +- **runInMame skips writes to IO window** during the binary load. + Without this, the zero-padding in the rodata-skip gap would + clobber soft switches (e.g. the LC1 RAM enable that crt0 sets + via $C083) when the loader naively wrote the entire image + byte-by-byte to memory. + +- **link816 `--gc-sections` (default ON)** — discards sections not + reachable from the entry point (`__start` / `_start` / `main` + for the canonical crt0 setup) plus all `.init_array` sections. + Built on `-ffunction-sections` so each function is in its own + section. A minimal program with full runtime linked shrinks + from ~43KB to ~1.5KB. Adding `tan/tanf` to math.c (which + caused the latent layout-sensitive failure described above) + no longer pushes any test past the bank-0 limit. Tests that + intentionally check unreachable symbols pass `--no-gc-sections` + to opt out. + +- **`fwrite(stdout, ...)` was a stub returning 0** even though + `stdout` has a working `putchar` route. Now actually writes + through `putchar` for stdout/stderr (only). Also gained the + same `size * nmemb` overflow guard as `calloc`. + ## What's still needed for a "ship-ready" toolchain -- **softDouble.c -O1 hold-out** — `__muldf3`'s u64 lifetime pressure - overflows the greedy register allocator at -O2 ("ran out of - registers during register allocation"). Builds correctly at - -O1. Investigated: marking dpack noinline reduces pressure but - isn't enough; making dclass noinline would unblock -O2 (verified) - but the (d,s),y-uses-DBR bug then corrupts dclass's pointer-arg - writes when a caller has switched DBR (caught by smoke's - dmul-after-bank-switch test). Real fix is gated on the broader - DBR-pointer-deref limitation listed above. +- **softDouble.c -O2 — FIXED.** Marking `dclass` noinline (in + addition to `dpack`) drops register pressure in `__muldf3`/ + `__divdf3`/`__adddf3` enough that greedy regalloc no longer + runs out. The previous blocker was that noinline-dclass would + write through pointer args via the DBR-relative `(d,s),y` mode + and corrupt caller data after a bank switch — that path now + goes through `STAptr/STBptr` which use `[$E0],Y` indirect-long + with the bank byte forced to 0, so DBR is irrelevant. All + three smoke build sites moved to `-O2`. - **More of the C standard library**: real `` file I/O (`fopen`, `fread`, `fwrite`, `fseek` are currently stubs returning success/zero) — would need a memory-backed FS or a - MAME hook. `` / `` are stubbed (compile and - return safe defaults); `` / `` mostly absent. + MAME hook. `` / `` / `` are stubbed + (compile and return safe defaults). `` mostly absent. + A `time()` impl wired to ReadTimeHex (Misc Tool $0D03) was + attempted but crashes MAME without the Tool Locator initialised + in crt0; `clock()` via VBL counter at $E1006B needs 24-bit + far-pointer support that the backend doesn't yet model. - **C++ runtime support**: vtable layout for multiple inheritance, RTTI, exceptions (or a documented `-fno-exceptions` requirement). @@ -315,9 +492,15 @@ sidecar bytes. whether any 8-bit accumulator value is used. A per-region scheduler would reduce the SEP/REP wrap overhead on i8 stores. -- **Toolbox / IIgs system call bindings**: header files declaring - the Apple IIgs system calls (`SystemTask`, `WaitMouseUp`, - `DrawString`, …) with the right inline-asm dispatch glue. +- **Toolbox / IIgs system call bindings**: `iigs/toolbox.h` covers + the common entry points across Tool Locator, Memory Manager, + Misc Tools, QuickDraw II, Event Manager, Window Manager, plus + GS/OS Quit. Multi-arg wrappers (NewHandle, QDStartUp, MoveTo, + EMStartUp, GetNextEvent, NewWindow, CloseWindow) live in + `runtime/src/iigsToolbox.s` because the backend's inline-asm + constraints can't take memory operands. Single-arg / no-arg + wrappers stay inline. More routines (Menu Manager, Dialog + Manager, Standard File, Sound) still TBD. - **Real-world program coverage**: the smoke tests are microbenchmarks. A few known-good Apple IIgs C programs (e.g. diff --git a/runtime/include/iigs/toolbox.h b/runtime/include/iigs/toolbox.h index 778e933..66b6f62 100644 --- a/runtime/include/iigs/toolbox.h +++ b/runtime/include/iigs/toolbox.h @@ -1,25 +1,27 @@ -// IIgs toolbox helpers — minimal inline-asm wrappers for the most -// commonly-used Apple IIgs system calls. +// IIgs toolbox helpers — wrappers for commonly-used Apple IIgs system +// calls. // // Toolbox dispatch on the IIgs goes through the Tool Locator at // $E10000. Each routine is identified by a 16-bit "tool number" -// (low byte = tool set, high byte = function within set), loaded +// (high byte = function within set, low byte = tool set), loaded // into X, and called via JSL $E10000. // -// Args go on the stack (push order: rightmost first), then the -// caller pushes a result-space slot if the routine returns something -// non-i16-or-pointer, then JSL. +// GS/OS dispatch goes through $E100A8 with X holding the call +// number and a parameter-block pointer pushed on the stack. // -// This header keeps things simple: each function inlines a tiny -// asm block specific to that call. No #include guards on bigger -// abstractions; users that want full toolbox coverage should write -// their own wrappers using the same pattern. +// Calling convention: +// - Args go on the stack (push order: rightmost first), then the +// caller pushes a result-space slot (16 or 32 bits) BEFORE +// the args if the routine returns something non-void. +// - The result is read off the same stack slot AFTER JSL. +// - Tool number lives in X immediately before JSL. +// - Tools clobber A, X, Y, P; the runtime spills around the call. // -// LIMITATIONS: -// - Only a handful of routines wrapped. Calypsi has full toolbox. -// - No error-handling — caller checks the return. -// - Single-bank only. Cross-bank toolbox calls need different -// dispatch logic. +// Single-arg / no-arg wrappers are `static inline`. Multi-arg +// wrappers are declared `extern` here and implemented in +// runtime/src/iigsToolbox.s — backend constraints don't allow +// memory-operand inline asm so the multi-arg pushes need real +// .s code. #ifndef IIGS_TOOLBOX_H #define IIGS_TOOLBOX_H @@ -28,81 +30,284 @@ extern "C" { #endif -// Tool number convention: high byte = function, low byte = tool set. -// Common tool sets: 04 = Misc, 0E = QuickDraw II, 18 = Window Mgr. +// ===== Tool numbers (high byte = function, low byte = tool set) ===== +// Tool sets: +// 01 = Tool Locator 02 = Memory Manager 03 = Misc Tools +// 04 = QuickDraw II 06 = Event Manager 0E = Window Manager +// 1B = Menu Manager 29 = Standard File -// Misc Tool Set --------------------------------------------------- - -// WriteCString (Misc Tool $290B) — write a NUL-terminated string to -// the text screen. Arg: 16-bit pointer pushed before the call. -// Returns nothing. -static inline void TBoxWriteCString(const char *s) { +// ===================================================================== +// Tool Locator (Set $01) +// ===================================================================== +static inline void TBoxTLStartUp(void) { __asm__ volatile ( - "pha\n" // push C-string pointer - "ldx #0x290B\n" // tool number (function 0x29, set 0x0B) - "jsl 0xe10000\n" // tool dispatcher + "ldx #0x0201\n" + "jsl 0xe10000\n" : - : "a"(s) + : + : "a", "x", "y", "memory" + ); +} + +static inline void TBoxTLShutDown(void) { + __asm__ volatile ( + "ldx #0x0301\n" + "jsl 0xe10000\n" + : + : + : "a", "x", "y", "memory" + ); +} + +// ===================================================================== +// Memory Manager (Set $02) +// ===================================================================== + +// MMStartUp — call as the first MM routine. Returns the caller's +// 16-bit userId; save it for later DisposeAll calls. +static inline unsigned short TBoxMMStartUp(void) { + unsigned short id; + __asm__ volatile ( + "pha\n" // result space + "ldx #0x0202\n" + "jsl 0xe10000\n" + "pla\n" + : "=a"(id) + : + : "x", "y", "memory" + ); + return id; +} + +// MMShutDown — releases all MM resources owned by `userId`. +static inline void TBoxMMShutDown(unsigned short userId) { + __asm__ volatile ( + "pha\n" + "ldx #0x0302\n" + "jsl 0xe10000\n" + : + : "a"(userId) : "x", "y", "memory" ); } -// SysBeep (Misc Tool $0303) — short beep through the speaker. +// NewHandle / DisposeHandle live in iigsToolbox.s — the parameter +// blocks are 4-arg with mixed widths and need explicit asm. +extern unsigned long TBoxNewHandle(unsigned long size, + unsigned short userId, + unsigned short attr, + unsigned long addr); +extern void TBoxDisposeHandle(unsigned long handle); + +// ===================================================================== +// Misc Tools (Set $03) +// ===================================================================== + +// SysBeep — short beep through the speaker. static inline void TBoxBeep(void) { __asm__ volatile ( "ldx #0x0303\n" "jsl 0xe10000\n" : : - : "x", "y", "memory" + : "a", "x", "y", "memory" ); } -// ReadKey (Event Mgr; simplified — actually KeyTrans/etc). Returns -// the next pending key in A, or 0 if none. This wraps GetNextEvent -// internally on a real GS; for the simple console harness it polls -// the keyboard buffer. -static inline char TBoxReadKey(void) { - char r; +// WriteCString — Misc Tool $0B; writes a NUL-terminated string to +// the text screen. Note: actual GS uses Text Tools or stdio; +// this is the legacy entry point. +static inline void TBoxWriteCString(const char *s) { __asm__ volatile ( - "ldx #0x250A\n" // GetEvent (placeholder; refine in real port) + "pha\n" + "ldx #0x290B\n" "jsl 0xe10000\n" - : "=a"(r) : + : "a"(s) : "x", "y", "memory" ); - return r; } -// ConsoleQuit — clean program shutdown via GS/OS Quit. Pushes a -// pConditionTbl pointer (here, 0 for no condition) before JSL. +// ReadAsciiTime — fills a 20-byte buffer with the current time +// formatted as "DDD MMM dd hh:mm:ss yyyy". +static inline void TBoxReadAsciiTime(char *buf20) { + __asm__ volatile ( + "pha\n" + "ldx #0x0F03\n" + "jsl 0xe10000\n" + : + : "a"(buf20) + : "x", "y", "memory" + ); +} + +// ===================================================================== +// QuickDraw II (Set $04) +// ===================================================================== + +// QDStartUp / QDShutDown. Multi-arg startup lives in iigsToolbox.s. +extern void TBoxQDStartUp(unsigned short masterSCB, + unsigned short pageSize, + unsigned short userId); + +static inline void TBoxQDShutDown(void) { + __asm__ volatile ( + "ldx #0x0304\n" + "jsl 0xe10000\n" + : + : + : "a", "x", "y", "memory" + ); +} + +// MoveTo — move the pen to absolute (h, v). +extern void TBoxMoveTo(short h, short v); + +// DrawString — draw a Pascal-style length-prefixed string at the +// current pen position. First byte of `pstr` must be the length. +static inline void TBoxDrawString(const char *pstr) { + __asm__ volatile ( + "pha\n" + "ldx #0x2C04\n" + "jsl 0xe10000\n" + : + : "a"(pstr) + : "x", "y", "memory" + ); +} + +// PaintRect / FrameRect / EraseRect — rect is a 16-bit pointer to a +// 4-word Rect (top, left, bottom, right). +static inline void TBoxPaintRect(const short *rect) { + __asm__ volatile ( + "pha\n" + "ldx #0x5104\n" + "jsl 0xe10000\n" + : + : "a"(rect) + : "x", "y", "memory" + ); +} + +static inline void TBoxFrameRect(const short *rect) { + __asm__ volatile ( + "pha\n" + "ldx #0x4F04\n" + "jsl 0xe10000\n" + : + : "a"(rect) + : "x", "y", "memory" + ); +} + +static inline void TBoxEraseRect(const short *rect) { + __asm__ volatile ( + "pha\n" + "ldx #0x5004\n" + "jsl 0xe10000\n" + : + : "a"(rect) + : "x", "y", "memory" + ); +} + +// ===================================================================== +// Event Manager (Set $06) +// ===================================================================== + +// EMStartUp — initialises Event Manager with default queue and +// 640x200 mouse clamp. Args other than userId are hardcoded; if +// you need custom clamp, write your own wrapper. +extern void TBoxEMStartUp(unsigned short userId); + +static inline void TBoxEMShutDown(void) { + __asm__ volatile ( + "ldx #0x0306\n" + "jsl 0xe10000\n" + : + : + : "a", "x", "y", "memory" + ); +} + +// SystemTask — gives time to background tasks. Call regularly in +// event loops. +static inline void TBoxSystemTask(void) { + __asm__ volatile ( + "ldx #0x0306\n" + "jsl 0xe10000\n" + : + : + : "a", "x", "y", "memory" + ); +} + +// GetNextEvent — fills the EventRecord pointed at by `theEvent` +// with the next event matching `eventMask`. Returns nonzero if an +// event was returned. +// +// EventRecord layout (16 bytes): what(2) message(4) when(4) where(4) +// modifiers(2). +extern unsigned short TBoxGetNextEvent(unsigned short eventMask, void *theEvent); + +// ===================================================================== +// Window Manager (Set $0E) +// ===================================================================== + +// NewWindow — allocate and display a new window. paramList points +// to a NewWindow parameter block (in-bank 16-bit pointer). Returns +// a 32-bit window pointer. +extern void *TBoxNewWindow(const void *paramList); + +// CloseWindow — tear down a window. Takes a 32-bit window pointer. +extern void TBoxCloseWindow(void *winPtr); + +// ===================================================================== +// GS/OS (dispatcher at $E100A8) +// ===================================================================== + +// Quit — clean program shutdown via GS/OS. pConditionTbl = 0 +// (no resume condition). Does not return. static inline void TBoxQuit(void) { __asm__ volatile ( - "pea 0\n" // pConditionTbl = NULL - "pea 0\n" // pParm - "ldx #0x2029\n" // GS/OS Quit - "jsl 0xe100a8\n" // GS/OS dispatcher (different addr) + "pea 0\n" // pConditionTbl + "pea 0\n" // pParm + "ldx #0x2029\n" // GS/OS Quit + "jsl 0xe100a8\n" : : - : "x", "y", "memory" + : "a", "x", "y", "memory" ); - while (1) {} // unreachable + while (1) {} // unreachable } -// QuickDraw II ---------------------------------------------------- +// ===================================================================== +// Helpers — direct hardware polling (no toolbox) +// ===================================================================== -// QDStartUp / QDShutDown (sketches — real ones take more args). -// Real apps typically use QuickDraw II via the "shell" startup -// sequence; this is for educational/sim scenarios. -static inline void TBoxQDStartUp(void) { +// ReadKey — poll the IIgs keyboard latch at $C000 directly. +// Returns the ASCII byte (0 if no key ready). Strobes $C010 to +// clear the latch. Does NOT use Event Manager — for a real GS +// app, use TBoxGetNextEvent and pull from the queue instead. +static inline char TBoxReadKey(void) { + char r = 0; __asm__ volatile ( - "pea 0\n" "pea 0\n" "pea 0\n" // dummy direct-page handle - "ldx #0x0204\n" - "jsl 0xe10000\n" + "sep #0x20\n" // 8-bit A + "lda 0xc000\n" + "bpl 1f\n" + "sta 0xc010\n" // strobe + "and #0x7f\n" + "bra 2f\n" + "1:\n" + "lda #0\n" + "2:\n" + "rep #0x20\n" + "and #0x00ff\n" + : "=a"(r) : - : - : "x", "y", "memory" + : "memory" ); + return r; } #ifdef __cplusplus diff --git a/runtime/include/inttypes.h b/runtime/include/inttypes.h index a3c95a9..d47f348 100644 --- a/runtime/include/inttypes.h +++ b/runtime/include/inttypes.h @@ -10,9 +10,14 @@ // (strtoimax / strtoumax not implemented — runtime has strtol / // strtoul for the 32-bit forms which cover the common needs.) - -// PRIxN format macros. `int` is 16-bit on W65816, `long` is 32, -// `long long` is 64. +// +// **WARNING — limited printf support.** The runtime's printf / +// snprintf understand the `l` length modifier (long, 32-bit) but +// NOT `ll` (long long, 64-bit). Using PRId64 / PRIu64 / PRIx64 +// will compile but the runtime treats the format as a literal +// "%lld" rather than reading 8 bytes off the va_list — wrong output +// AND a stack misalignment for any subsequent args. For 32-bit +// values, PRId32 / PRIu32 / PRIx32 work correctly. #define PRId8 "d" #define PRIi8 "i" diff --git a/runtime/include/math.h b/runtime/include/math.h index d9edded..53c65c5 100644 --- a/runtime/include/math.h +++ b/runtime/include/math.h @@ -19,6 +19,8 @@ double sin (double x); float sinf (float x); double cos (double x); float cosf (float x); +double tan (double x); +float tanf (float x); double exp (double x); float expf (float x); double log (double x); diff --git a/runtime/include/stdio.h b/runtime/include/stdio.h index e85b31e..95b8c9f 100644 --- a/runtime/include/stdio.h +++ b/runtime/include/stdio.h @@ -19,6 +19,8 @@ int snprintf(char *buf, size_t n, const char *fmt, ...); int vsprintf(char *buf, const char *fmt, va_list ap); int vsnprintf(char *buf, size_t n, const char *fmt, va_list ap); int fprintf(FILE *stream, const char *fmt, ...); +int vfprintf(FILE *stream, const char *fmt, va_list ap); +void perror(const char *prefix); int fputc(int c, FILE *stream); int fputs(const char *s, FILE *stream); int fflush(FILE *stream); diff --git a/runtime/src/crt0.s b/runtime/src/crt0.s index c743dc2..78db880 100644 --- a/runtime/src/crt0.s +++ b/runtime/src/crt0.s @@ -24,12 +24,13 @@ __start: rep #0x30 ; Disable IIgs peripheral interrupt sources at the chip level — ; SEI alone leaves the hardware lines asserted, and the IRQ trap - ; in ROM keeps re-firing if the source isn't quiesced. + ; in ROM keeps re-firing if the source isn't quiesced. STZ + ; stores zero without going through A; in M=8 it stores 1 byte + ; (matching the 8-bit registers), so no LDA #0 prelude is needed. sep #0x20 - .byte 0xa9, 0x00 ; lda #$00 (8-bit M) - sta 0xc041 ; INTEN = 0 (clear AN3/mouse/0.25s/VBL/mouse-IRQ enables) - sta 0xc023 ; VGCINT = 0 (clear external/1-sec/scan-line IRQ enables) - sta 0xc032 ; SCANINT clear + stz 0xc041 ; INTEN = 0 (clear AN3/mouse/0.25s/VBL/mouse-IRQ enables) + stz 0xc023 ; VGCINT = 0 (clear external/1-sec/scan-line IRQ enables) + stz 0xc032 ; SCANINT clear rep #0x20 ; Top-of-stack at $0FFF. Native-mode S is 16-bit, so we don't need @@ -58,20 +59,15 @@ __start: ; Zero BSS. X iterates from __bss_start to __bss_end; each ; iteration writes one byte of zero at addr X (via DP=0 + - ; offset 0 — which is just X). Wraps in 8-bit M for the - ; byte-store. + ; offset 0 — which is just X). STZ in M=8 stores 1 byte and + ; doesn't touch A, so we don't need the LDA #0 prelude. rep #0x10 ; ensure X is 16-bit ldx #__bss_start .Lbss_loop: cpx #__bss_end bcs .Lbss_done ; X >= end -> done sep #0x20 ; 8-bit M for 1-byte store - ; llvm-mc doesn't track SEP/REP — `lda #$0` after SEP gets - ; encoded as a 3-byte 16-bit immediate, so the CPU reads - ; `a9 00 00` = LDA #$00 then BRK. Force the 1-byte form - ; with raw bytes. - .byte 0xa9, 0x00 ; lda #$00 (8-bit M imm) - sta 0x0, x ; *(uint8_t *)X = 0 (DP=0) + stz 0x0, x ; *(uint8_t *)X = 0 (DP=0) rep #0x20 inx bra .Lbss_loop diff --git a/runtime/src/extras.c b/runtime/src/extras.c index 1d4089a..eabbb90 100644 --- a/runtime/src/extras.c +++ b/runtime/src/extras.c @@ -53,12 +53,14 @@ long atol(const char *s) { } else if (*s == '+') { s++; } - long n = 0; + // Parse magnitude as unsigned to avoid signed-overflow UB (e.g. + // "-2147483648" — the magnitude 2147483648 doesn't fit in long). + unsigned long u = 0; while (*s >= '0' && *s <= '9') { - n = n * 10 + (*s - '0'); + u = u * 10 + (unsigned long)(*s - '0'); s++; } - return sign < 0 ? -n : n; + return sign < 0 ? (long)(0ul - u) : (long)u; } diff --git a/runtime/src/iigsToolbox.s b/runtime/src/iigsToolbox.s new file mode 100644 index 0000000..90a1305 --- /dev/null +++ b/runtime/src/iigsToolbox.s @@ -0,0 +1,223 @@ +; iigsToolbox.s — multi-arg toolbox wrappers that can't be done as +; inline asm because the W65816 backend's inline-asm constraints +; can't take memory operands. +; +; C ABI on this target: +; - Arg 0 (i16): in A +; - Arg 0 (i32): low half in A, high half in X +; - Arg N>0 (i16):in stack at (4 + 2*(N-1)), S — args pushed +; rightmost-first, JSL adds 3 bytes of retaddr +; (4,S = arg1 lo) +; - i16 return: A +; - i32 return: A (low) + X (high) +; +; Toolbox calls expect: +; - Args on stack in toolbox order (rightmost pushed first), then +; a result slot of appropriate width pushed BEFORE the args (so +; the result ends up at the highest stack address after pushes). +; - Tool number in X. +; - JSL $E10000. +; - After JSL, pop result then args in reverse. +; +; All wrappers preserve nothing (toolbox clobbers A, X, Y, P). + + .text + .globl TBoxNewHandle + .globl TBoxDisposeHandle + .globl TBoxQDStartUp + .globl TBoxMoveTo + .globl TBoxEMStartUp + .globl TBoxGetNextEvent + .globl TBoxNewWindow + .globl TBoxCloseWindow + +; ===================================================================== +; unsigned long TBoxNewHandle(u32 size, u16 userId, u16 attr, u32 addr) +; Entry: A = size lo, X = size hi +; 4,S = userId, 6,S = attr, 8,S = addr lo, 10,S = addr hi +; Tool layout (push order, leftmost=outermost on stack): +; [result lo][result hi][size lo][size hi][userId][attr][addr lo][addr hi] +; Wait: NewHandle args per Apple GS docs are +; (Long blockSize, Word userId, Word attributes, Long memAttr) +; pushed leftmost-first, so: +; PEA result hi, PEA result lo +; PUSH blockSize hi, PUSH blockSize lo (long, lo first then hi? no — let me check) +; +; Actually GS toolbox push order: each parameter is pushed in +; declaration order, low word first then high word for longs. +; Result space is pushed FIRST (and is read LAST after the pop +; sequence reverses everything). So: +; PEA 0 ; result hi +; PEA 0 ; result lo +; PHA size lo +; PHB? no: +; per https://www.brutaldeluxe.fr/products/crossdevtools/cadius/ +; Push order: parameters in order, longs as lo then hi. +; For NewHandle(blockSize=Long, userId=Word, attr=Word, memLoc=Long): +; pea 0 ; result lo +; pea 0 ; result hi +; pha ; blockSize lo +; phx ; blockSize hi (since size hi is in X) +; pha userId +; pha attr +; pha addrLo +; pha addrHi +; ldx #$0902 ; jsl $E10000 +; ; result is now on stack: pop hi then lo into A:X return +; +; Note: the IIgs toolbox actually expects result space to be HIGHER +; on stack (pushed first) so that pops in reverse give result last. +; ===================================================================== +TBoxNewHandle: + ; Stash size lo (in A) and size hi (in X) before we use the + ; stack — both must be pushed AFTER the result slot. + sta 0xe0 ; size lo to scratch + stx 0xe2 ; size hi to scratch + + ; Push 4-byte result space (will be popped at end). + pea 0 ; result lo + pea 0 ; result hi + + ; Push blockSize: lo first then hi. + lda 0xe0 ; size lo + pha + lda 0xe2 ; size hi + pha + + ; Push userId (was at 4,S originally; pushes since added: 4 result + 4 size = 8; +4 for JSL retaddr offset baseline) + ; Original 4,S; we've pha'd 8 bytes (result+size) on top of retaddr + ; So userId is now at 4 + 8 = 12,S. + lda 12, s ; userId + pha + + ; attr was at 6,S originally; now at 6 + 8 + 2 (one more pha) = 16,S. + lda 16, s ; attr + pha + + ; addr lo was at 8,S originally; with all our pushes (4 result + 4 + ; size + 2 user + 2 attr = 12), now at 8 + 12 = 20,S. + lda 20, s ; addr lo + pha + + ; addr hi was at 10,S originally; +14 = 24,S. + lda 24, s ; addr hi + pha + + ldx #0x0902 + jsl 0xe10000 + + ; Pop result: hi then lo. Returns u32 in A:X (low in A, hi in X). + pla ; result hi + tax + pla ; result lo → A + rtl + + +; ===================================================================== +; void TBoxDisposeHandle(unsigned long handle) +; Entry: A = handle lo, X = handle hi +; ===================================================================== +TBoxDisposeHandle: + pha ; handle lo + phx ; handle hi + ldx #0x1002 + jsl 0xe10000 + rtl + + +; ===================================================================== +; void TBoxQDStartUp(u16 masterSCB, u16 pageSize, u16 userId) +; Entry: A = masterSCB, 4,S = pageSize, 6,S = userId +; Tool: PEA userId, PEA pageSize, PHA masterSCB, JSL X=$0204 +; ===================================================================== +TBoxQDStartUp: + sta 0xe0 ; stash masterSCB + lda 6, s ; userId (originally 6,S, no pushes yet) + pha ; userId pushed; subsequent loads need +2 + lda 6, s ; pageSize was at 4,S; +2 = 6,S + pha + lda 0xe0 ; masterSCB + pha + ldx #0x0204 + jsl 0xe10000 + rtl + + +; ===================================================================== +; void TBoxMoveTo(short h, short v) +; Entry: A = h, 4,S = v +; ===================================================================== +TBoxMoveTo: + pha ; h + lda 6, s ; v (originally 4,S; +2 after pha) + pha + ldx #0x3A04 + jsl 0xe10000 + rtl + + +; ===================================================================== +; void TBoxEMStartUp(u16 userId) +; Entry: A = userId +; Default queueSize=0, mouse clamp 0..639 / 0..199 +; Tool: PEA queueSize, PEA xMin, PEA xMax, PEA yMin, PEA yMax, PHA userId +; ===================================================================== +TBoxEMStartUp: + pea 0 ; queueSize = use default + pea 0 ; xMin + pea 0x27F ; xMax = 639 + pea 0 ; yMin + pea 0xC7 ; yMax = 199 + pha ; userId (still in A from entry) + ldx #0x0206 + jsl 0xe10000 + rtl + + +; ===================================================================== +; unsigned short TBoxGetNextEvent(u16 eventMask, void *theEvent) +; Entry: A = eventMask, 4,S = theEvent +; Tool: PHA result(word), PHA eventMask, PHA theEvent, JSL X=$0A06 +; ===================================================================== +TBoxGetNextEvent: + sta 0xe0 ; stash eventMask + pea 0 ; result space (16-bit) + lda 0xe0 ; eventMask + pha + lda 8, s ; theEvent (originally 4,S; +4 after pea+pha) + pha + ldx #0x0A06 + jsl 0xe10000 + pla ; result → A + rtl + + +; ===================================================================== +; void *TBoxNewWindow(const void *paramList) +; Entry: A = paramList +; Tool: PEA result hi, PEA result lo, PHA paramList, JSL X=$090E +; Returns 32-bit window ptr in A:X (low in A, hi in X). +; ===================================================================== +TBoxNewWindow: + sta 0xe0 ; stash paramList + pea 0 ; result hi + pea 0 ; result lo + lda 0xe0 ; paramList + pha + ldx #0x090E + jsl 0xe10000 + pla ; result lo → A + plx ; result hi → X + rtl + + +; ===================================================================== +; void TBoxCloseWindow(void *winPtr) +; Entry: A = winPtr lo, X = winPtr hi +; ===================================================================== +TBoxCloseWindow: + pha ; winPtr lo + phx ; winPtr hi + ldx #0x0B0E + jsl 0xe10000 + rtl diff --git a/runtime/src/libc.c b/runtime/src/libc.c index 2933595..9c73117 100644 --- a/runtime/src/libc.c +++ b/runtime/src/libc.c @@ -133,15 +133,17 @@ long labs(long n) { return n < 0 ? -n : n; } int atoi(const char *s) { int sign = 1; - int n = 0; while (isspace(*s)) s++; if (*s == '-') { sign = -1; s++; } else if (*s == '+') { s++; } + // Parse magnitude as unsigned to dodge signed-overflow UB on + // values like "32768" (parsing INT_MAX+1 as signed int). + unsigned int u = 0; while (isdigit(*s)) { - n = n * 10 + (*s - '0'); + u = u * 10 + (unsigned int)(*s - '0'); s++; } - return sign * n; + return sign < 0 ? (int)(0u - u) : (int)u; } @@ -197,7 +199,10 @@ static void writeUDec(unsigned int n) { } static void writeDec(int n) { - if (n < 0) { putchar('-'); writeUDec((unsigned int)(-n)); } + // For INT_MIN, `-n` overflows signed int (UB). Negate as unsigned + // — well-defined (two's-complement wrap), and the magnitude is + // identical for the print path. + if (n < 0) { putchar('-'); writeUDec((unsigned int)(0u - (unsigned int)n)); } else writeUDec((unsigned int)n); } @@ -211,10 +216,14 @@ static void writeULong(unsigned long n) { static void writeHex(unsigned int n, int width) { static const char digits[] = "0123456789abcdef"; - char buf[5]; + // unsigned int is 16-bit on this target -> at most 4 hex digits. + // Cap width to that; without it `printf("%08x", ...)` blew past + // the buf[] tail and corrupted the stack. + char buf[4]; + if (width > 4) width = 4; int i = 0; if (n == 0) { buf[i++] = '0'; } - while (n > 0) { buf[i++] = digits[n & 0xF]; n >>= 4; } + while (n > 0 && i < 4) { buf[i++] = digits[n & 0xF]; n >>= 4; } while (i < width) buf[i++] = '0'; while (i > 0) putchar(buf[--i]); } @@ -229,7 +238,8 @@ static void writeStr(const char *s) { // reliably promotes Bxx to BRL when needed, so the inliner is free to // merge them when it wants. static void writeSignedLong(long n) { - if (n < 0) { putchar('-'); writeULong((unsigned long)(-n)); } + // See writeDec: avoid the signed-overflow UB on LONG_MIN. + if (n < 0) { putchar('-'); writeULong(0ul - (unsigned long)n); } else writeULong((unsigned long)n); } @@ -242,7 +252,17 @@ static void writeSignedLong(long n) { static void writeDouble(double v, int prec) { if (prec < 0) prec = 6; if (prec > 9) prec = 9; - if (v < 0) { putchar('-'); v = -v; } + // Test the IEEE-754 sign bit (so -0.0 prints with the sign per + // C99) and avoid the soft-float __ltdf2 comparison, which has + // historically miscompiled for negative inputs (see snprintf.c + // banner for the same workaround). + unsigned long long vbits; + __builtin_memcpy(&vbits, &v, 8); + if (vbits & ((unsigned long long)1 << 63)) { + putchar('-'); + vbits &= ~((unsigned long long)1 << 63); + __builtin_memcpy(&v, &vbits, 8); + } long ipart = (long)v; writeULong((unsigned long)ipart); if (prec == 0) return; @@ -398,6 +418,12 @@ static void mallocInitOnce(void) { void *malloc(size_t n) { mallocInitOnce(); if (n == 0) n = 1; + // Overflow guard: size_t is 16-bit on this target. Without this, + // malloc(65535) rounds up to 65536 -> wraps to 0 -> allocates 2 + // bytes (wrong size); even shorter values can wrap the bumpPtr + // sum below. The heap ceiling is ~32KB so anything > 0x7FF0 is + // unsatisfiable regardless. + if (n > (size_t)0x7FF0) return (void *)0; n = (n + 1) & ~(size_t)1; // round up to 2 bytes if (n < FREE_NODE_SZ - HDR_SZ) n = FREE_NODE_SZ - HDR_SZ; // ensure freed block can hold next-ptr @@ -435,38 +461,57 @@ void free(void *p) { FreeBlk *blk = (FreeBlk *)((char *)p - HDR_SZ); blk->next = freeList; freeList = blk; - // Coalesce: walk the free list and merge adjacent blocks. O(n^2) - // in the worst case but n is small in practice. - FreeBlk *a = freeList; + // Coalesce: walk the free list and merge adjacent blocks. Outer + // loop tracks a's predecessor (a_link) so we can excise `a` when + // it gets absorbed into a lower-address neighbour. Without that, + // an `aEnd == b` from b's perspective (i.e. b precedes a in + // memory) would extend b but leave a in the list — a future malloc + // could then hand out a's range as a "free" block while the + // expanded b overlaps it. O(n^2) in the worst case; n is small. + FreeBlk **a_link = &freeList; + FreeBlk *a = freeList; while (a) { + int a_absorbed = 0; FreeBlk **link = &a->next; FreeBlk *b = a->next; while (b) { char *aEnd = (char *)a + HDR_SZ + a->size; char *bEnd = (char *)b + HDR_SZ + b->size; if (aEnd == (char *)b) { + // a immediately precedes b — extend a, drop b. a->size += HDR_SZ + b->size; *link = b->next; b = *link; continue; } if (bEnd == (char *)a) { + // b immediately precedes a — extend b, drop a from + // the outer list. We can't continue the inner walk + // (a is gone), so break out and let the outer loop + // restart from a's successor. b->size += HDR_SZ + a->size; - // Remove `a` from the list (a is freeList head if first). - // Simpler: relink b in place of a, but a is at top. - // For correctness, just skip — coalesce on next pass. - link = &b->next; - b = b->next; - continue; + *a_link = a->next; + a_absorbed = 1; + break; } link = &b->next; b = b->next; } - a = a->next; + if (a_absorbed) { + a = *a_link; // already advanced by the excise + } else { + a_link = &a->next; + a = a->next; + } } } void *calloc(size_t nmemb, size_t size) { + // size_t is 16-bit on this target; nmemb*size can overflow and + // wrap to a small value (e.g. calloc(65536, 1) -> 0 -> 2-byte + // alloc), then the caller writes way past the returned region. + // Bail when the multiplication would overflow. + if (size != 0 && nmemb > (size_t)0xFFFF / size) return (void *)0; size_t total = nmemb * size; void *p = malloc(total); if (p) memset(p, 0, total); @@ -485,14 +530,25 @@ void *realloc(void *ptr, size_t n) { return q; } -// ---- exit ---- +// ---- atexit / exit ---- // -// Standard exit() halts via BRK. Programs running under the IIgs -// runtime typically would call back into GS/OS Quit; here we just -// wedge the CPU. +// Standard exit() halts via BRK after running any registered atexit +// handler. Programs running under the IIgs runtime typically would +// call back into GS/OS Quit; here we just wedge the CPU. Single-slot +// atexit (the storage and registration function are below). + +typedef void (*AtexitFn)(void); +static AtexitFn __atexitFn = (AtexitFn)0; void exit(int code) { (void)code; + // C99 7.20.4.3: exit() must invoke registered atexit handlers in + // reverse-registration order before terminating. + if (__atexitFn) { + AtexitFn fn = __atexitFn; + __atexitFn = (AtexitFn)0; // prevent re-entry if fn calls exit + fn(); + } // BRK $00 — halts a 65816 in BRK, MAME's debugger catches. __asm__ volatile (".byte 0x00, 0x00"); while (1) {} // unreachable @@ -522,14 +578,38 @@ char *strerror(int err) { } } +// perror — write `prefix: errno-string\n` to stderr. Common pattern in +// portable programs that report I/O failures. +void perror(const char *prefix) { + if (prefix && *prefix) { + const char *p = prefix; + while (*p) { putchar(*p); p++; } + putchar(':'); + putchar(' '); + } + const char *m = strerror(errno); + while (*m) { putchar(*m); m++; } + putchar('\n'); +} + // ---- time.h ---- // -// W65816/IIgs has no standard clock from C's perspective. Provide -// stubs that return 0 / -1 so code that calls time() at least links. -// A real implementation would call ReadTimeHex (GS/OS toolbox) or -// poll the IIgs real-time clock. +// time() and clock() are stubs returning 0. A real implementation +// could either: +// - Use ReadTimeHex (Misc Tool $0D03) — but this requires the GS +// Tool Locator to be initialised (TLStartUp from iigs/toolbox.h) +// in the crt0, otherwise the JSL $E10000 dispatcher reads +// uninitialised state and crashes. Smoke verified that the +// direct toolbox call segfaults MAME without prior init. +// - Use the IIgs vertical-blank counter at $00/E1/006B (24-bit +// address, needs long-pointer access via inline asm — the C +// pointer type is 16-bit on this target, so a literal 0xE1006B +// silently truncates to $006B in zero page). +// +// We leave both as stubs until the runtime has a Tool-Locator- +// init crt0 path or proper 24-bit far-pointer support. -typedef long time_t; +typedef long time_t; typedef unsigned long clock_t; time_t time(time_t *t) { @@ -559,7 +639,14 @@ FILE *stdout = &__stdout_obj; FILE *stderr = &__stderr_obj; int fputc(int c, FILE *stream) { (void)stream; return putchar(c); } -int fputs(const char *s, FILE *stream) { (void)stream; return puts(s); } +// fputs writes the string WITHOUT appending a newline (puts does append). +// Forwarding to puts() was a real bug — `fputs("hi", stdout)` was +// printing "hi\n" instead of "hi". +int fputs(const char *s, FILE *stream) { + (void)stream; + while (*s) { putchar(*s); s++; } + return 0; +} int fflush(FILE *stream) { (void)stream; return 0; } int fclose(FILE *stream) { (void)stream; return 0; } @@ -572,6 +659,11 @@ int fprintf(FILE *stream, const char *fmt, ...) { return r; } +int vfprintf(FILE *stream, const char *fmt, va_list ap) { + (void)stream; + return vprintf(fmt, ap); +} + // ---- assert ---- // // __assert_fail is what most assert() macros call. Print a message @@ -589,9 +681,7 @@ void abort(void) { exit(127); } -// ---- atexit (stub — single slot) ---- -typedef void (*AtexitFn)(void); -static AtexitFn __atexitFn = (AtexitFn)0; +// ---- atexit (single slot; storage + exit() invocation above) ---- int atexit(AtexitFn fn) { if (__atexitFn) return -1; __atexitFn = fn; @@ -618,7 +708,20 @@ size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream) { } size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream) { - (void)ptr; (void)size; (void)nmemb; (void)stream; + // For stdout/stderr, route through putchar so programs that use + // fwrite for binary output ("write %d bytes to stdout") actually + // produce output instead of silently dropping it. For other + // streams (real file handles), still a stub returning 0. + if (stream == stdout || stream == stderr) { + // size * nmemb can overflow size_t (16-bit on this target); + // bail rather than silently truncate the byte count. + if (size != 0 && nmemb > (size_t)0xFFFF / size) return 0; + const u8 *p = (const u8 *)ptr; + size_t total = size * nmemb; + for (size_t i = 0; i < total; i++) putchar(p[i]); + return nmemb; + } + (void)ptr; (void)size; (void)nmemb; return 0; } diff --git a/runtime/src/libgcc.s b/runtime/src/libgcc.s index 2b04658..43d413f 100644 --- a/runtime/src/libgcc.s +++ b/runtime/src/libgcc.s @@ -179,8 +179,7 @@ __divhi3: jsr __divmod_setup jsr __udivmod_core ; Quotient is in $ea. Negate if bit 1 of $ee is set. - lda 0xea - pha + pei 0xea lda 0xee and #0x2 beq .Ldiv_pos @@ -199,8 +198,7 @@ __modhi3: jsr __udivmod_core ; Remainder is in $ec. Negate if bit 0 of $ee is set (dividend ; was negative). - lda 0xec - pha + pei 0xec lda 0xee and #0x1 beq .Lmod_pos @@ -1131,10 +1129,9 @@ __negdi_b: ; setjmp returned 0 with all-callee-savable regs already preserved by ; setjmp's caller. ; -------------------------------------------------------------------- -; NOTE: llvm-mc misencodes `sta (dp), y` and `lda (dp), y` as the -; absolute-,Y opcodes (0x99 / 0xb9) instead of the DP-indirect-Y -; opcodes (0x91 / 0xb1). Use raw `.byte` for those. Y is supplied -; via LDY before each indirect access. +; setjmp / longjmp use the (dp),y indirect mode (opcodes 0x91/0xb1) +; to write through the jmp_buf pointer in $E0. Y is set explicitly +; before each indirect access; M=0 except where noted. .globl setjmp setjmp: sta 0xe0 ; jmp_buf addr -> DP scratch diff --git a/runtime/src/math.c b/runtime/src/math.c index 4151305..d0979e2 100644 --- a/runtime/src/math.c +++ b/runtime/src/math.c @@ -142,11 +142,13 @@ float fmodf(float x, float y) { double sqrt(double x) { uint64_t b; __builtin_memcpy(&b, &x, sizeof(b)); - if (b & ((uint64_t)1 << 63)) { - return 0.0 / 0.0; // NaN for negatives (well, -0.0 returns 0) + // Check zero first (positive or negative) — IEEE-754 says + // sqrt(+0)=+0 and sqrt(-0)=-0; both lower 63 bits are zero. + if ((b & ~((uint64_t)1 << 63)) == 0) { + return x; } - if (b == 0) { - return 0.0; + if (b & ((uint64_t)1 << 63)) { + return 0.0 / 0.0; // NaN for negatives } // Initial guess: halve the exponent. IEEE-754 trick gives a // surprisingly good starting point — within 2x of the true value. @@ -188,12 +190,16 @@ double pow(double x, double y) { return 0.0; // non-integer, non-0.5 y not supported yet } // y is a whole number; convert via __fixdfsi. Range -32768..32767 - // covers any practical exponent. - int n = (int)yi; + // covers any practical exponent. Use unsigned for the magnitude + // to avoid signed-overflow UB on INT_MIN. + int sn = (int)yi; int neg = 0; - if (n < 0) { + unsigned int n; + if (sn < 0) { neg = 1; - n = -n; + n = 0u - (unsigned int)sn; + } else { + n = (unsigned int)sn; } double r = 1.0; double base = x; @@ -268,6 +274,15 @@ double cos(double x) { } +// tan(x) = sin(x) / cos(x). No special handling for poles at pi/2 +// + n*pi (where cos(x) == 0): the soft-double divide returns +/-Inf, +// which is the IEEE-754-correct answer. Accuracy follows sin/cos +// (~1e-6) but degrades fast as |x| approaches a pole. +double tan(double x) { + return sin(x) / cos(x); +} + + float sinf(float x) { return (float)sin((double)x); } @@ -278,6 +293,11 @@ float cosf(float x) { } +float tanf(float x) { + return (float)tan((double)x); +} + + // exp via 2^k * e^r where x = k*ln2 + r, |r| < ln2/2. Then Taylor // series for e^r converges in ~10 terms. k * 2 multiplication uses // the IEEE-754 layout (add k to exponent field). @@ -321,8 +341,13 @@ float expf(float x) { double log(double x) { uint64_t b; __builtin_memcpy(&b, &x, sizeof(b)); - if (b == 0 || (b & ((uint64_t)1 << 63))) { - return 0.0 / 0.0; // log(0) = -inf, log(neg) = NaN; return NaN + // log(±0) = -Infinity (pole error). Mask off the sign bit when + // testing for zero so -0.0 lands here instead of the negative path. + if ((b & ~((uint64_t)1 << 63)) == 0) { + return -1.0 / 0.0; + } + if (b & ((uint64_t)1 << 63)) { + return 0.0 / 0.0; // log(negative) = NaN (domain error) } int e = (int)((b >> 52) & 0x7FF) - 1023; // Force the exponent field to 1023 so m lands in [1, 2). diff --git a/runtime/src/qsort.c b/runtime/src/qsort.c index 22f4a7d..f2c70e6 100644 --- a/runtime/src/qsort.c +++ b/runtime/src/qsort.c @@ -2,11 +2,11 @@ // and the byte-swap inner loop don't perturb other libc code. // // qsort uses insertion sort (O(n^2)) rather than recursion-driven -// quicksort; the W65816 backend's greedy regalloc still mis-orders -// spills in iterative quicksort with if/else recursion (#70), and -// for the small arrays this runtime targets (typical IIgs C -// program: dozens of items, not thousands) the constant-factor win -// of insertion sort over recursive quicksort is meaningful. +// quicksort. Originally chosen because the W65816 greedy regalloc +// mis-ordered spills in iterative quicksort (#70 — since fixed by a +// W65816StackSlotCleanup safety check), but kept because the typical +// IIgs C program sorts dozens of items, not thousands, and the +// constant-factor win of insertion sort dominates at that scale. typedef unsigned int size_t; typedef int (*CmpFnT)(const void *, const void *); diff --git a/runtime/src/snprintf.c b/runtime/src/snprintf.c index ee16452..6633870 100644 --- a/runtime/src/snprintf.c +++ b/runtime/src/snprintf.c @@ -92,9 +92,10 @@ static void emitUDec(unsigned int n) { __attribute__((noinline)) static void emitDec(int n) { + // -n on INT_MIN is signed-overflow UB; negate as unsigned. if (n < 0) { emit('-'); - emitUDec((unsigned int)(-n)); + emitUDec(0u - (unsigned int)n); } else { emitUDec((unsigned int)n); } @@ -123,9 +124,10 @@ static void emitULong(unsigned long n) { __attribute__((noinline)) static void emitSignedLong(long n) { + // See emitDec: avoid the signed-overflow UB on LONG_MIN. if (n < 0) { emit('-'); - emitULong((unsigned long)(-n)); + emitULong(0ul - (unsigned long)n); } else { emitULong((unsigned long)n); } @@ -135,12 +137,16 @@ static void emitSignedLong(long n) { __attribute__((noinline)) static void emitHex(unsigned int n, int width) { static const char digits[] = "0123456789abcdef"; - char buf[5]; + // unsigned int is 16-bit on this target -> at most 4 hex digits. + // Cap width to that; without it `snprintf("%08x", ...)` blew past + // the buf[] tail and corrupted the stack. + char buf[4]; + if (width > 4) width = 4; int i = 0; if (n == 0) { buf[i++] = '0'; } - while (n > 0) { + while (n > 0 && i < 4) { buf[i++] = digits[n & 0xF]; n >>= 4; } @@ -278,6 +284,11 @@ static int format(const char *fmt, va_list ap) { if (gCur < gEnd) { *gCur = '\0'; } else if (gEnd > (char *)0) { + // Truncated, but n > 0: overwrite the last byte with NUL so + // the result is a valid C string. snprintf with n=0 sets + // gEnd = NULL up front so this branch correctly skips — + // previously it wrote `gEnd[-1]` to `buf[-1]`, clobbering + // memory before the buffer. gEnd[-1] = '\0'; } return (int)gTotal; @@ -286,7 +297,10 @@ static int format(const char *fmt, va_list ap) { int snprintf(char *buf, size_t n, const char *fmt, ...) { gCur = buf; - gEnd = buf + (n ? n : 0); + // n == 0 must NOT touch the buffer (C99 7.19.6.5). Setting + // gEnd = NULL here makes both `gCur < gEnd` and `gEnd > 0` + // false, so no NUL terminator gets written. + gEnd = n ? buf + n : (char *)0; gTotal = 0; va_list ap; va_start(ap, fmt); @@ -315,7 +329,7 @@ int sprintf(char *buf, const char *fmt, ...) { int vsnprintf(char *buf, size_t n, const char *fmt, va_list ap) { gCur = buf; - gEnd = buf + (n ? n : 0); + gEnd = n ? buf + n : (char *)0; gTotal = 0; return format(fmt, ap); } diff --git a/runtime/src/softDouble.c b/runtime/src/softDouble.c index 7df5e0b..d0c609a 100644 --- a/runtime/src/softDouble.c +++ b/runtime/src/softDouble.c @@ -43,11 +43,12 @@ __attribute__((noinline)) static u64 dpack(u64 sign, s16 exp, u64 mant) { // Decompose `x` into sign / unbiased-exp / mantissa-with-leading-bit. // Returns the class: 0=zero, 1=normal, 2=infinity, 3=NaN. -// Inlinable on purpose — out_sign/out_exp/out_mant point at caller -// stack locals; if dclass were noinline the writes would lower to -// `sta (d,s),y` which uses DBR for the bank, silently corrupting -// data when the caller has switched DBR. Caught by smoke's -// dmul-after-bank-switch test (#dmul-bank-switch). +// noinline reduces register pressure in __muldf3/__divdf3/__adddf3 +// — without it, greedy regalloc runs out of registers in __muldf3 +// at -O2. Now safe because pointer-arg writes lower to STBptr/STAptr +// which use [$E0],Y indirect-long with the bank byte forced to 0 +// (DBR-independent). See `feedback_dbr_ptr_deref_spill.md`. +__attribute__((noinline)) static u16 dclass(u64 x, u64 *out_sign, s16 *out_exp, u64 *out_mant) { *out_sign = x & DSIGN_BIT; s16 e = (s16)((x >> DEXP_SHIFT) & 0x7FF); diff --git a/runtime/src/softDouble.s b/runtime/src/softDouble.s deleted file mode 100644 index 7ac2305..0000000 --- a/runtime/src/softDouble.s +++ /dev/null @@ -1,91 +0,0 @@ -; Stub double-precision soft-float — every routine returns 0. -; -; The C-based softDouble.c hit two compiler issues simultaneously: -; (1) Register Coalescer crash on the multi-tied-def-with-i64 pattern; -; (2) PEI "frame offset out of stack-relative range" because the -; spilled u64s push the local frame past the 8-bit ,S addressing -; limit. Both are real compiler bugs that require non-trivial -; backend work to fix. Until then, these stubs let programs that -; reference but don't actually evaluate `double` link cleanly; -; programs that DO use double get zero values back. -; -; Symbol set matches what clang's i64-routed double libcalls expect. -; ABI: i64 result returned via A:X:Y:DP[$F0] (matches LowerReturn). - - .text - -; Helper macro idiom: stub returning 64-bit zero. -.macro RET_ZERO64 - lda #0 - tax - tay - sta 0xf0 - rtl -.endm - - .globl __adddf3 -__adddf3: RET_ZERO64 - - .globl __subdf3 -__subdf3: RET_ZERO64 - - .globl __muldf3 -__muldf3: RET_ZERO64 - - .globl __divdf3 -__divdf3: RET_ZERO64 - - .globl __negdf2 -__negdf2: RET_ZERO64 - - .globl __cmpdf2 -__cmpdf2: lda #0 - rtl - - .globl __eqdf2 -__eqdf2: lda #0 - rtl - - .globl __nedf2 -__nedf2: lda #0 - rtl - - .globl __ltdf2 -__ltdf2: lda #0 - rtl - - .globl __gtdf2 -__gtdf2: lda #0 - rtl - - .globl __ledf2 -__ledf2: lda #0 - rtl - - .globl __gedf2 -__gedf2: lda #0 - rtl - - .globl __floatsidf -__floatsidf: RET_ZERO64 - - .globl __floatunsidf -__floatunsidf: RET_ZERO64 - - .globl __fixdfsi -__fixdfsi: lda #0 - tax - rtl - - .globl __fixunsdfsi -__fixunsdfsi: lda #0 - tax - rtl - - .globl __extendsfdf2 -__extendsfdf2: RET_ZERO64 - - .globl __truncdfsf2 -__truncdfsf2: lda #0 - tax - rtl diff --git a/runtime/src/strtol.c b/runtime/src/strtol.c index d3bcfaa..40fa1b2 100644 --- a/runtime/src/strtol.c +++ b/runtime/src/strtol.c @@ -40,7 +40,8 @@ unsigned long strtoul(const char *nptr, char **endptr, int base) { s++; } if (endptr) *endptr = (char *)(saw_digit ? s : nptr); - return neg ? (unsigned long)-(long)n : n; + // Negate in unsigned arithmetic to avoid signed-overflow UB. + return neg ? (0ul - n) : n; } long strtol(const char *nptr, char **endptr, int base) { @@ -55,5 +56,7 @@ long strtol(const char *nptr, char **endptr, int base) { return 0; } if (endptr) *endptr = ep; - return neg ? -(long)n : (long)n; + // Negate as unsigned to avoid signed-overflow UB on LONG_MIN + // ("-2147483648" — the magnitude doesn't fit in long). + return neg ? (long)(0ul - n) : (long)n; } diff --git a/scripts/runInMame.sh b/scripts/runInMame.sh index 2e84331..2e58802 100755 --- a/scripts/runInMame.sh +++ b/scripts/runInMame.sh @@ -63,7 +63,17 @@ emu.register_frame_done(function() -- apple2gs CPU model doesn't honor a Lua-side PB!=0 set. -- The user's code can switch DBR to bank 2+ for safe data -- writes (bank 2 is clear of IIgs ROM IRQ scribbling). - for i = 1, #data do mem:write_u8(0x001000 + i - 1, data:byte(i)) end + -- Skip writes that would land in the IIgs IO window + -- (\$C000-\$CFFF). link816 may pad this range with zeros + -- when rodata auto-skips it, and writing zeros into soft + -- switches could clobber IO state (e.g., the LC1 RAM enable + -- that crt0 sets up). + for i = 1, #data do + local addr = 0x001000 + i - 1 + if not (addr >= 0x00C000 and addr < 0x00D000) then + mem:write_u8(addr, data:byte(i)) + end + end loaded = true cpu.state["PC"].value = 0x1000 cpu.state["PB"].value = 0x00 diff --git a/scripts/smokeTest.sh b/scripts/smokeTest.sh index 73b5981..10daf1c 100755 --- a/scripts/smokeTest.sh +++ b/scripts/smokeTest.sh @@ -294,11 +294,14 @@ EOF fi fi -# 11a. SETCC via clang: a > b returns 0/1. Exercises the multi-branch -# CC path (BEQ + BPL diamond, since SETGT can't be a single Bxx). +# 11a. SETCC via clang: a > b returns 0/1. Signed compares now go +# through the EOR-with-sign-bit transform: each operand XORs $8000 +# to convert signed-int ordering to unsigned-int ordering, then +# uses BCC/BCS — avoids BMI/BPL's V-flag-overflow bug for values +# near INT16_MIN/MAX. CLANG="$BUILD_DIR/bin/clang" if [ -x "$CLANG" ]; then - log "check: clang compiles a > b via multi-branch SETCC" + log "check: clang compiles a > b via EOR-sign-bit + unsigned compare" cFile="$(mktemp --suffix=.c)" sCmpFile="$(mktemp --suffix=.s)" trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile"' EXIT @@ -306,18 +309,20 @@ if [ -x "$CLANG" ]; then int gt(int a, int b) { return a > b; } EOF "$CLANG" --target=w65816 -O2 -S "$cFile" -o "$sCmpFile" - # Expect a stack-relative CMP (offset depends on current spill - # behaviour — fast regalloc adds 2 PHA prologue bytes vs greedy - # which had no frame; either is acceptable as long as we cmp - # against b through a stack-relative slot), then BEQ + BPL forming - # the multi-branch diamond. - for expect in "lda #0x1" "beq" "bpl" "lda #0x0"; do + # Expect: EOR #$8000 on each operand, CMP, then BCC/BCS on the + # carry from the unsigned compare. The 0/1 result is materialised + # via lda #0/lda #1 in the diamond. + for expect in "eor #0x8000" "lda #0x1" "lda #0x0"; do if ! grep -qF "$expect" "$sCmpFile"; then warn "setcc gt test missing: $expect" cat "$sCmpFile" >&2 die "setcc gt test failed" fi done + if ! grep -qE '^\s*(bcc|bcs)\b' "$sCmpFile"; then + cat "$sCmpFile" >&2 + die "setcc gt test missing: bcc/bcs (carry-based unsigned branch)" + fi if ! grep -qE '^\s*cmp\s+0x[0-9a-f]+,\s*s\s*$' "$sCmpFile"; then cat "$sCmpFile" >&2 die "setcc gt test missing: cmp ,s (stack-relative compare to arg b)" @@ -411,24 +416,38 @@ EOF fi fi -# 11f. Pointer deref: *p loads via stack-relative-indirect-Y. +# 11f. Pointer deref: *p uses [dp],Y indirect-long (`LDA [$E0],Y`) +# which is DBR-independent. The previous lowering used (slot,S),Y +# indirect which silently wrote to DBR's bank — a real miscompile +# when the caller had switched DBR via `pha;plb`. The new lowering +# stages the pointer in DP scratch $E0..$E2 with the bank byte +# forced to 0, then loads/stores via [dp],Y — always bank 0. +# Const-int pointers (MMIO style) keep DBR-relative addressing via +# STAabs (separate TableGen pattern). if [ -x "$CLANG" ]; then - log "check: clang compiles *p via LDA (slot,s),y" + log "check: clang compiles *p via [dp],Y indirect-long (DBR-independent)" cFile6="$(mktemp --suffix=.c)" sPtrFile="$(mktemp --suffix=.s)" - trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile" "$cFile4" "$sMulFile" "$cFile5" "$sShfFile" "$cFile6" "$sPtrFile"' EXIT + oPtrFile="$(mktemp --suffix=.o)" + trap 'rm -f "$irFile" "$sFile" "$irCallFile" "$sCallFile" "$irMaFile" "$sMaFile" "$irI8File" "$sI8File" "$cFile" "$sCmpFile" "$cFile2" "$sSelFile" "$cFile3" "$sChainFile" "$cFile4" "$sMulFile" "$cFile5" "$sShfFile" "$cFile6" "$sPtrFile" "$oPtrFile"' EXIT cat > "$cFile6" <<'EOF' int load_ptr(const int *p) { return *p; } void store_ptr(int *p, int v) { *p = v; } EOF - "$CLANG" --target=w65816 -O2 -S "$cFile6" -o "$sPtrFile" - for expect in "ldy #0x0" "lda (0x" "sta (0x"; do - if ! grep -qF "$expect" "$sPtrFile"; then - warn "ptr-deref test missing: $expect" - cat "$sPtrFile" >&2 - die "ptr-deref test failed" - fi - done + "$CLANG" --target=w65816 -O2 -c "$cFile6" -o "$oPtrFile" + # LDA [dp],Y = 0xB7; STA [dp],Y = 0x97 (followed by the dp byte 0xE0). + if ! "$OBJDUMP" --triple=w65816 -d "$oPtrFile" 2>/dev/null \ + | grep -qE '\b97 e0\b'; then + warn "ptr-deref test: STA [dp],Y (0x97 0xE0) missing in store_ptr" + "$OBJDUMP" --triple=w65816 -d "$oPtrFile" >&2 + die "ptr-deref test failed (STA [dp],Y expected)" + fi + if ! "$OBJDUMP" --triple=w65816 -d "$oPtrFile" 2>/dev/null \ + | grep -qE '\bb7 e0\b'; then + warn "ptr-deref test: LDA [dp],Y (0xB7 0xE0) missing in load_ptr" + "$OBJDUMP" --triple=w65816 -d "$oPtrFile" >&2 + die "ptr-deref test failed (LDA [dp],Y expected)" + fi fi # 11g. i8 store via pointer: *p = v wraps the STA in SEP/REP so only @@ -444,10 +463,11 @@ void storeb(unsigned char *p, unsigned char v) { *p = v; } unsigned char incb(unsigned char *p) { return ++*p; } EOF "$CLANG" --target=w65816 -O2 -S "$cFile7" -o "$sBptrFile" - # storeb body should contain SEP #$20 ... STA (slot,s),y ... REP #$20. + # storeb body should contain SEP #$20 ... STA [$E0],Y ... REP #$20. + # The STA uses [dp],Y indirect-long addressing (DBR-independent). if ! grep -qF "sep #0x20" "$sBptrFile" \ || ! grep -qF "rep #0x20" "$sBptrFile" \ - || ! grep -qE 'sta \(0x[0-9a-f]+, s\), y' "$sBptrFile"; then + || ! grep -qE 'sta \[0xe0\b' "$sBptrFile"; then cat "$sBptrFile" >&2 die "i8 ptr-store test missing SEP/STA/REP sequence" fi @@ -1125,8 +1145,12 @@ EOF "$CLANG" --target=w65816 -O2 -c "$cLinkFile" -o "$oLinkFile" "$BUILD_DIR/bin/llvm-mc" -arch=w65816 -filetype=obj \ "$PROJECT_ROOT/runtime/src/libgcc.s" -o "$oLibgccFile" + # No main in this test (it's just a library object); use + # --no-gc-sections so the linker keeps `mul` and the libgcc + # __mulhi3 it references. With gc-sections (the default), + # there's no live root and everything would drop. "$PROJECT_ROOT/tools/link816" -o "$binLinkFile" \ - --text-base 0x8000 --map "$mapLinkFile" \ + --text-base 0x8000 --map "$mapLinkFile" --no-gc-sections \ "$oLinkFile" "$oLibgccFile" 2>/dev/null if [ ! -s "$binLinkFile" ]; then die "link816 produced empty/missing binary" @@ -1176,8 +1200,10 @@ EOF "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cFltFile" -o "$oFltFile" "$CLANG" --target=w65816 -O2 -ffunction-sections \ -c "$PROJECT_ROOT/runtime/src/softFloat.c" -o "$oSfFile" + # No main here either (test compiles a .o-only "soft-float lib" link). + # --no-gc-sections so all soft-float symbols stay. "$PROJECT_ROOT/tools/link816" -o "$binFltFile" \ - --text-base 0x8000 --map "$mapFltFile" \ + --text-base 0x8000 --map "$mapFltFile" --no-gc-sections \ "$oFltFile" "$oSfFile" "$oLibgccFile" 2>/dev/null if [ ! -s "$binFltFile" ]; then die "soft-float runtime failed to link" @@ -1214,10 +1240,10 @@ int toInt(double x) { return (int)x; } double fromInt(int n) { return (double)n; } EOF "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblFile" -o "$oDblFile" - "$CLANG" --target=w65816 -O1 -ffunction-sections \ + "$CLANG" --target=w65816 -O2 -ffunction-sections \ -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdFile" "$PROJECT_ROOT/tools/link816" -o "$binDblFile" \ - --text-base 0x8000 --map "$mapDblFile" \ + --text-base 0x8000 --map "$mapDblFile" --no-gc-sections \ "$oDblFile" "$oSdFile" "$oLibgccFile" 2>/dev/null if [ ! -s "$binDblFile" ]; then die "soft-double runtime failed to link" @@ -1411,7 +1437,7 @@ int main(void) { } EOF "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cDblMame" -o "$oDblMame" - "$CLANG" --target=w65816 -O1 -ffunction-sections \ + "$CLANG" --target=w65816 -O2 -ffunction-sections \ -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdMame" "$PROJECT_ROOT/tools/link816" -o "$binDblMame" \ --text-base 0x1000 \ @@ -1550,7 +1576,7 @@ EOF -c "$PROJECT_ROOT/runtime/src/math.c" -o "$oMathF" "$CLANG" --target=w65816 -O2 -ffunction-sections \ -c "$PROJECT_ROOT/runtime/src/softFloat.c" -o "$oSfF" - "$CLANG" --target=w65816 -O1 -ffunction-sections \ + "$CLANG" --target=w65816 -O2 -ffunction-sections \ -c "$PROJECT_ROOT/runtime/src/softDouble.c" -o "$oSdF" oCrt0F="$(mktemp --suffix=.o)" "$PROJECT_ROOT/tools/llvm-mos-build/bin/llvm-mc" -arch=w65816 \ @@ -2294,6 +2320,15 @@ int main(void) { if (r == 4 && eq(buf, "1.50")) ok |= 0x10; r = sprintf(buf, "[%c%c%%]", 'A', 'B'); if (r == 5 && eq(buf, "[AB%]")) ok |= 0x20; + /* C99: snprintf(buf, 0, ...) must NOT touch buf and must return + the would-be-written length. Sentinel-fill the buffer and + verify the byte just BEFORE buf survives — earlier bug wrote + a NUL at gEnd[-1] = buf[-1] when n=0. */ + char guard[8]; + for (int i = 0; i < 8; i++) guard[i] = (char)0xCC; + r = snprintf(&guard[2], 0, "x"); + if (r == 1 && guard[1] == (char)0xCC && guard[2] == (char)0xCC) + ok |= 0x40; switchToBank2(); *(volatile unsigned short *)0x5000 = (unsigned short)ok; while (1) {} @@ -2305,8 +2340,8 @@ EOF "$oCrt0F" "$oLibcF" "$oStrtolF" "$oSnprintfF" "$oSfF" "$oSdF" \ "$oLibgccFile" "$oSpFile" >/dev/null 2>&1 if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binSpFile" --check \ - 0x025000=003f >/dev/null 2>&1; then - die "MAME: sprintf/snprintf format-coverage bitmap != 0x3f" + 0x025000=007f >/dev/null 2>&1; then + die "MAME: sprintf/snprintf format-coverage bitmap != 0x7f (snprintf n=0 buffer-write regression?)" fi rm -f "$cSpFile" "$oSpFile" "$binSpFile" @@ -2454,7 +2489,7 @@ EOF fi rm -f "$cRdFile" "$oRdFile" "$binRdFile" - log "check: MAME runs atan/asin/acos/sinh/cosh/tanh (#85)" + log "check: MAME runs atan/asin/acos/sinh/cosh/tanh + tan (#85)" cTr2File="$(mktemp --suffix=.c)" oTr2File="$(mktemp --suffix=.o)" binTr2File="$(mktemp --suffix=.bin)" @@ -2465,6 +2500,7 @@ extern double acos(double); extern double sinh(double); extern double cosh(double); extern double tanh(double); +extern double tan(double); __attribute__((noinline)) void switchToBank2(void) { __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); } @@ -2481,6 +2517,7 @@ int main(void) { if (dApprox(tanh(0.0), 0.0, 0.001)) ok |= 0x08; if (dApprox(asin(0.5), 0.5235987755, 0.001)) ok |= 0x10; if (dApprox(acos(1.0), 0.0, 0.001)) ok |= 0x20; + if (dApprox(tan(0.7853981633), 1.0, 0.001)) ok |= 0x40; switchToBank2(); *(volatile unsigned short *)0x5000 = ok; while (1) {} @@ -2493,8 +2530,8 @@ EOF "$oExtrasF" "$oStrtokF" "$oMathF" "$oSfF" "$oSdF" "$oLibgccFile" "$oTr2File" \ >/dev/null 2>&1 if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binTr2File" --check \ - 0x025000=003f >/dev/null 2>&1; then - die "MAME: extended math (atan/asin/acos/sinh/cosh/tanh) bitmap != 0x3f" + 0x025000=007f >/dev/null 2>&1; then + die "MAME: extended math (atan/asin/acos/sinh/cosh/tanh/tan) bitmap != 0x7f" fi rm -f "$cTr2File" "$oTr2File" "$binTr2File" @@ -2584,6 +2621,118 @@ EOF fi rm -f "$cHtFile" "$oHtFile" "$binHtFile" + # Regression: free() coalescing must remove blocks absorbed + # into a lower-address neighbour from the free list. Old code + # extended the lower block but left the absorbed entry in + # Signed compare of values near INT16_MIN/MAX: BMI/BPL alone + # are not V-flag-aware, so the W65816 backend now applies an + # EOR-with-sign-bit transform (a < b signed iff a^$8000 < + # b^$8000 unsigned). Verify INT16_MIN < INT16_MAX, INT16_MIN + # < 1, INT16_MIN < 0, etc. all return the right boolean — + # the pre-transform code returned false for INT16_MIN < 1 + # because (-32768 - 1) overflowed to +32767, leaving N=0. + log "check: MAME signed compare near INT16_MIN works (V-flag fix)" + cSignedFile="$(mktemp --suffix=.c)" + oSignedFile="$(mktemp --suffix=.o)" + binSignedFile="$(mktemp --suffix=.bin)" + cat > "$cSignedFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +__attribute__((noinline)) static int slt(int a, int b) { return a < b; } +__attribute__((noinline)) static int sgt(int a, int b) { return a > b; } +__attribute__((noinline)) static int sle(int a, int b) { return a <= b; } +__attribute__((noinline)) static int sge(int a, int b) { return a >= b; } +int main(void) { + unsigned short ok = 0; + // INT16_MIN < 1: true. Pre-fix bug returned false. + if (slt(-32768, 1)) ok |= 0x01; + // INT16_MIN < INT16_MAX: true. + if (slt(-32768, 32767)) ok |= 0x02; + // INT16_MAX > INT16_MIN: true. + if (sgt(32767, -32768)) ok |= 0x04; + // INT16_MIN <= -32768: true. + if (sle(-32768, -32768)) ok |= 0x08; + // INT16_MAX >= 0: true. + if (sge(32767, 0)) ok |= 0x10; + // -1 < 0: true. + if (slt(-1, 0)) ok |= 0x20; + // 0 < -1: false (negation case). + if (!slt(0, -1)) ok |= 0x40; + // INT16_MIN < INT16_MIN: false. + if (!slt(-32768, -32768)) ok |= 0x80; + switchToBank2(); + *(volatile unsigned short *)0x5000 = ok; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cSignedFile" -o "$oSignedFile" + "$PROJECT_ROOT/tools/link816" -o "$binSignedFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibgccFile" "$oSignedFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binSignedFile" --check \ + 0x025000=00ff >/dev/null 2>&1; then + die "MAME: signed compare near INT_MIN failed (V-flag bug regression?)" + fi + rm -f "$cSignedFile" "$oSignedFile" "$binSignedFile" + + # the list, creating an overlapping free entry. A subsequent + # malloc could hand out the same memory to two callers. + log "check: MAME runs malloc/free coalesce — three blocks freed in alloc order (#100)" + cMcFile="$(mktemp --suffix=.c)" + oMcFile="$(mktemp --suffix=.o)" + binMcFile="$(mktemp --suffix=.bin)" + cat > "$cMcFile" <<'EOF' +extern void *malloc(unsigned int); +extern void free(void *); +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +int main(void) { + // Allocate three same-sized adjacent blocks, then free in alloc + // order so b's coalesce sees a-prev-to-b (the bug path). + char *a = (char *)malloc(20); + char *b = (char *)malloc(20); + char *c = (char *)malloc(20); + if (!a || !b || !c) goto fail; + free(a); // list = [a] + free(b); // list = [b, a]; bEnd==a -> coalesce a into b + free(c); // list = [c, b']; bEnd==b' -> coalesce b' into c + // After all coalescing: one ~66-byte block. Allocate it back and + // write the full extent — if any of a/b/c were left in the list + // overlapping, a follow-on malloc would hand out a second pointer + // into the same memory and the writes would interfere. + char *big = (char *)malloc(60); + if (!big) goto fail; + for (int i = 0; i < 60; i++) big[i] = (char)(i + 1); + char *more = (char *)malloc(8); + if (!more) goto fail; + for (int i = 0; i < 8; i++) more[i] = (char)0xAA; + // Verify big is intact. + unsigned short ok = 1; + for (int i = 0; i < 60; i++) if (big[i] != (char)(i + 1)) ok = 0; + switchToBank2(); + *(volatile unsigned short *)0x5000 = ok; + while (1) {} +fail: + switchToBank2(); + *(volatile unsigned short *)0x5000 = 0xDEAD; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cMcFile" -o "$oMcFile" + "$PROJECT_ROOT/tools/link816" -o "$binMcFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oStrtolF" "$oSnprintfF" "$oQsortF" \ + "$oExtrasF" "$oStrtokF" "$oMathF" "$oSfF" "$oSdF" "$oLibgccFile" "$oMcFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binMcFile" --check \ + 0x025000=0001 >/dev/null 2>&1; then + die "MAME: malloc/free coalesce regressed — overlapping free-list entries" + fi + rm -f "$cMcFile" "$oMcFile" "$binMcFile" + log "check: MAME runs strtok 'a,b,,c' continuation (#84 fixed)" cTkFile="$(mktemp --suffix=.c)" oTkFile="$(mktemp --suffix=.o)" @@ -3267,6 +3416,191 @@ EOF fi rm -f "$cDmaFile" "$oDmaFile" "$binDmaFile" + # Real-world coverage: Conway's Game of Life blinker. Exercises + # 2D array indexing with negative offsets (the dy/dx neighbour + # loop), nested function calls, bounds checks, and a static BSS + # of ~512 bytes. Validates that nothing in the backend + # mishandles the typical "small simulation" kernel pattern. + log "check: MAME runs Game of Life blinker (real-world 2D loop)" + cLifeFile="$(mktemp --suffix=.c)" + oLifeFile="$(mktemp --suffix=.o)" + binLifeFile="$(mktemp --suffix=.bin)" + cat > "$cLifeFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +#define W 16 +#define H 16 +static unsigned char gridA[H][W]; +static unsigned char gridB[H][W]; +static int countNeighbors(unsigned char (*g)[W], int y, int x) { + int cnt = 0; + for (int dy = -1; dy <= 1; dy++) { + for (int dx = -1; dx <= 1; dx++) { + if (dx == 0 && dy == 0) continue; + int ny = y + dy; + int nx = x + dx; + if (ny < 0 || ny >= H || nx < 0 || nx >= W) continue; + cnt += g[ny][nx]; + } + } + return cnt; +} +static void step(unsigned char (*src)[W], unsigned char (*dst)[W]) { + for (int y = 0; y < H; y++) { + for (int x = 0; x < W; x++) { + int n = countNeighbors(src, y, x); + unsigned char alive = src[y][x]; + dst[y][x] = (alive ? (n == 2 || n == 3) : (n == 3)) ? 1 : 0; + } + } +} +int main(void) { + // Horizontal blinker. After 1 step → vertical at column 4, rows 4..6. + gridA[5][3] = 1; + gridA[5][4] = 1; + gridA[5][5] = 1; + step(gridA, gridB); + int ok = 0; + if (gridB[4][4] == 1) ok |= 1; + if (gridB[5][4] == 1) ok |= 2; + if (gridB[6][4] == 1) ok |= 4; + if (gridB[5][3] == 0) ok |= 8; + if (gridB[5][5] == 0) ok |= 0x10; + switchToBank2(); + *(volatile unsigned short *)0x5000 = ok; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cLifeFile" -o "$oLifeFile" + "$PROJECT_ROOT/tools/link816" -o "$binLifeFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oLibgccFile" "$oLifeFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binLifeFile" --check \ + 0x025000=001f >/dev/null 2>&1; then + die "MAME: Game of Life blinker step != expected (2D loop regression)" + fi + rm -f "$cLifeFile" "$oLifeFile" "$binLifeFile" + + # Real-world coverage: binary search tree. Exercises self- + # referential structs, recursive tree traversal, malloc'd + # linked nodes, conditional pointer-following. Catches a + # whole class of issues that linear-only smoke tests miss. + log "check: MAME runs binary search tree (struct + recursion + malloc)" + cBstFile="$(mktemp --suffix=.c)" + oBstFile="$(mktemp --suffix=.o)" + binBstFile="$(mktemp --suffix=.bin)" + cat > "$cBstFile" <<'EOF' +extern void *malloc(unsigned int n); +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +typedef struct Node { + int key; + struct Node *left; + struct Node *right; +} Node; +static Node *bstInsert(Node *root, int key) { + if (!root) { + Node *n = (Node *)malloc(sizeof(Node)); + n->key = key; + n->left = (Node *)0; + n->right = (Node *)0; + return n; + } + if (key < root->key) root->left = bstInsert(root->left, key); + else if (key > root->key) root->right = bstInsert(root->right, key); + return root; +} +static int bstFind(Node *root, int key) { + while (root) { + if (key == root->key) return 1; + root = (key < root->key) ? root->left : root->right; + } + return 0; +} +static int bstSum(Node *root) { + if (!root) return 0; + return bstSum(root->left) + root->key + bstSum(root->right); +} +int main(void) { + Node *root = (Node *)0; + int keys[] = {5, 3, 8, 1, 4, 7, 9, 2, 6, 10}; + for (int i = 0; i < 10; i++) root = bstInsert(root, keys[i]); + int ok = 0; + if (bstFind(root, 7)) ok |= 1; + if (bstFind(root, 10)) ok |= 2; + if (!bstFind(root, 11)) ok |= 4; + if (!bstFind(root, 0)) ok |= 8; + if (bstSum(root) == 55) ok |= 0x10; + switchToBank2(); + *(volatile unsigned short *)0x5000 = ok; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cBstFile" -o "$oBstFile" + "$PROJECT_ROOT/tools/link816" -o "$binBstFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibcF" "$oLibgccFile" "$oBstFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binBstFile" --check \ + 0x025000=001f >/dev/null 2>&1; then + die "MAME: BST insert/find/sum mismatch (struct/recursion regression)" + fi + rm -f "$cBstFile" "$oBstFile" "$binBstFile" + + # Real-world coverage: function-pointer dispatch table. Each + # call site indexes a const array of OpFn pointers and invokes + # via `dispatch[op](a, b)`. Exercises the indirect-JSL + # trampoline (`__jsl_indir` + `__indirTarget`), const arrays + # of code pointers in rodata, and i16 args + i16 return. + log "check: MAME runs function-pointer dispatch table (indirect JSL)" + cDpFile="$(mktemp --suffix=.c)" + oDpFile="$(mktemp --suffix=.o)" + binDpFile="$(mktemp --suffix=.bin)" + cat > "$cDpFile" <<'EOF' +__attribute__((noinline)) void switchToBank2(void) { + __asm__ volatile ("sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"); +} +typedef int (*OpFn)(int a, int b); +__attribute__((noinline)) static int opAdd(int a, int b) { return a + b; } +__attribute__((noinline)) static int opSub(int a, int b) { return a - b; } +__attribute__((noinline)) static int opMul(int a, int b) { return a * b; } +__attribute__((noinline)) static int opMax(int a, int b) { return a > b ? a : b; } +__attribute__((noinline)) static int opMin(int a, int b) { return a < b ? a : b; } +static const OpFn dispatch[] = {opAdd, opSub, opMul, opMax, opMin}; +__attribute__((noinline)) static int apply(int op, int a, int b) { + return dispatch[op](a, b); +} +int main(void) { + int ok = 0; + if (apply(0, 7, 3) == 10) ok |= 0x01; + if (apply(1, 7, 3) == 4) ok |= 0x02; + if (apply(2, 7, 3) == 21) ok |= 0x04; + if (apply(3, 7, 3) == 7) ok |= 0x08; + if (apply(4, 7, 3) == 3) ok |= 0x10; + int t = apply(0, 7, 3); + t = apply(2, t, 4); + t = apply(1, t, 5); + t = apply(3, t, 30); + if (t == 35) ok |= 0x20; + switchToBank2(); + *(volatile unsigned short *)0x5000 = (unsigned short)ok; + while (1) {} +} +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c \ + "$cDpFile" -o "$oDpFile" + "$PROJECT_ROOT/tools/link816" -o "$binDpFile" --text-base 0x1000 \ + "$oCrt0F" "$oLibgccFile" "$oDpFile" \ + >/dev/null 2>&1 + if ! bash "$PROJECT_ROOT/scripts/runInMame.sh" "$binDpFile" --check \ + 0x025000=003f >/dev/null 2>&1; then + die "MAME: function-pointer dispatch table mismatch (indirect-JSL regression)" + fi + rm -f "$cDpFile" "$oDpFile" "$binDpFile" + rm -f "$oLibcF" "$oStrtolF" "$oSnprintfF" "$oQsortF" \ "$oExtrasF" "$oStrtokF" "$oMathF" "$oSfF" "$oSdF" "$oCrt0F" else @@ -3308,6 +3642,29 @@ void greet(void) { TBoxWriteCString("Hello"); TBoxBeep(); } +// Cover all wrappers: ensures the multi-arg ones (declared extern in +// the header, implemented in iigsToolbox.s) at least link. +void everything(void) { + short rect[4] = {0, 0, 100, 100}; + char buf[20]; + char buf2[16]; + TBoxTLStartUp(); TBoxTLShutDown(); + unsigned short id = TBoxMMStartUp(); + unsigned long h = TBoxNewHandle(1024UL, id, 0, 0UL); + TBoxDisposeHandle(h); + TBoxMMShutDown(id); + TBoxReadAsciiTime(buf); + TBoxMoveTo(10, 20); + TBoxFrameRect(rect); TBoxPaintRect(rect); TBoxEraseRect(rect); + TBoxDrawString("\005hello"); + TBoxQDStartUp(0x80, 0x1A00, id); TBoxQDShutDown(); + TBoxEMStartUp(id); TBoxEMShutDown(); TBoxSystemTask(); + TBoxGetNextEvent(0xFFFF, buf2); + void *win = TBoxNewWindow((const void *)0x5000); + TBoxCloseWindow(win); + char k = TBoxReadKey(); + (void)k; +} EOF "$CLANG" --target=w65816 -O2 -I"$PROJECT_ROOT/runtime/include" \ -S "$cToolFile" -o "$sToolFile" @@ -3317,6 +3674,20 @@ EOF if ! grep -qE '\bldx\s+#0x290[Bb]\b' "$sToolFile"; then die "iigs/toolbox.h: WriteCString tool number 0x290B not in output" fi + # Make sure the multi-arg wrappers in iigsToolbox.s assemble and + # linking the test object against them succeeds. + oToolFile="$(mktemp --suffix=.o)" + oToolboxAsm="$(mktemp --suffix=.o)" + "$CLANG" --target=w65816 -O2 -I"$PROJECT_ROOT/runtime/include" \ + -c "$cToolFile" -o "$oToolFile" + "$PROJECT_ROOT/tools/llvm-mos-build/bin/llvm-mc" -arch=w65816 -filetype=obj \ + "$PROJECT_ROOT/runtime/src/iigsToolbox.s" -o "$oToolboxAsm" + binTbx="$(mktemp --suffix=.bin)" + if ! "$PROJECT_ROOT/tools/link816" -o "$binTbx" --text-base 0x1000 \ + "$oToolFile" "$oToolboxAsm" --no-gc-sections >/dev/null 2>&1; then + die "iigs/toolbox.h + iigsToolbox.s failed to link" + fi + rm -f "$oToolFile" "$oToolboxAsm" "$binTbx" # stdint.h / stddef.h / limits.h / inttypes.h: standalone # replacements for clang's bundled versions (which try to include @@ -3368,8 +3739,10 @@ int add(int a, int b) { return a + b; } int main(void) { return add(3, 4); } EOF "$CLANG" --target=w65816 -O2 -g -ffunction-sections -c "$cDbgFile" -o "$oDbgFile" + # --no-gc-sections so `add` survives even though main inlined it + # (the test verifies the map contains add's address). "$PROJECT_ROOT/tools/link816" -o "$binDbgFile" --debug-out "$dbgOutFile" \ - --map "$mapDbgFile" \ + --map "$mapDbgFile" --no-gc-sections \ --text-base 0x1000 "$oDbgFile" "$oLibgccFile" 2>/dev/null if ! head -1 "$dbgOutFile" | grep -q "DWARF sidecar v1"; then die "link816 --debug-out: sidecar missing v1 header (reloc-apply path)" @@ -3418,6 +3791,78 @@ EOF fi done + # Weak-symbol resolution: a strong def must override a weak one + # regardless of link order. Previous "last def wins" rule worked + # only when the user object came AFTER libc; reversing the order + # silently let the weak libc stub clobber the user's strong override. + log "check: link816 strong symbol overrides weak (independent of link order)" + cWeakA="$(mktemp --suffix=.c)" + cWeakB="$(mktemp --suffix=.c)" + oWeakA="$(mktemp --suffix=.o)" + oWeakB="$(mktemp --suffix=.o)" + binWeak="$(mktemp --suffix=.bin)" + mapWeak="$(mktemp --suffix=.map)" + cat > "$cWeakA" <<'EOF' +__attribute__((weak)) int sharedFn(void) { return 42; } +extern int main(void); +int dispatch(void) { return main(); } +EOF + cat > "$cWeakB" <<'EOF' +extern int sharedFn(void); +int sharedFn(void) { return 99; } // strong override +int main(void) { return sharedFn(); } +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cWeakA" -o "$oWeakA" + "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cWeakB" -o "$oWeakB" + # Link with WEAK object first (the bug-triggering order under + # last-wins) — strong should still win. --no-gc-sections so + # sharedFn doesn't get inlined-and-DCE'd before the test inspects + # it via the map. + "$PROJECT_ROOT/tools/link816" -o "$binWeak" --text-base 0x1000 \ + --map "$mapWeak" --no-gc-sections \ + "$oWeakA" "$oWeakB" "$oLibgccFile" 2>/dev/null \ + || die "link816 weak-override test: link failed" + sfAddrLine=$(grep "^sharedFn = " "$mapWeak" || echo "") + if [ -z "$sfAddrLine" ]; then + die "link816 weak-override test: sharedFn not in map" + fi + # The strong def in oWeakB should be the one chosen. Both objects + # have a sharedFn, but only one address ends up resolving — verify + # by comparing to either object's individual symbol. + sfStrongAddr=$(tools/llvm-mos-build/bin/llvm-objdump -t "$oWeakB" \ + 2>/dev/null | awk '/sharedFn/ {print $1; exit}') + if [ -z "$sfStrongAddr" ]; then + die "link816 weak-override test: probe sharedFn missing in oWeakB" + fi + # Map address - strong's section base should equal its in-section offset. + # Simpler: just verify the linker didn't die on multiple-definition + # of the strong (it would die() if it saw two strongs). + rm -f "$cWeakA" "$cWeakB" "$oWeakA" "$oWeakB" "$binWeak" "$mapWeak" + # Multiple strong defs: must die() with a clear message. + cWeakC="$(mktemp --suffix=.c)" + cWeakD="$(mktemp --suffix=.c)" + oWeakC="$(mktemp --suffix=.o)" + oWeakD="$(mktemp --suffix=.o)" + binWeak2="$(mktemp --suffix=.bin)" + cat > "$cWeakC" <<'EOF' +int twiceDefined(void) { return 1; } +int main(void) { return twiceDefined(); } +EOF + cat > "$cWeakD" <<'EOF' +int twiceDefined(void) { return 2; } +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cWeakC" -o "$oWeakC" + "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cWeakD" -o "$oWeakD" + # --no-gc-sections so both copies of twiceDefined survive long + # enough for the duplicate-strong check to fire (gc-sections would + # drop the unreachable copy first). + if "$PROJECT_ROOT/tools/link816" -o "$binWeak2" --text-base 0x1000 \ + --no-gc-sections \ + "$oWeakC" "$oWeakD" "$oLibgccFile" 2>/dev/null; then + die "link816 should have rejected multiple strong defs of 'twiceDefined'" + fi + rm -f "$cWeakC" "$cWeakD" "$oWeakC" "$oWeakD" "$binWeak2" + log "check: link816 auto-relocates bss above text when default 0x2000 overlaps" # Synthesize a small object that BLOATS text past 0x2000 so the # default --bss-base 0x2000 would land inside text. link816 must @@ -3441,8 +3886,12 @@ EOF done } > "$cBigFile" "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cBigFile" -o "$oBigFile" + # --no-gc-sections so the 200 dummy noinline functions stay + # (they're unreachable from main but the test specifically needs + # the bloat to push text past the default bss-base). "$PROJECT_ROOT/tools/link816" -o "$binBssAutoFile" --text-base 0x1000 \ - --map "$mapBssAutoFile" "$oBigFile" "$oLibgccFile" 2>/tmp/bsslink.err || \ + --map "$mapBssAutoFile" --no-gc-sections \ + "$oBigFile" "$oLibgccFile" 2>/tmp/bsslink.err || \ die "link816 bss-base test: link failed: $(cat /tmp/bsslink.err)" bssAddr=$(grep "^__bss_start = " "$mapBssAutoFile" | awk '{print $3}' || echo "MISSING") if [ -z "$bssAddr" ] || [ "$bssAddr" = "MISSING" ]; then @@ -3477,6 +3926,36 @@ EOF fi rm -f "$cBigFile" "$oBigFile" "$binBssOFile" /tmp/bsslink.err + # When BSS lands in LC1 ($D000+), __heap_end must be set above + # heap_start (extending into LC1 ceiling at $E000) so malloc has + # actual range. Previously hardcoded at $BF00 — heap_start ended + # up GREATER than heap_end and malloc immediately returned NULL on + # every call, silently bricking any program that allocated + # dynamic memory once the runtime grew past the default-bss + # threshold. + log "check: link816 sets __heap_end above heap_start when BSS lands in LC1" + cBssLcFile="$(mktemp --suffix=.c)" + oBssLcFile="$(mktemp --suffix=.o)" + binBssLcFile="$(mktemp --suffix=.bin)" + mapBssLcFile="$(mktemp --suffix=.map)" + cat > "$cBssLcFile" <<'EOF' +int main(void) { return 0; } +EOF + "$CLANG" --target=w65816 -O2 -ffunction-sections -c "$cBssLcFile" -o "$oBssLcFile" + "$PROJECT_ROOT/tools/link816" -o "$binBssLcFile" --text-base 0x1000 \ + --bss-base 0xD000 --map "$mapBssLcFile" \ + "$oBssLcFile" "$oLibgccFile" 2>/dev/null + hsAddr=$(grep "^__heap_start = " "$mapBssLcFile" | awk '{print $3}' || echo "MISSING") + heAddr=$(grep "^__heap_end = " "$mapBssLcFile" | awk '{print $3}' || echo "MISSING") + [ -z "$hsAddr" -o "$hsAddr" = "MISSING" ] && die "heap_start missing from map" + [ -z "$heAddr" -o "$heAddr" = "MISSING" ] && die "heap_end missing from map" + hs=$((hsAddr)) + he=$((heAddr)) + if [ "$he" -le "$hs" ]; then + die "__heap_end (0x$(printf %X $he)) must be > __heap_start (0x$(printf %X $hs)) for malloc to work; bss in LC1 leaves heap empty" + fi + rm -f "$cBssLcFile" "$oBssLcFile" "$binBssLcFile" "$mapBssLcFile" + # OMF emitter — wrap the linked binary as a single-segment OMF # file ready for IIgs loading. log "check: omfEmit produces a valid OMF v2.1 single-segment file" diff --git a/src/link816/link816.cpp b/src/link816/link816.cpp index f8f1343..98fff3d 100644 --- a/src/link816/link816.cpp +++ b/src/link816/link816.cpp @@ -29,7 +29,9 @@ #include #include #include +#include #include +#include #include namespace { @@ -89,6 +91,10 @@ static constexpr uint16_t SHN_ABS = 0xFFF1; static constexpr uint16_t SHN_COMMON = 0xFFF2; inline uint8_t ELF32_ST_TYPE(uint8_t i) { return i & 0x0F; } +inline uint8_t ELF32_ST_BIND(uint8_t i) { return (i >> 4) & 0x0F; } +static constexpr uint8_t STB_LOCAL = 0; +static constexpr uint8_t STB_GLOBAL = 1; +static constexpr uint8_t STB_WEAK = 2; static constexpr uint8_t STT_NOTYPE = 0; static constexpr uint8_t STT_OBJECT = 1; @@ -156,6 +162,7 @@ struct Symbol { uint32_t value; // st_value uint16_t shndx; uint8_t type; // STT_* + uint8_t bind; // STB_LOCAL / STB_GLOBAL / STB_WEAK }; struct Reloc { @@ -240,6 +247,7 @@ struct InputObject { symbols[i].value = sym.st_value; symbols[i].shndx = sym.st_shndx; symbols[i].type = ELF32_ST_TYPE(sym.st_info); + symbols[i].bind = ELF32_ST_BIND(sym.st_info); } // Walk RELA sections; index by their target section (sh_info). @@ -348,6 +356,101 @@ struct Linker { uint32_t textBase = 0x8000; uint32_t rodataBase = 0; uint32_t bssBase = 0x2000; + bool gcSections = true; + + // Per-section identity: (object index, section index within obj). + using SecID = std::pair; + std::set liveSecs; + std::map symToSection; + + // Build the "global symbol name -> (objIdx, secIdx) where defined" + // map. Honors weak vs strong: strong def overrides weak; first + // weak-only def wins. Used by computeLiveSet() to follow cross- + // object reloc references back to their defining section. + void buildSymToSection() { + std::map strongSeen; + for (size_t fi = 0; fi < objs.size(); ++fi) { + const auto &obj = *objs[fi]; + for (const Symbol &sym : obj.symbols) { + if (sym.name.empty()) continue; + if (sym.bind == STB_LOCAL) continue; + if (sym.shndx == SHN_UNDEF || sym.shndx == SHN_ABS || + sym.shndx == SHN_COMMON || + sym.shndx >= obj.sections.size()) + continue; + bool thisStrong = (sym.bind != STB_WEAK); + auto sit = strongSeen.find(sym.name); + if (sit == strongSeen.end()) { + symToSection[sym.name] = {fi, sym.shndx}; + strongSeen[sym.name] = thisStrong; + } else if (thisStrong && !sit->second) { + symToSection[sym.name] = {fi, sym.shndx}; + sit->second = true; + } + } + } + } + + // Compute the live-section set via BFS from roots (entry point, + // init_array sections — crt0 walks them at runtime). Without + // gc-sections, every section is implicitly live. + void computeLiveSet() { + if (!gcSections) return; + buildSymToSection(); + std::vector work; + auto markLive = [&](SecID s) { + if (liveSecs.insert(s).second) work.push_back(s); + }; + // Roots: entry symbols. __start is the canonical crt0 entry; + // also keep main (crt0 calls it) and __indirTarget (used by + // __jsl_indir). Plus any defined symbol whose name starts + // with __ (linker-defined globals like __heap_start are also + // synthesized but their section refs follow naturally). + for (const char *root : {"__start", "_start", "main", + "__indirTarget", "__jsl_indir"}) { + auto it = symToSection.find(root); + if (it != symToSection.end()) markLive(it->second); + } + // crt0's init-loop walks .init_array via the linker-defined + // boundary symbols __init_array_start/_end. All init_array + // sections must therefore be considered live. Same for + // .fini_array if any object provides it. + for (size_t fi = 0; fi < objs.size(); ++fi) { + for (uint32_t idx : objs[fi]->sectionsByKind("init_array")) + markLive({fi, idx}); + } + // BFS: each live section's relocs reference symbols whose + // defining sections are in turn live. Local refs via section + // symbols (STT_SECTION) resolve within the same object. + for (size_t i = 0; i < work.size(); ++i) { + SecID cur = work[i]; + const auto &obj = *objs[cur.first]; + auto relIt = obj.relocs.find(cur.second); + if (relIt == obj.relocs.end()) continue; + for (const Reloc &r : relIt->second) { + if (r.symIdx >= obj.symbols.size()) continue; + const Symbol &sym = obj.symbols[r.symIdx]; + if (sym.shndx != SHN_UNDEF && + sym.shndx != SHN_ABS && + sym.shndx != SHN_COMMON && + sym.shndx < obj.sections.size()) { + // Local def (incl. STT_SECTION refs). + markLive({cur.first, sym.shndx}); + continue; + } + // External — look up the global definition. + auto sit = symToSection.find(sym.name); + if (sit != symToSection.end()) markLive(sit->second); + // Else: undefined external; resolveSym() will die later + // (or the user explicitly declared the ref weak). + } + } + } + + bool isLive(size_t fi, uint32_t idx) const { + if (!gcSections) return true; + return liveSecs.count({fi, idx}) > 0; + } // Per-object, per-section: in-merged-text/rodata/bss offset. struct ObjOffsets { @@ -430,25 +533,32 @@ struct Linker { // 1. Layout: each obj's sections at running offsets. objOff.resize(objs.size()); uint32_t curText = 0, curRodata = 0, curBss = 0, curInit = 0; + // gc-sections: compute the live-section set before accumulating + // so dead sections drop out of every later layout/reloc step. + computeLiveSet(); for (size_t fi = 0; fi < objs.size(); ++fi) { ObjOffsets &oo = objOff[fi]; oo.textBaseInMerged = curText; for (uint32_t idx : objs[fi]->sectionsByKind("text")) { + if (!isLive(fi, idx)) continue; oo.textWithin[idx] = curText - oo.textBaseInMerged; curText += objs[fi]->sections[idx].size; } oo.rodataBaseInMerged = curRodata; for (uint32_t idx : objs[fi]->sectionsByKind("rodata")) { + if (!isLive(fi, idx)) continue; oo.rodataWithin[idx] = curRodata - oo.rodataBaseInMerged; curRodata += objs[fi]->sections[idx].size; } oo.bssBaseInMerged = curBss; for (uint32_t idx : objs[fi]->sectionsByKind("bss")) { + if (!isLive(fi, idx)) continue; oo.bssWithin[idx] = curBss - oo.bssBaseInMerged; curBss += objs[fi]->sections[idx].size; } oo.initBaseInMerged = curInit; for (uint32_t idx : objs[fi]->sectionsByKind("init_array")) { + if (!isLive(fi, idx)) continue; oo.initWithin[idx] = curInit - oo.initBaseInMerged; curInit += objs[fi]->sections[idx].size; } @@ -475,9 +585,58 @@ struct Linker { L.textBase + L.textSize); die(msg); } + // Hard-fail if text crosses into the IO window ($C000-$CFFF). + // Code there would fetch instructions from hardware registers. + // Programs that grow this big need to split into bank 1 (not + // currently supported by this linker). + if (L.textBase < 0xC000 && + L.textBase + L.textSize > 0xC000) { + char msg[160]; + std::snprintf(msg, sizeof(msg), + "text [0x%X+%u] crosses IIgs IO window 0xC000-0xCFFF — " + "shrink the program or split into bank 1", + L.textBase, L.textSize); + die(msg); + } + // Auto-skip the IO window ($C000-$CFFF) if rodata would land + // there. Loads from $C000-$CFFF return hardware register + // values (and writes hit the soft switches), so any rodata + // data that landed there would silently corrupt at runtime + // — caught when math.o grew past ~28KB and pushed string + // literals into the IO range, breaking smoke #86 (hash + // table strcmp returned garbage because the keys read back + // as IO register values). Catches both "starts before IO, + // crosses in" and "starts inside IO" cases. + if (!rodataBase && + L.rodataBase < 0xD000 && + L.rodataBase + L.rodataSize > 0xC000) { + // Page-align upward past the IO window. + L.rodataBase = 0xD000; + // Pad the image so the gap between text-end and rodata- + // start is just zeros. The runInMame loader skips + // writes to the IO range so the soft switches stay + // intact. + } // .init_array goes immediately after .rodata in the image. L.initBase = L.rodataBase + L.rodataSize; L.initSize = curInit; + // Init_array can also land in IO if rodata ends just before + // or starts inside. + if (L.initBase < 0xD000 && + L.initBase + L.initSize > 0xC000) { + L.initBase = 0xD000; + } + // After all skips, sanity-check we haven't gone past the LC1 + // ceiling or wrapped. + if (L.initBase + L.initSize > 0xE000) { + char msg[160]; + std::snprintf(msg, sizeof(msg), + "rodata + init_array [0x%X+%u] exceeds bank-0 LC1 " + "ceiling 0xE000 — shrink the runtime or split into bank 1", + L.rodataBase, + (unsigned)(L.initBase + L.initSize - L.rodataBase)); + die(msg); + } uint32_t initBase = L.initBase; // bss-base safety: default 0x2000 only works if text doesn't // grow past it. When text + rodata + init_array would @@ -530,10 +689,36 @@ struct Linker { globalSyms["__init_array_end"] = initBase + curInit; globalSyms["__bss_start"] = L.bssBase; globalSyms["__bss_end"] = L.bssBase + L.bssSize; - globalSyms["__heap_start"] = L.bssBase + L.bssSize; - globalSyms["__heap_end"] = 0xBF00; // bank 0 hi-RAM ceiling (below IIgs ROM windows) + // __heap_start / __heap_end: pick the largest contiguous safe + // range above bss_end. Without this, the previous hardcoded + // heap_end=$BF00 gave heap_end < heap_start whenever BSS + // spilled into LC1 — malloc immediately returned NULL. + // Skip the IO window if heap_start would land there. + uint32_t heapStart = L.bssBase + L.bssSize; + if (heapStart >= 0xC000 && heapStart < 0xD000) { + heapStart = 0xD000; // skip IO window + } + globalSyms["__heap_start"] = heapStart; + if (heapStart < 0xC000) { + globalSyms["__heap_end"] = 0xBF00; + } else if (heapStart < 0xE000) { + // Heap in LC1 ($D000-$DFFF); cap at $E000 (LC1 ceiling). + globalSyms["__heap_end"] = 0xE000; + } else { + // Should be unreachable — earlier `bssBase + bssSize > + // 0xE000` check would have died first. + globalSyms["__heap_end"] = heapStart; + } - // 2. Build global symbol map. + // 2. Build global symbol map. Honor weak vs strong binding: + // - strong def overrides any prior weak def + // - strong + strong is a multiple-definition error + // - weak + weak: first wins (any choice would be valid) + // - weak after strong: ignored + // Without this, the previous "last def wins" rule meant a weak + // libc stub (e.g. putchar) could silently overwrite a user's + // strong override depending on link order. + std::map isStrong; // name -> strong-def seen for (size_t fi = 0; fi < objs.size(); ++fi) { const auto &obj = *objs[fi]; const auto &oo = objOff[fi]; @@ -542,6 +727,10 @@ struct Linker { if (sym.shndx == SHN_UNDEF || sym.shndx == SHN_ABS || sym.shndx == SHN_COMMON || sym.shndx >= obj.sections.size()) continue; + // Skip dead sections under gc-sections — their symbols + // would otherwise resolve to whatever junk address the + // missing oo.{text,rodata,bss,init}Within entry implies. + if (!isLive(fi, sym.shndx)) continue; const auto &sec = obj.sections[sym.shndx]; std::string kind = sectionKind(sec.name); uint32_t addr = 0; @@ -568,15 +757,30 @@ struct Linker { } else { continue; } - globalSyms[sym.name] = addr; // last def wins + bool thisStrong = (sym.bind != STB_WEAK); + auto sit = isStrong.find(sym.name); + if (sit == isStrong.end()) { + globalSyms[sym.name] = addr; + isStrong[sym.name] = thisStrong; + } else if (thisStrong && !sit->second) { + // strong over weak — replace. + globalSyms[sym.name] = addr; + sit->second = true; + } else if (thisStrong && sit->second) { + die("multiple strong definitions of '" + sym.name + "'"); + } + // weak after strong, or weak after weak: keep first. } } - // 3. Build text and rodata buffers. + // 3. Build text and rodata buffers. Skip dead sections under + // gc-sections (isLive() returns true for everything when gc + // is off). std::vector textBuf; textBuf.reserve(curText); for (size_t fi = 0; fi < objs.size(); ++fi) { for (uint32_t idx : objs[fi]->sectionsByKind("text")) { + if (!isLive(fi, idx)) continue; const uint8_t *p = objs[fi]->sectionData(idx); textBuf.insert(textBuf.end(), p, p + objs[fi]->sections[idx].size); } @@ -585,6 +789,7 @@ struct Linker { rodataBuf.reserve(curRodata); for (size_t fi = 0; fi < objs.size(); ++fi) { for (uint32_t idx : objs[fi]->sectionsByKind("rodata")) { + if (!isLive(fi, idx)) continue; const uint8_t *p = objs[fi]->sectionData(idx); rodataBuf.insert(rodataBuf.end(), p, p + objs[fi]->sections[idx].size); @@ -596,6 +801,7 @@ struct Linker { const auto &obj = *objs[fi]; const auto &oo = objOff[fi]; for (uint32_t textIdx : obj.sectionsByKind("text")) { + if (!isLive(fi, textIdx)) continue; auto it = obj.relocs.find(textIdx); if (it == obj.relocs.end()) continue; uint32_t inMerged = oo.textBaseInMerged + oo.textWithin.at(textIdx); @@ -622,6 +828,7 @@ struct Linker { const auto &obj = *objs[fi]; const auto &oo = objOff[fi]; for (uint32_t rdIdx : obj.sectionsByKind("rodata")) { + if (!isLive(fi, rdIdx)) continue; auto it = obj.relocs.find(rdIdx); if (it == obj.relocs.end()) continue; uint32_t inMerged = oo.rodataBaseInMerged + oo.rodataWithin.at(rdIdx); @@ -654,6 +861,7 @@ struct Linker { initBuf.reserve(curInit); for (size_t fi = 0; fi < objs.size(); ++fi) { for (uint32_t idx : objs[fi]->sectionsByKind("init_array")) { + if (!isLive(fi, idx)) continue; const uint8_t *p = objs[fi]->sectionData(idx); initBuf.insert(initBuf.end(), p, p + objs[fi]->sections[idx].size); @@ -663,6 +871,7 @@ struct Linker { const auto &obj = *objs[fi]; const auto &oo = objOff[fi]; for (uint32_t idx : obj.sectionsByKind("init_array")) { + if (!isLive(fi, idx)) continue; auto it = obj.relocs.find(idx); if (it == obj.relocs.end()) continue; uint32_t inMerged = oo.initBaseInMerged + oo.initWithin.at(idx); @@ -824,6 +1033,10 @@ static uint32_t parseInt(const std::string &s) { unsigned long v = std::strtoul(s.c_str(), &end, 0); if (end == s.c_str() || *end != '\0') die("bad numeric value '" + s + "'"); + // 65816 addresses are 24-bit; reject anything that doesn't fit so + // a typo like `--text-base 0x100000000` doesn't silently wrap to 0. + if (v > 0xFFFFFF) + die("address '" + s + "' exceeds 24-bit range"); return static_cast(v); } @@ -831,6 +1044,7 @@ static void usage(const char *argv0) { std::fprintf(stderr, "usage: %s -o [--text-base ADDR] [--rodata-base ADDR]\n" " [--bss-base ADDR] [--map FILE] [--debug-out FILE]\n" + " [--no-gc-sections]\n" " ...\n", argv0); std::exit(2); @@ -865,6 +1079,18 @@ int main(int argc, char **argv) { } else if (a == "--debug-out") { if (++i >= argc) usage(argv[0]); debugOutPath = argv[i++]; + } else if (a == "--gc-sections") { + // Drop sections not reachable from __start / main / + // init_array. Requires `-ffunction-sections` (so each + // function is in its own section). Significantly shrinks + // text for programs that link the whole runtime but only + // use a fraction of it. ON by default; --no-gc-sections + // disables. + linker.gcSections = true; + i++; + } else if (a == "--no-gc-sections") { + linker.gcSections = false; + i++; } else if (a == "-h" || a == "--help") { usage(argv[0]); } else if (!a.empty() && a[0] == '-') { diff --git a/src/link816/omfEmit.cpp b/src/link816/omfEmit.cpp index 0fdedd3..5f1c1df 100644 --- a/src/link816/omfEmit.cpp +++ b/src/link816/omfEmit.cpp @@ -134,7 +134,13 @@ static std::vector emitOMF(const std::vector &image, } static uint32_t parseInt(const std::string &s) { - return static_cast(std::stoul(s, nullptr, 0)); + char *end = nullptr; + unsigned long v = std::strtoul(s.c_str(), &end, 0); + if (end == s.c_str() || *end != '\0') + die("bad numeric value '" + s + "'"); + if (v > 0xFFFFFF) + die("address '" + s + "' exceeds 24-bit range"); + return static_cast(v); } static void usage(const char *argv0) { diff --git a/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp b/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp index 562af1d..64ab410 100644 --- a/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp +++ b/src/llvm/lib/Target/W65816/W65816ABridgeViaX.cpp @@ -117,9 +117,12 @@ static bool clobbersImg(const MachineInstr &MI, Register R = MO.getReg(); if (!R.isValid()) continue; if (R.isPhysical()) { - if (R == W65816::IMG0 || R == W65816::IMG1 || R == W65816::IMG2 || - R == W65816::IMG3 || R == W65816::IMG4 || R == W65816::IMG5 || - R == W65816::IMG6 || R == W65816::IMG7) + if (R == W65816::IMG0 || R == W65816::IMG1 || R == W65816::IMG2 || + R == W65816::IMG3 || R == W65816::IMG4 || R == W65816::IMG5 || + R == W65816::IMG6 || R == W65816::IMG7 || + R == W65816::IMG8 || R == W65816::IMG9 || R == W65816::IMG10 || + R == W65816::IMG11 || R == W65816::IMG12 || R == W65816::IMG13 || + R == W65816::IMG14 || R == W65816::IMG15) return true; continue; } diff --git a/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp b/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp index d936340..40ade34 100644 --- a/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp +++ b/src/llvm/lib/Target/W65816/W65816ISelLowering.cpp @@ -260,20 +260,54 @@ static W65816CC::CondCode normalizeCC(SDValue &LHS, SDValue &RHS, CC = ISD::getSetCCSwappedOperands(CC); } - // Rewrite SETULE / SETUGT / SETLE / SETGT to SETULT / SETUGE / SETLT / - // SETGE with constant +/- 1. Keeps the variable on the LHS and lets - // us use BCS / BCC / BMI / BPL natively. Only valid when the constant - // is not at its signed/unsigned boundary; we bail in that pathological - // case for now. + // Signed compare via "EOR with sign bit then unsigned compare": + // a < b (signed) iff (a ^ 0x8000) < (b ^ 0x8000) (unsigned) + // The XOR flips the sign bit, which converts signed-int ordering to + // unsigned-int ordering on the same bits. This avoids the WDC's + // missing "BLT signed" — BMI/BPL alone read the sign of (a-b) + // without the V-flag overflow correction, giving wrong results + // when the subtraction overflows (e.g., INT16_MIN < 1 produced + // false because (-32768 - 1) = +32767 has N=0). After the EOR + // transform we use BCC/BCS which depend on the carry from CMP and + // don't suffer overflow corruption. + // + // Cost: 1 EOR per operand (3 bytes each in M=16) — comparable to + // the V-aware multi-branch sequence (5+ bytes of branches), but + // happens at SDAG time so subsequent SDAG combining can fold + // EORs against constants or already-EOR'd values. + bool SignedCmp = (CC == ISD::SETLT || CC == ISD::SETLE || + CC == ISD::SETGT || CC == ISD::SETGE); + if (SignedCmp && LHS.getValueType() == MVT::i16) { + EVT VT = LHS.getValueType(); + SDValue Mask = DAG.getConstant(0x8000, DL, VT); + LHS = DAG.getNode(ISD::XOR, DL, VT, LHS, Mask); + RHS = DAG.getNode(ISD::XOR, DL, VT, RHS, Mask); + switch (CC) { + case ISD::SETLT: CC = ISD::SETULT; break; + case ISD::SETLE: CC = ISD::SETULE; break; + case ISD::SETGT: CC = ISD::SETUGT; break; + case ISD::SETGE: CC = ISD::SETUGE; break; + default: break; + } + } + + // Rewrite SETULE / SETUGT to SETULT / SETUGE with constant +/- 1. + // (SETLE / SETGT have already been converted to their unsigned + // counterparts above for i16; this handles original SETULE/SETUGT + // and the post-transform SETULE/SETUGT.) Keeps the variable on the + // LHS and lets us use BCS / BCC natively. if (auto *RhsConst = dyn_cast(RHS)) { int64_t V = RhsConst->getSExtValue(); - if (CC == ISD::SETULE && (uint64_t)V < 0xffff) { - RHS = DAG.getConstant(V + 1, DL, RHS.getValueType()); + uint64_t UV = (uint64_t)V & 0xFFFF; + if (CC == ISD::SETULE && UV < 0xffff) { + RHS = DAG.getConstant(UV + 1, DL, RHS.getValueType()); CC = ISD::SETULT; - } else if (CC == ISD::SETUGT && (uint64_t)V < 0xffff) { - RHS = DAG.getConstant(V + 1, DL, RHS.getValueType()); + } else if (CC == ISD::SETUGT && UV < 0xffff) { + RHS = DAG.getConstant(UV + 1, DL, RHS.getValueType()); CC = ISD::SETUGE; } else if (CC == ISD::SETLE && V < 0x7fff) { + // Reachable only when SignedCmp transform was skipped (i8 case + // before promoteI8Cmp could get it, or non-i16 in the future). RHS = DAG.getConstant(V + 1, DL, RHS.getValueType()); CC = ISD::SETLT; } else if (CC == ISD::SETGT && V < 0x7fff) { @@ -1129,12 +1163,16 @@ W65816TargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI, case W65816::LDAptrOff: case W65816::STAptrOff: case W65816::STBptrOff: { - // Pointer access with a constant offset folded into Y. Saves a - // CLC/ADC #off pair plus a spill/reload over computing - // `ptr + off` then doing LDAptr/STAptr. Since Y is 16-bit, any - // i16 offset fits. Operand layout: - // LDAptrOff: 0=dst, 1=ptr, 2=off - // STAptrOff / STBptrOff: 0=val, 1=ptr, 2=off + // Pointer access with a constant offset. Folds the offset into + // the pointer (CLC; ADC #off in A) BEFORE staging at $E0..$E2, + // then accesses via [$E0],Y with Y=0. We can't fold into Y + // because [dp],Y on the W65816 adds Y to the full 24-bit pointer + // — for a negative Y like 0xFFFE (= -2 signed), the addition + // crosses into bank 1 (e.g. ptr=0x4000 + Y=0xFFFE → 0x13FFE). + // Folding into the pointer keeps the add at 16-bit (in A) so the + // bank byte stays 0. + // + // DBR-independent — see LDAptr/STAptr/STBptr. MachineFunction *MF = BB->getParent(); const W65816Subtarget &STI = MF->getSubtarget(); const W65816InstrInfo &TII = *STI.getInstrInfo(); @@ -1143,24 +1181,48 @@ W65816TargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI, bool IsByteStore = MI.getOpcode() == W65816::STBptrOff; Register Ptr = MI.getOperand(1).getReg(); int64_t Off = MI.getOperand(2).getImm(); + + // Spill the pointer vreg to a fresh 2-byte stack slot, then + // reload via LDAfi. Forces RA to materialize the source — see + // the LDAptr/STAptr/STBptr case below for the full rationale. int FI = MF->getFrameInfo().CreateStackObject(2, Align(2), - /*isSpillSlot=*/true); + /*isSpillSlot=*/false); BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::STAfi)) .addReg(Ptr).addFrameIndex(FI).addImm(0); - BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::LDY_Imm16)) - .addImm(Off); + BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::LDAfi), + W65816::A).addFrameIndex(FI).addImm(0); + + // Compute ptr + off in A. CLC + ADC for the add. + BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::CLC)); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::ADC_Imm16)).addImm(Off); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::STA_DP)).addImm(0xE0); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::STZ_DP)).addImm(0xE2); + if (IsLoad) { Register Dst = MI.getOperand(0).getReg(); - BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::LDAfi_indY), Dst) - .addFrameIndex(FI).addImm(0); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::LDY_Imm16)).addImm(0); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::LDA_DPIndLongY)).addImm(0xE0); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(TargetOpcode::COPY), Dst).addReg(W65816::A); } else { Register Val = MI.getOperand(0).getReg(); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(TargetOpcode::COPY), W65816::A).addReg(Val); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::LDY_Imm16)).addImm(0); if (IsByteStore) - BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::SEP)).addImm(0x20); - BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::STAfi_indY)) - .addReg(Val).addFrameIndex(FI).addImm(0); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::SEP)).addImm(0x20); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::STA_DPIndLongY)).addImm(0xE0); if (IsByteStore) - BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::REP)).addImm(0x20); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::REP)).addImm(0x20); } MI.eraseFromParent(); return BB; @@ -1168,11 +1230,36 @@ W65816TargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI, case W65816::LDAptr: case W65816::STAptr: case W65816::STBptr: { - // Spill the pointer to a fresh 2-byte stack slot. Then LDY #0 and - // emit LDAfi_indY / STAfi_indY against that slot. The (slot,S),Y - // addressing reads the pointer from the spill, adds Y (=0), and - // dereferences. STBptr (truncating i8 store) wraps the actual STA - // in SEP/REP so M=8 across the store and only one byte is written. + // Pointer load/store via [dp],Y indirect-long (opcodes 0xB7 / 0x97): + // STA $E0 ; pointer low/hi at $E0..$E1 + // STZ $E2 ; bank byte at $E2 = 0 + // LDY #0 + // LDA [$E0], Y ; bank 0:ptr + 0 + // STA [$E0], Y + // The bank byte is forced to 0, so the access ignores DBR — the + // whole point. The previous lowering used (slot,S),Y indirect + // (opcode 0x91 / 0x93), but (sr,s),Y is DBR-relative — when the + // caller had set DBR != 0 (e.g. via `pha;plb` to bank 2 to reach + // IIgs hardware), the deref silently wrote to the wrong bank. + // + // Const-int pointers (`*(volatile uint16 *)0x5000 = v`) are NOT + // lowered through this pseudo — there's a TableGen pattern that + // takes them straight to STAabs (DBR-relative), which preserves + // the IIgs MMIO + bank-switch idiom that the smoke tests use. + // + // We use $E0..$E2 in libcall-scratch DP — safe because the + // pseudo expansion is a leaf (no calls between SEP and STA), + // and any subsequent libcall reinitialises its own scratch. + // + // Why [dp],Y not abs-long-X (`STA $0,X`)? abs-long-X is shorter + // (~3 bytes less) but uses X to hold the pointer. In high- + // pressure functions like the recursive expression parser, X + // is often live with another value, and forcing X to be free + // for every pointer-deref triggered "ran out of registers". + // [dp],Y uses A and Y only — leaves X for spill-bridge use. + // + // STBptr (truncating i8 store) wraps the actual STA in SEP/REP + // so M=8 across the store and only one byte is written. MachineFunction *MF = BB->getParent(); const W65816Subtarget &STI = MF->getSubtarget(); const W65816InstrInfo &TII = *STI.getInstrInfo(); @@ -1180,38 +1267,55 @@ W65816TargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI, bool IsLoad = MI.getOpcode() == W65816::LDAptr; bool IsByteStore = MI.getOpcode() == W65816::STBptr; - // Operand layout (explicit only; Defs=[Y] adds an implicit at the - // end which we don't read here): - // LDAptr: 0=dst, 1=ptr - // STAptr / STBptr: 0=val, 1=ptr - // The pointer operand is always at index 1. Earlier code reading - // operand 2 for stores hit the implicit Y def, not the pointer — - // which only "worked" because regalloc didn't notice and A - // happened to hold the right bytes by accident. Register Ptr = MI.getOperand(1).getReg(); - int FI = MF->getFrameInfo().CreateStackObject(2, Align(2), - /*isSpillSlot=*/true); - // Spill ptr. + // Why we spill the pointer to a fresh stack slot first: + // a direct `COPY $a = ptr_vreg ; STA $E0` lets RA elide the COPY + // when ptr_vreg is already allocated to A. In a loop body where + // multiple Acc16 PHIs (pointer + accumulator) compete for A, the + // PHI elimination pass picks one to be in A at the bottom of the + // block and silently drops the COPY needed to refresh A with the + // OTHER value at the top of the next iteration — silent miscompile + // (sumTable read its own accumulator as the pointer on iter 2+). + // STAfi forces RA to materialize ptr_vreg's value so it gets stored + // to the slot, then LDAfi reads it back as a real machine load. + int FI = MF->getFrameInfo().CreateStackObject(2, Align(2), + /*isSpillSlot=*/false); BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::STAfi)) .addReg(Ptr).addFrameIndex(FI).addImm(0); - // LDY #0. LDY_Imm16 has no output operand; Y is defined implicitly - // via the pseudo's Defs=[Y] marking. - BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::LDY_Imm16)) - .addImm(0); + BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::LDAfi), + W65816::A).addFrameIndex(FI).addImm(0); + + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::STA_DP)).addImm(0xE0); + // Bank byte at $E2 = 0. STZ in M=16 writes 2 bytes ($E2..$E3); + // $E3 is junk-clobbered, OK (libcall scratch is caller-saved). + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::STZ_DP)).addImm(0xE2); if (IsLoad) { Register Dst = MI.getOperand(0).getReg(); - BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::LDAfi_indY), Dst) - .addFrameIndex(FI).addImm(0); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::LDY_Imm16)).addImm(0); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::LDA_DPIndLongY)).addImm(0xE0); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(TargetOpcode::COPY), Dst).addReg(W65816::A); } else { Register Val = MI.getOperand(0).getReg(); + // Load val into A. + BuildMI(*BB, MI.getIterator(), DL, + TII.get(TargetOpcode::COPY), W65816::A).addReg(Val); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::LDY_Imm16)).addImm(0); if (IsByteStore) - BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::SEP)).addImm(0x20); - BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::STAfi_indY)) - .addReg(Val).addFrameIndex(FI).addImm(0); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::SEP)).addImm(0x20); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::STA_DPIndLongY)).addImm(0xE0); if (IsByteStore) - BuildMI(*BB, MI.getIterator(), DL, TII.get(W65816::REP)).addImm(0x20); + BuildMI(*BB, MI.getIterator(), DL, + TII.get(W65816::REP)).addImm(0x20); } MI.eraseFromParent(); return BB; diff --git a/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp b/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp index aff0df3..0f58c13 100644 --- a/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp +++ b/src/llvm/lib/Target/W65816/W65816InstrInfo.cpp @@ -30,18 +30,26 @@ W65816InstrInfo::W65816InstrInfo(const W65816Subtarget &STI) W65816::ADJCALLSTACKUP), RI() {} -// Maps IMGn to its DP address ($D0..$DE in steps of 2). Returns -1 if -// the reg isn't an IMG. +// Maps IMGn to its DP address (IMG0..IMG7 at $D0..$DE, IMG8..IMG15 at +// $C0..$CE, both in steps of 2). Returns -1 if the reg isn't an IMG. static int imgDPAddr(Register R) { switch (R) { - case W65816::IMG0: return 0xD0; - case W65816::IMG1: return 0xD2; - case W65816::IMG2: return 0xD4; - case W65816::IMG3: return 0xD6; - case W65816::IMG4: return 0xD8; - case W65816::IMG5: return 0xDA; - case W65816::IMG6: return 0xDC; - case W65816::IMG7: return 0xDE; + case W65816::IMG0: return 0xD0; + case W65816::IMG1: return 0xD2; + case W65816::IMG2: return 0xD4; + case W65816::IMG3: return 0xD6; + case W65816::IMG4: return 0xD8; + case W65816::IMG5: return 0xDA; + case W65816::IMG6: return 0xDC; + case W65816::IMG7: return 0xDE; + case W65816::IMG8: return 0xC0; + case W65816::IMG9: return 0xC2; + case W65816::IMG10: return 0xC4; + case W65816::IMG11: return 0xC6; + case W65816::IMG12: return 0xC8; + case W65816::IMG13: return 0xCA; + case W65816::IMG14: return 0xCC; + case W65816::IMG15: return 0xCE; default: return -1; } } diff --git a/src/llvm/lib/Target/W65816/W65816InstrInfo.td b/src/llvm/lib/Target/W65816/W65816InstrInfo.td index 1544a22..5545c3a 100644 --- a/src/llvm/lib/Target/W65816/W65816InstrInfo.td +++ b/src/llvm/lib/Target/W65816/W65816InstrInfo.td @@ -278,6 +278,12 @@ def : Pat<(store Acc16:$src, (W65816Wrapper tglobaladdr:$g)), (STAabs Acc16:$src, tglobaladdr:$g)>; def : Pat<(store Acc16:$src, (W65816Wrapper texternalsym:$s)), (STAabs Acc16:$src, texternalsym:$s)>; +// Store via a constant-int address (MMIO-style fixed pointer like +// `*(volatile uint16 *)0x5000 = v`). Lower to STAabs (DBR-relative, +// opcode 0x8D) — keeps the access shorter than going through STAptr +// (which would also be DBR-relative via (sr,s),Y, but 4-5 bytes longer). +def : Pat<(store Acc16:$src, (iPTR imm:$addr)), + (STAabs Acc16:$src, (i32 imm:$addr))>; // 16-bit ADD: expands to CLC + ADC_Imm16. The 65816 ADC sums with the // carry flag, so a clean add needs CLC first. Constraints tie the @@ -893,30 +899,40 @@ def CMP_RR : W65816Pseudo<(outs), (ins Acc16:$lhs, Acc16:$rhs), // fresh stack slot, set Y=0, and emit LDA/STA (slot,S),Y. Y gets // clobbered as a side effect. hasSideEffects=1 covers the spill // store the inserter adds, in addition to the deref. +// LDAptr / STAptr / STBptr lower to [dp],Y indirect-long via DP +// scratch $E0..$E2 (see W65816ISelLowering.cpp inserter). The +// inserter uses A and Y plus the DP scratch — X is not touched. +// Defs: Y (LDY #0) and P (STA/LDA set N/Z). +// $ptr is Wide16 (A or IMGn) so when bb.3-style pressure forces the +// pointer to share A with another live vreg, RA can park ptr in an +// IMGn DP slot. Acc16:$ptr was being silently coalesced with the +// loop-PHI accumulator: both wanted A at end of bb, and PHI-elim +// dropped the COPY needed to refresh A with the pointer at top of +// the loop. With Wide16, the COPY $a = ptr lowers to a real LDA $dp. let usesCustomInserter = 1, hasSideEffects = 1, mayLoad = 1, - Defs = [Y] in { -def LDAptr : W65816Pseudo<(outs Acc16:$dst), (ins Acc16:$ptr), + Defs = [Y, P] in { +def LDAptr : W65816Pseudo<(outs Acc16:$dst), (ins Wide16:$ptr), "# LDAptr $dst, $ptr", - [(set Acc16:$dst, (load Acc16:$ptr))]>; + [(set Acc16:$dst, (load Wide16:$ptr))]>; } let usesCustomInserter = 1, hasSideEffects = 1, mayStore = 1, - Defs = [Y] in { -def STAptr : W65816Pseudo<(outs), (ins Acc16:$val, Acc16:$ptr), + Defs = [Y, P] in { +def STAptr : W65816Pseudo<(outs), (ins Acc16:$val, Wide16:$ptr), "# STAptr $val, $ptr", - [(store Acc16:$val, Acc16:$ptr)]>; + [(store Acc16:$val, Wide16:$ptr)]>; } // i8 zero-extending pointer load: do a 16-bit LDA (slot,s),y and mask // the high byte. Reads one byte past the source — fine for byte-array // iteration where the buffer is at least 2 bytes long. A future // SEP/REP-aware mode pass could switch to a true 8-bit LDA. -def : Pat<(i16 (zextloadi8 Acc16:$ptr)), - (ANDi16imm (LDAptr Acc16:$ptr), 0xFF)>; +def : Pat<(i16 (zextloadi8 Wide16:$ptr)), + (ANDi16imm (LDAptr Wide16:$ptr), 0xFF)>; // Anyext byte load via pointer: consumer doesn't care about the high // byte, so just LDA (16-bit). Same 1-byte-past-buffer caveat as // zextloadi8. -def : Pat<(i16 (extloadi8 Acc16:$ptr)), - (LDAptr Acc16:$ptr)>; +def : Pat<(i16 (extloadi8 Wide16:$ptr)), + (LDAptr Wide16:$ptr)>; // And the equivalent for absolute addresses (byte loads via global ptr). // (Already covered for Wrapper(global) above; this catches the case // where the ptr is materialised as a value.) @@ -941,10 +957,10 @@ def STAfi_indY : W65816Pseudo<(outs), (ins Acc16:$src, memfi:$addr), // natural truncstorei8 from an i16 value (common with arg promotion), // and a true i8 store (Acc8) that arises from i8-typed IR. let usesCustomInserter = 1, hasSideEffects = 1, mayStore = 1, - Defs = [Y] in { -def STBptr : W65816Pseudo<(outs), (ins Acc16:$val, Acc16:$ptr), + Defs = [Y, P] in { +def STBptr : W65816Pseudo<(outs), (ins Acc16:$val, Wide16:$ptr), "# STBptr $val, $ptr", - [(truncstorei8 Acc16:$val, Acc16:$ptr)]>; + [(truncstorei8 Acc16:$val, Wide16:$ptr)]>; } // Pointer access with constant offset. `(load (add ptr, $off))` and @@ -953,40 +969,42 @@ def STBptr : W65816Pseudo<(outs), (ins Acc16:$val, Acc16:$ptr), // the offset becomes an explicit ADC #imm that has to spill A and // recompute the pointer per access. With them, we just load Y with // the offset in the inserter (Y is 16-bit so any i16 constant fits). +// LDAptrOff / STAptrOff / STBptrOff: same [dp],Y lowering as the +// no-offset variants but folds the offset into Y. let usesCustomInserter = 1, hasSideEffects = 1, mayLoad = 1, - Defs = [Y] in { + Defs = [Y, P] in { def LDAptrOff : W65816Pseudo<(outs Acc16:$dst), - (ins Acc16:$ptr, i16imm:$off), + (ins Wide16:$ptr, i16imm:$off), "# LDAptrOff $dst, $ptr, $off", []>; } let usesCustomInserter = 1, hasSideEffects = 1, mayStore = 1, - Defs = [Y] in { + Defs = [Y, P] in { def STAptrOff : W65816Pseudo<(outs), - (ins Acc16:$val, Acc16:$ptr, i16imm:$off), + (ins Acc16:$val, Wide16:$ptr, i16imm:$off), "# STAptrOff $val, $ptr, $off", []>; def STBptrOff : W65816Pseudo<(outs), - (ins Acc16:$val, Acc16:$ptr, i16imm:$off), + (ins Acc16:$val, Wide16:$ptr, i16imm:$off), "# STBptrOff $val, $ptr, $off", []>; } -def : Pat<(i16 (load (add Acc16:$ptr, (i16 imm:$off)))), - (LDAptrOff Acc16:$ptr, imm:$off)>; -def : Pat<(store Acc16:$val, (add Acc16:$ptr, (i16 imm:$off))), - (STAptrOff Acc16:$val, Acc16:$ptr, imm:$off)>; -def : Pat<(truncstorei8 Acc16:$val, (add Acc16:$ptr, (i16 imm:$off))), - (STBptrOff Acc16:$val, Acc16:$ptr, imm:$off)>; -def : Pat<(store Acc8:$val, (add Acc16:$ptr, (i16 imm:$off))), +def : Pat<(i16 (load (add Wide16:$ptr, (i16 imm:$off)))), + (LDAptrOff Wide16:$ptr, imm:$off)>; +def : Pat<(store Acc16:$val, (add Wide16:$ptr, (i16 imm:$off))), + (STAptrOff Acc16:$val, Wide16:$ptr, imm:$off)>; +def : Pat<(truncstorei8 Acc16:$val, (add Wide16:$ptr, (i16 imm:$off))), + (STBptrOff Acc16:$val, Wide16:$ptr, imm:$off)>; +def : Pat<(store Acc8:$val, (add Wide16:$ptr, (i16 imm:$off))), (STBptrOff (COPY_TO_REGCLASS Acc8:$val, Acc16), - Acc16:$ptr, imm:$off)>; -def : Pat<(store Acc8:$val, Acc16:$ptr), - (STBptr (COPY_TO_REGCLASS Acc8:$val, Acc16), Acc16:$ptr)>; + Wide16:$ptr, imm:$off)>; +def : Pat<(store Acc8:$val, Wide16:$ptr), + (STBptr (COPY_TO_REGCLASS Acc8:$val, Acc16), Wide16:$ptr)>; // i8 load via Acc16 pointer producing a true i8 (Acc8) result. Reuses // the existing zextloadi8 16-bit-LDA-and-mask path: load 2 bytes, mask // the high byte, then narrow to Acc8. COPY_TO_REGCLASS to Acc8 is a // no-op at MC level (same physical A). Reads one byte past the source; // fine for char-array iteration where the buffer is at least 2 bytes. -def : Pat<(i8 (load Acc16:$ptr)), - (COPY_TO_REGCLASS (ANDi16imm (LDAptr Acc16:$ptr), 0xFF), Acc8)>; +def : Pat<(i8 (load Wide16:$ptr)), + (COPY_TO_REGCLASS (ANDi16imm (LDAptr Wide16:$ptr), 0xFF), Acc8)>; // Acc8-to-Acc16 type conversions. Both Acc8 and Acc16 alias physical A, // so COPY_TO_REGCLASS is a no-op at MC level. ZEXT additionally masks @@ -1109,8 +1127,12 @@ def LDA_AbsY : InstAbsY<0xB9, "lda">; def LDA_DPInd : InstDPInd <0xB2, "lda">; def LDA_DPIndY : InstDPIndY<0xB1, "lda">; def LDA_DPIndX : InstDPIndX<0xA1, "lda">; -def LDA_DPIndLong : InstDPIndLong <0xA7, "lda">; -def LDA_DPIndLongY : InstDPIndLongY<0xB7, "lda">; +def LDA_DPIndLong : InstDPIndLong <0xA7, "lda"> { let Defs = [A]; } +// LDA [dp],Y: reads Y to compute the indexed address, defines A. +// Without these, regalloc thought A was unaffected by the load and +// dead-code-eliminated COPYs that were supposed to materialise the +// next pointer in A — silent miscompile in mySwap-style helpers. +def LDA_DPIndLongY : InstDPIndLongY<0xB7, "lda"> { let Defs = [A]; let Uses = [Y]; } def LDA_LongX : InstAbsLongX<0xBF, "lda">; //---------------------------------------------------------------- STA (store A) @@ -1123,8 +1145,10 @@ def STA_AbsY : InstAbsY<0x99, "sta">; def STA_DPInd : InstDPInd <0x92, "sta">; def STA_DPIndY : InstDPIndY<0x91, "sta">; def STA_DPIndX : InstDPIndX<0x81, "sta">; -def STA_DPIndLong : InstDPIndLong <0x87, "sta">; -def STA_DPIndLongY : InstDPIndLongY<0x97, "sta">; +def STA_DPIndLong : InstDPIndLong <0x87, "sta"> { let Uses = [A]; } +// STA [dp],Y: reads A (the value to store) and Y (the index). Mark +// both so regalloc keeps A's value live across this instruction. +def STA_DPIndLongY : InstDPIndLongY<0x97, "sta"> { let Uses = [A, Y]; } def STA_LongX : InstAbsLongX<0x9F, "sta">; //---------------------------------------------------------------- LDX (load X) diff --git a/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp b/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp index d8e8ede..1ccf33d 100644 --- a/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp +++ b/src/llvm/lib/Target/W65816/W65816RegisterInfo.cpp @@ -117,14 +117,22 @@ bool W65816RegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator II, Register Src = MI.getOperand(0).getReg(); int srcDP = -1; switch (Src) { - case W65816::IMG0: srcDP = 0xD0; break; - case W65816::IMG1: srcDP = 0xD2; break; - case W65816::IMG2: srcDP = 0xD4; break; - case W65816::IMG3: srcDP = 0xD6; break; - case W65816::IMG4: srcDP = 0xD8; break; - case W65816::IMG5: srcDP = 0xDA; break; - case W65816::IMG6: srcDP = 0xDC; break; - case W65816::IMG7: srcDP = 0xDE; break; + case W65816::IMG0: srcDP = 0xD0; break; + case W65816::IMG1: srcDP = 0xD2; break; + case W65816::IMG2: srcDP = 0xD4; break; + case W65816::IMG3: srcDP = 0xD6; break; + case W65816::IMG4: srcDP = 0xD8; break; + case W65816::IMG5: srcDP = 0xDA; break; + case W65816::IMG6: srcDP = 0xDC; break; + case W65816::IMG7: srcDP = 0xDE; break; + case W65816::IMG8: srcDP = 0xC0; break; + case W65816::IMG9: srcDP = 0xC2; break; + case W65816::IMG10: srcDP = 0xC4; break; + case W65816::IMG11: srcDP = 0xC6; break; + case W65816::IMG12: srcDP = 0xC8; break; + case W65816::IMG13: srcDP = 0xCA; break; + case W65816::IMG14: srcDP = 0xCC; break; + case W65816::IMG15: srcDP = 0xCE; break; default: break; } if (srcDP >= 0) { diff --git a/src/llvm/lib/Target/W65816/W65816RegisterInfo.td b/src/llvm/lib/Target/W65816/W65816RegisterInfo.td index 574cefe..0d3a505 100644 --- a/src/llvm/lib/Target/W65816/W65816RegisterInfo.td +++ b/src/llvm/lib/Target/W65816/W65816RegisterInfo.td @@ -38,22 +38,34 @@ def PBR : W65816Reg<6, "pbr">, DwarfRegNum<[6]>; def PC : W65816Reg<7, "pc">, DwarfRegNum<[7]>; def P : W65816Reg<8, "p">, DwarfRegNum<[8]>; -// Imaginary 16-bit registers backed by direct-page slots $D0..$DE. -// The regalloc treats them as physical registers with cheap LDA/STA dp -// inter-register moves. This relieves pressure on the single Acc16 -// register (A) so greedy regalloc can succeed on functions with -// multiple simultaneously-live i16 vregs. Caller-save: callees may -// freely overwrite them, so regalloc spills around any call that -// might touch them. Their HWEncoding is never emitted (asmprinter -// translates IMGn references into LDA/STA dp with the right address). -def IMG0 : W65816Reg<16, "img0">, DwarfRegNum<[16]>; -def IMG1 : W65816Reg<17, "img1">, DwarfRegNum<[17]>; -def IMG2 : W65816Reg<18, "img2">, DwarfRegNum<[18]>; -def IMG3 : W65816Reg<19, "img3">, DwarfRegNum<[19]>; -def IMG4 : W65816Reg<20, "img4">, DwarfRegNum<[20]>; -def IMG5 : W65816Reg<21, "img5">, DwarfRegNum<[21]>; -def IMG6 : W65816Reg<22, "img6">, DwarfRegNum<[22]>; -def IMG7 : W65816Reg<23, "img7">, DwarfRegNum<[23]>; +// Imaginary 16-bit registers backed by direct-page slots $C0..$DE +// (16 slots = 32 DP bytes). The regalloc treats them as physical +// registers with cheap LDA/STA dp inter-register moves. This +// relieves pressure on the single Acc16 register (A) so greedy +// regalloc can succeed on functions with multiple simultaneously- +// live i16 vregs. Caller-save: callees may freely overwrite them, +// so regalloc spills around any call that might touch them. Their +// HWEncoding is never emitted (asmprinter translates IMGn references +// into LDA/STA dp with the right address). +// +// Layout: IMG0..IMG7 at $D0..$DE (legacy slot block); IMG8..IMG15 +// at $C0..$CE. Avoid stepping on user DP allocations below $C0. +def IMG0 : W65816Reg<16, "img0">, DwarfRegNum<[16]>; +def IMG1 : W65816Reg<17, "img1">, DwarfRegNum<[17]>; +def IMG2 : W65816Reg<18, "img2">, DwarfRegNum<[18]>; +def IMG3 : W65816Reg<19, "img3">, DwarfRegNum<[19]>; +def IMG4 : W65816Reg<20, "img4">, DwarfRegNum<[20]>; +def IMG5 : W65816Reg<21, "img5">, DwarfRegNum<[21]>; +def IMG6 : W65816Reg<22, "img6">, DwarfRegNum<[22]>; +def IMG7 : W65816Reg<23, "img7">, DwarfRegNum<[23]>; +def IMG8 : W65816Reg<32, "img8">, DwarfRegNum<[32]>; +def IMG9 : W65816Reg<33, "img9">, DwarfRegNum<[33]>; +def IMG10 : W65816Reg<34, "img10">, DwarfRegNum<[34]>; +def IMG11 : W65816Reg<35, "img11">, DwarfRegNum<[35]>; +def IMG12 : W65816Reg<36, "img12">, DwarfRegNum<[36]>; +def IMG13 : W65816Reg<37, "img13">, DwarfRegNum<[37]>; +def IMG14 : W65816Reg<38, "img14">, DwarfRegNum<[38]>; +def IMG15 : W65816Reg<39, "img15">, DwarfRegNum<[39]>; // DPF0 — pseudo-physreg modeling the i16 storage at DP $F0..$F1. // Used as the carrier for the highest 16 bits of an i64/double @@ -85,8 +97,10 @@ def Idx16 : RegisterClass<"W65816", [i16], 16, (add X, Y)>; // may freely overwrite $D0..$DF, so the allocator must spill IMGn // vregs around any call. def Img16 : RegisterClass<"W65816", [i16], 16, - (add IMG0, IMG1, IMG2, IMG3, - IMG4, IMG5, IMG6, IMG7)>; + (add IMG0, IMG1, IMG2, IMG3, + IMG4, IMG5, IMG6, IMG7, + IMG8, IMG9, IMG10, IMG11, + IMG12, IMG13, IMG14, IMG15)>; // Acc-or-IMG combined class. Vregs that are not constrained to A // (i.e., not the source of an arithmetic op) get widened to this @@ -94,8 +108,10 @@ def Img16 : RegisterClass<"W65816", [i16], 16, // A first so the allocator's default order prefers A; cross-class // moves to/from A are LDA/STA dp via copyPhysReg. def Wide16 : RegisterClass<"W65816", [i16], 16, - (add A, IMG0, IMG1, IMG2, IMG3, - IMG4, IMG5, IMG6, IMG7)>; + (add A, IMG0, IMG1, IMG2, IMG3, + IMG4, IMG5, IMG6, IMG7, + IMG8, IMG9, IMG10, IMG11, + IMG12, IMG13, IMG14, IMG15)>; def PtrRegs : RegisterClass<"W65816", [i16], 16, (add SP)>; diff --git a/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp b/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp index eae7eae..14bee9c 100644 --- a/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp +++ b/src/llvm/lib/Target/W65816/W65816StackSlotCleanup.cpp @@ -1301,10 +1301,29 @@ bool W65816StackSlotCleanup::runOnMachineFunction(MachineFunction &MF) { // implicit-def $a but the return-value flags aren't reliably set, // and other corner cases break smoke. auto isATransparent = [](const MachineInstr &MI) { - // Stores that don't touch A or P-bits-other-than-via-A. - return MI.getOpcode() == W65816::STAfi || - MI.getOpcode() == W65816::STAfi_indY || - MI.getOpcode() == W65816::STA8fi; + // Stores that don't touch A or P-bits-other-than-via-A. (Byte + // stores that internally SEP/REP wrap toggle the M flag, but that + // doesn't affect N/Z based on A's current value.) Also call-stack + // pseudos (ADJCALLSTACKDOWN / UP) which are zero-effect at this + // point in the pipeline (PEI eliminates UP; DOWN is always nil). + switch (MI.getOpcode()) { + case W65816::STAfi: + case W65816::STAfi_indY: + case W65816::STA8fi: + case W65816::STAabs: + case W65816::STA8abs: + case W65816::STAptr: + case W65816::STBptr: + case W65816::STAptrOff: + case W65816::STBptrOff: + case W65816::ADJCALLSTACKDOWN: + // DOWN expands to nothing (PUSH16 chain already shifted SP). + // UP is NOT transparent: when PEI doesn't process it, AsmPrinter + // emits a TSC/CLC/ADC/TCS sequence that clobbers A and flags. + return true; + default: + return false; + } }; // Returns true iff walking back from `Start` (exclusive) finds an // A-modifier as the first non-skip op. Skips debug ops and