65816-llvm-mos/STATUS.md
Scott Duensing f542f4fa01 Checkpoint
2026-05-03 21:31:53 -05:00

21 KiB

llvm816 — Current Status

LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from llvm-mos as a separate W65816 target.

What works

End-to-end C-to-binary toolchain that produces 65816 machine code which runs correctly under MAME (apple2gs).

Language coverage at -O2 (no extra flags):

  • All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod (signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
    • ASLA16 / shift libcalls.
  • Comparisons and signed/unsigned widening (sext, zext, trunc) for all the above sizes. Signed compare near INT_MIN handled via EOR-with- sign-bit transform.
  • Pointer arithmetic, array indexing, struct field access, struct return-by-value (up to 8 bytes — Pair, Vec4, double).
  • Pointer dereference (*p) lowers via LDAptr / STAptr / STBptr to [$E0],Y indirect-LONG with the bank byte at $E2 forced to 0 — DBR-independent, so pha;plb bank-switched callers don't corrupt data through callee local-pointer writes. Const-int pointers (*(volatile uint16 *)0x5000 = v MMIO idiom) lower to STAabs (DBR-relative) so bank-2 writes still work.
  • Bitfields, switch statements (verified up to ~12 cases + default), function pointers, function-pointer tables, indirect calls via __jsl_indir trampoline.
  • Recursion: factorial, Fibonacci, depth-3 binary-tree insert/sum/min/max, simple recursive quicksort.
  • Loops with goto / break / continue, nested loops, state machines.
  • <stdarg.h> varargs with int / long / unsigned long long mixed args.
  • Heap: malloc / free (libc.c first-fit allocator) — linked-list reverse with cons works; free-list coalesce verified.
  • Strings: hand-rolled strlen, strcmp, strcpy, strchr, atoi/itoa roundtrip.
  • Soft-float (single): all four ops + comparisons, MAME-verified.
  • Soft-double: add, sub, mul, div all return correct bit patterns bit-for-bit against gcc with round-to-nearest-even rounding; 3-iter Newton sqrt converges. Compiles at -O2 throughout. Long- running iterations may hit MAME's 1-second sim-time budget (test config issue, not a compiler bug).
  • Inline assembly with "a", "x", "y" register constraints and arbitrary opcode bytes (used for the pha;plb bank-switch idiom).
  • C++ minimal: clang++ compiles a class with virtual + non-trivial ctor (vtable + RTTI omitted; no exceptions).
  • printf with %d %x %s %c %p and width/precision specifiers.
  • sprintf / snprintf / vsprintf / vsnprintf with the same format coverage as printf (%d %u %x %ld %lu %s %c %f %p %% + width). C99 truncation semantics for snprintf. %.Nf produces the correct fractional digits with round-half-up.
  • qsort + bsearch over arbitrary element size with a user cmp callback.
  • Standard string/stdlib glue: strcat, strncat, strpbrk, strspn, strcspn, atol, llabs (kept in their own translation unit so vprintf's branch layout doesn't shift).
  • <math.h>: fabs, floor, ceil, fmod, copysign, sqrt, pow, sin, cos, tan, exp, log, atan, atan2, asin, acos, sinh, cosh, tanh (and float variants). Bit-twiddling for fabs/floor/ceil/ copysign; Newton iteration for sqrt; range-reduction + Taylor for sin/cos/exp/log/atan; identities for asin/acos/atan2/sinh/ cosh/tanh. Accuracy is in the ~1e-6 range — good enough for typical numeric work, far short of glibc-quality. These are slow (each call is dozens to hundreds of soft-double libcalls) — pre-compute or cache when possible.
  • setjmp / longjmp from libgcc.s.
  • Static constructors via crt0's init_array walk.
  • <stdio.h> file I/O against an in-memory FS: mfsRegister (path, buf, size, cap, writable) stages a buffer as a named file; fopen/fread/fwrite/fseek/ftell/fclose/fgetc /fgets/ungetc/fprintf operate on it via a per-FILE (kind, buf, size, cap, pos, eof, err, unget) record. stdin/ stdout/stderr route through putchar as before.
  • <wchar.h>: wcslen / wcscmp / wcsncmp / wcscpy / wcsncpy / wcscat / wcschr / wcsrchr; mbtowc / wctomb / mbstowcs / wcstombs / mblen with the trivial 1:1 byte<->wide mapping (Latin-1). wchar_t is 16-bit on this target.
  • <signal.h>: in-process signal table. signal() registers a handler; raise() invokes it. Default actions: SIGABRT calls abort(), SIGINT/SIGTERM call exit(128+sig), others ignored.
  • <locale.h>: setlocale always returns "C"; localeconv returns a fixed C-locale lconv struct.
  • C++ subset: classes, single inheritance, multiple inheritance (Drawable+Movable through one Sprite), virtual base diamond (A and B virtually derive Base; Diamond inherits from both with one shared Base subobject), virtual functions, polymorphism via base-class pointer arrays, virtual dtors, this-pointer adjustment for non-leftmost bases, vbase offset tables. RTTI / dynamic_cast works (downcast, MI cross-cast, virtual-base sibling cast) via a minimal libcxxabi shim (runtime/src/libcxxabi.c) that provides __dynamic_cast + the three typeinfo class vtables (__class_type_info, __si_class_type_info, __vmi_class_type_info) + sized operator delete + __cxa_pure_virtual.
  • C++ exceptions via clang++ -fsjlj-exceptions: throw, catch, catch-by-value, multiple catch handlers, exception destruction. Backend wiring: MCAsmInfo selects ExceptionHandling::SjLj so clang's SjLjEHPrepare runs; a custom W65816SjLjFinalize IR pass (in src/llvm/lib/Target/W65816/) finishes the lowering by inserting an actual setjmp at function entry, building a switch-on-call-site dispatch block, building a per-function catch table referenced via the lsda field, and rewriting eh.typeid.for(@TI) to use typeinfo addresses as selectors. Runtime in runtime/src/libcxxabiSjlj.c provides the full Itanium SJLJ surface: _Unwind_SjLj_Register/ Unregister/RaiseException/Resume, __cxa_allocate_exception, __cxa_throw, __cxa_begin_catch, __cxa_end_catch, __cxa_rethrow, plus a no-op __gxx_personality_sj0 (we dispatch via call_site directly, not via the personality). Two backend bug fixes were required along the way: longjmp's SP restore was off by 3 (libgcc.s subtracted 3 before TCS, leaving caller's stack 3 bytes off) and W65816StackSlotCleanup was eliminating volatile stores to dead-from-its-perspective stack slots (skipped via hasOrderedMemoryRef() gate).

Toolchain:

  • clang / llc produce W65816 assembly + ELF object files.

  • tools/link816 resolves cross-translation-unit refs, lays out text/rodata/bss, emits a flat binary the IIgs ROM can load. Auto-relocates bss above text+rodata when the default --bss-base 0x2000 would overlap text, and skips past the IIgs IO window ($C000-$CFFF) if needed. --gc-sections (default ON) drops unreachable functions: a minimal program with full runtime linked shrinks from ~43KB to ~1.5KB.

  • link816 --segment-cap N packs .text greedily into multiple bank-aligned segments, capped at N bytes per segment. Segment 1 stays at --text-base in bank 0 (alongside rodata + bss + init); segments 2..M start at --segment-bank-base (default $040000) in successive banks. --manifest path.json writes a JSON file listing each segment's image, base, and entry offset. Cross-bank JSL (IMM24 reloc) just works — patched at link time with the full 24-bit address. Cross-bank IMM16 is permitted (uses DBR for bank — caller pins DBR to data's bank); cross-bank PCREL is rejected with a clear diagnostic. scripts/runMultiSeg.sh is a mini in-Lua loader for MAME that reads the manifest, places each segment's bytes, and runs from segment 1's entry — used by smoke to verify cross-bank JSL end-to-end (helper3 chain across 3 bank-aligned segments).

  • tools/omfEmit produces OMF v2.1 files in two modes: (a) single-segment — --input flat.bin --map flat.map --base ADDR --entry SYM, KIND=0x0000 (CODE, dynamic), ORG=0 (loader picks bank); (b) multi-segment — --manifest path.json reads link816's manifest and emits one OMF segment per entry with KIND=0x8800 (STATIC|ABSBANK|CODE) + ORG=segment-base, asking the GS/OS Loader to place each at its declared bank-aligned address. All intra-segment relocations were already patched by the linker, so no INTERSEG/RELOC opcodes are needed for v1 static placement.

  • link816 --debug-out FILE writes a DWARF sidecar with text/ rodata/bss/init_array relocations applied to every .debug_* section, so .debug_addr / .debug_line PC values are final- image addresses.

  • runtime/build.sh builds crt0, libc, soft-float, soft-double, libgcc into linkable objects.

  • scripts/smokeTest.sh runs 126 end-to-end checks at -O2: scalar ops, control flow, calling conventions, MAME execution regressions, link816 bss-base safety + weak-symbol resolution + heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link, iigs/gsos.h compile + link, standalone runtime headers, AsmPrinter peepholes (STZ / PEA / PEI — single-STA, shared- LDA-multi-STA, DPF0-forwarding), malloc/free coalesce ordering, plus real-world coverage: Conway's Game of Life blinker (2D loop + neighbour bounds), binary search tree (recursive struct + malloc), function-pointer dispatch table (indirect JSL via __jsl_indir), memory-backed file I/O (mfsRegister + fopen/fread/fwrite/fseek/fprintf), C++ polymorphism (single inheritance), C++ multiple inheritance (Drawable+Movable), C++ virtual base diamond, C++ dynamic_cast (SI + MI cross-cast + virtual-base sibling cast through libcxxabi shim), SJLJ exception runtime end-to-end (libcxxabiSjlj.c throw/catch round-trip via setjmp/longjmp + catch-table walk), C++ -fsjlj-exceptions compile + link (the C++ frontend → backend path is execution- verified manually but skipped from MAME smoke due to a MAME-side flakiness — see "Yet to come"), GS/OS wrapper round-trip via stub dispatcher pre-loaded at $E100A8 (validates PHA + PEA 0 + JSL + post-call SP-fixup contract end-to-end), wchar / signal core APIs, hex dumper writing through fprintf, JSON tokenizer state machine, hash-table command shell (parser

    • dispatch + chained collisions over fprintf-to-mfs), scripts/bench.sh size-vs-Calypsi harness. 100% pass.
  • scripts/bench.sh compiles a microbenchmark suite with both clang (this toolchain) and Calypsi cc65816, comparing emitted text-section size. Current ratio: ~1.9x (down from 2.2x once the W65816 target started overriding replexitval to "never" by default in LLVMInitializeW65816Target; SCEV's closed-form rewrite was promoting i16 induction expressions to i64 and hitting __muldi3, which on a 16-bit target is dramatically bigger than the loop it replaces). sumOfSquares went 335B → 128B, a 2.6x shrink with no other benchmark affected. Eight benchmarks shipped under benchmarks/. Remaining gap is structural: Calypsi uses (sr,s),Y for stack-relative pointer indirection where we route through DP $E0 indirect- long for bank safety.

Backend register allocation:

  • Basic regalloc as default at -O1+; fast at -O0/optnone. We use basic instead of greedy because greedy fails ("ran out of registers during register allocation") on functions with many cross-call Acc16 vregs (the ok |= bit; helper(); ok |= bit; pattern across many if-blocks). Basic handles those cleanly with negligible code-size overhead vs greedy on the bench suite (~0.6%).
  • Pre-RA passes: WidenAcc16 (Acc16→Wide16 promotion, lets greedy spread i16 pressure across A and 16 IMG slots); TiedDefSpill (handles tied-def-multi-use hazard); ABridgeViaX (bridges via X/Y when free).
  • Post-RA passes: SpillToX (STA/LDA pairs → TAX/TXA bridges when X dead); StackSlotCleanup (deletes redundant adjacent spills); NegYIndY (rewrites negative-Y indirect-Y stack-rel ops to avoid the 24-bit-add bank-cross).
  • Pre-emit: BranchExpand (long Bxx → INV_Bxx skip; BRA target); SepRepCleanup (coalesces adjacent SEP/REP toggles, plus a cross-mode-neutral coalesce that drops REP/SEP pairs sandwiching X-flag-only ops, branches, transfers — saves 4B / 12cyc per collapse). AsmPrinter LDAi8imm peephole walks past mode-neutral MIs to fuse the closing REP into a following SEP.
  • Imaginary registers IMG0..IMG15 backed by DP $C0..$CE + $D0..$DE — gives greedy 17 effective i16 carriers (A + 16 IMG) before stack spills kick in.

ABI:

  • arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL on the system stack with PHA. Caller deallocates via tsc;clc;adc #N;tcs or PLY*N/2.
  • Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for the highest 16 bits.
  • Frame is empty-descending (S points to next-free); offsets account for the +1 skew vs LLVM's full-descending model.

IIgs toolbox:

  • iigs/toolbox.h — autogenerated wrappers for all ~1300 IIgs toolbox routines across 35 tool sets (Tool Locator, Memory Manager, Misc Tools, QuickDraw II / Aux, Event Manager, Sound Manager, Apple Desktop Bus, SANE, Integer Math, Text Tools, Window Manager, Menu Manager, Control Manager, LineEdit, Dialog Manager, Scrap Manager, Standard File, Note Synth/Sequencer, Font Manager, List Manager, ACE, Resource Manager, MIDI, Video Overlay, TextEdit, Media Control, Print Manager, Scheduler, Desk Manager, …). Names match Apple's IIgs Toolbox Reference exactly (TLStartUp, MMStartUp, NewWindow, SysBeep, …). 417 simple wrappers (zero/single-arg, i16-or-void return) inline in the header; 890 multi-arg ones live in runtime/src/iigsToolbox.s. Generated by scripts/genToolbox.py from ORCA-C's ORCACDefs/ (re-runnable when ORCA-C updates).

In flight

(Nothing currently — the four previous in-flight items all landed: basic-regalloc-by-default replaced greedy and resolved the long-arg-chain failure; time() reads ReadTimeHex when the program has called iigsToolboxInit() and clock() reads the VBL counter via 24-bit absolute load; the (sr,s),Y bank-wrap addressing is no longer emitted by any inserter and the W65816NegYIndY workaround is disabled; LC ceiling extended from $E000 to $10000 since crt0's lda $C083 read-twice enables RAM through $FFFF, gaining 8KB of bank-0 space.)

Yet to come

  • Multi-bank BSS / init_array — multi-segment splits text across banks but BSS + init_array still live in segment 1's bank (bank 0). Programs whose zero-init data exceeds the ~60KB bank-0 budget would need crt0 to walk a per-segment table of (start, end) pairs. Not blocking >64KB code programs; only matters for programs with very large global arrays.

  • GS/OS Loader OMF format compatibility — the OMF format we emit is now byte-equivalent to real Apple S16 segments at the header level. Verified by extracting the ABOUT segment from real /SYSTEM/START (FINDER) via Cadius (/tmp/cadius/cadius, not AppleCommander which can't extract forks) and comparing field-by-field against ours. Five fixes landed in src/link816/omfEmit.cpp along the way: (1) VERSION byte 0x21 → 0x02 (was BCD-style "2.1"; real format is enum where 0x02 = v2.1). Cleared error $1102. (2) Body opcode 0xF1 (DS = N zeros) → 0xF2 (compact LCONST, 2-byte length + N data bytes). Long-form 0xF5 LCONST is in the spec but real Loader appears to mis-parse it (3 stale copies of the segment ended up scattered in RAM). Every real segment we decoded uses 0xF2. (3) KIND 0x0000 (CODE) → 0x8000 (CODE|STATIC) for legacy single-segment mode. Real ABOUT segment uses 0x8000; with 0x0000 the Loader returns $110A loadSegFailErr. Multi-segment mode keeps 0x8800 (CODE|STATIC|ABSBANK) since each seg has a fixed ORG. (4) BANKSIZE 0 → 0x10000 (matches real code segments). (5) LOAD_NAME emitted as 10 bytes of zeros immediately after the 44-byte header (some sources omit it, real OMFs include it).

    GS/OS 6.0.2 is installed under tools/gsos/ and boots cleanly to Finder in MAME. Replacing /SYSTEM/START with a known-good OMF (the extracted ABOUT segment) gives error $005C — identical to what we get with our test program — meaning our OMF is indistinguishable from real Apple S16 as far as the Loader is concerned. The $005C is not OMF rejection; it is the boot-launcher path failing because a minimal /SYSTEM/START doesn't chain to a real Finder via QUIT-with-pathname.

    runtime/src/crt0Gsos.s is committed: skips SEI/LC-reconfig (GS/OS owns CPU state), zeros BSS, runs init_array, calls main, then QUIT(pcount=2) chained to gChainPath (default /SYSTEM/START.ORIG). Linkage works.

    Tested with a marker write as the very first instruction of crt0Gsos, replacing /SYSTEM/START with our OMF and saving the original as /SYSTEM/START.ORIG for chain-back. After 110-second boot: marker $00/0078 is still 0 — the Loader places our segment in RAM (entry signature found in 3 banks via memory search) but never JSLs entry. Tested ENTRY=0, ENTRY=1 (with NOP pad), auxtype=0 and =DB03; all give the same $005C without ever calling our code. Conclusion: the boot-launcher path requires the ~ExpressLoad segment that every real /SYSTEM/START carries. Without ExpressLoad, the bootstrap takes a code path that loads our segment but never auto-calls it.

    OMF format → fully Loader-compatible after reading Merlin32 source. Final canonical fields (single-segment Finder-launchable app):

    • KIND=0x1000 (CODE|PRIV) — was 0x8000 (CODE|STATIC) which came from extracting ABOUT from real FINDER, but ABOUT is a sub-segment called as a subroutine, not a launchable app
    • LABLEN=10 (fixed-width 10-byte LOAD_NAME and SEG_NAME, space-padded) — was 0 (length-prefixed) which is what /SYSTEM/START FINDER uses but the Loader will only LOAD, not JSL-into, that format
    • VERSION=0x02 (OMF v2.1)
    • BANKSIZE=0x10000 for code segs
    • Body opcode 0xF2 LCONST with NUMLEN-byte (=4) count

    ExpressLoad emission also landed (omfEmit --expressload): 6-byte header + segment list + remap list + header info, byte-equivalent to Merlin32's BuildExpressLoadSegment.

    End-to-end runtime verification: new scripts/runViaFinder.sh injects an OMF as /SYSTEM.DISK/HELLO, boots GS/OS in MAME, drives Finder via Lua keyboard automation (S+Cmd-O to open System.Disk, H+Cmd-O to launch HELLO), samples specified memory addresses to verify execution. Pattern adapted from joeylib/scripts/run-iigs-mame.sh from a sibling project. Pure-asm marker tests (sta $000078 long, value=$42) are confirmed running under real GS/OS Loader with runViaFinder.sh hello.omf --check 0x000078=0x42 returning exit 0.

    Compiled C now runs under real GS/OS Loader. Implemented option (a) from the analysis: OMF cRELOC opcode emission.

    • link816 --reloc-out FILE records every R_W65816_IMM24 relocation site (intra-segment 24-bit refs only — GS/OS dispatcher calls and other cross-bank refs are filtered out) as a binary sidecar of (patchOff, offsetRef) pairs.
    • omfEmit --relocs FILE reads the sidecar and emits a cRELOC opcode (0xF5) per site between the LCONST data and the END opcode. Format per Merlin32: 0xF5 ByteCnt(=3) Shift(=0) OffsetPatch(2) OffsetReference(2) = 7 bytes.
    • The Loader rewrites segment[OffsetPatch..OffsetPatch+2] to (segPlacedBase + OffsetReference) at load time, fixing every jsl/jml/sta long/lda long operand that targets an in-segment symbol.
    • End-to-end verified: a real C function call + for loop (sumTo(10) → 55, sumTo(100) → 5050) compiled with clang -O2, linked, OMF-emitted with cRELOC, injected as /SYSTEM.DISK/HELLO, launched from Finder via MAME-Lua keyboard automation, marker bytes verified at the expected values. Smoke check #62 verifies cRELOC opcode count matches the link816 sidecar count.

    Smoke tests #59-#60 (omfEmit single + multi-segment) verify the structural format invariants (VERSION=0x02, KIND=0x8000 or 0x8800, body opcode 0xF2 LCONST) so regressions are caught. scripts/runMultiSeg.sh mini-loader continues to cover the >64KB use case end-to-end.

  • C++ exceptions in CI smoke — runs reliably outside smoke; see context below. The SJLJ runtime end-to-end test passes; the C++ frontend→backend path is compile/link verified in smoke; full execution path is left out due to a MAME-side I/O flakiness (same binary runs fine interactively).

  • GS/OS validated against a real ProDOS volume — the wrapper contract (PHA + PEA 0 + LDX + JSL $E100A8 + post-call SP fixup) is verified end-to-end in MAME against a stub dispatcher (scripts/runInMameWithGsosStub.sh). Validating against an actual GS/OS-loaded volume needs a bootable system disk image attached as a MAME smartport hard disk and Tool Locator init — out of scope for an automated CI smoke.