15 KiB
llvm816 — Current Status
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
llvm-mos as a separate W65816 target.
What works
End-to-end C-to-binary toolchain that produces 65816 machine code which runs correctly under MAME (apple2gs).
Language coverage at -O2 (no extra flags):
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
- ASLA16 / shift libcalls.
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all the above sizes. Signed compare near INT_MIN handled via EOR-with- sign-bit transform.
- Pointer arithmetic, array indexing, struct field access, struct return-by-value (up to 8 bytes — Pair, Vec4, double).
- Pointer dereference (
*p) lowers viaLDAptr / STAptr / STBptrto[$E0],Yindirect-LONG with the bank byte at$E2forced to 0 — DBR-independent, sopha;plbbank-switched callers don't corrupt data through callee local-pointer writes. Const-int pointers (*(volatile uint16 *)0x5000 = vMMIO idiom) lower toSTAabs(DBR-relative) so bank-2 writes still work. - Bitfields, switch statements (verified up to ~12 cases + default),
function pointers, function-pointer tables, indirect calls via
__jsl_indirtrampoline. - Recursion: factorial, Fibonacci, depth-3 binary-tree insert/sum/min/max, simple recursive quicksort.
- Loops with goto / break / continue, nested loops, state machines.
<stdarg.h>varargs with int / long / unsigned long long mixed args.- Heap:
malloc/free(libc.c first-fit allocator) — linked-list reverse withconsworks; free-list coalesce verified. - Strings: hand-rolled
strlen,strcmp,strcpy,strchr, atoi/itoa roundtrip. - Soft-float (single): all four ops + comparisons, MAME-verified.
- Soft-double: add, sub, mul, div all return correct bit patterns bit-for-bit against gcc with round-to-nearest-even rounding; 3-iter Newton sqrt converges. Compiles at -O2 throughout. Long- running iterations may hit MAME's 1-second sim-time budget (test config issue, not a compiler bug).
- Inline assembly with
"a","x","y"register constraints and arbitrary opcode bytes (used for thepha;plbbank-switch idiom). - C++ minimal: clang++ compiles a class with virtual + non-trivial ctor (vtable + RTTI omitted; no exceptions).
- printf with
%d %x %s %c %pand width/precision specifiers. - sprintf / snprintf / vsprintf / vsnprintf with the same format
coverage as printf (
%d %u %x %ld %lu %s %c %f %p %%+ width). C99 truncation semantics for snprintf.%.Nfproduces the correct fractional digits with round-half-up. - qsort + bsearch over arbitrary element size with a user
cmpcallback. - Standard string/stdlib glue: strcat, strncat, strpbrk, strspn, strcspn, atol, llabs (kept in their own translation unit so vprintf's branch layout doesn't shift).
<math.h>: fabs, floor, ceil, fmod, copysign, sqrt, pow, sin, cos, tan, exp, log, atan, atan2, asin, acos, sinh, cosh, tanh (and float variants). Bit-twiddling for fabs/floor/ceil/ copysign; Newton iteration for sqrt; range-reduction + Taylor for sin/cos/exp/log/atan; identities for asin/acos/atan2/sinh/ cosh/tanh. Accuracy is in the ~1e-6 range — good enough for typical numeric work, far short of glibc-quality. These are slow (each call is dozens to hundreds of soft-double libcalls) — pre-compute or cache when possible.setjmp/longjmpfrom libgcc.s.- Static constructors via crt0's init_array walk.
<stdio.h>file I/O against an in-memory FS:mfsRegister (path, buf, size, cap, writable)stages a buffer as a named file;fopen/fread/fwrite/fseek/ftell/fclose/fgetc/fgets/ungetc/fprintfoperate on it via a per-FILE (kind, buf, size, cap, pos, eof, err, unget) record. stdin/ stdout/stderr route throughputcharas before.<wchar.h>: wcslen / wcscmp / wcsncmp / wcscpy / wcsncpy / wcscat / wcschr / wcsrchr; mbtowc / wctomb / mbstowcs / wcstombs / mblen with the trivial 1:1 byte<->wide mapping (Latin-1). wchar_t is 16-bit on this target.<signal.h>: in-process signal table. signal() registers a handler; raise() invokes it. Default actions: SIGABRT calls abort(), SIGINT/SIGTERM call exit(128+sig), others ignored.<locale.h>: setlocale always returns "C"; localeconv returns a fixed C-locale lconv struct.- C++ subset: classes, single inheritance, multiple inheritance
(Drawable+Movable through one Sprite), virtual base diamond
(A and B virtually derive Base; Diamond inherits from both
with one shared Base subobject), virtual functions,
polymorphism via base-class pointer arrays, virtual dtors,
this-pointer adjustment for non-leftmost bases, vbase offset
tables. RTTI /
dynamic_castworks (downcast, MI cross-cast, virtual-base sibling cast) via a minimal libcxxabi shim (runtime/src/libcxxabi.c) that provides__dynamic_cast+ the three typeinfo class vtables (__class_type_info,__si_class_type_info,__vmi_class_type_info) + sizedoperator delete+__cxa_pure_virtual. - C++ exceptions via
clang++ -fsjlj-exceptions: throw, catch, catch-by-value, multiple catch handlers, exception destruction.W65816SjLjFinalizeIR pass inserts the call-site dispatch and per-function catch table;runtime/src/libcxxabiSjlj.cprovides the Itanium SJLJ surface (_Unwind_SjLj_*,__cxa_throw,__cxa_begin_catch, etc.) plus a no-op personality.
Toolchain:
-
clang/llcproduce W65816 assembly + ELF object files. -
tools/link816resolves cross-translation-unit refs, lays out text/rodata/bss, emits a flat binary the IIgs ROM can load. Auto-relocates bss above text+rodata when the default--bss-base 0x2000would overlap text, and skips past the IIgs IO window ($C000-$CFFF) if needed.--gc-sections(default ON) drops unreachable functions: a minimal program with full runtime linked shrinks from ~43KB to ~1.5KB. -
link816 --segment-cap Npacks.textgreedily into multiple bank-aligned segments, capped at N bytes per segment. Segment 1 stays at--text-basein bank 0 (alongside rodata + bss + init); segments 2..M start at--segment-bank-base(default $040000) in successive banks.--manifest path.jsonwrites a JSON file listing each segment's image, base, and entry offset. Cross-bankJSL(IMM24 reloc) just works — patched at link time with the full 24-bit address. Cross-bank IMM16 is permitted (uses DBR for bank — caller pins DBR to data's bank); cross-bank PCREL is rejected with a clear diagnostic.scripts/runMultiSeg.shis a mini in-Lua loader for MAME that reads the manifest, places each segment's bytes, and runs from segment 1's entry — used by smoke to verify cross-bank JSL end-to-end (helper3 chain across 3 bank-aligned segments). -
tools/omfEmitproduces OMF v2.1 files in three modes: (a) single-segment —--input flat.bin --map flat.map --base ADDR --entry SYM, KIND=0x0000 (CODE, dynamic), ORG=0 (loader picks bank); (b) multi-segment —--manifest path.jsonreads link816's manifest and emits one OMF segment per entry with KIND=0x8800 (STATIC|ABSBANK|CODE) + ORG=segment-base, asking the GS/OS Loader to place each at its declared bank-aligned address. All intra-segment relocations were already patched by the linker, so no INTERSEG/RELOC opcodes are needed for v1 static placement. (c)--stack-size N(auto-enables--expressload) appends a~DirectDP/Stack segment (KIND=0x1012) of N bytes so apps can request a custom DP+stack allocation from GS/OS instead of the Loader's 4KB default. Validated end-to-end viarunViaFinder.shunder real GS/OS 6.0.2 — the slow Loader path silently rejects multi-segment OMFs, so--stack-sizeis gated behind ExpressLoad emission. -
link816 --debug-out FILEwrites a DWARF sidecar with text/ rodata/bss/init_array relocations applied to every.debug_*section, so.debug_addr/.debug_linePC values are final- image addresses. -
runtime/build.shbuilds crt0, libc, soft-float, soft-double, libgcc into linkable objects. -
scripts/smokeTest.shruns 132 end-to-end checks at -O2: scalar ops, control flow, calling conventions, MAME execution regressions, link816 bss-base safety + weak-symbol resolution + heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link, iigs/gsos.h compile + link, standalone runtime headers, AsmPrinter peepholes (STZ / PEA / PEI — single-STA, shared- LDA-multi-STA, DPF0-forwarding), malloc/free coalesce ordering, plus real-world coverage: Conway's Game of Life blinker (2D loop + neighbour bounds), binary search tree (recursive struct + malloc), function-pointer dispatch table (indirect JSL via__jsl_indir), memory-backed file I/O (mfsRegister + fopen/fread/fwrite/fseek/fprintf), C++ polymorphism (single inheritance), C++ multiple inheritance (Drawable+Movable), C++ virtual base diamond, C++ dynamic_cast (SI + MI cross-cast + virtual-base sibling cast through libcxxabi shim), SJLJ exception runtime end-to-end (libcxxabiSjlj.c throw/catch round-trip via setjmp/longjmp + catch-table walk), C++ -fsjlj-exceptions compile + link (the C++ frontend → backend path is execution- verified manually but skipped from MAME smoke due to a MAME-side flakiness — see "What's next"), GS/OS wrapper round-trip via stub dispatcher pre-loaded at $E100A8 (validates PHA + PEA 0 + JSL + post-call SP-fixup contract end-to-end), wchar / signal core APIs, hex dumper writing through fprintf, JSON tokenizer state machine, hash-table command shell (parser- dispatch + chained collisions over fprintf-to-mfs), scripts/bench.sh size-vs-Calypsi harness. 100% pass.
-
scripts/benchCyclesPrecise.shmeasures per-call cycle counts via MAME's emulated time counter. Eight benchmarks underbenchmarks/. Current numbers: popcount 6888 cyc, bsearch 1108, memcmp 1569, strcpy 3580, dotProduct 4774, fib(10) 14152, sumOfSquares 49104. Speed is the optimization priority, not size.
Backend register allocation:
- Basic regalloc as default at -O1+; fast at -O0/optnone. We use
basic instead of greedy because greedy fails ("ran out of
registers during register allocation") on functions with many
cross-call Acc16 vregs (the
ok |= bit; helper(); ok |= bit;pattern across many if-blocks). Basic handles those cleanly with negligible code-size overhead vs greedy on the bench suite (~0.6%). - Pre-RA passes:
WidenAcc16(Acc16→Wide16 promotion, lets greedy spread i16 pressure across A and 16 IMG slots);TiedDefSpill(handles tied-def-multi-use hazard);ABridgeViaX(bridges via X/Y when free). - Post-RA passes:
SpillToX(STA/LDA pairs → TAX/TXA bridges when X dead);StackSlotCleanup(deletes redundant adjacent spills);NegYIndY(rewrites negative-Y indirect-Y stack-rel ops to avoid the 24-bit-add bank-cross). - Pre-emit:
BranchExpand(long Bxx → INV_Bxx skip; BRA target);SepRepCleanup(coalesces adjacent SEP/REP toggles, plus a cross-mode-neutral coalesce that drops REP/SEP pairs sandwiching X-flag-only ops, branches, transfers — saves 4B / 12cyc per collapse). AsmPrinter LDAi8imm peephole walks past mode-neutral MIs to fuse the closing REP into a following SEP. - Imaginary registers IMG0..IMG15 backed by DP $C0..$CE + $D0..$DE — gives greedy 17 effective i16 carriers (A + 16 IMG) before stack spills kick in.
ABI:
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
on the system stack with PHA. Caller deallocates via
tsc;clc;adc #N;tcsorPLY*N/2. - Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for the highest 16 bits.
- Frame is empty-descending (S points to next-free); offsets account for the +1 skew vs LLVM's full-descending model.
IIgs toolbox:
iigs/toolbox.h— autogenerated wrappers for all ~1300 IIgs toolbox routines across 35 tool sets (Tool Locator, Memory Manager, Misc Tools, QuickDraw II / Aux, Event Manager, Sound Manager, Apple Desktop Bus, SANE, Integer Math, Text Tools, Window Manager, Menu Manager, Control Manager, LineEdit, Dialog Manager, Scrap Manager, Standard File, Note Synth/Sequencer, Font Manager, List Manager, ACE, Resource Manager, MIDI, Video Overlay, TextEdit, Media Control, Print Manager, Scheduler, Desk Manager, …). Names match Apple's IIgs Toolbox Reference exactly (TLStartUp, MMStartUp, NewWindow, SysBeep, …). 417 simple wrappers (zero/single-arg, i16-or-void return) inline in the header; 890 multi-arg ones live inruntime/src/iigsToolbox.s. Generated byscripts/genToolbox.pyfrom ORCA-C'sORCACDefs/(re-runnable when ORCA-C updates).
What's next
Work is now optimization-focused; the toolchain is feature-complete for the common-case C / minimal-C++ workload. Priority is speed (cycle counts), not size.
Speed wins queued, ranked by expected impact:
-
u16×u16 → u32 multiply path. sumOfSquares is 982 cyc/iter bottlenecked by
__mulsi3for what's effectively a 16×16 multiply (both inputs are zext from u16). Adding a__umulhi3libcall + SDAG hook to detectMUL(zext(a), zext(b))could roughly halve the iteration cost. -
Fold
while (x != 0)for i32 tolda lo; ora hi; bne. The combiner currently materializes a SETCC boolean and re-tests it, generating ~10 redundant ops in every i32-iteration loop. Hot in popcount, CRC, and any BigInt-style code. -
ptr32 pointer-increment overhead.
*p++under ptr32 emits a full 32-bitADCchain even when the high half is provably unchanged. strcpy and memcmp pay 30+ cycles per byte for what should be 15-20. Needs a peephole or SDAG combine fori32 + 1with provably-no-carry-into-hi. -
Greedy regalloc retry. Currently blocked on an upstream LLVM
LiveRangeEdit::eliminateDeadDefassertion when our sub-register pair partial-defs reach it. Basic regalloc works but leaves measurable cycle waste in load/store shuffles.
Open limitations:
-
Multi-bank BSS / init_array. Multi-segment mode splits
.textacross banks but BSS + init_array still live in segment 1's bank (bank 0). Programs with zero-init data exceeding the ~60KB bank-0 budget need crt0 to walk a per-segment(start, end)table. Not a blocker for >64KB code programs. -
C++ exceptions absent from CI smoke. The SJLJ runtime round-trip is in smoke; the full clang++ → backend → MAME execution path runs reliably interactively but is excluded from automated smoke due to MAME-side I/O flakiness.
-
GS/OS validation uses a stub dispatcher. The wrapper contract (PHA + PEA 0 + LDX + JSL $E100A8 + post-call SP fixup) is verified end-to-end in MAME against a stub (
scripts/runInMameWithGsosStub.sh). Validation against a real bootable GS/OS volume is left out of CI as it needs a smartport hard-disk image and live Tool Locator init. -
gmtime_r requires
optnone. IR-level optimizer issue: loop rotation + IndVar simplify mis-evaluatedays >= 365L + (__isLeap(...) ? 1 : 0), folding the comparison to compile-time-false. Not a backend bug; needs IR-pass-level diagnosis. -
softDouble
dpack/dclassrequirenoinline. Inlining triggers register pressure that overflows basic regalloc in__adddf3/__muldf3/__divdf3. Architectural for the same reason as qsort's earlier split.