65816-llvm-mos/STATUS.md
Scott Duensing f338d93bae Checkpoint
2026-05-02 18:30:15 -05:00

241 lines
12 KiB
Markdown

# llvm816 — Current Status
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
llvm-mos as a separate `W65816` target.
## What works
End-to-end C-to-binary toolchain that produces 65816 machine code
which runs correctly under MAME (apple2gs).
**Language coverage at -O2 (no extra flags):**
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
+ ASLA16 / shift libcalls.
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
the above sizes. Signed compare near INT_MIN handled via EOR-with-
sign-bit transform.
- Pointer arithmetic, array indexing, struct field access, struct
return-by-value (up to 8 bytes — Pair, Vec4, double).
- Pointer dereference (`*p`) lowers via `LDAptr / STAptr / STBptr`
to `[$E0],Y` indirect-LONG with the bank byte at `$E2` forced to 0
— DBR-independent, so `pha;plb` bank-switched callers don't corrupt
data through callee local-pointer writes. Const-int pointers
(`*(volatile uint16 *)0x5000 = v` MMIO idiom) lower to `STAabs`
(DBR-relative) so bank-2 writes still work.
- Bitfields, switch statements (verified up to ~12 cases + default),
function pointers, function-pointer tables, indirect calls via
`__jsl_indir` trampoline.
- Recursion: factorial, Fibonacci, depth-3 binary-tree
insert/sum/min/max, simple recursive quicksort.
- Loops with goto / break / continue, nested loops, state machines.
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
reverse with `cons` works; free-list coalesce verified.
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
roundtrip.
- Soft-float (single): all four ops + comparisons, MAME-verified.
- Soft-double: add, sub, mul, div all return correct bit patterns
bit-for-bit against gcc with round-to-nearest-even rounding;
3-iter Newton sqrt converges. Compiles at -O2 throughout. Long-
running iterations may hit MAME's 1-second sim-time budget (test
config issue, not a compiler bug).
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
- C++ minimal: clang++ compiles a class with virtual + non-trivial
ctor (vtable + RTTI omitted; no exceptions).
- printf with `%d %x %s %c %p` and width/precision specifiers.
- sprintf / snprintf / vsprintf / vsnprintf with the same format
coverage as printf (`%d %u %x %ld %lu %s %c %f %p %%` + width).
C99 truncation semantics for snprintf. `%.Nf` produces the
correct fractional digits with round-half-up.
- qsort + bsearch over arbitrary element size with a user `cmp`
callback.
- Standard string/stdlib glue: strcat, strncat, strpbrk, strspn,
strcspn, atol, llabs (kept in their own translation unit so
vprintf's branch layout doesn't shift).
- `<math.h>`: fabs, floor, ceil, fmod, copysign, sqrt, pow,
sin, cos, tan, exp, log, atan, atan2, asin, acos, sinh, cosh,
tanh (and float variants). Bit-twiddling for fabs/floor/ceil/
copysign; Newton iteration for sqrt; range-reduction + Taylor
for sin/cos/exp/log/atan; identities for asin/acos/atan2/sinh/
cosh/tanh. Accuracy is in the ~1e-6 range — good enough for
typical numeric work, far short of glibc-quality. These are
slow (each call is dozens to hundreds of soft-double libcalls)
— pre-compute or cache when possible.
- `setjmp` / `longjmp` from libgcc.s.
- Static constructors via crt0's init_array walk.
- `<stdio.h>` file I/O against an in-memory FS: `mfsRegister
(path, buf, size, cap, writable)` stages a buffer as a named
file; `fopen`/`fread`/`fwrite`/`fseek`/`ftell`/`fclose`/`fgetc`
/`fgets`/`ungetc`/`fprintf` operate on it via a per-FILE
(kind, buf, size, cap, pos, eof, err, unget) record. stdin/
stdout/stderr route through `putchar` as before.
- `<wchar.h>`: wcslen / wcscmp / wcsncmp / wcscpy / wcsncpy /
wcscat / wcschr / wcsrchr; mbtowc / wctomb / mbstowcs /
wcstombs / mblen with the trivial 1:1 byte<->wide mapping
(Latin-1). wchar_t is 16-bit on this target.
- `<signal.h>`: in-process signal table. signal() registers a
handler; raise() invokes it. Default actions: SIGABRT calls
abort(), SIGINT/SIGTERM call exit(128+sig), others ignored.
- `<locale.h>`: setlocale always returns "C"; localeconv returns
a fixed C-locale lconv struct.
- C++ subset: classes, single inheritance, virtual functions,
polymorphism via base-class pointer arrays, virtual dtors.
Compile with `clang++ -fno-exceptions -fno-rtti`. Multiple
inheritance with virtual bases, full RTTI, exceptions are
out of scope.
**Toolchain:**
- `clang` / `llc` produce W65816 assembly + ELF object files.
- `tools/link816` resolves cross-translation-unit refs, lays out
text/rodata/bss, emits a flat binary the IIgs ROM can load.
Auto-relocates bss above text+rodata when the default
`--bss-base 0x2000` would overlap text, and skips past the
IIgs IO window ($C000-$CFFF) if needed. `--gc-sections`
(default ON) drops unreachable functions: a minimal program
with full runtime linked shrinks from ~43KB to ~1.5KB.
- `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's
native object format) for round-tripping with classic dev tools.
- `link816 --debug-out FILE` writes a DWARF sidecar with text/
rodata/bss/init_array relocations applied to every `.debug_*`
section, so `.debug_addr` / `.debug_line` PC values are final-
image addresses.
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
libgcc into linkable objects.
- `scripts/smokeTest.sh` runs 113 end-to-end checks at -O2:
scalar ops, control flow, calling conventions, MAME execution
regressions, link816 bss-base safety + weak-symbol resolution +
heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link,
standalone runtime headers, AsmPrinter peepholes (STZ / PEA /
PEI — single-STA, shared-LDA-multi-STA, DPF0-forwarding),
malloc/free coalesce ordering, plus real-world coverage:
Conway's Game of Life blinker (2D loop + neighbour bounds),
binary search tree (recursive struct + malloc), function-pointer
dispatch table (indirect JSL via `__jsl_indir`), memory-backed
file I/O (mfsRegister + fopen/fread/fwrite/fseek/fprintf), C++
polymorphism (single inheritance + virtual functions), wchar /
signal core APIs, hex dumper writing through fprintf, JSON
tokenizer state machine, scripts/bench.sh size-vs-Calypsi
harness. 100% pass.
- `scripts/bench.sh` compiles a microbenchmark suite with both
clang (this toolchain) and Calypsi cc65816, comparing emitted
text-section size. Current ratio: ~2.2x (clang generates more
bytes than Calypsi on average; sumOfSquares is the worst case
at 6.45x because of __mulsi3 dispatch). Eight benchmarks
shipped under `benchmarks/`.
**Backend register allocation:**
- Greedy regalloc as default at -O1+; fast at -O0/optnone.
- Pre-RA passes: `WidenAcc16` (Acc16→Wide16 promotion, lets
greedy spread i16 pressure across A and 16 IMG slots);
`TiedDefSpill` (handles tied-def-multi-use hazard);
`ABridgeViaX` (bridges via X/Y when free).
- Post-RA passes: `SpillToX` (STA/LDA pairs → TAX/TXA bridges
when X dead); `StackSlotCleanup` (deletes redundant adjacent
spills); `NegYIndY` (rewrites negative-Y indirect-Y stack-rel
ops to avoid the 24-bit-add bank-cross).
- Pre-emit: `BranchExpand` (long Bxx → INV_Bxx skip; BRA target);
`SepRepCleanup` (coalesces adjacent SEP/REP toggles, plus a
cross-mode-neutral coalesce that drops REP/SEP pairs sandwiching
X-flag-only ops, branches, transfers — saves 4B / 12cyc per
collapse). AsmPrinter LDAi8imm peephole walks past mode-neutral
MIs to fuse the closing REP into a following SEP.
- Imaginary registers IMG0..IMG15 backed by DP $C0..$CE +
$D0..$DE — gives greedy 17 effective i16 carriers (A + 16 IMG)
before stack spills kick in.
**ABI:**
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
on the system stack with PHA. Caller deallocates via `tsc;clc;adc
#N;tcs` or `PLY*N/2`.
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
the highest 16 bits.
- Frame is empty-descending (S points to next-free); offsets account
for the +1 skew vs LLVM's full-descending model.
**IIgs toolbox:**
- `iigs/toolbox.h` — autogenerated wrappers for all ~1300 IIgs
toolbox routines across 35 tool sets (Tool Locator, Memory
Manager, Misc Tools, QuickDraw II / Aux, Event Manager,
Sound Manager, Apple Desktop Bus, SANE, Integer Math, Text
Tools, Window Manager, Menu Manager, Control Manager,
LineEdit, Dialog Manager, Scrap Manager, Standard File,
Note Synth/Sequencer, Font Manager, List Manager, ACE,
Resource Manager, MIDI, Video Overlay, TextEdit, Media
Control, Print Manager, Scheduler, Desk Manager, …). Names
match Apple's IIgs Toolbox Reference exactly (TLStartUp,
MMStartUp, NewWindow, SysBeep, …). 417 simple wrappers
(zero/single-arg, i16-or-void return) inline in the header;
890 multi-arg ones live in `runtime/src/iigsToolbox.s`.
Generated by `scripts/genToolbox.py` from ORCA-C's
`ORCACDefs/` (re-runnable when ORCA-C updates).
## In flight
- **Greedy regalloc fails on long-arg call chains** — a function
that strings ~7+ independent `helper(longArg1, longArg2)` calls
overflows greedy at -O1+ with "ran out of registers during
register allocation". IMG slot expansion (8→16) raised the
threshold; most "normal-looking" mixed-arity workloads now
compile, but pathological pressure (many i32+ args + bitmask
SETCC chain in one function) still fails. Workarounds: mark
the heaviest helper `__attribute__((noinline))`; or
`-mllvm -regalloc=fast` for that TU; or `__attribute__((optnone))`
on the affected function. Proper fix needs either a custom
greedy→fast fallback in
`W65816TargetMachine::createTargetRegisterAllocator` or a
smarter spill-placement pre-RA pass.
- **`time()` / `clock()` are stubs** returning 0. ReadTimeHex
(Misc Tool $0D03) needs the Tool Locator initialised in crt0
to not crash MAME; the VBL counter at $E1006B needs 24-bit
far-pointer support that the backend doesn't yet model.
- **`(d,s),y / (sr,s),y` addressing wraps the bank** when Y is
negative as 16-bit unsigned. Worked around by `W65816NegYIndY`
rewriting the affected ops to `TAX ; LDA/STA $0000,X`. The
workaround stays correct for negative offsets like `arr[i-1]`
but the underlying issue is unfixed at the addressing-mode
level.
- **Bank-0 size limit (~48KB)** — the runtime + program must fit
in $1000-$BFFF (text+rodata) plus $D000-$DFFF (LC1 for rodata-
spill and BSS). Past that, link816 hard-fails because text
would cross the IO window. In practice rarely hit thanks to
`--gc-sections`, but programs that genuinely use most of the
runtime can still trip it. Future work: enable LC2 / shadow
RAM via crt0 to add ~16KB more.
## Yet to come
- **GS/OS-backed `<stdio.h>` file I/O** — current FS is
memory-backed (programs `mfsRegister` buffers as files). A
GS/OS backend would let programs see the real ProDOS volume
during MAME execution, but needs Tool Locator init in crt0
and a class-1 parm-block dispatch wrapper around $E100A8.
- **C++ exceptions / RTTI / multiple inheritance with virtual
bases** — only the `-fno-exceptions -fno-rtti` subset is
supported. `__cxa_throw` etc. would need an unwind ABI on
this target plus a personality routine.
- **Close the size gap to Calypsi** — `scripts/bench.sh`
shows clang at ~2.2x Calypsi text size on the included
microbenchmarks, with sumOfSquares as the worst case (6.45x)
due to __mulsi3 dispatch overhead. Targeted improvements:
inline 16x16->32 multiply for small operands; widen the
IMG slot heuristic so greedy uses them more aggressively;
cycle-time benchmark harness (separate from size).
- **Larger/real-world end-to-end programs** — current real-world
smoke (Game of Life, BST, dispatch, hex dumper, JSON tokenizer)
exercises core idioms. A multi-thousand-line program (e.g.
a small interactive shell, a text editor command loop) would
catch issues no smaller test reaches.