65816-llvm-mos/STATUS.md
Scott Duensing e65fedc8e1 Checkpoint
2026-05-13 15:48:34 -05:00

398 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# llvm816 — Current Status
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
llvm-mos as a separate `W65816` target.
## What works
End-to-end C-to-binary toolchain that produces 65816 machine code
which runs correctly under MAME (apple2gs).
**Language coverage at -O2 (no extra flags):**
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
+ ASLA16 / shift libcalls.
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
the above sizes. Signed compare near INT_MIN handled via EOR-with-
sign-bit transform.
- Pointer arithmetic, array indexing, struct field access, struct
return-by-value (up to 8 bytes — Pair, Vec4, double).
- Pointer dereference (`*p`) lowers via `LDAptr / STAptr / STBptr`
to `[$E0],Y` indirect-LONG with the bank byte at `$E2` forced to 0
— DBR-independent, so `pha;plb` bank-switched callers don't corrupt
data through callee local-pointer writes. Const-int pointers
(`*(volatile uint16 *)0x5000 = v` MMIO idiom) lower to `STAabs`
(DBR-relative) so bank-2 writes still work.
- Bitfields, switch statements (verified up to ~12 cases + default),
function pointers, function-pointer tables, indirect calls via
`__jsl_indir` trampoline.
- Recursion: factorial, Fibonacci, depth-3 binary-tree
insert/sum/min/max, simple recursive quicksort.
- Loops with goto / break / continue, nested loops, state machines.
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
reverse with `cons` works; free-list coalesce verified.
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
roundtrip.
- Soft-float (single): all four ops + comparisons, MAME-verified.
- Soft-double: add, sub, mul, div all return correct bit patterns
bit-for-bit against gcc with round-to-nearest-even rounding;
3-iter Newton sqrt converges. Compiles at -O2 throughout. Long-
running iterations may hit MAME's 1-second sim-time budget (test
config issue, not a compiler bug).
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
- C++ minimal: clang++ compiles a class with virtual + non-trivial
ctor (vtable + RTTI omitted; no exceptions).
- printf with `%d %x %s %c %p` and width/precision specifiers.
- sprintf / snprintf / vsprintf / vsnprintf with the same format
coverage as printf (`%d %u %x %ld %lu %s %c %f %p %%` + width).
C99 truncation semantics for snprintf. `%.Nf` produces the
correct fractional digits with round-half-up.
- scanf family: `sscanf` / `vsscanf` parse a C string; `fscanf` /
`vfscanf` bridge to vsscanf via a per-call line buffer (caps at
255 bytes / line; a longer line silently truncates). `scanf`
reads from stdin which always returns EOF on this target — the
surface compiles but isn't useful without a stdin source.
Format directives: `%d %i %u %x %X %o %s %c %ld %lu %lx %li %lo %%`.
- qsort + bsearch over arbitrary element size with a user `cmp`
callback.
- Standard string/stdlib glue: strcat, strncat, strpbrk, strspn,
strcspn, atol, llabs (kept in their own translation unit so
vprintf's branch layout doesn't shift).
- `<math.h>`: fabs, floor, ceil, fmod, copysign, sqrt, pow,
sin, cos, tan, exp, log, atan, atan2, asin, acos, sinh, cosh,
tanh (and float variants). Bit-twiddling for fabs/floor/ceil/
copysign; Newton iteration for sqrt; range-reduction + Taylor
for sin/cos/exp/log/atan; identities for asin/acos/atan2/sinh/
cosh/tanh. Accuracy is in the ~1e-6 range — good enough for
typical numeric work, far short of glibc-quality. These are
slow (each call is dozens to hundreds of soft-double libcalls)
— pre-compute or cache when possible.
- `setjmp` / `longjmp` from libgcc.s.
- Static constructors via crt0's init_array walk.
- `<stdio.h>` file I/O with two backends:
- **mfs** — `mfsRegister(path, buf, size, cap, writable)` stages a
memory buffer as a named file. Used by smoke tests that don't
have a real disk. Fully validated end-to-end.
- **GS/OS** — `fopen` falls through to `gsosOpen` for any path not
in the mfs table. Routes through the GS/OS class-1 dispatcher
via wrappers in `runtime/src/iigsGsos.s` (Open/Read/Write/Close/
SetMark/GetMark/SetEOF/GetEOF). The full stdio surface
(`fread/fwrite/fseek/ftell/fclose/fgetc/fputc/fputs/fgets/ungetc/
feof/ferror/clearerr/rewind/fprintf/vfprintf`) dispatches on
backend. link816 honors weak symbols so programs that don't use
the GS/OS backend don't have to link `iigsGsos.o`.
- **Validation status:** code path compiles, links, and runs under
`runViaFinder.sh --data` injection. `fopen` + `gsosOpen` hangs
when invoked under real GS/OS 6.0.2 (JSL $E100A8 doesn't return);
root cause not yet diagnosed. Stub-dispatcher GS/OS smoke (the
existing one) validates the wrapper contract independently. An
XFAIL'd end-to-end smoke is in `scripts/smokeTest.sh` gated
behind `GSOS_FILE_SMOKE=1` for use after the dispatcher path is
fixed. `runViaFinder.sh --data /PATH=local_file` is the
automated-injection mechanism for runtime-test data files.
- stdin/stdout/stderr route through `putchar` as before.
- `<wchar.h>`: wcslen / wcscmp / wcsncmp / wcscpy / wcsncpy /
wcscat / wcschr / wcsrchr; mbtowc / wctomb / mbstowcs /
wcstombs / mblen with the trivial 1:1 byte<->wide mapping
(Latin-1). wchar_t is 16-bit on this target. Extended set:
wmemcpy / wmemmove / wmemset / wmemcmp / wmemchr;
wcstol / wcstoul / wcstoll / wcstoull / wcstod / wcstof;
swprintf / vswprintf; wcsftime. All delegate to the byte
equivalents under the Latin-1 model.
- `<signal.h>`: in-process signal table. signal() registers a
handler; raise() invokes it. Default actions: SIGABRT calls
abort(), SIGINT/SIGTERM call exit(128+sig), others ignored.
- `<locale.h>`: setlocale always returns "C"; localeconv returns
a fixed C-locale lconv struct.
- `<fenv.h>`: rounding mode + exception flag word tracked but
no-op (softFloat / softDouble are fixed RNE; exceptions never
raised). Surface compiles cleanly for portable code.
- `<tgmath.h>`: C11 type-generic math via `_Generic`; selects
`sqrtf` vs `sqrt` etc. based on argument type.
- `<stdatomic.h>`: C11 atomic surface, all ops lower to plain
ops (single-core uniprocessor — no real synchronization
needed). `_Atomic T` is treated as plain `T`.
- `<threads.h>`: stubs. `thrd_create` returns `thrd_error`;
mutex/cond ops are no-ops; `call_once` and `tss_*` work since
they're degenerate on a single-core target.
- `aligned_alloc` / `posix_memalign` / `aligned_free`: wrap
malloc with an over-allocation + pointer-stash trick. Match
C11 contract — `aligned_alloc(N, M)` returns N-aligned, free
with `aligned_free`.
- `<iso646.h>`: alternative operator spellings (`and`, `or`,
`not`, etc.) — C95 compat header.
- `<stdalign.h>`: aliases `_Alignas` / `_Alignof` to `alignas` /
`alignof`.
- `<stdnoreturn.h>`: aliases `_Noreturn` to `noreturn`.
- `<uchar.h>`: `char16_t` / `char32_t` typedefs + `mbrtoc16` /
`c16rtomb` / `mbrtoc32` / `c32rtomb` conversion helpers. In
our Latin-1 model these are 1:1 byte copies (no UTF-8 decode).
- `<wctype.h>`: wide-char classification + case folding.
Delegates to `<ctype.h>` for code-points 0..255; anything
outside Latin-1 returns false / unchanged.
- `<complex.h>`: C99 complex-number surface — clang built-in
`_Complex` lowers to soft-double under the hood. Macros
`complex` / `_Complex_I` / `I` / `CMPLX` / `CMPLXF` / `CMPLXL`
plus inline `creal` / `cimag` / `conj` / `cproj` / `cabs` /
`carg` and their `f` / `l` variants. Transcendental complex
routines (csin/ccos/cexp/etc.) intentionally not provided —
they would each need a polynomial-expansion implementation
with limited IIgs value.
- `<assert.h>`: adds C11 `static_assert` as a macro alias for
the `_Static_assert` keyword.
- `<errno.h>`: full C standard error codes (EDOM, ERANGE,
EILSEQ) plus common POSIX codes (EPERM..EPIPE, ENAMETOOLONG,
ENOSYS, ENOTEMPTY, ELOOP). `strerror` maps every defined
code to a human-readable string.
- `<stdio.h>`: adds C standard buffer-control surface
(`setvbuf` / `setbuf` as no-ops, `_IOFBF` / `_IOLBF` / `_IONBF`
/ `BUFSIZ`); `fgetpos` / `fsetpos` wrap `ftell` / `fseek`;
`remove` routes through `mfsUnregister`; `rename` / `tmpfile`
/ `tmpnam` are stubs.
- C++ subset: classes, single inheritance, multiple inheritance
(Drawable+Movable through one Sprite), virtual base diamond
(A and B virtually derive Base; Diamond inherits from both
with one shared Base subobject), virtual functions,
polymorphism via base-class pointer arrays, virtual dtors,
this-pointer adjustment for non-leftmost bases, vbase offset
tables. RTTI / `dynamic_cast` works (downcast, MI cross-cast,
virtual-base sibling cast) via a minimal libcxxabi shim
(`runtime/src/libcxxabi.c`) that provides `__dynamic_cast` +
the three typeinfo class vtables (`__class_type_info`,
`__si_class_type_info`, `__vmi_class_type_info`) + sized
`operator delete` + `__cxa_pure_virtual`.
- C++ exceptions via `clang++ -fsjlj-exceptions`: throw, catch,
catch-by-value, multiple catch handlers, exception destruction.
`W65816SjLjFinalize` IR pass inserts the call-site dispatch and
per-function catch table; `runtime/src/libcxxabiSjlj.c` provides
the Itanium SJLJ surface (`_Unwind_SjLj_*`, `__cxa_throw`,
`__cxa_begin_catch`, etc.) plus a no-op personality.
**Toolchain:**
- `clang` / `llc` produce W65816 assembly + ELF object files.
- `tools/link816` resolves cross-translation-unit refs, lays out
text/rodata/bss, emits a flat binary the IIgs ROM can load.
Auto-relocates bss above text+rodata when the default
`--bss-base 0x2000` would overlap text, and skips past the
IIgs IO window ($C000-$CFFF) if needed. `--gc-sections`
(default ON) drops unreachable functions: a minimal program
with full runtime linked shrinks from ~43KB to ~1.5KB.
- `link816 --segment-cap N` packs `.text` greedily into multiple
bank-aligned segments, capped at N bytes per segment. Segment 1
stays at `--text-base` in bank 0 (alongside rodata + bss + init);
segments 2..M start at `--segment-bank-base` (default $040000)
in successive banks. `--manifest path.json` writes a JSON file
listing each segment's image, base, and entry offset.
Cross-bank `JSL` (IMM24 reloc) just works — patched at link
time with the full 24-bit address. Cross-bank IMM16 is
permitted (uses DBR for bank — caller pins DBR to data's bank);
cross-bank PCREL is rejected with a clear diagnostic.
`scripts/runMultiSeg.sh` is a mini in-Lua loader for MAME that
reads the manifest, places each segment's bytes, and runs from
segment 1's entry — used by smoke to verify cross-bank JSL
end-to-end (helper3 chain across 3 bank-aligned segments).
- `tools/omfEmit` produces OMF v2.1 files in three modes:
(a) single-segment — `--input flat.bin --map flat.map --base
ADDR --entry SYM`, KIND=0x0000 (CODE, dynamic), ORG=0 (loader
picks bank); (b) multi-segment — `--manifest path.json` reads
link816's manifest and emits one OMF segment per entry with
KIND=0x8800 (STATIC|ABSBANK|CODE) + ORG=segment-base, asking
the GS/OS Loader to place each at its declared bank-aligned
address. All intra-segment relocations were already patched by
the linker, so no INTERSEG/RELOC opcodes are needed for v1
static placement. (c) `--stack-size N` (auto-enables
`--expressload`) appends a `~Direct` DP/Stack segment
(KIND=0x1012) of N bytes so apps can request a custom DP+stack
allocation from GS/OS instead of the Loader's 4KB default.
Validated end-to-end via `runViaFinder.sh` under real GS/OS
6.0.2 — the slow Loader path silently rejects multi-segment
OMFs, so `--stack-size` is gated behind ExpressLoad emission.
- `link816 --debug-out FILE` writes a DWARF sidecar with text/
rodata/bss/init_array relocations applied to every `.debug_*`
section, so `.debug_addr` / `.debug_line` PC values are final-
image addresses.
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
libgcc into linkable objects.
- `scripts/smokeTest.sh` runs 145 end-to-end checks at -O2:
scalar ops, control flow, calling conventions, MAME execution
regressions, link816 bss-base safety + weak-symbol resolution +
heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link,
iigs/gsos.h compile + link, standalone runtime headers,
AsmPrinter peepholes (STZ / PEA / PEI — single-STA, shared-
LDA-multi-STA, DPF0-forwarding), malloc/free coalesce ordering,
plus real-world coverage: Conway's Game of Life blinker
(2D loop + neighbour bounds), binary search tree (recursive
struct + malloc), function-pointer dispatch table (indirect
JSL via `__jsl_indir`), memory-backed file I/O (mfsRegister +
fopen/fread/fwrite/fseek/fprintf), C++ polymorphism (single
inheritance), C++ multiple inheritance (Drawable+Movable),
C++ virtual base diamond, C++ dynamic_cast (SI + MI cross-cast +
virtual-base sibling cast through libcxxabi shim), SJLJ exception
runtime end-to-end (libcxxabiSjlj.c throw/catch round-trip via
setjmp/longjmp + catch-table walk), C++ -fsjlj-exceptions
compile + link (the C++ frontend → backend path is execution-
verified manually but skipped from MAME smoke due to a
MAME-side flakiness — see "What's next"), GS/OS wrapper
round-trip via stub dispatcher pre-loaded at $E100A8 (validates
PHA + PEA 0 + JSL + post-call SP-fixup contract end-to-end),
wchar / signal core APIs, hex dumper writing through fprintf,
JSON tokenizer state machine, hash-table command shell (parser
+ dispatch + chained collisions over fprintf-to-mfs),
scripts/bench.sh size-vs-Calypsi harness. 100% pass.
- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts
via MAME's emulated time counter. Eight benchmarks under
`benchmarks/`. Current numbers: popcount 3683 cyc, bsearch
852, memcmp 1091, strcpy 2558, dotProduct 2387, fib(10) 12617,
sumOfSquares 23529. Speed is the optimization priority, not
size.
**Backend register allocation:**
- Greedy regalloc as default at -O1+; fast at -O0/optnone. Greedy
was previously blocked by an upstream LLVM `LiveRangeEdit::elimina-
teDeadDef` assertion firing on KILL pseudos with non-dead implicit-
def $a. Fix landed in `tools/llvm-mos/llvm/lib/CodeGen/InlineSpil-
ler.cpp`: when InlineSpiller converts a redundant STAfi to a KILL
pseudo, mark BOTH explicit and implicit defs dead (the original loop
only iterated `MI.defs()` = explicit-only, leaving the inherited
implicit-def $a live). Bench impact: popcount 19.4%, strcpy
18.9%, memcmp 8.6%, bsearch 9.2%.
- Pre-RA passes: `WidenAcc16` (Acc16→Wide16 promotion, lets
greedy spread i16 pressure across A and 16 IMG slots);
`TiedDefSpill` (handles tied-def-multi-use hazard);
`ABridgeViaX` (bridges via X/Y when free).
- Post-RA passes: `SpillToX` (STA/LDA pairs → TAX/TXA bridges
when X dead); `StackSlotCleanup` (deletes redundant adjacent
spills); `NegYIndY` (rewrites negative-Y indirect-Y stack-rel
ops to avoid the 24-bit-add bank-cross).
- Pre-emit: `BranchExpand` (long Bxx → INV_Bxx skip; BRA target);
`SepRepCleanup` (coalesces adjacent SEP/REP toggles, plus a
cross-mode-neutral coalesce that drops REP/SEP pairs sandwiching
X-flag-only ops, branches, transfers — saves 4B / 12cyc per
collapse). AsmPrinter LDAi8imm peephole walks past mode-neutral
MIs to fuse the closing REP into a following SEP.
- Imaginary registers IMG0..IMG15 backed by DP $C0..$CE +
$D0..$DE — gives greedy 17 effective i16 carriers (A + 16 IMG)
before stack spills kick in.
**ABI:**
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
on the system stack with PHA. Caller deallocates via `tsc;clc;adc
#N;tcs` or `PLY*N/2`.
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
the highest 16 bits.
- Frame is empty-descending (S points to next-free); offsets account
for the +1 skew vs LLVM's full-descending model.
**IIgs toolbox:**
- `iigs/toolbox.h` — autogenerated wrappers for all ~1300 IIgs
toolbox routines across 35 tool sets (Tool Locator, Memory
Manager, Misc Tools, QuickDraw II / Aux, Event Manager,
Sound Manager, Apple Desktop Bus, SANE, Integer Math, Text
Tools, Window Manager, Menu Manager, Control Manager,
LineEdit, Dialog Manager, Scrap Manager, Standard File,
Note Synth/Sequencer, Font Manager, List Manager, ACE,
Resource Manager, MIDI, Video Overlay, TextEdit, Media
Control, Print Manager, Scheduler, Desk Manager, …). Names
match Apple's IIgs Toolbox Reference exactly (TLStartUp,
MMStartUp, NewWindow, SysBeep, …). 417 simple wrappers
(zero/single-arg, i16-or-void return) inline in the header;
890 multi-arg ones live in `runtime/src/iigsToolbox.s`.
Generated by `scripts/genToolbox.py` from ORCA-C's
`ORCACDefs/` (re-runnable when ORCA-C updates).
## What's next
Work is now optimization-focused; the toolchain is feature-complete
for the common-case C / minimal-C++ workload. Priority is speed
(cycle counts), not size.
**Speed wins queued, ranked by expected impact:**
- **ptr32 pointer-increment overhead** (partially addressed). The
`i32 += 1` post-PEI peephole (`W65816I32IncFold`) detects the
6-instruction LDA/ADCi16imm 1/STA/LDA/ADCEi16imm 0/STA pattern and
rewrites to LDA/INA/STA/INC_HI_IF_CARRY (with private-label BNE
expansion in AsmPrinter). Saves ~13 cyc per increment on the
no-carry common path. memcmp 1330 → 1194 (10.2%), strcpy 3325 →
3154 (5.1%). LSR's `*p++ → base+offset` rewrite remains
unaddressed; tried `-disable-lsr` and `isLSRCostLess` override,
both regressed dotProduct.
- **More peephole / libcall opportunities.** __mulsi3 just gained
early-exit when the multiplier shifts to 0; dotProduct dropped
4007→2472 (38.3%), sumOfSquares 40920→23870 (41.6%). Next
candidates: a true 16×16→32 multiply libcall (for `(u32)i*i`
patterns) and shift-by-N inlining for shifts 5+ that currently
go through __ashlsi3.
**Open limitations:**
- **Multi-bank BSS** — full support up to 4 banks (256KB). link816
splits BSS into up to 4 contiguous segments at link time; each
segment fits within a single bank. Linker emits
`__bss_seg{0..3}_lo16 / _bank / _size` symbols. crt0 walks the
table, setting DBR per segment. Per-segment size capped at
0xFF00 so the 16-bit `cpx #__bss_segN_size` loop comparison
doesn't wrap to 0 on a full-bank segment (a single full bank is
split into a 0xFF00-byte primary + 0x100-byte tail in the same
bank). Smoke 137/137 validates BSS spanning bank 3 + bank 4
(100KB) is zeroed end-to-end. Note: program access to non-DBR
bank globals still requires DBR management — the compiler emits
DBR-relative absolute for global accesses, so accessing BSS in
bank N needs the program to set DBR=N or use `sta long` via
inline asm.
- **C++ exceptions absent from CI smoke.** The SJLJ runtime
round-trip is in smoke; the full clang++ → backend → MAME
execution path runs reliably interactively but is excluded
from automated smoke due to MAME-side I/O flakiness.
- **GS/OS validation uses a stub dispatcher.** The wrapper
contract (PHA + PEA 0 + LDX + JSL $E100A8 + post-call SP
fixup) is verified end-to-end in MAME against a stub
(`scripts/runInMameWithGsosStub.sh`). Validation against a
real bootable GS/OS volume is left out of CI as it needs a
smartport hard-disk image and live Tool Locator init.
- **VLAs work end-to-end** (2026-05-09). Backend Custom-lowers
`ISD::DYNAMIC_STACKALLOC` for both i16 and i32 result types.
Loop patterns now produce correct results: `sum_n(3)→6`
verified in MAME smoke. Fix: in VLA functions PEI expands
STAfi/STA8fi/STAfi_indY to a 4-MC sequence ending in `LDY $F8`
which clobbers N/Z; the StackSlotCleanup PHP/PLP wrap pass
treats those pseudos as flag-corrupting so PLP wraps the entire
expansion. `expandFarFI` uses `STY $F8`/`LDY $F8` to a DP
scratch slot rather than PHY/PLY (PHY/PLY between PHP/PLP would
pollute the saved P).
- **dpack and dclass now both inline** (2026-05-10). dpack uses
a volatile-output array rewrite to defeat the backend stack-slot
coalesce bug that previously caused dadd(1.5, 2.5) →
0x4010_4010_0000_0000. dclass's pointer-arg stores lower to
STBptr/STAptr (indirect-long, DBR-independent) and inline
cleanly. All softDouble routines compile at -O2.
- **IMG8..IMG15 callee-save via W65816ImgCalleeSave** (2026-05-13).
New post-RA, pre-PEI pass detects use of IMG8..IMG15 ($C0..$CE)
in a function and emits prologue save + epilogue restore so those
slots behave as callee-saved AT THE ASM LEVEL — without going
through LLVM's CSR mechanism (which would shift regalloc decisions
and break unrelated tests). Save shape per used slot: `PHA; LDA
$C?; STAfi A,slot,2; PLA`; restore mirrors it. The `+2` ImmOffset
compensates for PHA's SP shift so the lowered `sta d,s` lands on
the same byte that subsequent normal-SP reads see. Cost: ~16
cycles + 6 bytes per used slot, applied only to functions that
actually use those slots (most don't). Fixed picol `expr 1+2 == 4`
(now `3`) and a class of recursive double-fn miscompiles with
compound `||` conditions — see `feedback_picol_expr_compound_or.md`.
Smoke 149/149 green including a new orBug regression test guarding
the fix.