298 lines
15 KiB
Markdown
298 lines
15 KiB
Markdown
# llvm816 — Current Status
|
||
|
||
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
|
||
llvm-mos as a separate `W65816` target.
|
||
|
||
## What works
|
||
|
||
End-to-end C-to-binary toolchain that produces 65816 machine code
|
||
which runs correctly under MAME (apple2gs).
|
||
|
||
**Language coverage at -O2 (no extra flags):**
|
||
|
||
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
|
||
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
|
||
+ ASLA16 / shift libcalls.
|
||
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
|
||
the above sizes. Signed compare near INT_MIN handled via EOR-with-
|
||
sign-bit transform.
|
||
- Pointer arithmetic, array indexing, struct field access, struct
|
||
return-by-value (up to 8 bytes — Pair, Vec4, double).
|
||
- Pointer dereference (`*p`) lowers via `LDAptr / STAptr / STBptr`
|
||
to `[$E0],Y` indirect-LONG with the bank byte at `$E2` forced to 0
|
||
— DBR-independent, so `pha;plb` bank-switched callers don't corrupt
|
||
data through callee local-pointer writes. Const-int pointers
|
||
(`*(volatile uint16 *)0x5000 = v` MMIO idiom) lower to `STAabs`
|
||
(DBR-relative) so bank-2 writes still work.
|
||
- Bitfields, switch statements (verified up to ~12 cases + default),
|
||
function pointers, function-pointer tables, indirect calls via
|
||
`__jsl_indir` trampoline.
|
||
- Recursion: factorial, Fibonacci, depth-3 binary-tree
|
||
insert/sum/min/max, simple recursive quicksort.
|
||
- Loops with goto / break / continue, nested loops, state machines.
|
||
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
|
||
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
|
||
reverse with `cons` works; free-list coalesce verified.
|
||
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
|
||
roundtrip.
|
||
- Soft-float (single): all four ops + comparisons, MAME-verified.
|
||
- Soft-double: add, sub, mul, div all return correct bit patterns
|
||
bit-for-bit against gcc with round-to-nearest-even rounding;
|
||
3-iter Newton sqrt converges. Compiles at -O2 throughout. Long-
|
||
running iterations may hit MAME's 1-second sim-time budget (test
|
||
config issue, not a compiler bug).
|
||
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
|
||
arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
|
||
- C++ minimal: clang++ compiles a class with virtual + non-trivial
|
||
ctor (vtable + RTTI omitted; no exceptions).
|
||
- printf with `%d %x %s %c %p` and width/precision specifiers.
|
||
- sprintf / snprintf / vsprintf / vsnprintf with the same format
|
||
coverage as printf (`%d %u %x %ld %lu %s %c %f %p %%` + width).
|
||
C99 truncation semantics for snprintf. `%.Nf` produces the
|
||
correct fractional digits with round-half-up.
|
||
- qsort + bsearch over arbitrary element size with a user `cmp`
|
||
callback.
|
||
- Standard string/stdlib glue: strcat, strncat, strpbrk, strspn,
|
||
strcspn, atol, llabs (kept in their own translation unit so
|
||
vprintf's branch layout doesn't shift).
|
||
- `<math.h>`: fabs, floor, ceil, fmod, copysign, sqrt, pow,
|
||
sin, cos, tan, exp, log, atan, atan2, asin, acos, sinh, cosh,
|
||
tanh (and float variants). Bit-twiddling for fabs/floor/ceil/
|
||
copysign; Newton iteration for sqrt; range-reduction + Taylor
|
||
for sin/cos/exp/log/atan; identities for asin/acos/atan2/sinh/
|
||
cosh/tanh. Accuracy is in the ~1e-6 range — good enough for
|
||
typical numeric work, far short of glibc-quality. These are
|
||
slow (each call is dozens to hundreds of soft-double libcalls)
|
||
— pre-compute or cache when possible.
|
||
- `setjmp` / `longjmp` from libgcc.s.
|
||
- Static constructors via crt0's init_array walk.
|
||
- `<stdio.h>` file I/O against an in-memory FS: `mfsRegister
|
||
(path, buf, size, cap, writable)` stages a buffer as a named
|
||
file; `fopen`/`fread`/`fwrite`/`fseek`/`ftell`/`fclose`/`fgetc`
|
||
/`fgets`/`ungetc`/`fprintf` operate on it via a per-FILE
|
||
(kind, buf, size, cap, pos, eof, err, unget) record. stdin/
|
||
stdout/stderr route through `putchar` as before.
|
||
- `<wchar.h>`: wcslen / wcscmp / wcsncmp / wcscpy / wcsncpy /
|
||
wcscat / wcschr / wcsrchr; mbtowc / wctomb / mbstowcs /
|
||
wcstombs / mblen with the trivial 1:1 byte<->wide mapping
|
||
(Latin-1). wchar_t is 16-bit on this target.
|
||
- `<signal.h>`: in-process signal table. signal() registers a
|
||
handler; raise() invokes it. Default actions: SIGABRT calls
|
||
abort(), SIGINT/SIGTERM call exit(128+sig), others ignored.
|
||
- `<locale.h>`: setlocale always returns "C"; localeconv returns
|
||
a fixed C-locale lconv struct.
|
||
- C++ subset: classes, single inheritance, multiple inheritance
|
||
(Drawable+Movable through one Sprite), virtual base diamond
|
||
(A and B virtually derive Base; Diamond inherits from both
|
||
with one shared Base subobject), virtual functions,
|
||
polymorphism via base-class pointer arrays, virtual dtors,
|
||
this-pointer adjustment for non-leftmost bases, vbase offset
|
||
tables. RTTI / `dynamic_cast` works (downcast, MI cross-cast,
|
||
virtual-base sibling cast) via a minimal libcxxabi shim
|
||
(`runtime/src/libcxxabi.c`) that provides `__dynamic_cast` +
|
||
the three typeinfo class vtables (`__class_type_info`,
|
||
`__si_class_type_info`, `__vmi_class_type_info`) + sized
|
||
`operator delete` + `__cxa_pure_virtual`.
|
||
- C++ exceptions via `clang++ -fsjlj-exceptions`: throw, catch,
|
||
catch-by-value, multiple catch handlers, exception destruction.
|
||
`W65816SjLjFinalize` IR pass inserts the call-site dispatch and
|
||
per-function catch table; `runtime/src/libcxxabiSjlj.c` provides
|
||
the Itanium SJLJ surface (`_Unwind_SjLj_*`, `__cxa_throw`,
|
||
`__cxa_begin_catch`, etc.) plus a no-op personality.
|
||
|
||
**Toolchain:**
|
||
|
||
- `clang` / `llc` produce W65816 assembly + ELF object files.
|
||
- `tools/link816` resolves cross-translation-unit refs, lays out
|
||
text/rodata/bss, emits a flat binary the IIgs ROM can load.
|
||
Auto-relocates bss above text+rodata when the default
|
||
`--bss-base 0x2000` would overlap text, and skips past the
|
||
IIgs IO window ($C000-$CFFF) if needed. `--gc-sections`
|
||
(default ON) drops unreachable functions: a minimal program
|
||
with full runtime linked shrinks from ~43KB to ~1.5KB.
|
||
- `link816 --segment-cap N` packs `.text` greedily into multiple
|
||
bank-aligned segments, capped at N bytes per segment. Segment 1
|
||
stays at `--text-base` in bank 0 (alongside rodata + bss + init);
|
||
segments 2..M start at `--segment-bank-base` (default $040000)
|
||
in successive banks. `--manifest path.json` writes a JSON file
|
||
listing each segment's image, base, and entry offset.
|
||
Cross-bank `JSL` (IMM24 reloc) just works — patched at link
|
||
time with the full 24-bit address. Cross-bank IMM16 is
|
||
permitted (uses DBR for bank — caller pins DBR to data's bank);
|
||
cross-bank PCREL is rejected with a clear diagnostic.
|
||
`scripts/runMultiSeg.sh` is a mini in-Lua loader for MAME that
|
||
reads the manifest, places each segment's bytes, and runs from
|
||
segment 1's entry — used by smoke to verify cross-bank JSL
|
||
end-to-end (helper3 chain across 3 bank-aligned segments).
|
||
- `tools/omfEmit` produces OMF v2.1 files in three modes:
|
||
(a) single-segment — `--input flat.bin --map flat.map --base
|
||
ADDR --entry SYM`, KIND=0x0000 (CODE, dynamic), ORG=0 (loader
|
||
picks bank); (b) multi-segment — `--manifest path.json` reads
|
||
link816's manifest and emits one OMF segment per entry with
|
||
KIND=0x8800 (STATIC|ABSBANK|CODE) + ORG=segment-base, asking
|
||
the GS/OS Loader to place each at its declared bank-aligned
|
||
address. All intra-segment relocations were already patched by
|
||
the linker, so no INTERSEG/RELOC opcodes are needed for v1
|
||
static placement. (c) `--stack-size N` (auto-enables
|
||
`--expressload`) appends a `~Direct` DP/Stack segment
|
||
(KIND=0x1012) of N bytes so apps can request a custom DP+stack
|
||
allocation from GS/OS instead of the Loader's 4KB default.
|
||
Validated end-to-end via `runViaFinder.sh` under real GS/OS
|
||
6.0.2 — the slow Loader path silently rejects multi-segment
|
||
OMFs, so `--stack-size` is gated behind ExpressLoad emission.
|
||
- `link816 --debug-out FILE` writes a DWARF sidecar with text/
|
||
rodata/bss/init_array relocations applied to every `.debug_*`
|
||
section, so `.debug_addr` / `.debug_line` PC values are final-
|
||
image addresses.
|
||
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
|
||
libgcc into linkable objects.
|
||
- `scripts/smokeTest.sh` runs 132 end-to-end checks at -O2:
|
||
scalar ops, control flow, calling conventions, MAME execution
|
||
regressions, link816 bss-base safety + weak-symbol resolution +
|
||
heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link,
|
||
iigs/gsos.h compile + link, standalone runtime headers,
|
||
AsmPrinter peepholes (STZ / PEA / PEI — single-STA, shared-
|
||
LDA-multi-STA, DPF0-forwarding), malloc/free coalesce ordering,
|
||
plus real-world coverage: Conway's Game of Life blinker
|
||
(2D loop + neighbour bounds), binary search tree (recursive
|
||
struct + malloc), function-pointer dispatch table (indirect
|
||
JSL via `__jsl_indir`), memory-backed file I/O (mfsRegister +
|
||
fopen/fread/fwrite/fseek/fprintf), C++ polymorphism (single
|
||
inheritance), C++ multiple inheritance (Drawable+Movable),
|
||
C++ virtual base diamond, C++ dynamic_cast (SI + MI cross-cast +
|
||
virtual-base sibling cast through libcxxabi shim), SJLJ exception
|
||
runtime end-to-end (libcxxabiSjlj.c throw/catch round-trip via
|
||
setjmp/longjmp + catch-table walk), C++ -fsjlj-exceptions
|
||
compile + link (the C++ frontend → backend path is execution-
|
||
verified manually but skipped from MAME smoke due to a
|
||
MAME-side flakiness — see "What's next"), GS/OS wrapper
|
||
round-trip via stub dispatcher pre-loaded at $E100A8 (validates
|
||
PHA + PEA 0 + JSL + post-call SP-fixup contract end-to-end),
|
||
wchar / signal core APIs, hex dumper writing through fprintf,
|
||
JSON tokenizer state machine, hash-table command shell (parser
|
||
+ dispatch + chained collisions over fprintf-to-mfs),
|
||
scripts/bench.sh size-vs-Calypsi harness. 100% pass.
|
||
|
||
- `scripts/benchCyclesPrecise.sh` measures per-call cycle counts
|
||
via MAME's emulated time counter. Eight benchmarks under
|
||
`benchmarks/`. Current numbers: popcount 6888 cyc, bsearch
|
||
1108, memcmp 1569, strcpy 3580, dotProduct 4774, fib(10) 14152,
|
||
sumOfSquares 49104. Speed is the optimization priority, not
|
||
size.
|
||
|
||
**Backend register allocation:**
|
||
|
||
- Basic regalloc as default at -O1+; fast at -O0/optnone. We use
|
||
basic instead of greedy because greedy fails ("ran out of
|
||
registers during register allocation") on functions with many
|
||
cross-call Acc16 vregs (the `ok |= bit; helper(); ok |= bit;`
|
||
pattern across many if-blocks). Basic handles those cleanly
|
||
with negligible code-size overhead vs greedy on the bench
|
||
suite (~0.6%).
|
||
- Pre-RA passes: `WidenAcc16` (Acc16→Wide16 promotion, lets
|
||
greedy spread i16 pressure across A and 16 IMG slots);
|
||
`TiedDefSpill` (handles tied-def-multi-use hazard);
|
||
`ABridgeViaX` (bridges via X/Y when free).
|
||
- Post-RA passes: `SpillToX` (STA/LDA pairs → TAX/TXA bridges
|
||
when X dead); `StackSlotCleanup` (deletes redundant adjacent
|
||
spills); `NegYIndY` (rewrites negative-Y indirect-Y stack-rel
|
||
ops to avoid the 24-bit-add bank-cross).
|
||
- Pre-emit: `BranchExpand` (long Bxx → INV_Bxx skip; BRA target);
|
||
`SepRepCleanup` (coalesces adjacent SEP/REP toggles, plus a
|
||
cross-mode-neutral coalesce that drops REP/SEP pairs sandwiching
|
||
X-flag-only ops, branches, transfers — saves 4B / 12cyc per
|
||
collapse). AsmPrinter LDAi8imm peephole walks past mode-neutral
|
||
MIs to fuse the closing REP into a following SEP.
|
||
- Imaginary registers IMG0..IMG15 backed by DP $C0..$CE +
|
||
$D0..$DE — gives greedy 17 effective i16 carriers (A + 16 IMG)
|
||
before stack spills kick in.
|
||
|
||
**ABI:**
|
||
|
||
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
|
||
on the system stack with PHA. Caller deallocates via `tsc;clc;adc
|
||
#N;tcs` or `PLY*N/2`.
|
||
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
|
||
the highest 16 bits.
|
||
- Frame is empty-descending (S points to next-free); offsets account
|
||
for the +1 skew vs LLVM's full-descending model.
|
||
|
||
**IIgs toolbox:**
|
||
|
||
- `iigs/toolbox.h` — autogenerated wrappers for all ~1300 IIgs
|
||
toolbox routines across 35 tool sets (Tool Locator, Memory
|
||
Manager, Misc Tools, QuickDraw II / Aux, Event Manager,
|
||
Sound Manager, Apple Desktop Bus, SANE, Integer Math, Text
|
||
Tools, Window Manager, Menu Manager, Control Manager,
|
||
LineEdit, Dialog Manager, Scrap Manager, Standard File,
|
||
Note Synth/Sequencer, Font Manager, List Manager, ACE,
|
||
Resource Manager, MIDI, Video Overlay, TextEdit, Media
|
||
Control, Print Manager, Scheduler, Desk Manager, …). Names
|
||
match Apple's IIgs Toolbox Reference exactly (TLStartUp,
|
||
MMStartUp, NewWindow, SysBeep, …). 417 simple wrappers
|
||
(zero/single-arg, i16-or-void return) inline in the header;
|
||
890 multi-arg ones live in `runtime/src/iigsToolbox.s`.
|
||
Generated by `scripts/genToolbox.py` from ORCA-C's
|
||
`ORCACDefs/` (re-runnable when ORCA-C updates).
|
||
|
||
## What's next
|
||
|
||
Work is now optimization-focused; the toolchain is feature-complete
|
||
for the common-case C / minimal-C++ workload. Priority is speed
|
||
(cycle counts), not size.
|
||
|
||
**Speed wins queued, ranked by expected impact:**
|
||
|
||
- **u16×u16 → u32 multiply path.** sumOfSquares is 982 cyc/iter
|
||
bottlenecked by `__mulsi3` for what's effectively a 16×16
|
||
multiply (both inputs are zext from u16). Adding a `__umulhi3`
|
||
libcall + SDAG hook to detect `MUL(zext(a), zext(b))` could
|
||
roughly halve the iteration cost.
|
||
|
||
- **Fold `while (x != 0)` for i32 to `lda lo; ora hi; bne`.**
|
||
The combiner currently materializes a SETCC boolean and re-tests
|
||
it, generating ~10 redundant ops in every i32-iteration loop.
|
||
Hot in popcount, CRC, and any BigInt-style code.
|
||
|
||
- **ptr32 pointer-increment overhead.** `*p++` under ptr32 emits
|
||
a full 32-bit `ADC` chain even when the high half is provably
|
||
unchanged. strcpy and memcmp pay 30+ cycles per byte for what
|
||
should be 15-20. Needs a peephole or SDAG combine for `i32 + 1`
|
||
with provably-no-carry-into-hi.
|
||
|
||
- **Greedy regalloc retry.** Currently blocked on an upstream
|
||
LLVM `LiveRangeEdit::eliminateDeadDef` assertion when our
|
||
sub-register pair partial-defs reach it. Basic regalloc works
|
||
but leaves measurable cycle waste in load/store shuffles.
|
||
|
||
**Open limitations:**
|
||
|
||
- **Multi-bank BSS / init_array.** Multi-segment mode splits
|
||
`.text` across banks but BSS + init_array still live in
|
||
segment 1's bank (bank 0). Programs with zero-init data
|
||
exceeding the ~60KB bank-0 budget need crt0 to walk a
|
||
per-segment `(start, end)` table. Not a blocker for >64KB
|
||
*code* programs.
|
||
|
||
- **C++ exceptions absent from CI smoke.** The SJLJ runtime
|
||
round-trip is in smoke; the full clang++ → backend → MAME
|
||
execution path runs reliably interactively but is excluded
|
||
from automated smoke due to MAME-side I/O flakiness.
|
||
|
||
- **GS/OS validation uses a stub dispatcher.** The wrapper
|
||
contract (PHA + PEA 0 + LDX + JSL $E100A8 + post-call SP
|
||
fixup) is verified end-to-end in MAME against a stub
|
||
(`scripts/runInMameWithGsosStub.sh`). Validation against a
|
||
real bootable GS/OS volume is left out of CI as it needs a
|
||
smartport hard-disk image and live Tool Locator init.
|
||
|
||
- **gmtime_r requires `optnone`.** IR-level optimizer issue:
|
||
loop rotation + IndVar simplify mis-evaluate `days >= 365L +
|
||
(__isLeap(...) ? 1 : 0)`, folding the comparison to
|
||
compile-time-false. Not a backend bug; needs IR-pass-level
|
||
diagnosis.
|
||
|
||
- **softDouble `dpack` / `dclass` require `noinline`.**
|
||
Inlining triggers register pressure that overflows basic
|
||
regalloc in `__adddf3`/`__muldf3`/`__divdf3`. Architectural
|
||
for the same reason as qsort's earlier split.
|