581 lines
31 KiB
Markdown
581 lines
31 KiB
Markdown
# llvm816 — Current Status
|
||
|
||
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
|
||
llvm-mos as a separate `W65816` target.
|
||
|
||
## What works
|
||
|
||
End-to-end C-to-binary toolchain that produces 65816 machine code
|
||
which runs correctly under MAME (apple2gs).
|
||
|
||
**Language coverage at -O2 (no extra flags):**
|
||
|
||
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
|
||
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
|
||
+ ASLA16 / shift libcalls.
|
||
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
|
||
the above sizes. Signed compare near INT_MIN handled via EOR-with-
|
||
sign-bit transform.
|
||
- Pointer arithmetic, array indexing, struct field access, struct
|
||
return-by-value (up to 8 bytes — Pair, Vec4, double).
|
||
- Pointer dereference (`*p`) lowers via `LDAptr / STAptr / STBptr`
|
||
to `[$E0],Y` indirect-LONG with the bank byte at `$E2` forced to 0
|
||
— DBR-independent, so `pha;plb` bank-switched callers don't corrupt
|
||
data through callee local-pointer writes. Const-int pointers
|
||
(`*(volatile uint16 *)0x5000 = v` MMIO idiom) lower to `STAabs`
|
||
(DBR-relative) so bank-2 writes still work.
|
||
- Bitfields, switch statements (verified up to ~12 cases + default),
|
||
function pointers, function-pointer tables, indirect calls via
|
||
`__jsl_indir` trampoline.
|
||
- Recursion: factorial, Fibonacci, depth-3 binary-tree
|
||
insert/sum/min/max, simple recursive quicksort.
|
||
- Loops with goto / break / continue, nested loops, state machines.
|
||
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
|
||
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
|
||
reverse with `cons` works; free-list coalesce verified.
|
||
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
|
||
roundtrip.
|
||
- Soft-float (single): all four ops + comparisons, MAME-verified.
|
||
- Soft-double: add, sub, mul, div all return correct bit patterns
|
||
bit-for-bit against gcc with round-to-nearest-even rounding;
|
||
3-iter Newton sqrt converges. Compiles at -O2 throughout. Long-
|
||
running iterations may hit MAME's 1-second sim-time budget (test
|
||
config issue, not a compiler bug).
|
||
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
|
||
arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
|
||
- C++ minimal: clang++ compiles a class with virtual + non-trivial
|
||
ctor (vtable + RTTI omitted; no exceptions).
|
||
- printf with `%d %x %s %c %p` and width/precision specifiers.
|
||
- sprintf / snprintf / vsprintf / vsnprintf with the same format
|
||
coverage as printf (`%d %u %x %ld %lu %s %c %f %p %%` + width).
|
||
C99 truncation semantics for snprintf. `%.Nf` produces the
|
||
correct fractional digits with round-half-up.
|
||
- scanf family: `sscanf` / `vsscanf` parse a C string; `fscanf` /
|
||
`vfscanf` bridge to vsscanf via a per-call line buffer (caps at
|
||
255 bytes / line; a longer line silently truncates). `scanf`
|
||
reads from stdin which always returns EOF on this target — the
|
||
surface compiles but isn't useful without a stdin source.
|
||
Format directives: `%d %i %u %x %X %o %s %c %ld %lu %lx %li %lo %%`.
|
||
- qsort + bsearch over arbitrary element size with a user `cmp`
|
||
callback.
|
||
- Standard string/stdlib glue: strcat, strncat, strpbrk, strspn,
|
||
strcspn, atol, llabs (kept in their own translation unit so
|
||
vprintf's branch layout doesn't shift).
|
||
- `<math.h>`: fabs, floor, ceil, fmod, copysign, sqrt, pow,
|
||
sin, cos, tan, exp, log, atan, atan2, asin, acos, sinh, cosh,
|
||
tanh (and float variants). Bit-twiddling for fabs/floor/ceil/
|
||
copysign; Newton iteration for sqrt; range-reduction + Taylor
|
||
for sin/cos/exp/log/atan; identities for asin/acos/atan2/sinh/
|
||
cosh/tanh. Accuracy is in the ~1e-6 range — good enough for
|
||
typical numeric work, far short of glibc-quality. These are
|
||
slow (each call is dozens to hundreds of soft-double libcalls)
|
||
— pre-compute or cache when possible.
|
||
- `setjmp` / `longjmp` from libgcc.s.
|
||
- Static constructors via crt0's init_array walk.
|
||
- `<stdio.h>` file I/O with two backends:
|
||
- **mfs** — `mfsRegister(path, buf, size, cap, writable)` stages a
|
||
memory buffer as a named file. Used by smoke tests that don't
|
||
have a real disk. Fully validated end-to-end.
|
||
- **GS/OS** — `fopen` falls through to `gsosOpen` for any path not
|
||
in the mfs table. Routes through the GS/OS class-1 dispatcher
|
||
via wrappers in `runtime/src/iigsGsos.s` (Open/Read/Write/Close/
|
||
SetMark/GetMark/SetEOF/GetEOF). The full stdio surface
|
||
(`fread/fwrite/fseek/ftell/fclose/fgetc/fputc/fputs/fgets/ungetc/
|
||
feof/ferror/clearerr/rewind/fprintf/vfprintf`) dispatches on
|
||
backend. link816 honors weak symbols so programs that don't use
|
||
the GS/OS backend don't have to link `iigsGsos.o`.
|
||
- **Validation status:** code path compiles, links, and runs under
|
||
`runViaFinder.sh --data` injection. `fopen` + `gsosOpen` hangs
|
||
when invoked under real GS/OS 6.0.2 (JSL $E100A8 doesn't return);
|
||
root cause not yet diagnosed. Stub-dispatcher GS/OS smoke (the
|
||
existing one) validates the wrapper contract independently. An
|
||
XFAIL'd end-to-end smoke is in `scripts/smokeTest.sh` gated
|
||
behind `GSOS_FILE_SMOKE=1` for use after the dispatcher path is
|
||
fixed. `runViaFinder.sh --data /PATH=local_file` is the
|
||
automated-injection mechanism for runtime-test data files.
|
||
- stdin/stdout/stderr route through `putchar` as before.
|
||
- `<wchar.h>`: wcslen / wcscmp / wcsncmp / wcscpy / wcsncpy /
|
||
wcscat / wcschr / wcsrchr; mbtowc / wctomb / mbstowcs /
|
||
wcstombs / mblen with the trivial 1:1 byte<->wide mapping
|
||
(Latin-1). wchar_t is 16-bit on this target. Extended set:
|
||
wmemcpy / wmemmove / wmemset / wmemcmp / wmemchr;
|
||
wcstol / wcstoul / wcstoll / wcstoull / wcstod / wcstof;
|
||
swprintf / vswprintf; wcsftime. All delegate to the byte
|
||
equivalents under the Latin-1 model.
|
||
- `<signal.h>`: in-process signal table. signal() registers a
|
||
handler; raise() invokes it. Default actions: SIGABRT calls
|
||
abort(), SIGINT/SIGTERM call exit(128+sig), others ignored.
|
||
- `<locale.h>`: setlocale always returns "C"; localeconv returns
|
||
a fixed C-locale lconv struct.
|
||
- `<fenv.h>`: rounding mode + exception flag word tracked but
|
||
no-op (softFloat / softDouble are fixed RNE; exceptions never
|
||
raised). Surface compiles cleanly for portable code.
|
||
- `<tgmath.h>`: C11 type-generic math via `_Generic`; selects
|
||
`sqrtf` vs `sqrt` etc. based on argument type.
|
||
- `<stdatomic.h>`: C11 atomic surface, all ops lower to plain
|
||
ops (single-core uniprocessor — no real synchronization
|
||
needed). `_Atomic T` is treated as plain `T`.
|
||
- `<threads.h>`: stubs. `thrd_create` returns `thrd_error`;
|
||
mutex/cond ops are no-ops; `call_once` and `tss_*` work since
|
||
they're degenerate on a single-core target.
|
||
- `aligned_alloc` / `posix_memalign` / `aligned_free`: wrap
|
||
malloc with an over-allocation + pointer-stash trick. Match
|
||
C11 contract — `aligned_alloc(N, M)` returns N-aligned, free
|
||
with `aligned_free`.
|
||
- `<iso646.h>`: alternative operator spellings (`and`, `or`,
|
||
`not`, etc.) — C95 compat header.
|
||
- `<stdalign.h>`: aliases `_Alignas` / `_Alignof` to `alignas` /
|
||
`alignof`.
|
||
- `<stdnoreturn.h>`: aliases `_Noreturn` to `noreturn`.
|
||
- `<uchar.h>`: `char16_t` / `char32_t` typedefs + `mbrtoc16` /
|
||
`c16rtomb` / `mbrtoc32` / `c32rtomb` conversion helpers. In
|
||
our Latin-1 model these are 1:1 byte copies (no UTF-8 decode).
|
||
- `<wctype.h>`: wide-char classification + case folding.
|
||
Delegates to `<ctype.h>` for code-points 0..255; anything
|
||
outside Latin-1 returns false / unchanged.
|
||
- `<complex.h>`: C99 complex-number surface — clang built-in
|
||
`_Complex` lowers to soft-double under the hood. Macros
|
||
`complex` / `_Complex_I` / `I` / `CMPLX` / `CMPLXF` / `CMPLXL`
|
||
plus inline `creal` / `cimag` / `conj` / `cproj` / `cabs` /
|
||
`carg` and their `f` / `l` variants. Transcendental complex
|
||
routines (csin/ccos/cexp/etc.) intentionally not provided —
|
||
they would each need a polynomial-expansion implementation
|
||
with limited IIgs value.
|
||
- `<assert.h>`: adds C11 `static_assert` as a macro alias for
|
||
the `_Static_assert` keyword.
|
||
- `<errno.h>`: full C standard error codes (EDOM, ERANGE,
|
||
EILSEQ) plus common POSIX codes (EPERM..EPIPE, ENAMETOOLONG,
|
||
ENOSYS, ENOTEMPTY, ELOOP). `strerror` maps every defined
|
||
code to a human-readable string.
|
||
- `<stdio.h>`: adds C standard buffer-control surface
|
||
(`setvbuf` / `setbuf` as no-ops, `_IOFBF` / `_IOLBF` / `_IONBF`
|
||
/ `BUFSIZ`); `fgetpos` / `fsetpos` wrap `ftell` / `fseek`;
|
||
`remove` routes through `mfsUnregister`; `rename` / `tmpfile`
|
||
/ `tmpnam` are stubs.
|
||
- C++ subset: classes, single inheritance, multiple inheritance
|
||
(Drawable+Movable through one Sprite), virtual base diamond
|
||
(A and B virtually derive Base; Diamond inherits from both
|
||
with one shared Base subobject), virtual functions,
|
||
polymorphism via base-class pointer arrays, virtual dtors,
|
||
this-pointer adjustment for non-leftmost bases, vbase offset
|
||
tables. RTTI / `dynamic_cast` works (downcast, MI cross-cast,
|
||
virtual-base sibling cast) via a minimal libcxxabi shim
|
||
(`runtime/src/libcxxabi.c`) that provides `__dynamic_cast` +
|
||
the three typeinfo class vtables (`__class_type_info`,
|
||
`__si_class_type_info`, `__vmi_class_type_info`) + sized
|
||
`operator delete` + `__cxa_pure_virtual`.
|
||
- C++ exceptions via `clang++ -fsjlj-exceptions`: throw, catch,
|
||
catch-by-value, multiple catch handlers, exception destruction.
|
||
`W65816SjLjFinalize` IR pass inserts the call-site dispatch and
|
||
per-function catch table; `runtime/src/libcxxabiSjlj.c` provides
|
||
the Itanium SJLJ surface (`_Unwind_SjLj_*`, `__cxa_throw`,
|
||
`__cxa_begin_catch`, etc.) plus a no-op personality.
|
||
|
||
**Toolchain:**
|
||
|
||
- `clang` / `llc` produce W65816 assembly + ELF object files.
|
||
- `tools/link816` resolves cross-translation-unit refs, lays out
|
||
text/rodata/bss, emits a flat binary the IIgs ROM can load.
|
||
Auto-relocates bss above text+rodata when the default
|
||
`--bss-base 0x2000` would overlap text, and skips past the
|
||
IIgs IO window ($C000-$CFFF) if needed. `--gc-sections`
|
||
(default ON) drops unreachable functions: a minimal program
|
||
with full runtime linked shrinks from ~43KB to ~1.5KB.
|
||
- `link816 --segment-cap N` packs `.text` greedily into multiple
|
||
bank-aligned segments, capped at N bytes per segment. Segment 1
|
||
stays at `--text-base` in bank 0 (alongside rodata + bss + init);
|
||
segments 2..M start at `--segment-bank-base` (default $040000)
|
||
in successive banks. `--manifest path.json` writes a JSON file
|
||
listing each segment's image, base, and entry offset.
|
||
Cross-bank `JSL` (IMM24 reloc) just works — patched at link
|
||
time with the full 24-bit address. Cross-bank IMM16 is
|
||
permitted (uses DBR for bank — caller pins DBR to data's bank);
|
||
cross-bank PCREL is rejected with a clear diagnostic.
|
||
`scripts/runMultiSeg.sh` is a mini in-Lua loader for MAME that
|
||
reads the manifest, places each segment's bytes, and runs from
|
||
segment 1's entry — used by smoke to verify cross-bank JSL
|
||
end-to-end (helper3 chain across 3 bank-aligned segments).
|
||
- `tools/omfEmit` produces OMF v2.1 files in three modes:
|
||
(a) single-segment — `--input flat.bin --map flat.map --base
|
||
ADDR --entry SYM`, KIND=0x0000 (CODE, dynamic), ORG=0 (loader
|
||
picks bank); (b) multi-segment — `--manifest path.json` reads
|
||
link816's manifest and emits one OMF segment per entry with
|
||
KIND=0x8800 (STATIC|ABSBANK|CODE) + ORG=segment-base, asking
|
||
the GS/OS Loader to place each at its declared bank-aligned
|
||
address. All intra-segment relocations were already patched by
|
||
the linker, so no INTERSEG/RELOC opcodes are needed for v1
|
||
static placement. (c) `--stack-size N` (auto-enables
|
||
`--expressload`) appends a `~Direct` DP/Stack segment
|
||
(KIND=0x1012) of N bytes so apps can request a custom DP+stack
|
||
allocation from GS/OS instead of the Loader's 4KB default.
|
||
Validated end-to-end via `runViaFinder.sh` under real GS/OS
|
||
6.0.2 — the slow Loader path silently rejects multi-segment
|
||
OMFs, so `--stack-size` is gated behind ExpressLoad emission.
|
||
- `link816 --debug-out FILE` writes a DWARF sidecar with text/
|
||
rodata/bss/init_array relocations applied to every `.debug_*`
|
||
section, so `.debug_addr` / `.debug_line` PC values are final-
|
||
image addresses.
|
||
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
|
||
libgcc into linkable objects.
|
||
- `scripts/smokeTest.sh` runs 148 end-to-end checks at -O2:
|
||
scalar ops, control flow, calling conventions, MAME execution
|
||
regressions, link816 bss-base safety + weak-symbol resolution +
|
||
heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link,
|
||
iigs/gsos.h compile + link, standalone runtime headers,
|
||
AsmPrinter peepholes (STZ / PEA / PEI — single-STA, shared-
|
||
LDA-multi-STA, DPF0-forwarding), malloc/free coalesce ordering,
|
||
plus real-world coverage: Conway's Game of Life blinker
|
||
(2D loop + neighbour bounds), binary search tree (recursive
|
||
struct + malloc), function-pointer dispatch table (indirect
|
||
JSL via `__jsl_indir`), memory-backed file I/O (mfsRegister +
|
||
fopen/fread/fwrite/fseek/fprintf), C++ polymorphism (single
|
||
inheritance), C++ multiple inheritance (Drawable+Movable),
|
||
C++ virtual base diamond, C++ dynamic_cast (SI + MI cross-cast +
|
||
virtual-base sibling cast through libcxxabi shim), SJLJ exception
|
||
runtime end-to-end (libcxxabiSjlj.c throw/catch round-trip via
|
||
setjmp/longjmp + catch-table walk), C++ -fsjlj-exceptions
|
||
compile + link (the C++ frontend → backend path is execution-
|
||
verified manually but skipped from MAME smoke due to a
|
||
MAME-side flakiness — see "What's next"), GS/OS wrapper
|
||
round-trip via stub dispatcher pre-loaded at $E100A8 (validates
|
||
PHA + PEA 0 + JSL + post-call SP-fixup contract end-to-end),
|
||
wchar / signal core APIs, hex dumper writing through fprintf,
|
||
JSON tokenizer state machine, hash-table command shell (parser
|
||
+ dispatch + chained collisions over fprintf-to-mfs),
|
||
scripts/bench.sh size-vs-Calypsi harness. 100% pass.
|
||
|
||
- `scripts/benchCycles.sh` measures per-iteration cycle counts via
|
||
MAME's emulated HBL counter. 13 benchmarks under `benchmarks/`
|
||
(8 int micro + 3 soft-FP + 2 "game-like": particles, mandelbrot).
|
||
Current numbers (2026-05-20):
|
||
bsearch 127, crc32 <65, dotProduct 144, fib 97, memcmp 113,
|
||
popcount 93, strcpy 91, sumOfSquares 126 cyc/iter (100 iters);
|
||
dadd 1157, ddiv 1261, dmul 1033 cyc/iter (10 iters);
|
||
particles 2253 cyc/iter (3 iters — 32-particle physics tick);
|
||
mandelbrot 11570 cyc/iter (1 iter — 4×4 fixed-point tile, max 8
|
||
Mandelbrot iters). Speed is the optimization priority, not size.
|
||
|
||
- `compare/` holds three side-by-side C tests with our asm and
|
||
Calypsi's listing for static-size comparison:
|
||
`sumSquares`/`evalAt`/`mul16to32`. `bash compare/regen.sh`
|
||
recompiles each under both `clang --target=w65816 -O2 -S` and
|
||
`cc65816 --speed -O 2 --64bit-doubles` and prints an
|
||
ours/Calypsi instruction-count ratio. Current ratios (2026-05-20):
|
||
sumSquares **0.84×** (26 inst — we beat Calypsi's 31),
|
||
evalAt 1.86× (472 inst), mul16to32 **0.25×** (1 inst — we beat
|
||
Calypsi's 4). See `compare/README.md`.
|
||
|
||
**Backend register allocation:**
|
||
|
||
- Greedy regalloc as default at -O1+; fast at -O0/optnone. Greedy
|
||
was previously blocked by an upstream LLVM `LiveRangeEdit::elimina-
|
||
teDeadDef` assertion firing on KILL pseudos with non-dead implicit-
|
||
def $a. Fix landed in `tools/llvm-mos/llvm/lib/CodeGen/InlineSpil-
|
||
ler.cpp`: when InlineSpiller converts a redundant STAfi to a KILL
|
||
pseudo, mark BOTH explicit and implicit defs dead (the original loop
|
||
only iterated `MI.defs()` = explicit-only, leaving the inherited
|
||
implicit-def $a live). Bench impact: popcount −19.4%, strcpy
|
||
−18.9%, memcmp −8.6%, bsearch −9.2%.
|
||
|
||
- Pre-RA passes: `WidenAcc16` (Acc16→Wide16 promotion, lets
|
||
greedy spread i16 pressure across A and 16 IMG slots);
|
||
`TiedDefSpill` (handles tied-def-multi-use hazard);
|
||
`ABridgeViaX` (bridges via X/Y when free).
|
||
- Post-RA passes: `SpillToX` (STA/LDA pairs → TAX/TXA bridges
|
||
when X dead); `StackSlotCleanup` (deletes redundant adjacent
|
||
spills); `NegYIndY` (rewrites negative-Y indirect-Y stack-rel
|
||
ops to avoid the 24-bit-add bank-cross).
|
||
- Pre-emit: `BranchExpand` (long Bxx → INV_Bxx skip; BRA target);
|
||
`SepRepCleanup` (coalesces adjacent SEP/REP toggles, plus a
|
||
cross-mode-neutral coalesce that drops REP/SEP pairs sandwiching
|
||
X-flag-only ops, branches, transfers — saves 4B / 12cyc per
|
||
collapse). AsmPrinter LDAi8imm peephole walks past mode-neutral
|
||
MIs to fuse the closing REP into a following SEP.
|
||
- Imaginary registers IMG0..IMG15 backed by DP $C0..$CE +
|
||
$D0..$DE — gives greedy 17 effective i16 carriers (A + 16 IMG)
|
||
before stack spills kick in.
|
||
|
||
**ABI:**
|
||
|
||
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
|
||
on the system stack with PHA. Caller deallocates via `tsc;clc;adc
|
||
#N;tcs` or `PLY*N/2`.
|
||
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
|
||
the highest 16 bits.
|
||
- Frame is empty-descending (S points to next-free); offsets account
|
||
for the +1 skew vs LLVM's full-descending model.
|
||
|
||
**IIgs toolbox:**
|
||
|
||
- `iigs/toolbox.h` — autogenerated wrappers for all ~1300 IIgs
|
||
toolbox routines across 35 tool sets (Tool Locator, Memory
|
||
Manager, Misc Tools, QuickDraw II / Aux, Event Manager,
|
||
Sound Manager, Apple Desktop Bus, SANE, Integer Math, Text
|
||
Tools, Window Manager, Menu Manager, Control Manager,
|
||
LineEdit, Dialog Manager, Scrap Manager, Standard File,
|
||
Note Synth/Sequencer, Font Manager, List Manager, ACE,
|
||
Resource Manager, MIDI, Video Overlay, TextEdit, Media
|
||
Control, Print Manager, Scheduler, Desk Manager, …). Names
|
||
match Apple's IIgs Toolbox Reference exactly (TLStartUp,
|
||
MMStartUp, NewWindow, SysBeep, …). 417 simple wrappers
|
||
(zero/single-arg, i16-or-void return) inline in the header;
|
||
890 multi-arg ones live in `runtime/src/iigsToolbox.s`.
|
||
Generated by `scripts/genToolbox.py` from ORCA-C's
|
||
`ORCACDefs/` (re-runnable when ORCA-C updates).
|
||
|
||
## What's next
|
||
|
||
Work is now optimization-focused; the toolchain is feature-complete
|
||
for the common-case C / minimal-C++ workload. Priority is speed
|
||
(cycle counts), not size.
|
||
|
||
**Recently landed (2026-05-25):**
|
||
|
||
- **Layer 1 ptr32 deref-fold (always on)** — Constant offset on a
|
||
ptr32 deref folds into the `[dp],Y` Y register instead of a CLC/ADC
|
||
carry-chain pre-add. Plus consecutive-deref CSE that shares the
|
||
`$E0/$E2` staging across `s->a`, `s->b`, ... accesses with the same
|
||
base. Always on; saves ~3 instructions per struct-field access.
|
||
See `feedback_ptr32_deref_fold_layer1_landed.md`.
|
||
|
||
- **Layer 2 ptr32 deref via `(d,S),Y` (opt-in)** —
|
||
`-mllvm -w65816-dbr-safe-ptrs` switches ptr32 derefs to the
|
||
one-instruction `lda (d,S),Y` (opcode 0xB3) at the cost of reading
|
||
only 16 bits of pointer. Bank byte is implicit DBR. Correct only
|
||
for code that touches memory inside DBR's bank — typical for
|
||
malloc/globals/BSS-only programs (Lua, Picol). Lua 5.1.5 shrinks
|
||
20.6%, dropping our total from 1.45× to 1.15× Calypsi. Default
|
||
off; per-TU opt-in. See `feedback_ptr32_layer2_landed.md` and
|
||
`docs/USAGE.md` for the safety rules.
|
||
|
||
- **Inline-threshold lowered target-wide to 50** (was LLVM default
|
||
225). LLVM's default is tuned for desktop ISAs where call overhead
|
||
is high relative to inlined-body byte cost. On W65816, `jsl` is
|
||
cheap (4 bytes / ~8 cycles) but inlined ptr32 derefs are expensive
|
||
even with Layer 2 — the tradeoff inverts. At 225, Lua's
|
||
`index2adr` (41 callers in lapi.c) and CoreMark's `matrix_test`
|
||
helpers got copied everywhere. At 50, neither does, and the cycle
|
||
benchmark suite is unchanged. With Layer 2 + threshold=50, total
|
||
Lua is **0.93× Calypsi** and total CoreMark is **0.79× Calypsi
|
||
(we beat by 21%)**. Override per-TU with
|
||
`-mllvm -inline-threshold=N`. See
|
||
`feedback_lapi_inline_threshold.md` and
|
||
`feedback_coremark_matrix_test_regression.md`.
|
||
|
||
- **CoreMark 1.0 ported** (`tests/coremark/`). EEMBC's standard
|
||
embedded benchmark, ~2K LOC. Exercises linked-list traversal,
|
||
matrix multiply, formal state machine, CRC — patterns Lua doesn't
|
||
hit. Build requires `--layer2` to fit a single bank
|
||
(otherwise crosses the IO window at 0xC000). See
|
||
`tests/coremark/README.md` and `feedback_coremark_landed.md`.
|
||
|
||
**Speed wins queued, ranked by expected impact:**
|
||
|
||
- **ptr32 pointer-increment overhead** (partially addressed). The
|
||
`i32 += 1` post-PEI peephole (`W65816I32IncFold`) detects the
|
||
6-instruction LDA/ADCi16imm 1/STA/LDA/ADCEi16imm 0/STA pattern and
|
||
rewrites to LDA/INA/STA/INC_HI_IF_CARRY (with private-label BNE
|
||
expansion in AsmPrinter). Saves ~13 cyc per increment on the
|
||
no-carry common path. memcmp 1330 → 1194 (−10.2%), strcpy 3325 →
|
||
3154 (−5.1%). Now also tolerates intervening TAX/TXA pseudo-saves
|
||
in the matcher (regalloc inserts them around STAfi's conservative
|
||
`Defs=[A]`); LSR-introduced i32 PHIs like `lsr.iv9 += 1` now match.
|
||
LSR's `*p++ → base+offset` rewrite remains unaddressed; tried
|
||
`-disable-lsr` and `isLSRCostLess` override, both regressed
|
||
dotProduct.
|
||
|
||
- **W65816StackSlotMerge — value-equivalent stack slot coalesce**
|
||
(2026-05-13). Pre-emit pass that merges PHI src/dst stack-slot
|
||
pairs which LLVM's StackSlotColoring can't see (they're
|
||
simultaneously live but hold the same value). Detects the
|
||
canonical loop-body `LDA X ; STA Y` PHI-copy in a self-looped
|
||
MBB, verifies value equivalence via bidirectional twin-pairing
|
||
(Case 1: same A in same MBB / Case 2: PHI-copy reload pattern /
|
||
Case 3: matching `LDA #const` init in different MBBs), and
|
||
renames slot X→Y function-wide. Runs AFTER SepRepCleanup so the
|
||
PHI copies are out of their PHP/PLP wraps and offsets are stable.
|
||
**A-define detection is opcode-based, not operand-based** —
|
||
LDA_DP / LDA_Abs / LDA_Long etc. omit the `implicit-def $a`
|
||
annotation in tablegen but semantically write A; the
|
||
`semanticallyDefsA` helper falls back to an opcode whitelist.
|
||
sumSquares static: 65 → 61 inst (1.97x — under 2x Calypsi for
|
||
the first time). sumOfSquares cyc/call: 18755 → 17391
|
||
(**−7.3%**). strcpy: 2558 → 2387 (−6.7%). See
|
||
W65816StackSlotMerge.cpp.
|
||
|
||
- **LSR-widened i32 IV narrowing** (`W65816NarrowI32Mul` Phase 2,
|
||
2026-05-13). After rewriting `mul i32 X, Y` to a `__umulhisi3`
|
||
call, scan for i32 PHIs whose only uses are (a) the truncs the
|
||
rewrite emitted and (b) a single self-feeding `add %P, const`.
|
||
When SCEV bounds the PHI to u16, build an i16 PHI + i16 add in
|
||
place, replace truncs, and erase the i32 chain. Care needed
|
||
to break the PN ↔ Incr use-cycle before erasing. sumSquares
|
||
frame: 14B → 12B; loop-internal `i++` shrinks from 7→3 inst.
|
||
|
||
- **PHI-hoist accepts LDA_Imm16 / LDAi16imm** (2026-05-13).
|
||
Init blocks contain `lda #const ; sta slot,s` pairs wrapped in
|
||
PHP/PLP around the pre-loop CMP — same shape as a PHI-copy
|
||
wrap but with an immediate load instead of a memory load.
|
||
Matcher extended to accept both the MC opcode (`LDA_Imm16`) and
|
||
the surviving pseudo (`LDAi16imm`), with an added **$a-live-out
|
||
guard**: if any successor MBB has $a in its live-in set, bail —
|
||
the LDA's A-value is a fall-through register-PHI consumed by
|
||
the successor's first STA, and hoisting clobbers it. Caught
|
||
by `sumTable` where `lda #0 ; sta 0x9,s` (wrap+trailing) ALSO
|
||
supplied A=0 to `bb.2`'s `sta 0x1,s`.
|
||
|
||
- **16x16→32 multiply via `__umulhisi3` + `W65816NarrowI32Mul` IR
|
||
pass** (2026-05-13). Added `__umulhisi3` (unsigned 16x16→32) to
|
||
`runtime/src/libgcc.s`. New IR pass in `addISelPrepare` walks
|
||
`mul i32 X, Y` and uses IR-level `computeKnownBits` plus a SCEV
|
||
unsigned-range fallback (`getUnsignedRange().getActiveBits() <=
|
||
16`) to detect operands with provably-zero high 16 bits — fires
|
||
on the canonical loop-internal `(u32)i*i` pattern after LSR
|
||
widens the i16 IV to i32. Rewrites to a call to `__umulhisi3`.
|
||
sumOfSquares 20801 → 19096 cyc/call by itself (-8.2% from
|
||
baseline).
|
||
|
||
- **Dead TAX/TXA peephole** (2026-05-13). STAfi's conservative
|
||
`Defs=[A]` (for the IMG-source PHA-bracketed expansion path)
|
||
causes regalloc to insert spurious TAX/TXA save/restore brackets
|
||
even when STAfi's source is A directly. `W65816SepRepCleanup`
|
||
now elides TXA/TYA whose next non-debug inst defines `$a`, and
|
||
TAX/TAY whose target reg is dead before its next redefinition.
|
||
Cross-MBB liveness via `Succ->isLiveIn(...)`; bails on
|
||
return-terminated MBBs (RTL doesn't model the i32-return
|
||
convention). Tracks `pRedef` so `TAX ; CLC ; ADC` chains
|
||
don't bail on ADC's $p-read (CLC freshens the carry flag).
|
||
|
||
- **i32 += i32 store-bypass** (2026-05-13). Regalloc materializes
|
||
the call-result `A:X` i32 pair into spill slots before the add,
|
||
then reloads — emitting a 10-instruction `STA-TXA-STA-LDA-CLC-
|
||
ADC-STA-LDA-ADC-STA` sequence. `W65816SepRepCleanup` matcher
|
||
rewrites to 6-instruction `CLC-ADC-STA-TXA-ADC-STA` (TXA preserves
|
||
carry; hi-half consumes it directly from X). Saves 4 inst / ~13
|
||
cyc per call-result-add site. sumOfSquares 20460 → 19096 (-6.7%).
|
||
|
||
- **PHI-copy hoist out of PHP/PLP wrap** (2026-05-13).
|
||
`W65816SepRepCleanup` detects the back-edge `CMP ; PHP ; (LDA/STA
|
||
pairs) ; PLP ; (trailing STA) ; Bxx ; BRA loop` pattern and
|
||
hoists the LDA/STA pairs + trailing above the CMP's $a-producer
|
||
chain, dropping PHP/PLP. Two safety guards: (1) **bump undo** —
|
||
in-wrap stack-rel offsets were pre-adjusted by +1 (PHP decrements
|
||
S; `W65816StackSlotCleanup`'s wrap pass compensates inside the
|
||
wrap), so the hoist subtracts 1 from each `LDA_StackRel` /
|
||
`STA_StackRel` offset; trailing STAs (already outside the wrap)
|
||
are untouched. (2) **pair-count check** — require
|
||
`#LDAs(Block) == #STAs(Block) + #STAs(Trailing)`; an extra LDA
|
||
is a memory-to-register PHI value live-out at the back-edge
|
||
(consumed by the loop top's first STA), and hoisting would
|
||
clobber A. Saves 2 inst / 8 cyc per occurrence. sumOfSquares
|
||
19096 → 18755 (-1.8%), popcount 3683 → 3478 (-5.6%).
|
||
|
||
- **More peephole / libcall opportunities.** __mulsi3 just gained
|
||
early-exit when the multiplier shifts to 0; dotProduct dropped
|
||
4007→2472 (−38.3%), sumOfSquares 40920→23870 (−41.6%). Next
|
||
candidates: shift-by-N inlining for shifts 5+ that currently
|
||
go through __ashlsi3; a `u32 += zext i16` SDAG combine to skip
|
||
the hi-half carry chain when one operand has known-zero high
|
||
16 bits.
|
||
|
||
- **W65816StackRelToImg peephole pipeline** (2026-05-20). Eight
|
||
always-on peepholes plus an extended phase 4 in the pre-emit
|
||
StackRelToImg pass: (1) `elidePhaBracket` with case-a single-store
|
||
bracket + case-b ImgCalleeSave multi-store with STA-hoist +
|
||
case-c STA_DP-only multi-pair + forward-walk liveness through
|
||
conditional branches; (2) `elideCallResultSaveSPReload` drops
|
||
STA/LDA $E0 round-trip in ADJCALLSTACKUP's Y-live i64-return
|
||
path; (3) `elideDeadStaCarry` drops first STA in i32-carry
|
||
STA/ADCE/STA pattern; (4) `elideRedundantLdaAfterPha`; (4b)
|
||
`elidePlaPhaPair` collapses consecutive PLA;PHA; (5)
|
||
`elideStoreForwarding` (gated to bail path + end-of-pass to
|
||
avoid IMG-slot reallocation cascade). Phase 4 extended to walk
|
||
past STX_DP/STY_DP between TYA and STA_DP with safety check
|
||
(post-STA op must redefine A) and to handle STA_StackRel
|
||
destination with offset compensation. Result: evalAt 498→472
|
||
inst (1.96×→1.86× vs Calypsi), fib -35% cyc/iter (149→97),
|
||
popcount -11% (104→93), 35 libc functions get TAY/TYA bracket
|
||
elided. Case (b) hoists the body's first STA before the
|
||
ImgCalleeSave bracket, enabling the existing phase 4 to remove
|
||
PEI's TAY/TYA round-trip in a synergistic chain.
|
||
|
||
- **__muldi3 32-bit short-circuit** (2026-05-20). When `a`'s high
|
||
32 bits ($E4/$E6) are zero, use a 32-iter shift-and-add loop
|
||
instead of 64 iters. Fires on every `mulhi64Aligned` call from
|
||
softDouble.c (4× per `__muldf3`), which always passes zero-
|
||
extended u32 operands. Result: **dmul 1605→1033 cyc/iter
|
||
(-36%)**. Single-side check (just `a`) is correct since `b`'s
|
||
high half being non-zero doesn't affect correctness — iters 32-63
|
||
would just shift b without adding.
|
||
|
||
- **Lua 5.1.5 compiles cleanly** (2026-05-20). Reference C
|
||
implementation (17K lines, 24 source files) builds + links into a
|
||
multi-segment binary. Loads in MAME. Lives under `tests/lua/`.
|
||
Three large functions (luaV_execute, symbexec, auxsort) hit
|
||
greedy regalloc's complexity budget and need `-mllvm
|
||
-regalloc=basic` (still at -O2 — basic-regalloc -O2 is ~3.5×
|
||
smaller than fast-regalloc -O0). Largest "real-world C" test
|
||
in the project.
|
||
|
||
**Open limitations:**
|
||
|
||
- **Multi-bank BSS** — full support up to 4 banks (256KB). link816
|
||
splits BSS into up to 4 contiguous segments at link time; each
|
||
segment fits within a single bank. Linker emits
|
||
`__bss_seg{0..3}_lo16 / _bank / _size` symbols. crt0 walks the
|
||
table, setting DBR per segment. Per-segment size capped at
|
||
0xFF00 so the 16-bit `cpx #__bss_segN_size` loop comparison
|
||
doesn't wrap to 0 on a full-bank segment (a single full bank is
|
||
split into a 0xFF00-byte primary + 0x100-byte tail in the same
|
||
bank). Smoke validates BSS spanning bank 3 + bank 4
|
||
(100KB) is zeroed end-to-end. Note: program access to non-DBR
|
||
bank globals still requires DBR management — the compiler emits
|
||
DBR-relative absolute for global accesses, so accessing BSS in
|
||
bank N needs the program to set DBR=N or use `sta long` via
|
||
inline asm.
|
||
|
||
- **C++ exceptions absent from CI smoke.** The SJLJ runtime
|
||
round-trip is in smoke; the full clang++ → backend → MAME
|
||
execution path runs reliably interactively but is excluded
|
||
from automated smoke due to MAME-side I/O flakiness.
|
||
|
||
- **GS/OS validation uses a stub dispatcher.** The wrapper
|
||
contract (PHA + PEA 0 + LDX + JSL $E100A8 + post-call SP
|
||
fixup) is verified end-to-end in MAME against a stub
|
||
(`scripts/runInMameWithGsosStub.sh`). Validation against a
|
||
real bootable GS/OS volume is left out of CI as it needs a
|
||
smartport hard-disk image and live Tool Locator init.
|
||
|
||
- **VLAs work end-to-end** (2026-05-09). Backend Custom-lowers
|
||
`ISD::DYNAMIC_STACKALLOC` for both i16 and i32 result types.
|
||
Loop patterns now produce correct results: `sum_n(3)→6`
|
||
verified in MAME smoke. Fix: in VLA functions PEI expands
|
||
STAfi/STA8fi/STAfi_indY to a 4-MC sequence ending in `LDY $F8`
|
||
which clobbers N/Z; the StackSlotCleanup PHP/PLP wrap pass
|
||
treats those pseudos as flag-corrupting so PLP wraps the entire
|
||
expansion. `expandFarFI` uses `STY $F8`/`LDY $F8` to a DP
|
||
scratch slot rather than PHY/PLY (PHY/PLY between PHP/PLP would
|
||
pollute the saved P).
|
||
|
||
- **dpack and dclass now both inline** (2026-05-10). dpack uses
|
||
a volatile-output array rewrite to defeat the backend stack-slot
|
||
coalesce bug that previously caused dadd(1.5, 2.5) →
|
||
0x4010_4010_0000_0000. dclass's pointer-arg stores lower to
|
||
STBptr/STAptr (indirect-long, DBR-independent) and inline
|
||
cleanly. All softDouble routines compile at -O2.
|
||
|
||
- **IMG8..IMG15 callee-save via W65816ImgCalleeSave** (2026-05-13).
|
||
New post-RA, pre-PEI pass detects use of IMG8..IMG15 ($C0..$CE)
|
||
in a function and emits prologue save + epilogue restore so those
|
||
slots behave as callee-saved AT THE ASM LEVEL — without going
|
||
through LLVM's CSR mechanism (which would shift regalloc decisions
|
||
and break unrelated tests). Save shape per used slot: `PHA; LDA
|
||
$C?; STAfi A,slot,2; PLA`; restore mirrors it. The `+2` ImmOffset
|
||
compensates for PHA's SP shift so the lowered `sta d,s` lands on
|
||
the same byte that subsequent normal-SP reads see. Cost: ~16
|
||
cycles + 6 bytes per used slot, applied only to functions that
|
||
actually use those slots (most don't). Fixed picol `expr 1+2 == 4`
|
||
(now `3`) and a class of recursive double-fn miscompiles with
|
||
compound `||` conditions — see `feedback_picol_expr_compound_or.md`.
|
||
Smoke green including a new orBug regression test guarding
|
||
the fix.
|