65816-llvm-mos/STATUS.md
Scott Duensing 07544f49f2 Checkpoint
2026-05-02 16:48:56 -05:00

512 lines
26 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# llvm816 — Current Status
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
llvm-mos as a separate `W65816` target.
## What works
End-to-end C-to-binary toolchain that produces 65816 machine code
which runs correctly under MAME (apple2gs).
**Language coverage at -O2 (no extra flags):**
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
+ ASLA16 / shift libcalls.
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
the above sizes.
- Pointer arithmetic, array indexing, struct field access, struct
return-by-value (up to 8 bytes — Pair, Vec4, double).
- Bitfields, switch statements (verified up to ~12 cases + default),
function pointers, function-pointer tables, indirect calls via
`__jsl_indir` trampoline.
- Recursion: factorial, Fibonacci, depth-3 binary-tree
insert/sum/min/max, simple recursive quicksort.
- Loops with goto / break / continue, nested loops, state machines.
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
reverse with `cons` works.
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
roundtrip.
- Soft-float (single): all four ops + comparisons, MAME-verified.
- Soft-double: add, sub, mul, div all return correct bit patterns
bit-for-bit against gcc with round-to-nearest-even rounding;
3-iter Newton sqrt converges. Long-running iterations may hit MAME's
1-second sim-time budget (test config issue, not a compiler bug).
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
- C++ minimal: clang++ compiles a class with virtual + non-trivial
ctor (vtable + RTTI omitted; no exceptions).
- printf with `%d %x %s %c %p` and width/precision specifiers.
- sprintf / snprintf / vsprintf / vsnprintf with the same format
coverage as printf (`%d %u %x %ld %lu %s %c %f %p %%` + width).
C99 truncation semantics for snprintf. `%.Nf` produces the
correct fractional digits with round-half-up.
- qsort + bsearch over arbitrary element size with a user `cmp`
callback (insertion-sort variant — sidesteps the greedy regalloc
bug in the recursive iterative-qsort form).
- Standard string/stdlib glue: strcat, strncat, strpbrk, strspn,
strcspn, atol, llabs (kept in their own translation unit so
vprintf's branch layout doesn't shift).
- `<math.h>`: fabs, floor, ceil, fmod, copysign, sqrt, pow,
sin, cos, exp, log, atan, atan2, asin, acos, sinh, cosh, tanh
(and float variants). Bit-twiddling for fabs/floor/ceil/copysign;
Newton iteration for sqrt; range-reduction + Taylor for sin/cos/
exp/log/atan; identities for asin/acos/atan2/sinh/cosh/tanh.
Accuracy is in the ~1e-6 range — good enough for typical numeric
work, far short of glibc-quality. These are slow (each call is
dozens to hundreds of soft-double libcalls) — pre-compute or
cache when possible.
- `setjmp` / `longjmp` from libgcc.s.
- Static constructors via crt0's init_array walk.
**Toolchain:**
- `clang` / `llc` produce W65816 assembly + ELF object files.
- `tools/link816` resolves cross-translation-unit refs, lays out
text/rodata/bss, emits a flat binary the IIgs ROM can load.
Auto-relocates bss above text+rodata when the default
`--bss-base 0x2000` would overlap text, and skips past the
IIgs IO window ($C000-$CFFF) if needed.
- `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's
native object format) for round-tripping with classic dev tools.
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
libgcc into linkable objects.
- `scripts/smokeTest.sh` runs 102 end-to-end checks (scalar ops,
control flow, calling conventions, MAME execution, regressions,
link816 bss-base safety + weak-symbol resolution +
heap_end-vs-heap_start sanity, iigs/toolbox.h compile-check,
standalone runtime headers, AsmPrinter peepholes for STZ /
PEA / PEI — single-STA, shared-LDA-multi-STA, and DPF0-
forwarding cases — malloc/free coalesce ordering).
Currently 100% pass at -O2 throughout.
**ABI:**
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
on the system stack with PHA. Caller deallocates via `tsc;clc;adc
#N;tcs` or `PLY*N/2`.
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
the highest 16 bits.
- Frame is empty-descending (S points to next-free); offsets account
for the +1 skew vs LLVM's full-descending model.
## In flight
Two open bugs tracked:
1. **#107 — strtok / qsort -O1+ miscompile — RESOLVED.** Three
independent issues across the backend, runtime, and linker;
all fixed.
**Fix 1 (W65816StackSlotCleanup cross-MBB):** Pass -4 /
Pass -4c collapsed `LDA fs.X; STA stk.Y; ... LDA_indY stk.Y`
patterns with only an MBB-local safety check, missing cross-MBB
readers of stk.Y. Greedy regalloc had spilled an in-place INA
result back to stk.Y; eliminating the bb.3 init store left the
bb.10 reload reading garbage. Function-wide cross-MBB check
added.
**Fix 2 (W65816SepRepCleanup LDAi8imm hoist):** Pre-pass that
relocates LDAi8imm BEFORE byte-store SEP/REP wraps. LDAi8imm
expands at AsmPrinter to its own SEP+LDA8+REP that toggles M;
the post-RA scheduler was moving it INSIDE an STBptr wrap, so
the LDAi8imm's REP fired BEFORE the byte STA. The STA then
ran in M=16, writing 2 bytes of zero and clobbering the next
byte. Hoist puts the toggle in the outer M=16 zone, leaving
the byte STA in M=8.
**Fix 3 (link816 bss-base safety + strtok_r noinline):** With
the backend fixes, -O2 strtok grew large enough that the
strtok() wrapper inlining (~290 extra bytes) pushed the
binary's text+rodata past 0xC000 (IIgs IO window). Reads of
string literals or stdio handles in that range hit IO
registers and corrupted execution. Two complementary fixes:
`__attribute__((noinline))` on `strtok_r` so the wrapper
doesn't duplicate it (-O2 strtok.o now 1564B, was 2156B);
link816 auto-relocates bss above text+rodata when default
`--bss-base 0x2000` would overlap, and skips past the IO
window if needed.
strtok.c now compiles at -O2 with everything else. Smoke
#84 (4-call strtok continuation) and #92 (recursive parser)
both pass. Workaround comments in build.sh / smokeTest.sh
removed.
The `__attribute__((noinline,optnone))` defenses on iterative
qsort / RPN `runAll` / expression-parser `runAll` were
subsequently dropped; the smoke now compiles them at plain
`-O2` without escape hatches.
The W65816 backend assembler now supports all common indirect
addressing modes (`(dp)`, `(dp),Y`, `(dp,X)`, `(d,s),Y`,
`[dp]`, `[dp],Y`, and `JMP (abs)`). All `.byte` opcode hacks in
the runtime have been removed in favour of the mnemonics. The
disassembler decodes them too.
Runtime now exposes a ~complete C99 subset: sprintf/snprintf with correct %.Nf precision, qsort/bsearch,
the full string.h family (strcat/strncat/strpbrk/strspn/strcspn/
strtok/strtok_r), math.h with the eleven common transcendentals
(sqrt/pow/sin/cos/exp/log/atan/atan2/asin/acos/sinh/cosh/tanh),
atol/llabs/atexit/exit/abort, and a smoke test that exercises
malloc + struct pointers + strcmp/strcpy via a working hash table
end-to-end in MAME.
`strtok` / `strtok_r` live in their own TU at `-O2` (with
`__attribute__((noinline))` on `strtok_r` so the strtok() wrapper
doesn't duplicate it). Multi-call strtok over "a,b,,c" works
end-to-end in smoke. The layout-sensitive miscompile that
previously haunted strtok_r's inner CMP loop has been fixed by
modelling `Uses=[P]` on the conditional branches (the LICM/sink
interaction that elided "redundant" CMPs no longer fires); no
surgical workaround flags needed.
A small **RPN calculator** test (smoke #87) chains strtok, atol,
push/pop over a static stack, snprintf "%ld", and strcmp to verify
the end-to-end composition under a realistic-ish workload — adds,
subs, muls, divs, and 3-deep operand stacks all work.
**setjmp / longjmp** (smoke #88) now work end-to-end: setjmp saves
SP / 24-bit ret addr / DP, longjmp restores them and returns the
val argument as setjmp's "second return". Required two fixes:
(a) the W65816 assembler had no instruction definition for
`(dp)` / `(dp), y` / `(dp, x)` indirect addressing modes, so the
mnemonic forms silently fell through to absolute-,Y opcodes —
fixed in `src/llvm/lib/Target/W65816/W65816InstrFormats.td` +
`W65816InstrInfo.td` + `AsmParser/W65816AsmParser.cpp` (the runtime
.byte hacks have been replaced with mnemonics); (b) added
`__attribute__((returns_twice))` to the setjmp declaration so the
optimizer doesn't constant-fold post-setjmp env reads to 0.
**CRC32** (smoke #89) verifies the standard "123456789" → 0xCBF43926
end-to-end — exercises uint32_t shifts, XORs, char-by-char loops.
**Brainfuck interpreter** (smoke #90) executes a small bf program
and verifies the output bytes — exercises loop bracket matching,
pointer math (data pointer), branching on cell value.
**Recursive-descent expression parser** (smoke #92) evaluates
"3+4", "2*3+4", "2+3*4", "(3+4)*5", "100/4-5*2+1" with proper
operator precedence and parentheses — exercises mutual recursion,
char-by-char tokenization, and integer arithmetic in concert.
The **DWARF sidecar** (`link816 --debug-out FILE`) now applies
text/rodata/bss/init_array relocations to every `.debug_*` section
before writing it. PC values in `.debug_addr` and `.debug_line` end
up as final-image addresses, so a consumer can map back to source
lines without re-running the linker. Intra-debug references (e.g.
`.debug_info` -> `.debug_str` offsets) are intentionally left
object-local — sections are concatenated, not recompacted, and each
slice carries an `; OBJ ... SEC ... SIZE ...` header so a multi-TU
consumer can scope intra-debug offsets per-slice. The smoke test
verifies the address of a known function appears in the patched
sidecar bytes.
## Known issues / workarounds
- **(d,s),y / (sr,s),y addressing wraps the bank** when Y is
negative as 16-bit unsigned. Worked around by `W65816NegYIndY`
rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays
correct for negative offsets like `arr[i-1]`.
- **Pointer-deref bank policy is now split-by-syntax** (FIXED):
`*p` (where `p` is a runtime pointer / local-or-arg vreg) lowers
via `LDAptr / STAptr / STBptr` to `[$E0],Y` indirect-LONG with
the bank byte at `$E2` forced to 0 — DBR-independent. The
`*(volatile uint16 *)0x5000 = v` MMIO idiom (const-int pointer)
is matched by a separate TableGen pattern that lowers straight
to `STAabs` (DBR-relative) so the smoke tests' bank-2 write
path still works. Two tracked issues this resolved:
(a) PHI-elim was eliding the inserter's `COPY $a = ptr_vreg`
when the loop body had multiple Acc16 PHIs competing for A —
the inserter now spills the pointer to a fresh stack slot and
reloads via LDAfi to keep RA honest; sumTable now correct.
(b) pointer staging through `[$E0]` is bank-0 only, so
switchToBank2 + helper-with-local-ptr no longer corrupts data
in the wrong bank. See `feedback_dbr_ptr_deref_spill.md`.
- **Greedy regalloc fails on long-arg call chains** — a function
that strings ~7+ independent `helper(longArg1, longArg2)` calls
overflows greedy at -O1+ ("ran out of registers during register
allocation"). Same root issue as softDouble's old -O2 hold-out.
Threshold raised somewhat by expanding IMG slots from 8 to 16
(now backed by DP $C0..$DE) — most "normal-looking" mixed-arity
workloads now compile, but pathological pressure (many i32+ args
+ bitmask SETCC chain) still fails. Workarounds (in order of
preference): mark the heaviest helper `__attribute__((noinline))`
to reduce caller pressure; `-mllvm -regalloc=fast` for that TU;
or `__attribute__((optnone))` on the affected function. A proper
fix needs either a custom greedy→fast fallback in
`W65816TargetMachine::createTargetRegisterAllocator` or a smarter
spill-placement pre-RA pass.
- **Bank-0 size limit (~48KB)** — the runtime + program must fit in
$1000-$BFFF (text+rodata) plus $D000-$DFFF (LC1 for rodata-spill
and BSS). Past that, link816 hard-fails because text would
cross the IO window. In practice this is rarely hit now that
link816 has `--gc-sections` (default ON, see Recently Fixed)
which drops unreachable functions: a minimal program shrinks
from ~43KB (whole runtime) to ~1.5KB. Programs that genuinely
use most of the runtime can still hit the limit.
## Recently fixed
- **#70 — iterative qsort -O2 miscompile** — `W65816StackSlotCleanup`
Pass -2 was deleting a store to a slot the loop body read.
Function-wide `slotHasOtherRefs` safety check added (Pass -1 and
Pass -2c hardened with the same pattern). Iterative qsort at
plain -O2 + greedy now compiles correctly; the `optnone` workaround
in smoke #70 was removed.
- **strtok -O2 layout-sensitive miscompile** — modelling `Uses=[P]`
on the conditional branches (BEQ/BNE/BCS/BCC/BMI/BPL/BVS/BVC) made
MachineCSE / scheduler / LICM / sink see the CMP→Bxx flag
dependency. An entire class of layout-sensitive flag-corruption
bugs went away; verified by sweeping `--rodata-base` from text-end
to text-end+300 in 13 increments — every layout returns the correct
strtok result. As a follow-on, MachineCSE has been re-enabled
(was previously disabled in `W65816TargetMachine::addMachineSSAOpti­
mization` as a workaround for the same root cause).
- **link816 silently produced 4.3GB binaries** when `--rodata-base`
was set inside the text region. Now dies with a clear error:
`--rodata-base 0xX overlaps text 0xY+N (must start at or after 0xZ)`.
- **link816 BSS-relocate landed in IIgs Language Card area** —
when text+rodata grew past $C000, link816 placed BSS at $D000
(the LC1 area), where IIgs-by-default maps ROM (writes drop
silently, reads return ROM bytes). Globals never initialised;
caught by the expression-parser smoke (#92) when adding rand /
strnlen / etc. pushed the runtime past that threshold. Two-part
fix: crt0 now enables LC1 RAM via the standard `lda $C083`
read-twice trick at startup, and link816 hard-fails (rather
than silently corrupt) if BSS would exceed the LC1 ceiling
($E000) — past that you'd need crt0 to also enable LC2 / shadow
RAM, which we haven't wired up.
- **STZ peephole multi-STA latent miscompile** — AsmPrinter's
`LDA #0; STA $g` -> `STZ $g` peephole eliminated the LDA but
only consumed the FIRST `STA`. When SDAG-CSE shared one
`LDA #0` across multiple `STA`s (`g16=0; g32=0;` is one IR
shape), trailing `STA`s read whatever was in A on entry —
silently corrupting any global where A wasn't 0 at function
entry. Smoke happened to pass because A was 0 by luck in
every covered path. Fixed by gating the peephole on the
consuming `STA` killing A (regalloc only sets `killed` on the
last reader); smoke #98 added to lock the multi-STA case.
- **PEI AsmPrinter peephole** — new: `LDA $dp; PHA` -> `PEI $dp`
saves 1 byte and avoids touching A. Fires on the
`copyPhysReg(A=DPF0); PUSH16` pattern (i64-libcall return-value
forwarding into the next call's stacked args), which appears
in every chained soft-double / soft-int64 expression. Saves
68 bytes across the runtime (-64 in math.o alone). Same
next-instruction-modifies-A safety check as the PEA peephole.
Smoke #99 added.
- **PEA peephole opcode-allowlist replaced with `modifiesRegister`** —
the next-after-PUSH16 check that gates the PEA peephole was a
hand-curated list of opcodes that obviously redefine A; switched
to `MachineInstr::modifiesRegister(A, TRI)` which also catches
implicit-defs (e.g. JSL clobbering A as part of the call ABI).
Saves a few bytes and is more robust.
- **libgcc.s `lda #0; sta $XX` -> `stz $XX`** — 7 sites converted
in libgcc.s after STZ landed in the assembler. Saves 28 bytes;
also removes two PHA/PLA save-restore wraps around the LDA #0
(STZ doesn't touch A, so the wraps are unnecessary).
- **libgcc.s `lda dp; pha` -> `pei dp`** — 2 sites in __divhi3 /
__modhi3 where the loaded A is dead after the push. PEI
doesn't touch A, saves 1 byte each.
- **W65816StackSlotCleanup Pass 1c skip-list extended** — added
STAabs / STA8abs / STAptr / STBptr / STAptrOff / STBptrOff and
ADJCALLSTACKDOWN to the A-transparent list. Lets the redundant-
CMP-after-A-modifier elimination see through more pseudo
stores and the call-stack-down pseudo. Saves 8 bytes in math.o.
(ADJCALLSTACKUP is NOT transparent — when PEI doesn't process
it, AsmPrinter emits a TSC/CLC/ADC/TCS that clobbers A.)
- **crt0.s `lda #0; sta` -> `stz`** — IRQ-disable block and the
BSS-zero loop both used `.byte 0xa9, 0x00 ; sta` raw-byte
workarounds for `lda #0` (the assembler emits a 16-bit immediate
in M=8, mis-encoding it). `stz` works in M=8 (stores 1 byte) and
doesn't touch A — both `.byte` workarounds removed; saves 4 bytes
in crt0.o.
- **Runtime correctness pass — five real bugs fixed:**
- `free()` coalesce: when a freed block was absorbed into a
lower-address neighbour (`bEnd == a` path), the absorbed entry
was left in the free list overlapping the extended one. A
follow-on malloc could hand out the same memory to two
callers. Fix: track outer-loop predecessor and excise the
absorbed entry. Smoke #100 added.
- `sqrt(-0.0)` returned NaN; should return -0.0 per IEEE-754.
The sign-bit check fired before the zero check. Fix: mask
sign bit when testing for zero.
- `log(0)` returned NaN; should return -Infinity (pole error).
Same sign-bit-vs-zero ordering issue; both ±0 now return
`-1.0/0.0`.
- `snprintf(buf, 0, ...)` wrote `'\0'` to `buf[-1]` (one byte
BEFORE the buffer). C99 says n=0 must not touch the buffer.
Fix: set `gEnd = NULL` for n=0 so neither the normal nor the
truncation NUL-write path fires. Smoke #76 extended.
- `malloc(>~32KB)` and `calloc(n, m)` had silent integer overflow
on size_t (16-bit), wrapping to small values and handing out
tiny allocations claiming huge sizes. Bumped malloc to bail
above 0x7FF0 (heap is at most ~32KB anyway) and made calloc
overflow-check before multiplying.
- **Removed** dead `runtime/src/softDouble.s` (a stub from before
`softDouble.c` was implemented; the build script doesn't reference
it but it was confusing to leave around).
- **inttypes.h PRId64 / PRIu64 / PRIx64** documented as
unsupported in the runtime's printf — the macros expand to
`"lld"`/`"llu"`/`"llx"` but the formatter only knows the `l`
length modifier, not `ll`, so the format prints literally and
the va_list misaligns. Use `PRId32` etc. for now.
- **More runtime fixes (round 2):**
- `fputs(s, stream)` was forwarding to `puts(s)`, which appends a
newline. C says fputs MUST NOT add one. Direct char-by-char
write now.
- `exit(code)` never invoked the registered `atexit` handler.
C99 7.20.4.3 requires it. Now runs the single-slot handler
(with re-entry guard) before the BRK.
- `printf("%f", -0.0)` printed `0.000000` instead of `-0.000000`
because `if (v < 0)` (a `__ltdf2` call) returns false for
negative zero. Switched to the IEEE-754 sign-bit test that
snprintf already uses.
- `vfprintf` was missing entirely (declared neither in stdio.h
nor implemented). Added a thin wrapper around vprintf.
- **link816 weak-symbol resolution:** the linker previously used
"last def wins" with no regard for STB_GLOBAL vs STB_WEAK. When
a user provided a strong override of a weak libc stub (e.g.
`putchar`), it worked only by link-order luck — reversing the
order let the weak stub silently overwrite the strong def.
Now properly: strong over weak (any order), strong + strong
errors out, weak + weak picks the first. Smoke #100 added.
- **More runtime fixes (round 3):**
- `writeHex` / `emitHex` had a stack-overflow buffer overrun
(`char buf[5]` but `printf("%08x", ...)` would write 8 bytes).
On 16-bit `unsigned int`, max useful width is 4 — buf shrunk
to 4 and width is now capped.
- `writeDec` / `writeSignedLong` / `emitDec` / `emitSignedLong`
used `-n` on signed input, which overflows for INT_MIN /
LONG_MIN (UB). All four switched to unsigned-negation
(`0u - (unsigned)n`) for correctness and to keep an
optimizer-aware compiler from exploiting the UB.
- `atoi` / `atol` / `strtol` / `strtoul` likewise built the
parsed magnitude in a signed accumulator and negated at the
end — same UB on the boundary value. All switched to
unsigned magnitude + unsigned-negation cast.
- `link816 parseInt` / `omfEmit parseInt` silently truncated
addresses > 24 bits to `uint32_t` low bits — `--text-base
0x100000000` would silently wrap to 0. Both now reject
out-of-range addresses with a clear error.
- **More runtime fixes (round 4):**
- `pow(x, y)` computed `n = -n` for the integer-y branch when
yi was INT_MIN (-32768); same signed-overflow UB pattern as
the print functions. Switched to unsigned magnitude.
- Added `perror(prefix)` — was missing from the runtime; common
pattern in portable code that reports I/O failure via
`errno + strerror`. Declared in stdio.h, implemented as
char-by-char emit through putchar (no fprintf dependency).
- **link816 `__heap_end` was hardcoded at $BF00**, ignoring where
`__heap_start` actually ended up. When BSS got auto-relocated
into LC1 ($D000+), heap_start ended up > heap_end and malloc
immediately returned NULL on every call — silently bricking any
program that allocated dynamic memory after the runtime grew
past the default-bss threshold. Heap_end now picks
$BF00 / $E000 based on where heap_start lands (and skips the IO
window if heap_start would have landed in $C000-$CFFF).
Smoke #102 added.
- **link816 rodata auto-skips IIgs IO window** ($C000-$CFFF). When
text+rodata grew past 0xC000 the rodata bytes silently corrupted
at runtime — string literals in the IO range read back as
hardware register values, breaking strcmp / strstr / printf / etc.
Now: rodata that would land in or cross $C000-$CFFF auto-skips
to $D000. Init_array gets the same treatment. Text that would
cross IO is hard-rejected at link time (no auto-fix possible —
PC fetches in IO would read hardware registers). This was the
root cause of the "tan/tanf triggers layout-sensitive failure"
symptom listed in older STATUS notes.
- **runInMame skips writes to IO window** during the binary load.
Without this, the zero-padding in the rodata-skip gap would
clobber soft switches (e.g. the LC1 RAM enable that crt0 sets
via $C083) when the loader naively wrote the entire image
byte-by-byte to memory.
- **link816 `--gc-sections` (default ON)** — discards sections not
reachable from the entry point (`__start` / `_start` / `main`
for the canonical crt0 setup) plus all `.init_array` sections.
Built on `-ffunction-sections` so each function is in its own
section. A minimal program with full runtime linked shrinks
from ~43KB to ~1.5KB. Adding `tan/tanf` to math.c (which
caused the latent layout-sensitive failure described above)
no longer pushes any test past the bank-0 limit. Tests that
intentionally check unreachable symbols pass `--no-gc-sections`
to opt out.
- **`fwrite(stdout, ...)` was a stub returning 0** even though
`stdout` has a working `putchar` route. Now actually writes
through `putchar` for stdout/stderr (only). Also gained the
same `size * nmemb` overflow guard as `calloc`.
## What's still needed for a "ship-ready" toolchain
- **softDouble.c -O2 — FIXED.** Marking `dclass` noinline (in
addition to `dpack`) drops register pressure in `__muldf3`/
`__divdf3`/`__adddf3` enough that greedy regalloc no longer
runs out. The previous blocker was that noinline-dclass would
write through pointer args via the DBR-relative `(d,s),y` mode
and corrupt caller data after a bank switch — that path now
goes through `STAptr/STBptr` which use `[$E0],Y` indirect-long
with the bank byte forced to 0, so DBR is irrelevant. All
three smoke build sites moved to `-O2`.
- **More of the C standard library**: real `<stdio.h>` file I/O
(`fopen`, `fread`, `fwrite`, `fseek` are currently stubs
returning success/zero) — would need a memory-backed FS or a
MAME hook. `<locale.h>` / `<signal.h>` / `<time.h>` are stubbed
(compile and return safe defaults). `<wchar.h>` mostly absent.
A `time()` impl wired to ReadTimeHex (Misc Tool $0D03) was
attempted but crashes MAME without the Tool Locator initialised
in crt0; `clock()` via VBL counter at $E1006B needs 24-bit
far-pointer support that the backend doesn't yet model.
- **C++ runtime support**: vtable layout for multiple inheritance,
RTTI, exceptions (or a documented `-fno-exceptions` requirement).
- **REP/SEP scheduling pass** (design doc §3.3): the current
prologue picks one M-mode for the whole function based on
whether any 8-bit accumulator value is used. A per-region
scheduler would reduce the SEP/REP wrap overhead on i8 stores.
- **Toolbox / IIgs system call bindings**: `iigs/toolbox.h` covers
the common entry points across Tool Locator, Memory Manager,
Misc Tools, QuickDraw II, Event Manager, Window Manager, plus
GS/OS Quit. Multi-arg wrappers (NewHandle, QDStartUp, MoveTo,
EMStartUp, GetNextEvent, NewWindow, CloseWindow) live in
`runtime/src/iigsToolbox.s` because the backend's inline-asm
constraints can't take memory operands. Single-arg / no-arg
wrappers stay inline. More routines (Menu Manager, Dialog
Manager, Standard File, Sound) still TBD.
- **Real-world program coverage**: the smoke tests are
microbenchmarks. A few known-good Apple IIgs C programs (e.g.
a textfile pager, a small game) compiled and run end-to-end
would catch issues no synthetic test currently exercises.
- **Cycle-time / size benchmarks vs Calypsi 5.16**: design doc §1
says the goal is to "match or exceed" Calypsi. We have neither
baseline numbers nor a comparison harness yet.