553 lines
28 KiB
Markdown
553 lines
28 KiB
Markdown
# llvm816 — Current Status
|
||
|
||
LLVM/Clang backend for the WDC 65816 (Apple IIgs), forked from
|
||
llvm-mos as a separate `W65816` target.
|
||
|
||
## What works
|
||
|
||
End-to-end C-to-binary toolchain that produces 65816 machine code
|
||
which runs correctly under MAME (apple2gs).
|
||
|
||
**Language coverage at -O2 (no extra flags):**
|
||
|
||
- All scalar arithmetic: i8 / i16 / i32 / i64 add, sub, mul, div, mod
|
||
(signed and unsigned). Carry-chained multi-word ops via ADC/SBC pseudos
|
||
+ ASLA16 / shift libcalls.
|
||
- Comparisons and signed/unsigned widening (sext, zext, trunc) for all
|
||
the above sizes.
|
||
- Pointer arithmetic, array indexing, struct field access, struct
|
||
return-by-value (up to 8 bytes — Pair, Vec4, double).
|
||
- Bitfields, switch statements (verified up to ~12 cases + default),
|
||
function pointers, function-pointer tables, indirect calls via
|
||
`__jsl_indir` trampoline.
|
||
- Recursion: factorial, Fibonacci, depth-3 binary-tree
|
||
insert/sum/min/max, simple recursive quicksort.
|
||
- Loops with goto / break / continue, nested loops, state machines.
|
||
- `<stdarg.h>` varargs with int / long / unsigned long long mixed args.
|
||
- Heap: `malloc` / `free` (libc.c first-fit allocator) — linked-list
|
||
reverse with `cons` works.
|
||
- Strings: hand-rolled `strlen`, `strcmp`, `strcpy`, `strchr`, atoi/itoa
|
||
roundtrip.
|
||
- Soft-float (single): all four ops + comparisons, MAME-verified.
|
||
- Soft-double: add, sub, mul, div all return correct bit patterns
|
||
bit-for-bit against gcc with round-to-nearest-even rounding;
|
||
3-iter Newton sqrt converges. Long-running iterations may hit MAME's
|
||
1-second sim-time budget (test config issue, not a compiler bug).
|
||
- Inline assembly with `"a"`, `"x"`, `"y"` register constraints and
|
||
arbitrary opcode bytes (used for the `pha;plb` bank-switch idiom).
|
||
- C++ minimal: clang++ compiles a class with virtual + non-trivial
|
||
ctor (vtable + RTTI omitted; no exceptions).
|
||
- printf with `%d %x %s %c %p` and width/precision specifiers.
|
||
- sprintf / snprintf / vsprintf / vsnprintf with the same format
|
||
coverage as printf (`%d %u %x %ld %lu %s %c %f %p %%` + width).
|
||
C99 truncation semantics for snprintf. `%.Nf` produces the
|
||
correct fractional digits with round-half-up.
|
||
- qsort + bsearch over arbitrary element size with a user `cmp`
|
||
callback (insertion-sort variant — sidesteps the greedy regalloc
|
||
bug in the recursive iterative-qsort form).
|
||
- Standard string/stdlib glue: strcat, strncat, strpbrk, strspn,
|
||
strcspn, atol, llabs (kept in their own translation unit so
|
||
vprintf's branch layout doesn't shift).
|
||
- `<math.h>`: fabs, floor, ceil, fmod, copysign, sqrt, pow,
|
||
sin, cos, exp, log, atan, atan2, asin, acos, sinh, cosh, tanh
|
||
(and float variants). Bit-twiddling for fabs/floor/ceil/copysign;
|
||
Newton iteration for sqrt; range-reduction + Taylor for sin/cos/
|
||
exp/log/atan; identities for asin/acos/atan2/sinh/cosh/tanh.
|
||
Accuracy is in the ~1e-6 range — good enough for typical numeric
|
||
work, far short of glibc-quality. These are slow (each call is
|
||
dozens to hundreds of soft-double libcalls) — pre-compute or
|
||
cache when possible.
|
||
- `setjmp` / `longjmp` from libgcc.s.
|
||
- Static constructors via crt0's init_array walk.
|
||
|
||
**Toolchain:**
|
||
|
||
- `clang` / `llc` produce W65816 assembly + ELF object files.
|
||
- `tools/link816` resolves cross-translation-unit refs, lays out
|
||
text/rodata/bss, emits a flat binary the IIgs ROM can load.
|
||
Auto-relocates bss above text+rodata when the default
|
||
`--bss-base 0x2000` would overlap text, and skips past the
|
||
IIgs IO window ($C000-$CFFF) if needed.
|
||
- `tools/omfEmit` produces OMF v2.1 single-segment files (the IIgs's
|
||
native object format) for round-tripping with classic dev tools.
|
||
- `runtime/build.sh` builds crt0, libc, soft-float, soft-double,
|
||
libgcc into linkable objects.
|
||
- `scripts/smokeTest.sh` runs 107 end-to-end checks (scalar ops,
|
||
control flow, calling conventions, MAME execution, regressions,
|
||
link816 bss-base safety + weak-symbol resolution +
|
||
heap_end-vs-heap_start sanity, iigs/toolbox.h compile + link
|
||
check (singles inlined, multi-arg wrappers in iigsToolbox.s),
|
||
standalone runtime headers, AsmPrinter peepholes for STZ /
|
||
PEA / PEI — single-STA, shared-LDA-multi-STA, and DPF0-
|
||
forwarding cases — malloc/free coalesce ordering, plus
|
||
real-world coverage tests: Conway's Game of Life blinker
|
||
(2D loop + neighbour bounds), binary search tree (recursive
|
||
struct + malloc), function-pointer dispatch table (indirect
|
||
JSL via `__jsl_indir`). Currently 100% pass at -O2 throughout.
|
||
|
||
**ABI:**
|
||
|
||
- arg0 in A; arg1 in X for i32-first-arg signatures; rest pushed RTL
|
||
on the system stack with PHA. Caller deallocates via `tsc;clc;adc
|
||
#N;tcs` or `PLY*N/2`.
|
||
- Return: i8/i16 in A; i32 in A:X; i64 in A:X:Y plus DP[$F0..$F1] for
|
||
the highest 16 bits.
|
||
- Frame is empty-descending (S points to next-free); offsets account
|
||
for the +1 skew vs LLVM's full-descending model.
|
||
|
||
## In flight
|
||
|
||
Two open bugs tracked:
|
||
|
||
1. **#107 — strtok / qsort -O1+ miscompile — RESOLVED.** Three
|
||
independent issues across the backend, runtime, and linker;
|
||
all fixed.
|
||
|
||
**Fix 1 (W65816StackSlotCleanup cross-MBB):** Pass -4 /
|
||
Pass -4c collapsed `LDA fs.X; STA stk.Y; ... LDA_indY stk.Y`
|
||
patterns with only an MBB-local safety check, missing cross-MBB
|
||
readers of stk.Y. Greedy regalloc had spilled an in-place INA
|
||
result back to stk.Y; eliminating the bb.3 init store left the
|
||
bb.10 reload reading garbage. Function-wide cross-MBB check
|
||
added.
|
||
|
||
**Fix 2 (W65816SepRepCleanup LDAi8imm hoist):** Pre-pass that
|
||
relocates LDAi8imm BEFORE byte-store SEP/REP wraps. LDAi8imm
|
||
expands at AsmPrinter to its own SEP+LDA8+REP that toggles M;
|
||
the post-RA scheduler was moving it INSIDE an STBptr wrap, so
|
||
the LDAi8imm's REP fired BEFORE the byte STA. The STA then
|
||
ran in M=16, writing 2 bytes of zero and clobbering the next
|
||
byte. Hoist puts the toggle in the outer M=16 zone, leaving
|
||
the byte STA in M=8.
|
||
|
||
**Fix 3 (link816 bss-base safety + strtok_r noinline):** With
|
||
the backend fixes, -O2 strtok grew large enough that the
|
||
strtok() wrapper inlining (~290 extra bytes) pushed the
|
||
binary's text+rodata past 0xC000 (IIgs IO window). Reads of
|
||
string literals or stdio handles in that range hit IO
|
||
registers and corrupted execution. Two complementary fixes:
|
||
`__attribute__((noinline))` on `strtok_r` so the wrapper
|
||
doesn't duplicate it (-O2 strtok.o now 1564B, was 2156B);
|
||
link816 auto-relocates bss above text+rodata when default
|
||
`--bss-base 0x2000` would overlap, and skips past the IO
|
||
window if needed.
|
||
|
||
strtok.c now compiles at -O2 with everything else. Smoke
|
||
#84 (4-call strtok continuation) and #92 (recursive parser)
|
||
both pass. Workaround comments in build.sh / smokeTest.sh
|
||
removed.
|
||
|
||
The `__attribute__((noinline,optnone))` defenses on iterative
|
||
qsort / RPN `runAll` / expression-parser `runAll` were
|
||
subsequently dropped; the smoke now compiles them at plain
|
||
`-O2` without escape hatches.
|
||
|
||
The W65816 backend assembler now supports all common indirect
|
||
addressing modes (`(dp)`, `(dp),Y`, `(dp,X)`, `(d,s),Y`,
|
||
`[dp]`, `[dp],Y`, and `JMP (abs)`). All `.byte` opcode hacks in
|
||
the runtime have been removed in favour of the mnemonics. The
|
||
disassembler decodes them too.
|
||
|
||
Runtime now exposes a ~complete C99 subset: sprintf/snprintf with correct %.Nf precision, qsort/bsearch,
|
||
the full string.h family (strcat/strncat/strpbrk/strspn/strcspn/
|
||
strtok/strtok_r), math.h with the eleven common transcendentals
|
||
(sqrt/pow/sin/cos/exp/log/atan/atan2/asin/acos/sinh/cosh/tanh),
|
||
atol/llabs/atexit/exit/abort, and a smoke test that exercises
|
||
malloc + struct pointers + strcmp/strcpy via a working hash table
|
||
end-to-end in MAME.
|
||
|
||
`strtok` / `strtok_r` live in their own TU at `-O2` (with
|
||
`__attribute__((noinline))` on `strtok_r` so the strtok() wrapper
|
||
doesn't duplicate it). Multi-call strtok over "a,b,,c" works
|
||
end-to-end in smoke. The layout-sensitive miscompile that
|
||
previously haunted strtok_r's inner CMP loop has been fixed by
|
||
modelling `Uses=[P]` on the conditional branches (the LICM/sink
|
||
interaction that elided "redundant" CMPs no longer fires); no
|
||
surgical workaround flags needed.
|
||
|
||
A small **RPN calculator** test (smoke #87) chains strtok, atol,
|
||
push/pop over a static stack, snprintf "%ld", and strcmp to verify
|
||
the end-to-end composition under a realistic-ish workload — adds,
|
||
subs, muls, divs, and 3-deep operand stacks all work.
|
||
|
||
**setjmp / longjmp** (smoke #88) now work end-to-end: setjmp saves
|
||
SP / 24-bit ret addr / DP, longjmp restores them and returns the
|
||
val argument as setjmp's "second return". Required two fixes:
|
||
(a) the W65816 assembler had no instruction definition for
|
||
`(dp)` / `(dp), y` / `(dp, x)` indirect addressing modes, so the
|
||
mnemonic forms silently fell through to absolute-,Y opcodes —
|
||
fixed in `src/llvm/lib/Target/W65816/W65816InstrFormats.td` +
|
||
`W65816InstrInfo.td` + `AsmParser/W65816AsmParser.cpp` (the runtime
|
||
.byte hacks have been replaced with mnemonics); (b) added
|
||
`__attribute__((returns_twice))` to the setjmp declaration so the
|
||
optimizer doesn't constant-fold post-setjmp env reads to 0.
|
||
|
||
**CRC32** (smoke #89) verifies the standard "123456789" → 0xCBF43926
|
||
end-to-end — exercises uint32_t shifts, XORs, char-by-char loops.
|
||
|
||
**Brainfuck interpreter** (smoke #90) executes a small bf program
|
||
and verifies the output bytes — exercises loop bracket matching,
|
||
pointer math (data pointer), branching on cell value.
|
||
|
||
**Recursive-descent expression parser** (smoke #92) evaluates
|
||
"3+4", "2*3+4", "2+3*4", "(3+4)*5", "100/4-5*2+1" with proper
|
||
operator precedence and parentheses — exercises mutual recursion,
|
||
char-by-char tokenization, and integer arithmetic in concert.
|
||
|
||
The **DWARF sidecar** (`link816 --debug-out FILE`) now applies
|
||
text/rodata/bss/init_array relocations to every `.debug_*` section
|
||
before writing it. PC values in `.debug_addr` and `.debug_line` end
|
||
up as final-image addresses, so a consumer can map back to source
|
||
lines without re-running the linker. Intra-debug references (e.g.
|
||
`.debug_info` -> `.debug_str` offsets) are intentionally left
|
||
object-local — sections are concatenated, not recompacted, and each
|
||
slice carries an `; OBJ ... SEC ... SIZE ...` header so a multi-TU
|
||
consumer can scope intra-debug offsets per-slice. The smoke test
|
||
verifies the address of a known function appears in the patched
|
||
sidecar bytes.
|
||
|
||
## Known issues / workarounds
|
||
|
||
- **(d,s),y / (sr,s),y addressing wraps the bank** when Y is
|
||
negative as 16-bit unsigned. Worked around by `W65816NegYIndY`
|
||
rewriting the affected ops to `TAX ; LDA/STA $0000,X`. Stays
|
||
correct for negative offsets like `arr[i-1]`.
|
||
|
||
- **Pointer-deref bank policy is now split-by-syntax** (FIXED):
|
||
`*p` (where `p` is a runtime pointer / local-or-arg vreg) lowers
|
||
via `LDAptr / STAptr / STBptr` to `[$E0],Y` indirect-LONG with
|
||
the bank byte at `$E2` forced to 0 — DBR-independent. The
|
||
`*(volatile uint16 *)0x5000 = v` MMIO idiom (const-int pointer)
|
||
is matched by a separate TableGen pattern that lowers straight
|
||
to `STAabs` (DBR-relative) so the smoke tests' bank-2 write
|
||
path still works. Two tracked issues this resolved:
|
||
(a) PHI-elim was eliding the inserter's `COPY $a = ptr_vreg`
|
||
when the loop body had multiple Acc16 PHIs competing for A —
|
||
the inserter now spills the pointer to a fresh stack slot and
|
||
reloads via LDAfi to keep RA honest; sumTable now correct.
|
||
(b) pointer staging through `[$E0]` is bank-0 only, so
|
||
switchToBank2 + helper-with-local-ptr no longer corrupts data
|
||
in the wrong bank. See `feedback_dbr_ptr_deref_spill.md`.
|
||
|
||
- **Greedy regalloc fails on long-arg call chains** — a function
|
||
that strings ~7+ independent `helper(longArg1, longArg2)` calls
|
||
overflows greedy at -O1+ ("ran out of registers during register
|
||
allocation"). Same root issue as softDouble's old -O2 hold-out.
|
||
Threshold raised somewhat by expanding IMG slots from 8 to 16
|
||
(now backed by DP $C0..$DE) — most "normal-looking" mixed-arity
|
||
workloads now compile, but pathological pressure (many i32+ args
|
||
+ bitmask SETCC chain) still fails. Workarounds (in order of
|
||
preference): mark the heaviest helper `__attribute__((noinline))`
|
||
to reduce caller pressure; `-mllvm -regalloc=fast` for that TU;
|
||
or `__attribute__((optnone))` on the affected function. A proper
|
||
fix needs either a custom greedy→fast fallback in
|
||
`W65816TargetMachine::createTargetRegisterAllocator` or a smarter
|
||
spill-placement pre-RA pass.
|
||
|
||
- **Bank-0 size limit (~48KB)** — the runtime + program must fit in
|
||
$1000-$BFFF (text+rodata) plus $D000-$DFFF (LC1 for rodata-spill
|
||
and BSS). Past that, link816 hard-fails because text would
|
||
cross the IO window. In practice this is rarely hit now that
|
||
link816 has `--gc-sections` (default ON, see Recently Fixed)
|
||
which drops unreachable functions: a minimal program shrinks
|
||
from ~43KB (whole runtime) to ~1.5KB. Programs that genuinely
|
||
use most of the runtime can still hit the limit.
|
||
|
||
## Recently fixed
|
||
|
||
- **DBR pointer-deref RA elision (sumTable miscompile)** —
|
||
`LDAptr / STAptr / STBptr` inserter's first-thing
|
||
`COPY $a = ptr_vreg` was being elided by RA when the loop body
|
||
had multiple Acc16 PHIs competing for A. PHI-elim silently
|
||
dropped the COPY needed to refresh A with the pointer at the
|
||
top of each iteration; sumTable's inner loop did `STA $E0`
|
||
while A held the accumulator. Fix: spill the pointer to a
|
||
fresh stack slot via `STAfi` and reload via `LDAfi` — forces
|
||
RA to materialize the value through real machine ops. See
|
||
`feedback_dbr_ptr_deref_spill.md`.
|
||
|
||
- **softDouble.c -O2 hold-out** — with the DBR fix in place,
|
||
`dclass` can be `noinline` (its pointer-arg writes go through
|
||
`STBptr / STAptr` which now use `[$E0],Y` indirect-long with
|
||
bank=0). Drops register pressure in `__muldf3 / __divdf3 /
|
||
__adddf3` enough that greedy regalloc no longer runs out. All
|
||
three smoke build sites moved from `-O1` to `-O2`.
|
||
|
||
- **IMG slot count doubled (8 → 16)** — Img16 / Wide16 register
|
||
classes now hold IMG0..IMG15, backed by DP $C0..$CE + $D0..$DE.
|
||
Reduces greedy regalloc spills for moderately-busy functions.
|
||
Existing `IMG0..IMG7 → $D0..$DE` mapping unchanged so smoke
|
||
tests that assume specific DP carriers (e.g. DPF0 at $F0) still
|
||
work. User app DP is now $00..$BF (was $00..$CF).
|
||
|
||
- **Real-world smoke coverage added** — Conway's Game of Life
|
||
blinker (2D arrays + neighbour bounds), binary search tree
|
||
(recursive struct + malloc), function-pointer dispatch table
|
||
(indirect-JSL via `__jsl_indir`). Total smoke tests at 107.
|
||
|
||
- **iigs/toolbox.h expanded** — from 4 stubs to 18+ wrappers
|
||
across Tool Locator, Memory Manager, Misc Tools, QuickDraw II,
|
||
Event Manager, Window Manager, plus GS/OS Quit. Multi-arg
|
||
wrappers live in `runtime/src/iigsToolbox.s` (the backend's
|
||
inline-asm constraints can't take memory operands); single-arg
|
||
ones stay inline.
|
||
|
||
- **#70 — iterative qsort -O2 miscompile** — `W65816StackSlotCleanup`
|
||
Pass -2 was deleting a store to a slot the loop body read.
|
||
Function-wide `slotHasOtherRefs` safety check added (Pass -1 and
|
||
Pass -2c hardened with the same pattern). Iterative qsort at
|
||
plain -O2 + greedy now compiles correctly; the `optnone` workaround
|
||
in smoke #70 was removed.
|
||
|
||
- **strtok -O2 layout-sensitive miscompile** — modelling `Uses=[P]`
|
||
on the conditional branches (BEQ/BNE/BCS/BCC/BMI/BPL/BVS/BVC) made
|
||
MachineCSE / scheduler / LICM / sink see the CMP→Bxx flag
|
||
dependency. An entire class of layout-sensitive flag-corruption
|
||
bugs went away; verified by sweeping `--rodata-base` from text-end
|
||
to text-end+300 in 13 increments — every layout returns the correct
|
||
strtok result. As a follow-on, MachineCSE has been re-enabled
|
||
(was previously disabled in `W65816TargetMachine::addMachineSSAOpti
|
||
mization` as a workaround for the same root cause).
|
||
|
||
- **link816 silently produced 4.3GB binaries** when `--rodata-base`
|
||
was set inside the text region. Now dies with a clear error:
|
||
`--rodata-base 0xX overlaps text 0xY+N (must start at or after 0xZ)`.
|
||
|
||
- **link816 BSS-relocate landed in IIgs Language Card area** —
|
||
when text+rodata grew past $C000, link816 placed BSS at $D000
|
||
(the LC1 area), where IIgs-by-default maps ROM (writes drop
|
||
silently, reads return ROM bytes). Globals never initialised;
|
||
caught by the expression-parser smoke (#92) when adding rand /
|
||
strnlen / etc. pushed the runtime past that threshold. Two-part
|
||
fix: crt0 now enables LC1 RAM via the standard `lda $C083`
|
||
read-twice trick at startup, and link816 hard-fails (rather
|
||
than silently corrupt) if BSS would exceed the LC1 ceiling
|
||
($E000) — past that you'd need crt0 to also enable LC2 / shadow
|
||
RAM, which we haven't wired up.
|
||
|
||
- **STZ peephole multi-STA latent miscompile** — AsmPrinter's
|
||
`LDA #0; STA $g` -> `STZ $g` peephole eliminated the LDA but
|
||
only consumed the FIRST `STA`. When SDAG-CSE shared one
|
||
`LDA #0` across multiple `STA`s (`g16=0; g32=0;` is one IR
|
||
shape), trailing `STA`s read whatever was in A on entry —
|
||
silently corrupting any global where A wasn't 0 at function
|
||
entry. Smoke happened to pass because A was 0 by luck in
|
||
every covered path. Fixed by gating the peephole on the
|
||
consuming `STA` killing A (regalloc only sets `killed` on the
|
||
last reader); smoke #98 added to lock the multi-STA case.
|
||
|
||
- **PEI AsmPrinter peephole** — new: `LDA $dp; PHA` -> `PEI $dp`
|
||
saves 1 byte and avoids touching A. Fires on the
|
||
`copyPhysReg(A=DPF0); PUSH16` pattern (i64-libcall return-value
|
||
forwarding into the next call's stacked args), which appears
|
||
in every chained soft-double / soft-int64 expression. Saves
|
||
68 bytes across the runtime (-64 in math.o alone). Same
|
||
next-instruction-modifies-A safety check as the PEA peephole.
|
||
Smoke #99 added.
|
||
|
||
- **PEA peephole opcode-allowlist replaced with `modifiesRegister`** —
|
||
the next-after-PUSH16 check that gates the PEA peephole was a
|
||
hand-curated list of opcodes that obviously redefine A; switched
|
||
to `MachineInstr::modifiesRegister(A, TRI)` which also catches
|
||
implicit-defs (e.g. JSL clobbering A as part of the call ABI).
|
||
Saves a few bytes and is more robust.
|
||
|
||
- **libgcc.s `lda #0; sta $XX` -> `stz $XX`** — 7 sites converted
|
||
in libgcc.s after STZ landed in the assembler. Saves 28 bytes;
|
||
also removes two PHA/PLA save-restore wraps around the LDA #0
|
||
(STZ doesn't touch A, so the wraps are unnecessary).
|
||
|
||
- **libgcc.s `lda dp; pha` -> `pei dp`** — 2 sites in __divhi3 /
|
||
__modhi3 where the loaded A is dead after the push. PEI
|
||
doesn't touch A, saves 1 byte each.
|
||
|
||
- **W65816StackSlotCleanup Pass 1c skip-list extended** — added
|
||
STAabs / STA8abs / STAptr / STBptr / STAptrOff / STBptrOff and
|
||
ADJCALLSTACKDOWN to the A-transparent list. Lets the redundant-
|
||
CMP-after-A-modifier elimination see through more pseudo
|
||
stores and the call-stack-down pseudo. Saves 8 bytes in math.o.
|
||
(ADJCALLSTACKUP is NOT transparent — when PEI doesn't process
|
||
it, AsmPrinter emits a TSC/CLC/ADC/TCS that clobbers A.)
|
||
|
||
- **crt0.s `lda #0; sta` -> `stz`** — IRQ-disable block and the
|
||
BSS-zero loop both used `.byte 0xa9, 0x00 ; sta` raw-byte
|
||
workarounds for `lda #0` (the assembler emits a 16-bit immediate
|
||
in M=8, mis-encoding it). `stz` works in M=8 (stores 1 byte) and
|
||
doesn't touch A — both `.byte` workarounds removed; saves 4 bytes
|
||
in crt0.o.
|
||
|
||
- **Runtime correctness pass — five real bugs fixed:**
|
||
- `free()` coalesce: when a freed block was absorbed into a
|
||
lower-address neighbour (`bEnd == a` path), the absorbed entry
|
||
was left in the free list overlapping the extended one. A
|
||
follow-on malloc could hand out the same memory to two
|
||
callers. Fix: track outer-loop predecessor and excise the
|
||
absorbed entry. Smoke #100 added.
|
||
- `sqrt(-0.0)` returned NaN; should return -0.0 per IEEE-754.
|
||
The sign-bit check fired before the zero check. Fix: mask
|
||
sign bit when testing for zero.
|
||
- `log(0)` returned NaN; should return -Infinity (pole error).
|
||
Same sign-bit-vs-zero ordering issue; both ±0 now return
|
||
`-1.0/0.0`.
|
||
- `snprintf(buf, 0, ...)` wrote `'\0'` to `buf[-1]` (one byte
|
||
BEFORE the buffer). C99 says n=0 must not touch the buffer.
|
||
Fix: set `gEnd = NULL` for n=0 so neither the normal nor the
|
||
truncation NUL-write path fires. Smoke #76 extended.
|
||
- `malloc(>~32KB)` and `calloc(n, m)` had silent integer overflow
|
||
on size_t (16-bit), wrapping to small values and handing out
|
||
tiny allocations claiming huge sizes. Bumped malloc to bail
|
||
above 0x7FF0 (heap is at most ~32KB anyway) and made calloc
|
||
overflow-check before multiplying.
|
||
|
||
- **Removed** dead `runtime/src/softDouble.s` (a stub from before
|
||
`softDouble.c` was implemented; the build script doesn't reference
|
||
it but it was confusing to leave around).
|
||
|
||
- **inttypes.h PRId64 / PRIu64 / PRIx64** documented as
|
||
unsupported in the runtime's printf — the macros expand to
|
||
`"lld"`/`"llu"`/`"llx"` but the formatter only knows the `l`
|
||
length modifier, not `ll`, so the format prints literally and
|
||
the va_list misaligns. Use `PRId32` etc. for now.
|
||
|
||
- **More runtime fixes (round 2):**
|
||
- `fputs(s, stream)` was forwarding to `puts(s)`, which appends a
|
||
newline. C says fputs MUST NOT add one. Direct char-by-char
|
||
write now.
|
||
- `exit(code)` never invoked the registered `atexit` handler.
|
||
C99 7.20.4.3 requires it. Now runs the single-slot handler
|
||
(with re-entry guard) before the BRK.
|
||
- `printf("%f", -0.0)` printed `0.000000` instead of `-0.000000`
|
||
because `if (v < 0)` (a `__ltdf2` call) returns false for
|
||
negative zero. Switched to the IEEE-754 sign-bit test that
|
||
snprintf already uses.
|
||
- `vfprintf` was missing entirely (declared neither in stdio.h
|
||
nor implemented). Added a thin wrapper around vprintf.
|
||
|
||
- **link816 weak-symbol resolution:** the linker previously used
|
||
"last def wins" with no regard for STB_GLOBAL vs STB_WEAK. When
|
||
a user provided a strong override of a weak libc stub (e.g.
|
||
`putchar`), it worked only by link-order luck — reversing the
|
||
order let the weak stub silently overwrite the strong def.
|
||
Now properly: strong over weak (any order), strong + strong
|
||
errors out, weak + weak picks the first. Smoke #100 added.
|
||
|
||
- **More runtime fixes (round 3):**
|
||
- `writeHex` / `emitHex` had a stack-overflow buffer overrun
|
||
(`char buf[5]` but `printf("%08x", ...)` would write 8 bytes).
|
||
On 16-bit `unsigned int`, max useful width is 4 — buf shrunk
|
||
to 4 and width is now capped.
|
||
- `writeDec` / `writeSignedLong` / `emitDec` / `emitSignedLong`
|
||
used `-n` on signed input, which overflows for INT_MIN /
|
||
LONG_MIN (UB). All four switched to unsigned-negation
|
||
(`0u - (unsigned)n`) for correctness and to keep an
|
||
optimizer-aware compiler from exploiting the UB.
|
||
- `atoi` / `atol` / `strtol` / `strtoul` likewise built the
|
||
parsed magnitude in a signed accumulator and negated at the
|
||
end — same UB on the boundary value. All switched to
|
||
unsigned magnitude + unsigned-negation cast.
|
||
- `link816 parseInt` / `omfEmit parseInt` silently truncated
|
||
addresses > 24 bits to `uint32_t` low bits — `--text-base
|
||
0x100000000` would silently wrap to 0. Both now reject
|
||
out-of-range addresses with a clear error.
|
||
|
||
- **More runtime fixes (round 4):**
|
||
- `pow(x, y)` computed `n = -n` for the integer-y branch when
|
||
yi was INT_MIN (-32768); same signed-overflow UB pattern as
|
||
the print functions. Switched to unsigned magnitude.
|
||
- Added `perror(prefix)` — was missing from the runtime; common
|
||
pattern in portable code that reports I/O failure via
|
||
`errno + strerror`. Declared in stdio.h, implemented as
|
||
char-by-char emit through putchar (no fprintf dependency).
|
||
|
||
- **link816 `__heap_end` was hardcoded at $BF00**, ignoring where
|
||
`__heap_start` actually ended up. When BSS got auto-relocated
|
||
into LC1 ($D000+), heap_start ended up > heap_end and malloc
|
||
immediately returned NULL on every call — silently bricking any
|
||
program that allocated dynamic memory after the runtime grew
|
||
past the default-bss threshold. Heap_end now picks
|
||
$BF00 / $E000 based on where heap_start lands (and skips the IO
|
||
window if heap_start would have landed in $C000-$CFFF).
|
||
Smoke #102 added.
|
||
|
||
- **link816 rodata auto-skips IIgs IO window** ($C000-$CFFF). When
|
||
text+rodata grew past 0xC000 the rodata bytes silently corrupted
|
||
at runtime — string literals in the IO range read back as
|
||
hardware register values, breaking strcmp / strstr / printf / etc.
|
||
Now: rodata that would land in or cross $C000-$CFFF auto-skips
|
||
to $D000. Init_array gets the same treatment. Text that would
|
||
cross IO is hard-rejected at link time (no auto-fix possible —
|
||
PC fetches in IO would read hardware registers). This was the
|
||
root cause of the "tan/tanf triggers layout-sensitive failure"
|
||
symptom listed in older STATUS notes.
|
||
|
||
- **runInMame skips writes to IO window** during the binary load.
|
||
Without this, the zero-padding in the rodata-skip gap would
|
||
clobber soft switches (e.g. the LC1 RAM enable that crt0 sets
|
||
via $C083) when the loader naively wrote the entire image
|
||
byte-by-byte to memory.
|
||
|
||
- **link816 `--gc-sections` (default ON)** — discards sections not
|
||
reachable from the entry point (`__start` / `_start` / `main`
|
||
for the canonical crt0 setup) plus all `.init_array` sections.
|
||
Built on `-ffunction-sections` so each function is in its own
|
||
section. A minimal program with full runtime linked shrinks
|
||
from ~43KB to ~1.5KB. Adding `tan/tanf` to math.c (which
|
||
caused the latent layout-sensitive failure described above)
|
||
no longer pushes any test past the bank-0 limit. Tests that
|
||
intentionally check unreachable symbols pass `--no-gc-sections`
|
||
to opt out.
|
||
|
||
- **`fwrite(stdout, ...)` was a stub returning 0** even though
|
||
`stdout` has a working `putchar` route. Now actually writes
|
||
through `putchar` for stdout/stderr (only). Also gained the
|
||
same `size * nmemb` overflow guard as `calloc`.
|
||
|
||
## What's still needed for a "ship-ready" toolchain
|
||
|
||
- **softDouble.c -O2 — FIXED.** Marking `dclass` noinline (in
|
||
addition to `dpack`) drops register pressure in `__muldf3`/
|
||
`__divdf3`/`__adddf3` enough that greedy regalloc no longer
|
||
runs out. The previous blocker was that noinline-dclass would
|
||
write through pointer args via the DBR-relative `(d,s),y` mode
|
||
and corrupt caller data after a bank switch — that path now
|
||
goes through `STAptr/STBptr` which use `[$E0],Y` indirect-long
|
||
with the bank byte forced to 0, so DBR is irrelevant. All
|
||
three smoke build sites moved to `-O2`.
|
||
|
||
|
||
- **More of the C standard library**: real `<stdio.h>` file I/O
|
||
(`fopen`, `fread`, `fwrite`, `fseek` are currently stubs
|
||
returning success/zero) — would need a memory-backed FS or a
|
||
MAME hook. `<locale.h>` / `<signal.h>` / `<time.h>` are stubbed
|
||
(compile and return safe defaults). `<wchar.h>` mostly absent.
|
||
A `time()` impl wired to ReadTimeHex (Misc Tool $0D03) was
|
||
attempted but crashes MAME without the Tool Locator initialised
|
||
in crt0; `clock()` via VBL counter at $E1006B needs 24-bit
|
||
far-pointer support that the backend doesn't yet model.
|
||
|
||
- **C++ runtime support**: vtable layout for multiple inheritance,
|
||
RTTI, exceptions (or a documented `-fno-exceptions` requirement).
|
||
|
||
- **REP/SEP scheduling pass** (design doc §3.3): the current
|
||
prologue picks one M-mode for the whole function based on
|
||
whether any 8-bit accumulator value is used. A per-region
|
||
scheduler would reduce the SEP/REP wrap overhead on i8 stores.
|
||
|
||
- **Toolbox / IIgs system call bindings**: `iigs/toolbox.h` covers
|
||
the common entry points across Tool Locator, Memory Manager,
|
||
Misc Tools, QuickDraw II, Event Manager, Window Manager, plus
|
||
GS/OS Quit. Multi-arg wrappers (NewHandle, QDStartUp, MoveTo,
|
||
EMStartUp, GetNextEvent, NewWindow, CloseWindow) live in
|
||
`runtime/src/iigsToolbox.s` because the backend's inline-asm
|
||
constraints can't take memory operands. Single-arg / no-arg
|
||
wrappers stay inline. More routines (Menu Manager, Dialog
|
||
Manager, Standard File, Sound) still TBD.
|
||
|
||
- **Real-world program coverage**: the smoke tests are
|
||
microbenchmarks. A few known-good Apple IIgs C programs (e.g.
|
||
a textfile pager, a small game) compiled and run end-to-end
|
||
would catch issues no synthetic test currently exercises.
|
||
|
||
- **Cycle-time / size benchmarks vs Calypsi 5.16**: design doc §1
|
||
says the goal is to "match or exceed" Calypsi. We have neither
|
||
baseline numbers nor a comparison harness yet.
|