# Using llvm816 This document covers compiling a C program, linking it into an Apple IIgs binary, and running it under MAME. It assumes you've followed [INSTALL.md](INSTALL.md) and the install completed successfully. If you've never used **clang** or a similar C compiler before, start with [Quick orientation](#quick-orientation) — it explains the moving parts. If you already know what clang is, jump to [Your first program](#your-first-program). --- ## Quick orientation ### What is clang? Clang is a C / C++ compiler — the program that turns your `.c` source file into machine code an actual CPU can execute. It's part of the LLVM project and is the default C compiler on macOS and on most modern Linux distributions. If you've used `gcc` before, clang takes nearly the same command-line flags. A normal install of clang produces code for the machine it's running on — x86-64 if you're on a typical Linux PC. Clang has a **cross-compiler mode**: pass `--target=` to make it emit code for a *different* CPU. The W65816 (the Apple IIgs CPU) is one of the architectures we've added to a fork of clang that ships with this project. ### What gets installed where After `./setup.sh` completes, the project tree under your `llvm816/` checkout looks roughly like this: ``` llvm816/ ← repo root; everything is contained here ├── docs/ ← this directory ├── runtime/ ← C standard library + startup code │ ├── build.sh ← script that builds the runtime .o files │ ├── include/ ← header files (, etc.) │ │ ├── stdio.h │ │ ├── string.h │ │ ├── ... │ │ └── iigs/ ← Apple IIgs-specific headers │ │ ├── toolbox.h ← ~1300 toolbox routine wrappers │ │ ├── gsos.h │ │ └── desktop.h │ ├── src/ ← sources for the runtime (.c and .s) │ └── *.o ← compiled runtime objects (after build) ├── scripts/ ← driver scripts │ ├── runInMame.sh ← run a binary in MAME and check memory │ ├── benchCycles.sh ← cycle-count benchmarks │ └── smokeTest.sh ← ~150 end-to-end correctness checks ├── src/ ← OUR backend source (you compile from here) ├── tools/ ← installed tools (~7 GB total) │ ├── llvm-mos/ ← LLVM source tree (~5 GB) │ ├── llvm-mos-build/ ← built artifacts (~1.4 GB) │ │ └── bin/ │ │ ├── clang ← THE COMPILER YOU USE │ │ ├── clang++ ← same, for C++ │ │ ├── llc ← standalone IR → asm converter │ │ ├── llvm-mc ← standalone assembler │ │ ├── llvm-objdump ← disassembler │ │ └── ... │ ├── llvm-mos-sdk/ ← prebuilt llvm-mos SDK (~400 MB, mostly unused) │ ├── link816 ← OUR LINKER (single binary, ~120 KB) │ ├── omfEmit ← turns flat binary → Apple IIgs OMF v2.1 │ ├── mame/ ← Apple IIgs ROMs for MAME │ ├── gsos/ ← GS/OS 6.0.2 / 6.0.4 disk images │ ├── calypsi/ ← reference compiler for comparison (~580 MB) │ └── orca-c/ ← reference compiler (header sources) ├── demos/ ← example IIgs programs ├── benchmarks/ ← cycle-count benchmarks ├── compare/ ← side-by-side ours-vs-Calypsi assembly └── setup.sh ← one-shot installer ``` The two files you'll use most often: | File | Purpose | |---|---| | **`tools/llvm-mos-build/bin/clang`** | The compiler. Pass `--target=w65816` to make it emit Apple IIgs code | | **`tools/link816`** | The linker. Takes `.o` files and produces a flat binary the IIgs can load | Nothing is installed into `/usr/local`, `/opt`, or anywhere else on your system — the entire toolchain lives under your `llvm816/` checkout. To uninstall, delete the directory. ### What about the system's `/usr/bin/clang`? If your distribution provides a clang (most do), that's a *different* clang for *your machine's* CPU. It does **not** know about the W65816 target. When following this document, always use the full path `./tools/llvm-mos-build/bin/clang` (or set an alias / `$PATH` — see [Setting up your environment](#setting-up-your-environment)). ### What the build process produces When you compile a C file for the IIgs, the flow looks like this: ``` hello.c │ │ clang --target=w65816 (cross-compile to 65816 machine code) ▼ hello.o (relocatable ELF object file) │ │ + crt0.o + libc.o + libgcc.o (runtime libraries you link in) │ │ link816 (our relocating linker) ▼ hello.bin (flat binary, loadable at $00:1000) │ │ optionally: omfEmit hello.bin → hello.omf (for GS/OS Loader) │ │ scripts/runInMame.sh hello.bin ▼ runs in MAME's emulated Apple IIgs ``` Three stages: 1. **Compile** — clang turns `.c` into `.o` 2. **Link** — `link816` combines `.o` files + runtime libraries into a binary 3. **Run** — MAME boots an emulated IIgs and executes the binary --- ## Setting up your environment To save typing, you can either edit your `$PATH` or use absolute paths. The rest of this document uses absolute paths so the examples work without any setup, but in practice you'll want shortcuts. ### Option A: edit `$PATH` (recommended) Add this to `~/.bashrc` (or `~/.zshrc`) so our tools are on your path: ```bash export LLVM816_ROOT=$HOME/path/to/llvm816 export PATH="$LLVM816_ROOT/tools/llvm-mos-build/bin:$LLVM816_ROOT/tools:$PATH" ``` Then `source ~/.bashrc` (or restart your shell). After that you can just type `clang --target=w65816 ...` without the path prefix. > **Careful:** putting `tools/llvm-mos-build/bin` first on `$PATH` means > *all* `clang` invocations in that shell go to our build, not the > system clang. Ours still works for your machine's native target > too (it's a multi-arch clang), but if you also need your distro's > version, prefer Option B. ### Option B: shell aliases In `~/.bashrc`: ```bash LLVM816_ROOT=$HOME/path/to/llvm816 alias w65clang="$LLVM816_ROOT/tools/llvm-mos-build/bin/clang --target=w65816 -I $LLVM816_ROOT/runtime/include" alias link816="$LLVM816_ROOT/tools/link816" ``` Then: ```bash w65clang -O2 -c hello.c -o hello.o link816 -o hello.bin --text-base 0x1000 ... ``` ### Option C: nothing — just use full paths Every example in this document spells out the full path, so this works too. Verbose, but unambiguous. --- ## Your first program Let's compile, link, and run a tiny program. Open a terminal in your `llvm816/` checkout directory. ### 1. Write the source Create `hello.c`: ```c // hello.c — the smallest meaningful Apple IIgs program. // // Writes a value to bank-2 RAM at $02:5000, then halts. The MAME // harness reads that memory cell to verify the result. int main(void) { int x = 6 * 7; // Write directly to the 24-bit absolute address $02:5000. With // ptr32 mode (our default), constant pointers to >16-bit addresses // lower to `sta long $025000` — no bank-switching needed. *(volatile int *)0x025000 = x; while (1) {} // halt; the harness reads memory + exits return 0; } ``` ### 2. Compile to a `.o` file ```bash ./tools/llvm-mos-build/bin/clang \ --target=w65816 \ -O2 \ -I runtime/include \ -c hello.c \ -o hello.o ``` What each flag does: | Flag | Meaning | |---|---| | `--target=w65816` | **Required.** Tells clang to emit W65816 machine code instead of the host CPU's code. | | `-O2` | Optimization level. `-O2` is recommended; `-O0` works but produces 3-5× larger code. | | `-I runtime/include` | Look for `` etc. in our runtime headers. | | `-c` | Compile only — produce a `.o`, don't link. | | `-o hello.o` | Write the object to `hello.o`. | If the command succeeds, you'll have a `hello.o` next to your `hello.c`. You can inspect it: ```bash ./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o | head -40 ``` ### 3. Link to a flat binary ```bash ./tools/link816 \ -o hello.bin \ --text-base 0x1000 \ runtime/crt0.o \ runtime/libc.o \ runtime/libgcc.o \ hello.o ``` Each argument: | Argument | Why | |---|---| | `-o hello.bin` | Output file. | | `--text-base 0x1000` | Where the code goes in memory. `0x1000` is conventional (first 4 KB of bank 0 is reserved for stack + zero page). | | `runtime/crt0.o` | **Must come first.** The C runtime startup — sets up the stack, calls `main`, halts cleanly on return. | | `runtime/libc.o` | Core C library (`printf`, `malloc`, `strlen`, etc.). | | `runtime/libgcc.o` | Compiler-provided helpers for things the 65816 can't do natively (16×16 multiply, 32-bit divide, etc.). Required for almost every program. | | `hello.o` | Your code. | `link816` will print something like: ``` linked: text=[0x1000+128] rodata=[0x1080+0] bss=[0x1100+8] -> hello.bin ``` That tells you the code is 128 bytes, no read-only data, 8 bytes of BSS. ### 4. Run it in MAME ```bash bash scripts/runInMame.sh hello.bin --check 0x025000=002a ``` `0x002a` is hexadecimal for 42 (= 6 × 7), and `0x025000` is the 24-bit address `bank $02 + offset $5000` — where your program wrote `x`. The script boots MAME's emulated Apple IIgs, loads your binary at `$00:1000`, runs for 5 seconds, reads memory at `$02:5000`, and compares to the expected value. A pass looks like: ``` MAME-LOADED bytes=128 MAME-READ addr=0x025000 val=0x002a [llvm816] MAME OK: 1 reads matched ``` If you get `MAME mismatch`, your program wrote a different value (or no value). Most common cause for a new project is writing to a bank-0 address like `*(volatile int *)0x5000 = x;` (a plain `$5000`) instead of a 24-bit address like `*(volatile int *)0x025000 = x;` (`$02:5000`). The verification harness reads bank 2; writes to bank 0 go to a different RAM cell and the comparison fails. --- ## Compiling C — full reference The compiler is invoked just like a normal clang, with one extra flag: ```bash ./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -c source.c -o source.o ``` ### Recommended flags | Flag | Meaning | |---|---| | `--target=w65816` | Selects the W65816 backend (required). | | `-O2` | Default optimization. `-O0` and `-O1` work but produce ~3-5× larger code. `-O3` is the same as `-O2` for our backend. | | `-ffunction-sections` | Put each function in its own section. Lets the linker drop unreferenced functions (smaller binaries). | | `-I runtime/include` | Find ``, ``, `` etc. | | `-c` | Compile only — produce `.o`, don't link. Without this, clang tries to invoke the host linker, which doesn't understand 65816 objects. | | `-g` | Emit DWARF debug info. Useful with `link816 --debug-out`. | | `-S` | Emit assembly (`.s`) instead of an object file. Useful for inspecting codegen. | ### What works at `-O2` - All C99 scalars: `int8_t` through `int64_t`, signed and unsigned, all arithmetic operators - Soft `float` and `double` (full IEEE-754 with round-to-nearest-even) - Pointers, arrays, structs, unions, bitfields - All control flow: `if`, `for`, `while`, `goto`, `switch`, recursion - `` varargs - `` setjmp/longjmp (SJLJ, no DWARF unwinder) - Inline `__asm__` with `"a"`, `"x"`, `"y"` register constraints - C++ subset: classes, single+multiple inheritance, virtual functions, RTTI, `dynamic_cast`. **No exceptions** (DWARF unwinder not implemented; SJLJ exceptions work via `-fsjlj-exceptions`). See [STATUS.md](../STATUS.md) for the full feature matrix. --- ## Linking — full reference `link816` produces a flat binary suitable for direct execution (loaded into a fixed address) or, with `--omf`, an OMF binary that the GS/OS Loader can load and relocate. ### Raw binary (fixed-address load) ```bash ./tools/link816 -o output.bin --text-base 0x1000 \ runtime/crt0.o runtime/libc.o runtime/libgcc.o yourprog.o ``` - `--text-base 0x1000` — Where code is loaded. `$1000` is conventional; the first 4 KB of bank 0 (`$00:0000`-`$00:0FFF`) is reserved for the stack and direct page. - `--bss-base 0x020000` — Where uninitialized data (BSS) goes. By default the linker places BSS immediately after rodata; supplying a different bank is useful when your text + data exceeds a single bank's free space. - `--map output.map` — Writes a human-readable map file showing every symbol's address. Useful for debugging. - `--no-gc-sections` — Keep all functions, even unreferenced ones. By default `link816 --gc-sections` (ON) drops unused code, shrinking binaries dramatically (a minimal program with full runtime linked goes from ~43 KB to ~1.5 KB). ### Runtime libraries Each runtime library is built once by `runtime/build.sh` and lives as a `.o` in `runtime/`. Link only what you use — `--gc-sections` drops the rest. | Library | When you need it | |---|---| | `runtime/crt0.o` | **Always.** C runtime startup. | | `runtime/crt0Gsos.o` | Instead of `crt0.o` for programs launched by the GS/OS Loader. | | `runtime/libc.o` | `printf`, `malloc`, `strlen`, the usual. Almost always. | | `runtime/libgcc.o` | Compiler helpers — multiply, divide, shift. Almost always. | | `runtime/snprintf.o` | If you use `sprintf` / `snprintf` / `vsnprintf`. | | `runtime/sscanf.o` | If you use `sscanf` / `vsscanf` / `fscanf`. | | `runtime/softDouble.o` | If you use `double`-precision arithmetic anywhere. | | `runtime/softFloat.o` | If you use `float`-precision arithmetic. | | `runtime/math.o` | `fabs`, `floor`, `sqrt`, `sin`, `cos`, `pow`, etc. | | `runtime/qsort.o` | `qsort` / `bsearch`. | | `runtime/strtol.o` | `strtol` / `strtoul` / `atoi` / `atol`. | | `runtime/strtok.o` | `strtok` / `strtok_r`. | | `runtime/extras.o` | `strcat`, `strncat`, `llabs`, `rand`/`srand`. | | `runtime/timeExt.o` | `time` / `gmtime` / `mktime`. | | `runtime/iigsToolbox.o` | Apple IIgs Toolbox call wrappers. | | `runtime/iigsGsos.o` | GS/OS class-1 call wrappers (file I/O, etc.). | | `runtime/desktop.o` | `startdesk()` helper used by demos that need a Window Manager environment. | | `runtime/libcxxabi.o` | C++ ABI runtime (vtable RTTI, `dynamic_cast`). | | `runtime/libcxxabiSjlj.o` | C++ SJLJ-exception support (paired with `-fsjlj-exceptions`). | To (re)build the runtime: ```bash bash runtime/build.sh ``` ### Multi-segment OMF (for GS/OS Loader) For programs >60 KB (the usable bank-0 limit after the stack, zero page, and I/O window are subtracted), build a multi-segment OMF that GS/OS Loader places across banks: ```bash ./tools/link816 -o myprog.bin \ --text-base 0x1000 \ --segment-cap 0xB000 \ --segment-bank-base 0x040000 \ --manifest myprog.manifest.json \ runtime/crt0Gsos.o ... yourprog.o ./tools/omfEmit --manifest myprog.manifest.json --expressload -o myprog.omf ``` See [`docs/multiSegmentPlan.md`](multiSegmentPlan.md) for details and [`scripts/runMultiSeg.sh`](../scripts/runMultiSeg.sh) for a working example. --- ## Running under MAME [`scripts/runInMame.sh`](../scripts/runInMame.sh) launches MAME's `apple2gs` driver, loads your binary at `$00:1000`, runs for a few seconds, and reads a memory cell: ```bash bash scripts/runInMame.sh prog.bin # just run for ~5 s bash scripts/runInMame.sh prog.bin --check 0x025000=002a # verify a value bash scripts/runInMame.sh prog.bin 0x025000 0x025002 # dump these addresses ``` - `--check ADDR=VALUE` returns exit 0 if memory matches, exit 1 if not. Used by smoke and CI. - The bare-address form dumps the value without comparing. The runner is headless by default (`-video none` + `SDL_VIDEODRIVER=dummy`) so it runs in a terminal-only environment. Useful environment variables: | Variable | Default | Purpose | |---|---|---| | `MAME_CHECK_FRAME` | `300` | Frame at which to read the check address (300 ≈ 5 s at 60 Hz). | | `MAME_SECS` | `6` | How long to let MAME run before forcibly exiting. | | `MAME_TIMEOUT` | `30` | Wall-clock timeout for the whole MAME invocation. | | `MAME_RAMSIZE` | unset | Override the emulated RAM size (e.g. `8M`). | ### Writing to non-bank-0 RAM The 65816 has two registers that select which bank a memory access goes to: - **PBR** (Program Bank Register) — selects the bank for instruction fetches. Set by `jsl long_addr` and `rtl`. - **DBR** (Data Bank Register) — selects the bank for 16-bit absolute data accesses like `lda $5000`. When the IIgs boots, DBR defaults to `$00`. Bank `$00` contains the I/O window at `$C000-$CFFF`, the language card area, and the stack — not a great place for general data. **With ptr32 mode** (the default — pointers are 32 bits / 24-bit addresses), constant pointers to non-bank-0 addresses lower automatically to long (24-bit absolute) instructions that *ignore DBR*: ```c *(volatile int *)0x025000 = 42; // → sta long $025000 (DBR-independent) *(volatile char *)0xE10068 = 1; // → sta long $E10068 (vert position reg) unsigned char v = *(volatile char *)0xE0C025; // ROM read ``` For typical programs — writing a result to a verification address, poking IIgs hardware registers, accessing the SHR framebuffer at `$E1:2000` — you just dereference the absolute pointer and the compiler does the right thing. **DBR doesn't matter.** ### Legacy: the `switchToBank2()` idiom You may see older code (pre-ptr32 migration) using a `switchToBank2()` helper that pokes DBR to `$02` so that subsequent 16-bit-absolute stores like `*(volatile X*)0x5000 = v` land in bank 2: ```c __attribute__((noinline)) void switchToBank2(void) { __asm__ volatile ( "sep #0x20\n" // 8-bit A ".byte 0xa9,0x02\n" // lda #2 (hand-encoded) "pha\n" // push A "plb\n" // pop into DBR "rep #0x20\n" // back to 16-bit A ); } // then: switchToBank2(); *(volatile int *)0x5000 = x; ``` This still works but is **no longer needed** for new code. Prefer the direct 24-bit pointer form (`*(volatile int *)0x025000 = x;`) — it's clearer, requires no inline asm, and produces fewer instructions because the bank byte is encoded inline. There's still one case where it's useful: if you have a *large amount* of data work in a single bank and want every store to be 3 bytes (`sta $5000,X` etc.) instead of 4 bytes (`sta long $025000,X`). In that case, set DBR once with the helper above and use 16-bit-absolute addresses afterward. Otherwise, the direct form is simpler. ### What never needs bank-switching - **Local variables on the stack** — stack-relative accesses bypass DBR. - **Direct-page accesses** — `lda $D0` always reads `$00:00D0`. - **`[dp],Y` indirect-long pointers** — they carry their own bank byte. - **Function calls** — `jsl` uses PBR + a long destination. - **Pointers in ptr32 mode** — every C pointer is 32 bits, so deref'ing any pointer (even one to bank 0) generates DBR-independent code. --- ## Worked examples ### Recursion + printing ```c // fib.c #include #include unsigned long fib(unsigned n) { if (n < 2) return n; return fib(n-1) + fib(n-2); } int main(void) { char buf[32]; int len = snprintf(buf, sizeof buf, "fib(10) = %lu", fib(10)); // Copy the formatted string into bank-2 RAM at $02:5000 so the // MAME harness can read it back. Each store goes through a 24-bit // long-address write — no bank-switching needed. for (int i = 0; i <= len; i++) ((volatile char *)0x025000)[i] = buf[i]; while (1) {} } ``` Build (snprintf needs soft-double + sscanf to link cleanly): ```bash ./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \ -I runtime/include -c fib.c -o fib.o ./tools/link816 -o fib.bin --text-base 0x1000 \ runtime/crt0.o runtime/libc.o runtime/libgcc.o \ runtime/snprintf.o runtime/softDouble.o runtime/sscanf.o \ fib.o bash scripts/runInMame.sh fib.bin --check 0x025000=0066 # 'f' (start of "fib") ``` ### Apple IIgs Toolbox ```c // hello_gs.c #include int main(void) { SysBeep(); while (1) {} } ``` Build (note `crt0Gsos.o` instead of `crt0.o` — sets up the toolbox environment): ```bash ./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \ -I runtime/include -c hello_gs.c -o hello_gs.o ./tools/link816 -o hello_gs.bin --text-base 0x1000 \ runtime/crt0Gsos.o runtime/iigsToolbox.o runtime/iigsGsos.o \ runtime/libgcc.o hello_gs.o ``` Programs that call the toolbox usually run under real GS/OS rather than in the headless harness. See `demos/launch.sh` and `demos/build.sh` for a working pipeline. --- ## Advanced: pointer-deref code generation The W65816 backend treats every pointer as 32-bit (`p:32:16` datalayout — `sizeof(void *) == 4` from the C compiler's perspective). The high two bytes carry the bank byte plus a pad byte; the low two carry the in-bank offset. This lets a single C pointer reach any byte in the IIgs's 24-bit address space. A pointer dereference has to read up to 24 bits of address to know which bank to touch. The CPU's `[dp],Y` (indirect-long-Y, opcode 0xB7) reads a 24-bit pointer from a direct-page slot and uses it as the effective address — three bytes wide, bank byte explicit. This is the **safe default** path and it works regardless of where the target memory lives. There are two optimizations layered on top of the default path. One is **always on** and safe. The other is **opt-in via a flag** and needs care. ### Layer 1: constant-offset peeling (default on, always safe) When you write `s->c` for a struct field at offset `4`, the natural code is "compute `s + 4`, then deref". Layer 1 recognizes that `[dp],Y` already has a Y register that's added to the 24-bit pointer on the deref — so instead of computing `s + 4` first, the backend stages the **base pointer** at `$E0..$E2` and loads `Y = #4` for the deref. Saves three instructions per struct-field access (the `clc; adc #4; ...; adc #0` carry chain). A consecutive-access CSE peephole shares the `$E0/$E2` staging between adjacent derefs of the same base, so `s->a + s->b + s->c + s->d` stages once and emits four `ldy #K; lda [$E0],Y` pairs. There's nothing to enable or disable. This was a `+1%` Lua-wide size win on its own. It's always-on because it's structurally equivalent to the un-optimized code — the same 24-bit deref, just with the offset folded into Y instead of pre-added to the pointer. ### Layer 2: `-mllvm -w65816-dbr-safe-ptrs` (opt-in, unsafe if misused) The default `[dp],Y` deref needs three bytes of staging at `$E0..$E2` because it reads a 24-bit pointer. Calypsi uses `lda (d,S),Y` (opcode 0xB3, stack-rel-indirect-Y) for the same effect in ONE instruction — but that opcode reads only **16 bits** of pointer. The bank byte is implicit DBR. When you pass `-mllvm -w65816-dbr-safe-ptrs`, our backend uses the same one-instruction path: it spills only the low 16 bits of the pointer to a stack slot, sets Y to the offset, and emits `lda (slot,S),Y` (or `sta (slot,S),Y`). Bank byte = whatever DBR holds at runtime. Per-deref cost drops from ~5 instructions to 1. Lua 5.1.5 shrinks by 20.6% with the flag on. **This is correct only when every pointer dereferenced in the TU points to memory inside DBR's current bank.** Some examples: | Pointer | Bank? | Safe with the flag? | |---|---|---| | `malloc()` result | DBR's bank (crt0 sets DBR to load bank; malloc allocates from BSS heap there) | Yes | | Global variable address | DBR's bank (linker puts globals in the load segment) | Yes | | `&local_array[i]` in a stack frame | Bank 0 (stack is always bank 0) | Yes IF DBR is 0 (typical) | | Pointer returned by GS/OS Loader | The Loader's bank (might differ from yours) | **No** — would miscompile | | Pointer cast from a `0x010000+addr` integer literal in bank 1 | Bank 1 | **No** if DBR is not bank 1 | | `&ROMVECTORS[0]` from `iigs/`-style headers | Various IIgs system banks | **No** in general | For Lua, Picol, plain C programs that allocate via `malloc` and operate on globals, this flag is safe. For GS/OS demos that interact with Loader-returned segments or system memory, it would miscompile. Default is **off**. Opt in per-TU: ```bash clang --target=w65816 -O2 -mllvm -w65816-dbr-safe-ptrs -c hot.c -o hot.o ``` If you set the flag and your code does dereference cross-bank pointers, the symptom is silent wrong-address reads — typically a read from the same in-bank offset but in DBR's bank instead of the intended one. No abort, no diagnostic. **Mixing safely:** the flag is per-TU. You can compile your hot struct-heavy code with the flag and your bank-aware code without. The two `.o` files link cleanly together. Per-function or per-parameter control isn't supported yet. #### When the slot offset overflows 8 bits `lda (d,S),Y` has an 8-bit `d` field — max slot offset 255 from SP. If the function's frame is large enough that the spill slot exceeds that, PEI emits a fallback sequence that long-indirects the slot via `[$F6],Y` (the function's frame-pointer), then stages at `$E0..$E2` and derefs via `[$E0],Y`. This is ~8 instructions — worse than the plain `[dp],Y` path the flag was meant to replace. Functions that hit this need `usesDpFP=true` (set automatically for large frames); otherwise PEI emits a fatal error. In practice you'll only see this on functions with hundreds of local variables. ### Inline-threshold tuning (default lowered to 50) LLVM's default inline-cost threshold is 225, tuned for desktop CPUs where call overhead is high relative to the size of the inlined body. On W65816 a `jsl long:foo` is just 4 bytes / ~8 cycles, but every inlined pointer dereference expands to multiple instructions even with Layer 2. Aggressive inlining bloats code without commensurate cycle wins. The W65816 backend lowers the default to **50**. Calibration: | Threshold | Lua size | CoreMark size | Cycle benches | |----------:|---------:|--------------:|--------------| | 225 (LLVM stock) | 1.47× Calypsi | (not measured) | baseline | | 75 | 1.16× | 0.87× | identical | | **50 (current)** | **1.13×** | **0.79×** | identical | | 25 | 1.11× | 0.79× | identical | At 225, Lua's `index2adr` (a multi-branch helper called 41 times in `lapi.c`) was inlined into every API entry, adding ~2 KB per file — and CoreMark's `matrix_test` was 17× Calypsi because the inliner copied 5 nested-loop helpers into it. At 50, both regressions vanish and the cycle benchmarks are unchanged. To override (e.g. on size-sensitive ROMs or speed-critical loops): ```bash # Force aggressive inlining (back to LLVM default) clang --target=w65816 -O2 -mllvm -inline-threshold=225 -c file.c -o file.o # Force MORE conservative inlining clang --target=w65816 -O2 -mllvm -inline-threshold=10 -c file.c -o file.o ``` A function marked `__attribute__((always_inline))` is always inlined regardless of threshold. A function marked `__attribute__((noinline))` is never inlined. Use these to override the global threshold for specific cases. ### Summary: which options to use when | Goal | Compile flag | |---|---| | Smallest, safest binary (default) | `clang --target=w65816 -O2 ...` — Layer 1 is on, Layer 2 is off, threshold=50 | | Smallest binary for code that touches only same-bank memory | Add `-mllvm -w65816-dbr-safe-ptrs` | | Fastest possible code (size be damned) | Add `-mllvm -inline-threshold=500` | | Reproduce LLVM's stock inlining behavior | Add `-mllvm -inline-threshold=225` | | Maximum safety review of inlining decisions | Mark hot helpers `__attribute__((noinline))` explicitly | --- ## Inline assembly The W65816 backend supports `__asm__` with operand constraints `"a"`, `"x"`, `"y"`: ```c unsigned short addOne(unsigned short x) { unsigned short r; __asm__("inc a" : "=a"(r) : "a"(x)); return r; } ``` Multi-instruction asm and raw bytes both work: ```c __asm__ volatile ( "sep #0x20\n" ".byte 0x68\n" // pla "rep #0x20\n" ); ``` The `.byte` form is needed when llvm-mc can't yet parse an opcode literally (some 65816 addressing modes still have gaps in the assembler). Hand-encoding is a stopgap; report opcodes that need it. --- ## Tools reference | Tool | Location | Purpose | |---|---|---| | `clang` | `tools/llvm-mos-build/bin/clang` | C / C++ compiler | | `clang++` | `tools/llvm-mos-build/bin/clang++` | C++ driver | | `llc` | `tools/llvm-mos-build/bin/llc` | Standalone codegen (`.ll` → `.s`) | | `llvm-mc` | `tools/llvm-mos-build/bin/llvm-mc` | Assembler | | `llvm-objdump` | `tools/llvm-mos-build/bin/llvm-objdump` | Disassembler | | `link816` | `tools/link816` | Our relocating linker | | `omfEmit` | `tools/omfEmit` | Emit OMF v2.1 binary from `link816` output | | `mame` | system `apt` install | Apple IIgs emulator | --- ## Debugging ### Look at the asm ```bash ./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -S -o prog.s prog.c cat prog.s ``` ### Look at the MIR after each backend pass ```bash ./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \ -mllvm -print-after-all -S prog.c 2>&1 | less ``` Useful pass names to filter on: | Pass name | What it does | |---|---| | `w65816-isel` | SDAG → MachineInstr selection | | `w65816-widen-acc16` | Promote Acc16 vregs to Wide16 (regalloc help) | | `w65816-stack-slot-cleanup` | Remove redundant spill/reload | | `w65816-stackrel-to-img` | Promote hot stack slots to DP IMG slots | | `w65816-stack-slot-merge` | Collapse PHI src/dst slot pairs | | `w65816-branch-expand` | Long-distance Bxx → INV_Bxx skip; BRA | ### Single-pass filter ```bash ./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \ -mllvm -print-after=w65816-isel \ -mllvm -filter-print-funcs=myfunc \ -S prog.c 2>&1 | less ``` ### Disassemble an object file ```bash ./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o ``` --- ## Cycle-count benchmarks 13 microbenchmarks live under [`benchmarks/`](../benchmarks/) — eight integer/string micro-benches, three soft-double FP benches (`dadd`, `dmul`, `ddiv`), and two "game-like" workloads: `particles` (32-particle physics tick with i16 bounce/wall collision) and `mandelbrot` (4×4 fixed-point Mandelbrot tile exercising i32 multiply and conditional control flow). ```bash bash scripts/benchCycles.sh ``` Output (2026-05-21): ``` | Benchmark | Per-iteration cycles | |-----------|---------------------:| | bsearch | 127 cyc/iter (100 iters) | | crc32 | <65 (under timer resolution) | | dadd | 1157 cyc/iter (10 iters) | | ddiv | 1261 cyc/iter (10 iters) | | dmul | 1033 cyc/iter (10 iters) | | dotProduct | 144 cyc/iter (100 iters) | | fib | 97 cyc/iter (100 iters) | | mandelbrot | 11570 cyc/iter (1 iter, GRID=4 MAX_ITER=8) | | memcmp | 113 cyc/iter (100 iters) | | particles | 2253 cyc/iter (3 iters, N=32) | | popcount | 93 cyc/iter (100 iters) | | strcpy | 91 cyc/iter (100 iters) | | sumOfSquares | 126 cyc/iter (100 iters) | ``` The legacy `scripts/benchCyclesPrecise.sh` (per-call cycle count via `emu.time()`) is still available but slower to run. The [`compare/`](../compare/) directory has side-by-side `.s` files vs Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with: ```bash bash compare/regen.sh ``` --- ## Known limitations - **C++ exceptions** are not implemented for DWARF unwinding. `try` / `catch` compiles but doesn't unwind. `-fsjlj-exceptions` works for limited SJLJ-style throwing. - **`stdin`** always returns EOF. `scanf` compiles but isn't useful. Use `sscanf` on a buffer instead. - **File I/O** through `fopen` requires a backing implementation. The default `mfs` backing (memory-file-system) lets you simulate files via `mfsRegister()` — useful for tests, not for real disk I/O. GS/OS file I/O works via `runtime/iigsGsos.o` if you link against the GS/OS runtime. - **`fork`/`exec`** — not applicable on a 65816, no support. - **Code generation gotcha:** very large stack frames (>200 bytes) trigger FP-relative addressing. Most programs fit under that limit. See the `frame-rel` discussion in [LLVM_65816_DESIGN.md](../LLVM_65816_DESIGN.md). - **Three Lua functions** (`luaV_execute`, `symbexec`, `auxsort`) hit the greedy register allocator's complexity budget. Workaround: compile those TUs with `-mllvm -regalloc=basic`. Documented in [`tests/lua/README.md`](../tests/lua/README.md). --- ## Where to go next - **Building real GS/OS apps:** see [`docs/multiSegmentPlan.md`](multiSegmentPlan.md) and the `demos/launch.sh` script for booting through real GS/OS 6.0.2 in MAME. The 9 demos under `demos/` are reasonable starting points. - **Backend internals (you're hacking on the compiler):** [LLVM_65816_DESIGN.md](../LLVM_65816_DESIGN.md). - **Smoke tests:** `scripts/smokeTest.sh` runs ~150 end-to-end checks. Read it for examples of every feature in action. - **Cycle-bench a Lua port or other real-world C:** see [`tests/lua/README.md`](../tests/lua/README.md) for the recipe (vendoring + per-file regalloc tuning + libc stubs).