65816-llvm-mos/docs/USAGE.md
2026-05-25 21:00:32 -05:00

33 KiB
Raw Blame History

Using llvm816

This document covers compiling a C program, linking it into an Apple IIgs binary, and running it under MAME. It assumes you've followed INSTALL.md and the install completed successfully.

If you've never used clang or a similar C compiler before, start with Quick orientation — it explains the moving parts. If you already know what clang is, jump to Your first program.


Quick orientation

What is clang?

Clang is a C / C++ compiler — the program that turns your .c source file into machine code an actual CPU can execute. It's part of the LLVM project and is the default C compiler on macOS and on most modern Linux distributions. If you've used gcc before, clang takes nearly the same command-line flags.

A normal install of clang produces code for the machine it's running on — x86-64 if you're on a typical Linux PC. Clang has a cross-compiler mode: pass --target=<arch> to make it emit code for a different CPU. The W65816 (the Apple IIgs CPU) is one of the architectures we've added to a fork of clang that ships with this project.

What gets installed where

After ./setup.sh completes, the project tree under your llvm816/ checkout looks roughly like this:

llvm816/                            ← repo root; everything is contained here
├── docs/                           ← this directory
├── runtime/                        ← C standard library + startup code
│   ├── build.sh                    ← script that builds the runtime .o files
│   ├── include/                    ← header files (<stdio.h>, etc.)
│   │   ├── stdio.h
│   │   ├── string.h
│   │   ├── ...
│   │   └── iigs/                   ← Apple IIgs-specific headers
│   │       ├── toolbox.h           ← ~1300 toolbox routine wrappers
│   │       ├── gsos.h
│   │       └── desktop.h
│   ├── src/                        ← sources for the runtime (.c and .s)
│   └── *.o                         ← compiled runtime objects (after build)
├── scripts/                        ← driver scripts
│   ├── runInMame.sh                ← run a binary in MAME and check memory
│   ├── benchCycles.sh              ← cycle-count benchmarks
│   └── smokeTest.sh                ← ~150 end-to-end correctness checks
├── src/                            ← OUR backend source (you compile from here)
├── tools/                          ← installed tools (~7 GB total)
│   ├── llvm-mos/                   ← LLVM source tree (~5 GB)
│   ├── llvm-mos-build/             ← built artifacts (~1.4 GB)
│   │   └── bin/
│   │       ├── clang               ← THE COMPILER YOU USE
│   │       ├── clang++             ← same, for C++
│   │       ├── llc                 ← standalone IR → asm converter
│   │       ├── llvm-mc             ← standalone assembler
│   │       ├── llvm-objdump        ← disassembler
│   │       └── ...
│   ├── llvm-mos-sdk/               ← prebuilt llvm-mos SDK (~400 MB, mostly unused)
│   ├── link816                     ← OUR LINKER (single binary, ~120 KB)
│   ├── omfEmit                     ← turns flat binary → Apple IIgs OMF v2.1
│   ├── mame/                       ← Apple IIgs ROMs for MAME
│   ├── gsos/                       ← GS/OS 6.0.2 / 6.0.4 disk images
│   ├── calypsi/                    ← reference compiler for comparison (~580 MB)
│   └── orca-c/                     ← reference compiler (header sources)
├── demos/                          ← example IIgs programs
├── benchmarks/                     ← cycle-count benchmarks
├── compare/                        ← side-by-side ours-vs-Calypsi assembly
└── setup.sh                        ← one-shot installer

The two files you'll use most often:

File Purpose
tools/llvm-mos-build/bin/clang The compiler. Pass --target=w65816 to make it emit Apple IIgs code
tools/link816 The linker. Takes .o files and produces a flat binary the IIgs can load

Nothing is installed into /usr/local, /opt, or anywhere else on your system — the entire toolchain lives under your llvm816/ checkout. To uninstall, delete the directory.

What about the system's /usr/bin/clang?

If your distribution provides a clang (most do), that's a different clang for your machine's CPU. It does not know about the W65816 target. When following this document, always use the full path ./tools/llvm-mos-build/bin/clang (or set an alias / $PATH — see Setting up your environment).

What the build process produces

When you compile a C file for the IIgs, the flow looks like this:

hello.c
   │
   │  clang --target=w65816    (cross-compile to 65816 machine code)
   ▼
hello.o                        (relocatable ELF object file)
   │
   │  + crt0.o + libc.o + libgcc.o      (runtime libraries you link in)
   │
   │  link816                  (our relocating linker)
   ▼
hello.bin                      (flat binary, loadable at $00:1000)
   │
   │  optionally: omfEmit hello.bin → hello.omf  (for GS/OS Loader)
   │
   │  scripts/runInMame.sh hello.bin
   ▼
runs in MAME's emulated Apple IIgs

Three stages:

  1. Compile — clang turns .c into .o
  2. Linklink816 combines .o files + runtime libraries into a binary
  3. Run — MAME boots an emulated IIgs and executes the binary

Setting up your environment

To save typing, you can either edit your $PATH or use absolute paths. The rest of this document uses absolute paths so the examples work without any setup, but in practice you'll want shortcuts.

Add this to ~/.bashrc (or ~/.zshrc) so our tools are on your path:

export LLVM816_ROOT=$HOME/path/to/llvm816
export PATH="$LLVM816_ROOT/tools/llvm-mos-build/bin:$LLVM816_ROOT/tools:$PATH"

Then source ~/.bashrc (or restart your shell). After that you can just type clang --target=w65816 ... without the path prefix.

Careful: putting tools/llvm-mos-build/bin first on $PATH means all clang invocations in that shell go to our build, not the system clang. Ours still works for your machine's native target too (it's a multi-arch clang), but if you also need your distro's version, prefer Option B.

Option B: shell aliases

In ~/.bashrc:

LLVM816_ROOT=$HOME/path/to/llvm816
alias w65clang="$LLVM816_ROOT/tools/llvm-mos-build/bin/clang --target=w65816 -I $LLVM816_ROOT/runtime/include"
alias link816="$LLVM816_ROOT/tools/link816"

Then:

w65clang -O2 -c hello.c -o hello.o
link816 -o hello.bin --text-base 0x1000 ...

Option C: nothing — just use full paths

Every example in this document spells out the full path, so this works too. Verbose, but unambiguous.


Your first program

Let's compile, link, and run a tiny program. Open a terminal in your llvm816/ checkout directory.

1. Write the source

Create hello.c:

// hello.c — the smallest meaningful Apple IIgs program.
//
// Writes a value to bank-2 RAM at $02:5000, then halts.  The MAME
// harness reads that memory cell to verify the result.

int main(void) {
    int x = 6 * 7;
    // Write directly to the 24-bit absolute address $02:5000.  With
    // ptr32 mode (our default), constant pointers to >16-bit addresses
    // lower to `sta long $025000` — no bank-switching needed.
    *(volatile int *)0x025000 = x;
    while (1) {}   // halt; the harness reads memory + exits
    return 0;
}

2. Compile to a .o file

./tools/llvm-mos-build/bin/clang \
    --target=w65816 \
    -O2 \
    -I runtime/include \
    -c hello.c \
    -o hello.o

What each flag does:

Flag Meaning
--target=w65816 Required. Tells clang to emit W65816 machine code instead of the host CPU's code.
-O2 Optimization level. -O2 is recommended; -O0 works but produces 3-5× larger code.
-I runtime/include Look for <stdio.h> etc. in our runtime headers.
-c Compile only — produce a .o, don't link.
-o hello.o Write the object to hello.o.

If the command succeeds, you'll have a hello.o next to your hello.c. You can inspect it:

./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o | head -40
./tools/link816 \
    -o hello.bin \
    --text-base 0x1000 \
    runtime/crt0.o \
    runtime/libc.o \
    runtime/libgcc.o \
    hello.o

Each argument:

Argument Why
-o hello.bin Output file.
--text-base 0x1000 Where the code goes in memory. 0x1000 is conventional (first 4 KB of bank 0 is reserved for stack + zero page).
runtime/crt0.o Must come first. The C runtime startup — sets up the stack, calls main, halts cleanly on return.
runtime/libc.o Core C library (printf, malloc, strlen, etc.).
runtime/libgcc.o Compiler-provided helpers for things the 65816 can't do natively (16×16 multiply, 32-bit divide, etc.). Required for almost every program.
hello.o Your code.

link816 will print something like:

linked: text=[0x1000+128] rodata=[0x1080+0] bss=[0x1100+8] -> hello.bin

That tells you the code is 128 bytes, no read-only data, 8 bytes of BSS.

4. Run it in MAME

bash scripts/runInMame.sh hello.bin --check 0x025000=002a

0x002a is hexadecimal for 42 (= 6 × 7), and 0x025000 is the 24-bit address bank $02 + offset $5000 — where your program wrote x. The script boots MAME's emulated Apple IIgs, loads your binary at $00:1000, runs for 5 seconds, reads memory at $02:5000, and compares to the expected value.

A pass looks like:

MAME-LOADED bytes=128
MAME-READ addr=0x025000 val=0x002a
[llvm816] MAME OK: 1 reads matched

If you get MAME mismatch, your program wrote a different value (or no value). Most common cause for a new project is writing to a bank-0 address like *(volatile int *)0x5000 = x; (a plain $5000) instead of a 24-bit address like *(volatile int *)0x025000 = x; ($02:5000). The verification harness reads bank 2; writes to bank 0 go to a different RAM cell and the comparison fails.


Compiling C — full reference

The compiler is invoked just like a normal clang, with one extra flag:

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -c source.c -o source.o
Flag Meaning
--target=w65816 Selects the W65816 backend (required).
-O2 Default optimization. -O0 and -O1 work but produce ~3-5× larger code. -O3 is the same as -O2 for our backend.
-ffunction-sections Put each function in its own section. Lets the linker drop unreferenced functions (smaller binaries).
-I runtime/include Find <stdio.h>, <stdlib.h>, <iigs/toolbox.h> etc.
-c Compile only — produce .o, don't link. Without this, clang tries to invoke the host linker, which doesn't understand 65816 objects.
-g Emit DWARF debug info. Useful with link816 --debug-out.
-S Emit assembly (.s) instead of an object file. Useful for inspecting codegen.

What works at -O2

  • All C99 scalars: int8_t through int64_t, signed and unsigned, all arithmetic operators
  • Soft float and double (full IEEE-754 with round-to-nearest-even)
  • Pointers, arrays, structs, unions, bitfields
  • All control flow: if, for, while, goto, switch, recursion
  • <stdarg.h> varargs
  • <setjmp.h> setjmp/longjmp (SJLJ, no DWARF unwinder)
  • Inline __asm__ with "a", "x", "y" register constraints
  • C++ subset: classes, single+multiple inheritance, virtual functions, RTTI, dynamic_cast. No exceptions (DWARF unwinder not implemented; SJLJ exceptions work via -fsjlj-exceptions).

See STATUS.md for the full feature matrix.


Linking — full reference

link816 produces a flat binary suitable for direct execution (loaded into a fixed address) or, with --omf, an OMF binary that the GS/OS Loader can load and relocate.

Raw binary (fixed-address load)

./tools/link816 -o output.bin --text-base 0x1000 \
    runtime/crt0.o runtime/libc.o runtime/libgcc.o yourprog.o
  • --text-base 0x1000 — Where code is loaded. $1000 is conventional; the first 4 KB of bank 0 ($00:0000-$00:0FFF) is reserved for the stack and direct page.
  • --bss-base 0x020000 — Where uninitialized data (BSS) goes. By default the linker places BSS immediately after rodata; supplying a different bank is useful when your text + data exceeds a single bank's free space.
  • --map output.map — Writes a human-readable map file showing every symbol's address. Useful for debugging.
  • --no-gc-sections — Keep all functions, even unreferenced ones. By default link816 --gc-sections (ON) drops unused code, shrinking binaries dramatically (a minimal program with full runtime linked goes from ~43 KB to ~1.5 KB).

Runtime libraries

Each runtime library is built once by runtime/build.sh and lives as a .o in runtime/. Link only what you use — --gc-sections drops the rest.

Library When you need it
runtime/crt0.o Always. C runtime startup.
runtime/crt0Gsos.o Instead of crt0.o for programs launched by the GS/OS Loader.
runtime/libc.o printf, malloc, strlen, the usual. Almost always.
runtime/libgcc.o Compiler helpers — multiply, divide, shift. Almost always.
runtime/snprintf.o If you use sprintf / snprintf / vsnprintf.
runtime/sscanf.o If you use sscanf / vsscanf / fscanf.
runtime/softDouble.o If you use double-precision arithmetic anywhere.
runtime/softFloat.o If you use float-precision arithmetic.
runtime/math.o fabs, floor, sqrt, sin, cos, pow, etc.
runtime/qsort.o qsort / bsearch.
runtime/strtol.o strtol / strtoul / atoi / atol.
runtime/strtok.o strtok / strtok_r.
runtime/extras.o strcat, strncat, llabs, rand/srand.
runtime/timeExt.o time / gmtime / mktime.
runtime/iigsToolbox.o Apple IIgs Toolbox call wrappers.
runtime/iigsGsos.o GS/OS class-1 call wrappers (file I/O, etc.).
runtime/desktop.o startdesk() helper used by demos that need a Window Manager environment.
runtime/libcxxabi.o C++ ABI runtime (vtable RTTI, dynamic_cast).
runtime/libcxxabiSjlj.o C++ SJLJ-exception support (paired with -fsjlj-exceptions).

To (re)build the runtime:

bash runtime/build.sh

Multi-segment OMF (for GS/OS Loader)

For programs >60 KB (the usable bank-0 limit after the stack, zero page, and I/O window are subtracted), build a multi-segment OMF that GS/OS Loader places across banks:

./tools/link816 -o myprog.bin \
    --text-base 0x1000 \
    --segment-cap 0xB000 \
    --segment-bank-base 0x040000 \
    --manifest myprog.manifest.json \
    runtime/crt0Gsos.o ... yourprog.o
./tools/omfEmit --manifest myprog.manifest.json --expressload -o myprog.omf

See docs/multiSegmentPlan.md for details and scripts/runMultiSeg.sh for a working example.


Running under MAME

scripts/runInMame.sh launches MAME's apple2gs driver, loads your binary at $00:1000, runs for a few seconds, and reads a memory cell:

bash scripts/runInMame.sh prog.bin                       # just run for ~5 s
bash scripts/runInMame.sh prog.bin --check 0x025000=002a # verify a value
bash scripts/runInMame.sh prog.bin 0x025000 0x025002     # dump these addresses
  • --check ADDR=VALUE returns exit 0 if memory matches, exit 1 if not. Used by smoke and CI.
  • The bare-address form dumps the value without comparing.

The runner is headless by default (-video none + SDL_VIDEODRIVER=dummy) so it runs in a terminal-only environment. Useful environment variables:

Variable Default Purpose
MAME_CHECK_FRAME 300 Frame at which to read the check address (300 ≈ 5 s at 60 Hz).
MAME_SECS 6 How long to let MAME run before forcibly exiting.
MAME_TIMEOUT 30 Wall-clock timeout for the whole MAME invocation.
MAME_RAMSIZE unset Override the emulated RAM size (e.g. 8M).

Writing to non-bank-0 RAM

The 65816 has two registers that select which bank a memory access goes to:

  • PBR (Program Bank Register) — selects the bank for instruction fetches. Set by jsl long_addr and rtl.
  • DBR (Data Bank Register) — selects the bank for 16-bit absolute data accesses like lda $5000.

When the IIgs boots, DBR defaults to $00. Bank $00 contains the I/O window at $C000-$CFFF, the language card area, and the stack — not a great place for general data.

With ptr32 mode (the default — pointers are 32 bits / 24-bit addresses), constant pointers to non-bank-0 addresses lower automatically to long (24-bit absolute) instructions that ignore DBR:

*(volatile int *)0x025000 = 42;   // → sta long $025000  (DBR-independent)
*(volatile char *)0xE10068 = 1;   // → sta long $E10068  (vert position reg)
unsigned char v = *(volatile char *)0xE0C025;  // ROM read

For typical programs — writing a result to a verification address, poking IIgs hardware registers, accessing the SHR framebuffer at $E1:2000 — you just dereference the absolute pointer and the compiler does the right thing. DBR doesn't matter.

Legacy: the switchToBank2() idiom

You may see older code (pre-ptr32 migration) using a switchToBank2() helper that pokes DBR to $02 so that subsequent 16-bit-absolute stores like *(volatile X*)0x5000 = v land in bank 2:

__attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile (
        "sep #0x20\n"        // 8-bit A
        ".byte 0xa9,0x02\n"  // lda #2 (hand-encoded)
        "pha\n"              // push A
        "plb\n"              // pop into DBR
        "rep #0x20\n"        // back to 16-bit A
    );
}
// then:
switchToBank2();
*(volatile int *)0x5000 = x;

This still works but is no longer needed for new code. Prefer the direct 24-bit pointer form (*(volatile int *)0x025000 = x;) — it's clearer, requires no inline asm, and produces fewer instructions because the bank byte is encoded inline.

There's still one case where it's useful: if you have a large amount of data work in a single bank and want every store to be 3 bytes (sta $5000,X etc.) instead of 4 bytes (sta long $025000,X). In that case, set DBR once with the helper above and use 16-bit-absolute addresses afterward. Otherwise, the direct form is simpler.

What never needs bank-switching

  • Local variables on the stack — stack-relative accesses bypass DBR.
  • Direct-page accesseslda $D0 always reads $00:00D0.
  • [dp],Y indirect-long pointers — they carry their own bank byte.
  • Function callsjsl uses PBR + a long destination.
  • Pointers in ptr32 mode — every C pointer is 32 bits, so deref'ing any pointer (even one to bank 0) generates DBR-independent code.

Worked examples

Recursion + printing

// fib.c
#include <stdio.h>
#include <stdlib.h>

unsigned long fib(unsigned n) {
    if (n < 2) return n;
    return fib(n-1) + fib(n-2);
}

int main(void) {
    char buf[32];
    int len = snprintf(buf, sizeof buf, "fib(10) = %lu", fib(10));
    // Copy the formatted string into bank-2 RAM at $02:5000 so the
    // MAME harness can read it back.  Each store goes through a 24-bit
    // long-address write — no bank-switching needed.
    for (int i = 0; i <= len; i++)
        ((volatile char *)0x025000)[i] = buf[i];
    while (1) {}
}

Build (snprintf needs soft-double + sscanf to link cleanly):

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
    -I runtime/include -c fib.c -o fib.o

./tools/link816 -o fib.bin --text-base 0x1000 \
    runtime/crt0.o runtime/libc.o runtime/libgcc.o \
    runtime/snprintf.o runtime/softDouble.o runtime/sscanf.o \
    fib.o

bash scripts/runInMame.sh fib.bin --check 0x025000=0066    # 'f' (start of "fib")

Apple IIgs Toolbox

// hello_gs.c
#include <iigs/toolbox.h>

int main(void) {
    SysBeep();
    while (1) {}
}

Build (note crt0Gsos.o instead of crt0.o — sets up the toolbox environment):

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
    -I runtime/include -c hello_gs.c -o hello_gs.o

./tools/link816 -o hello_gs.bin --text-base 0x1000 \
    runtime/crt0Gsos.o runtime/iigsToolbox.o runtime/iigsGsos.o \
    runtime/libgcc.o hello_gs.o

Programs that call the toolbox usually run under real GS/OS rather than in the headless harness. See demos/launch.sh and demos/build.sh for a working pipeline.


Advanced: pointer-deref code generation

The W65816 backend treats every pointer as 32-bit (p:32:16 datalayout — sizeof(void *) == 4 from the C compiler's perspective). The high two bytes carry the bank byte plus a pad byte; the low two carry the in-bank offset. This lets a single C pointer reach any byte in the IIgs's 24-bit address space.

A pointer dereference has to read up to 24 bits of address to know which bank to touch. The CPU's [dp],Y (indirect-long-Y, opcode 0xB7) reads a 24-bit pointer from a direct-page slot and uses it as the effective address — three bytes wide, bank byte explicit. This is the safe default path and it works regardless of where the target memory lives.

There are two optimizations layered on top of the default path. One is always on and safe. The other is opt-in via a flag and needs care.

Layer 1: constant-offset peeling (default on, always safe)

When you write s->c for a struct field at offset 4, the natural code is "compute s + 4, then deref". Layer 1 recognizes that [dp],Y already has a Y register that's added to the 24-bit pointer on the deref — so instead of computing s + 4 first, the backend stages the base pointer at $E0..$E2 and loads Y = #4 for the deref. Saves three instructions per struct-field access (the clc; adc #4; ...; adc #0 carry chain).

A consecutive-access CSE peephole shares the $E0/$E2 staging between adjacent derefs of the same base, so s->a + s->b + s->c + s->d stages once and emits four ldy #K; lda [$E0],Y pairs.

There's nothing to enable or disable. This was a +1% Lua-wide size win on its own. It's always-on because it's structurally equivalent to the un-optimized code — the same 24-bit deref, just with the offset folded into Y instead of pre-added to the pointer.

Layer 2: -mllvm -w65816-dbr-safe-ptrs (opt-in, unsafe if misused)

The default [dp],Y deref needs three bytes of staging at $E0..$E2 because it reads a 24-bit pointer. Calypsi uses lda (d,S),Y (opcode 0xB3, stack-rel-indirect-Y) for the same effect in ONE instruction — but that opcode reads only 16 bits of pointer. The bank byte is implicit DBR.

When you pass -mllvm -w65816-dbr-safe-ptrs, our backend uses the same one-instruction path: it spills only the low 16 bits of the pointer to a stack slot, sets Y to the offset, and emits lda (slot,S),Y (or sta (slot,S),Y). Bank byte = whatever DBR holds at runtime.

Per-deref cost drops from ~5 instructions to 1. Lua 5.1.5 shrinks by 20.6% with the flag on.

This is correct only when every pointer dereferenced in the TU points to memory inside DBR's current bank. Some examples:

Pointer Bank? Safe with the flag?
malloc() result DBR's bank (crt0 sets DBR to load bank; malloc allocates from BSS heap there) Yes
Global variable address DBR's bank (linker puts globals in the load segment) Yes
&local_array[i] in a stack frame Bank 0 (stack is always bank 0) Yes IF DBR is 0 (typical)
Pointer returned by GS/OS Loader The Loader's bank (might differ from yours) No — would miscompile
Pointer cast from a 0x010000+addr integer literal in bank 1 Bank 1 No if DBR is not bank 1
&ROMVECTORS[0] from iigs/-style headers Various IIgs system banks No in general

For Lua, Picol, plain C programs that allocate via malloc and operate on globals, this flag is safe. For GS/OS demos that interact with Loader-returned segments or system memory, it would miscompile.

Default is off. Opt in per-TU:

clang --target=w65816 -O2 -mllvm -w65816-dbr-safe-ptrs -c hot.c -o hot.o

If you set the flag and your code does dereference cross-bank pointers, the symptom is silent wrong-address reads — typically a read from the same in-bank offset but in DBR's bank instead of the intended one. No abort, no diagnostic.

Mixing safely: the flag is per-TU. You can compile your hot struct-heavy code with the flag and your bank-aware code without. The two .o files link cleanly together. Per-function or per-parameter control isn't supported yet.

When the slot offset overflows 8 bits

lda (d,S),Y has an 8-bit d field — max slot offset 255 from SP. If the function's frame is large enough that the spill slot exceeds that, PEI emits a fallback sequence that long-indirects the slot via [$F6],Y (the function's frame-pointer), then stages at $E0..$E2 and derefs via [$E0],Y. This is ~8 instructions — worse than the plain [dp],Y path the flag was meant to replace. Functions that hit this need usesDpFP=true (set automatically for large frames); otherwise PEI emits a fatal error. In practice you'll only see this on functions with hundreds of local variables.

Inline-threshold tuning (default lowered to 50)

LLVM's default inline-cost threshold is 225, tuned for desktop CPUs where call overhead is high relative to the size of the inlined body. On W65816 a jsl long:foo is just 4 bytes / ~8 cycles, but every inlined pointer dereference expands to multiple instructions even with Layer 2. Aggressive inlining bloats code without commensurate cycle wins.

The W65816 backend lowers the default to 50. Calibration:

Threshold Lua size CoreMark size Cycle benches
225 (LLVM stock) 1.47× Calypsi (not measured) baseline
75 1.16× 0.87× identical
50 (current) 1.13× 0.79× identical
25 1.11× 0.79× identical

At 225, Lua's index2adr (a multi-branch helper called 41 times in lapi.c) was inlined into every API entry, adding ~2 KB per file — and CoreMark's matrix_test was 17× Calypsi because the inliner copied 5 nested-loop helpers into it. At 50, both regressions vanish and the cycle benchmarks are unchanged.

To override (e.g. on size-sensitive ROMs or speed-critical loops):

# Force aggressive inlining (back to LLVM default)
clang --target=w65816 -O2 -mllvm -inline-threshold=225 -c file.c -o file.o

# Force MORE conservative inlining
clang --target=w65816 -O2 -mllvm -inline-threshold=10 -c file.c -o file.o

A function marked __attribute__((always_inline)) is always inlined regardless of threshold. A function marked __attribute__((noinline)) is never inlined. Use these to override the global threshold for specific cases.

Summary: which options to use when

Goal Compile flag
Smallest, safest binary (default) clang --target=w65816 -O2 ... — Layer 1 is on, Layer 2 is off, threshold=50
Smallest binary for code that touches only same-bank memory Add -mllvm -w65816-dbr-safe-ptrs
Fastest possible code (size be damned) Add -mllvm -inline-threshold=500
Reproduce LLVM's stock inlining behavior Add -mllvm -inline-threshold=225
Maximum safety review of inlining decisions Mark hot helpers __attribute__((noinline)) explicitly

Inline assembly

The W65816 backend supports __asm__ with operand constraints "a", "x", "y":

unsigned short addOne(unsigned short x) {
    unsigned short r;
    __asm__("inc a" : "=a"(r) : "a"(x));
    return r;
}

Multi-instruction asm and raw bytes both work:

__asm__ volatile (
    "sep #0x20\n"
    ".byte 0x68\n"      // pla
    "rep #0x20\n"
);

The .byte form is needed when llvm-mc can't yet parse an opcode literally (some 65816 addressing modes still have gaps in the assembler). Hand-encoding is a stopgap; report opcodes that need it.


Tools reference

Tool Location Purpose
clang tools/llvm-mos-build/bin/clang C / C++ compiler
clang++ tools/llvm-mos-build/bin/clang++ C++ driver
llc tools/llvm-mos-build/bin/llc Standalone codegen (.ll.s)
llvm-mc tools/llvm-mos-build/bin/llvm-mc Assembler
llvm-objdump tools/llvm-mos-build/bin/llvm-objdump Disassembler
link816 tools/link816 Our relocating linker
omfEmit tools/omfEmit Emit OMF v2.1 binary from link816 output
mame system apt install Apple IIgs emulator

Debugging

Look at the asm

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -S -o prog.s prog.c
cat prog.s

Look at the MIR after each backend pass

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
    -mllvm -print-after-all -S prog.c 2>&1 | less

Useful pass names to filter on:

Pass name What it does
w65816-isel SDAG → MachineInstr selection
w65816-widen-acc16 Promote Acc16 vregs to Wide16 (regalloc help)
w65816-stack-slot-cleanup Remove redundant spill/reload
w65816-stackrel-to-img Promote hot stack slots to DP IMG slots
w65816-stack-slot-merge Collapse PHI src/dst slot pairs
w65816-branch-expand Long-distance Bxx → INV_Bxx skip; BRA

Single-pass filter

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
    -mllvm -print-after=w65816-isel \
    -mllvm -filter-print-funcs=myfunc \
    -S prog.c 2>&1 | less

Disassemble an object file

./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o

Cycle-count benchmarks

13 microbenchmarks live under benchmarks/ — eight integer/string micro-benches, three soft-double FP benches (dadd, dmul, ddiv), and two "game-like" workloads: particles (32-particle physics tick with i16 bounce/wall collision) and mandelbrot (4×4 fixed-point Mandelbrot tile exercising i32 multiply and conditional control flow).

bash scripts/benchCycles.sh

Output (2026-05-21):

| Benchmark | Per-iteration cycles |
|-----------|---------------------:|
| bsearch | 127 cyc/iter (100 iters) |
| crc32 | <65 (under timer resolution) |
| dadd | 1157 cyc/iter (10 iters) |
| ddiv | 1261 cyc/iter (10 iters) |
| dmul | 1033 cyc/iter (10 iters) |
| dotProduct | 144 cyc/iter (100 iters) |
| fib | 97 cyc/iter (100 iters) |
| mandelbrot | 11570 cyc/iter (1 iter, GRID=4 MAX_ITER=8) |
| memcmp | 113 cyc/iter (100 iters) |
| particles | 2253 cyc/iter (3 iters, N=32) |
| popcount | 93 cyc/iter (100 iters) |
| strcpy | 91 cyc/iter (100 iters) |
| sumOfSquares | 126 cyc/iter (100 iters) |

The legacy scripts/benchCyclesPrecise.sh (per-call cycle count via emu.time()) is still available but slower to run.

The compare/ directory has side-by-side .s files vs Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with:

bash compare/regen.sh

Known limitations

  • C++ exceptions are not implemented for DWARF unwinding. try / catch compiles but doesn't unwind. -fsjlj-exceptions works for limited SJLJ-style throwing.
  • stdin always returns EOF. scanf compiles but isn't useful. Use sscanf on a buffer instead.
  • File I/O through fopen requires a backing implementation. The default mfs backing (memory-file-system) lets you simulate files via mfsRegister() — useful for tests, not for real disk I/O. GS/OS file I/O works via runtime/iigsGsos.o if you link against the GS/OS runtime.
  • fork/exec — not applicable on a 65816, no support.
  • Code generation gotcha: very large stack frames (>200 bytes) trigger FP-relative addressing. Most programs fit under that limit. See the frame-rel discussion in LLVM_65816_DESIGN.md.
  • Three Lua functions (luaV_execute, symbexec, auxsort) hit the greedy register allocator's complexity budget. Workaround: compile those TUs with -mllvm -regalloc=basic. Documented in tests/lua/README.md.

Where to go next

  • Building real GS/OS apps: see docs/multiSegmentPlan.md and the demos/launch.sh script for booting through real GS/OS 6.0.2 in MAME. The 9 demos under demos/ are reasonable starting points.
  • Backend internals (you're hacking on the compiler): LLVM_65816_DESIGN.md.
  • Smoke tests: scripts/smokeTest.sh runs ~150 end-to-end checks. Read it for examples of every feature in action.
  • Cycle-bench a Lua port or other real-world C: see tests/lua/README.md for the recipe (vendoring + per-file regalloc tuning + libc stubs).