65816-llvm-mos/docs/USAGE.md
Scott Duensing da095402ec Updated
2026-06-02 23:17:57 -05:00

45 KiB
Raw Blame History

Using llvm816

This document covers compiling a C program, linking it into an Apple IIgs binary, and running it under MAME. It assumes you've followed INSTALL.md and the install completed successfully.

If you've never used clang or a similar C compiler before, start with Quick orientation — it explains the moving parts. If you already know what clang is, jump to Your first program.


Quick orientation

What is clang?

Clang is a C / C++ compiler — the program that turns your .c source file into machine code an actual CPU can execute. It's part of the LLVM project and is the default C compiler on macOS and on most modern Linux distributions. If you've used gcc before, clang takes nearly the same command-line flags.

A normal install of clang produces code for the machine it's running on — x86-64 if you're on a typical Linux PC. Clang has a cross-compiler mode: pass --target=<arch> to make it emit code for a different CPU. The W65816 (the Apple IIgs CPU) is one of the architectures we've added to a fork of clang that ships with this project.

What gets installed where

After ./setup.sh completes, the project tree under your llvm816/ checkout looks roughly like this:

llvm816/                            ← repo root; everything is contained here
├── docs/                           ← this directory
├── runtime/                        ← C standard library + startup code
│   ├── build.sh                    ← script that builds the runtime .o files
│   ├── include/                    ← header files (<stdio.h>, etc.)
│   │   ├── stdio.h
│   │   ├── string.h
│   │   ├── ...
│   │   └── iigs/                   ← Apple IIgs-specific headers
│   │       ├── toolbox.h           ← ~1300 toolbox routine wrappers
│   │       ├── gsos.h
│   │       └── desktop.h
│   ├── src/                        ← sources for the runtime (.c and .s)
│   └── *.o                         ← compiled runtime objects (after build)
├── scripts/                        ← driver scripts
│   ├── runInMame.sh                ← run a binary in MAME and check memory
│   ├── benchCycles.sh              ← cycle-count benchmarks
│   └── smokeTest.sh                ← ~150 end-to-end correctness checks
├── src/                            ← OUR backend source (you compile from here)
├── tools/                          ← installed tools (~7 GB total)
│   ├── llvm-mos/                   ← LLVM source tree (~5 GB)
│   ├── llvm-mos-build/             ← built artifacts (~1.4 GB)
│   │   └── bin/
│   │       ├── clang               ← THE COMPILER YOU USE
│   │       ├── clang++             ← same, for C++
│   │       ├── llc                 ← standalone IR → asm converter
│   │       ├── llvm-mc             ← standalone assembler
│   │       ├── llvm-objdump        ← disassembler
│   │       └── ...
│   ├── llvm-mos-sdk/               ← prebuilt llvm-mos SDK (~400 MB, mostly unused)
│   ├── link816                     ← OUR LINKER (single binary, ~120 KB)
│   ├── omfEmit                     ← turns flat binary → Apple IIgs OMF v2.1
│   ├── mame/                       ← Apple IIgs ROMs for MAME
│   ├── gsos/                       ← GS/OS 6.0.2 / 6.0.4 disk images
│   ├── calypsi/                    ← reference compiler for comparison (~580 MB)
│   └── orca-c/                     ← reference compiler (header sources)
├── demos/                          ← example IIgs programs
├── benchmarks/                     ← cycle-count benchmarks
├── compare/                        ← side-by-side ours-vs-Calypsi assembly
└── setup.sh                        ← one-shot installer

The two files you'll use most often:

File Purpose
tools/llvm-mos-build/bin/clang The compiler. Pass --target=w65816 to make it emit Apple IIgs code
tools/link816 The linker. Takes .o files and produces a flat binary the IIgs can load

Nothing is installed into /usr/local, /opt, or anywhere else on your system — the entire toolchain lives under your llvm816/ checkout. To uninstall, delete the directory.

What about the system's /usr/bin/clang?

If your distribution provides a clang (most do), that's a different clang for your machine's CPU. It does not know about the W65816 target. When following this document, always use the full path ./tools/llvm-mos-build/bin/clang (or set an alias / $PATH — see Setting up your environment).

What the build process produces

When you compile a C file for the IIgs, the flow looks like this:

hello.c
   │
   │  clang --target=w65816    (cross-compile to 65816 machine code)
   ▼
hello.o                        (relocatable ELF object file)
   │
   │  + crt0.o + libc.o + libgcc.o      (runtime libraries you link in)
   │
   │  link816                  (our relocating linker)
   ▼
hello.bin                      (flat binary, loadable at $00:1000)
   │
   │  optionally: omfEmit hello.bin → hello.omf  (for GS/OS Loader)
   │
   │  scripts/runInMame.sh hello.bin
   ▼
runs in MAME's emulated Apple IIgs

Three stages:

  1. Compile — clang turns .c into .o
  2. Linklink816 combines .o files + runtime libraries into a binary
  3. Run — MAME boots an emulated IIgs and executes the binary

Setting up your environment

To save typing, you can either edit your $PATH or use absolute paths. The rest of this document uses absolute paths so the examples work without any setup, but in practice you'll want shortcuts.

Add this to ~/.bashrc (or ~/.zshrc) so our tools are on your path:

export LLVM816_ROOT=$HOME/path/to/llvm816
export PATH="$LLVM816_ROOT/tools/llvm-mos-build/bin:$LLVM816_ROOT/tools:$PATH"

Then source ~/.bashrc (or restart your shell). After that you can just type clang --target=w65816 ... without the path prefix.

Careful: putting tools/llvm-mos-build/bin first on $PATH means all clang invocations in that shell go to our build, not the system clang. Ours still works for your machine's native target too (it's a multi-arch clang), but if you also need your distro's version, prefer Option B.

Option B: shell aliases

In ~/.bashrc:

LLVM816_ROOT=$HOME/path/to/llvm816
alias w65clang="$LLVM816_ROOT/tools/llvm-mos-build/bin/clang --target=w65816 -I $LLVM816_ROOT/runtime/include"
alias link816="$LLVM816_ROOT/tools/link816"

Then:

w65clang -O2 -c hello.c -o hello.o
link816 -o hello.bin --text-base 0x1000 ...

Option C: nothing — just use full paths

Every example in this document spells out the full path, so this works too. Verbose, but unambiguous.


Your first program

Let's compile, link, and run a tiny program. Open a terminal in your llvm816/ checkout directory.

1. Write the source

Create hello.c:

// hello.c — the smallest meaningful Apple IIgs program.
//
// Writes a value to bank-2 RAM at $02:5000, then halts.  The MAME
// harness reads that memory cell to verify the result.

int main(void) {
    int x = 6 * 7;
    // Write directly to the 24-bit absolute address $02:5000.  With
    // ptr32 mode (our default), constant pointers to >16-bit addresses
    // lower to `sta long $025000` — no bank-switching needed.
    *(volatile int *)0x025000 = x;
    while (1) {}   // halt; the harness reads memory + exits
    return 0;
}

2. Compile to a .o file

./tools/llvm-mos-build/bin/clang \
    --target=w65816 \
    -O2 \
    -I runtime/include \
    -c hello.c \
    -o hello.o

What each flag does:

Flag Meaning
--target=w65816 Required. Tells clang to emit W65816 machine code instead of the host CPU's code.
-O2 Optimization level. -O2 is recommended; -O0 works but produces 3-5× larger code.
-I runtime/include Look for <stdio.h> etc. in our runtime headers.
-c Compile only — produce a .o, don't link.
-o hello.o Write the object to hello.o.

If the command succeeds, you'll have a hello.o next to your hello.c. You can inspect it:

./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o | head -40
./tools/link816 \
    -o hello.bin \
    --text-base 0x1000 \
    runtime/crt0.o \
    runtime/libc.o \
    runtime/libgcc.o \
    hello.o

Each argument:

Argument Why
-o hello.bin Output file.
--text-base 0x1000 Where the code goes in memory. 0x1000 is conventional (first 4 KB of bank 0 is reserved for stack + zero page).
runtime/crt0.o Must come first. The C runtime startup — sets up the stack, calls main, halts cleanly on return.
runtime/libc.o Core C library (printf, malloc, strlen, etc.).
runtime/libgcc.o Compiler-provided helpers for things the 65816 can't do natively (16×16 multiply, 32-bit divide, etc.). Required for almost every program.
hello.o Your code.

link816 will print something like:

linked: text=[0x1000+128] rodata=[0x1080+0] bss=[0x1100+8] -> hello.bin

That tells you the code is 128 bytes, no read-only data, 8 bytes of BSS.

4. Run it in MAME

bash scripts/runInMame.sh hello.bin --check 0x025000=002a

0x002a is hexadecimal for 42 (= 6 × 7), and 0x025000 is the 24-bit address bank $02 + offset $5000 — where your program wrote x. The script boots MAME's emulated Apple IIgs, loads your binary at $00:1000, runs for 5 seconds, reads memory at $02:5000, and compares to the expected value.

A pass looks like:

MAME-LOADED bytes=128
MAME-READ addr=0x025000 val=0x002a
[llvm816] MAME OK: 1 reads matched

If you get MAME mismatch, your program wrote a different value (or no value). Most common cause for a new project is writing to a bank-0 address like *(volatile int *)0x5000 = x; (a plain $5000) instead of a 24-bit address like *(volatile int *)0x025000 = x; ($02:5000). The verification harness reads bank 2; writes to bank 0 go to a different RAM cell and the comparison fails.


Compiling C — full reference

The compiler is invoked just like a normal clang, with one extra flag:

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -c source.c -o source.o
Flag Meaning
--target=w65816 Selects the W65816 backend (required).
-O2 Default optimization. -O0 and -O1 work but produce ~3-5× larger code. -O3 is the same as -O2 for our backend.
-ffunction-sections Put each function in its own section. Lets the linker drop unreferenced functions (smaller binaries).
-I runtime/include Find <stdio.h>, <stdlib.h>, <iigs/toolbox.h> etc.
-c Compile only — produce .o, don't link. Without this, clang tries to invoke the host linker, which doesn't understand 65816 objects.
-g Emit DWARF debug info. Useful with link816 --debug-out.
-S Emit assembly (.s) instead of an object file. Useful for inspecting codegen.

What works at -O2

  • All C99 scalars: int8_t through int64_t, signed and unsigned, all arithmetic operators
  • Soft float and double (full IEEE-754 with round-to-nearest-even)
  • Pointers, arrays, structs, unions, bitfields
  • All control flow: if, for, while, goto, switch, recursion
  • <stdarg.h> varargs
  • <setjmp.h> setjmp/longjmp (SJLJ, no DWARF unwinder)
  • Inline __asm__ with "a", "x", "y" register constraints
  • C++ subset: classes, single + multiple inheritance, virtual base diamonds, RTTI, dynamic_cast, new / delete / new[] / delete[], global ctors via .init_array (walked by the crt0), Meyers singletons (gated by __cxa_guard_acquire/release), and global + static-local dtors actually run at exit time — each crt0 calls __run_cxa_atexit after main() returns to walk the registered table LIFO. SJLJ exceptions via clang++ -fsjlj-exceptions (no DWARF unwinder).
  • printf / snprintf family: full C99 conversion + flag + width + precision + length surface — %d %i %u %x %X %o %c %s %p %f %F %e %E %g %G %n %%, flags - + space # 0, width and precision via decimal or *, length modifiers hh h l ll j z t. Hex-float %a / %A is the only intentional gap (niche).
  • IIgs desktop helpers: <iigs/desktop.h> (startdesk/enddesk), <iigs/sound.h> (SysBeep + FFStartSound wrappers), <iigs/eventLoop.h> (callback-based TaskMaster dispatch — close, menu, key, mouse, idle). See demos/cxxProbe.cpp / the smoke helpers test for usage.
  • Source-level debugger (post-mortem): build with clang -g and link with link816 --debug-out FOO.dwarf --map FOO.map, then resolve a runtime PC to source with scripts/pc2line.py --sidecar FOO.dwarf --map FOO.map 0xADDR. Output: PC=0x123A FILE=foo.c LINE=42 FUNC=add. See scripts/mameDebug.sh for a wrapper that takes --break FUNC / --break FILE:LINE and runs under MAME.
  • C++ containers via vendored ETL (Embedded Template Library) — #include "etl/vector.h", #include "etl/string.h", #include "etl/map.h", #include "etl/optional.h", #include "etl/delegate.h", etc. See the C++ shell commands section below for usage.

See STATUS.md for the full feature matrix.


Linking — full reference

link816 produces a flat binary suitable for direct execution (loaded into a fixed address) or, with --omf, an OMF binary that the GS/OS Loader can load and relocate.

Raw binary (fixed-address load)

./tools/link816 -o output.bin --text-base 0x1000 \
    runtime/crt0.o runtime/libc.o runtime/libgcc.o yourprog.o
  • --text-base 0x1000 — Where code is loaded. $1000 is conventional; the first 4 KB of bank 0 ($00:0000-$00:0FFF) is reserved for the stack and direct page.
  • --bss-base 0x020000 — Where uninitialized data (BSS) goes. By default the linker places BSS immediately after rodata; supplying a different bank is useful when your text + data exceeds a single bank's free space.
  • --map output.map — Writes a human-readable map file showing every symbol's address. Useful for debugging.
  • --no-gc-sections — Keep all functions, even unreferenced ones. By default link816 --gc-sections (ON) drops unused code, shrinking binaries dramatically (a minimal program with full runtime linked goes from ~43 KB to ~1.5 KB).

Runtime libraries

Each runtime library is built once by runtime/build.sh and lives as a .o in runtime/. Link only what you use — --gc-sections drops the rest.

Library When you need it
runtime/crt0.o Always. C runtime startup.
runtime/crt0Gsos.o Instead of crt0.o for programs launched by the GS/OS Loader.
runtime/libc.o printf, malloc, strlen, the usual. Almost always.
runtime/libgcc.o Compiler helpers — multiply, divide, shift. Almost always.
runtime/snprintf.o If you use sprintf / snprintf / vsnprintf.
runtime/sscanf.o If you use sscanf / vsscanf / fscanf.
runtime/softDouble.o If you use double-precision arithmetic anywhere.
runtime/softFloat.o If you use float-precision arithmetic.
runtime/math.o fabs, floor, sqrt, sin, cos, pow, etc.
runtime/qsort.o qsort / bsearch.
runtime/strtol.o strtol / strtoul / atoi / atol.
runtime/strtok.o strtok / strtok_r.
runtime/extras.o strcat, strncat, llabs, rand/srand.
runtime/timeExt.o time / gmtime / mktime.
runtime/iigsToolbox.o Apple IIgs Toolbox call wrappers.
runtime/iigsGsos.o GS/OS class-1 call wrappers (file I/O, etc.).
runtime/desktop.o startdesk() helper used by demos that need a Window Manager environment.
runtime/libcxxabi.o C++ ABI runtime (vtable RTTI, dynamic_cast).
runtime/libcxxabiSjlj.o C++ SJLJ-exception support (paired with -fsjlj-exceptions).

To (re)build the runtime:

bash runtime/build.sh

Multi-segment OMF (for GS/OS Loader)

For programs >60 KB (the usable bank-0 limit after the stack, zero page, and I/O window are subtracted), build a multi-segment OMF that GS/OS Loader places across banks:

./tools/link816 -o myprog.bin \
    --text-base 0x1000 \
    --segment-cap 0xB000 \
    --segment-bank-base 0x040000 \
    --manifest myprog.manifest.json \
    runtime/crt0Gsos.o ... yourprog.o
./tools/omfEmit --manifest myprog.manifest.json --expressload -o myprog.omf

See docs/multiSegmentPlan.md for details and scripts/runMultiSeg.sh for a working example.


Running under MAME

scripts/runInMame.sh launches MAME's apple2gs driver, loads your binary at $00:1000, runs for a few seconds, and reads a memory cell:

bash scripts/runInMame.sh prog.bin                       # just run for ~5 s
bash scripts/runInMame.sh prog.bin --check 0x025000=002a # verify a value
bash scripts/runInMame.sh prog.bin 0x025000 0x025002     # dump these addresses
  • --check ADDR=VALUE returns exit 0 if memory matches, exit 1 if not. Used by smoke and CI.
  • The bare-address form dumps the value without comparing.

The runner is headless by default (-video none + SDL_VIDEODRIVER=dummy) so it runs in a terminal-only environment. Useful environment variables:

Variable Default Purpose
MAME_CHECK_FRAME 300 Frame at which to read the check address (300 ≈ 5 s at 60 Hz).
MAME_SECS 6 How long to let MAME run before forcibly exiting.
MAME_TIMEOUT 30 Wall-clock timeout for the whole MAME invocation.
MAME_RAMSIZE unset Override the emulated RAM size (e.g. 8M).

Writing to non-bank-0 RAM

The 65816 has two registers that select which bank a memory access goes to:

  • PBR (Program Bank Register) — selects the bank for instruction fetches. Set by jsl long_addr and rtl.
  • DBR (Data Bank Register) — selects the bank for 16-bit absolute data accesses like lda $5000.

When the IIgs boots, DBR defaults to $00. Bank $00 contains the I/O window at $C000-$CFFF, the language card area, and the stack — not a great place for general data.

With ptr32 mode (the default — pointers are 32 bits / 24-bit addresses), constant pointers to non-bank-0 addresses lower automatically to long (24-bit absolute) instructions that ignore DBR:

*(volatile int *)0x025000 = 42;   // → sta long $025000  (DBR-independent)
*(volatile char *)0xE10068 = 1;   // → sta long $E10068  (vert position reg)
unsigned char v = *(volatile char *)0xE0C025;  // ROM read

For typical programs — writing a result to a verification address, poking IIgs hardware registers, accessing the SHR framebuffer at $E1:2000 — you just dereference the absolute pointer and the compiler does the right thing. DBR doesn't matter.

Legacy: the switchToBank2() idiom

You may see older code (pre-ptr32 migration) using a switchToBank2() helper that pokes DBR to $02 so that subsequent 16-bit-absolute stores like *(volatile X*)0x5000 = v land in bank 2:

__attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile (
        "sep #0x20\n"        // 8-bit A
        ".byte 0xa9,0x02\n"  // lda #2 (hand-encoded)
        "pha\n"              // push A
        "plb\n"              // pop into DBR
        "rep #0x20\n"        // back to 16-bit A
    );
}
// then:
switchToBank2();
*(volatile int *)0x5000 = x;

This still works but is no longer needed for new code. Prefer the direct 24-bit pointer form (*(volatile int *)0x025000 = x;) — it's clearer, requires no inline asm, and produces fewer instructions because the bank byte is encoded inline.

There's still one case where it's useful: if you have a large amount of data work in a single bank and want every store to be 3 bytes (sta $5000,X etc.) instead of 4 bytes (sta long $025000,X). In that case, set DBR once with the helper above and use 16-bit-absolute addresses afterward. Otherwise, the direct form is simpler.

What never needs bank-switching

  • Local variables on the stack — stack-relative accesses bypass DBR.
  • Direct-page accesseslda $D0 always reads $00:00D0.
  • [dp],Y indirect-long pointers — they carry their own bank byte.
  • Function callsjsl uses PBR + a long destination.
  • Pointers in ptr32 mode — every C pointer is 32 bits, so deref'ing any pointer (even one to bank 0) generates DBR-independent code.

Running under GNO/ME

The MAME path above runs your program bare-metal. GNO/ME 2.0.6 is a Unix-like multitasking environment that runs on top of real GS/OS, and a llvm816-compiled C (or C++) program can run as a native GNO shell command — with console stdio, argv, and FILE* file I/O — booted through GS/OS 6.0.4 in MAME. This is a sibling to the MAME path: a different way to run the same C, inside a real OS.

This is verified headless and end-to-end. Three steps take you from C source to a running command.

1. Build the base GNO disk (once)

bash tools/gno/buildDisk.sh        # -> tools/gno/gnobase.po

This assembles the GNO/ME userland into an 800 KB ProDOS volume. Re-run it only when the GNO archive set changes.

One-time prerequisites. buildDisk.sh needs nulib2 (a system package: sudo apt-get install nulib2) and tools/cadius/cadius (run bash scripts/installCadius.sh if it is missing), plus the GNO/ME 2.0.6 .shk archives under tools/gno/dist/. The runner in step 3 also needs the GS/OS 6.0.4 system disk at tools/gsos/6.0.4 - System.Disk.po and the same IIgs ROMs the MAME path uses. None of these are installed by setup.sh today — see INSTALL.md for the full list. You also need the GNO runtime objects, which bash runtime/build.sh builds automatically.

2. Compile a C program into a GNO OMF

bash demos/buildGno.sh gnoHello    # demos/gnoHello.c -> demos/gnoHello.omf

buildGno.sh takes a single basename (required); it reads demos/<name>.c and emits demos/<name>.omf (plus .o/.bin/.map/ .reloc sidecars). Bundled examples: gnoHello, gnoCat, gnoFile, gnoFmt, gnoStdin.

It links the GNO crt0 and runtime, then runs omfEmit --expressload --relocs ... --stack-size 0x4000. Override the DP/Stack size with the GNO_STACK_SIZE environment variable if needed (default 0x4000).

3. Boot, log in, run, and check

bash scripts/runInGno.sh demos/gnoHello.omf --check 0x025000=C0DE

The runner boots GS/OS 6.0.4 + GNO in headless MAME, logs in as root, runs your command, then probes memory. gnoHello writes 0xC0DE to $02:5000 as a harness marker, so a successful run prints:

[llvm816] GNO check OK: 0x025000 = 0xc0de

--check takes ADDR=VALUE pairs (multiple allowed after one --check). The address uses 0x form (0x025000); the expected value is bare hex with no prefix (C0DE, not 0xC0DE). The runner prints the matched value lowercased. Add --snapshots to capture a PNG of each boot/login/run stage to /tmp/gnosnaps.

Things you must know

  • The OMF command basename must be ProDOS-legal — no hyphen. Name it testgno, not test-gno, or the command never launches.
  • stdio needs libcGno linked. buildGno.sh does this for C. Without it the program runs but prints nothing (the console hooks fall through to a dead sink).
  • Console file descriptors follow GNO's convention: stdin=1, stdout=2, stderr=3 (a documented deviation from POSIX 0/1/2).
  • Commands that do GS/OS file I/O need the --stack-size DP/Stack OMF segment that buildGno.sh passes (0x4000); the 4 KB default crashes.

C++ shell commands

demos/buildGno.sh <name> auto-detects .c vs .cpp and switches to clang++ -fno-exceptions -fno-rtti for the latter, linking runtime/libcxxabi.o + libcxxabiSjlj.o so the C++ ABI hooks (operator new/delete, __cxa_guard_*, __cxa_atexit + __run_cxa_atexit, RTTI typeinfo, dynamic_cast, SJLJ exception runtime) resolve. Link-time GC strips whatever isn't used, so a pure-C .c program pays nothing extra for the additional .os on the link line.

Global / static-local dtors run at exit. Each crt0 calls __run_cxa_atexit after main() returns and before halt/QUIT — the registered dtor table is walked in LIFO order, so destructors for file-scope objects and static T x; locals actually execute. demos/cxxProbe.cpp is the worked example.

ETL containers — the vendored Embedded Template Library at runtime/include/c++/etl/ provides fixed-capacity STL-style containers with no malloc and no exceptions. buildGno.sh adds -I runtime/include/c++ to the compile line, so:

#include "etl/vector.h"
#include "etl/string.h"
#include "etl/map.h"
#include "etl/optional.h"
#include "etl/delegate.h"

static int doubler(int x) { return x * 2; }

int main(void) {
    etl::vector<int, 8> v;
    for (int i = 1; i <= 5; i++) v.push_back(i);

    etl::string<32> s("Hello, ");
    s += "ETL";

    etl::map<int, int, 8> m;
    m[1] = 100;

    etl::optional<int> opt = 42;

    // etl::delegate is the std::function-equivalent (type-erased callable).
    // etl::function is for binding object methods, NOT general callables.
    etl::delegate<int(int)> fn = etl::delegate<int(int)>::create<doubler>();
    return fn(s.size());   // 20
}

The capacity N in etl::vector<T, N> (and etl::string<N>, etl::map<K,V,N>, …) is a template parameter, so storage is in-struct (no heap, no allocator). Pick N like you'd pick the size of a C array. Same trade-off — too small overflows, too large wastes BSS. Overflow today silently corrupts past the storage array (no exceptions, default ETL_ASSERT is a no-op); install a callback via etl::error_handler::set_callback(...) at startup if you want a halt on overflow.

The target profile at runtime/include/c++/etl_profile.h sets ETL_NO_STL, no atomics, no exceptions, no std::ostreamdo not override it in user code. Full container list at etlcpp.com. demos/etlProbe.cpp exercises vector + string + map + optional + delegate end-to-end (20 KB total).

For hand-driven builds without buildGno.sh, link libcGno before libc so its strong console hooks win. See the gno target in stuff/baztest/Makefile for a worked recipe.

For the full picture — disk layout, the inline GS/OS QUIT convention, the double-run/QUIT trap, argv handover, FILE* round-trips, and the runInGno.sh environment hooks (GNO_STDIN, GNO_ADDFILE, GNO_RUNCMD, GNO_POLL_FRAMES) — see tools/gno/README.md.


Worked examples

Recursion + printing

// fib.c
#include <stdio.h>
#include <stdlib.h>

unsigned long fib(unsigned n) {
    if (n < 2) return n;
    return fib(n-1) + fib(n-2);
}

int main(void) {
    char buf[32];
    int len = snprintf(buf, sizeof buf, "fib(10) = %lu", fib(10));
    // Copy the formatted string into bank-2 RAM at $02:5000 so the
    // MAME harness can read it back.  Each store goes through a 24-bit
    // long-address write — no bank-switching needed.
    for (int i = 0; i <= len; i++)
        ((volatile char *)0x025000)[i] = buf[i];
    while (1) {}
}

Build (snprintf needs soft-double + sscanf to link cleanly):

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
    -I runtime/include -c fib.c -o fib.o

./tools/link816 -o fib.bin --text-base 0x1000 \
    runtime/crt0.o runtime/libc.o runtime/libgcc.o \
    runtime/snprintf.o runtime/softDouble.o runtime/sscanf.o \
    fib.o

bash scripts/runInMame.sh fib.bin --check 0x025000=0066    # 'f' (start of "fib")

Apple IIgs Toolbox

// hello_gs.c
#include <iigs/toolbox.h>

int main(void) {
    SysBeep();
    while (1) {}
}

Build (note crt0Gsos.o instead of crt0.o — sets up the toolbox environment):

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
    -I runtime/include -c hello_gs.c -o hello_gs.o

./tools/link816 -o hello_gs.bin --text-base 0x1000 \
    runtime/crt0Gsos.o runtime/iigsToolbox.o runtime/iigsGsos.o \
    runtime/libgcc.o hello_gs.o

Programs that call the toolbox usually run under real GS/OS rather than in the headless harness. See demos/launch.sh and demos/build.sh for a working pipeline.


Advanced: pointer-deref code generation

The W65816 backend treats every pointer as 32-bit (p:32:16 datalayout — sizeof(void *) == 4 from the C compiler's perspective). The high two bytes carry the bank byte plus a pad byte; the low two carry the in-bank offset. This lets a single C pointer reach any byte in the IIgs's 24-bit address space.

A pointer dereference has to read up to 24 bits of address to know which bank to touch. The CPU's [dp],Y (indirect-long-Y, opcode 0xB7) reads a 24-bit pointer from a direct-page slot and uses it as the effective address — three bytes wide, bank byte explicit. This is the safe default path and it works regardless of where the target memory lives.

There are two optimizations layered on top of the default path. One is always on and safe. The other is opt-in via a flag and needs care.

Layer 1: constant-offset peeling (default on, always safe)

When you write s->c for a struct field at offset 4, the natural code is "compute s + 4, then deref". Layer 1 recognizes that [dp],Y already has a Y register that's added to the 24-bit pointer on the deref — so instead of computing s + 4 first, the backend stages the base pointer at $E0..$E2 and loads Y = #4 for the deref. Saves three instructions per struct-field access (the clc; adc #4; ...; adc #0 carry chain).

A consecutive-access CSE peephole shares the $E0/$E2 staging between adjacent derefs of the same base, so s->a + s->b + s->c + s->d stages once and emits four ldy #K; lda [$E0],Y pairs.

There's nothing to enable or disable. This was a +1% Lua-wide size win on its own. It's always-on because it's structurally equivalent to the un-optimized code — the same 24-bit deref, just with the offset folded into Y instead of pre-added to the pointer.

Layer 2: -mllvm -w65816-dbr-safe-ptrs (opt-in, unsafe if misused)

The default [dp],Y deref needs three bytes of staging at $E0..$E2 because it reads a 24-bit pointer. Calypsi uses lda (d,S),Y (opcode 0xB3, stack-rel-indirect-Y) for the same effect in ONE instruction — but that opcode reads only 16 bits of pointer. The bank byte is implicit DBR.

When you pass -mllvm -w65816-dbr-safe-ptrs, our backend uses the same one-instruction path: it spills only the low 16 bits of the pointer to a stack slot, sets Y to the offset, and emits lda (slot,S),Y (or sta (slot,S),Y). Bank byte = whatever DBR holds at runtime.

Per-deref cost drops from ~5 instructions to 1. Lua 5.1.5 shrinks by 20.6% with the flag on.

This is correct only when every pointer dereferenced in the TU points to memory inside DBR's current bank. Some examples:

Pointer Bank? Safe with the flag?
malloc() result DBR's bank (crt0 sets DBR to load bank; malloc allocates from BSS heap there) Yes
Global variable address DBR's bank (linker puts globals in the load segment) Yes
&local_array[i] in a stack frame Bank 0 (stack is always bank 0) Yes IF DBR is 0 (typical)
Pointer returned by GS/OS Loader The Loader's bank (might differ from yours) No — would miscompile
Pointer cast from a 0x010000+addr integer literal in bank 1 Bank 1 No if DBR is not bank 1
&ROMVECTORS[0] from iigs/-style headers Various IIgs system banks No in general

For Lua, Picol, plain C programs that allocate via malloc and operate on globals, this flag is safe. For GS/OS demos that interact with Loader-returned segments or system memory, it would miscompile.

Default is off. Opt in per-TU:

clang --target=w65816 -O2 -mllvm -w65816-dbr-safe-ptrs -c hot.c -o hot.o

If you set the flag and your code does dereference cross-bank pointers, the symptom is silent wrong-address reads — typically a read from the same in-bank offset but in DBR's bank instead of the intended one. No abort, no diagnostic.

Mixing safely: the flag is per-TU. You can compile your hot struct-heavy code with the flag and your bank-aware code without. The two .o files link cleanly together. Per-function or per-parameter control isn't supported yet.

When the slot offset overflows 8 bits

lda (d,S),Y has an 8-bit d field — max slot offset 255 from SP. If the function's frame is large enough that the spill slot exceeds that, PEI emits a fallback sequence that long-indirects the slot via [$F6],Y (the function's frame-pointer), then stages at $E0..$E2 and derefs via [$E0],Y. This is ~8 instructions — worse than the plain [dp],Y path the flag was meant to replace. Functions that hit this need usesDpFP=true (set automatically for large frames); otherwise PEI emits a fatal error. In practice you'll only see this on functions with hundreds of local variables.

Inline-threshold tuning (default lowered to 50)

LLVM's default inline-cost threshold is 225, tuned for desktop CPUs where call overhead is high relative to the size of the inlined body. On W65816 a jsl long:foo is just 4 bytes / ~8 cycles, but every inlined pointer dereference expands to multiple instructions even with Layer 2. Aggressive inlining bloats code without commensurate cycle wins.

The W65816 backend lowers the default to 50. Calibration:

Threshold Lua size CoreMark size Cycle benches
225 (LLVM stock) 1.47× Calypsi (not measured) baseline
75 1.16× 0.87× identical
50 (current) 1.13× 0.79× identical
25 1.11× 0.79× identical

At 225, Lua's index2adr (a multi-branch helper called 41 times in lapi.c) was inlined into every API entry, adding ~2 KB per file — and CoreMark's matrix_test was 17× Calypsi because the inliner copied 5 nested-loop helpers into it. At 50, both regressions vanish and the cycle benchmarks are unchanged.

To override (e.g. on size-sensitive ROMs or speed-critical loops):

# Force aggressive inlining (back to LLVM default)
clang --target=w65816 -O2 -mllvm -inline-threshold=225 -c file.c -o file.o

# Force MORE conservative inlining
clang --target=w65816 -O2 -mllvm -inline-threshold=10 -c file.c -o file.o

A function marked __attribute__((always_inline)) is always inlined regardless of threshold. A function marked __attribute__((noinline)) is never inlined. Use these to override the global threshold for specific cases.

Summary: which options to use when

Goal Compile flag
Smallest, safest binary (default) clang --target=w65816 -O2 ... — Layer 1 is on, Layer 2 is off, threshold=50
Smallest binary for code that touches only same-bank memory Add -mllvm -w65816-dbr-safe-ptrs
Fastest possible code (size be damned) Add -mllvm -inline-threshold=500
Reproduce LLVM's stock inlining behavior Add -mllvm -inline-threshold=225
Maximum safety review of inlining decisions Mark hot helpers __attribute__((noinline)) explicitly

Inline assembly

The W65816 backend supports __asm__ with operand constraints "a", "x", "y":

unsigned short addOne(unsigned short x) {
    unsigned short r;
    __asm__("inc a" : "=a"(r) : "a"(x));
    return r;
}

Multi-instruction asm and raw bytes both work:

__asm__ volatile (
    "sep #0x20\n"
    ".byte 0x68\n"      // pla
    "rep #0x20\n"
);

The .byte form is needed when llvm-mc can't yet parse an opcode literally (some 65816 addressing modes still have gaps in the assembler). Hand-encoding is a stopgap; report opcodes that need it.


Tools reference

Tool Location Purpose
clang tools/llvm-mos-build/bin/clang C / C++ compiler
clang++ tools/llvm-mos-build/bin/clang++ C++ driver
llc tools/llvm-mos-build/bin/llc Standalone codegen (.ll.s)
llvm-mc tools/llvm-mos-build/bin/llvm-mc Assembler
llvm-objdump tools/llvm-mos-build/bin/llvm-objdump Disassembler
link816 tools/link816 Our relocating linker
omfEmit tools/omfEmit Emit OMF v2.1 binary from link816 output
mame system apt install Apple IIgs emulator

Debugging

Look at the asm

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 -S -o prog.s prog.c
cat prog.s

Look at the MIR after each backend pass

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
    -mllvm -print-after-all -S prog.c 2>&1 | less

Useful pass names to filter on:

Pass name What it does
w65816-isel SDAG → MachineInstr selection
w65816-widen-acc16 Promote Acc16 vregs to Wide16 (regalloc help)
w65816-stack-slot-cleanup Remove redundant spill/reload
w65816-stackrel-to-img Promote hot stack slots to DP IMG slots
w65816-stack-slot-merge Collapse PHI src/dst slot pairs
w65816-branch-expand Long-distance Bxx → INV_Bxx skip; BRA

Single-pass filter

./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
    -mllvm -print-after=w65816-isel \
    -mllvm -filter-print-funcs=myfunc \
    -S prog.c 2>&1 | less

Disassemble an object file

./tools/llvm-mos-build/bin/llvm-objdump --triple=w65816 -d hello.o

ELF e_machine value

W65816 .o files use EM_W65816 = 0xFF16 in the ELF header.

The value sits in the 0xFF00-0xFFFF range reserved by the ELF spec for vendor-private / experimental targets — no IANA registration required. The 16 suffix is a mnemonic for "65816". (The natural choice, 65816 itself = 0x10118, does not fit the 16-bit Elf32_Half e_machine field.)

Why this matters:

  • llvm-dwarfdump, readelf, and other generic ELF consumers used to warn on every invocation because the file claimed EM_NONE (= no machine). Setting a real EM_ value silences the warning while still preventing a host-architecture .o from being accidentally linked.
  • link816 validates e_machine and rejects anything that isn't EM_W65816 (with EM_NONE still accepted for backwards compatibility with any pre-Phase-1.13 object files lingering in a build tree).
  • The relocation numbers R_W65816_* are unique under EM_W65816, so they're free to stay at the small stable integers 1-8 (see src/llvm/lib/Target/W65816/MCTargetDesc/W65816ELFObjectWriter.cpp).

Touchpoints if you ever need to change the value:

File What it does
tools/llvm-mos/llvm/include/llvm/BinaryFormat/ELF.h Defines EM_W65816 enumerator
src/llvm/lib/Target/W65816/MCTargetDesc/W65816ELFObjectWriter.cpp Passes value to MCELFObjectTargetWriter
src/link816/link816.cpp Validates value on input

Cycle-count benchmarks

Microbenchmarks live under benchmarks/ — integer/ string micro-benches plus soft-double FP benches.

W65816_CC_EXTRA="-mllvm -w65816-dbr-safe-ptrs" bash scripts/benchCyclesPrecise.sh

This measures per-call cycle counts via MAME's emu.time() between markers — apples-to-apples vs the matching scripts/benchCyclesCalypsi.sh runner (commercial Calypsi 5.16). Current ratios (2026-05-27, Layer 2):

| Benchmark    | Ours  | Calypsi | Ratio |
|--------------|------:|--------:|------:|
| dotProduct   | 1534  | 5712    | 0.27× |
| bsearch      | 682   | 2387    | 0.29× |
| sumOfSquares | 6820  | 16368   | 0.42× |
| bubbleSort   | 11594 | 17050   | 0.68× |
| strLen       | 767   | 1023    | 0.75× |
| djb2Hash     | 2046  | 2643    | 0.77× |
| popcount     | 1194  | 1534    | 0.78× |
| strcpy       | 1108  | 1194    | 0.93× |
| memcmp       | 682   | 716     | 0.95× |
| fib          | 11594 | 10912   | 1.06× |

Geomean: 0.62× Calypsi. 9 of 10 below 1.0×. The Layer 2 flag (-mllvm -w65816-dbr-safe-ptrs) enables stack-rel-indirect-Y ptr32 derefs — required for parity since Calypsi's pointer ABI assumes DBR matches the pointer's bank.

The scripts/benchCycles.sh (HBL-tick-based) script is still around but lower-resolution. Prefer the Precise runner above.

The compare/ directory has side-by-side .s files vs Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with:

bash compare/regen.sh

UndefinedBehaviorSanitizer (UBSan, minimal runtime)

The W65816 target ships a hand-rolled minimal UBSan runtime (runtime/ubsan.o). No driver-side magic: pass the flags and link the runtime object explicitly.

# Compile with UBSan-min instrumentation.
./tools/llvm-mos-build/bin/clang --target=w65816 -O2 \
    -fsanitize=undefined -fsanitize-minimal-runtime \
    -ffunction-sections -I runtime/include \
    -c prog.c -o prog.o

# Link, including runtime/ubsan.o so the 25 __ubsan_handle_*_minimal
# symbols clang emits calls to resolve cleanly.  libgcc.o is needed
# whenever you exercise i16 div / i32 multiply / shift-by-N.
./tools/link816 -o prog.bin --text-base 0x1000 --bss-base 0xA000 \
    runtime/crt0.o prog.o runtime/ubsan.o runtime/libgcc.o

What's covered (25 of the 25 handlers upstream's minimal runtime emits):

type-mismatch            shift-out-of-bounds      invalid-objc-cast
alignment-assumption     out-of-bounds            function-type-mismatch
add-overflow             local-out-of-bounds      implicit-conversion
sub-overflow             builtin-unreachable (*)  nonnull-arg
mul-overflow             missing-return (*)       nonnull-return
negate-overflow          vla-bound-not-positive   nullability-arg
divrem-overflow          float-cast-overflow      nullability-return
                         load-invalid-value       pointer-overflow
                         invalid-builtin          cfi-check-fail

(*) recovering-only — no _abort variant emitted upstream.

When a UB site fires, the runtime calls a per-kind handler that:

  1. Looks up the caller PC in a 20-entry dedup table (single-threaded, no atomics).
  2. If first-seen, emits one line via the existing __putByteErr hook (GNO fd 3 / stderr) in the format ubsan: <kind> by 0x<8-hex>\n.
  3. The recover variant returns; the _abort variant calls __builtin_trap() which lowers to BRK_pseudo + sentinel 0xBE @ $70
    • tight-loop spin.

ASan is out of scope — the 8:1 shadow-memory model would need ~2 MB of shadow for the 16 MB 65816 address space, while most IIgs programs run in one or two banks.

End-to-end smoke probe:

bash tests/ubsan/runUbsanProbe.sh

Exercises add-overflow + shift-out-of-bounds + divide-by-zero, verifies each handler fires and execution recovers past the UB site (sentinels at $025000..$025006). Wired into scripts/smokeTest.sh as the Phase 6.2 stage; override with SMOKE_SKIP_UBSAN=1.

The probe deliberately overrides three handlers with strong defs that record their firing in a state byte rather than printing — that lets the test verify the call edge without pulling libc.o (and the attached snprintf.o) into a smoke probe that doesn't need console I/O. A diagnostic-format smoke (asserting on the ubsan: ...\n line) is a follow-up under the cxxsmoke GNO MAME harness.


Known limitations

  • C++ exceptions are not implemented for DWARF unwinding. try / catch compiles but doesn't unwind. -fsjlj-exceptions works for limited SJLJ-style throwing.
  • stdin always returns EOF. scanf compiles but isn't useful. Use sscanf on a buffer instead.
  • File I/O through fopen requires a backing implementation. The default mfs backing (memory-file-system) lets you simulate files via mfsRegister() — useful for tests, not for real disk I/O. GS/OS file I/O works via runtime/iigsGsos.o if you link against the GS/OS runtime.
  • fork/exec — not applicable on a 65816, no support.
  • Code generation gotcha: very large stack frames (>200 bytes) trigger FP-relative addressing. Most programs fit under that limit. See the frame-rel discussion in LLVM_65816_DESIGN.md.
  • Three Lua functions (luaV_execute, symbexec, auxsort) hit the greedy register allocator's complexity budget. Workaround: compile those TUs with -mllvm -regalloc=basic. Documented in tests/lua/README.md.

Where to go next

  • Building real GS/OS apps: see docs/multiSegmentPlan.md and the demos/launch.sh script for booting through real GS/OS 6.0.2 in MAME. The 9 demos under demos/ are reasonable starting points.
  • Running as a GNO/ME shell command: see Running under GNO/ME above, tools/gno/README.md, and the demos/gno*.c examples.
  • Backend internals (you're hacking on the compiler): LLVM_65816_DESIGN.md.
  • Smoke tests: scripts/smokeTest.sh runs ~150 end-to-end checks. Read it for examples of every feature in action.
  • Cycle-bench a Lua port or other real-world C: see tests/lua/README.md for the recipe (vendoring + per-file regalloc tuning + libc stubs).