65816-llvm-mos/docs/USAGE.md
2026-05-18 14:43:35 -05:00

17 KiB
Raw Blame History

Using llvm816

This document covers compiling a C program, linking it into an Apple IIgs binary, and running it under MAME. It assumes you've followed INSTALL.md and have a working tools/llvm-mos-build/bin/clang.

Quick reference

CLANG=tools/llvm-mos-build/bin/clang
LINK=tools/link816
RUNTIME=runtime

# 1. Compile C to object
$CLANG --target=w65816 -O2 -I$RUNTIME/include -c hello.c -o hello.o

# 2. Link to a raw binary (loadable at $00:1000)
$LINK -o hello.bin --text-base 0x1000 \
    $RUNTIME/crt0.o $RUNTIME/libc.o $RUNTIME/libgcc.o hello.o

# 3. Run under MAME
bash scripts/runInMame.sh hello.bin --check 0x025000=????

Compiling C

The compiler is invoked just like a normal clang, with --target=w65816:

clang --target=w65816 -O2 -c source.c -o source.o

Recommended flags:

Flag Meaning
--target=w65816 Selects the W65816 backend (required)
-O2 Default optimization level. -O0 and -O1 work but produce ~3-5× larger code
-ffunction-sections Put each function in its own section. Lets the linker drop unreferenced functions
-I runtime/include Find <stdio.h> etc.
-c Compile only — produce .o, don't link

What works at -O2:

  • All C99 scalars: int8_t through int64_t, signed and unsigned, all arithmetic operators
  • Soft float and double (full IEEE-754 with round-to-nearest-even)
  • Pointers, arrays, structs, unions, bitfields
  • All control flow: if, for, while, goto, switch, recursion
  • <stdarg.h> varargs
  • <setjmp.h> setjmp/longjmp (SJLJ, no DWARF unwinder)
  • Inline __asm__ with "a", "x", "y" register constraints
  • C++ subset: classes, single+multiple inheritance, virtual functions, RTTI, dynamic_cast. No exceptions (DWARF unwinder not implemented).

See STATUS.md for the full feature matrix.

Linking

The linker is tools/link816. It produces either a raw binary suitable for direct execution (loaded into a fixed address) or an OMF binary suitable for GS/OS Loader.

Raw binary

link816 -o output.bin --text-base 0x1000 crt0.o libc.o libgcc.o yourprog.o
  • --text-base 0x1000 — physical address where code is loaded. 0x1000 is the conventional starting address; the first 4KB of bank 0 ($00:0000 $00:0FFF) is reserved for the stack and zero-page.
  • crt0.o — the C runtime startup. Sets DBR, calls main, halts. Always link first.
  • libc.oprintf, malloc, strlen, etc.
  • libgcc.o — compiler-helper routines (__mulhi3, __umulhisi3, __divhi3, __ashlhi3, etc.). Required by most non-trivial programs.

Additional runtime libraries

Library What you get
runtime/libc.o Core C library — printf, malloc, strlen, etc.
runtime/libgcc.o Compiler helpers — multiply, divide, shift
runtime/snprintf.o sprintf / snprintf / vsnprintf
runtime/sscanf.o sscanf / vsscanf / fscanf
runtime/softDouble.o IEEE 754 double-precision math
runtime/softFloat.o IEEE 754 single-precision math
runtime/math.o fabs, floor, sqrt, sin, cos, etc.
runtime/qsort.o qsort / bsearch
runtime/strtol.o strtol / strtoul / atoi / atol
runtime/strtok.o strtok / strtok_r
runtime/extras.o strcat, strncat, llabs, rand/srand
runtime/timeExt.o time / gmtime / mktime
runtime/iigsToolbox.o Apple IIgs Toolbox call wrappers
runtime/iigsGsos.o GS/OS call wrappers

Link only what you use — the linker drops unreferenced symbols.

Build them all once with:

bash runtime/build.sh

Multi-segment OMF (for GS/OS Loader)

For programs that need >60 KB of code (the usable bank-0 limit after subtracting the stack, zero-page, and I/O window), build a multi-segment OMF that GS/OS Loader can place across banks:

link816 -o myprog.bin --omf --manifest my.manifest \
    --expressload \
    crt0Gsos.o ... yourprog.o

See docs/multiSegmentPlan.md for details and scripts/runMultiSeg.sh for a working example.

Running under MAME

The supplied scripts/runInMame.sh launches MAME's apple2gs with the right ROM path, loads your binary at $00:1000, runs for a few seconds, and reads back a memory cell.

bash scripts/runInMame.sh prog.bin                     # just run for 5s
bash scripts/runInMame.sh prog.bin --check 0x025000=00ff
bash scripts/runInMame.sh prog.bin 0x025000 0x025002   # dump these addrs

The --check ADDR=VALUE form returns exit 0 if ADDR contains VALUE after the run, exit 1 otherwise. Use 0x???? to dump the value without checking.

MAME is invoked headless by default (no window) via -video none + SDL_VIDEODRIVER=dummy. This works on servers/CI runners.

The bank-switch idiom

Background — why this is necessary

The 65816 has two registers that select which bank a memory access goes to:

  • PBR (Program Bank Register) — selects the bank for instruction fetches. Set by jsl long_addr and rtl.
  • DBR (Data Bank Register) — selects the bank for data accesses like lda $5000, sta $5000, etc.

When the IIgs boots, DBR defaults to $00. Bank $00 (the same bank as the language card / IIe-compatibility area) contains the I/O window at $C000-$CFFF. Any data access to addresses in that range goes to the soft-switches and slot ROMs, NOT to RAM. This is the same I/O hole the Apple IIe has, inherited by the IIgs for backward compatibility.

Concretely: if your DBR is $00 and you write to address $C100, you're poking the slot-1 ROM enable register — definitely not what you want. Similarly, $5000 in bank 0 is the language card area and may or may not be RAM depending on soft-switch state.

Banks $01-$DF are full 64K RAM banks ($E0/$E1 are aux/main shadow, $E0-$FF reserved). To do reliable data work, switch the DBR to any of these "normal" banks. $02 is conventional in this codebase because:

  1. $01:0000-$01:FFFF overlaps the stack page ($0100-$01FF in any bank ends up in the same physical RAM as bank $00's stack page — confusing).
  2. $02:0000-$02:FFFF is the first "clean" bank above the special-purpose banks.
  3. The smoke-test convention is to write a result word to $02:5000 so runInMame.sh can read it back.

If your program needs more than 64 KB of data, switch DBR to different banks as needed.

What the assembly does, line by line

__attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile (
        "sep #0x20\n"        // (1) Switch A to 8-bit
        ".byte 0xa9,0x02\n"  // (2) lda #2 (8-bit immediate)
        "pha\n"              // (3) Push A onto stack (1 byte)
        "plb\n"              // (4) Pop into DBR (1 byte from stack)
        "rep #0x20\n"        // (5) Restore A to 16-bit
    );
}
  1. sep #0x20 — sets the M bit in the status register P. M=1 makes A behave as 8-bit (and immediate operands become 1 byte). We need this so the next lda #2 pushes 1 byte (matching what plb expects to pop). Calling-convention prologues always run in M=0 (16-bit), so this sep is required.

  2. .byte 0xa9,0x02 — raw bytes for lda #$02. We hand-encode because llvm-mc can't yet emit an 8-bit immediate lda #$02 that knows it's 1 byte; the assembler keeps treating it as 16-bit. 0xa9 is the LDA-immediate opcode; 0x02 is the 1-byte operand. Result: A = $02 (8-bit).

  3. pha — pushes A. In M=1 mode, PHA pushes exactly 1 byte (the low half of A). Stack now has $02 on top.

  4. plb — pops 1 byte from the stack and stores it in DBR. DBR is now $02. All subsequent data accesses go to bank 2.

  5. rep #0x20 — clears the M bit. A returns to 16-bit mode, matching the calling-convention contract for the rest of the function.

The DBR change persists across function returns. Once switchToBank2() returns, all data reads/writes in your program target bank 2 — until you switch DBR again.

When you need it

You need to switch DBR whenever you want to access data at an absolute address $XXXX and need it to land in a specific bank. Common cases:

  • MMIO from the test harness*(volatile uint16 *)0x5000 = x; Without DBR=2, this would go to bank 0's $5000 (which is in the language card area). With DBR=2, it goes to $02:5000 where runInMame.sh --check 0x025000=... reads from.
  • Anything in $C000-$CFFF — bank 0 has soft-switches here. Bank 2 has plain RAM.
  • Global arrays declared at link-time at fixed addresses — the linker may place them in bank 2 BSS (--bss-base 0x020000). Your DBR must match.

You DON'T need DBR=2 for:

  • Local variables on the stack — the stack is always bank-relative-to-DBR-ignored; lda $4,s reads from the stack page regardless of DBR.
  • Direct-page accesseslda $D0 reads from $00:00D0 (always bank 0). DP is anchored to bank 0.
  • Indirect-long pointers via [dp],y — these include their own bank byte and ignore DBR.
  • Function callsjsl uses PBR + a long destination address. PBR is updated automatically.

Other ways to access non-bank-0 data

If you only need to write to a single non-bank-0 address, you can emit the store as STA_Long (24-bit absolute) which encodes the bank inline:

*(volatile unsigned short *)0x025000 = 42;   // becomes sta $025000

The W65816 backend recognizes const-int pointer + integer offset and lowers to sta long if the address has a bank byte. No switchToBank2() needed.

For frequent data work in a bank, switching DBR once and using plain sta $5000 (2 bytes) is smaller and faster than sta $025000 (4 bytes) per access.

Caveats

  • Save/restore is your problem. switchToBank2() never restores DBR. If your caller expected DBR=0, you've broken its expectation. For long-running programs, that's usually fine (you just set DBR=2 once and stay there). For toolbox calls, GS/OS might assume DBR=0 — check the call's documentation.
  • The stack is in bank 0 regardless. Don't try to put the stack elsewhere; the 65816's stack-relative addressing modes ignore DBR.
  • In M=1 mode, INTERRUPTS may behave differently. The sep affects A's width but not the bank-switching machinery itself. Keep the sep/rep window short.
  • PBR vs DBR are independent. Code execution stays where it was; only data accesses change.

How runInMame.sh --check 0x025000=... works

The check address 0x025000 is a 24-bit address: bank $02, offset $5000. The MAME Lua runner reads this byte (and the next byte if you specify a 2-byte value) directly from physical RAM, bypassing DBR entirely. So the convention is:

  1. Your program switches DBR to bank 2.
  2. Your program writes its result to *(volatile X *)0x5000, which becomes sta $5000 — landing in bank 2 because of DBR.
  3. MAME reads bank 2's $5000 via the absolute 24-bit address.
  4. The runner compares to your expected value.

If you forget switchToBank2(), your store goes to the language card area (bank 0's $5000), MAME's check reads bank 2's unchanged $5000 (likely $00 or whatever was there), and the test fails.

Examples

Hello, integer

__attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile (
        "sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"
    );
}

int main(void) {
    int x = 42;
    switchToBank2();
    *(volatile int *)0x5000 = x;
    while (1) {}
}

Build & run:

clang --target=w65816 -O2 -c hello.c -o hello.o
link816 -o hello.bin --text-base 0x1000 \
    runtime/crt0.o runtime/libc.o runtime/libgcc.o hello.o
bash scripts/runInMame.sh hello.bin --check 0x025000=002a    # 0x2a = 42

Recursion + printing

#include <stdio.h>
#include <stdlib.h>

unsigned long fib(unsigned n) {
    if (n < 2) return n;
    return fib(n-1) + fib(n-2);
}

__attribute__((noinline)) void switchToBank2(void) {
    __asm__ volatile (
        "sep #0x20\n.byte 0xa9,0x02\npha\nplb\nrep #0x20\n"
    );
}

int main(void) {
    char buf[32];
    int len = snprintf(buf, sizeof buf, "fib(10) = %lu", fib(10));
    switchToBank2();
    // Copy buf to $025000 so we can read it after the run
    for (int i = 0; i <= len; i++)
        ((volatile char *)0x5000)[i] = buf[i];
    while (1) {}
}

Build (note: need snprintf.o for snprintf):

clang --target=w65816 -O2 -I runtime/include -c fib.c -o fib.o
link816 -o fib.bin --text-base 0x1000 \
    runtime/crt0.o runtime/libc.o runtime/libgcc.o \
    runtime/snprintf.o runtime/softDouble.o runtime/sscanf.o fib.o

Apple IIgs Toolbox

#include <iigs/toolbox_full.h>

int main(void) {
    DrawString("\pHello, World");
    while (1) {}
}

Build:

clang --target=w65816 -O2 -I runtime/include -c hello_gs.c -o hello_gs.o
link816 -o hello_gs.bin --text-base 0x1000 \
    runtime/crt0Gsos.o runtime/iigsToolbox.o runtime/iigsGsos.o \
    runtime/libgcc.o hello_gs.o

Use crt0Gsos.o (not crt0.o) for programs that call into the toolbox — it sets up the IIgs runtime environment.

Inline assembly

The W65816 backend supports __asm__ with operand constraints "a", "x", "y":

unsigned short addOne(unsigned short x) {
    unsigned short r;
    __asm__("inc a" : "=a"(r) : "a"(x));
    return r;
}

Multi-instruction asm and raw bytes both work:

__asm__ volatile (
    "sep #0x20\n"
    ".byte 0x68\n"      // pla
    "rep #0x20\n"
);

The .byte 0xa9, ... form is sometimes needed to work around llvm-mc encoding gaps — the assembler doesn't yet support every 65816 addressing mode literally. The pattern works for any opcode whose mnemonic doesn't yet parse.

Tools reference

Tool Location Purpose
clang tools/llvm-mos-build/bin/clang C/C++ compiler
llvm-mc tools/llvm-mos-build/bin/llvm-mc Assembler
llvm-objdump tools/llvm-mos-build/bin/llvm-objdump Disassembler
llc tools/llvm-mos-build/bin/llc Standalone codegen (.ll.s)
link816 tools/link816 Our relocating linker
omfEmit tools/omfEmit Emit OMF v2.1 binary from link816 output
mame apt (system-wide) Apple IIgs emulator

Debugging

Look at the asm

clang --target=w65816 -O2 -S -o prog.s prog.c

Look at the MIR after each pass

clang --target=w65816 -O2 -mllvm -print-after-all -S prog.c 2>&1 | less

Useful pass names to filter on:

Pass name What it does
w65816-isel SDAG → MachineInstr selection
w65816-widen-acc16 Promote Acc16 vregs to Wide16 (regalloc help)
w65816-stack-slot-cleanup Remove redundant spill/reload
w65816-stackrel-to-img Promote hot stack slots to DP IMG slots
w65816-stack-slot-merge Collapse PHI src/dst slot pairs
w65816-branch-expand Long-distance Bxx → INV_Bxx skip;BRA

Single-pass filter

clang --target=w65816 -O2 -mllvm -print-after=w65816-isel \
    -mllvm -filter-print-funcs=myfunc -S prog.c 2>&1 | less

Cycle-count benchmarks

Eight microbenchmarks live under benchmarks/. Each runs N iterations of the bench function and reports a per-call cycle count via MAME's emu.time():

bash scripts/benchCyclesPrecise.sh

Output:

| Benchmark | Per-call cycles (clang) |
|-----------|------------------------:|
| bsearch | 767 cyc/call |
| dotProduct | 2131 cyc/call |
| fib | 12617 cyc/call |
| memcmp | 989 cyc/call |
| popcount | 2864 cyc/call |
| strcpy | 2216 cyc/call |
| sumOfSquares | 16709 cyc/call |

The compare/ directory has side-by-side .s files vs Calypsi 5.16 for sumSquares, evalAt, and mul16to32. Rerun with:

bash compare/regen.sh

Known limitations

  • C++ exceptions are not implemented. try/catch compiles but doesn't unwind. -fsjlj-exceptions works for limited SJLJ-style throwing.
  • stdin always returns EOF. scanf compiles but isn't useful. Use sscanf on a buffer instead.
  • File I/O through fopen etc. requires a backing implementation. The default mfs backing (memory-file-system) lets you simulate files via mfsRegister() — useful for tests, not for real disk I/O. GS/OS file I/O works via runtime/iigsGsos.o if you link against the GS/OS runtime.
  • fork/exec — not applicable on a 65816, no support.
  • Code generation gotcha: very large frames (>200 bytes) trigger FP-relative addressing. Most programs fit under that limit. See the frame-rel discussion in LLVM_65816_DESIGN.md.

Where to go next

  • Building real GS/OS apps: see docs/multiSegmentPlan.md and the runViaFinder.sh script for booting through real GS/OS 6.0.2 in MAME.
  • Backend internals (you're hacking on the compiler): LLVM_65816_DESIGN.md.
  • Smoke tests: scripts/smokeTest.sh runs ~150 end-to-end checks. Read it for examples of every feature in action.