65816-llvm-mos/LLVM_65816_DESIGN.md
Scott Duensing 873eab4922 Checkpoint.
2026-04-25 17:07:28 -05:00

478 lines
18 KiB
Markdown

# LLVM 65816 Backend for Apple IIgs
## Project Design and Handoff Document
---
## 1. Project Goal
Build an optimized Clang/LLVM C compiler backend targeting the WDC 65816
processor, specifically for Apple IIgs development. The backend must produce
genuinely optimized output — not just correct code, but tight, cycle-efficient
65816 assembly that takes full advantage of the architecture.
ORCA/C is the existing open source option and its code generation quality is
poor. Calypsi is the commercial alternative with good output quality but is
closed source. This project aims to produce an open source backend that matches
or exceeds Calypsi's output quality.
---
## 2. Background and Context
### 2.1 The 65816 Architecture
The WDC W65C816S is a 16-bit microprocessor used in the Apple IIgs (and SNES).
Key characteristics relevant to code generation:
- **Dynamic register width:** The A (accumulator), X, and Y registers can be
either 8-bit or 16-bit depending on the M and X bits in the processor status
register. Mode switching is done via REP (Reset bits) and SEP (Set bits)
instructions. REP #$20 sets 16-bit accumulator mode; SEP #$20 sets 8-bit.
REP #$10 sets 16-bit index registers; SEP #$10 sets 8-bit.
- **Direct page:** A relocatable 256-byte window in bank 0 RAM, similar to
zero page on the 6502 but moveable. The Direct Page Register (DP) holds the
base address. Direct page addressing saves one byte per instruction and one
cycle over absolute addressing. Highly valuable for hot variables.
- **24-bit address space:** Addresses are bank:offset. The Data Bank Register
(DBR) implicitly provides the bank for 16-bit absolute data accesses. The
Program Bank Register (PBR) provides the bank for code. Long (24-bit)
addressing is available for cross-bank access.
- **Stack:** 16-bit stack pointer in bank 0. Stack-relative addressing is
available but expensive.
- **Native vs. emulation mode:** At reset the CPU starts in emulation mode
(behaves like 65C02). Native mode (16-bit capable) is entered by clearing
the emulation bit. All IIgs native code runs in native mode.
### 2.2 Why This Is Hard for LLVM
LLVM's register allocator assumes fixed-width registers. The 65816's dynamic
register widths are a fundamental mismatch. Key problems:
1. **REP/SEP scheduling:** Optimal code minimizes mode switches since each
REP/SEP costs 3 cycles. The backend must decide register width per
region of code, coalescing regions that use the same width to minimize
transitions. This is a global dataflow problem.
2. **Direct page as a register file:** LLVM has no model for a relocatable
zero page. Hot variables should be allocated to direct page offsets, but
GS/OS (the IIgs operating system) reserves certain direct page locations
for its own use. The allocator must know which regions are safe.
3. **Bank register management:** DBR affects all 16-bit absolute data accesses.
The backend must ensure DBR is set correctly for cross-bank data access.
4. **24-bit address space in 32-bit ELF:** LLVM's ELF format uses 32-bit
addresses. The llvm-mos project has already defined an ELF extension for
MOS processors that handles this correctly (ELFCLASS32 with unused high
bits zeroed).
5. **GS/OS calling conventions:** Apple IIgs toolbox calls use a stack-based
parameter protocol different from any standard calling convention LLVM
knows. Custom lowering is required for toolbox calls.
### 2.3 Existing Work
- **llvm-mos:** An open source LLVM fork providing first-class support for
MOS Technology 65xx processors. The 6502 backend is production quality.
The 65816 assembler (llvm-mc) has partial support — direct page assumed
at 0x0000, 24-bit long addressing supported, 16-bit immediates supported.
The **compiler backend** (C to 65816) is incomplete. Issues #32 and #321
in the llvm-mos GitHub track this work. The core blocker is REP/SEP
register width management.
- **jeremysrand/llvm-65816:** A separate LLVM backend attempt specifically
targeting Apple IIgs, aiming to output Merlin 32-compatible assembly.
The repo explicitly states "Don't even try to use it yet" and appears
stalled.
- **WorldsApartDevTeam/65816-c:** Claims C11 compliance for 65C816. Only
2 stars, 28 commits, no releases. Very early stage.
- **Calypsi:** Commercial closed-source compiler, version 5.16 released
April 15, 2026. Best available 65816 C compiler. Free for hobby use.
Use its output as a quality benchmark.
- **ORCA/C:** Open source, runs on-machine or via emulator. Mature but
generates poor code. The baseline we need to beat significantly.
### 2.4 llvm-mos as the Foundation
This project is built on top of llvm-mos, not vanilla LLVM:
- llvm-mos already has the MOS ELF specification implemented
- It has the 6502 backend as a reference for 65xx-family conventions
- It has the assembler-level 65816 support to build on
- It has the SDK infrastructure for platform-specific runtime support
- GitHub: https://github.com/llvm-mos/llvm-mos
### 2.5 Architectural Decision: Separate W65816 Target
llvm-mos already defines `FeatureW65816` as a subtarget feature of MOS, and
tracks the missing codegen support in issue #321. We could have added 16-bit
register handling to the existing MOS target as a subtarget feature.
**We did not.** This project is a **separate W65816 target**, maintained as
our own fork of llvm-mos. Reasons:
- Clean register model. MOS's register classes assume 8-bit hardware; aliasing
A/X/Y across 8 and 16-bit widths inside MOS would require invasive changes
to an already-shipping target. A fresh target can define
`Acc8`/`Acc16`/`Idx8`/`Idx16` classes directly and let the REP/SEP pass
operate at the MIR level.
- Independent evolution. This codebase can move at its own pace without
coordinating against llvm-mos's 6502 stability guarantees.
- No upstream burden. We are not attempting to land this in llvm-mos; this is
not a subtarget-feature extension to MOS, it is a sibling target.
The MOS target remains available in the same source tree, unchanged. We
borrow infrastructure (ELF writer, MC layer patterns) but implement our own
registers, instructions, scheduling, and lowering.
---
## 3. Architecture
### 3.1 LLVM Backend Structure
A standard LLVM backend consists of:
```
Clang frontend
LLVM IR (architecture-independent)
Target-specific lowering (SelectionDAG or GlobalISel)
Machine IR (MIR)
Register allocation
Instruction scheduling
Code emission (assembly or object file)
```
The backend adds:
- **TableGen definitions:** Register file, instruction encodings, calling
conventions, operand types, instruction selection patterns
- **Target machine:** Configuration, subtarget features, data layout
- **Instruction selector:** Lowers LLVM IR operations to 65816 instructions
- **Register allocator customization:** Handles the special register width
and direct page constraints
- **Pass pipeline:** Custom optimization passes for REP/SEP scheduling and
direct page allocation
### 3.2 Register File Definition
```tablegen
// Processor status register bits
def MBit : Register<"m">; // accumulator width (0=16-bit, 1=8-bit)
def XBit : Register<"x">; // index width (0=16-bit, 1=8-bit)
// Main registers
def A : Register<"a">; // accumulator (8 or 16-bit)
def X : Register<"x">; // index X (8 or 16-bit)
def Y : Register<"y">; // index Y (8 or 16-bit)
def SP : Register<"sp">; // stack pointer (16-bit)
def DP : Register<"dp">; // direct page register (16-bit)
def DBR : Register<"dbr">; // data bank register (8-bit)
def PBR : Register<"pbr">; // program bank register (8-bit)
def PC : Register<"pc">; // program counter (16-bit)
def P : Register<"p">; // processor status
// Register classes
def Acc8 : RegisterClass<"W65816", [i8], 8, (add A)>;
def Acc16 : RegisterClass<"W65816", [i16], 16, (add A)>;
def Idx8 : RegisterClass<"W65816", [i8], 8, (add X, Y)>;
def Idx16 : RegisterClass<"W65816", [i16], 16, (add X, Y)>;
```
### 3.3 REP/SEP Scheduling Strategy
The core algorithmic challenge. Proposed approach:
**Phase 1 — Width inference:** For each basic block, determine the "natural"
width of each operation: loads/stores of i8 values prefer 8-bit mode, i16
values prefer 16-bit mode.
**Phase 2 — Width coalescing:** Run a dataflow analysis across the CFG. Assign
each basic block a preferred width that minimizes the total number of mode
switches on all edges. This is a minimum cut / graph partitioning problem.
A greedy approach works well in practice.
**Phase 3 — Transition insertion:** At points where the width changes, insert
REP or SEP pseudo-instructions. These are later lowered to real REP/SEP.
**Phase 4 — Peephole cleanup:** Eliminate redundant REP/SEP pairs. If a
block ends with SEP #$20 and the successor starts with REP #$20, and no
path skips the block, one of them is redundant.
### 3.4 Direct Page Allocation
A custom LLVM pass runs after register allocation:
1. Identify the GS/OS-safe direct page region. GS/OS uses $00-$9F for
toolbox and system use. The safe region for user code is $A0-$FF
(96 bytes). This may need to be configurable per project.
2. Score all local variables and spill slots by access frequency (using
LLVM's block frequency analysis).
3. Greedily allocate the highest-frequency variables to direct page
offsets, packing them tightly to fit within the 96-byte budget.
4. Rewrite all accesses to direct-page-allocated variables to use DP
addressing mode instead of stack-relative or absolute addressing.
### 3.5 Calling Convention
Standard function calls:
- Parameters passed on stack (push right-to-left, caller cleans)
- Return value in A (8-bit) or A (16-bit) or A:X (32-bit)
- Callee saves: DP, DBR, direct page contents if modified
- Stack frame: return address (3 bytes in native mode), then locals
GS/OS toolbox calls:
- Parameters pushed as a parameter block
- Tool call via JSL $E10000 (tool dispatcher)
- Custom call lowering needed — mark with a special ISD node and lower
to JSL with the correct tool number encoding
### 3.6 Memory Model
The Apple IIgs memory map relevant to code generation:
```
Bank 00: $0000-$01FF Zero page + stack (system reserved)
$0200-$9FFF User RAM (conventional)
$A000-$BFFF User RAM or LC bank
$C000-$CFFF I/O space
$D000-$FFFF ROM / language card
Bank 01: $0000-$FFFF Additional RAM
Bank E0: $0000-$FFFF Mega II (Apple //e on a chip) I/O space
Bank E1: $0000-$FFFF Video/reserved
Bank FC-FF: ROM banks
```
The default code model places code in bank 0 (or bank 1 for large programs)
and data in bank 0. Long (24-bit) pointers are needed for cross-bank access.
---
## 4. Key Optimizations
### 4.1 Register Width Selection (highest impact)
Using 16-bit accumulator for word operations vs. two 8-bit operations cuts
cycle counts roughly in half for arithmetic. The REP/SEP scheduling described
above is the primary mechanism.
### 4.2 Direct Page Promotion (second highest impact)
Direct page addressing saves 1 byte and 1 cycle per instruction versus
absolute addressing. For variables accessed in tight loops, this compounds
significantly. The allocation pass described above handles this.
### 4.3 Addressing Mode Selection
The instruction selector must choose the most efficient addressing mode:
- Direct page (8-bit offset): preferred for hot variables
- Absolute (16-bit address): for global/static data in same bank
- Long (24-bit address): for cross-bank data access
- Stack-relative: avoid where possible, expensive on 65816
- Indexed: X or Y register indexed variants of the above
### 4.4 MVN/MVP for Block Moves
The 65816 has block move instructions (MVN = move negative, MVP = move
positive) that move memory efficiently. These map to LLVM's memcpy/memmove
intrinsics. The backend should recognize and emit these for small to medium
copies rather than emitting a loop.
### 4.5 Stack Frame Minimization
Minimize spills to the stack — they use expensive stack-relative addressing.
Prefer direct page allocation for spill slots (via the DP allocator above).
---
## 5. Testing Strategy
### 5.1 MAME Lua Interface for Automated Testing
MAME's Lua scripting interface allows automated correctness testing against
real IIgs hardware emulation.
Test harness workflow:
1. Compile a C test case with the backend
2. Load the binary into MAME's IIgs emulation
3. A Lua script sets breakpoints, runs the code, checks register and
memory state against expected values
4. Report pass/fail
Example Lua test script structure:
```lua
emu.add_machine_reset_notifier(function()
local cpu = manager.machine.devices[":maincpu"]
-- breakpoint at end of test function
cpu.debug:bpset(0x010200, "", function()
local a = cpu.state["A"].value
local p = cpu.state["P"].value
local mBit = (p >> 5) & 1 -- M bit: 0=16-bit, 1=8-bit
-- validate result and register width state
if a ~= 0x1234 then
print(string.format("FAIL: A=0x%04X expected 0x1234", a))
elseif mBit ~= 0 then
print("FAIL: expected 16-bit accumulator mode")
else
print("PASS")
end
manager.machine:exit()
end)
end)
```
REP/SEP tracing — validate mode switch sequences:
```lua
-- trace all REP/SEP instructions to validate width scheduling
cpu.debug:wpset(0x000000, 1, "x", "", function()
-- check if instruction at PC is REP or SEP
local pc = cpu.state["PC"].value
local opcode = manager.machine.spaces["program"]:read_u8(pc)
if opcode == 0xC2 then -- REP
local operand = manager.machine.spaces["program"]:read_u8(pc+1)
print(string.format("REP #$%02X at $%04X", operand, pc))
elseif opcode == 0xE2 then -- SEP
local operand = manager.machine.spaces["program"]:read_u8(pc+1)
print(string.format("SEP #$%02X at $%04X", operand, pc))
end
end)
```
### 5.2 Calypsi Output Comparison
Compile the same C code with Calypsi and with our backend. Diff the assembly
output. Cases where our output is significantly worse (more instructions,
more REP/SEP transitions, missing direct page usage) are optimization bugs.
### 5.3 Test Categories
- Basic arithmetic (8-bit and 16-bit, mixed width)
- Pointer operations (direct page, absolute, long)
- Function calls (stack frame setup/teardown)
- GS/OS toolbox calls
- Struct/array access
- Mixed 8/16-bit operations (the REP/SEP stress test)
- memcpy/memset (MVN/MVP emission)
- Interrupt handlers (bank register save/restore)
---
## 6. GS/OS Integration
### 6.1 Runtime Library
A minimal runtime library is needed:
- `crt0.s`: startup code, sets native mode, initializes DP and DBR,
sets up stack, calls `main()`
- `libc`: subset of C standard library functions adapted for GS/OS
- GS/OS system call wrappers (file I/O, memory management, etc.)
### 6.2 ProDOS 16 / GS/OS Output Format
The linker must produce output in a format the IIgs can load:
- OMF (Object Module Format) — the native Apple IIgs executable format
- Alternatively, a raw binary image for simple programs
- The llvm-mos SDK's infrastructure for platform-specific output formats
can be adapted here
---
## 7. Implementation Order
1. **Fork llvm-mos** and create the W65816 target directory structure
2. **TableGen: registers and instruction set** — describe all 65816 registers,
addressing modes, and instructions
3. **Target machine descriptor** — data layout, calling convention basics,
feature flags (native mode, emulation mode)
4. **Instruction selector** — basic patterns for arithmetic, load/store,
branches; initially target a fixed 16-bit accumulator mode (no REP/SEP)
5. **Calling convention lowering** — stack-based parameter passing, return
values
6. **Basic test suite** — compile simple C functions, verify correctness in
MAME via Lua harness
7. **REP/SEP scheduling pass** — the core optimization; width inference,
coalescing, transition insertion
8. **Direct page allocator** — promote hot variables to DP addressing
9. **GS/OS toolbox call lowering** — custom ISD nodes for JSL-based calls
10. **Runtime library** — crt0, minimal libc, GS/OS wrappers
11. **OMF output** — linker support for IIgs executable format
12. **Optimization tuning** — compare against Calypsi, close gaps
---
## 8. Open Questions
1. **GS/OS direct page reservation:** Exactly which direct page bytes does
GS/OS reserve vs. what is safe for user code? Need to verify against
the GS/OS reference documentation.
2. **Bank model:** Should the default code model assume everything fits in
bank 0/1, or should we support a large memory model with 24-bit pointers
by default? Small model is simpler; large model is more powerful.
3. **Interrupt handler ABI:** What registers must an IIgs interrupt handler
save/restore? DBR and DP especially — they may be in an unknown state
when an interrupt fires.
4. **ORCA/C compatibility:** Do we try to be ABI-compatible with ORCA/C to
allow linking against existing IIgs libraries? Or define a clean new ABI?
ORCA/C compatibility would be valuable but constraining.
5. **REP/SEP in leaf functions:** For small leaf functions that only use one
width, we can omit mode switches entirely if the caller guarantees a
known state. Is a "width contract" calling convention attribute worth
implementing?
6. **Cycle counting in MAME:** MAME's IIgs emulation may not be fully
cycle-accurate. For performance regression testing, do we need real
hardware or a more cycle-accurate emulator?
---
## 9. Coding Conventions
- `camelCase` for functions and variables
- `PascalCase` with `T` suffix for types (e.g., `RegWidthT`, `DpSlotT`)
- `//` single-line comments throughout
- Target language: C++ (LLVM requirement)
- Follow llvm-mos coding style for backend code
- Follow LLVM coding standards for everything else
---
## 10. References
- llvm-mos repository: https://github.com/llvm-mos/llvm-mos
- llvm-mos SDK: https://github.com/llvm-mos/llvm-mos-sdk
- llvm-mos ELF spec: https://llvm-mos.org/wiki/ELF_specification
- llvm-mos 65816 issue #32: https://github.com/llvm-mos/llvm-mos/issues/32
- llvm-mos 65816 issue #321: https://github.com/llvm-mos/llvm-mos/issues/321
- WDC 65816 data sheet: available from westerndesigncenter.com
- Apple IIgs Hardware Reference: archive.org
- Apple IIgs GS/OS Reference: archive.org
- Calypsi compiler (quality benchmark): https://www.calypsi.cc/
- ORCA/C (open source reference): https://github.com/byteworksinc/ORCA-C
- jeremysrand/llvm-65816 (prior attempt): https://github.com/jeremysrand/llvm-65816