# Session Resume — llvm816 project Drop this into a new Claude Code session and say "read SESSION_STATE.md and continue where we left off." Pairs with `LLVM_65816_DESIGN.md` (the design doc — read that second). --- ## 1. Project in one sentence Build an open-source LLVM/Clang backend for the WDC 65816 (Apple IIgs) that matches or exceeds Calypsi's output quality, forked from llvm-mos but maintained as our own separate W65816 target. User is Scott; expert C dev, doesn't want hand-holding on LLVM or 65816 basics. ## 2. Where we are in the plan Design doc section 7 lists a 12-step implementation order. We are at: - [x] **Setup toolchain** (prior session) - [x] **Architectural decision: separate W65816 target** (design doc §2.5) - [x] **Repo-layout decision:** `src/` holds our authored files, `patches/` holds modifications to upstream llvm-mos files, `tools/llvm-mos/` is ephemeral and gitignored. `scripts/applyBackend.sh` stitches src + patches into the clone. - [x] **Step 1 — scaffold W65816 target directory.** 41 files under `src/llvm/lib/Target/W65816/` + 2 files under `src/clang/lib/Basic/Targets/`. 4 upstream patches under `patches/`. - [x] **Step 2 — verify the skeleton fully compiles and links.** All 8 tablegen generators run clean, three static libs (LLVMW65816Info/Desc/CodeGen) build, llc links with the target registered, zero warnings in the W65816-local build. `./bin/llc -march=w65816 -filetype=null /dev/null` → exit 0. - [x] **Step 2a — real MC-layer instructions.** `W65816InstrInfo.td` now holds ~90 real 65816 opcodes (LDA/STA/LDX/LDY/STX/STY across immediate/DP/abs/DPX/AbsX/AbsY/long where applicable; ADC/SBC/CMP/ AND/ORA/EOR/BIT; INC/DEC/ASL/LSR/ROL/ROR; all transfers; stack push/pull; REP/SEP/CLC/SEC/XCE/XBA; branches; JMP/JML/JSR/JSL; RTS/RTL/RTI; MVN/MVP). Instructions whose size depends on M or X bits exist as `_Imm8`/`_Imm16` pairs carrying the appropriate TSFlag bits (MLow/MHigh/XLow/XHigh) for the future REP/SEP pass. - [x] **Step 2b — wire MCCodeEmitter.** Tablegen `-gen-emitter` runs cleanly; `W65816MCCodeEmitter.cpp` calls the tablegen-provided `getBinaryCodeForInstr` and emits Size bytes little-endian. - [x] **Step 2c — symbolic fixups.** Each operand class (imm8/imm16/ addrDP/addrAbs/addrLong/pcrel8/pcrel16) has its own `EncoderMethod` that emits a `W65816::fixup_*` at the correct byte offset for expression operands. `W65816AsmBackend::applyFixup` patches the data bytes little-endian for resolved fixups and defers to `maybeAddReloc` for unresolved ones. `W65816ELFObjectWriter::getRelocType` returns placeholder relocation numbers 1-5 (swap for canonical R_W65816_* names once the ELF `EM_` is decided — §7 item 1). - [x] **Step 2d — patch 0005 eliminates data-layout warning.** The data-layout string for `Triple::w65816` lives in `llvm/lib/TargetParser/TargetDataLayout.cpp`; `W65816TargetMachine` now calls `TT.computeDataLayout()`. Zero warnings in the W65816 build. - [x] **Step 2e — AsmParser scaffold.** 441-line `AsmParser/W65816AsmParser.cpp` ported from MSP430, stripped of register-operand handling (65816 has no MC register operands), with width-narrowing predicates on each operand class so the matcher picks the narrowest instruction variant the value fits (e.g. `sta $10` → STA_DP, `sta $1000` → STA_Abs, `sta $10000` → STA_Long). `#` is emitted as a literal token to match the AsmString tokenisation. Block-move (MVN/MVP) uses `addrDP` for both bank bytes so `mvn $01, $02` parses. - [x] **Step 2f — operand bit-field wiring.** Every `Inst*` class in `W65816InstrFormats.td` now assigns named bitfields into `Inst{N-8}` (e.g. `let Inst{15-8} = imm;`). Without this tablegen emits an encoder that writes the opcode but leaves the operand bytes as zero — we had that bug for an iteration. - [x] **Step 2g — smoke-test script.** `scripts/smokeTest.sh` checks llc registration, empty-module codegen, and llvm-mc encoding of a representative instruction mix. Run with `--build` to rebuild first. - [x] **Step 2h — end-to-end ELF object.** `llvm-mc -filetype=obj` produces a valid ELF with relocations at correct byte offsets. Relocations are placeholder numbers 1-5 (§7 item 1 — decide EM_/R_* mapping). - [x] **Step 2i — Disassembler.** 190-line `Disassembler/W65816Disassembler.cpp` tries 1/2/3/4-byte decode tables in ascending size order. Custom decoder callbacks for imm/addr/pcrel operands wrap raw bits into MCOperands. Mode- ambiguous opcodes (LDA/LDX/LDY/ADC/SBC/CMP/AND/ORA/EOR/BIT/CPX/ CPY immediate forms) are parked in separate DecoderTableW65816{MHigh,XHigh}16 tables and the scaffold only reads the default tables — so those opcodes always disassemble as 3-byte 16-bit-immediate forms until a mode-aware decoder lands (alongside REP/SEP). - [x] **Step 2j — register operands in the AsmParser.** Key fix found via round-trip test: tablegen treats `a`, `x`, `y` in AsmStrings (e.g. `"inc a"`, `"lda\t$addr, x"`) as references to the real register records, so the matcher expects register operands, not literal tokens. AsmParser now produces `k_Reg` operands for these identifiers. Verified: `inc a` → 0x1A, `lda $1000, x` → 0xBD,0x00,0x10, full ELF round-trip passes. - [x] **Step 2k — smoke test covers disassembly.** The smoke test now feeds raw bytes through `llvm-mc --disassemble` and checks for expected mnemonics, so encoder/decoder asymmetries surface immediately. - [x] **Step 3a — first DAG patterns.** Type-as-mode model (approved). `LDAi16imm` pseudo for i16 constants; `RTL` for retglue; `emitPrologue` emits canonical `REP #$30`. Mode-dependent `_Imm8` variants are `isCodeGenOnly` so the asm matcher never picks them. - [x] **Step 3c — single-arg function calls.** `LowerFormalArguments` receives arg 0 in A; `LowerCall` passes arg 0 in A and JSL's via a JSL pseudo to bridge the i16 symbol operand to the MC `JSL_Long`'s 24-bit operand class. Result is back in A. Multi-arg call lowering still wants a `PUSHA` SDNode + SP unwind sequence — caller side currently fatals on >1 args. - [x] **Step 3d — multi-arg via stack (callee side).** `LowerFormalArguments` now reads arg 1+ from stack via FrameIndex + load. `eliminateFrameIndex` translates LDAfi / STAfi / ADCfi / SBCfi / ANDfi / ORAfi / EORfi / CMPfi pseudos to their `LDA d,S` etc. counterparts with the offset baked in. Stack-relative MC instructions are in place; AsmParser recognises the `,s` suffix. Callee-side fully working: a `define i16 @sum3(i16 %a, i16 %b, i16 %c)` compiles to `clc; adc 4,s; clc; adc 6,s; rtl`. - [x] **Step 3e — frame-index spill plumbing.** `storeRegToStackSlot` and `loadRegFromStackSlot` emit STAfi / LDAfi pseudos so the register allocator can spill Acc16 values when needed. - [x] **Step 3f — multiplications via shifts.** Multiply by power-of-2 constants inherits the `shl` patterns (1/2/3/4 bits unrolled to `asl a` sequences). Multiply by arbitrary constants and runtime values fail at ISel pending library functions. - [x] **Step 3h — clang front end builds.** Real C → 65816 machine code via the full `clang -target w65816 -c` pipeline. Bumped clang's `IntAlign`/`LongAlign`/`PointerAlign`/`SuitableAlign` from 8 to 16; also overrode `allowsMisalignedMemoryAccesses` to return true. `scripts/cDemo.sh` shows the full front-end pipeline on a built-in 7-function demo. Additional patterns: `INC_Abs`/`DEC_Abs` for `*p = *p + 1`; `ASRA16` (PHA;ASL;PLA;ROR sequence) for signed shift-right by 1. - [x] **Step 3i — frame reservation + epilogue.** `emitPrologue` now emits `TSC; SEC; SBC #N; TCS` to reserve N bytes for locals and spills, then `emitEpilogue` reverses with `TSC; CLC; ADC #N; TCS` before the RTL. `eliminateFrameIndex` translates FrameIndex operands into stack-relative offsets via `disp = FrameOffset + StackSize`. `hasFPImpl` returns false (no native FP — direct page would be the logical home). This unblocks `clang -O0 -c` for pure-arithmetic functions (each arg gets spilled to its own stack slot). Stack-relative addressing modes for ADC/SBC/AND/ORA/EOR/CMP let the codegen fold loads from frame indices into the carry-arithmetic ops. - [x] **Step 3g — basic i8 codegen.** Acc8 patterns now cover: `LDAi8imm` (constants), `INA_PSEUDO8` / `DEA_PSEUDO8` (inc/dec), `ADCi8imm` / `SBCi8imm` (add/sub immediate), `ANDi8imm` / `ORAi8imm` / `EORi8imm` (bitwise immediate), `LDA8abs` / `STA8abs` (load/store via global), `ASLA8` / `LSRA8` (1-bit shifts), `CMPi8imm` (compare against immediate, with BR_CC i8 lowering). Frame lowering scans the function IR for any i8 type usage (return, args, instruction values, operands) and picks `REP #$10; SEP #$20` prologue when found, else `REP #$30`. AsmPrinter masks i8 immediates to 8 bits before printing so `i8 -16` shows `0xf0` rather than `0xfff0`. Limitations: i8 mode is per-function only — mixed-mode functions get the i8 prologue (8-bit A) and i16 ops fail. Asm round-trip for i8 still loses M-mode info (the parser can't disambiguate `lda #imm` between Imm8 and Imm16); use `-filetype=obj` directly from llc to get the right encoding. - [x] **Step 3b — globals, loads, stores, arithmetic, branches, bitwise.** `LowerOperation` custom-lowers `GlobalAddress` and `ExternalSymbol` to `W65816Wrapper(target...)`. Pseudo + AsmPrinter-expansion family covers: - `LDAi16imm`, `LDAabs`, `STAabs` (load/store/materialise via Wrapper of global) - `ADCi16imm`, `ADCabs`, `SBCi16imm`, `SBCabs` (add/sub with the required CLC/SEC carry prefix) - `ANDi16imm`, `ORAi16imm`, `EORi16imm` and their `*abs` memory-fold variants - `CMPi16imm`, `CMPabs` plus `W65816ISD::CMP` / `W65816ISD::BR_CC` SDNodes; `LowerBR_CC` swaps constant-on-LHS forms and rewrites SETULE/SETUGT/SETLE/SETGT to SETULT/SETUGE/SETLT/SETGE+1 so the canonicalised DAG hits our patterns; condition-code map covers BEQ/BNE/BCS/BCC plus signed BMI/BPL. - `BRA` for unconditional `br`. - `INA_PSEUDO` / `DEA_PSEUDO` for `add x, ±1` → `inc a` / `dec a` - `ASLA16` / `LSRA16` for `shl x, 1` and `lshr x, 1` → `asl a` / `lsr a` - `NEGA16` for `0 - x` → `eor #$ffff; inc a` - `(xor x, -1)` → `eor #$ffff` (bitwise NOT) - Zero-extending byte load: `lda addr; and #$ff` The end-to-end pipeline can now compile and assemble functions that read/write globals, do arithmetic on them, and branch conditionally — all with optimal-looking 65816 idioms (e.g. `lda x ; clc ; adc y` for `*x + *y`). - [ ] **Step 3i — open codegen gaps:** 1. **Multi-arg call lowering** (caller side). Callee side works; caller still bails on >1 arg. Needs PUSHA SDNode + SP-unwind in ADJCALLSTACKUP. 2. **Frame-reserved scratch space.** Prologue doesn't reserve stack space for locals/spills, so any alloca'd value or allocator-spilled value lands at a negative SP offset and eliminateFrameIndex bails. Blocks: -O0 compilation of functions with parameters; loops with PHIs that need to compare two computed values; two-Acc16 binary ops in general. Fix: emit `TSC; CLC; ADC #-N; TCS` (or PHA-loop) in emitPrologue and the inverse in emitEpilogue, where N is the function's frame size. 3. **Mixed-mode i8/i16.** Per-function mode only — the prologue picks one mode; the other type's ops fail. REP/SEP scheduling pass needed. 4. **Signed `(a - b)` overflow handling.** BMI/BPL based signed comparisons are correct only when the subtraction can't overflow; pathological values give wrong results. 5. **`sub imm, var`** and **`mul var, var`** (or non-power-of-2 constants). Need libcall support. 6. **SETCC and SELECT_CC i16.** Boolean conversions like `(int)(cond != 0)` and `(cond) ? a : b` aren't selectable. Custom lowering needed. 7. **Library functions.** `__mulhi3`, etc. — no runtime yet. - [ ] **Step 4 — real frame lowering, calling convention, REP/SEP scheduling pass.** The prologue `REP #$30` is unconditional; the REP/SEP pass will remove it when redundant. ### Where we actually got to (current state, 2026-04-27) The "open codegen gaps" list above is mostly resolved. Status of the seven sub-items at line 192: 1. **Multi-arg call lowering (caller side)** — done. `LowerCall` pushes args 1..N-1 right-to-left via `W65816ISD::PUSH`, `ADJCALLSTACKUP` unwinds with `tsc;clc;adc #N;tcs`. 2. **Frame-reserved scratch space** — done. `emitPrologue` / `emitEpilogue` use `tsc;sec;sbc #N;tcs` and the inverse. 3. **Mixed-mode i8/i16** — partial. Per-function mode based on IR scan; full REP/SEP scheduling pass still TODO (Step 4). 4. **Signed `(a - b)` overflow in compares** — handled for i8/i16 via the signed-CC promote-to-i16 path. Still has the BMI/BPL correctness caveat at INT16_MIN/MAX boundaries. 5. **`mul var, var` and friends** — done via libcalls; runtime stubs live in `runtime/src/libgcc.s` (__mulhi3, __mulsi3, __ashlhi3, __ashrhi3, __lshrhi3, __ashlsi3, __ashrsi3, __lshrsi3, __udivhi3, __divhi3, __umodhi3, __modhi3, __udivsi3, __divsi3, __umodsi3, __modsi3). 6. **SETCC and SELECT_CC i16** — done via custom inserter and the `W65816cmp + W65816selectcc` SDNode pair. 7. **Library functions** — done; see #5 above. ### i32 (long) support — landed (2026-04-26..28) - Type legalization splits i32 into two i16 halves. - ABI: i32 first-arg lives in A:X (lo:hi), matching the return ABI; subsequent i32 args go on stack 2 bytes per half. `RetCC_W65816` assigns `[A, X]` for two i16 returns so `__mulsi3` / `__divsi3` libcall returns work. - ADD/SUB use the native ADC carry chain via ISD::ADDC/ADDE/SUBC/ SUBE Legal: `ADCi16imm` etc. mark `Defs = [P]` and pattern-match `addc`; new `ADCEi16imm` / `ADCEabs` / `ADCEfi` (and SBC/E variants) mark `Uses = [P], Defs = [P]` for `adde`/`sube`. `ADDE_RR` / `SUBE_RR` have the inserter equivalent for two-Acc16 chains (e.g. fib32's loop). Net: an i32 add went from ~25 insns (manual UADDO + SETCC + add-of-bool) to ~17 incl. prologue/epilogue, with the core 8 being the optimal `clc;adc;sta;lda;adc;tax;lda;rtl`. - NEGC16 / NEGE16 lower `(subc/sube 0, x)` for i32 negate via the ADD chain (`EOR #$FFFF; CLC; ADC #1` lo, `EOR #$FFFF; ADC #0` hi). - MUL/DIV/MOD/SHL/SHR/USHR all libcalled; preferredShift­Legalization­ Strategy returns `LowerToLibcall` for i32 to keep LLVM from emitting SHL_PARTS we'd have no pattern for. - `BuildSDIVPow2` / `BuildSREMPow2` overrides return SDValue() to block the magic-constant pow2 expansion that emits unsupported BUILD_VECTOR. ### Other recent work - `i1` `sext_inreg` lowered as `(sub 0, (and x, 1))`. - `i8` `sext_inreg` and `sextload-i8` go through the existing branchless `((x & 0xFF) ^ 0x80) - 0x80` sequence (SEXTLOAD i8 set to Expand, sext_inreg pattern added). - `extloadi8` from an `Acc16` register pointer maps to `LDAptr` (16- bit load; consumer ignores high byte). - Bare `ISD::FrameIndex` selected as `ADDframe (FI, 0)` for alloca'd-array address-of; `eliminateFrameIndex` expands ADDframe into `tsc;clc;adc #disp` (LEA equivalent). - **Indirect calls** (function pointers): `LowerCall` redirects through `__jsl_indir` in `runtime/src/libgcc.s` — caller stores the dynamic target to global `__indirTarget` then JSLs the trampoline, which does `JMP (__indirTarget)`. Target's RTL pops the original JSL frame and returns directly to the caller. Single-bank only (JMP indirect is bank-local). - **Code-quality cleanup pass** (`W65816StackSlotCleanup`, addPostRegAlloc): - Removes redundant `LDAfi slot` after `STAfi reg, slot` when the LDA's destination matches and nothing in between clobbers either reg or slot. Catches the regalloc spill+reload cycle around COPY $a → vreg. - Removes dead `STAfi reg, slot` when a subsequent `STAfi` overwrites the same slot before any read, OR when the function returns without reading the slot (catches result-spill-before- return that the libcall return ABI makes redundant). - Combined with `isReMaterializable` on LDAfi from fixed FIs, the i32 add went from 17 → 11 instructions. - **i32 shift-by-1 inline** (task #59). The type-legalizer's SHL_PARTS / SRL_PARTS expansion of `i32 << 1` / `>> 1` emits a `(srl x, 15)` or `(shl x, 15)` for the carry-cross-halves slot. Previously routed through __lshrhi3 / __ashlhi3 libcalls. Added SRL15A pseudo (`ASL A; LDA #0; ROL A`, 3 bytes) and SHL15A (`LSR A; LDA #0; ROR A`). i32 shl-by-1 went 33 → 26 insns; shr-by-1 29 → 23. - **i16 shift-by-8 inline** (task #60). Same idea for `(srl x, 8)` and `(shl x, 8)` — used by i32 shift-by-8 type-legalization. XBA swaps the two bytes of A in 16-bit M; AND clears the half we don't want. 4 bytes per shift. i32 shl/shr-by-8 went 39/35 → 27/24 insns. - **PUSH16X for direct X-push** (task #61). When LowerCall sees an outgoing arg whose SDValue is `CopyFromReg` of a vreg that's live-in from $x (i.e. the i32-first-arg-in-A:X hi half), emit `phx` directly instead of `txa; pha` (which also requires spilling $a to preserve it). mul32 went 19 → 13 insns. - **Dead frame-slot trimming** (task #62). Extended W65816Stack­ SlotCleanup to scan MIR for unreferenced (post-cleanup) local frame indices and zero-size them so PrologueEpilogue trims the prologue PHA/TSC reservation. Combined with the spill cleanup, shrinks frames in many functions by 2-4 bytes (one fewer PHA + PLY pair). - **i32 first-arg in A:X (task #50)**. When the first original argument is i32 (LowerFormalArguments / LowerCall detect via `Outs[0..1].OrigArgIndex == 0` on i16 halves), pass it lo:hi in A:X — matching the i32 return ABI. Saves one stack slot per i32 arg. Required updating libgcc.s helpers (`__mulsi3`, `__udivsi3`, `__umodsi3`, `__divsi3`, `__modsi3`, `__ashlsi3`, `__lshrsi3`, `__ashrsi3`, `__divmodsi_setup`) to read arg0_hi from X (and shifted arg1 offsets). - **Implicit Defs/Uses on stack-rel MC instructions**: was a pre-existing latent bug — `eliminateFrameIndex` strips the implicit A/P def/use info when it converts ADCfi/STAfi/etc. to the MC form (ADC d,S, STA d,S etc.). Machine Copy Propagation then sees stale dataflow and elides necessary TAX/TXA copies. Fixed by re-attaching `RegState::Implicit` operands on each expanded MC instruction in W65816RegisterInfo::eliminateFrame­ Index. Without this, the i32-A:X ABI miscompiles return values (TAX gets elided, X retains arg0_hi instead of result_hi). The fix also benefits the existing single-A path; before it, certain Machine Copy Propagation choices were unsafe but happened not to trigger. Now they're also safe. ### Currently still pending - **REP/SEP scheduling pass** (Step 4) — per-function mode only; mixed-mode functions don't work. - **Vararg functions** — `LowerFormalArguments` reports a fatal error. - **i32 comparison** — uses SETCC+ADD-of-bool instead of a CMP+SBC chain (analogous to the ADC chain we landed for add/sub). - **Regalloc** (#56) — heapify-style functions with 4+ live i16 values run out of A. ### Smoke-test coverage (31 checks as of 2026-04-28) `scripts/smokeTest.sh` covers: target registration, llvm-mc encode/ disassemble, end-to-end IR→ELF, multi-pattern function, single-arg call, 3-arg stack reads, pure-i8 SEP prologue, multi-branch SETCC, SELECT_CC, two-Acc16 spill, libcall emission (__mulhi3/__ashlhi3), pointer load/store, runtime/build.sh, real-world program, libcall-symbol coverage, signed/eq i8 compare, -O2 tiny C, i32 add end-to-end, i32 carry-chain shape (1 clc + 2 adc + 0 bcc), i32 A:X first-arg ABI (1 txa), 32-bit fib loop (ADDE_RR inserter), __mulsi3 libcall, alloca'd-array LEA, signed-byte strcmp (sextload + sext_inreg + extload-via-ptr), indirect call via __jsl_indir trampoline, i32 shift-by-1 inline (no hi3 libcall). ## 3. What is installed and where All under `/home/scott/claude/llvm816/tools/`: | Tool | Path | Notes | |---|---|---| | llvm-mos source | `tools/llvm-mos/` | shallow clone. Backend files are symlinked in from `src/`; patches applied on top. Reset cleanly via `scripts/updateLlvmMos.sh`. | | llvm-mos build dir | `tools/llvm-mos-build/` | cmake-generated, ephemeral | | llvm-mos-sdk | `tools/llvm-mos-sdk/` | prebuilt toolchain | | MAME 0.264 | `/usr/games/mame` (apt) | supports `-console` (Lua) | | Apple IIgs ROMs | `tools/mame/roms/apple2gs.zip`, `apple2gsr1.zip` | from archive.org | | Calypsi 5.16 | `tools/calypsi/` | extracted .deb | | ORCA/C source | `tools/orca-c/` | reference only | `./setup.sh --verify-only` passed all checks as of the prior session. ## 4. Repo layout (current) ``` llvm816/ # git repo, branch main ├── LLVM_65816_DESIGN.md # tracked ├── SESSION_STATE.md # this file ├── setup.sh # tracked ├── scripts/ # tracked │ ├── common.sh │ ├── installDeps.sh installCalypsi.sh installOrcaC.sh │ ├── installLlvmMos.sh # non-destructive (see §8) │ ├── installMame.sh verify.sh │ ├── applyBackend.sh # src/ + patches/ -> tools/llvm-mos/ │ └── updateLlvmMos.sh # reset clone, re-apply backend ├── src/ # authored files, tracked │ ├── llvm/lib/Target/W65816/ # 41 files │ │ ├── MCTargetDesc/ (10 files) │ │ ├── TargetInfo/ (3 files) │ │ └── (28 top-level files) │ └── clang/lib/Basic/Targets/ │ ├── W65816.h │ └── W65816.cpp ├── patches/ # unified diffs, tracked │ ├── 0001-triple-add-w65816-arch.patch │ ├── 0002-triple-cpp-add-w65816-cases.patch │ ├── 0003-clang-basic-dispatch-w65816.patch │ └── 0004-cmake-add-w65816-experimental.patch ├── tools/ # gitignored, ephemeral └── .gitignore # excludes tools/, .cache/ ``` ## 5. Key architectural decisions ### 5.1 Separate target, not MOS subtarget feature llvm-mos has `FeatureW65816` declared in `MOSDevices.td` but codegen unimplemented (issue #321). We are NOT extending MOS. Reasons: - We cannot upstream an AI-assisted backend to llvm-mos anyway. - Clean register model: `Acc8`/`Acc16`/`Idx8`/`Idx16` as separate classes. - Independent evolution. Recorded in design doc §2.5. ### 5.2 Symlinks + patches, not a fork `applyBackend.sh` symlinks every file under `src/` into the corresponding path under `tools/llvm-mos/`, then applies each `patches/*.patch` with `git apply`. Idempotent: skips already-current symlinks and already-applied patches (detected via `git apply --reverse --check`). `updateLlvmMos.sh` is the ONLY script allowed to destructively reset the clone. It reverses all patches, removes our symlinks, `git reset --hard FETCH_HEAD`, then re-runs `applyBackend.sh`. `installLlvmMos.sh` refuses to touch the clone if it is dirty or off main — this is deliberate to protect applied patches. ## 6. Concrete next actions (in order) ### 6.1 Function arguments `LowerFormalArguments` and `LowerCall` still fatal-error. Without arguments, every function we test has to use globals as inputs. The plan: pass i8/i16 args via the stack (push right-to-left, caller cleans), with the first 1-2 args optionally going in A or X for register-passing. Calypsi output is the reference for ABI choices. ### 6.2 i8 codegen Currently every function gets `REP #$30` (16-bit mode). For i8 ops we need either: - A scan-and-prepend approach: if the function has any i8 op, emit `SEP #$20` after the REP for whichever mode dominates, plus toggle pseudos around the off-mode regions. - Or commit to widening all i8 to i16 pre-ISel (simpler, but uses 2x the cycles for byte-heavy code). This is the natural lead-in to the REP/SEP scheduling pass (§6.4). ### 6.3 Frame indices, stack locals Add `eliminateFrameIndex` and frame-pointer pseudos so we can spill to the stack. Today `W65816RegisterInfo::eliminateFrameIndex` is `llvm_unreachable`. Stack accesses on 65816 are `,s` and `(,s),y` indirect — needs new operand classes. ### 6.4 REP/SEP scheduling pass The core algorithmic work. TSFlag bits on every mode-dependent instruction are already in place; the pass walks MIR, dataflows the required mode per region, and inserts/removes REP/SEP transitions to minimise total mode switches. Design doc §3.3. ### 6.2 Wire frame lowering + calling convention (real) `W65816FrameLowering.cpp` is still `llvm_unreachable`. The simplest working version: establish an i16 stack pointer-based frame using the native SP, locals accessed via stack-relative indirect via Y. Calypsi output for a trivial function is a good model. `W65816CallingConv.td` covers i8/i16 return in A but nothing for arguments. Start with stack-based (push right-to-left, caller cleans) per design doc §3.5. ### 6.3 Disassembler mode-aware decoding (deferred) The scaffold disassembler always decodes LDA/LDX/LDY/ADC/SBC/CMP/AND/ ORA/EOR/BIT/CPX/CPY immediate forms as 3-byte 16-bit-immediate variants. A real decoder should track the M/X bits across the stream (consuming REP/SEP, XCE transitions) and choose between `DecoderTableW65816` (default) and `DecoderTableW65816{MHigh,XHigh}16` per instruction. Naturally pairs with the REP/SEP codegen pass since both need the same M/X tracking model. ### 6.4 REP/SEP scheduling pass The core algorithmic work (design doc §3.3). Every real instruction now carries TSFlag bits indicating which M/X mode it requires. The pass reads those, does the width-inference / coalescing / transition insertion dataflow, and emits REP/SEP instructions at block boundaries. Plan to spend multiple sessions. ### 6.5 Tidy-ups (can happen in any order) - Decide ELF `EM_` value (§7 item 1). Currently `EM_NONE`, with placeholder relocation numbers 1-5 in `W65816ELFObjectWriter`. Swap for canonical `R_W65816_*` names once chosen. - Replace ASCII-art mnemonics (`inc a`, `dec a`, `asl a`, etc.) with proper InstAliases so both `INA` and `INC A` assemble to the same opcode. Requires AsmParser (§6.3). ## 7. Open design questions flagged by the scaffold 1. **ELF `EM_` machine number.** `W65816ELFObjectWriter.cpp` uses `ELF::EM_NONE` as a placeholder. llvm-mos uses `EM_MOS = 0x1966` for the 6502 family. Decide: share `EM_MOS`, or pick a new value? 2. **Data layout string** is hardcoded in `W65816TargetMachine.cpp` rather than routing through `Triple::computeDataLayout()`. That is OK for now — when we're ready to consolidate, add a case in `TargetDataLayout.cpp` and switch to `TT.computeDataLayout()`. 3. **i32 return convention** — does i32 return in A:X or via a hidden pointer? Currently `W65816CallingConv.td` only handles i8/i16. Design doc §3.5 says "A:X for 32-bit" but this isn't modelled yet. 4. **Register aliasing for mode-dependent widths.** `Acc8` and `Acc16` both contain physical register `A`. LLVM's allocator will not cope with this correctly. The REP/SEP management pass (§3.3) is required. Flagged per the design doc. 5. **Open questions from design doc §8** (GS/OS DP reservation, bank memory model, interrupt ABI, ORCA/C ABI compat, width-contract attribute, MAME cycle accuracy) — still unresolved. Punt until after we have a working instruction set. ## 8. Gotchas + hard-won knowledge - **`installLlvmMos.sh` is non-destructive now.** It refuses to reset the clone if it is dirty or off main. Use `scripts/updateLlvmMos.sh` to refresh (the only script allowed to reset). - **MAME `-console` flag** is listed by `-showusage`, NOT `-help`. - **`log()` in `common.sh` writes to stderr.** Don't change it. - **llvm-mos has `FeatureW65816` but not working codegen** (issue #321). - **`RemapAllTargetPseudoPointerOperands` is required** in `W65816.td` or tablegen fails with 8 "missing target override for pseudoinstruction using PointerLikeRegClass" errors. Don't remove it. - **`Triple::w65816` placement in Triple.h:** inserted right after `mos,` to keep the 65xx family clustered. See patch 0001. - **Added to `LLVM_ALL_EXPERIMENTAL_TARGETS`** in `llvm/CMakeLists.txt` so `-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=all` picks up W65816. Not strictly required — passing the name explicitly also works. - **Operand `OperandType` field wants LLVM's enum spelling**, not shortened. Use `OPERAND_IMMEDIATE`, `OPERAND_MEMORY`, `OPERAND_PCREL` (see `llvm/include/llvm/MC/MCInstrDesc.h` MCOI::OperandType). `OPERAND_IMM` / `OPERAND_IMM8` / `OPERAND_IMM16` are NOT valid. - **`PrintMethod` signature for PC-rel operands takes `Address`.** Tablegen generates `printPCRel8(MI, Address, OpNo, O)` — 4 args, not 3. Non-PC-rel PrintMethods use the 3-arg form `(MI, OpNo, O)`. - **Several `.cpp` files needed explicit `#include`s beyond what MSP430 ships with** because the tablegen-generated `.inc` references full types: `W65816RegisterInfo.cpp` needs `W65816Subtarget.h` and `W65816FrameLowering.h` (for `GET_REGINFO_TARGET_DESC`); `W65816InstPrinter.cpp` needs `llvm/MC/MCAsmInfo.h` (for `MAI.printExpr`). - **Marker classes can't override mayLoad/mayStore via `let`.** TableGen's multi-inheritance doesn't let unrelated sibling classes touch fields from the base `Instruction`. Use `let isReturn = 1, ... in { ... }` blocks at def sites instead (idiomatic LLVM style). - **Data layout is hardcoded** in `W65816TargetMachine.cpp` rather than computed from `TT.computeDataLayout()`, because `TargetDataLayout.cpp` doesn't have a case for `w65816` yet. This produces one `-Wswitch` warning in the llvm-mos build. §6.5 notes adding a 5th patch to silence it. ## 9. Disk space recovery If space is tight before resume: ``` # safe to delete — regenerable from setup.sh + applyBackend.sh: rm -rf /home/scott/claude/llvm816/tools/ rm -rf /home/scott/claude/llvm816/.cache/ ``` Regenerate with `./setup.sh` then `./scripts/applyBackend.sh`. The `tools/llvm-mos-build/` directory alone is ~2 GB after a full configure+tablegen. A full ninja build will be much more. ## 10. Quick verification commands for resume ``` # Verify the scaffold is in place: ls src/llvm/lib/Target/W65816/ | wc -l # expect ~20 top-level files ls patches/ # expect 4 .patch files # Verify apply is clean: ./scripts/applyBackend.sh # expect 0 new, 44 current symlinks; 0 new, 4 applied patches # Verify cmake configures: cmake -S tools/llvm-mos/llvm -B tools/llvm-mos-build -G Ninja \ -DCMAKE_BUILD_TYPE=Release \ -DLLVM_TARGETS_TO_BUILD="" \ -DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD="MOS;W65816" \ -DLLVM_ENABLE_PROJECTS="clang" \ -DLLVM_INCLUDE_TESTS=OFF -DLLVM_INCLUDE_EXAMPLES=OFF \ -DLLVM_INCLUDE_BENCHMARKS=OFF # Verify full build + llc registration (slow first time, cached after): ( cd tools/llvm-mos-build && ninja LLVMW65816Info LLVMW65816Desc LLVMW65816CodeGen llc ) ./tools/llvm-mos-build/bin/llc --version | grep w65816 ./tools/llvm-mos-build/bin/llc -march=w65816 -filetype=null /dev/null ; echo $? # Expect: grep matches; llc exits 0. ``` ## 11. Files changed this session (not yet committed by user) ``` scripts/applyBackend.sh # idempotent src+patches apply scripts/updateLlvmMos.sh # safe reset+reapply scripts/installLlvmMos.sh # no longer destructively resets scripts/smokeTest.sh # regression smoke test src/llvm/lib/Target/W65816/ # full MC layer + first codegen: # CodeGen scaffolds (~40 files) # AsmParser/ (2 files) # Disassembler/ (2 files) # MCTargetDesc/ (11 files) # TargetInfo/ (3 files) # ~90 real instruction defs # ~25 codegen pseudos + # AsmPrinter expansion src/clang/lib/Basic/Targets/W65816.{h,cpp} patches/0001..0005.patch # upstream llvm-mos mods SESSION_STATE.md # this file ``` The tools/ tree is all ephemeral (gitignored). ### What now works end-to-end Try it yourself: ``` ./scripts/cDemo.sh # built-in demo ./scripts/cDemo.sh path/to/your.c ``` Sample output for the built-in demo (real C → real 65816): ``` get_counter: lda counter ; rtl set_counter: sta counter ; rtl sum_with_target: clc ; adc target ; rtl doubler: asl a ; rtl half: lsr a ; rtl reset: lda #0 ; sta counter ; rtl answer: lda #42 ; rtl ``` ### Detail: command-line invocations ``` # Round-trip asm -> bytes -> asm: echo ' lda #0x1234' | ./bin/llvm-mc -arch=w65816 -show-encoding # -> lda #0x1234 ; encoding: [0xa9,0x34,0x12] echo '0xea 0xa9 0x34 0x12 0x6b' | ./bin/llvm-mc --disassemble --triple=w65816 # -> nop ; lda #0x1234 ; rtl # Full asm -> ELF -> disasm: ./bin/llvm-mc -arch=w65816 -filetype=obj foo.s -o foo.o ./bin/llvm-objdump --triple=w65816 -d foo.o # Real codegen. This .ll compiles cleanly: @x = global i16 0 @y = global i16 0 define i16 @fib_step() { %a = load i16, ptr @x %b = load i16, ptr @y %s = add i16 %a, %b store i16 %a, ptr @y store i16 %s, ptr @x ret i16 %s } # llc emits idiomatic 65816: # rep #0x30 # lda x; clc; adc y ; A = a + b # sta x ; x = a + b # ... ``` ### What doesn't work yet - **Multi-arg calls** (caller side). Callee accepts stack-passed args; the matching push side is unimplemented. Functions with more than one arg can be defined and compile correctly, but cannot be called from another function. - **Two-Acc16 cmp.** Loops with PHIs that need to compare two computed values fail at ISel — only one A. - **i8 ops** (always 16-bit mode for now). - **Signed overflow** in CMP-based branches: BMI/BPL test the N flag of the subtraction, which is incorrect when the subtract overflows. - **`mul var, var`** (or by non-power-of-2 constants). Needs library functions (`__mulhi3` etc.). - **`sub imm, var`** (only `sub var, imm` works). See §6.1-§6.4 for the next steps.