35 KiB
Session Resume — llvm816 project
Drop this into a new Claude Code session and say "read SESSION_STATE.md and
continue where we left off." Pairs with LLVM_65816_DESIGN.md (the design
doc — read that second).
1. Project in one sentence
Build an open-source LLVM/Clang backend for the WDC 65816 (Apple IIgs) that matches or exceeds Calypsi's output quality, forked from llvm-mos but maintained as our own separate W65816 target. User is Scott; expert C dev, doesn't want hand-holding on LLVM or 65816 basics.
2. Where we are in the plan
Design doc section 7 lists a 12-step implementation order. We are at:
-
Setup toolchain (prior session)
-
Architectural decision: separate W65816 target (design doc §2.5)
-
Repo-layout decision:
src/holds our authored files,patches/holds modifications to upstream llvm-mos files,tools/llvm-mos/is ephemeral and gitignored.scripts/applyBackend.shstitches src + patches into the clone. -
Step 1 — scaffold W65816 target directory. 41 files under
src/llvm/lib/Target/W65816/+ 2 files undersrc/clang/lib/Basic/Targets/. 4 upstream patches underpatches/. -
Step 2 — verify the skeleton fully compiles and links. All 8 tablegen generators run clean, three static libs (LLVMW65816Info/Desc/CodeGen) build, llc links with the target registered, zero warnings in the W65816-local build.
./bin/llc -march=w65816 -filetype=null /dev/null→ exit 0. -
Step 2a — real MC-layer instructions.
W65816InstrInfo.tdnow holds ~90 real 65816 opcodes (LDA/STA/LDX/LDY/STX/STY across immediate/DP/abs/DPX/AbsX/AbsY/long where applicable; ADC/SBC/CMP/ AND/ORA/EOR/BIT; INC/DEC/ASL/LSR/ROL/ROR; all transfers; stack push/pull; REP/SEP/CLC/SEC/XCE/XBA; branches; JMP/JML/JSR/JSL; RTS/RTL/RTI; MVN/MVP). Instructions whose size depends on M or X bits exist as_Imm8/_Imm16pairs carrying the appropriate TSFlag bits (MLow/MHigh/XLow/XHigh) for the future REP/SEP pass. -
Step 2b — wire MCCodeEmitter. Tablegen
-gen-emitterruns cleanly;W65816MCCodeEmitter.cppcalls the tablegen-providedgetBinaryCodeForInstrand emits Size bytes little-endian. -
Step 2c — symbolic fixups. Each operand class (imm8/imm16/ addrDP/addrAbs/addrLong/pcrel8/pcrel16) has its own
EncoderMethodthat emits aW65816::fixup_*at the correct byte offset for expression operands.W65816AsmBackend::applyFixuppatches the data bytes little-endian for resolved fixups and defers tomaybeAddRelocfor unresolved ones.W65816ELFObjectWriter::getRelocTypereturns placeholder relocation numbers 1-5 (swap for canonical R_W65816_* names once the ELFEM_is decided — §7 item 1). -
Step 2d — patch 0005 eliminates data-layout warning. The data-layout string for
Triple::w65816lives inllvm/lib/TargetParser/TargetDataLayout.cpp;W65816TargetMachinenow callsTT.computeDataLayout(). Zero warnings in the W65816 build. -
Step 2e — AsmParser scaffold. 441-line
AsmParser/W65816AsmParser.cppported from MSP430, stripped of register-operand handling (65816 has no MC register operands), with width-narrowing predicates on each operand class so the matcher picks the narrowest instruction variant the value fits (e.g.sta $10→ STA_DP,sta $1000→ STA_Abs,sta $10000→ STA_Long).#is emitted as a literal token to match the AsmString tokenisation. Block-move (MVN/MVP) usesaddrDPfor both bank bytes somvn $01, $02parses. -
Step 2f — operand bit-field wiring. Every
Inst*class inW65816InstrFormats.tdnow assigns named bitfields intoInst{N-8}(e.g.let Inst{15-8} = imm;). Without this tablegen emits an encoder that writes the opcode but leaves the operand bytes as zero — we had that bug for an iteration. -
Step 2g — smoke-test script.
scripts/smokeTest.shchecks llc registration, empty-module codegen, and llvm-mc encoding of a representative instruction mix. Run with--buildto rebuild first. -
Step 2h — end-to-end ELF object.
llvm-mc -filetype=objproduces a valid ELF with relocations at correct byte offsets. Relocations are placeholder numbers 1-5 (§7 item 1 — decide EM_/R_* mapping). -
Step 2i — Disassembler. 190-line
Disassembler/W65816Disassembler.cpptries 1/2/3/4-byte decode tables in ascending size order. Custom decoder callbacks for imm/addr/pcrel operands wrap raw bits into MCOperands. Mode- ambiguous opcodes (LDA/LDX/LDY/ADC/SBC/CMP/AND/ORA/EOR/BIT/CPX/ CPY immediate forms) are parked in separate DecoderTableW65816{MHigh,XHigh}16 tables and the scaffold only reads the default tables — so those opcodes always disassemble as 3-byte 16-bit-immediate forms until a mode-aware decoder lands (alongside REP/SEP). -
Step 2j — register operands in the AsmParser. Key fix found via round-trip test: tablegen treats
a,x,yin AsmStrings (e.g."inc a","lda\t$addr, x") as references to the real register records, so the matcher expects register operands, not literal tokens. AsmParser now producesk_Regoperands for these identifiers. Verified:inc a→ 0x1A,lda $1000, x→ 0xBD,0x00,0x10, full ELF round-trip passes. -
Step 2k — smoke test covers disassembly. The smoke test now feeds raw bytes through
llvm-mc --disassembleand checks for expected mnemonics, so encoder/decoder asymmetries surface immediately. -
Step 3a — first DAG patterns. Type-as-mode model (approved).
LDAi16immpseudo for i16 constants;RTLfor retglue;emitPrologueemits canonicalREP #$30. Mode-dependent_Imm8variants areisCodeGenOnlyso the asm matcher never picks them. -
Step 3c — single-arg function calls.
LowerFormalArgumentsreceives arg 0 in A;LowerCallpasses arg 0 in A and JSL's via a JSL pseudo to bridge the i16 symbol operand to the MCJSL_Long's 24-bit operand class. Result is back in A. Multi-arg call lowering still wants aPUSHASDNode + SP unwind sequence — caller side currently fatals on >1 args. -
Step 3d — multi-arg via stack (callee side).
LowerFormalArgumentsnow reads arg 1+ from stack via FrameIndex + load.eliminateFrameIndextranslates LDAfi / STAfi / ADCfi / SBCfi / ANDfi / ORAfi / EORfi / CMPfi pseudos to theirLDA d,Setc. counterparts with the offset baked in. Stack-relative MC instructions are in place; AsmParser recognises the,ssuffix. Callee-side fully working: adefine i16 @sum3(i16 %a, i16 %b, i16 %c)compiles toclc; adc 4,s; clc; adc 6,s; rtl. -
Step 3e — frame-index spill plumbing.
storeRegToStackSlotandloadRegFromStackSlotemit STAfi / LDAfi pseudos so the register allocator can spill Acc16 values when needed. -
Step 3f — multiplications via shifts. Multiply by power-of-2 constants inherits the
shlpatterns (1/2/3/4 bits unrolled toasl asequences). Multiply by arbitrary constants and runtime values fail at ISel pending library functions. -
Step 3h — clang front end builds. Real C → 65816 machine code via the full
clang -target w65816 -cpipeline. Bumped clang'sIntAlign/LongAlign/PointerAlign/SuitableAlignfrom 8 to 16; also overrodeallowsMisalignedMemoryAccessesto return true.scripts/cDemo.shshows the full front-end pipeline on a built-in 7-function demo. Additional patterns:INC_Abs/DEC_Absfor*p = *p + 1;ASRA16(PHA;ASL;PLA;ROR sequence) for signed shift-right by 1. -
Step 3i — frame reservation + epilogue.
emitProloguenow emitsTSC; SEC; SBC #N; TCSto reserve N bytes for locals and spills, thenemitEpiloguereverses withTSC; CLC; ADC #N; TCSbefore the RTL.eliminateFrameIndextranslates FrameIndex operands into stack-relative offsets viadisp = FrameOffset + StackSize.hasFPImplreturns false (no native FP — direct page would be the logical home). This unblocksclang -O0 -cfor pure-arithmetic functions (each arg gets spilled to its own stack slot). Stack-relative addressing modes for ADC/SBC/AND/ORA/EOR/CMP let the codegen fold loads from frame indices into the carry-arithmetic ops. -
Step 3g — basic i8 codegen. Acc8 patterns now cover:
LDAi8imm(constants),INA_PSEUDO8/DEA_PSEUDO8(inc/dec),ADCi8imm/SBCi8imm(add/sub immediate),ANDi8imm/ORAi8imm/EORi8imm(bitwise immediate),LDA8abs/STA8abs(load/store via global),ASLA8/LSRA8(1-bit shifts),CMPi8imm(compare against immediate, with BR_CC i8 lowering). Frame lowering scans the function IR for any i8 type usage (return, args, instruction values, operands) and picksREP #$10; SEP #$20prologue when found, elseREP #$30. AsmPrinter masks i8 immediates to 8 bits before printing soi8 -16shows0xf0rather than0xfff0. Limitations: i8 mode is per-function only — mixed-mode functions get the i8 prologue (8-bit A) and i16 ops fail. Asm round-trip for i8 still loses M-mode info (the parser can't disambiguatelda #immbetween Imm8 and Imm16); use-filetype=objdirectly from llc to get the right encoding. -
Step 3b — globals, loads, stores, arithmetic, branches, bitwise.
LowerOperationcustom-lowersGlobalAddressandExternalSymboltoW65816Wrapper(target...). Pseudo + AsmPrinter-expansion family covers:- `LDAi16imm`, `LDAabs`, `STAabs` (load/store/materialise via Wrapper of global) - `ADCi16imm`, `ADCabs`, `SBCi16imm`, `SBCabs` (add/sub with the required CLC/SEC carry prefix) - `ANDi16imm`, `ORAi16imm`, `EORi16imm` and their `*abs` memory-fold variants - `CMPi16imm`, `CMPabs` plus `W65816ISD::CMP` / `W65816ISD::BR_CC` SDNodes; `LowerBR_CC` swaps constant-on-LHS forms and rewrites SETULE/SETUGT/SETLE/SETGT to SETULT/SETUGE/SETLT/SETGE+1 so the canonicalised DAG hits our patterns; condition-code map covers BEQ/BNE/BCS/BCC plus signed BMI/BPL. - `BRA` for unconditional `br`. - `INA_PSEUDO` / `DEA_PSEUDO` for `add x, ±1` → `inc a` / `dec a` - `ASLA16` / `LSRA16` for `shl x, 1` and `lshr x, 1` → `asl a` / `lsr a` - `NEGA16` for `0 - x` → `eor #$ffff; inc a` - `(xor x, -1)` → `eor #$ffff` (bitwise NOT) - Zero-extending byte load: `lda addr; and #$ff` The end-to-end pipeline can now compile and assemble functions that read/write globals, do arithmetic on them, and branch conditionally — all with optimal-looking 65816 idioms (e.g. `lda x ; clc ; adc y` for `*x + *y`). -
Step 3i — open codegen gaps:
1. **Multi-arg call lowering** (caller side). Callee side works; caller still bails on >1 arg. Needs PUSHA SDNode + SP-unwind in ADJCALLSTACKUP. 2. **Frame-reserved scratch space.** Prologue doesn't reserve stack space for locals/spills, so any alloca'd value or allocator-spilled value lands at a negative SP offset and eliminateFrameIndex bails. Blocks: -O0 compilation of functions with parameters; loops with PHIs that need to compare two computed values; two-Acc16 binary ops in general. Fix: emit `TSC; CLC; ADC #-N; TCS` (or PHA-loop) in emitPrologue and the inverse in emitEpilogue, where N is the function's frame size. 3. **Mixed-mode i8/i16.** Per-function mode only — the prologue picks one mode; the other type's ops fail. REP/SEP scheduling pass needed. 4. **Signed `(a - b)` overflow handling.** BMI/BPL based signed comparisons are correct only when the subtraction can't overflow; pathological values give wrong results. 5. **`sub imm, var`** and **`mul var, var`** (or non-power-of-2 constants). Need libcall support. 6. **SETCC and SELECT_CC i16.** Boolean conversions like `(int)(cond != 0)` and `(cond) ? a : b` aren't selectable. Custom lowering needed. 7. **Library functions.** `__mulhi3`, etc. — no runtime yet. -
Step 4 — real frame lowering, calling convention, REP/SEP scheduling pass. The prologue
REP #$30is unconditional; the REP/SEP pass will remove it when redundant.
Where we actually got to (current state, 2026-04-27)
The "open codegen gaps" list above is mostly resolved. Status of the seven sub-items at line 192:
- Multi-arg call lowering (caller side) — done.
LowerCallpushes args 1..N-1 right-to-left viaW65816ISD::PUSH,ADJCALLSTACKUPunwinds withtsc;clc;adc #N;tcs. - Frame-reserved scratch space — done.
emitPrologue/emitEpilogueusetsc;sec;sbc #N;tcsand the inverse. - Mixed-mode i8/i16 — partial. Per-function mode based on IR scan; full REP/SEP scheduling pass still TODO (Step 4).
- Signed
(a - b)overflow in compares — handled for i8/i16 via the signed-CC promote-to-i16 path. Still has the BMI/BPL correctness caveat at INT16_MIN/MAX boundaries. mul var, varand friends — done via libcalls; runtime stubs live inruntime/src/libgcc.s(__mulhi3, __mulsi3, __ashlhi3, __ashrhi3, __lshrhi3, __ashlsi3, __ashrsi3, __lshrsi3, __udivhi3, __divhi3, __umodhi3, __modhi3, __udivsi3, __divsi3, __umodsi3, __modsi3).- SETCC and SELECT_CC i16 — done via custom inserter and the
W65816cmp + W65816selectccSDNode pair. - Library functions — done; see #5 above.
i32 (long) support — landed (2026-04-26..28)
- Type legalization splits i32 into two i16 halves.
- ABI: i32 first-arg lives in A:X (lo:hi), matching the return
ABI; subsequent i32 args go on stack 2 bytes per half.
RetCC_W65816assigns[A, X]for two i16 returns so__mulsi3/__divsi3libcall returns work. - ADD/SUB use the native ADC carry chain via ISD::ADDC/ADDE/SUBC/
SUBE Legal:
ADCi16immetc. markDefs = [P]and pattern-matchaddc; newADCEi16imm/ADCEabs/ADCEfi(and SBC/E variants) markUses = [P], Defs = [P]foradde/sube.ADDE_RR/SUBE_RRhave the inserter equivalent for two-Acc16 chains (e.g. fib32's loop). Net: an i32 add went from ~25 insns (manual UADDO + SETCC + add-of-bool) to ~17 incl. prologue/epilogue, with the core 8 being the optimalclc;adc;sta;lda;adc;tax;lda;rtl. - NEGC16 / NEGE16 lower
(subc/sube 0, x)for i32 negate via the ADD chain (EOR #$FFFF; CLC; ADC #1lo,EOR #$FFFF; ADC #0hi). - MUL/DIV/MOD/SHL/SHR/USHR all libcalled; preferredShiftLegalization
Strategy returns
LowerToLibcallfor i32 to keep LLVM from emitting SHL_PARTS we'd have no pattern for. BuildSDIVPow2/BuildSREMPow2overrides return SDValue() to block the magic-constant pow2 expansion that emits unsupported BUILD_VECTOR.
Other recent work
i1sext_inreglowered as(sub 0, (and x, 1)).i8sext_inregandsextload-i8go through the existing branchless((x & 0xFF) ^ 0x80) - 0x80sequence (SEXTLOAD i8 set to Expand, sext_inreg pattern added).extloadi8from anAcc16register pointer maps toLDAptr(16- bit load; consumer ignores high byte).- Bare
ISD::FrameIndexselected asADDframe (FI, 0)for alloca'd-array address-of;eliminateFrameIndexexpands ADDframe intotsc;clc;adc #disp(LEA equivalent). - Indirect calls (function pointers):
LowerCallredirects through__jsl_indirinruntime/src/libgcc.s— caller stores the dynamic target to global__indirTargetthen JSLs the trampoline, which doesJMP (__indirTarget). Target's RTL pops the original JSL frame and returns directly to the caller. Single-bank only (JMP indirect is bank-local). - Code-quality cleanup pass (
W65816StackSlotCleanup, addPostRegAlloc):- Removes redundant
LDAfi slotafterSTAfi reg, slotwhen the LDA's destination matches and nothing in between clobbers either reg or slot. Catches the regalloc spill+reload cycle around COPY $a → vreg. - Removes dead
STAfi reg, slotwhen a subsequentSTAfioverwrites the same slot before any read, OR when the function returns without reading the slot (catches result-spill-before- return that the libcall return ABI makes redundant). - Combined with
isReMaterializableon LDAfi from fixed FIs, the i32 add went from 17 → 11 instructions.
- Removes redundant
- i32 shift-by-1 inline (task #59). The type-legalizer's
SHL_PARTS / SRL_PARTS expansion of
i32 << 1/>> 1emits a(srl x, 15)or(shl x, 15)for the carry-cross-halves slot. Previously routed through __lshrhi3 / __ashlhi3 libcalls. Added SRL15A pseudo (ASL A; LDA #0; ROL A, 3 bytes) and SHL15A (LSR A; LDA #0; ROR A). i32 shl-by-1 went 33 → 26 insns; shr-by-1 29 → 23. - i16 shift-by-8 inline (task #60). Same idea for
(srl x, 8)and(shl x, 8)— used by i32 shift-by-8 type-legalization. XBA swaps the two bytes of A in 16-bit M; AND clears the half we don't want. 4 bytes per shift. i32 shl/shr-by-8 went 39/35 → 27/24 insns. - PUSH16X for direct X-push (task #61). When LowerCall sees
an outgoing arg whose SDValue is
CopyFromRegof a vreg that's live-in from $x (i.e. the i32-first-arg-in-A:X hi half), emitphxdirectly instead oftxa; pha(which also requires spilling $a to preserve it). mul32 went 19 → 13 insns. - Dead frame-slot trimming (task #62). Extended W65816Stack SlotCleanup to scan MIR for unreferenced (post-cleanup) local frame indices and zero-size them so PrologueEpilogue trims the prologue PHA/TSC reservation. Combined with the spill cleanup, shrinks frames in many functions by 2-4 bytes (one fewer PHA + PLY pair).
- i32 first-arg in A:X (task #50). When the first original
argument is i32 (LowerFormalArguments / LowerCall detect via
Outs[0..1].OrigArgIndex == 0on i16 halves), pass it lo:hi in A:X — matching the i32 return ABI. Saves one stack slot per i32 arg. Required updating libgcc.s helpers (__mulsi3,__udivsi3,__umodsi3,__divsi3,__modsi3,__ashlsi3,__lshrsi3,__ashrsi3,__divmodsi_setup) to read arg0_hi from X (and shifted arg1 offsets). - Implicit Defs/Uses on stack-rel MC instructions: was a
pre-existing latent bug —
eliminateFrameIndexstrips the implicit A/P def/use info when it converts ADCfi/STAfi/etc. to the MC form (ADC d,S, STA d,S etc.). Machine Copy Propagation then sees stale dataflow and elides necessary TAX/TXA copies. Fixed by re-attachingRegState::Implicitoperands on each expanded MC instruction in W65816RegisterInfo::eliminateFrame Index. Without this, the i32-A:X ABI miscompiles return values (TAX gets elided, X retains arg0_hi instead of result_hi). The fix also benefits the existing single-A path; before it, certain Machine Copy Propagation choices were unsafe but happened not to trigger. Now they're also safe.
Currently still pending
- REP/SEP scheduling pass (Step 4) — per-function mode only; mixed-mode functions don't work.
- Vararg functions —
LowerFormalArgumentsreports a fatal error. - i32 comparison — uses SETCC+ADD-of-bool instead of a CMP+SBC chain (analogous to the ADC chain we landed for add/sub).
- Regalloc (#56) — heapify-style functions with 4+ live i16 values run out of A.
Smoke-test coverage (31 checks as of 2026-04-28)
scripts/smokeTest.sh covers: target registration, llvm-mc encode/
disassemble, end-to-end IR→ELF, multi-pattern function, single-arg
call, 3-arg stack reads, pure-i8 SEP prologue, multi-branch SETCC,
SELECT_CC, two-Acc16 spill, libcall emission (__mulhi3/__ashlhi3),
pointer load/store, runtime/build.sh, real-world program,
libcall-symbol coverage, signed/eq i8 compare, -O2 tiny C, i32 add
end-to-end, i32 carry-chain shape (1 clc + 2 adc + 0 bcc), i32
A:X first-arg ABI (1 txa), 32-bit fib loop (ADDE_RR inserter),
__mulsi3 libcall, alloca'd-array LEA, signed-byte strcmp
(sextload + sext_inreg + extload-via-ptr), indirect call via
__jsl_indir trampoline, i32 shift-by-1 inline (no hi3 libcall).
3. What is installed and where
All under /home/scott/claude/llvm816/tools/:
| Tool | Path | Notes |
|---|---|---|
| llvm-mos source | tools/llvm-mos/ |
shallow clone. Backend files are symlinked in from src/; patches applied on top. Reset cleanly via scripts/updateLlvmMos.sh. |
| llvm-mos build dir | tools/llvm-mos-build/ |
cmake-generated, ephemeral |
| llvm-mos-sdk | tools/llvm-mos-sdk/ |
prebuilt toolchain |
| MAME 0.264 | /usr/games/mame (apt) |
supports -console (Lua) |
| Apple IIgs ROMs | tools/mame/roms/apple2gs.zip, apple2gsr1.zip |
from archive.org |
| Calypsi 5.16 | tools/calypsi/ |
extracted .deb |
| ORCA/C source | tools/orca-c/ |
reference only |
./setup.sh --verify-only passed all checks as of the prior session.
4. Repo layout (current)
llvm816/ # git repo, branch main
├── LLVM_65816_DESIGN.md # tracked
├── SESSION_STATE.md # this file
├── setup.sh # tracked
├── scripts/ # tracked
│ ├── common.sh
│ ├── installDeps.sh installCalypsi.sh installOrcaC.sh
│ ├── installLlvmMos.sh # non-destructive (see §8)
│ ├── installMame.sh verify.sh
│ ├── applyBackend.sh # src/ + patches/ -> tools/llvm-mos/
│ └── updateLlvmMos.sh # reset clone, re-apply backend
├── src/ # authored files, tracked
│ ├── llvm/lib/Target/W65816/ # 41 files
│ │ ├── MCTargetDesc/ (10 files)
│ │ ├── TargetInfo/ (3 files)
│ │ └── (28 top-level files)
│ └── clang/lib/Basic/Targets/
│ ├── W65816.h
│ └── W65816.cpp
├── patches/ # unified diffs, tracked
│ ├── 0001-triple-add-w65816-arch.patch
│ ├── 0002-triple-cpp-add-w65816-cases.patch
│ ├── 0003-clang-basic-dispatch-w65816.patch
│ └── 0004-cmake-add-w65816-experimental.patch
├── tools/ # gitignored, ephemeral
└── .gitignore # excludes tools/, .cache/
5. Key architectural decisions
5.1 Separate target, not MOS subtarget feature
llvm-mos has FeatureW65816 declared in MOSDevices.td but codegen
unimplemented (issue #321). We are NOT extending MOS. Reasons:
- We cannot upstream an AI-assisted backend to llvm-mos anyway.
- Clean register model:
Acc8/Acc16/Idx8/Idx16as separate classes. - Independent evolution.
Recorded in design doc §2.5.
5.2 Symlinks + patches, not a fork
applyBackend.sh symlinks every file under src/ into the corresponding
path under tools/llvm-mos/, then applies each patches/*.patch with
git apply. Idempotent: skips already-current symlinks and
already-applied patches (detected via git apply --reverse --check).
updateLlvmMos.sh is the ONLY script allowed to destructively reset the
clone. It reverses all patches, removes our symlinks, git reset --hard FETCH_HEAD, then re-runs applyBackend.sh.
installLlvmMos.sh refuses to touch the clone if it is dirty or off
main — this is deliberate to protect applied patches.
6. Concrete next actions (in order)
6.1 Function arguments
LowerFormalArguments and LowerCall still fatal-error. Without
arguments, every function we test has to use globals as inputs. The
plan: pass i8/i16 args via the stack (push right-to-left, caller
cleans), with the first 1-2 args optionally going in A or X for
register-passing. Calypsi output is the reference for ABI choices.
6.2 i8 codegen
Currently every function gets REP #$30 (16-bit mode). For i8 ops
we need either:
- A scan-and-prepend approach: if the function has any i8 op, emit
SEP #$20after the REP for whichever mode dominates, plus toggle pseudos around the off-mode regions. - Or commit to widening all i8 to i16 pre-ISel (simpler, but uses 2x the cycles for byte-heavy code).
This is the natural lead-in to the REP/SEP scheduling pass (§6.4).
6.3 Frame indices, stack locals
Add eliminateFrameIndex and frame-pointer pseudos so we can spill
to the stack. Today W65816RegisterInfo::eliminateFrameIndex is
llvm_unreachable. Stack accesses on 65816 are ,s and (,s),y
indirect — needs new operand classes.
6.4 REP/SEP scheduling pass
The core algorithmic work. TSFlag bits on every mode-dependent instruction are already in place; the pass walks MIR, dataflows the required mode per region, and inserts/removes REP/SEP transitions to minimise total mode switches. Design doc §3.3.
6.2 Wire frame lowering + calling convention (real)
W65816FrameLowering.cpp is still llvm_unreachable. The simplest
working version: establish an i16 stack pointer-based frame using the
native SP, locals accessed via stack-relative indirect via Y. Calypsi
output for a trivial function is a good model.
W65816CallingConv.td covers i8/i16 return in A but nothing for
arguments. Start with stack-based (push right-to-left, caller cleans)
per design doc §3.5.
6.3 Disassembler mode-aware decoding (deferred)
The scaffold disassembler always decodes LDA/LDX/LDY/ADC/SBC/CMP/AND/
ORA/EOR/BIT/CPX/CPY immediate forms as 3-byte 16-bit-immediate
variants. A real decoder should track the M/X bits across the stream
(consuming REP/SEP, XCE transitions) and choose between
DecoderTableW65816 (default) and DecoderTableW65816{MHigh,XHigh}16
per instruction. Naturally pairs with the REP/SEP codegen pass since
both need the same M/X tracking model.
6.4 REP/SEP scheduling pass
The core algorithmic work (design doc §3.3). Every real instruction now carries TSFlag bits indicating which M/X mode it requires. The pass reads those, does the width-inference / coalescing / transition insertion dataflow, and emits REP/SEP instructions at block boundaries. Plan to spend multiple sessions.
6.5 Tidy-ups (can happen in any order)
- Decide ELF
EM_value (§7 item 1). CurrentlyEM_NONE, with placeholder relocation numbers 1-5 inW65816ELFObjectWriter. Swap for canonicalR_W65816_*names once chosen. - Replace ASCII-art mnemonics (
inc a,dec a,asl a, etc.) with proper InstAliases so bothINAandINC Aassemble to the same opcode. Requires AsmParser (§6.3).
7. Open design questions flagged by the scaffold
- ELF
EM_machine number.W65816ELFObjectWriter.cppusesELF::EM_NONEas a placeholder. llvm-mos usesEM_MOS = 0x1966for the 6502 family. Decide: shareEM_MOS, or pick a new value? - Data layout string is hardcoded in
W65816TargetMachine.cpprather than routing throughTriple::computeDataLayout(). That is OK for now — when we're ready to consolidate, add a case inTargetDataLayout.cppand switch toTT.computeDataLayout(). - i32 return convention — does i32 return in A:X or via a hidden
pointer? Currently
W65816CallingConv.tdonly handles i8/i16. Design doc §3.5 says "A:X for 32-bit" but this isn't modelled yet. - Register aliasing for mode-dependent widths.
Acc8andAcc16both contain physical registerA. LLVM's allocator will not cope with this correctly. The REP/SEP management pass (§3.3) is required. Flagged per the design doc. - Open questions from design doc §8 (GS/OS DP reservation, bank memory model, interrupt ABI, ORCA/C ABI compat, width-contract attribute, MAME cycle accuracy) — still unresolved. Punt until after we have a working instruction set.
8. Gotchas + hard-won knowledge
installLlvmMos.shis non-destructive now. It refuses to reset the clone if it is dirty or off main. Usescripts/updateLlvmMos.shto refresh (the only script allowed to reset).- MAME
-consoleflag is listed by-showusage, NOT-help. log()incommon.shwrites to stderr. Don't change it.- llvm-mos has
FeatureW65816but not working codegen (issue #321). RemapAllTargetPseudoPointerOperands<PtrRegs>is required inW65816.tdor tablegen fails with 8 "missing target override for pseudoinstruction using PointerLikeRegClass" errors. Don't remove it.Triple::w65816placement in Triple.h: inserted right aftermos,to keep the 65xx family clustered. See patch 0001.- Added to
LLVM_ALL_EXPERIMENTAL_TARGETSinllvm/CMakeLists.txtso-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=allpicks up W65816. Not strictly required — passing the name explicitly also works. - Operand
OperandTypefield wants LLVM's enum spelling, not shortened. UseOPERAND_IMMEDIATE,OPERAND_MEMORY,OPERAND_PCREL(seellvm/include/llvm/MC/MCInstrDesc.hMCOI::OperandType).OPERAND_IMM/OPERAND_IMM8/OPERAND_IMM16are NOT valid. PrintMethodsignature for PC-rel operands takesAddress. Tablegen generatesprintPCRel8(MI, Address, OpNo, O)— 4 args, not 3. Non-PC-rel PrintMethods use the 3-arg form(MI, OpNo, O).- Several
.cppfiles needed explicit#includes beyond what MSP430 ships with because the tablegen-generated.increferences full types:W65816RegisterInfo.cppneedsW65816Subtarget.handW65816FrameLowering.h(forGET_REGINFO_TARGET_DESC);W65816InstPrinter.cppneedsllvm/MC/MCAsmInfo.h(forMAI.printExpr). - Marker classes can't override mayLoad/mayStore via
let. TableGen's multi-inheritance doesn't let unrelated sibling classes touch fields from the baseInstruction. Uselet isReturn = 1, ... in { ... }blocks at def sites instead (idiomatic LLVM style). - Data layout is hardcoded in
W65816TargetMachine.cpprather than computed fromTT.computeDataLayout(), becauseTargetDataLayout.cppdoesn't have a case forw65816yet. This produces one-Wswitchwarning in the llvm-mos build. §6.5 notes adding a 5th patch to silence it.
9. Disk space recovery
If space is tight before resume:
# safe to delete — regenerable from setup.sh + applyBackend.sh:
rm -rf /home/scott/claude/llvm816/tools/
rm -rf /home/scott/claude/llvm816/.cache/
Regenerate with ./setup.sh then ./scripts/applyBackend.sh.
The tools/llvm-mos-build/ directory alone is ~2 GB after a full
configure+tablegen. A full ninja build will be much more.
10. Quick verification commands for resume
# Verify the scaffold is in place:
ls src/llvm/lib/Target/W65816/ | wc -l # expect ~20 top-level files
ls patches/ # expect 4 .patch files
# Verify apply is clean:
./scripts/applyBackend.sh # expect 0 new, 44 current symlinks; 0 new, 4 applied patches
# Verify cmake configures:
cmake -S tools/llvm-mos/llvm -B tools/llvm-mos-build -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DLLVM_TARGETS_TO_BUILD="" \
-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD="MOS;W65816" \
-DLLVM_ENABLE_PROJECTS="clang" \
-DLLVM_INCLUDE_TESTS=OFF -DLLVM_INCLUDE_EXAMPLES=OFF \
-DLLVM_INCLUDE_BENCHMARKS=OFF
# Verify full build + llc registration (slow first time, cached after):
( cd tools/llvm-mos-build && ninja LLVMW65816Info LLVMW65816Desc LLVMW65816CodeGen llc )
./tools/llvm-mos-build/bin/llc --version | grep w65816
./tools/llvm-mos-build/bin/llc -march=w65816 -filetype=null /dev/null ; echo $?
# Expect: grep matches; llc exits 0.
11. Files changed this session (not yet committed by user)
scripts/applyBackend.sh # idempotent src+patches apply
scripts/updateLlvmMos.sh # safe reset+reapply
scripts/installLlvmMos.sh # no longer destructively resets
scripts/smokeTest.sh # regression smoke test
src/llvm/lib/Target/W65816/ # full MC layer + first codegen:
# CodeGen scaffolds (~40 files)
# AsmParser/ (2 files)
# Disassembler/ (2 files)
# MCTargetDesc/ (11 files)
# TargetInfo/ (3 files)
# ~90 real instruction defs
# ~25 codegen pseudos +
# AsmPrinter expansion
src/clang/lib/Basic/Targets/W65816.{h,cpp}
patches/0001..0005.patch # upstream llvm-mos mods
SESSION_STATE.md # this file
The tools/ tree is all ephemeral (gitignored).
What now works end-to-end
Try it yourself:
./scripts/cDemo.sh # built-in demo
./scripts/cDemo.sh path/to/your.c
Sample output for the built-in demo (real C → real 65816):
get_counter: lda counter ; rtl
set_counter: sta counter ; rtl
sum_with_target: clc ; adc target ; rtl
doubler: asl a ; rtl
half: lsr a ; rtl
reset: lda #0 ; sta counter ; rtl
answer: lda #42 ; rtl
Detail: command-line invocations
# Round-trip asm -> bytes -> asm:
echo ' lda #0x1234' | ./bin/llvm-mc -arch=w65816 -show-encoding
# -> lda #0x1234 ; encoding: [0xa9,0x34,0x12]
echo '0xea 0xa9 0x34 0x12 0x6b' | ./bin/llvm-mc --disassemble --triple=w65816
# -> nop ; lda #0x1234 ; rtl
# Full asm -> ELF -> disasm:
./bin/llvm-mc -arch=w65816 -filetype=obj foo.s -o foo.o
./bin/llvm-objdump --triple=w65816 -d foo.o
# Real codegen. This .ll compiles cleanly:
@x = global i16 0
@y = global i16 0
define i16 @fib_step() {
%a = load i16, ptr @x
%b = load i16, ptr @y
%s = add i16 %a, %b
store i16 %a, ptr @y
store i16 %s, ptr @x
ret i16 %s
}
# llc emits idiomatic 65816:
# rep #0x30
# lda x; clc; adc y ; A = a + b
# sta x ; x = a + b
# ...
What doesn't work yet
- Multi-arg calls (caller side). Callee accepts stack-passed args; the matching push side is unimplemented. Functions with more than one arg can be defined and compile correctly, but cannot be called from another function.
- Two-Acc16 cmp. Loops with PHIs that need to compare two computed values fail at ISel — only one A.
- i8 ops (always 16-bit mode for now).
- Signed overflow in CMP-based branches: BMI/BPL test the N flag of the subtraction, which is incorrect when the subtract overflows.
mul var, var(or by non-power-of-2 constants). Needs library functions (__mulhi3etc.).sub imm, var(onlysub var, immworks).
See §6.1-§6.4 for the next steps.