65816-llvm-mos/docs/multiSegmentPlan.md
Scott Duensing f542f4fa01 Checkpoint
2026-05-03 21:31:53 -05:00

10 KiB

Multi-segment OMF support — plan

Why

Single-segment cap: ~60KB usable in bank 0 after the IO window ($C000- $CFFF), the stack at $0FFF, and crt0 / runtime overhead. Real IIgs applications need 100s of KB across multiple banks. GS/OS Loader is designed for this — load each segment into its chosen bank, fix up inter-segment references at load time, jump to the entry segment.

Today

  • link816 produces a flat binary covering [--text-base, ...] in a single bank-0 image. All sections are concatenated into one address space. Inter-section relocations are resolved at link time.
  • omfEmit wraps that flat binary in a single OMF segment (KIND=CODE, ORG=0, SEGNUM=1, body = one DS opcode + END). No relocation records emitted (image is already absolute).
  • crt0 enables LC RAM, zeroes BSS, runs .init_array, calls main.
  • All cross-function calls already use JSL (3-byte long) — we never emit JSR. That's accidentally helpful for multi-segment.

Target

A program that builds 4 segments — say:

  • Segment 1 ("MAIN"): crt0 + main + a few hot routines, in bank 1
  • Segment 2 ("CODE"): bulk of code, in bank 2
  • Segment 3 ("DATA"): rodata, in bank 3
  • Segment 4 ("BSS"): uninitialized data + heap, in bank 4

GS/OS Loader places each segment, applies inter-segment relocations (every JSL foo where foo lives in a different segment becomes a JSL <segment-relative-addr> patched at load time with the absolute address), and jumps to the entry.

The four hard problems

1. Section → segment assignment policy

We need a deterministic rule that maps every input object's .text / .rodata / .bss / .init_array section into a specific segment. Three options:

A. Per-object → one segment. Each .o becomes one segment. Simple mental model; bad locality (many tiny segments, lots of inter-segment JSLs); GS/OS Loader has 8KB+ minimum overhead per segment.

B. Greedy bin-packing. Compute total code size; cap each segment at N bytes (e.g. 32KB to leave headroom); pack .text sections into segments greedily in input order. Same for .rodata / .bss. Predictable, but a function near the end of segment N might want to JSL a function at the start of segment N+1 — common pattern, every call becomes inter-segment.

C. Static call graph + clustering. Compute call graph from the relocations, cluster co-calling functions together, pack clusters into segments to minimize inter-segment edges. Best locality, real linker work.

Recommendation: B for v1. Add a --segment-cap option (default 32768). Real applications will want C eventually, but B unblocks "my program is bigger than 64KB" today.

2. Inter-segment relocation tracking

When a JSL foo reloc resolves to a function in a different segment, we MUST emit an OMF relocation record instead of patching the bytes in-place. Currently link816 patches everything at link time and emits zero reloc records.

The reloc model becomes per-segment:

  • Intra-segment IMM16 / PCREL: patch at link time, no OMF record.
  • Intra-segment IMM24 (JSL): patch at link time (low 24 bits = segment- relative offset for now; loader adjusts at load time when segment is placed). Actually need OMF reloc here too because we don't know the load bank.
  • Inter-segment IMM24 (cross-bank JSL): emit INTERSEG opcode (E2) pointing at (target_segment_num, offset_within_segment).
  • Inter-segment IMM16 data ref: requires the data segment to land in the same bank as the referencing code OR we need the loader to fail (16-bit absolute can't cross banks). In v1, force all data refs to be to a "data segment" that's in a fixed bank, OR rewrite to long addressing.

The IMM16 cross-segment problem is the killer. Three responses:

i. Punt: Disallow it. All .rodata references must be in the same segment as the code, OR refs to global data must use long addressing (rewrite at compile time via __attribute__((far))). ii. Promote to long at link time: Detect IMM16 cross-segment refs, rewrite the instruction's encoding from absolute (3-byte) to absolute-long (4-byte). Changes code size, shifts everything after the patched site — invasive. iii. Same-bank constraint: Ensure the data segment's bank == the code's DBR. Means all code segments share one DBR, all data lives in one segment in that DBR's bank.

Recommendation: iii for v1. All .rodata lives in one segment in the bank our code uses for DBR. We already pin DBR to bank 0 in crt0 (well, code does pha;plb for bank 2 sometimes for tests, but not in general). For v1, all .rodata goes in bank 0 alongside the first text segment, and code segments in higher banks reference data via long absolute addressing. Need to confirm what addressing modes our backend actually emits for global access.

3. crt0 / loader contract

Current crt0 assumes flat layout:

__start:
  setup CPU mode, stack
  enable LC RAM
  zero BSS [__bss_start..__bss_end]
  run .init_array
  jsl main
  spin

Multi-segment changes:

  • BSS may span multiple segments (bank 0 LC + bank N segment). The __bss_start / __bss_end symbols need to be per-segment, OR a loop over a list of (start, end) pairs the linker emits.
  • .init_array ditto.
  • LC RAM enable only applies to bank 0 — fine.
  • The OMF Loader will handle the actual memory placement; crt0 just runs after Loader is done.
  • The Loader's entry call lands at the segment marked with the entry field. By convention that's segment 1.

Decision: Designate segment 1 as "init segment" containing crt0 + its required symbols (__bss_start_seg1, __init_array_start_seg1, etc.) and the linker emits a __bss_table and __init_array_table — arrays of (start, end) pointers walked by crt0. Same idea Mac OS X's loader uses for multi-segment programs.

4. Build pipeline + tests

  • link816 --segment-cap N emits multiple (image, base, syms) triples plus inter-segment reloc records.
  • New intermediate format between linker and omfEmit: a small manifest file listing each segment's body, base, name, and reloc records. Easier than passing all that on the CLI.
  • omfEmit reads the manifest and emits a single multi-segment OMF file with proper INTERSEG opcodes.
  • Smoke needs new test: build a program with --segment-cap 8192 so it forces ≥2 segments even for our small benches; verify under MAME via a GS/OS-loader-aware test path. (We don't have GS/OS-loaded tests today — see "Risks" below.)

Phased implementation

Phase 1: linker emits per-segment images + manifest

  • link816 --segment-cap N --manifest manifest.json -o out
  • Pack .text greedy into segments 1..K capped at N bytes each.
  • All .rodata into segment K+1 (the "data segment").
  • All .bss into segment K+2.
  • Resolve intra-segment relocations.
  • Write inter-segment relocations into the manifest.
  • Emit one flat binary per segment; manifest references them by path.

Phase 2: omfEmit consumes manifest, emits multi-segment OMF

  • One OMF segment header per manifest entry.
  • DS opcodes for body bytes.
  • INTERSEG (E2) opcodes for inter-segment reloc patch sites.
  • RELOC (E0) opcodes for intra-segment relocations that need load-time fixup (JSL targets within same segment but different bank than expected).
  • END opcode terminator per segment.

Phase 3: runtime updates

  • Linker emits __bss_table[] and __init_array_table[] instead of single __bss_start/__bss_end symbols.
  • crt0 walks those tables.
  • crt0.s removes the LC-enable hardcoding from segment 1 if segment 1 isn't bank 0 (configurable).

Phase 4: tests + smoke

  • Bench harness builds with --segment-cap 8192 to force multi-segment even for small programs; verify output size growth (should be small — just OMF headers + reloc records overhead).
  • Need a GS/OS-aware MAME test path (boot a ProDOS volume with our OMF binary, let GS/OS Loader load it, check markers in bank 2). This is the test we deferred earlier in the GS/OS smoke task. Phase 4 reopens the GS/OS-volume smoke decision — multi-segment is the main reason to even care about that.

Scope estimate

  • Phase 1: 2-3 sessions (linker rework, careful with reloc accounting)
  • Phase 2: 1 session (mostly OMF format work, well-specified)
  • Phase 3: 1 session (crt0 + linker symbol table changes)
  • Phase 4: 2-3 sessions (GS/OS-loaded test infra is the slog, not the multi-segment logic itself)

Total: ~6-8 focused sessions. Phases 1-3 deliver a working multi- segment binary; phase 4 makes it testable in CI.

Risks

  • DBR management is genuinely tricky. Code in segment 2 (bank 2) doing lda foo where foo is in segment K+1 (bank 0): the absolute fetch uses DBR. If DBR != bank-of-foo, we read garbage. The cleanest rule (DBR=0 always; data refs use long via __attribute__((far)) or a backend pass that promotes them) requires backend cooperation we don't have. v1's "all data in one segment in DBR's bank" works but constrains data size to ~60KB.
  • The Loader's behaviour around segment placement is poorly documented. Apple's GS/OS Loader picks banks dynamically; we may end up with code segments in banks the loader chose, with relocations that work, but layouts that surprise us. Mitigation: use STATIC segments (KIND bit) initially so the loader can't move them.
  • Smoke needs a real GS/OS volume image. This is the same blocker as the deferred GS/OS file I/O smoke — needs a 2img/po image with ProDOS volume + a way to run our OMF through the actual loader. Without that, multi-segment logic is testable only by inspection of the OMF bytes and a hand-rolled mini-loader (which we'd have to write).

Recommendation

Start Phase 1. The linker work is contained, mostly mechanical, and the manifest format gives us a clean handoff to omfEmit work in Phase 2. We can validate Phase 1 by inspecting the per-segment images

  • manifest before any OMF / loader work.

Phase 4's GS/OS-volume test path is the biggest unknown. Reasonable to defer that decision until Phases 1-3 are working — at that point we can decide whether to invest in proper GS/OS-loaded smoke or accept "multi-segment OMF emits valid bytes per the spec" as the test bar.