65816-llvm-mos/docs/multiSegmentPlan.md

# Multi-segment OMF support — plan

## Why

Single-segment cap: ~60KB usable in bank 0 after the IO window ($C000-
$CFFF), the stack at $0FFF, and crt0 / runtime overhead. Real IIgs
applications need 100s of KB across multiple banks. GS/OS Loader is
designed for this — load each segment into its chosen bank, fix up
inter-segment references at load time, jump to the entry segment.

## Today

- `link816` produces a flat binary covering `[--text-base, ...]` in a
  single bank-0 image. All sections are concatenated into one address
  space. Inter-section relocations are resolved at link time.
- `omfEmit` wraps that flat binary in a single OMF segment (KIND=CODE,
  ORG=0, SEGNUM=1, body = one DS opcode + END). No relocation records
  emitted (image is already absolute).
- `crt0` enables LC RAM, zeroes BSS, runs `.init_array`, calls `main`.
- All cross-function calls already use JSL (3-byte long) — we never
  emit JSR. That's accidentally helpful for multi-segment.

## Target

A program that builds 4 segments — say:
- Segment 1 ("MAIN"): crt0 + main + a few hot routines, in bank 1
- Segment 2 ("CODE"): bulk of code, in bank 2
- Segment 3 ("DATA"): rodata, in bank 3
- Segment 4 ("BSS"): uninitialized data + heap, in bank 4

GS/OS Loader places each segment, applies inter-segment relocations
(every `JSL foo` where `foo` lives in a different segment becomes a
`JSL <segment-relative-addr>` patched at load time with the absolute
address), and jumps to the entry.

## The four hard problems

### 1. Section → segment assignment policy

We need a deterministic rule that maps every input object's `.text` /
`.rodata` / `.bss` / `.init_array` section into a specific segment.
Three options:

**A. Per-object → one segment.** Each `.o` becomes one segment. Simple
mental model; bad locality (many tiny segments, lots of inter-segment
JSLs); GS/OS Loader has 8KB+ minimum overhead per segment.

**B. Greedy bin-packing.** Compute total code size; cap each segment at
N bytes (e.g. 32KB to leave headroom); pack `.text` sections into
segments greedily in input order. Same for `.rodata` / `.bss`.
Predictable, but a function near the end of segment N might want to
JSL a function at the start of segment N+1 — common pattern, every
call becomes inter-segment.

**C. Static call graph + clustering.** Compute call graph from the
relocations, cluster co-calling functions together, pack clusters into
segments to minimize inter-segment edges. Best locality, real linker
work.

**Recommendation: B for v1.** Add a `--segment-cap` option (default
32768). Real applications will want C eventually, but B unblocks
"my program is bigger than 64KB" today.

### 2. Inter-segment relocation tracking

When a `JSL foo` reloc resolves to a function in a different segment,
we MUST emit an OMF relocation record instead of patching the bytes
in-place. Currently `link816` patches everything at link time and emits
zero reloc records.

The reloc model becomes per-segment:

- Intra-segment IMM16 / PCREL: patch at link time, no OMF record.
- Intra-segment IMM24 (JSL): patch at link time (low 24 bits = segment-
  relative offset for now; loader adjusts at load time when segment is
  placed). Actually need OMF reloc here too because we don't know the
  load bank.
- Inter-segment IMM24 (cross-bank JSL): emit `INTERSEG` opcode (`E2`)
  pointing at `(target_segment_num, offset_within_segment)`.
- Inter-segment IMM16 data ref: requires the data segment to land in
  the same bank as the referencing code OR we need the loader to fail
  (16-bit absolute can't cross banks). In v1, force all data refs to be
  to a "data segment" that's in a fixed bank, OR rewrite to long
  addressing.

The IMM16 cross-segment problem is the killer. Three responses:

  i. **Punt:** Disallow it. All `.rodata` references must be in the
     same segment as the code, OR refs to global data must use long
     addressing (rewrite at compile time via `__attribute__((far))`).
  ii. **Promote to long at link time:** Detect IMM16 cross-segment
      refs, rewrite the instruction's encoding from absolute (3-byte)
      to absolute-long (4-byte). Changes code size, shifts everything
      after the patched site — invasive.
  iii. **Same-bank constraint:** Ensure the data segment's bank ==
       the code's DBR. Means all code segments share one DBR, all data
       lives in one segment in that DBR's bank.

  **Recommendation: iii for v1.** All `.rodata` lives in one segment
  in the bank our code uses for DBR. We already pin DBR to bank 0 in
  crt0 (well, code does `pha;plb` for bank 2 sometimes for tests, but
  not in general). For v1, all `.rodata` goes in bank 0 alongside the
  first text segment, and code segments in higher banks reference data
  via long absolute addressing. Need to confirm what addressing modes
  our backend actually emits for global access.

### 3. crt0 / loader contract

Current crt0 assumes flat layout:

```
__start:
  setup CPU mode, stack
  enable LC RAM
  zero BSS [__bss_start..__bss_end]
  run .init_array
  jsl main
  spin
```

Multi-segment changes:

- BSS may span multiple segments (bank 0 LC + bank N segment). The
  `__bss_start` / `__bss_end` symbols need to be per-segment, OR a
  loop over a list of `(start, end)` pairs the linker emits.
- `.init_array` ditto.
- LC RAM enable only applies to bank 0 — fine.
- The OMF Loader will handle the actual memory placement; crt0 just
  runs after Loader is done.
- The Loader's entry call lands at the segment marked with the entry
  field. By convention that's segment 1.

**Decision:** Designate segment 1 as "init segment" containing crt0 +
its required symbols (`__bss_start_seg1`, `__init_array_start_seg1`,
etc.) and the linker emits a `__bss_table` and `__init_array_table` —
arrays of `(start, end)` pointers walked by crt0. Same idea Mac OS X's
loader uses for multi-segment programs.

### 4. Build pipeline + tests

- `link816 --segment-cap N` emits multiple `(image, base, syms)`
  triples plus inter-segment reloc records.
- New intermediate format between linker and `omfEmit`: a small
  manifest file listing each segment's body, base, name, and reloc
  records. Easier than passing all that on the CLI.
- `omfEmit` reads the manifest and emits a single multi-segment OMF
  file with proper INTERSEG opcodes.
- Smoke needs new test: build a program with `--segment-cap 8192` so it
  forces ≥2 segments even for our small benches; verify under MAME via
  a GS/OS-loader-aware test path. (We don't have GS/OS-loaded tests
  today — see "Risks" below.)

## Phased implementation

### Phase 1: linker emits per-segment images + manifest
- `link816 --segment-cap N --manifest manifest.json -o out`
- Pack `.text` greedy into segments 1..K capped at N bytes each.
- All `.rodata` into segment K+1 (the "data segment").
- All `.bss` into segment K+2.
- Resolve intra-segment relocations.
- Write inter-segment relocations into the manifest.
- Emit one flat binary per segment; manifest references them by path.

### Phase 2: omfEmit consumes manifest, emits multi-segment OMF
- One OMF segment header per manifest entry.
- DS opcodes for body bytes.
- INTERSEG (`E2`) opcodes for inter-segment reloc patch sites.
- RELOC (`E0`) opcodes for intra-segment relocations that need
  load-time fixup (JSL targets within same segment but different bank
  than expected).
- END opcode terminator per segment.

### Phase 3: runtime updates
- Linker emits `__bss_table[]` and `__init_array_table[]` instead of
  single `__bss_start`/`__bss_end` symbols.
- crt0 walks those tables.
- `crt0.s` removes the LC-enable hardcoding from segment 1 if segment
  1 isn't bank 0 (configurable).

### Phase 4: tests + smoke
- Bench harness builds with `--segment-cap 8192` to force multi-segment
  even for small programs; verify output size growth (should be small —
  just OMF headers + reloc records overhead).
- Need a GS/OS-aware MAME test path (boot a ProDOS volume with our OMF
  binary, let GS/OS Loader load it, check markers in bank 2). This is
  the test we deferred earlier in the GS/OS smoke task. **Phase 4
  reopens the GS/OS-volume smoke decision** — multi-segment is the
  main reason to even care about that.

## Scope estimate

- Phase 1: 2-3 sessions (linker rework, careful with reloc accounting)
- Phase 2: 1 session (mostly OMF format work, well-specified)
- Phase 3: 1 session (crt0 + linker symbol table changes)
- Phase 4: 2-3 sessions (GS/OS-loaded test infra is the slog, not the
  multi-segment logic itself)

Total: ~6-8 focused sessions. Phases 1-3 deliver a working multi-
segment binary; phase 4 makes it testable in CI.

## Risks

- **DBR management is genuinely tricky.** Code in segment 2 (bank 2)
  doing `lda foo` where foo is in segment K+1 (bank 0): the absolute
  fetch uses DBR. If DBR != bank-of-foo, we read garbage. The cleanest
  rule (DBR=0 always; data refs use long via `__attribute__((far))` or
  a backend pass that promotes them) requires backend cooperation
  we don't have. v1's "all data in one segment in DBR's bank" works
  but constrains data size to ~60KB.
- **The Loader's behaviour around segment placement is poorly
  documented.** Apple's GS/OS Loader picks banks dynamically; we may
  end up with code segments in banks the loader chose, with relocations
  that work, but layouts that surprise us. Mitigation: use STATIC
  segments (KIND bit) initially so the loader can't move them.
- **Smoke needs a real GS/OS volume image.** This is the same blocker
  as the deferred GS/OS file I/O smoke — needs a 2img/po image with
  ProDOS volume + a way to run our OMF through the actual loader.
  Without that, multi-segment logic is testable only by inspection of
  the OMF bytes and a hand-rolled mini-loader (which we'd have to
  write).

## Recommendation

Start Phase 1. The linker work is contained, mostly mechanical, and
the manifest format gives us a clean handoff to `omfEmit` work in
Phase 2. We can validate Phase 1 by inspecting the per-segment images
+ manifest before any OMF / loader work.

Phase 4's GS/OS-volume test path is the biggest unknown. Reasonable to
defer that decision until Phases 1-3 are working — at that point we
can decide whether to invest in proper GS/OS-loaded smoke or accept
"multi-segment OMF emits valid bytes per the spec" as the test bar.