65816-llvm-mos/docs/multiSegmentPlan.md
Scott Duensing f542f4fa01 Checkpoint
2026-05-03 21:31:53 -05:00

232 lines
10 KiB
Markdown

# Multi-segment OMF support — plan
## Why
Single-segment cap: ~60KB usable in bank 0 after the IO window ($C000-
$CFFF), the stack at $0FFF, and crt0 / runtime overhead. Real IIgs
applications need 100s of KB across multiple banks. GS/OS Loader is
designed for this — load each segment into its chosen bank, fix up
inter-segment references at load time, jump to the entry segment.
## Today
- `link816` produces a flat binary covering `[--text-base, ...]` in a
single bank-0 image. All sections are concatenated into one address
space. Inter-section relocations are resolved at link time.
- `omfEmit` wraps that flat binary in a single OMF segment (KIND=CODE,
ORG=0, SEGNUM=1, body = one DS opcode + END). No relocation records
emitted (image is already absolute).
- `crt0` enables LC RAM, zeroes BSS, runs `.init_array`, calls `main`.
- All cross-function calls already use JSL (3-byte long) — we never
emit JSR. That's accidentally helpful for multi-segment.
## Target
A program that builds 4 segments — say:
- Segment 1 ("MAIN"): crt0 + main + a few hot routines, in bank 1
- Segment 2 ("CODE"): bulk of code, in bank 2
- Segment 3 ("DATA"): rodata, in bank 3
- Segment 4 ("BSS"): uninitialized data + heap, in bank 4
GS/OS Loader places each segment, applies inter-segment relocations
(every `JSL foo` where `foo` lives in a different segment becomes a
`JSL <segment-relative-addr>` patched at load time with the absolute
address), and jumps to the entry.
## The four hard problems
### 1. Section → segment assignment policy
We need a deterministic rule that maps every input object's `.text` /
`.rodata` / `.bss` / `.init_array` section into a specific segment.
Three options:
**A. Per-object → one segment.** Each `.o` becomes one segment. Simple
mental model; bad locality (many tiny segments, lots of inter-segment
JSLs); GS/OS Loader has 8KB+ minimum overhead per segment.
**B. Greedy bin-packing.** Compute total code size; cap each segment at
N bytes (e.g. 32KB to leave headroom); pack `.text` sections into
segments greedily in input order. Same for `.rodata` / `.bss`.
Predictable, but a function near the end of segment N might want to
JSL a function at the start of segment N+1 — common pattern, every
call becomes inter-segment.
**C. Static call graph + clustering.** Compute call graph from the
relocations, cluster co-calling functions together, pack clusters into
segments to minimize inter-segment edges. Best locality, real linker
work.
**Recommendation: B for v1.** Add a `--segment-cap` option (default
32768). Real applications will want C eventually, but B unblocks
"my program is bigger than 64KB" today.
### 2. Inter-segment relocation tracking
When a `JSL foo` reloc resolves to a function in a different segment,
we MUST emit an OMF relocation record instead of patching the bytes
in-place. Currently `link816` patches everything at link time and emits
zero reloc records.
The reloc model becomes per-segment:
- Intra-segment IMM16 / PCREL: patch at link time, no OMF record.
- Intra-segment IMM24 (JSL): patch at link time (low 24 bits = segment-
relative offset for now; loader adjusts at load time when segment is
placed). Actually need OMF reloc here too because we don't know the
load bank.
- Inter-segment IMM24 (cross-bank JSL): emit `INTERSEG` opcode (`E2`)
pointing at `(target_segment_num, offset_within_segment)`.
- Inter-segment IMM16 data ref: requires the data segment to land in
the same bank as the referencing code OR we need the loader to fail
(16-bit absolute can't cross banks). In v1, force all data refs to be
to a "data segment" that's in a fixed bank, OR rewrite to long
addressing.
The IMM16 cross-segment problem is the killer. Three responses:
i. **Punt:** Disallow it. All `.rodata` references must be in the
same segment as the code, OR refs to global data must use long
addressing (rewrite at compile time via `__attribute__((far))`).
ii. **Promote to long at link time:** Detect IMM16 cross-segment
refs, rewrite the instruction's encoding from absolute (3-byte)
to absolute-long (4-byte). Changes code size, shifts everything
after the patched site — invasive.
iii. **Same-bank constraint:** Ensure the data segment's bank ==
the code's DBR. Means all code segments share one DBR, all data
lives in one segment in that DBR's bank.
**Recommendation: iii for v1.** All `.rodata` lives in one segment
in the bank our code uses for DBR. We already pin DBR to bank 0 in
crt0 (well, code does `pha;plb` for bank 2 sometimes for tests, but
not in general). For v1, all `.rodata` goes in bank 0 alongside the
first text segment, and code segments in higher banks reference data
via long absolute addressing. Need to confirm what addressing modes
our backend actually emits for global access.
### 3. crt0 / loader contract
Current crt0 assumes flat layout:
```
__start:
setup CPU mode, stack
enable LC RAM
zero BSS [__bss_start..__bss_end]
run .init_array
jsl main
spin
```
Multi-segment changes:
- BSS may span multiple segments (bank 0 LC + bank N segment). The
`__bss_start` / `__bss_end` symbols need to be per-segment, OR a
loop over a list of `(start, end)` pairs the linker emits.
- `.init_array` ditto.
- LC RAM enable only applies to bank 0 — fine.
- The OMF Loader will handle the actual memory placement; crt0 just
runs after Loader is done.
- The Loader's entry call lands at the segment marked with the entry
field. By convention that's segment 1.
**Decision:** Designate segment 1 as "init segment" containing crt0 +
its required symbols (`__bss_start_seg1`, `__init_array_start_seg1`,
etc.) and the linker emits a `__bss_table` and `__init_array_table`
arrays of `(start, end)` pointers walked by crt0. Same idea Mac OS X's
loader uses for multi-segment programs.
### 4. Build pipeline + tests
- `link816 --segment-cap N` emits multiple `(image, base, syms)`
triples plus inter-segment reloc records.
- New intermediate format between linker and `omfEmit`: a small
manifest file listing each segment's body, base, name, and reloc
records. Easier than passing all that on the CLI.
- `omfEmit` reads the manifest and emits a single multi-segment OMF
file with proper INTERSEG opcodes.
- Smoke needs new test: build a program with `--segment-cap 8192` so it
forces ≥2 segments even for our small benches; verify under MAME via
a GS/OS-loader-aware test path. (We don't have GS/OS-loaded tests
today — see "Risks" below.)
## Phased implementation
### Phase 1: linker emits per-segment images + manifest
- `link816 --segment-cap N --manifest manifest.json -o out`
- Pack `.text` greedy into segments 1..K capped at N bytes each.
- All `.rodata` into segment K+1 (the "data segment").
- All `.bss` into segment K+2.
- Resolve intra-segment relocations.
- Write inter-segment relocations into the manifest.
- Emit one flat binary per segment; manifest references them by path.
### Phase 2: omfEmit consumes manifest, emits multi-segment OMF
- One OMF segment header per manifest entry.
- DS opcodes for body bytes.
- INTERSEG (`E2`) opcodes for inter-segment reloc patch sites.
- RELOC (`E0`) opcodes for intra-segment relocations that need
load-time fixup (JSL targets within same segment but different bank
than expected).
- END opcode terminator per segment.
### Phase 3: runtime updates
- Linker emits `__bss_table[]` and `__init_array_table[]` instead of
single `__bss_start`/`__bss_end` symbols.
- crt0 walks those tables.
- `crt0.s` removes the LC-enable hardcoding from segment 1 if segment
1 isn't bank 0 (configurable).
### Phase 4: tests + smoke
- Bench harness builds with `--segment-cap 8192` to force multi-segment
even for small programs; verify output size growth (should be small —
just OMF headers + reloc records overhead).
- Need a GS/OS-aware MAME test path (boot a ProDOS volume with our OMF
binary, let GS/OS Loader load it, check markers in bank 2). This is
the test we deferred earlier in the GS/OS smoke task. **Phase 4
reopens the GS/OS-volume smoke decision** — multi-segment is the
main reason to even care about that.
## Scope estimate
- Phase 1: 2-3 sessions (linker rework, careful with reloc accounting)
- Phase 2: 1 session (mostly OMF format work, well-specified)
- Phase 3: 1 session (crt0 + linker symbol table changes)
- Phase 4: 2-3 sessions (GS/OS-loaded test infra is the slog, not the
multi-segment logic itself)
Total: ~6-8 focused sessions. Phases 1-3 deliver a working multi-
segment binary; phase 4 makes it testable in CI.
## Risks
- **DBR management is genuinely tricky.** Code in segment 2 (bank 2)
doing `lda foo` where foo is in segment K+1 (bank 0): the absolute
fetch uses DBR. If DBR != bank-of-foo, we read garbage. The cleanest
rule (DBR=0 always; data refs use long via `__attribute__((far))` or
a backend pass that promotes them) requires backend cooperation
we don't have. v1's "all data in one segment in DBR's bank" works
but constrains data size to ~60KB.
- **The Loader's behaviour around segment placement is poorly
documented.** Apple's GS/OS Loader picks banks dynamically; we may
end up with code segments in banks the loader chose, with relocations
that work, but layouts that surprise us. Mitigation: use STATIC
segments (KIND bit) initially so the loader can't move them.
- **Smoke needs a real GS/OS volume image.** This is the same blocker
as the deferred GS/OS file I/O smoke — needs a 2img/po image with
ProDOS volume + a way to run our OMF through the actual loader.
Without that, multi-segment logic is testable only by inspection of
the OMF bytes and a hand-rolled mini-loader (which we'd have to
write).
## Recommendation
Start Phase 1. The linker work is contained, mostly mechanical, and
the manifest format gives us a clean handoff to `omfEmit` work in
Phase 2. We can validate Phase 1 by inspecting the per-segment images
+ manifest before any OMF / loader work.
Phase 4's GS/OS-volume test path is the biggest unknown. Reasonable to
defer that decision until Phases 1-3 are working — at that point we
can decide whether to invest in proper GS/OS-loaded smoke or accept
"multi-segment OMF emits valid bytes per the spec" as the test bar.