10 KiB
Multi-segment OMF support — plan
Why
Single-segment cap: ~60KB usable in bank 0 after the IO window ($C000- $CFFF), the stack at $0FFF, and crt0 / runtime overhead. Real IIgs applications need 100s of KB across multiple banks. GS/OS Loader is designed for this — load each segment into its chosen bank, fix up inter-segment references at load time, jump to the entry segment.
Today
link816produces a flat binary covering[--text-base, ...]in a single bank-0 image. All sections are concatenated into one address space. Inter-section relocations are resolved at link time.omfEmitwraps that flat binary in a single OMF segment (KIND=CODE, ORG=0, SEGNUM=1, body = one DS opcode + END). No relocation records emitted (image is already absolute).crt0enables LC RAM, zeroes BSS, runs.init_array, callsmain.- All cross-function calls already use JSL (3-byte long) — we never emit JSR. That's accidentally helpful for multi-segment.
Target
A program that builds 4 segments — say:
- Segment 1 ("MAIN"): crt0 + main + a few hot routines, in bank 1
- Segment 2 ("CODE"): bulk of code, in bank 2
- Segment 3 ("DATA"): rodata, in bank 3
- Segment 4 ("BSS"): uninitialized data + heap, in bank 4
GS/OS Loader places each segment, applies inter-segment relocations
(every JSL foo where foo lives in a different segment becomes a
JSL <segment-relative-addr> patched at load time with the absolute
address), and jumps to the entry.
The four hard problems
1. Section → segment assignment policy
We need a deterministic rule that maps every input object's .text /
.rodata / .bss / .init_array section into a specific segment.
Three options:
A. Per-object → one segment. Each .o becomes one segment. Simple
mental model; bad locality (many tiny segments, lots of inter-segment
JSLs); GS/OS Loader has 8KB+ minimum overhead per segment.
B. Greedy bin-packing. Compute total code size; cap each segment at
N bytes (e.g. 32KB to leave headroom); pack .text sections into
segments greedily in input order. Same for .rodata / .bss.
Predictable, but a function near the end of segment N might want to
JSL a function at the start of segment N+1 — common pattern, every
call becomes inter-segment.
C. Static call graph + clustering. Compute call graph from the relocations, cluster co-calling functions together, pack clusters into segments to minimize inter-segment edges. Best locality, real linker work.
Recommendation: B for v1. Add a --segment-cap option (default
32768). Real applications will want C eventually, but B unblocks
"my program is bigger than 64KB" today.
2. Inter-segment relocation tracking
When a JSL foo reloc resolves to a function in a different segment,
we MUST emit an OMF relocation record instead of patching the bytes
in-place. Currently link816 patches everything at link time and emits
zero reloc records.
The reloc model becomes per-segment:
- Intra-segment IMM16 / PCREL: patch at link time, no OMF record.
- Intra-segment IMM24 (JSL): patch at link time (low 24 bits = segment- relative offset for now; loader adjusts at load time when segment is placed). Actually need OMF reloc here too because we don't know the load bank.
- Inter-segment IMM24 (cross-bank JSL): emit
INTERSEGopcode (E2) pointing at(target_segment_num, offset_within_segment). - Inter-segment IMM16 data ref: requires the data segment to land in the same bank as the referencing code OR we need the loader to fail (16-bit absolute can't cross banks). In v1, force all data refs to be to a "data segment" that's in a fixed bank, OR rewrite to long addressing.
The IMM16 cross-segment problem is the killer. Three responses:
i. Punt: Disallow it. All .rodata references must be in the
same segment as the code, OR refs to global data must use long
addressing (rewrite at compile time via __attribute__((far))).
ii. Promote to long at link time: Detect IMM16 cross-segment
refs, rewrite the instruction's encoding from absolute (3-byte)
to absolute-long (4-byte). Changes code size, shifts everything
after the patched site — invasive.
iii. Same-bank constraint: Ensure the data segment's bank ==
the code's DBR. Means all code segments share one DBR, all data
lives in one segment in that DBR's bank.
Recommendation: iii for v1. All .rodata lives in one segment
in the bank our code uses for DBR. We already pin DBR to bank 0 in
crt0 (well, code does pha;plb for bank 2 sometimes for tests, but
not in general). For v1, all .rodata goes in bank 0 alongside the
first text segment, and code segments in higher banks reference data
via long absolute addressing. Need to confirm what addressing modes
our backend actually emits for global access.
3. crt0 / loader contract
Current crt0 assumes flat layout:
__start:
setup CPU mode, stack
enable LC RAM
zero BSS [__bss_start..__bss_end]
run .init_array
jsl main
spin
Multi-segment changes:
- BSS may span multiple segments (bank 0 LC + bank N segment). The
__bss_start/__bss_endsymbols need to be per-segment, OR a loop over a list of(start, end)pairs the linker emits. .init_arrayditto.- LC RAM enable only applies to bank 0 — fine.
- The OMF Loader will handle the actual memory placement; crt0 just runs after Loader is done.
- The Loader's entry call lands at the segment marked with the entry field. By convention that's segment 1.
Decision: Designate segment 1 as "init segment" containing crt0 +
its required symbols (__bss_start_seg1, __init_array_start_seg1,
etc.) and the linker emits a __bss_table and __init_array_table —
arrays of (start, end) pointers walked by crt0. Same idea Mac OS X's
loader uses for multi-segment programs.
4. Build pipeline + tests
link816 --segment-cap Nemits multiple(image, base, syms)triples plus inter-segment reloc records.- New intermediate format between linker and
omfEmit: a small manifest file listing each segment's body, base, name, and reloc records. Easier than passing all that on the CLI. omfEmitreads the manifest and emits a single multi-segment OMF file with proper INTERSEG opcodes.- Smoke needs new test: build a program with
--segment-cap 8192so it forces ≥2 segments even for our small benches; verify under MAME via a GS/OS-loader-aware test path. (We don't have GS/OS-loaded tests today — see "Risks" below.)
Phased implementation
Phase 1: linker emits per-segment images + manifest
link816 --segment-cap N --manifest manifest.json -o out- Pack
.textgreedy into segments 1..K capped at N bytes each. - All
.rodatainto segment K+1 (the "data segment"). - All
.bssinto segment K+2. - Resolve intra-segment relocations.
- Write inter-segment relocations into the manifest.
- Emit one flat binary per segment; manifest references them by path.
Phase 2: omfEmit consumes manifest, emits multi-segment OMF
- One OMF segment header per manifest entry.
- DS opcodes for body bytes.
- INTERSEG (
E2) opcodes for inter-segment reloc patch sites. - RELOC (
E0) opcodes for intra-segment relocations that need load-time fixup (JSL targets within same segment but different bank than expected). - END opcode terminator per segment.
Phase 3: runtime updates
- Linker emits
__bss_table[]and__init_array_table[]instead of single__bss_start/__bss_endsymbols. - crt0 walks those tables.
crt0.sremoves the LC-enable hardcoding from segment 1 if segment 1 isn't bank 0 (configurable).
Phase 4: tests + smoke
- Bench harness builds with
--segment-cap 8192to force multi-segment even for small programs; verify output size growth (should be small — just OMF headers + reloc records overhead). - Need a GS/OS-aware MAME test path (boot a ProDOS volume with our OMF binary, let GS/OS Loader load it, check markers in bank 2). This is the test we deferred earlier in the GS/OS smoke task. Phase 4 reopens the GS/OS-volume smoke decision — multi-segment is the main reason to even care about that.
Scope estimate
- Phase 1: 2-3 sessions (linker rework, careful with reloc accounting)
- Phase 2: 1 session (mostly OMF format work, well-specified)
- Phase 3: 1 session (crt0 + linker symbol table changes)
- Phase 4: 2-3 sessions (GS/OS-loaded test infra is the slog, not the multi-segment logic itself)
Total: ~6-8 focused sessions. Phases 1-3 deliver a working multi- segment binary; phase 4 makes it testable in CI.
Risks
- DBR management is genuinely tricky. Code in segment 2 (bank 2)
doing
lda foowhere foo is in segment K+1 (bank 0): the absolute fetch uses DBR. If DBR != bank-of-foo, we read garbage. The cleanest rule (DBR=0 always; data refs use long via__attribute__((far))or a backend pass that promotes them) requires backend cooperation we don't have. v1's "all data in one segment in DBR's bank" works but constrains data size to ~60KB. - The Loader's behaviour around segment placement is poorly documented. Apple's GS/OS Loader picks banks dynamically; we may end up with code segments in banks the loader chose, with relocations that work, but layouts that surprise us. Mitigation: use STATIC segments (KIND bit) initially so the loader can't move them.
- Smoke needs a real GS/OS volume image. This is the same blocker as the deferred GS/OS file I/O smoke — needs a 2img/po image with ProDOS volume + a way to run our OMF through the actual loader. Without that, multi-segment logic is testable only by inspection of the OMF bytes and a hand-rolled mini-loader (which we'd have to write).
Recommendation
Start Phase 1. The linker work is contained, mostly mechanical, and
the manifest format gives us a clean handoff to omfEmit work in
Phase 2. We can validate Phase 1 by inspecting the per-segment images
- manifest before any OMF / loader work.
Phase 4's GS/OS-volume test path is the biggest unknown. Reasonable to defer that decision until Phases 1-3 are working — at that point we can decide whether to invest in proper GS/OS-loaded smoke or accept "multi-segment OMF emits valid bytes per the spec" as the test bar.