Skip to content

feat(x86_64): boot Asterinas as zone1 via Multiboot2, with virtio-blk/net/console#322

Open
yydawx wants to merge 4 commits into
syswonder:dev-asterinasfrom
yydawx:ccf-asterinas
Open

feat(x86_64): boot Asterinas as zone1 via Multiboot2, with virtio-blk/net/console#322
yydawx wants to merge 4 commits into
syswonder:dev-asterinasfrom
yydawx:ccf-asterinas

Conversation

@yydawx

@yydawx yydawx commented Jun 3, 2026

Copy link
Copy Markdown

Summary
Adds Multiboot2 protocol support to boot Asterinas OS as a zone1 guest, using a minimal ASM bootloader. Minimal changes to core code — all x86-specific logic stays under arch/x86_64/.

Changes
Multiboot2 Boot Support (Commit 1: feat)

New mb2_boot.S bootloader: 16→32-bit transition with GDT + TSS setup, jumps to kernel entry
Loaded via boot_filepath in zone1 config, with GPA→HPA offset translation in hvisor-tool
ELF segment loading with kernel_entry_gpa passed to bootloader via ESI
multiboot_info_paddr/multiboot_enabled added to HvArchZoneConfig (x86-specific)
Multiboot path gated behind multiboot_enabled flag — Linux zone1 paths unaffected
Removed unused print_memory_map
Exception Handling (Commit 2: fix)

S2PT (EPT) violation handler via MMIO dispatch
GS_BASE/FS_BASE MSR read/write support for 64-bit guests
x2APIC MSR fallback for unrecognized registers in x2APIC range
TSC frequency reporting via CPUID
Virtio Robustness (Commit 3: fix)

NULL guard for VIRTIO_BRIDGE.res_agent() — returns gracefully instead of panic
Struct & Config Fixes (Commit 4: feat)

Added v_bus/v_device/v_function to HvPciDevConfig to match C side (fixes 128-byte zone_config size mismatch)
Bumped CONFIG_MAGIC_VERSION to 0x7 on both C and Rust sides
Zone0 memory layout and virtio config adjustments for zone1 coexistence
Example zone1 config: zone1-asterinas.json
Requires
yydawx/hvisor-tool#98 — Multiboot2 loading with GPA→HPA offset translation

@github-actions github-actions Bot added x86_64 feature New feature or request labels Jun 3, 2026
@yydawx

yydawx commented Jun 3, 2026

Copy link
Copy Markdown
Author

因为验证一下非常繁琐,所以我提供一个agent生成的Guide,如有问题可以随时沟通:

在 hvisor 上运行 Asterinas(x86_64 QEMU)

概述

本文档说明如何在 hvisor 上通过 Multiboot2 协议启动 Asterinas 内核作为 zone1 虚拟机。

测试版本:

  • hvisor:基于上游 d3260d0(v0.4 发布基线)+ ccf-asterinas 补丁

  • hvisor-tool:基于上游 b45971a + ccf-asterinas 补丁

  • Asterinas:OSDK 0.17.2,SMP=2

Asterinas 内核要求

Asterinas 内核编译参数:

make kernel SMP=2 BENCHMARK=sysbench/cpu_lat ENABLE_REGRESSION_TEST=true

其中:

  • SMP=2:与 zone1 分配的 CPU 数量匹配

  • BENCHMARK=sysbench/cpu_lat:将 benchmark 工具打包进 initramfs

  • ENABLE_REGRESSION_TEST=true:将回归测试打包进 initramfs

注意:Asterinas 不需要我们的内核修改即可在 hvisor 下启动。可选的 iface/init.rs 补丁仅用于让 virtio-net 的 eth0 接口出现;不打这个补丁, console、块设备 I/O、benchmark、回归测试全都能正常运行。

hvisor 侧修改(ccf-asterinas 分支)

共 4 个提交,基于上游 d3260d0

  1. Multiboot2 启动协议支持 — 为 Multiboot2 兼容内核设置 32 位保护 模式 guest 状态。加入 ELF 段加载、Multiboot2 info 结构构建、cpuid leaf 0x15 支持、EFER LME/LMA 位修复。

  2. 异常处理和中断路由优化 — EOI 卡死看门狗(5000 次后丢弃卡死 中断)、IRQ 去重(同向量不重复注入)、非 zone0 CPU 上的设备中断 转发回 zone0、INS/OUTS 指令支持。

  3. EPT PCI 映射和 virtio 鲁棒性 — ECAM 直通映射让 guest 能访问 PCI 配置空间、PCI MMIO 窗口映射、DMA 区域映射、virtio bridge 从 panic 改为 Option 返回值、无 PCI 设备的 zone 跳过 IOMMU。

  4. 示例配置文件virtio-asterinas-example.json zone1-asterinas-example.json,位于 platform/x86_64/qemu/configs/

hvisor-tool 侧修改(ccf-asterinas 分支)

共 4 个提交,基于上游 b45971a:

  1. Virtio 设备修复 — queue_sel 写入移除边界检查(让 guest 能 遍历队列)、QUEUE_NUM_MAX 读取加边界检查返回 0、GPA 翻译后加 NULL 指针检查、blk_size 字段初始化。

  2. 终端换行修复 — virtio-console TX 处理中 \n\r\n 转换, PTY 初始化错误处理补全。

  3. Multiboot2 zone 加载 — ELF 段解析、Multiboot2 info 结构构建、 zone_config 结构体扩展。

  4. 示例配置文件 — 同 hvisor 提交 4。

宿主机环境

QEMU:      qemu-system-x86_64 + KVM 加速
机型:      q35, kernel-irqchip=split
CPU:       host,+x2apic,+invtsc,+vmx
内存:      12 GB(可调,建议 ≥ 8 GB)
IOMMU:     Intel VT-d(intel-iommu,caching-mode=on,device-iotlb=on)
磁盘:      virtio-blk-pci,挂载在 PCIe bus 1
网络:      user-mode NIC

内存布局

Asterinas zone1 8 GB 示例(非连续 EPT 区域,绕过 ECAM 空洞):

GPA 0x00000000-0x1ff00000  (511 MB)     低端 RAM
GPA 0x1ff00000-0x20000000  (1 MB)       ACPI 表
GPA 0x20000000-0xb0000000  (2.25 GB)    中端 RAM(到 ECAM 空洞前)
GPA 0x100000000-0x150000000 (1.25 GB)   高端 RAM
GPA 0x150000000-0x250000000 (4 GB)      扩展 RAM
GPA 0xfeb00000-0xfeb02000               virtio MMIO 区域

总计约 8 GB,分散在 EPT 的 5 个 RAM 区域中。

编译和部署

完整构建

1. 构建 Asterinas(需要 Docker 容器):

docker exec syswand-build bash -c 'cd /root/syswand_asterinas/asterinas && make kernel SMP=2 BENCHMARK=sysbench/cpu_lat ENABLE_REGRESSION_TEST=true'

2. 构建 hvisor:

cd /home/yyda/workspace/syswand_asterinas/hvisor
make clean && make ARCH=x86_64 BOARD=qemu LOG=off

3. 构建 hvisor daemon:

cd /home/yyda/workspace/syswand_asterinas/hvisor-tool
make all ARCH=x86_64 LOG=LOG_INFO KDIR=/home/yyda/workspace/syswand_asterinas/linux

4. 部署到 rootfs:

cd /home/yyda/workspace/syswand_asterinas
sudo mount rootfs1.img -t ext4 /mnt
sudo rm -f /mnt/hvisor /mnt/hvisor.ko
sudo cp asterinas/target/osdk/iso_root/boot/aster-kernel-osdk-bin /mnt/
sudo cp asterinas/target/osdk/iso_root/boot/initramfs.cpio.gz /mnt/
sudo cp zone1-asterinas.json /mnt/
sudo cp virtio_cfg.json /mnt/
sudo cp hvisor-tool/output/hvisor /mnt/
sudo cp hvisor-tool/output/hvisor.ko /mnt/
sudo umount /mnt
sudo cp rootfs1.img ./hvisor/platform/x86_64/qemu/image/virtdisk/

快速重建(仅 hvisor)

cd hvisor && make clean && make ARCH=x86_64 BOARD=qemu LOG=off

快速重建(仅 daemon)

cd hvisor-tool && make all ARCH=x86_64 LOG=LOG_INFO KDIR=/path/to/linux
sudo mount rootfs1.img -t ext4 /mnt
sudo rm -f /mnt/hvisor /mnt/hvisor.ko
sudo cp hvisor-tool/output/hvisor /mnt/
sudo cp hvisor-tool/output/hvisor.ko /mnt/
sudo umount /mnt
sudo rm -f ./hvisor/platform/x86_64/qemu/image/virtdisk/rootfs1.img
sudo cp rootfs1.img ./hvisor/platform/x86_64/qemu/image/virtdisk/
sudo chown $(whoami):$(whoami) ./hvisor/platform/x86_64/qemu/image/virtdisk/rootfs1.img

运行

启动 hvisor

cd /home/yyda/workspace/syswand_asterinas/hvisor
make ARCH=x86_64 BOARD=qemu run LOG=off

LOG=off 关闭 hvisor 日志输出,保持终端干净。

在 zone0(根 Linux)中启动服务和 zone1

# 1. 可选:创建 TAP 设备给 virtio-net
ip tuntap add tap0 mode tap
ip link set tap0 up
ip addr add 192.168.100.1/24 dev tap0

2. 启动 virtio daemon

nohup ./hvisor virtio start virtio_cfg.json > /daemon.log 2>&1 &
sleep 2

3. 启动 Asterinas zone1

./hvisor zone start ./zone1-asterinas.json

在 zone1(Asterinas)中操作

# 查看文件系统
ls

运行回归测试

/test/run_regression_test.sh

运行 benchmark

mkdir -p /ext2
mount -t ext2 /dev/vda /ext2
sh /benchmark/run_all.sh

挂载持久化磁盘并写入

echo "hello" > /ext2/test.txt
cat /ext2/test.txt

配置文件说明

zone1-asterinas.json

定义 zone1 的内存区域、CPU、内核路径、initramfs 和 Multiboot2 参数。 关键字段:

字段 说明
memory_regions GPA→HPA 映射,包含 RAM 区域和 virtio MMIO
multiboot_enabled 必须设为 true
multiboot_info_paddr Multiboot2 info 结构的 GPA 地址
kernel_cmdline 传递给 Asterinas 内核的命令行参数
initramfs_filepath initramfs 文件路径

已知问题

  1. pty_blocking 测试死锁 — 已从 device 回归测试中跳过该测试。

  2. pivot_root errno 差异 — 两个 pivot_root 边缘情况中,Asterinas VFS 返回的 errno 与 Linux 不同(EBUSY vs EINVAL、ENOENT vs EINVAL)。

  3. cgroup 文件系统缺失 — 需要 /sys/fs/cgroup 的进程回归测试 因 cgroupfs 未挂载而失败。

  4. /proc/self/exe 不可用 — Memory 和 security 回归测试的 setup 阶段因 Asterinas procfs 未实现 /proc/self/exe 符号链接而失败。

  5. fio slab 分配失败 — zone1 内存不足 8 GB 时,fio 测试可能触发 "Allocating a slot from a full slab" panic。建议给 zone1 至少 8 GB。

  6. QEMU 内存必须覆盖 zone1 所有 HPA — 如果 QEMU 的 -m 小于 zone1 最高 HPA,将触发 EPT violation(#GP 风暴)。确保 -m ≥ 最高 HPA + 最大区域大小

@yydawx

yydawx commented Jun 3, 2026

Copy link
Copy Markdown
Author

目前我发现给asterinas配置virtio需要修改asterinas的源码,这个可以接受吗?还是我们需要想一个更好的办法。
应该是asterinas根据一些硬编码的设置,编译出带有对应设备信息的kernel,所以想改virtio,总是要修改asterinas的部分源码。

@caodg caodg requested a review from Solicey June 3, 2026 21:32
@Solicey

Solicey commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

目前我发现给asterinas配置virtio需要修改asterinas的源码,这个可以接受吗?还是我们需要想一个更好的办法。 应该是asterinas根据一些硬编码的设置,编译出带有对应设备信息的kernel,所以想改virtio,总是要修改asterinas的部分源码。

I also encountered this problem when configuring virtio, and I think it is acceptable to make a few changes to Asterinas.

Comment thread src/device/irqchip/pic/ioapic.rs Outdated
let zone = this_zone_arc.read();
// The guest IOAPIC RTE may route to a CPU outside this zone.
// If so, redirect to the zone's first CPU so the interrupt
// reaches the correct guest.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added cpu redirect fix in function VirtIoApic::write() Line136-142 in the last commit, you could remove redundant fixes.

Comment thread src/arch/x86_64/trap.rs Outdated

/// Walk guest page tables for virtual address `vaddr` using CR3 as the PML4 base.
/// Prints the full page table hierarchy for debugging.
fn walk_guest_page_table(vaddr: usize, cr3_gpa: usize) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could reuse function gva_to_gpa() in mmio.rs for page walking.

Comment thread src/config.rs Outdated
pub name: [u8; CONFIG_NAME_MAXLEN],
// Multiboot support (NEW)
pub multiboot_info_paddr: u64,
pub multiboot_enabled: u32,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider putting multiboot_info_paddr and multiboot_enabled inside arch_config, since they are x86 specific configs, which shall not be shared by other archs.

Comment thread src/hypercall/mod.rs Outdated
Comment on lines +150 to +165
Some(zone_arc) => {
let target_cpu = get_target_cpu(irq_id as _, target_zone as _);
// Verify target_cpu belongs to target_zone.
// The guest IOAPIC may route IRQs to an APIC ID that now
// belongs to a different zone, which would cause the IRQ
// to be injected into the wrong guest.
let zone = zone_arc.read();
if zone.cpu_set.bitmap & (1u64 << target_cpu) != 0 {
target_cpu
} else {
trace!("virtio: IRQ {} for zone {} routed to CPU {} outside zone, falling back to CPU {}",
irq_id, target_zone, target_cpu,
zone.cpu_set.first_cpu().unwrap());
zone.cpu_set.first_cpu().unwrap()
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another redundant IOAPIC redirect fix which should be removed. By the way, we shall avoid adding arch-specific contents into codes and files shared by all archs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The IOAPIC redirect in VirtIoApic::write() only takes effect when the guest actively reconfigures IOAPIC entries, but the initial RTE state is inherited from zone0 on zone1 startup and may point to CPUs outside zone1. Without the fallback In handle_hvc_finish_req, virtio IRQs are delivered to the wrong guest. Tested: removing this breaks virtio console input.

Comment thread src/arch/x86_64/zone.rs Outdated
Comment on lines +221 to +318
// Map PCI ECAM region into guest EPT so the guest can access PCI config space.
// The MCFG table tells the guest ECAM is at HPA 0xb0000000, which becomes
// GPA 0xb0000000 when copied into guest ACPI tables. Without this mapping,
// any PCI config access causes an EPT violation.
if self.id == 0 {
// Zone0: full ECAM identity-map
let ecam_base = 0xb000_0000usize;
let ecam_size = 0x20_0000usize;
self.gpm.insert(MemoryRegion::new_with_offset_mapper(
ecam_base as GuestPhysAddr,
ecam_base as HostPhysAddr,
ecam_size,
MemFlags::READ | MemFlags::WRITE,
))?;
} else {
// Non-root zone: identity-map ECAM except for the page containing
// the virtio-blk device (01:00.0). That page gets an MMIO handler
// that returns 0xffffffff for the vendor-ID read, hiding the device
// so the guest never tries to access its BAR and corrupt IOMMU state.
let ecam_base = 0xb000_0000usize;
let virtio_blk_ecam_gpa = ecam_base + 0x10_0000; // bus 1, dev 0, func 0
let ecam_page = 0x1000usize;

// ECAM before virtio-blk page: 0xb0000000..0xb0100000
if virtio_blk_ecam_gpa > ecam_base {
self.gpm.insert(MemoryRegion::new_with_offset_mapper(
ecam_base as GuestPhysAddr,
ecam_base as HostPhysAddr,
virtio_blk_ecam_gpa - ecam_base,
MemFlags::READ | MemFlags::WRITE,
))?;
}
// Virtio-blk ECAM page: MMIO-handler that hides the device
self.mmio_region_register(
virtio_blk_ecam_gpa,
ecam_page,
ecam_virtio_blk_hide_handler,
virtio_blk_ecam_gpa,
);
// ECAM after virtio-blk page: 0xb0101000..0xb0200000
let after_gpa = virtio_blk_ecam_gpa + ecam_page;
let ecam_end = ecam_base + 0x20_0000usize;
if after_gpa < ecam_end {
self.gpm.insert(MemoryRegion::new_with_offset_mapper(
after_gpa as GuestPhysAddr,
after_gpa as HostPhysAddr,
ecam_end - after_gpa,
MemFlags::READ | MemFlags::WRITE,
))?;
}
}

// Map PCI 32-bit MMIO window so the guest can access PCI device BARs.
let pci_mmio_base = 0xC000_0000usize;
let pci_mmio_size = 0x3EB0_0000usize; // up to 0xFEB00000
self.gpm.insert(MemoryRegion::new_with_offset_mapper(
pci_mmio_base as GuestPhysAddr,
pci_mmio_base as HostPhysAddr,
pci_mmio_size,
MemFlags::READ | MemFlags::WRITE,
))?;

// Continue PCI MMIO after the virtio MMIO hole
let pci_mmio2_base = 0xFEB0_1000usize;
let pci_mmio2_size = 0xFF000usize; // ~1MB, up to IOAPIC at 0xFEC00000
self.gpm.insert(MemoryRegion::new_with_offset_mapper(
pci_mmio2_base as GuestPhysAddr,
pci_mmio2_base as HostPhysAddr,
pci_mmio2_size,
MemFlags::READ | MemFlags::WRITE,
))?;

// Map 64-bit PCI BAR window for non-root zones (zone0 maps it below too,
// but via the RAM regions which cover all of HPA).
let pci_bar64_base = 0x8_0000_0000usize;
let pci_bar64_size = 0x1000_0000usize;
self.gpm.insert(MemoryRegion::new_with_offset_mapper(
pci_bar64_base as GuestPhysAddr,
pci_bar64_base as HostPhysAddr,
pci_bar64_size,
MemFlags::READ | MemFlags::WRITE,
))?;

// Map DMA memory region for non-root zones: cover the guest's I/O memory
// allocator low range (0x20000000..0xB0000000, i.e. up to ECAM) using
// HPA 0x1_0000_0000 (4GB, reserved for zone1 by zone0).
if self.id != 0 {
let dma_gpa_base = 0x2000_0000usize;
let dma_hpa_base = 0x1_0000_0000usize;
let dma_size = 0x9000_0000usize; // 2.25GB, up to ECAM at 0xB0000000
self.gpm.try_insert(MemoryRegion::new_with_offset_mapper(
dma_gpa_base as GuestPhysAddr,
dma_hpa_base as HostPhysAddr,
dma_size,
MemFlags::READ | MemFlags::WRITE,
));
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a good idea to hard-code configurations in code. Another thing is that, we have already add memory mapping for PCI config space, take a look at codes under pci.

Comment thread src/arch/x86_64/trap.rs Outdated
Comment on lines +100 to +118
}
IdtVector::I8042_KEYBOARD_VECTOR => {}
IdtVector::APIC_SPURIOUS_VECTOR | IdtVector::APIC_ERROR_VECTOR => {}
_ => {
if vector >= 0x20 && this_cpu_data().arch_cpu.power_on {
inject_vector(this_cpu_id(), vector, None, false);
IdtVector::APIC_SPURIOUS_VECTOR
| IdtVector::APIC_ERROR_VECTOR => {}
// programmed the LAPIC. They belong to the CURRENT zone,
// not zone0. Device interrupts (0x20-0xdf) always belong to
// zone0 and must be forwarded if they arrive on a non-zone0 CPU.
// Check if this is a LAPIC-local interrupt.
// The guest's timer vector is dynamically allocated and may be < 0xe0,
// so we also check against the tracked LAPIC timer vector.
let is_lapic_local = vector >= 0xe0
|| vector == this_cpu_data().arch_cpu.virt_lapic.virt_timer_vector as u8;
if zone_id == 0 || is_lapic_local {
inject_vector(cpu_id, vector, None, false);
} else {
// Forward device interrupt to zone0.
let zone0 = crate::zone::find_zone(0).unwrap();
let zone0_cpu = zone0.read().cpu_set.first_cpu().unwrap_or(0);
inject_vector(zone0_cpu, vector, None, false);
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-root zones should also be able to receive real-hardware-injected vectors. Sometimes we may let zone1 use real devices instead of virtio devices.

@Solicey Solicey left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest making as minimum changes as possible to achieve booting Asterinas. You can take a look at my previous commit to learn what had already been fixed, so that you do not need to add redundant fixes in your pr.

Comment thread src/device/irqchip/pic/lapic.rs Outdated
Comment on lines +76 to +95
IA32_X2APIC_APICID => {
// info!("apicid: {:x}", this_cpu_id());
Ok(this_apic_id() as u64)
}
IA32_X2APIC_LDR => Ok(this_apic_id() as u64), // logical apic id
IA32_X2APIC_APICID => Ok(this_apic_id() as u64),
IA32_X2APIC_VERSION => Ok(0x1415), // version 0x14, max LVT entry 0x15
IA32_X2APIC_LDR => Ok(this_apic_id() as u64),
IA32_X2APIC_SIVR => Ok(self.virt_svr as u64),
IA32_X2APIC_ISR0 | IA32_X2APIC_ISR1 | IA32_X2APIC_ISR2 | IA32_X2APIC_ISR3
| IA32_X2APIC_ISR4 | IA32_X2APIC_ISR5 | IA32_X2APIC_ISR6 | IA32_X2APIC_ISR7 => {
// info!("isr!");
Ok(0)
}
| IA32_X2APIC_ISR4 | IA32_X2APIC_ISR5 | IA32_X2APIC_ISR6 | IA32_X2APIC_ISR7 => Ok(0),
IA32_X2APIC_IRR0 | IA32_X2APIC_IRR1 | IA32_X2APIC_IRR2 | IA32_X2APIC_IRR3
| IA32_X2APIC_IRR4 | IA32_X2APIC_IRR5 | IA32_X2APIC_IRR6 | IA32_X2APIC_IRR7 => {
// info!("irr!");
Ok(0)
}
IA32_X2APIC_LVT_TIMER => Ok(self.virt_lvt_timer_bits as _),
_ => hv_result_err!(ENOSYS),
| IA32_X2APIC_IRR4 | IA32_X2APIC_IRR5 | IA32_X2APIC_IRR6 | IA32_X2APIC_IRR7 => Ok(0),
IA32_X2APIC_ESR => Ok(0),
IA32_X2APIC_LVT_TIMER => Ok(self.virt_lvt_timer_bits as u64),
IA32_X2APIC_LVT_THERMAL | IA32_X2APIC_LVT_PMI | IA32_X2APIC_LVT_LINT0
| IA32_X2APIC_LVT_LINT1 | IA32_X2APIC_LVT_ERROR => Ok(1 << 16), // masked
IA32_X2APIC_INIT_COUNT => Ok(0),
IA32_X2APIC_CUR_COUNT => Ok(0),
IA32_X2APIC_DIV_CONF => Ok(0),
IA32_TSC_DEADLINE => Ok(0),
_ => Ok(0), // safe default for unknown MSRs

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the reason why we shall add more x2apic handlers?

Comment thread src/device/irqchip/pic/ioapic.rs Outdated
Comment on lines +233 to +264
/// When a non-root zone starts on a set of CPUs, ensure critical physical
/// interrupts (UART, etc.) are not routed to those CPUs. If they are, re-route
/// them to CPU 0 which stays in the root zone. Without this, zone0 can become
/// unresponsive because physical interrupts get injected into a guest that has
/// no handler for them.
pub fn ioapic_reroute_from_cpus(cpu_set: &crate::cpu_data::CpuSet) {
// Critical IRQs that the root zone needs for interactive console.
const CRITICAL_IRQS: &[u8] = &[irqs::UART_COM1_IRQ];

let mut io_apic = IO_APIC.lock();
for &irq in CRITICAL_IRQS {
// table_entry returns RedirectionTableEntry, transmute to u64 for
// bit-field manipulation.
let entry = unsafe { io_apic.table_entry(irq) };
let raw: u64 = unsafe { core::mem::transmute(entry) };
let dest_apic_id = raw.get_bits(56..=63) as usize;
let dest_cpu = get_cpu_id(dest_apic_id);
if cpu_set.bitmap & (1u64 << dest_cpu) != 0 {
// Re-route to CPU 0 which is always in the root zone.
let cpu0_apic_id = get_apic_id(0) as u64;
let mut new_raw = raw;
new_raw.set_bits(56..=63, cpu0_apic_id);
let new_entry = unsafe { core::mem::transmute(new_raw) };
unsafe { io_apic.set_table_entry(irq, new_entry) };
warn!(
"ioapic: rerouted IRQ {} from CPU {} (APIC {:#x}) to CPU 0 (APIC {:#x})",
irq, dest_cpu, dest_apic_id, cpu0_apic_id
);
}
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it necessary to handle ioapic reroute. As mentioned earlier, this issue has been fixed in my last commit. You can make your own modifications based on my fix, but please avoid fixing the same problems with redundant codes.

Comment thread src/arch/x86_64/zone.rs Outdated
Comment on lines +120 to +139
// Use kernel's existing GDT at GPA 0x80014f0
// The kernel's GDT has:
// - Selector 0x08: 64-bit code segment
// - Selector 0x10: Data segment
// - Selector 0x18: 32-bit code segment
//
// We need a TSS for VMX. Put it at GPA 0x8048000 (below stack at 0x804a000)
// DO NOT use 0x8009000 - that's kernel boot code!
let tss_gpa = 0x8048000usize;
if let Ok((tss_hpa, _, _)) = unsafe { self.gpm.page_table_query(tss_gpa) } {
let tss_ptr = tss_hpa as *mut u8;
unsafe {
// Zero out 104 bytes (32-bit TSS size)
for i in 0..104 {
core::ptr::write_volatile(tss_ptr.add(i), 0);
}
}
info!("[ZONE{}] TSS written to GPA {:#x} (HPA {:#x})", zone_id, tss_gpa, tss_hpa);
} else {
warn!("[ZONE{}] Failed to write TSS: GPA {:#x} not mapped", zone_id, tss_gpa);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why we are adding TSS entry for GDT in Multiboot?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel SDM Vol 3C §26.3.1.2 requires the TR selector to point to a valid TSS descriptor on VM entry. The multiboot2 guest starts in 32-bit protected mode and only sets up its own TSS after transitioning to 64-bit mode. A minimal blank TSS at a safe GPA is needed to satisfy the VMX check during the early boot window.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is necessary to place a TSS entry in GDT, you can take a look at platform/x86_64/qemu/image/bootloader/boot.S, where we setup GDT the first time we enter guest OS.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, hard-coding numeric configurations is not a good idea, because it hurts readability and make future maintenance more difficult. It may be better to move them to a dedicated config so the code stays cleaner and easier to manage.

Comment thread src/device/irqchip/pic/mod.rs Outdated
@yydawx

yydawx commented Jun 6, 2026

Copy link
Copy Markdown
Author

Most redundant code is because some problems when booting. But some of them may not work indeed. I will find out which part is useless. Thanks for your review!

@yydawx yydawx force-pushed the ccf-asterinas branch 2 times, most recently from 138a5b2 to a589ab3 Compare June 8, 2026 03:39
@yydawx yydawx marked this pull request as draft June 8, 2026 03:39
@yydawx yydawx force-pushed the ccf-asterinas branch 3 times, most recently from 0cb3ac4 to dd456e1 Compare June 8, 2026 08:03
@yydawx yydawx requested a review from Solicey June 8, 2026 08:04
@yydawx

yydawx commented Jun 8, 2026

Copy link
Copy Markdown
Author

@Solicey Hi!I removed most redundant codes and debugs/comments. I also aviod hard -coding
It would be better now.

@yydawx yydawx marked this pull request as ready for review June 8, 2026 10:45
Comment thread src/arch/x86_64/boot.rs Outdated
Comment on lines +526 to +541
@@ -536,7 +538,7 @@ pub fn print_memory_map() {

/// copy kernel modules to the right place
pub fn module_init(info_addr: usize) {
println!("module_init");
info!("module_init");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't initialized logger at this point, so we'd better keep using println

Comment thread src/arch/x86_64/cpu.rs Outdated
Comment on lines 595 to 693
fn setup_multiboot_guest_state(&mut self, entry: GuestPhysAddr) -> HvResult {
let cr0_fixed0 = Msr::IA32_VMX_CR0_FIXED0.read();
let cr0_fixed1 = Msr::IA32_VMX_CR0_FIXED1.read();
let mut cr0_guest = Cr0Flags::PROTECTED_MODE_ENABLE.bits();
let cr0_fixed1_excluding_pe_pg =
cr0_fixed1 | Cr0Flags::PAGING.bits() | Cr0Flags::PROTECTED_MODE_ENABLE.bits();
let cr0_fixed0_excluding_pe_pg =
cr0_fixed0 & !(Cr0Flags::PAGING.bits() | Cr0Flags::PROTECTED_MODE_ENABLE.bits());
cr0_guest = (cr0_guest | cr0_fixed0_excluding_pe_pg) & cr0_fixed1_excluding_pe_pg;

let cr4_fixed0 = Msr::IA32_VMX_CR4_FIXED0.read();
let cr4_fixed1 = Msr::IA32_VMX_CR4_FIXED1.read();
let cr4_guest = (cr4_fixed0 & cr4_fixed1) as usize;

VmcsGuestNW::CR0.write(cr0_guest as usize)?;
VmcsControlNW::CR0_READ_SHADOW.write(cr0_guest as usize)?;
let cr0_mask = Cr0Flags::CACHE_DISABLE.bits()
| Cr0Flags::NOT_WRITE_THROUGH.bits()
| Cr0Flags::NUMERIC_ERROR.bits()
| Cr0Flags::EXTENSION_TYPE.bits();
VmcsControlNW::CR0_GUEST_HOST_MASK.write(cr0_mask as usize)?;

VmcsGuestNW::CR3.write(0)?;

VmcsGuestNW::CR4.write(cr4_guest)?;
VmcsControlNW::CR4_READ_SHADOW.write(cr4_guest)?;
VmcsControlNW::CR4_GUEST_HOST_MASK
.write(Cr4Flags::VIRTUAL_MACHINE_EXTENSIONS.bits() as usize)?;

// CS: 32-bit code at selector 0x18 (kernel's GDT)
VmcsGuestNW::CS_BASE.write(0)?;
VmcsGuest32::CS_LIMIT.write(0xFFFFFFFF)?;
VmcsGuest16::CS_SELECTOR.write(0x18)?;
VmcsGuest32::CS_ACCESS_RIGHTS.write(0xC09B)?;

// DS, ES, SS: data at selector 0x10
VmcsGuestNW::DS_BASE.write(0)?;
VmcsGuest32::DS_LIMIT.write(0xFFFFFFFF)?;
VmcsGuest16::DS_SELECTOR.write(0x10)?;
VmcsGuest32::DS_ACCESS_RIGHTS.write(0xC093)?;
VmcsGuestNW::ES_BASE.write(0)?;
VmcsGuest32::ES_LIMIT.write(0xFFFFFFFF)?;
VmcsGuest16::ES_SELECTOR.write(0x10)?;
VmcsGuest32::ES_ACCESS_RIGHTS.write(0xC093)?;
VmcsGuestNW::SS_BASE.write(0)?;
VmcsGuest32::SS_LIMIT.write(0xFFFFFFFF)?;
VmcsGuest16::SS_SELECTOR.write(0x10)?;
VmcsGuest32::SS_ACCESS_RIGHTS.write(0xC093)?;

// FS, GS: unusable
VmcsGuestNW::FS_BASE.write(0)?;
VmcsGuest32::FS_LIMIT.write(0)?;
VmcsGuest16::FS_SELECTOR.write(0)?;
VmcsGuest32::FS_ACCESS_RIGHTS.write(0x10000)?;
VmcsGuestNW::GS_BASE.write(0)?;
VmcsGuest32::GS_LIMIT.write(0)?;
VmcsGuest16::GS_SELECTOR.write(0)?;
VmcsGuest32::GS_ACCESS_RIGHTS.write(0x10000)?;

// TR: TSS at MB2_TSS_GPA
VmcsGuestNW::TR_BASE.write(MB2_TSS_GPA)?;
VmcsGuest32::TR_LIMIT.write(MB2_TSS_SIZE as u32 - 1)?;
VmcsGuest16::TR_SELECTOR.write((MB2_GDT_TSS_ENTRY * 8) as u16)?;
VmcsGuest32::TR_ACCESS_RIGHTS.write(0x008B)?;

// LDTR: unusable
VmcsGuestNW::LDTR_BASE.write(0)?;
VmcsGuest32::LDTR_LIMIT.write(0)?;
VmcsGuest16::LDTR_SELECTOR.write(0)?;
VmcsGuest32::LDTR_ACCESS_RIGHTS.write(0x10000)?;

VmcsGuestNW::GDTR_BASE.write(MB2_GDT_BASE_GPA)?;
VmcsGuest32::GDTR_LIMIT.write(((MB2_GDT_TSS_ENTRY + 2) * 8 - 1) as u32)?;
VmcsGuestNW::IDTR_BASE.write(0)?;
VmcsGuest32::IDTR_LIMIT.write(0xffff)?;
VmcsGuest32::IDTR_LIMIT.write(0)?;

VmcsGuestNW::DR7.write(0x400)?;
VmcsGuestNW::RSP.write(rsp)?;
VmcsGuestNW::RSP.write(MB2_STACK_GPA)?;
VmcsGuestNW::RIP.write(entry)?;
VmcsGuestNW::RFLAGS.write(0x2)?;
VmcsGuestNW::PENDING_DBG_EXCEPTIONS.write(0)?;
VmcsGuestNW::IA32_SYSENTER_ESP.write(0)?;
VmcsGuestNW::IA32_SYSENTER_EIP.write(0)?;
VmcsGuest32::IA32_SYSENTER_CS.write(0)?;

VmcsGuest32::INTERRUPTIBILITY_STATE.write(0)?;
VmcsGuest32::ACTIVITY_STATE.write(0)?;
VmcsGuest32::VMX_PREEMPTION_TIMER_VALUE.write(0)?;

VmcsGuest64::LINK_PTR.write(u64::MAX)?; // SDM Vol. 3C, Section 24.4.2
VmcsGuest64::LINK_PTR.write(u64::MAX)?;
VmcsGuest64::IA32_DEBUGCTL.write(0)?;
VmcsGuest64::IA32_PAT.write(Msr::IA32_PAT.read())?;
VmcsGuest64::IA32_EFER.write(0)?;

// for AP start up, set CS_BASE to entry address, and RIP to 0.
if self.power_on && !this_cpu_data().boot_cpu {
VmcsGuestNW::RIP.write(0)?;
VmcsGuestNW::CS_BASE.write(entry)?;
}
info!(
"[MULTIBOOT] 32-bit guest: CR0={:#x}, CR4={:#x}, RIP={:#x}, GDT={:#x}",
cr0_guest, cr4_guest, entry, MB2_GDT_BASE_GPA
);

Ok(())
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest putting this part of the code inside an ASM file. You can find reference in platform/x86_64/qemu/image/bootloader/boot.S. By setting boot_filepath and boot_load_paddr inside arch_config, you can load this booting code into memory when booting zone1. By setting entry_point you can set guest entry to this file.

@Solicey Solicey Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing this, we can make as fewer changes to our main codes as possible while introducing multiboot2.

Comment thread src/arch/x86_64/trap.rs Outdated
}

let res = match exit_info.exit_reason {
VmxExitReason::EXCEPTION_NMI => handle_exception(arch_cpu, &exit_info),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why we are adding a NMI handler?

yydawx added 3 commits June 12, 2026 10:19
- Add mb2_boot.S bootloader for 16-bit to 32-bit mode transition
- Bootloader sets up GDT with TSS and jumps to kernel entry
- Pass kernel entry via ESI to bootloader on VM entry
- Add multiboot_info_paddr/multiboot_enabled to HvArchZoneConfig
- Remove unused print_memory_map
- Add v_bus/v_device/v_function to HvPciDevConfig
- Add S2PT violation handler via MMIO dispatch
- Add GS_BASE/FS_BASE MSR read/write support
- Add NULL guard for VIRTIO_BRIDGE res_agent
- Adjust zone0 memory layout for zone1 coexistence
- Update virtio configuration for multi-zone setup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request x86_64

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants