Il 19/06/23 20:35, Almudena Garcia ha scritto:
But the code which starts the secondary cpus is so much later than the crash.
Then, the crash could be produced by the reading of ACPI tables, which are
supposed to be in a certain memory region, defined by a physical address.
phystokv will doesn't solve fully the problem, because the lapic address is out
of the range allowed by this function. Currently, we are using paging to map
every ACPI table which we need to access (to get a virtual address of this).
But the search of the initial ACPI address is based in a physical address range.
I could go a bit further with debugging, and it seems that the problem
is a bit different, it seems removing the 1:1 map exposed an issue that
went hidden so far.
In my test the cpu is reset by a triple fault (you can see this by
enabling interrupt and cpu_reset logging with qemu, e.g. using -d
int,cpu_reset) which is triggered after the first call to splvm:
(gdb) bt
#0 splvm () at ../i386/i386/spl.S:122
#1 0xc1001da6 in pmap_enter (pmap=<optimized out>, v=<optimized out>,
pa=<optimized out>, prot=<optimized out>, wired=<optimized out>) at
../i386/intel/pmap.c:2171
#2 0xc1029b99 in pmap_steal_memory (size=<optimized out>) at
../vm/vm_resident.c:278
#3 0xc1029c48 in vm_page_bootstrap (startp=<optimized out>,
endp=<optimized out>) at ../vm/vm_resident.c:207
#4 0xc101b893 in vm_mem_bootstrap () at ../vm/vm_init.c:65
#5 0xc10161d1 in setup_main () at ../kern/startup.c:115
#6 0xc1004652 in c_boot_entry (bi=<optimized out>) at
../i386/i386at/model_dep.c:578
#7 0xc1000093 in iplt_done () at ../i386/i386at/boothdr.S:103
(gdb) si
124 cli
1: x/i $pc
=> 0xc100ac5d <splvm+5>: cli
(gdb)
125 CPU_NUMBER(%edx)
1: x/i $pc
=> 0xc100ac5e <splvm+6>: mov %cs:0xc109bc6c,%edx
(gdb)
0xc100ac65 125 CPU_NUMBER(%edx)
1: x/i $pc
=> 0xc100ac65 <splvm+13>: mov %cs:0x20(%edx),%edx
(gdb)
t_page_fault () at ../i386/i386/locore.S:435
435 pushl $(T_PAGE_FAULT) /* mark a page fault trap */
1: x/i $pc
=> 0xc100a42c <t_page_fault>: push $0xe
... and here it will enter recursively t_page_fault, because in
trap_from_kernel there is another CPU_NUMBER. I guess the triple fault
is triggered because at some point the exception stack overflows.
With --enable-ncpu=2 it seems that CPU_NUMBER is
#define CPU_NUMBER(reg) \
movl %cs:lapic, reg ;\
movl %cs:APIC_ID(reg), reg ;\
shrl $24, reg ;\
and at this stage the lapic pointer is not yet initialized:
(gdb) p lapic
$4 = (volatile ApicLocalUnit *) 0x0
(gdb) x &lapic
0xc109bc6c <lapic>: 0x00000000
I guess so far this worked because the address 0 was mapped, and now it
isn't.
I'm not sure what would be the proper way to solve this. I tried
anticipating the call to machine_init() to be before vm_mem_bootstrap()
(to have lapic initialized) but this triggers another assert.
Any idea?
Luca