Re: Policeman Jenkins => new hardware

Uwe Schindler Tue, 18 Mar 2025 15:57:14 -0700

Hi,
It is same hardware. The problem with Macos is the emulation. Looks like the 
virtualization overhead is too large.


My hope is to get this working better with KVM. The problem with Macos is the 
processor hardware: if the CPU is not supported by Apple you have to hide some 
details by modifying CPU flags (see previous mail), so the osx kernel thinks 
its Intel hardware. It just refuses to boot otherwise.
KVM is not easily able to hide the real CPU identification.

Uwe

Am 18. März 2025 18:09:32 MEZ schrieb Houston Putman <hous...@apache.org>:
>Thanks for all the work here Uwe!
>
>I see that the OSX builds are a part of your TODOs, but with the new
>hardware, do you expect the OSX VM to be faster, or is the VM not living on
>the same hardware?
>We see a ton of OSX build failures because the "eventual consistency" in
>the tests doesn't expect the hardware to be quite as slow as the OSX VM
>is...
>
>- Houston
>
>On Tue, Mar 18, 2025 at 12:03 PM Uwe Schindler <u...@thetaphi.de> wrote:
>
>> P.S.: Fun fact: The old policeman server's NVME SSDs were long after
>> their lifetime - so it was the main reason to replace it (the failing
>> network adaptor was just earlier). Lucene did a good job to burn the
>> SSDs. It is still interesting that Lucene/Solr's tests write more than
>> they read(!?!???!):
>>
>> # nvme smart-log /dev/nvme0
>> Smart Log for NVME device:nvme0 namespace-id:ffffffff
>> critical_warning                        : 0
>> temperature                             : 45 °C (318 K)
>> available_spare                         : 100%
>> available_spare_threshold               : 10%
>> *percentage_used : 211%*
>> endurance group critical warning summary: 0
>> *Data Units Read : 317332814 (162.47 TB) Data Units Written : 2383037910
>> (1.22 PB)*
>> host_read_commands                      : 10452268853
>> host_write_commands                     : 57744004908
>> controller_busy_time                    : 73212
>> power_cycles                            : 9
>> power_on_hours                          : 45321
>> unsafe_shutdowns                        : 4
>> media_errors                            : 0
>> num_err_log_entries                     : 0
>> Warning Temperature Time                : 0
>> Critical Composite Temperature Time     : 0
>> Temperature Sensor 1                    : 45 °C (318 K)
>> Thermal Management T1 Trans Count       : 0
>> Thermal Management T2 Trans Count       : 0
>> Thermal Management T1 Total Time        : 0
>> Thermal Management T2 Total Time        : 0
>>
>> # nvme smart-log /dev/nvme1
>> Smart Log for NVME device:nvme1 namespace-id:ffffffff
>> critical_warning                        : 0
>> temperature                             : 42 °C (315 K)
>> available_spare                         : 100%
>> available_spare_threshold               : 10%*percentage_used : 217%
>> *endurance group critical warning summary: 0
>> *Data Units Read : 152984082 (78.33 TB) Data Units Written : 2385237910
>> (1.22 PB)*
>> host_read_commands                      : 1870329041
>> host_write_commands                     : 57743490085
>> controller_busy_time                    : 62644
>> power_cycles                            : 9
>> power_on_hours                          : 45321
>> unsafe_shutdowns                        : 4
>> media_errors                            : 0
>> num_err_log_entries                     : 0
>> Warning Temperature Time                : 0
>> Critical Composite Temperature Time     : 0
>> Temperature Sensor 1                    : 42 °C (315 K)
>> Thermal Management T1 Trans Count       : 0
>> Thermal Management T2 Trans Count       : 0
>> Thermal Management T1 Total Time        : 0
>> Thermal Management T2 Total Time        : 0
>>
>> Uwe
>>
>> Am 18.03.2025 um 17:52 schrieb Uwe Schindler:
>> >
>> > Moin moin,
>> >
>> > Policeman Jenkins got new hardware yesterday - no functional changes.
>> >
>> > Background: The old server had some strange problems with the
>> > networking adaptor (Intel's "igb" kernel driver) about "Detected Tx
>> > Unit Hang". This caused some short downtimes and the monitoring
>> > complained all the time about lost pings which drove me crazy at
>> > weekend. It worked better after a restart and also with downgrade of
>> > kernel, but as I was about to replace the machine by a newer one, I
>> > ordered a replacement to new Hardware version (previously it was
>> > Hetzner AX51-NVME; now it is: Hetzner AX52).
>> >
>> > The migration was done starting yesterday lunch time europe (12:00
>> > CET) in the by booting the new server in the datacenter's recovery
>> > environment booted from network on both servers with a temporary IP
>> > and then mounting both root disks and doing a large rsync (with
>> > checksums, external attributes, numeric uid/gid and delete option).
>> > Luckily this worked with the old server (the Intel Adapter did not
>> > break). The whole downtime should have taken only 1 to 1.5 hours (the
>> > time copy with 1 GBits and reconfig needs), but unfortunately the
>> > PCIexpress on the new server complained about (recoverable) errors on
>> > the NVME communications. After some support roundtrips (they first
>> > replaced only the failing NVME controller which did not help), the
>> > replaced the whole server.
>> >
>> > At 18:30 CET, I started copy to new server again and all went well,
>> > dmesg showed no PCI express checksum errors. Finally, after fixing
>> > boot (the old server used MBR the new one EFI), the server was mounted
>> > at the original location by the team and all IPv4 adresses and IPv6
>> > network were available. Since then (approx 20:30 CET), Policeman
>> > Jenkins is back and running.
>> >
>> > The TODOs for the future:
>> >
>> >   * Replace the MacOS VM and update it to a new version (it's
>> >     complicated, as it is a "Hackintosh", so it shouldn't be there
>> >     according to Apple)
>> >   * Possibly migrate away from VirtualBOX to KVM, but it's unclear if
>> >     Hackintoshs work there.
>> >
>> > Have fun with the new hardware, the builds on Lucene main branch are
>> > now 1.5 times faster (10 instead of 15 minutes).
>> >
>> > The new hardware is described here:
>> > https://www.hetzner.com/dedicated-rootserver/ax52/; it has AVX 512....
>> > let's see what comes out. No test failures yet.
>> >
>> > vendor_id       : AuthenticAMD
>> > cpu family      : 25
>> > model           : 97
>> > model name      : AMD Ryzen 7 7700 8-Core Processor
>> > stepping        : 2
>> > microcode       : 0xa601209
>> > cpu MHz         : 5114.082
>> > cache size      : 1024 KB
>> > physical id     : 0
>> > siblings        : 16
>> > core id         : 7
>> > cpu cores       : 8
>> > apicid          : 15
>> > initial apicid  : 15
>> > fpu             : yes
>> > fpu_exception   : yes
>> > cpuid level     : 16
>> > wp              : yes
>> > flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> > mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
>> > fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl
>> > xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq
>> > monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c
>> > rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse
>> > 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb
>> > bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba
>> > perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2
>> > smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap
>> > avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt
>> > xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total
>> > cqm_mbm_local user_shstk avx512_bf16 clzero irperf xsaveerptr rdpru
>> > wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean
>> > flushbyasid decodeassists pausefilter pfthreshold avic vgif x2avic
>> > v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes
>> > vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid
>> > overflow_recov succor smca fsrm flush_l1d amd_lbr_pmc_freeze
>> > bugs            : sysret_ss_attrs spectre_v1 spectre_v2
>> > spec_store_bypass srso
>> > bogomips        : 7585.28
>> > TLB size        : 3584 4K pages
>> > clflush size    : 64
>> > cache_alignment : 64
>> > address sizes   : 48 bits physical, 48 bits virtual
>> > power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
>> >
>> > # lspci | fgrep -i volati
>> > 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd
>> > NVMe SSD Controller PM9A1/PM9A3/980PRO
>> > 02:00.0 Non-Volatile memory controller: Micron Technology Inc 3400
>> > NVMe SSD [Hendrix]
>> >
>> > I have no idea why the replacement server has two different NVME SSDs,
>> > but you never know before what you get! From smart info I know that
>> > both SSDs were fresh (6 hours total uptime only).
>> >
>> > Uwe
>> >
>> > --
>> >
>> > Uwe Schindler Achterdiek 19, D-28357 Bremen https://www.thetaphi.de
>> > eMail: u...@thetaphi.de
>>
>> --
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> https://www.thetaphi.de
>> eMail:u...@thetaphi.de
>>

--
Uwe Schindler
Achterdiek 19, 28357 Bremen
https://www.thetaphi.de

Re: Policeman Jenkins => new hardware

Reply via email to