Data Corruption in ZFS #196

bautsche · 2019-10-07T04:47:35Z

I appear to be getting data corruption in ZFS leading to panics. I've been asked by Brian Bennett to open a new issue for a kernel developer to look at this.
Here's more detail:

A bit of history: I have recently moved to SmartOS. I have home-built a server on which I am currently just running a handful of VMs, the idea being however that this will become my main server. The thing used to crash quite regularly. I eventually traced it to an additional SATA adapter I had in the server, so I removed that (not really required, I can make do with the 4 available on-board ports).

I thought I had it fixed now and was contemplating migrating my remaining VMs over, when, last night, the thing panic'ed again, twice.

Here's what I got out of the first panic:

root@nexon # pwd
/var/crash/volatile
root@nexon # savecore -f vmdump.1
savecore: incomplete dump on dump device
savecore: System dump time: Sat Sep 21 21:03:21 2019

savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.1
Constructing namelist /var/crash/volatile/unix.1
Constructing corefile /var/crash/volatile/vmcore.1
pfn 16454656 not found for as=fffffffffbc490a0, va=fffffe0040000000
pfn 16454657 not found for as=fffffffffbc490a0, va=fffffe0040001000
pfn 16454658 not found for as=fffffffffbc490a0, va=fffffe0040002000
pfn 16454659 not found for as=fffffffffbc490a0, va=fffffe0040003000
pfn 16454660 not found for as=fffffffffbc490a0, va=fffffe0040004000
pfn 16454661 not found for as=fffffffffbc490a0, va=fffffe0040005000
pfn 16454662 not found for as=fffffffffbc490a0, va=fffffe0040006000
pfn 16454663 not found for as=fffffffffbc490a0, va=fffffe0040007000
pfn 16454664 not found for as=fffffffffbc490a0, va=fffffe0040008000
pfn 16454665 not found for as=fffffffffbc490a0, va=fffffe0040009000
4:54 99% done: 1823654 of 1824153 pages saved
savecore: bad data after page 1823654
root@nexon # ls -la
total 24388208
drwx------ 2 root root 9 Sep 22 09:32 .
drwxr-xr-x 3 root root 3 Aug 9 10:36 ..
-rw-r--r-- 1 root root 2 Sep 22 00:07 bounds
-rw-r--r-- 1 root root 3014 Sep 22 00:07 METRICS.csv
-rw-r--r-- 1 root root 2324901 Sep 22 09:32 unix.1
-rw-r--r-- 1 root root 6859653120 Sep 22 09:37 vmcore.1
-rw-r--r-- 1 root root 2734096384 Aug 30 19:19 vmdump.0
-rw-r--r-- 1 root root 2739470336 Sep 21 21:07 vmdump.1
-rw-r--r-- 1 root root 2168651776 Sep 22 00:07 vmdump.2
root@nexon # mdb 1
mdb: failed to read .symtab header for 'unix', id=0: no mapping for address
mdb: failed to read .symtab header for 'genunix', id=1: no mapping for address
mdb: failed to read modctl at fffffe58c97eaf08: no mapping for address

::status
debugging crash dump vmcore.1 (64-bit) from nexon
operating system: 5.11 joyent_20190731T235744Z (i86pc)
image uuid: (not set)
panic message: checksum of cached data doesn't match BP err=50 hdr=fffffe5941b98658 bp=fffffe5a67d99800 abd=fffffe5942208088 buf=fffffe5a45a5a000
dump content: kernel pages only
::stack
vpanic()
0xfffffffff7e93d08()
arc_read+0xb43()
dbuf_read_impl+0x38a()
dbuf_read+0xee()
dmu_buf_hold_array_by_dnode+0x128()
dmu_read_uio_dnode+0x5a()
dmu_read_uio+0x5b()
zvol_read+0x17a()
cdev_read+0x2d()
spec_read+0x2b9()
fop_read+0xf3()
preadv+0x4fb()
sys_syscall+0x19f()
::msgbuf
mdb: invalid command '::msgbuf': unknown dcmd name
::panicinfo
mdb: invalid command '::panicinfo': unknown dcmd name
::cpuinfo -v
mdb: invalid command '::cpuinfo': unknown dcmd name
::ps
mdb: invalid command '::ps': unknown dcmd name

Note the failed commands at the end.

For the second panic, I get something similar, but the commands work (the second panic occurred as I was running, via cron, my daily snapshot and zfs send to a remote system):

(Note that have re-run the zfs snapshots and zfs send manually this morning, without any issues, so this may be a red herring)

root@nexon # savecore -f vmdump.2
savecore: System dump time: Sun Sep 22 00:04:32 2019

savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.2
Constructing namelist /var/crash/volatile/unix.2
Constructing corefile /var/crash/volatile/vmcore.2
2:36 100% done: 1532295 of 1532295 pages saved
root@nexon # ls -la
total 34188827
drwx------ 2 root root 11 Sep 22 09:42 .
drwxr-xr-x 3 root root 3 Aug 9 10:36 ..
-rw-r--r-- 1 root root 2 Sep 22 00:07 bounds
-rw-r--r-- 1 root root 3388 Sep 22 09:45 METRICS.csv
-rw-r--r-- 1 root root 2324901 Sep 22 09:32 unix.1
-rw-r--r-- 1 root root 2322416 Sep 22 09:42 unix.2
-rw-r--r-- 1 root root 7654342656 Sep 22 09:37 vmcore.1
-rw-r--r-- 1 root root 6374752256 Sep 22 09:45 vmcore.2
-rw-r--r-- 1 root root 2734096384 Aug 30 19:19 vmdump.0
-rw-r--r-- 1 root root 2739470336 Sep 21 21:07 vmdump.1
-rw-r--r-- 1 root root 2168651776 Sep 22 00:07 vmdump.2
root@nexon # mdb 2
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba xhci mm sd fctl stmf_sbd stmf zfs lofs idm sata crypto fcp random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ]

::status
debugging crash dump vmcore.2 (64-bit) from nexon
operating system: 5.11 joyent_20190731T235744Z (i86pc)
git branch: release-20190801
git rev: ec6335f
image uuid: (not set)
panic message: checksum of cached data doesn't match BP err=50 hdr=fffffe5970f343b0 bp=fffffe59bc8a8e58 abd=fffffe5970ec9d38 buf=fffffe5909849000
dump content: kernel pages only
::msgbuf
MESSAGE
/pseudo/zconsnex@1/zcons@3 (zcons3) online
NOTICE: vnic1020 registered
NOTICE: vnic1020 link up, 1000 Mbps, unknown duplex
NOTICE: vnic1021 registered
NOTICE: vnic1021 link up, 1000 Mbps, unknown duplex
NOTICE: vnic1023 registered
NOTICE: vnic1023 link up, 1000 Mbps, unknown duplex
kvm_lapic_reset: vcpu=fffffe59148ec000, id=0, base_msr= fee00100 PRIx64 base_address=fee00000
vmcs revision_id = 4
kvm_lapic_reset: vcpu=fffffe59148e5000, id=1, base_msr= fee00000 PRIx64 base_address=fee00000
vmcs revision_id = 4
unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfebdc
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0x0 data 0
unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c
unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0
unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0
vcpu 1 received sipi with vector # 10
kvm_lapic_reset: vcpu=fffffe591441b000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xef1bbdd0 data fffffc7fef1cafb0
unhandled wrmsr: 0xef1bbdd0 data fffffc7fef1cafb0
unhandled wrmsr: 0xffdfee68 data 3ef1beff8
unhandled wrmsr: 0xffdfee68 data 3ef1beff8
vcpu 1 received sipi with vector # 10
kvm_lapic_reset: vcpu=fffffe59148f3000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000
unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c
unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0
unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0
vcpu 1 received sipi with vector # 10
kvm_lapic_reset: vcpu=fffffe59148e5000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000
unhandled wrmsr: 0x0 data 0
Creating /etc/devices/devid_cache
Creating /etc/devices/pci_unitaddr_persistent
Creating /etc/devices/devname_cache
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
xsvc0 at root: space 0 offset 0
xsvc0 is /xsvc@0,0
pseudo-device: devinfo0
devinfo0 is /pseudo/devinfo@0
Block device: blkdev@1,0, blkdev0
blkdev0 is /pci@0,0/pci8086,a298@1d/pci8086,390a@0/blkdev@1,0
sd0 at scsa2usb0: target 0 lun 0
sd0 is /pci@0,0/pci1849,a2af@14/storage@d/disk@0,0
pseudo-device: llc10
llc10 is /pseudo/llc1@0
pseudo-device: ramdisk1024
ramdisk1024 is /pseudo/ramdisk@1024
pseudo-device: ucode0
ucode0 is /pseudo/ucode@0
pseudo-device: dcpc0
dcpc0 is /pseudo/dcpc@0
pseudo-device: fbt0
fbt0 is /pseudo/fbt@0
pseudo-device: profile0
profile0 is /pseudo/profile@0
pseudo-device: lockstat0
lockstat0 is /pseudo/lockstat@0
pseudo-device: sdt0
sdt0 is /pseudo/sdt@0
pseudo-device: systrace0
systrace0 is /pseudo/systrace@0
pseudo-device: fcp0
fcp0 is /pseudo/fcp@0
pseudo-device: fcsm0
fcsm0 is /pseudo/fcsm@0
pseudo-device: ipd0
ipd0 is /pseudo/ipd@0
pseudo-device: stmf0
stmf0 is /pseudo/stmf@0
pseudo-device: fssnap0
fssnap0 is /pseudo/fssnap@0
pseudo-device: bpf0
bpf0 is /pseudo/bpf@0
pseudo-device: pm0
pm0 is /pseudo/pm@0
pseudo-device: nsmb0
nsmb0 is /pseudo/nsmb@0
device pciclass,030000@2(display#0) keeps up device sd@0,0(disk#0), but the former is not power managed
pseudo-device: lx_systrace0
lx_systrace0 is /pseudo/lx_systrace@0
pseudo-device: inotify0
inotify0 is /pseudo/inotify@0
pseudo-device: eventfd0
eventfd0 is /pseudo/eventfd@0
pseudo-device: timerfd0
timerfd0 is /pseudo/timerfd@0
pseudo-device: signalfd0
signalfd0 is /pseudo/signalfd@0
pseudo-device: viona0
viona0 is /pseudo/viona@0
unhandled rdmsr: 0x140
vcpu 1 received sipi with vector # 9a
kvm_lapic_reset: vcpu=fffffe59148f3000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000
unhandled rdmsr: 0x140
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
vcpu 1 received sipi with vector # 40
kvm_lapic_reset: vcpu=fffffe59148e5000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000
vcpu 1 received sipi with vector # 40
kvm_lapic_reset: vcpu=fffffe591441b000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000

panic[cpu5]/thread=fffffe58f6015be0:
checksum of cached data doesn't match BP err=50 hdr=fffffe5970f343b0 bp=fffffe59bc8a8e58 abd=fffffe5970ec9d38 buf=fffffe5909849000

fffffe007a8a4640 zfs:zfs_nfsshare_inited+37dc3c6c ()
fffffe007a8a4760 zfs:arc_read+b43 ()
fffffe007a8a4810 zfs:do_dump+215 ()
fffffe007a8a49d0 zfs:dmu_send_impl+7e3 ()
fffffe007a8a4b10 zfs:dmu_send_obj+2f0 ()
fffffe007a8a4bb0 zfs:zfs_ioc_send+105 ()
fffffe007a8a4c70 zfs:zfsdev_ioctl+512 ()
fffffe007a8a4cb0 genunix:cdev_ioctl+39 ()
fffffe007a8a4d00 specfs:spec_ioctl+60 ()
fffffe007a8a4d90 genunix:fop_ioctl+55 ()
fffffe007a8a4eb0 genunix:ioctl+9b ()
fffffe007a8a4f10 unix:brand_sys_sysenter+1d3 ()

dumping to /dev/zvol/dsk/zones/dump, offset 65536, content: kernel
NOTICE: ahci1: ahci_tran_reset_dport port 0 reset port
NOTICE: ahci0: ahci_tran_reset_dport port 2 reset port

::panicinfo
cpu 5
thread fffffe58f6015be0
message checksum of cached data doesn't match BP err=50 hdr=fffffe5970f343b0 bp=fffffe59bc8a8e58 abd=fffffe5970ec9d38 buf=fffffe5909849000
rdi fffffffff7fab408
rsi fffffe007a8a4560
rdx fffffe5970f343b0
rcx fffffe59bc8a8e58
r8 fffffe5970ec9d38
r9 fffffe5909849000
rax fffffe007a8a4580
rbx fffffe59bc8a8e58
rbp fffffe007a8a45c0
r10 c0dd0
r11 0
r12 fffffe5970f343b0
r13 32
r14 fffffe5970ec9d38
r15 2000
fsbase fffffc7fef060a40
gsbase fffffe58dcaad000
ds 38
es 38
fs 0
gs 0
trapno 0
err 0
rip fffffffffb881010
cs 30
rflags 286
rsp fffffe007a8a4558
ss 38
gdt_hi 0
gdt_lo 1ef
idt_hi 0
idt_lo 9000ffff
ldt 0
task 70
cr0 80050033
cr2 a768b7f710
cr3 fec051000
cr4 3626f8
::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc4a000 1f 0 0 -1 no no t-0 fffffe0079805c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
1 fffffe58db55a000 1f 0 0 -1 no no t-0 fffffe0079adbc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
2 fffffe58db558000 1f 0 0 -1 no no t-5 fffffe0079b61c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
3 fffffe58db841000 1f 0 0 -1 no no t-1 fffffe0079bb7c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
4 fffffe58dc909000 1f 0 0 -1 no no t-0 fffffe0079cadc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
5 fffffffffbc4fe00 1b 0 0 59 no no t-0 fffffe58f6015be0 zfs
|
RUNNING <--+
READY
EXISTS
ENABLE

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
6 fffffe58dcaa7000 1f 0 0 -1 no no t-0 fffffe0079e69c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
7 fffffe58dcaa5000 1f 0 0 -1 no no t-11 fffffe0079eefc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
8 fffffe58dcaa1000 1f 1 0 -1 no no t-0 fffffe0079f75c20 (idle)
| |
RUNNING <--+ +--> PRI THREAD PROC
READY 60 fffffe007aa79c20 sched
QUIESCED
EXISTS
ENABLE

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
9 fffffe58dca9d000 1f 0 0 -1 no no t-0 fffffe0079ffbc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
10 fffffe58dca99000 1f 0 0 -1 no no t-42 fffffe007a081c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE

ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
11 fffffe58dcd8d000 1f 0 0 -1 no no t-5 fffffe007a107c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE

fffffe5970ec9d38::whatis
fffffe5970ec9d38 is allocated from kmem_alloc_48
fffffe5970ec9d38::print abd_t
{
abd_flags = 0x2 (ABD_FLAG_OWNER)
abd_size = 0x2000
abd_parent = 0
abd_children = {
rc_count = 0
}
abd_u = {
abd_scatter = {
abd_offset = 0
abd_chunk_size = 0x1000
abd_chunks = [ ... ]
}
abd_linear = {
abd_buf = 0x100000000000
}
}
}

Looks like I am again having zfs issues....

The error message I'm seeing comes from this bit of code:

https://github.com/openzfs/openzfs/blob/master/usr/src/uts/common/fs/zfs/arc.c

specifically: static void arc_hdr_verify_checksum(spa_t *spa, arc_buf_hdr_t *hdr, const blkptr_t *bp)

which includes the (helpful?) comment: this should be changed to return an error, and callers re-read from disk on failure (on nondebug bits).

Question then being, why am I getting there?

Looking at the code in question, I seem to end up in abd != NULL, which is only the case, if my C is not too rusty, if I either use encryption (which I don't) or compression (which I do):

    int err = 0;
    abd_t *abd = NULL;
    if (BP_IS_ENCRYPTED(bp)) {
            if (HDR_HAS_RABD(hdr)) {
                    abd = hdr->b_crypt_hdr.b_rabd;
            }
    } else if (HDR_COMPRESSION_ENABLED(hdr)) {
            abd = hdr->b_l1hdr.b_pabd;
    }
    if (abd != NULL) {

So would a workaround be that I don't use compression? Or am I suffering from some underlying issue and I'll just move my problem somewhere else?

Here's some other data form the system around config if that helps, basically, the disks for the various VMs actually resize on the zpool vmdg. There is one Windows, one Linux and two Solaris VMs. Yes, I know all about images and zones, but these are "special" in various different ways, so need to be VMs:

bautsche@nexon $ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
vmdg 952G 250G 702G - - 11% 26% 1.00x ONLINE -
zones 222G 20.3G 202G - - 0% 9% 1.00x ONLINE -
bautsche@nexon $ zfs list
NAME USED AVAIL REFER MOUNTPOINT
vmdg 621G 301G 24K /export/vm
vmdg/galatea0 63.1G 343G 7.05G -
vmdg/gaspra0 58.3G 343G 13.5G -
vmdg/io0 452G 549G 105G -
vmdg/nereid0 47.2G 343G 5.70G -
zones 85.3G 130G 538K /zones
zones/789a6253-c226-ed36-cf15-c5b0c5e79f23 193K 10.0G 49K /zones/789a6253-c226-ed36-cf15-c5b0c5e79f23
zones/89d9e83c-437b-edf5-db03-b0004eb2b273 128K 10.0G 48K /zones/89d9e83c-437b-edf5-db03-b0004eb2b273
zones/a0d804ec-5f92-c97c-af8a-810d656b1e93 126K 10.0G 48K /zones/a0d804ec-5f92-c97c-af8a-810d656b1e93
zones/archive 50K 130G 24K /zones/archive
zones/config 272K 130G 59K legacy
zones/cores 144K 130G 24K none
zones/cores/789a6253-c226-ed36-cf15-c5b0c5e79f23 24K 100G 24K /zones/789a6253-c226-ed36-cf15-c5b0c5e79f23/cores
zones/cores/89d9e83c-437b-edf5-db03-b0004eb2b273 24K 100G 24K /zones/89d9e83c-437b-edf5-db03-b0004eb2b273/cores
zones/cores/a0d804ec-5f92-c97c-af8a-810d656b1e93 24K 100G 24K /zones/a0d804ec-5f92-c97c-af8a-810d656b1e93/cores
zones/cores/e67a4093-4a30-49dd-a75a-8aead207180e 24K 100G 24K /zones/e67a4093-4a30-49dd-a75a-8aead207180e/cores
zones/cores/global 24K 10.0G 24K /zones/global/cores
zones/dump 2.56G 130G 2.56G -
zones/e67a4093-4a30-49dd-a75a-8aead207180e 248K 10.0G 50.5K /zones/e67a4093-4a30-49dd-a75a-8aead207180e
zones/opt 384K 130G 238K legacy
zones/swap 65.7G 195G 738M -
zones/usbkey 58.5K 130G 32.5K legacy
zones/var 17.0G 130G 14.5G legacy
bautsche@nexon $ su -
Password:

SmartOS (build: 20190731T235744Z)

System is running the following applications:
vm server

DISPLAY=192.168.140.92:0.0

root@nexon # vmadm list
UUID TYPE RAM STATE ALIAS
789a6253-c226-ed36-cf15-c5b0c5e79f23 KVM 4096 running gaspra
89d9e83c-437b-edf5-db03-b0004eb2b273 KVM 4096 running galatea
a0d804ec-5f92-c97c-af8a-810d656b1e93 KVM 4096 running nereid
e67a4093-4a30-49dd-a75a-8aead207180e KVM 4096 running io
root@nexon #

The text was updated successfully, but these errors were encountered:

mgerdts · 2019-10-07T13:42:49Z

My first suspect would be bad RAM. Do you have ECC DIMMs? Have you run memtest86 to test your RAM? Mike

…

On Sun, Oct 6, 2019 at 11:47 PM bautsche ***@***.***> wrote: I appear to be getting data corruption in ZFS leading to panics. I've been asked by Brian Bennett to open a new issue for a kernel developer to look at this. Here's more detail: A bit of history: I have recently moved to SmartOS. I have home-built a server on which I am currently just running a handful of VMs, the idea being however that this will become my main server. The thing used to crash quite regularly. I eventually traced it to an additional SATA adapter I had in the server, so I removed that (not really required, I can make do with the 4 available on-board ports). I thought I had it fixed now and was contemplating migrating my remaining VMs over, when, last night, the thing panic'ed again, twice. Here's what I got out of the first panic: ***@***.*** # pwd /var/crash/volatile ***@***.*** # savecore -f vmdump.1 savecore: incomplete dump on dump device savecore: System dump time: Sat Sep 21 21:03:21 2019 savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.1 Constructing namelist /var/crash/volatile/unix.1 Constructing corefile /var/crash/volatile/vmcore.1 pfn 16454656 not found for as=fffffffffbc490a0, va=fffffe0040000000 pfn 16454657 not found for as=fffffffffbc490a0, va=fffffe0040001000 pfn 16454658 not found for as=fffffffffbc490a0, va=fffffe0040002000 pfn 16454659 not found for as=fffffffffbc490a0, va=fffffe0040003000 pfn 16454660 not found for as=fffffffffbc490a0, va=fffffe0040004000 pfn 16454661 not found for as=fffffffffbc490a0, va=fffffe0040005000 pfn 16454662 not found for as=fffffffffbc490a0, va=fffffe0040006000 pfn 16454663 not found for as=fffffffffbc490a0, va=fffffe0040007000 pfn 16454664 not found for as=fffffffffbc490a0, va=fffffe0040008000 pfn 16454665 not found for as=fffffffffbc490a0, va=fffffe0040009000 4:54 99% done: 1823654 of 1824153 pages saved savecore: bad data after page 1823654 ***@***.*** # ls -la total 24388208 drwx------ 2 root root 9 Sep 22 09:32 . drwxr-xr-x 3 root root 3 Aug 9 10:36 .. -rw-r--r-- 1 root root 2 Sep 22 00:07 bounds -rw-r--r-- 1 root root 3014 Sep 22 00:07 METRICS.csv -rw-r--r-- 1 root root 2324901 Sep 22 09:32 unix.1 -rw-r--r-- 1 root root 6859653120 Sep 22 09:37 vmcore.1 -rw-r--r-- 1 root root 2734096384 Aug 30 19:19 vmdump.0 -rw-r--r-- 1 root root 2739470336 Sep 21 21:07 vmdump.1 -rw-r--r-- 1 root root 2168651776 Sep 22 00:07 vmdump.2 ***@***.*** # mdb 1 mdb: failed to read .symtab header for 'unix', id=0: no mapping for address mdb: failed to read .symtab header for 'genunix', id=1: no mapping for address mdb: failed to read modctl at fffffe58c97eaf08: no mapping for address ::status debugging crash dump vmcore.1 (64-bit) from nexon operating system: 5.11 joyent_20190731T235744Z (i86pc) image uuid: (not set) panic message: checksum of cached data doesn't match BP err=50 hdr=fffffe5941b98658 bp=fffffe5a67d99800 abd=fffffe5942208088 buf=fffffe5a45a5a000 dump content: kernel pages only ::stack vpanic() 0xfffffffff7e93d08() arc_read+0xb43() dbuf_read_impl+0x38a() dbuf_read+0xee() dmu_buf_hold_array_by_dnode+0x128() dmu_read_uio_dnode+0x5a() dmu_read_uio+0x5b() zvol_read+0x17a() cdev_read+0x2d() spec_read+0x2b9() fop_read+0xf3() preadv+0x4fb() sys_syscall+0x19f() ::msgbuf mdb: invalid command '::msgbuf': unknown dcmd name ::panicinfo mdb: invalid command '::panicinfo': unknown dcmd name ::cpuinfo -v mdb: invalid command '::cpuinfo': unknown dcmd name ::ps mdb: invalid command '::ps': unknown dcmd name Note the failed commands at the end. For the second panic, I get something similar, but the commands work (the second panic occurred as I was running, via cron, my daily snapshot and zfs send to a remote system): (Note that have re-run the zfs snapshots and zfs send manually this morning, without any issues, so this may be a red herring) ***@***.*** # savecore -f vmdump.2 savecore: System dump time: Sun Sep 22 00:04:32 2019 savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.2 Constructing namelist /var/crash/volatile/unix.2 Constructing corefile /var/crash/volatile/vmcore.2 2:36 100% done: 1532295 of 1532295 pages saved ***@***.*** # ls -la total 34188827 drwx------ 2 root root 11 Sep 22 09:42 . drwxr-xr-x 3 root root 3 Aug 9 10:36 .. -rw-r--r-- 1 root root 2 Sep 22 00:07 bounds -rw-r--r-- 1 root root 3388 Sep 22 09:45 METRICS.csv -rw-r--r-- 1 root root 2324901 Sep 22 09:32 unix.1 -rw-r--r-- 1 root root 2322416 Sep 22 09:42 unix.2 -rw-r--r-- 1 root root 7654342656 Sep 22 09:37 vmcore.1 -rw-r--r-- 1 root root 6374752256 Sep 22 09:45 vmcore.2 -rw-r--r-- 1 root root 2734096384 Aug 30 19:19 vmdump.0 -rw-r--r-- 1 root root 2739470336 Sep 21 21:07 vmdump.1 -rw-r--r-- 1 root root 2168651776 Sep 22 00:07 vmdump.2 ***@***.*** # mdb 2 Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba xhci mm sd fctl stmf_sbd stmf zfs lofs idm sata crypto fcp random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ] ::status debugging crash dump vmcore.2 (64-bit) from nexon operating system: 5.11 joyent_20190731T235744Z (i86pc) git branch: release-20190801 git rev: ec6335f <ec6335f> image uuid: (not set) panic message: checksum of cached data doesn't match BP err=50 hdr=fffffe5970f343b0 bp=fffffe59bc8a8e58 abd=fffffe5970ec9d38 buf=fffffe5909849000 dump content: kernel pages only ::msgbuf MESSAGE ***@***.******@***.*** (zcons3) online NOTICE: vnic1020 registered NOTICE: vnic1020 link up, 1000 Mbps, unknown duplex NOTICE: vnic1021 registered NOTICE: vnic1021 link up, 1000 Mbps, unknown duplex NOTICE: vnic1023 registered NOTICE: vnic1023 link up, 1000 Mbps, unknown duplex kvm_lapic_reset: vcpu=fffffe59148ec000, id=0, base_msr= fee00100 PRIx64 base_address=fee00000 vmcs revision_id = 4 kvm_lapic_reset: vcpu=fffffe59148e5000, id=1, base_msr= fee00000 PRIx64 base_address=fee00000 vmcs revision_id = 4 unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfebdc unhandled wrmsr: 0xef197394 data 0 unhandled wrmsr: 0x0 data 0 unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c unhandled wrmsr: 0xef197394 data 0 unhandled wrmsr: 0xef197394 data 0 unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0 unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0 vcpu 1 received sipi with vector # 10 kvm_lapic_reset: vcpu=fffffe591441b000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000 unhandled wrmsr: 0xef197394 data 0 unhandled wrmsr: 0xef197394 data 0 unhandled wrmsr: 0xef1bbdd0 data fffffc7fef1cafb0 unhandled wrmsr: 0xef1bbdd0 data fffffc7fef1cafb0 unhandled wrmsr: 0xffdfee68 data 3ef1beff8 unhandled wrmsr: 0xffdfee68 data 3ef1beff8 vcpu 1 received sipi with vector # 10 kvm_lapic_reset: vcpu=fffffe59148f3000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000 unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c unhandled wrmsr: 0xef197394 data 0 unhandled wrmsr: 0xef197394 data 0 unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0 unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0 vcpu 1 received sipi with vector # 10 kvm_lapic_reset: vcpu=fffffe59148e5000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000 unhandled wrmsr: 0x0 data 0 Creating /etc/devices/devid_cache Creating /etc/devices/pci_unitaddr_persistent Creating /etc/devices/devname_cache unhandled rdmsr: 0x4b4d564b unhandled wrmsr: 0x0 data 4002014001ff01ff unhandled rdmsr: 0x4b4d564b unhandled wrmsr: 0x0 data 4002014001ff01ff unhandled rdmsr: 0x4b4d564b unhandled wrmsr: 0x0 data 4002014001ff01ff xsvc0 at root: space 0 offset 0 xsvc0 is ***@***.***,0 pseudo-device: devinfo0 devinfo0 is ***@***.*** Block device: ***@***.***,0, blkdev0 blkdev0 is ***@***.******@***.******@***.******@***.***,0 sd0 at scsa2usb0: target 0 lun 0 sd0 is ***@***.******@***.******@***.******@***.***,0 pseudo-device: llc10 llc10 is ***@***.*** pseudo-device: ramdisk1024 ramdisk1024 is ***@***.*** pseudo-device: ucode0 ucode0 is ***@***.*** pseudo-device: dcpc0 dcpc0 is ***@***.*** pseudo-device: fbt0 fbt0 is ***@***.*** pseudo-device: profile0 profile0 is ***@***.*** pseudo-device: lockstat0 lockstat0 is ***@***.*** pseudo-device: sdt0 sdt0 is ***@***.*** pseudo-device: systrace0 systrace0 is ***@***.*** pseudo-device: fcp0 fcp0 is ***@***.*** pseudo-device: fcsm0 fcsm0 is ***@***.*** pseudo-device: ipd0 ipd0 is ***@***.*** pseudo-device: stmf0 stmf0 is ***@***.*** pseudo-device: fssnap0 fssnap0 is ***@***.*** pseudo-device: bpf0 bpf0 is ***@***.*** pseudo-device: pm0 pm0 is ***@***.*** pseudo-device: nsmb0 nsmb0 is ***@***.*** device ***@***.***(display#0) keeps up device ***@***.***,0(disk#0), but the former is not power managed pseudo-device: lx_systrace0 lx_systrace0 is ***@***.*** pseudo-device: inotify0 inotify0 is ***@***.*** pseudo-device: eventfd0 eventfd0 is ***@***.*** pseudo-device: timerfd0 timerfd0 is ***@***.*** pseudo-device: signalfd0 signalfd0 is ***@***.*** pseudo-device: viona0 viona0 is ***@***.*** unhandled rdmsr: 0x140 vcpu 1 received sipi with vector # 9a kvm_lapic_reset: vcpu=fffffe59148f3000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000 unhandled rdmsr: 0x140 unhandled rdmsr: 0x4b4d564b unhandled wrmsr: 0x0 data 4002014001ff01ff unhandled rdmsr: 0x4b4d564b unhandled wrmsr: 0x0 data 4002014001ff01ff unhandled rdmsr: 0x4b4d564b unhandled wrmsr: 0x0 data 4002014001ff01ff unhandled rdmsr: 0x4b4d564b unhandled wrmsr: 0x0 data 4002014001ff01ff unhandled rdmsr: 0x4b4d564b unhandled wrmsr: 0x0 data 4002014001ff01ff unhandled rdmsr: 0x4b4d564b unhandled wrmsr: 0x0 data 4002014001ff01ff unhandled rdmsr: 0x4b4d564b unhandled wrmsr: 0x0 data 4002014001ff01ff unhandled rdmsr: 0x4b4d564b unhandled wrmsr: 0x0 data 4002014001ff01ff vcpu 1 received sipi with vector # 40 kvm_lapic_reset: vcpu=fffffe59148e5000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000 vcpu 1 received sipi with vector # 40 kvm_lapic_reset: vcpu=fffffe591441b000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000 panic[cpu5]/thread=fffffe58f6015be0: checksum of cached data doesn't match BP err=50 hdr=fffffe5970f343b0 bp=fffffe59bc8a8e58 abd=fffffe5970ec9d38 buf=fffffe5909849000 fffffe007a8a4640 zfs:zfs_nfsshare_inited+37dc3c6c () fffffe007a8a4760 zfs:arc_read+b43 () fffffe007a8a4810 zfs:do_dump+215 () fffffe007a8a49d0 zfs:dmu_send_impl+7e3 () fffffe007a8a4b10 zfs:dmu_send_obj+2f0 () fffffe007a8a4bb0 zfs:zfs_ioc_send+105 () fffffe007a8a4c70 zfs:zfsdev_ioctl+512 () fffffe007a8a4cb0 genunix:cdev_ioctl+39 () fffffe007a8a4d00 specfs:spec_ioctl+60 () fffffe007a8a4d90 genunix:fop_ioctl+55 () fffffe007a8a4eb0 genunix:ioctl+9b () fffffe007a8a4f10 unix:brand_sys_sysenter+1d3 () dumping to /dev/zvol/dsk/zones/dump, offset 65536, content: kernel NOTICE: ahci1: ahci_tran_reset_dport port 0 reset port NOTICE: ahci0: ahci_tran_reset_dport port 2 reset port ::panicinfo cpu 5 thread fffffe58f6015be0 message checksum of cached data doesn't match BP err=50 hdr=fffffe5970f343b0 bp=fffffe59bc8a8e58 abd=fffffe5970ec9d38 buf=fffffe5909849000 rdi fffffffff7fab408 rsi fffffe007a8a4560 rdx fffffe5970f343b0 rcx fffffe59bc8a8e58 r8 fffffe5970ec9d38 r9 fffffe5909849000 rax fffffe007a8a4580 rbx fffffe59bc8a8e58 rbp fffffe007a8a45c0 r10 c0dd0 r11 0 r12 fffffe5970f343b0 r13 32 r14 fffffe5970ec9d38 r15 2000 fsbase fffffc7fef060a40 gsbase fffffe58dcaad000 ds 38 es 38 fs 0 gs 0 trapno 0 err 0 rip fffffffffb881010 cs 30 rflags 286 rsp fffffe007a8a4558 ss 38 gdt_hi 0 gdt_lo 1ef idt_hi 0 idt_lo 9000ffff ldt 0 task 70 cr0 80050033 cr2 a768b7f710 cr3 fec051000 cr4 3626f8 ::cpuinfo -v ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 0 fffffffffbc4a000 1f 0 0 -1 no no t-0 fffffe0079805c20 (idle) | RUNNING <--+ READY QUIESCED EXISTS ENABLE ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 1 fffffe58db55a000 1f 0 0 -1 no no t-0 fffffe0079adbc20 (idle) | RUNNING <--+ READY QUIESCED EXISTS ENABLE ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 2 fffffe58db558000 1f 0 0 -1 no no t-5 fffffe0079b61c20 (idle) | RUNNING <--+ READY QUIESCED EXISTS ENABLE ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 3 fffffe58db841000 1f 0 0 -1 no no t-1 fffffe0079bb7c20 (idle) | RUNNING <--+ READY QUIESCED EXISTS ENABLE ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 4 fffffe58dc909000 1f 0 0 -1 no no t-0 fffffe0079cadc20 (idle) | RUNNING <--+ READY QUIESCED EXISTS ENABLE ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 5 fffffffffbc4fe00 1b 0 0 59 no no t-0 fffffe58f6015be0 zfs | RUNNING <--+ READY EXISTS ENABLE ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 6 fffffe58dcaa7000 1f 0 0 -1 no no t-0 fffffe0079e69c20 (idle) | RUNNING <--+ READY QUIESCED EXISTS ENABLE ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 7 fffffe58dcaa5000 1f 0 0 -1 no no t-11 fffffe0079eefc20 (idle) | RUNNING <--+ READY QUIESCED EXISTS ENABLE ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 8 fffffe58dcaa1000 1f 1 0 -1 no no t-0 fffffe0079f75c20 (idle) | | RUNNING <--+ +--> PRI THREAD PROC READY 60 fffffe007aa79c20 sched QUIESCED EXISTS ENABLE ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 9 fffffe58dca9d000 1f 0 0 -1 no no t-0 fffffe0079ffbc20 (idle) | RUNNING <--+ READY QUIESCED EXISTS ENABLE ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 10 fffffe58dca99000 1f 0 0 -1 no no t-42 fffffe007a081c20 (idle) | RUNNING <--+ READY QUIESCED EXISTS ENABLE ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC 11 fffffe58dcd8d000 1f 0 0 -1 no no t-5 fffffe007a107c20 (idle) | RUNNING <--+ READY QUIESCED EXISTS ENABLE fffffe5970ec9d38::whatis fffffe5970ec9d38 is allocated from kmem_alloc_48 fffffe5970ec9d38::print abd_t { abd_flags = 0x2 (ABD_FLAG_OWNER) abd_size = 0x2000 abd_parent = 0 abd_children = { rc_count = 0 } abd_u = { abd_scatter = { abd_offset = 0 abd_chunk_size = 0x1000 abd_chunks = [ ... ] } abd_linear = { abd_buf = 0x100000000000 } } } Looks like I am again having zfs issues.... The error message I'm seeing comes from this bit of code: https://github.com/openzfs/openzfs/blob/master/usr/src/uts/common/fs/zfs/arc.c specifically: static void arc_hdr_verify_checksum(spa_t *spa, arc_buf_hdr_t *hdr, const blkptr_t *bp) which includes the (helpful?) comment: this should be changed to return an error, and callers re-read from disk on failure (on nondebug bits). Question then being, why am I getting there? Looking at the code in question, I seem to end up in abd != NULL, which is only the case, if my C is not too rusty, if I either use encryption (which I don't) or compression (which I do): int err = 0; abd_t *abd = NULL; if (BP_IS_ENCRYPTED(bp)) { if (HDR_HAS_RABD(hdr)) { abd = hdr->b_crypt_hdr.b_rabd; } } else if (HDR_COMPRESSION_ENABLED(hdr)) { abd = hdr->b_l1hdr.b_pabd; } if (abd != NULL) { So would a workaround be that I don't use compression? Or am I suffering from some underlying issue and I'll just move my problem somewhere else? Here's some other data form the system around config if that helps, basically, the disks for the various VMs actually resize on the zpool vmdg. There is one Windows, one Linux and two Solaris VMs. Yes, I know all about images and zones, but these are "special" in various different ways, so need to be VMs: ***@***.*** $ zpool list NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT vmdg 952G 250G 702G - - 11% 26% 1.00x ONLINE - zones 222G 20.3G 202G - - 0% 9% 1.00x ONLINE - ***@***.*** $ zfs list NAME USED AVAIL REFER MOUNTPOINT vmdg 621G 301G 24K /export/vm vmdg/galatea0 63.1G 343G 7.05G - vmdg/gaspra0 58.3G 343G 13.5G - vmdg/io0 452G 549G 105G - vmdg/nereid0 47.2G 343G 5.70G - zones 85.3G 130G 538K /zones zones/789a6253-c226-ed36-cf15-c5b0c5e79f23 193K 10.0G 49K /zones/789a6253-c226-ed36-cf15-c5b0c5e79f23 zones/89d9e83c-437b-edf5-db03-b0004eb2b273 128K 10.0G 48K /zones/89d9e83c-437b-edf5-db03-b0004eb2b273 zones/a0d804ec-5f92-c97c-af8a-810d656b1e93 126K 10.0G 48K /zones/a0d804ec-5f92-c97c-af8a-810d656b1e93 zones/archive 50K 130G 24K /zones/archive zones/config 272K 130G 59K legacy zones/cores 144K 130G 24K none zones/cores/789a6253-c226-ed36-cf15-c5b0c5e79f23 24K 100G 24K /zones/789a6253-c226-ed36-cf15-c5b0c5e79f23/cores zones/cores/89d9e83c-437b-edf5-db03-b0004eb2b273 24K 100G 24K /zones/89d9e83c-437b-edf5-db03-b0004eb2b273/cores zones/cores/a0d804ec-5f92-c97c-af8a-810d656b1e93 24K 100G 24K /zones/a0d804ec-5f92-c97c-af8a-810d656b1e93/cores zones/cores/e67a4093-4a30-49dd-a75a-8aead207180e 24K 100G 24K /zones/e67a4093-4a30-49dd-a75a-8aead207180e/cores zones/cores/global 24K 10.0G 24K /zones/global/cores zones/dump 2.56G 130G 2.56G - zones/e67a4093-4a30-49dd-a75a-8aead207180e 248K 10.0G 50.5K /zones/e67a4093-4a30-49dd-a75a-8aead207180e zones/opt 384K 130G 238K legacy zones/swap 65.7G 195G 738M - zones/usbkey 58.5K 130G 32.5K legacy zones/var 17.0G 130G 14.5G legacy ***@***.*** $ su - Password: - SmartOS (build: 20190731T235744Z) System is running the following applications: vm server DISPLAY=192.168.140.92:0.0 ***@***.*** # vmadm list UUID TYPE RAM STATE ALIAS 789a6253-c226-ed36-cf15-c5b0c5e79f23 KVM 4096 running gaspra 89d9e83c-437b-edf5-db03-b0004eb2b273 KVM 4096 running galatea a0d804ec-5f92-c97c-af8a-810d656b1e93 KVM 4096 running nereid e67a4093-4a30-49dd-a75a-8aead207180e KVM 4096 running io ***@***.*** # — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#196>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACETNBNKQUYGJ52V3TP33ALQNK5PBANCNFSM4I57UCGQ> .

bautsche · 2019-10-08T06:12:10Z

The motherboard is an ASRock Z370 Extreme4 and does not support ECC RAM. I just ran 12hrs worth of memtest86 tests over night without any errors.
Not that that rules out that there is something that the memory does occationally do, mind....

blackwood821 · 2022-04-27T21:21:26Z

I'm getting something similar. I recently setup a new CN which is an Intel NUC 10. When I try to migrate a VM from the head node to the CN the CN panics and reboots. When the CN comes back online I see the following message in mail:

Subject: Fault Management Event: cn01:PCIEX-8000-1P
Content-Length: 806

SUNW-MSG-ID: PCIEX-8000-1P, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Wed Apr 27 20:16:44 UTC 2022
PLATFORM: NUC10i7FNH, CSN: G6FN13500H2K, HOSTNAME: cn01
SOURCE: eft, REV: 1.16
EVENT-ID: faa6bdd1-82ae-eb4a-dd7c-d1eb797968b7
DESC: Either the transmitting device sent an invalid request or the receiving device is reporting an internal fault.
  Refer to http://illumos.org/msg/PCIEX-8000-1P for more information.
AUTO-RESPONSE: One or more device instances may be disabled
IMPACT: Loss of services provided by the device instances associated with this fault
REC-ACTION: Ensure that the latest drivers and patches are installed.  Otherwise schedule a repair procedure to replace the affected device(s).  Use fmadm faulty to identify the devices or contact your illumos distribution team for support.

[root@cn01 (neptune) ~]# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 27 20:16:44 4a300256-d6fd-489d-81db-ce510f92b078  SUNOS-8000-J0  Major     

Host        : cn01
Platform    : NUC10i7FNH        Chassis_id  : G6FN13500H2K
Product_sn  : 

Fault class : defect.sunos.eft.unexpected_telemetry 50%
              fault.sunos.eft.unexpected_telemetry 50%
Problem in  : dev:////pci@0,0
                  faulted and taken out of service

Description : The diagnosis engine encountered telemetry from the listed
              devices for which it was unable to perform a diagnosis - 
              Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
              information.  Refer to http://illumos.org/msg/SUNOS-8000-J0 for
              more information.

Response    : Error reports have been logged for examination by your illumos
              distribution team.

Impact      : Automated diagnosis and response for these events will not occur.

Action      : Ensure that the latest illumos Kernel and Predictive Self-Healing
              (PSH) updates are installed.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 08 15:16:52 23b6e843-db9c-e724-9b97-d4a78d483894  SUNOS-8000-J0  Major     

Host        : cn01
Platform    : NUC10i7FNH        Chassis_id  : G6FN13500H2K
Product_sn  : 

Fault class : defect.sunos.eft.unexpected_telemetry 50%
              fault.sunos.eft.unexpected_telemetry 50%
Problem in  : dev:////pci@0,0/pci8086,2081@1f,6
                  faulted and taken out of service

Description : The diagnosis engine encountered telemetry from the listed
              devices for which it was unable to perform a diagnosis - 
              Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
              information.  Refer to http://illumos.org/msg/SUNOS-8000-J0 for
              more information.

Response    : Error reports have been logged for examination by your illumos
              distribution team.

Impact      : Automated diagnosis and response for these events will not occur.

Action      : Ensure that the latest illumos Kernel and Predictive Self-Healing
              (PSH) updates are installed.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 08 21:04:22 d51dfb74-93b8-e3ea-bbdf-aee8e30adc3f  SUNOS-8000-J0  Major     

Host        : cn01
Platform    : NUC10i7FNH        Chassis_id  : G6FN13500H2K
Product_sn  : 

Fault class : defect.sunos.eft.unexpected_telemetry 50%
              fault.sunos.eft.unexpected_telemetry 50%
Problem in  : dev:////pseudo/overlay@0
                  faulted and taken out of service

Description : The diagnosis engine encountered telemetry from the listed
              devices for which it was unable to perform a diagnosis - 
              Refer to http://illumos.org/msg/SUNOS-8000-J0 for more
              information.  Refer to http://illumos.org/msg/SUNOS-8000-J0 for
              more information.

Response    : Error reports have been logged for examination by your illumos
              distribution team.

Impact      : Automated diagnosis and response for these events will not occur.

Action      : Ensure that the latest illumos Kernel and Predictive Self-Healing
              (PSH) updates are installed.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 27 20:16:44 faa6bdd1-82ae-eb4a-dd7c-d1eb797968b7  PCIEX-8000-1P  Major     

Host        : cn01
Platform    : NUC10i7FNH        Chassis_id  : G6FN13500H2K
Product_sn  : 

Fault class : fault.io.pciex.device-interr 67%
              fault.io.pciex.device-invreq 33%
Affects     : dev:////pci@0,0/pci8086,2b5@1d,5/pci8086,2081@0
              dev:////pci@0,0/pci8086,2b5@1d,5
                  faulted and taken out of service
FRU         : "MB" (hc://:product-id=NUC10i7FNH:server-id=cn01:chassis-id=G6FN13500H2K/motherboard=0)
                  faulty

Description : Either the transmitting device sent an invalid request or the
              receiving device is reporting an internal fault.
              Refer to http://illumos.org/msg/PCIEX-8000-1P for more
              information.

Response    : One or more device instances may be disabled

Impact      : Loss of services provided by the device instances associated with
              this fault

Action      : Ensure that the latest drivers and patches are installed. 
              Otherwise schedule a repair procedure to replace the affected
              device(s).  Use fmadm faulty to identify the devices or contact
              your illumos distribution team for support.

[root@cn01 (neptune) ~]# prtconf -p
System Configuration:  Intel(R) Client Systems  i86pc
Memory size: 65370 Megabytes
System Peripherals (PROM Nodes):

Node 'i86pc'
    Node 'ramdisk'
    Node 'pci'
        Node 'pci8086,2081'
        Node 'display'
        Node 'pci8086,2081'
        Node 'pci8086,2081'
        Node 'pci8086,2081'
        Node 'pci8086,2081'
        Node 'pci8086,74'
        Node 'pci8086,2081'
        Node 'pci8086,2081'
        Node 'pci8086,2081'
        Node 'pci8086,2081'
        Node 'pci8086,2bc'
            Node 'pci8086,15e7'
                Node 'pci8086,15e7'
                    Node 'pci8086,2081'
                Node 'pci8086,15e7'
                Node 'pci8086,15e7'
                    Node 'pci8086,2081'
        Node 'pci8086,2b0'
            Node 'pci144d,a801'
        Node 'pci8086,2b5'
            Node 'pci8086,2081'
        Node 'isa'
        Node 'pci8086,2081'
        Node 'pci8086,2081'
        Node 'pci8086,2081'
        Node 'pci8086,2081'
    Node 'fw'
        Node 'sb'
            Node 'SNDW'
            Node 'cpu'
            Node 'cpu'
            Node 'cpu'
            Node 'cpu'
            Node 'cpu'
            Node 'cpu'
            Node 'cpu'
            Node 'cpu'
            Node 'cpu'
            Node 'cpu'
            Node 'cpu'
            Node 'cpu'
    Node 'used-resources'

[root@cn01 (neptune) ~]# cd /var/crash/volatile/
[root@cn01 (neptune) /var/crash/volatile]# ls -lrth
total 6946571
-rw-r--r--   1 root     root       2.14G Apr 27 20:16 vmdump.0
-rw-r--r--   1 root     root       1.17G Apr 27 20:58 vmdump.1
-rw-r--r--   1 root     root       1.95K Apr 27 20:58 METRICS.csv
-rw-r--r--   1 root     root           2 Apr 27 20:58 bounds
[root@cn01 (neptune) /var/crash/volatile]# savecore -f vmdump.1
savecore: System dump time: Wed Apr 27 20:55:32 2022

savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.1
Constructing namelist /var/crash/volatile/unix.1
Constructing corefile /var/crash/volatile/vmcore.1
 0:11 100% done: 1096944 of 1096944 pages saved

[root@cn01 (neptune) /var/crash/volatile]# ls -la
total 15908764
drwx------   2 root     root           8 Apr 27 21:09 .
drwxr-xr-x   3 root     root           3 Apr  7 18:02 ..
-rw-r--r--   1 root     root           2 Apr 27 20:58 bounds
-rw-r--r--   1 root     root        2180 Apr 27 21:09 METRICS.csv
-rw-r--r--   1 root     root     2379931 Apr 27 21:09 unix.1
-rw-r--r--   1 root     root     4588134400 Apr 27 21:09 vmcore.1
-rw-r--r--   1 root     root     2292973568 Apr 27 20:16 vmdump.0
-rw-r--r--   1 root     root     1261436928 Apr 27 20:58 vmdump.1

[root@cn01 (neptune) /var/crash/volatile]# mdb 1
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba xhci smbios mm sd fctl stmf_sbd stmf zfs lofs sata random idm crypto fcp cpc logindmux ptm kvm sppp nsmb smbsrv nfs ]
> ::status
debugging crash dump vmcore.1 (64-bit) from cn01
operating system: 5.11 joyent_20220310T212952Z (i86pc)
git branch: release-20220310
git rev: 1ce3d5b04b982325d0be94b2d2563c571fbb1aff
image uuid: (not set)
panic message: checksum of cached data doesn't match BP err=50 hdr=fffffe5962ea2b40 bp=fffffe007bfcd708 abd=fffffe592233d858 buf=fffffe5962395000
dump content: kernel pages only

> ::msgbuf
MESSAGE                                                               
illumos Version joyent_20220310T212952Z 64-bit
x86_feature: lgpg
x86_feature: tsc
x86_feature: msr
x86_feature: mtrr
x86_feature: pge
x86_feature: de
x86_feature: cmov
x86_feature: mmx
x86_feature: mca
x86_feature: pae
x86_feature: cv8
x86_feature: pat
x86_feature: sep
x86_feature: sse
x86_feature: sse2
x86_feature: htt
x86_feature: asysc
x86_feature: nx
x86_feature: sse3
x86_feature: cx16
x86_feature: cmp
x86_feature: tscp
x86_feature: mwait
x86_feature: cpuid
x86_feature: ssse3
x86_feature: sse4_1
x86_feature: sse4_2
x86_feature: 1gpg
x86_feature: clfsh
x86_feature: 64
x86_feature: aes
x86_feature: pclmulqdq
x86_feature: xsave
x86_feature: avx
x86_feature: vmx
x86_feature: f16c
x86_feature: rdrand
x86_feature: x2apic
x86_feature: avx2
x86_feature: bmi1
x86_feature: bmi2
x86_feature: fma
x86_feature: smep
x86_feature: smap
x86_feature: adx
x86_feature: rdseed
x86_feature: mpx
x86_feature: xsaveopt
x86_feature: xsavec
x86_feature: xsaves
x86_feature: pcid
x86_feature: invpcid
x86_feature: ibrs
x86_feature: ibpb
x86_feature: stibp
x86_feature: ssbd
x86_feature: rdcl_no
x86_feature: ibrs_all
x86_feature: flush_cmd
x86_feature: l1d_vmentry_no           
x86_feature: fsgsbase
x86_feature: clflushopt
x86_feature: md_clear
x86_feature: mds_no
x86_feature: core_thermal
x86_feature: pkg_thermal
x86_feature: lfence_serializing
mem = 66938052K (0xff5931000)
TSC calibrated using HPET; freq is 1608 MHz
ACPI: RSDP 0x00000000000F05B0 000024 (v02 INTEL )
ACPI: XSDT 0x0000000083818728 0000CC (v01 INTEL  NUC9i5FN 00000034 AMI  01000013)
ACPI: FACP 0x0000000083414000 000114 (v06 INTEL  NUC9i5FN 00000034 AMI  00010013)
ACPI: DSDT 0x00000000833D0000 0439CD (v02 INTEL  NUC9i5FN 00000034 INTL 20160527)
ACPI: FACS 0x00000000838B3000 000040
ACPI: MCFG 0x0000000083417000 00003C (v01 INTEL  NUC9i5FN 00000034 MSFT 00000097)
ACPI: SSDT 0x0000000083415000 001B4A (v02 INTEL  NUC9i5FN 00000034 INTL 20160527)
ACPI: FIDT 0x00000000833CF000 00009C (v01 INTEL  NUC9i5FN 00000034 AMI  00010013)
ACPI: SSDT 0x00000000833CB000 0031C6 (v02 INTEL  NUC9i5FN 00000034 INTL 20160527)
ACPI: HPET 0x0000000083419000 000038 (v01 INTEL  NUC9i5FN 00000034 AMI  01000013)
ACPI: SSDT 0x00000000833C7000 0033B4 (v02 INTEL  NUC9i5FN 00000034 INTL 20160527)
ACPI: SSDT 0x00000000833C5000 00115C (v02 INTEL  NUC9i5FN 00000034 INTL 20160527)
ACPI: SSDT 0x00000000833C1000 0032BD (v02 INTEL  NUC9i5FN 00000034 INTL 20160527)
ACPI: NHLT 0x0000000083418000 00002D (v00 INTEL  NUC9i5FN 00000034 AMI  01000013)
ACPI: LPIT 0x00000000833C0000 000094 (v01 INTEL  NUC9i5FN 00000034 AMI  01000013)
ACPI: SSDT 0x00000000833BC000 002720 (v02 INTEL  NUC9i5FN 00000034 INTL 20160527)
ACPI: SSDT 0x00000000833BB000 00087C (v02 INTEL  NUC9i5FN 00000034 INTL 20160527)
ACPI: DBGP 0x00000000833BA000 000034 (v01 INTEL  NUC9i5FN 00000034 AMI  01000013)
ACPI: DBG2 0x00000000833B9000 000054 (v00 INTEL  NUC9i5FN 00000034 AMI  01000013)
ACPI: SSDT 0x00000000833B7000 001B66 (v02 INTEL  NUC9i5FN 00000034 INTL 20160527)
ACPI: TPM2 0x00000000833B4000 00004C (v04 INTEL  NUC9i5FN 00000034 AMI  00000000)
ACPI: DMAR 0x00000000833B6000 0000A8 (v01 INTEL  NUC9i5FN 00000034      01000013)
ACPI: WSMT 0x00000000833BF000 000028 (v01 INTEL  NUC9i5FN 00000034 AMI  00010013)
ACPI: APIC 0x00000000833B3000 0000F4 (v04 INTEL  NUC9i5FN 00000034 AMI  00010013)
ACPI: FPDT 0x00000000833B2000 000044 (v01 INTEL  NUC9i5FN 00000034 AMI  01000013)
ACPI: 9 ACPI AML tables successfully acquired and loaded
ACPI: Enabled 5 GPEs in block 00 to 7F
NOTICE: ACPI resource type (0X13) not yet supported
SMBIOS v3.3 loaded (4062 bytes)
Skipping psm: xpv_psm
root nexus = i86pc
NOTICE: iommulib_nexus_register: rootnex-1: Succesfully registered NEXUS i86pc nexops=fffffffffbd83140
pseudo0 at root
pseudo0 is /pseudo
scsi_vhci0 at root
scsi_vhci0 is /scsi_vhci
NOTICE: reprogram io-range on ppb[0/1d/0]: 0x4000 ~ 0x4fff
NOTICE: reprogram io-range on ppb[0/1d/5]: 0x5000 ~ 0x5fff
NOTICE: program [0/15/2] BAR@0x10 0x89800000 length 0x1000
NOTICE: program [0/15/0] BAR@0x10 0x89801000 length 0x1000
Reading Intel IOMMU boot options
npe0 at root: space 0 offset 0
npe0 is /pci@0,0
PCI Express-device: isa@1f, isa0
NOTICE: apic: local nmi: 1 0x5 1
NOTICE: apic: local nmi: 2 0x5 1
NOTICE: apic: local nmi: 3 0x5 1
NOTICE: apic: local nmi: 4 0x5 1
NOTICE: apic: local nmi: 5 0x5 1
NOTICE: apic: local nmi: 6 0x5 1
NOTICE: apic: local nmi: 7 0x5 1
NOTICE: apic: local nmi: 8 0x5 1      
NOTICE: apic: local nmi: 9 0x5 1
NOTICE: apic: local nmi: 10 0x5 1
NOTICE: apic: local nmi: 11 0x5 1
NOTICE: apic: local nmi: 12 0x5 1
NOTICE: apic: Using ACPI (MADT) for SMP configuration
NOTICE: apic: Using APIC interrupt routing mode
NOTICE: amd_iommu: No AMD IOMMU ACPI IVRS table
pseudo-device: acpippm0
acpippm0 is /pseudo/acpippm@0
pseudo-device: ppm0
ppm0 is /pseudo/ppm@0
ramdisk0 at root
ramdisk0 is /ramdisk
root on /ramdisk:a fstype ufs
ACPI: Dynamic OEM Table Load:
ACPI: SSDT 0xFFFFFE5902152288 000400 (v02 PmRef  Cpu0Cst  00003001 INTL 20160527)
ACPI: Dynamic OEM Table Load:
ACPI: SSDT 0xFFFFFE59021CC048 000479 (v02 PmRef  Cpu0Ist  00003000 INTL 20160527)
ACPI: Dynamic OEM Table Load:
ACPI: SSDT 0xFFFFFE5911331D88 0000F4 (v02 PmRef  Cpu0Psd  00003000 INTL 20160527)
acpinex0 at root
acpinex0 is /fw
acpinex: sb@0, acpinex1
acpinex1 is /fw/sb@0
acpinex: cpu@1, cpudrv0
/fw/sb@0/cpu@1 (cpudrv0) online
pseudo-device: dld0
dld0 is /pseudo/dld@0
PCI Express-device: pci8086,2bc@1c, pcieb0
pcieb0 is /pci@0,0/pci8086,2bc@1c
PCIE-device: pci8086,15e7@0, pcieb1
PCI Express-device: pci8086,15e7@0, pcieb1
pcieb1 is /pci@0,0/pci8086,2bc@1c/pci8086,15e7@0
PCIE-device: pci8086,15e7@2, pcieb2
PCI Express-device: pci8086,15e7@2, pcieb2
pcieb2 is /pci@0,0/pci8086,2bc@1c/pci8086,15e7@0/pci8086,15e7@2
cpu0: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu0: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
KPTI enabled (PCID in use, INVPCID in use)
ACPI: Dynamic OEM Table Load:
ACPI: SSDT 0xFFFFFE5914467808 000778 (v02 PmRef  ApIst    00003000 INTL 20160527)
ACPI: Dynamic OEM Table Load:
ACPI: SSDT 0xFFFFFE591445A008 000D74 (v02 PmRef  ApPsd    00003000 INTL 20160527)
ACPI: Dynamic OEM Table Load:
ACPI: SSDT 0xFFFFFE59144A3048 0003CA (v02 PmRef  ApCst    00003000 INTL 20160527)
acpinex: cpu@2, cpudrv1
/fw/sb@0/cpu@2 (cpudrv1) online
cpu1: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu1: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
cpu1 initialization complete - online
acpinex: cpu@3, cpudrv2
/fw/sb@0/cpu@3 (cpudrv2) online
cpu2: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu2: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
cpu2 initialization complete - online
acpinex: cpu@4, cpudrv3
/fw/sb@0/cpu@4 (cpudrv3) online
cpu3: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu3: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
cpu3 initialization complete - online
acpinex: cpu@5, cpudrv4               
/fw/sb@0/cpu@5 (cpudrv4) online
cpu4: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu4: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
cpu4 initialization complete - online
acpinex: cpu@6, cpudrv5
/fw/sb@0/cpu@6 (cpudrv5) online
cpu5: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu5: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
cpu5 initialization complete - online
acpinex: cpu@7, cpudrv6
/fw/sb@0/cpu@7 (cpudrv6) online
cpu6: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu6: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
cpu6 initialization complete - online
acpinex: cpu@8, cpudrv7
/fw/sb@0/cpu@8 (cpudrv7) online
cpu7: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu7: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
cpu7 initialization complete - online
acpinex: cpu@9, cpudrv8
/fw/sb@0/cpu@9 (cpudrv8) online
cpu8: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu8: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
cpu8 initialization complete - online
acpinex: cpu@10, cpudrv9
/fw/sb@0/cpu@10 (cpudrv9) online
cpu9: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu9: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
cpu9 initialization complete - online
acpinex: cpu@11, cpudrv10
/fw/sb@0/cpu@11 (cpudrv10) online
cpu10: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu10: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
cpu10 initialization complete - online
acpinex: cpu@12, cpudrv11
/fw/sb@0/cpu@12 (cpudrv11) online
cpu11: x86 (chipid 0x0 GenuineIntel A0660 family 6 model 166 step 0 clock 1608 MHz)
cpu11: Intel(r) Core(tm) i7-10710U CPU @ 1.10GHz
cpu11 initialization complete - online
NOTICE: SMT enabled

pseudo-device: lofi0
lofi0 is /pseudo/lofi@0
pseudo-device: lofi1
lofi1 is /pseudo/lofi@1
/pseudo/lofi@1 (lofi1) online
PCI Express-device: pci8086,2b0@1d, pcieb3
pcieb3 is /pci@0,0/pci8086,2b0@1d
PCI Express-device: pci8086,2b5@1d,5, pcieb4
pcieb4 is /pci@0,0/pci8086,2b5@1d,5
USB 2.0 device (usb1532,15) operating at full speed (USB 1.x) on USB 3.1 root hub: device@1, usb_mid0 at bus address 2
PCIE-device: pci8086,15e7@0, pcieb5
Razer Razer Naga
PCI Express-device: pci8086,15e7@0, pcieb5
usb_mid0 is /pci@0,0/pci8086,2081@14/device@1
/pci@0,0/pci8086,2081@14/device@1 (usb_mid0) online
pcieb5 is /pci@0,0/pci8086,2bc@1c/pci8086,15e7@0/pci8086,15e7@0
PCIE-device: pci8086,15e7@1, pcieb6
PCI Express-device: pci8086,15e7@1, pcieb6
pcieb6 is /pci@0,0/pci8086,2bc@1c/pci8086,15e7@0/pci8086,15e7@1
USB 2.0 device (usb781,5571) operating at hi speed (USB 2.x) on USB 3.1 root hub: storage@2, scsa2usb0 at bus address 3
SanDisk' Cruzer Fit 03020132011821080348
scsa2usb0 is /pci@0,0/pci8086,2081@14/storage@2
/pci@0,0/pci8086,2081@14/storage@2 (scsa2usb0) online
USB 1.10 device (usb46d,c31c) operating at low speed (USB 1.x) on USB 3.1 root hub: device@3, usb_mid1 at bus address 4
Logitech USB Keyboard
usb_mid1 is /pci@0,0/pci8086,2081@14/device@3
/pci@0,0/pci8086,2081@14/device@3 (usb_mid1) online
USB 2.1 device (usb8087,26) operating at full speed (USB 1.x) on USB 3.1 root hub: device@a, usb_mid2 at bus address 5
usb_mid2 is /pci@0,0/pci8086,2081@14/device@a
/pci@0,0/pci8086,2081@14/device@a (usb_mid2) online
usba:   no driver found for device /pci@0,0/pci8086,2081@14/device@a/wireless
usba:   no driver found for device /pci@0,0/pci8086,2081@14/device@a/wireless
pseudo-device: tzmon0
tzmon0 is /pseudo/tzmon@0
pseudo-device: audio0
audio0 is /pseudo/audio@0
USB 2.0 interface (usbif1532,15.config1.0) operating at full speed (USB 1.x) on USB 3.1 root hub: mouse@0, hid0 at bus address 2
Razer Razer Naga
hid0 is /pci@0,0/pci8086,2081@14/device@1/mouse@0
/pci@0,0/pci8086,2081@14/device@1/mouse@0 (hid0) online
sd0 at scsa2usb0: target 0 lun 0
sd0 is /pci@0,0/pci8086,2081@14/storage@2/disk@0,0
USB 1.10 interface (usbif46d,c31c.config1.0) operating at low speed (USB 1.x) on USB 3.1 root hub: keyboard@0, hid2 at bus address 4
Logitech USB Keyboard
hid2 is /pci@0,0/pci8086,2081@14/device@3/keyboard@0
/pci@0,0/pci8086,2081@14/device@3/keyboard@0 (hid2) online
USB 1.10 interface (usbif46d,c31c.config1.1) operating at low speed (USB 1.x) on USB 3.1 root hub: input@1, hid3 at bus address 4
Logitech USB Keyboard
hid3 is /pci@0,0/pci8086,2081@14/device@3/input@1
/pci@0,0/pci8086,2081@14/device@3/input@1 (hid3) online
/pci@0,0/pci8086,2081@14/storage@2/disk@0,0 (sd0) online
USB 2.0 interface (usbif1532,15.config1.1) operating at full speed (USB 1.x) on USB 3.1 root hub: input@1, hid1 at bus address 2
Razer Razer Naga
hid1 is /pci@0,0/pci8086,2081@14/device@1/input@1
/pci@0,0/pci8086,2081@14/device@1/input@1 (hid1) online
NOTICE: ahci0: hba AHCI version = 1.31
pseudo-device: dtrace0
dtrace0 is /pseudo/dtrace@0
NOTICE: e1000g0 registered
pseudo-device: devinfo0
devinfo0 is /pseudo/devinfo@0
iscsi0 at root
iscsi0 is /iscsi
xsvc0 at root: space 0 offset 0
xsvc0 is /xsvc@0,0
pseudo-device: pseudo1
pseudo1 is /pseudo/zconsnex@1
pseudo-device: pseudo2
pseudo2 is /pseudo/zfdnex@2
acpinex: SNDW@, acpinex2
acpinex2 is /fw/sb@0/SNDW
acpinex: usbroothub@XHC_, acpinex3
acpinex3 is /fw/sb@0/usbroothub@XHC_
acpinex: port@1, acpinex4
acpinex4 is /fw/sb@0/usbroothub@XHC_/port@1
ISA-device: pit_beep0
pit_beep0 is /pci@0,0/isa@1f/pit_beep
acpinex: port@2, acpinex5
acpinex5 is /fw/sb@0/usbroothub@XHC_/port@2
acpinex: port@3, acpinex6
acpinex6 is /fw/sb@0/usbroothub@XHC_/port@3
acpinex: port@4, acpinex7
acpinex7 is /fw/sb@0/usbroothub@XHC_/port@4
nvme0: NVMe spec version 1.3
PCIE-device: pci8086,2081@0, sdhost0
PCI Express-device: pci8086,2081@0, sdhost0
Block device: blkdev@0, blkdev0
blkdev0 is /pci@0,0/pci8086,2b5@1d,5/pci8086,2081@0/blkdev@0
/pci@0,0/pci8086,2b5@1d,5/pci8086,2081@0/blkdev@0 (blkdev0) online
pseudo-device: llc10
llc10 is /pseudo/llc1@0
pseudo-device: power0
power0 is /pseudo/power@0
pseudo-device: ramdisk1024
ramdisk1024 is /pseudo/ramdisk@1024
pseudo-device: ucode0
ucode0 is /pseudo/ucode@0
NOTICE: blkdev1: dynamic LUN expansion
Block device: blkdev@w00253852214082E0,0, blkdev1
blkdev1 is /pci@0,0/pci8086,2b0@1d/pci144d,a801@0/blkdev@w00253852214082E0,0
/pci@0,0/pci8086,2b0@1d/pci144d,a801@0/blkdev@w00253852214082E0,0 (blkdev1) online
pseudo-device: zfs0
zfs0 is /pseudo/zfs@0
pseudo-device: srn0
srn0 is /pseudo/srn@0
pseudo-device: dcpc0
dcpc0 is /pseudo/dcpc@0
pseudo-device: fasttrap0
fasttrap0 is /pseudo/fasttrap@0
pseudo-device: fbt0
fbt0 is /pseudo/fbt@0
pseudo-device: profile0
profile0 is /pseudo/profile@0
pseudo-device: lockstat0
lockstat0 is /pseudo/lockstat@0
pseudo-device: sdt0
sdt0 is /pseudo/sdt@0
pseudo-device: systrace0
systrace0 is /pseudo/systrace@0
pseudo-device: fcp0
fcp0 is /pseudo/fcp@0
pseudo-device: fcsm0
fcsm0 is /pseudo/fcsm@0
pseudo-device: ipd0
ipd0 is /pseudo/ipd@0
pseudo-device: stmf0
stmf0 is /pseudo/stmf@0
pseudo-device: stmf_sbd0
stmf_sbd0 is /pseudo/stmf_sbd@0
pseudo-device: fssnap0
fssnap0 is /pseudo/fssnap@0
pseudo-device: kvm0
kvm0 is /pseudo/kvm@0
pseudo-device: pool0
pool0 is /pseudo/pool@0
IP Filter: v4.1.9, running.
pseudo-device: bpf0
bpf0 is /pseudo/bpf@0
pseudo-device: pm0
pm0 is /pseudo/pm@0
pseudo-device: nsmb0
nsmb0 is /pseudo/nsmb@0               
Universal TUN/TAP device driver ver 1.3.0 03/11/2022 (C) 1999-2000 Maxim Krasnyansky
pseudo-device: tap0
tap0 is /pseudo/tap@0
Universal TUN/TAP device driver ver 1.3.0 03/11/2022 (C) 1999-2000 Maxim Krasnyansky
pseudo-device: tun0
tun0 is /pseudo/tun@0
pseudo-device: lx_systrace0
lx_systrace0 is /pseudo/lx_systrace@0
pseudo-device: inotify0
inotify0 is /pseudo/inotify@0
pseudo-device: eventfd0
eventfd0 is /pseudo/eventfd@0
pseudo-device: timerfd0
timerfd0 is /pseudo/timerfd@0
pseudo-device: signalfd0
signalfd0 is /pseudo/signalfd@0
pseudo-device: vmm0
vmm0 is /pseudo/vmm@0
pseudo-device: viona0
viona0 is /pseudo/viona@0
NOTICE: e1000g0 link up, 1000 Mbps, full duplex
dump on /dev/zvol/dsk/zones/dump size 32685 MB
device pciclass,030000@2(display#0) keeps up device sd@0,0(disk#0), but the former is not power managed
device pciclass,030000@2(display#0) keeps up device blkdev@0(blkdev#0), but the former is not power managed
NOTICE: vnic1007 registered
NOTICE: vnic1007 link up, 1000 Mbps, unknown duplex
Creating /etc/devices/devid_cache
Creating /etc/devices/pci_unitaddr_persistent
/pseudo/zconsnex@1/zcons@0 (zcons0) online
/pseudo/zconsnex@1/zcons@1 (zcons1) online
NOTICE: vnic1016 registered
NOTICE: vnic1016 link up, 1000 Mbps, unknown duplex
NOTICE: vnic1017 registered
NOTICE: vnic1017 link up, 1000 Mbps, unknown duplex
NOTICE: overlay1018 registered
NOTICE: overlay1018 link up, 0 Mbps, unknown duplex
NOTICE: vnic1019 registered
NOTICE: vnic1019 link up, 0 Mbps, unknown duplex
/pseudo/zfdnex@2/zfd@0 (zfd0) online
/pseudo/zfdnex@2/zfd@1 (zfd1) online
/pseudo/zfdnex@2/zfd@2 (zfd2) online
/pseudo/zfdnex@2/zfd@3 (zfd3) online
/pseudo/zfdnex@2/zfd@4 (zfd4) online
NOTICE: vnic1020 registered
NOTICE: vnic1020 link up, 0 Mbps, unknown duplex
NOTICE: (bnxe): QLogic NetXtreme II 10 Gigabit Ethernet Driver v7.10.4
xsvc0 at root: space 0 offset 0
xsvc0 is /xsvc@0,0
pseudo-device: devinfo0
devinfo0 is /pseudo/devinfo@0
sd0 at scsa2usb0: target 0 lun 0
sd0 is /pci@0,0/pci8086,2081@14/storage@2/disk@0,0
device pciclass,030000@2(display#0) keeps up device sd@0,0(disk#0), but the former is not power managed
Block device: blkdev@0, blkdev0
blkdev0 is /pci@0,0/pci8086,2b5@1d,5/pci8086,2081@0/blkdev@0
USB 1.10 interface (usbif46d,c31c.config1.1) operating at low speed (USB 1.x) on USB 3.1 root hub: input@1, hid3 at bus address 4
Logitech USB Keyboard
hid3 is /pci@0,0/pci8086,2081@14/device@3/input@1
pseudo-device: llc10
llc10 is /pseudo/llc1@0
pseudo-device: ramdisk1024            
ramdisk1024 is /pseudo/ramdisk@1024
pseudo-device: ucode0
ucode0 is /pseudo/ucode@0
pseudo-device: dcpc0
dcpc0 is /pseudo/dcpc@0
pseudo-device: fbt0
fbt0 is /pseudo/fbt@0
pseudo-device: profile0
profile0 is /pseudo/profile@0
pseudo-device: lockstat0
lockstat0 is /pseudo/lockstat@0
pseudo-device: sdt0
sdt0 is /pseudo/sdt@0
pseudo-device: systrace0
systrace0 is /pseudo/systrace@0
pseudo-device: fcp0
fcp0 is /pseudo/fcp@0
pseudo-device: fcsm0
fcsm0 is /pseudo/fcsm@0
pseudo-device: ipd0
ipd0 is /pseudo/ipd@0
pseudo-device: stmf0
stmf0 is /pseudo/stmf@0
pseudo-device: fssnap0
fssnap0 is /pseudo/fssnap@0
pseudo-device: kvm0
kvm0 is /pseudo/kvm@0
pseudo-device: bpf0
bpf0 is /pseudo/bpf@0
pseudo-device: pm0
pm0 is /pseudo/pm@0
pseudo-device: nsmb0
nsmb0 is /pseudo/nsmb@0
pseudo-device: lx_systrace0
lx_systrace0 is /pseudo/lx_systrace@0
pseudo-device: inotify0
inotify0 is /pseudo/inotify@0
pseudo-device: eventfd0
eventfd0 is /pseudo/eventfd@0
pseudo-device: timerfd0
timerfd0 is /pseudo/timerfd@0
pseudo-device: signalfd0
signalfd0 is /pseudo/signalfd@0
NOTICE: Device: already retired: /pci@0,0/pci8086,2b5@1d,5/pci8086,2081@0
Creating /etc/devices/retire_store
Creating /etc/devices/devname_cache
/pseudo/zconsnex@1/zcons@2 (zcons2) online

panic[cpu8]/thread=fffffe592aec8bc0: 
checksum of cached data doesn't match BP err=50 hdr=fffffe5962ea2b40 bp=fffffe007bfcd708 abd=fffffe592233d858 buf=fffffe5962395000


fffffe007bfcd560 zfs:zfs_nfsshare_inited+37ca77d0 ()
fffffe007bfcd690 zfs:arc_read+de1 ()
fffffe007bfcd7f0 zfs:dbuf_read_impl+45a ()
fffffe007bfcd890 zfs:dbuf_read+e6 ()
fffffe007bfcd940 zfs:dmu_buf_hold_array_by_dnode+10c ()
fffffe007bfcd9e0 zfs:dmu_read_uio_dnode+54 ()
fffffe007bfcda40 zfs:dmu_read_uio+73 ()
fffffe007bfcdad0 zfs:zvol_read+12d ()
fffffe007bfcdb00 genunix:cdev_read+2a ()
fffffe007bfcdb90 specfs:spec_read+2ab ()
fffffe007bfcdc30 genunix:fop_read+111 ()
fffffe007bfcdf00 genunix:preadv+22e ()
fffffe007bfcdf10 unix:brand_sys_syscall+229 ()

> ::panicinfo
             cpu                8
          thread fffffe592aec8bc0
         message checksum of cached data doesn't match BP err=50 hdr=fffffe5962ea2b40 bp=fffffe007bfcd708 abd=fffffe592233d858 buf=fffffe5962395000
             rdi fffffffff7ecd2f8
             rsi fffffe007bfcd488
             rdx fffffe5962ea2b40
             rcx fffffe007bfcd708
              r8 fffffe592233d858
              r9 fffffe5962395000
             rax fffffe007bfcd4a0
             rbx fffffe007bfcd708
             rbp fffffe007bfcd4e0
             r10         742f6970
             r11               15
             r12             2000
             r13 fffffe5962ea2b40
             r14 fffffe592233d858
             r15               32
          fsbase fffffc7fef026240
          gsbase fffffe59145c3000
              ds               4b
              es               4b
              fs                0
              gs                0
          trapno                0
             err                0
             rip fffffffffb884c30
              cs               30
          rflags              282
             rsp fffffe007bfcd478
              ss               38
          gdt_hi                0
          gdt_lo              1ef
          idt_hi                0
          idt_lo         f000ffff
             ldt                0
            task               70
             cr0         80050033
             cr2       c0007dd000
             cr3        f72a38000
             cr4           3626f8

> ::cpuinfo -v
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc4c000  1f    0    0   1   no    no t-0    fffffe592addd040 bhyve
                       |    
            RUNNING <--+    
              READY         
           QUIESCED         
             EXISTS         
             ENABLE         

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  1 fffffe5914077000  1f    0    0   1   no    no t-0    fffffe59402f4020 bhyve
                       |    
            RUNNING <--+    
              READY         
           QUIESCED         
             EXISTS         
             ENABLE         

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  2 fffffe59143df000  1f    0    0  -1   no    no t-0    fffffe0079f91c20 (idle)
                       |    
            RUNNING <--+    
              READY         
           QUIESCED         
             EXISTS         
             ENABLE         

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  3 fffffe59143db000  1f    1    0  -1   no    no t-0    fffffe007a079c20 (idle)
                       |    |
            RUNNING <--+    +-->  PRI THREAD           PROC
              READY                 1 fffffe592f02d000 bhyve
           QUIESCED         
             EXISTS         
             ENABLE         

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  4 fffffe59145d7000  1f    0    0  -1   no    no t-0    fffffe007a0d8c20 (idle)
                       |    
            RUNNING <--+    
              READY         
           QUIESCED         
             EXISTS         
             ENABLE         

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  5 fffffe59145cf000  1f    0    0  47   no    no t-0    fffffe59402e7b80 node
                       |    
            RUNNING <--+    
              READY         
           QUIESCED         
             EXISTS         
             ENABLE         

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  6 fffffe59145cb000  1f    0    0  60   no    no t-1347 fffffe007a93ec20 sched
                       |    
            RUNNING <--+    
              READY         
           QUIESCED         
             EXISTS         
             ENABLE                   

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  7 fffffe59145c5000  1f    1    0  -1   no    no t-0    fffffe007a31ac20 (idle)
                       |    |
            RUNNING <--+    +-->  PRI THREAD           PROC
              READY                60 fffffe007a1a7c20 sched
           QUIESCED         
             EXISTS         
             ENABLE         

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  8 fffffffffbc51e00  1b    0    0   1   no    no t-0    fffffe592aec8bc0 bhyve
                       |    
            RUNNING <--+    
              READY         
             EXISTS         
             ENABLE         

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  9 fffffe59145bf000  1f    0    0  -1   no    no t-0    fffffe007a426c20 (idle)
                       |    
            RUNNING <--+    
              READY         
           QUIESCED         
             EXISTS         
             ENABLE         

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
 10 fffffe59145b9000  1f    0    0   1  yes    no t-0    fffffe592f02ec40 bhyve
                       |    
            RUNNING <--+    
              READY         
           QUIESCED         
             EXISTS         
             ENABLE         

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
 11 fffffe591492b000  1f    0    0  -1   no    no t-2    fffffe007a532c20 (idle)
                       |    
            RUNNING <--+    
              READY         
           QUIESCED         
             EXISTS         
             ENABLE         

>

blackwood821 · 2022-04-28T19:21:16Z

I ran memtest86 and many tests failed so my issue appears to be due to faulty memory.

bahamat · 2022-05-17T14:26:31Z

Closing since this is a hardware issue.

bahamat · 2022-05-17T15:00:09Z

Re-opening because this is a more generalized issue than a single server's hardware problems.

bahamat closed this as completed May 17, 2022

bahamat reopened this May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Corruption in ZFS #196

Data Corruption in ZFS #196

bautsche commented Oct 7, 2019

mgerdts commented Oct 7, 2019 via email

bautsche commented Oct 8, 2019

blackwood821 commented Apr 27, 2022

blackwood821 commented Apr 28, 2022

bahamat commented May 17, 2022

bahamat commented May 17, 2022

Data Corruption in ZFS #196

Data Corruption in ZFS #196

Comments

bautsche commented Oct 7, 2019

mgerdts commented Oct 7, 2019 via email

bautsche commented Oct 8, 2019

blackwood821 commented Apr 27, 2022

blackwood821 commented Apr 28, 2022

bahamat commented May 17, 2022

bahamat commented May 17, 2022