-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Corruption in ZFS #196
Comments
My first suspect would be bad RAM. Do you have ECC DIMMs? Have you run
memtest86 to test your RAM?
Mike
…On Sun, Oct 6, 2019 at 11:47 PM bautsche ***@***.***> wrote:
I appear to be getting data corruption in ZFS leading to panics. I've been
asked by Brian Bennett to open a new issue for a kernel developer to look
at this.
Here's more detail:
A bit of history: I have recently moved to SmartOS. I have home-built a
server on which I am currently just running a handful of VMs, the idea
being however that this will become my main server. The thing used to crash
quite regularly. I eventually traced it to an additional SATA adapter I had
in the server, so I removed that (not really required, I can make do with
the 4 available on-board ports).
I thought I had it fixed now and was contemplating migrating my remaining
VMs over, when, last night, the thing panic'ed again, twice.
Here's what I got out of the first panic:
***@***.*** # pwd
/var/crash/volatile
***@***.*** # savecore -f vmdump.1
savecore: incomplete dump on dump device
savecore: System dump time: Sat Sep 21 21:03:21 2019
savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.1
Constructing namelist /var/crash/volatile/unix.1
Constructing corefile /var/crash/volatile/vmcore.1
pfn 16454656 not found for as=fffffffffbc490a0, va=fffffe0040000000
pfn 16454657 not found for as=fffffffffbc490a0, va=fffffe0040001000
pfn 16454658 not found for as=fffffffffbc490a0, va=fffffe0040002000
pfn 16454659 not found for as=fffffffffbc490a0, va=fffffe0040003000
pfn 16454660 not found for as=fffffffffbc490a0, va=fffffe0040004000
pfn 16454661 not found for as=fffffffffbc490a0, va=fffffe0040005000
pfn 16454662 not found for as=fffffffffbc490a0, va=fffffe0040006000
pfn 16454663 not found for as=fffffffffbc490a0, va=fffffe0040007000
pfn 16454664 not found for as=fffffffffbc490a0, va=fffffe0040008000
pfn 16454665 not found for as=fffffffffbc490a0, va=fffffe0040009000
4:54 99% done: 1823654 of 1824153 pages saved
savecore: bad data after page 1823654
***@***.*** # ls -la
total 24388208
drwx------ 2 root root 9 Sep 22 09:32 .
drwxr-xr-x 3 root root 3 Aug 9 10:36 ..
-rw-r--r-- 1 root root 2 Sep 22 00:07 bounds
-rw-r--r-- 1 root root 3014 Sep 22 00:07 METRICS.csv
-rw-r--r-- 1 root root 2324901 Sep 22 09:32 unix.1
-rw-r--r-- 1 root root 6859653120 Sep 22 09:37 vmcore.1
-rw-r--r-- 1 root root 2734096384 Aug 30 19:19 vmdump.0
-rw-r--r-- 1 root root 2739470336 Sep 21 21:07 vmdump.1
-rw-r--r-- 1 root root 2168651776 Sep 22 00:07 vmdump.2
***@***.*** # mdb 1
mdb: failed to read .symtab header for 'unix', id=0: no mapping for address
mdb: failed to read .symtab header for 'genunix', id=1: no mapping for
address
mdb: failed to read modctl at fffffe58c97eaf08: no mapping for address
::status
debugging crash dump vmcore.1 (64-bit) from nexon
operating system: 5.11 joyent_20190731T235744Z (i86pc)
image uuid: (not set)
panic message: checksum of cached data doesn't match BP err=50
hdr=fffffe5941b98658 bp=fffffe5a67d99800 abd=fffffe5942208088
buf=fffffe5a45a5a000
dump content: kernel pages only
::stack
vpanic()
0xfffffffff7e93d08()
arc_read+0xb43()
dbuf_read_impl+0x38a()
dbuf_read+0xee()
dmu_buf_hold_array_by_dnode+0x128()
dmu_read_uio_dnode+0x5a()
dmu_read_uio+0x5b()
zvol_read+0x17a()
cdev_read+0x2d()
spec_read+0x2b9()
fop_read+0xf3()
preadv+0x4fb()
sys_syscall+0x19f()
::msgbuf
mdb: invalid command '::msgbuf': unknown dcmd name
::panicinfo
mdb: invalid command '::panicinfo': unknown dcmd name
::cpuinfo -v
mdb: invalid command '::cpuinfo': unknown dcmd name
::ps
mdb: invalid command '::ps': unknown dcmd name
Note the failed commands at the end.
For the second panic, I get something similar, but the commands work (the
second panic occurred as I was running, via cron, my daily snapshot and zfs
send to a remote system):
(Note that have re-run the zfs snapshots and zfs send manually this
morning, without any issues, so this may be a red herring)
***@***.*** # savecore -f vmdump.2
savecore: System dump time: Sun Sep 22 00:04:32 2019
savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.2
Constructing namelist /var/crash/volatile/unix.2
Constructing corefile /var/crash/volatile/vmcore.2
2:36 100% done: 1532295 of 1532295 pages saved
***@***.*** # ls -la
total 34188827
drwx------ 2 root root 11 Sep 22 09:42 .
drwxr-xr-x 3 root root 3 Aug 9 10:36 ..
-rw-r--r-- 1 root root 2 Sep 22 00:07 bounds
-rw-r--r-- 1 root root 3388 Sep 22 09:45 METRICS.csv
-rw-r--r-- 1 root root 2324901 Sep 22 09:32 unix.1
-rw-r--r-- 1 root root 2322416 Sep 22 09:42 unix.2
-rw-r--r-- 1 root root 7654342656 Sep 22 09:37 vmcore.1
-rw-r--r-- 1 root root 6374752256 Sep 22 09:45 vmcore.2
-rw-r--r-- 1 root root 2734096384 Aug 30 19:19 vmdump.0
-rw-r--r-- 1 root root 2739470336 Sep 21 21:07 vmdump.1
-rw-r--r-- 1 root root 2168651776 Sep 22 00:07 vmdump.2
***@***.*** # mdb 2
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix
scsi_vhci ufs ip hook neti sockfs arp usba xhci mm sd fctl stmf_sbd stmf
zfs lofs idm sata crypto fcp random cpc logindmux ptm kvm sppp nsmb smbsrv
nfs ]
::status
debugging crash dump vmcore.2 (64-bit) from nexon
operating system: 5.11 joyent_20190731T235744Z (i86pc)
git branch: release-20190801
git rev: ec6335f
<ec6335f>
image uuid: (not set)
panic message: checksum of cached data doesn't match BP err=50
hdr=fffffe5970f343b0 bp=fffffe59bc8a8e58 abd=fffffe5970ec9d38
buf=fffffe5909849000
dump content: kernel pages only
::msgbuf
MESSAGE
***@***.******@***.*** (zcons3) online
NOTICE: vnic1020 registered
NOTICE: vnic1020 link up, 1000 Mbps, unknown duplex
NOTICE: vnic1021 registered
NOTICE: vnic1021 link up, 1000 Mbps, unknown duplex
NOTICE: vnic1023 registered
NOTICE: vnic1023 link up, 1000 Mbps, unknown duplex
kvm_lapic_reset: vcpu=fffffe59148ec000, id=0, base_msr= fee00100 PRIx64
base_address=fee00000
vmcs revision_id = 4
kvm_lapic_reset: vcpu=fffffe59148e5000, id=1, base_msr= fee00000 PRIx64
base_address=fee00000
vmcs revision_id = 4
unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfebdc
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0x0 data 0
unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c
unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0
unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0
vcpu 1 received sipi with vector # 10
kvm_lapic_reset: vcpu=fffffe591441b000, id=1, base_msr= fee00800 PRIx64
base_address=fee00000
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xef1bbdd0 data fffffc7fef1cafb0
unhandled wrmsr: 0xef1bbdd0 data fffffc7fef1cafb0
unhandled wrmsr: 0xffdfee68 data 3ef1beff8
unhandled wrmsr: 0xffdfee68 data 3ef1beff8
vcpu 1 received sipi with vector # 10
kvm_lapic_reset: vcpu=fffffe59148f3000, id=1, base_msr= fee00800 PRIx64
base_address=fee00000
unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c
unhandled wrmsr: 0xef3b6e97 data fffffc7fffdfea1c
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xef197394 data 0
unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0
unhandled wrmsr: 0xffdfedf8 data 3ef1cafb0
vcpu 1 received sipi with vector # 10
kvm_lapic_reset: vcpu=fffffe59148e5000, id=1, base_msr= fee00800 PRIx64
base_address=fee00000
unhandled wrmsr: 0x0 data 0
Creating /etc/devices/devid_cache
Creating /etc/devices/pci_unitaddr_persistent
Creating /etc/devices/devname_cache
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
xsvc0 at root: space 0 offset 0
xsvc0 is ***@***.***,0
pseudo-device: devinfo0
devinfo0 is ***@***.***
Block device: ***@***.***,0, blkdev0
blkdev0 is ***@***.******@***.******@***.******@***.***,0
sd0 at scsa2usb0: target 0 lun 0
sd0 is ***@***.******@***.******@***.******@***.***,0
pseudo-device: llc10
llc10 is ***@***.***
pseudo-device: ramdisk1024
ramdisk1024 is ***@***.***
pseudo-device: ucode0
ucode0 is ***@***.***
pseudo-device: dcpc0
dcpc0 is ***@***.***
pseudo-device: fbt0
fbt0 is ***@***.***
pseudo-device: profile0
profile0 is ***@***.***
pseudo-device: lockstat0
lockstat0 is ***@***.***
pseudo-device: sdt0
sdt0 is ***@***.***
pseudo-device: systrace0
systrace0 is ***@***.***
pseudo-device: fcp0
fcp0 is ***@***.***
pseudo-device: fcsm0
fcsm0 is ***@***.***
pseudo-device: ipd0
ipd0 is ***@***.***
pseudo-device: stmf0
stmf0 is ***@***.***
pseudo-device: fssnap0
fssnap0 is ***@***.***
pseudo-device: bpf0
bpf0 is ***@***.***
pseudo-device: pm0
pm0 is ***@***.***
pseudo-device: nsmb0
nsmb0 is ***@***.***
device ***@***.***(display#0) keeps up device ***@***.***,0(disk#0), but
the former is not power managed
pseudo-device: lx_systrace0
lx_systrace0 is ***@***.***
pseudo-device: inotify0
inotify0 is ***@***.***
pseudo-device: eventfd0
eventfd0 is ***@***.***
pseudo-device: timerfd0
timerfd0 is ***@***.***
pseudo-device: signalfd0
signalfd0 is ***@***.***
pseudo-device: viona0
viona0 is ***@***.***
unhandled rdmsr: 0x140
vcpu 1 received sipi with vector # 9a
kvm_lapic_reset: vcpu=fffffe59148f3000, id=1, base_msr= fee00800 PRIx64
base_address=fee00000
unhandled rdmsr: 0x140
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
unhandled rdmsr: 0x4b4d564b
unhandled wrmsr: 0x0 data 4002014001ff01ff
vcpu 1 received sipi with vector # 40
kvm_lapic_reset: vcpu=fffffe59148e5000, id=1, base_msr= fee00800 PRIx64
base_address=fee00000
vcpu 1 received sipi with vector # 40
kvm_lapic_reset: vcpu=fffffe591441b000, id=1, base_msr= fee00800 PRIx64
base_address=fee00000
panic[cpu5]/thread=fffffe58f6015be0:
checksum of cached data doesn't match BP err=50 hdr=fffffe5970f343b0
bp=fffffe59bc8a8e58 abd=fffffe5970ec9d38 buf=fffffe5909849000
fffffe007a8a4640 zfs:zfs_nfsshare_inited+37dc3c6c ()
fffffe007a8a4760 zfs:arc_read+b43 ()
fffffe007a8a4810 zfs:do_dump+215 ()
fffffe007a8a49d0 zfs:dmu_send_impl+7e3 ()
fffffe007a8a4b10 zfs:dmu_send_obj+2f0 ()
fffffe007a8a4bb0 zfs:zfs_ioc_send+105 ()
fffffe007a8a4c70 zfs:zfsdev_ioctl+512 ()
fffffe007a8a4cb0 genunix:cdev_ioctl+39 ()
fffffe007a8a4d00 specfs:spec_ioctl+60 ()
fffffe007a8a4d90 genunix:fop_ioctl+55 ()
fffffe007a8a4eb0 genunix:ioctl+9b ()
fffffe007a8a4f10 unix:brand_sys_sysenter+1d3 ()
dumping to /dev/zvol/dsk/zones/dump, offset 65536, content: kernel
NOTICE: ahci1: ahci_tran_reset_dport port 0 reset port
NOTICE: ahci0: ahci_tran_reset_dport port 2 reset port
::panicinfo
cpu 5
thread fffffe58f6015be0
message checksum of cached data doesn't match BP err=50
hdr=fffffe5970f343b0 bp=fffffe59bc8a8e58 abd=fffffe5970ec9d38
buf=fffffe5909849000
rdi fffffffff7fab408
rsi fffffe007a8a4560
rdx fffffe5970f343b0
rcx fffffe59bc8a8e58
r8 fffffe5970ec9d38
r9 fffffe5909849000
rax fffffe007a8a4580
rbx fffffe59bc8a8e58
rbp fffffe007a8a45c0
r10 c0dd0
r11 0
r12 fffffe5970f343b0
r13 32
r14 fffffe5970ec9d38
r15 2000
fsbase fffffc7fef060a40
gsbase fffffe58dcaad000
ds 38
es 38
fs 0
gs 0
trapno 0
err 0
rip fffffffffb881010
cs 30
rflags 286
rsp fffffe007a8a4558
ss 38
gdt_hi 0
gdt_lo 1ef
idt_hi 0
idt_lo 9000ffff
ldt 0
task 70
cr0 80050033
cr2 a768b7f710
cr3 fec051000
cr4 3626f8
::cpuinfo -v
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 fffffffffbc4a000 1f 0 0 -1 no no t-0 fffffe0079805c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
1 fffffe58db55a000 1f 0 0 -1 no no t-0 fffffe0079adbc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
2 fffffe58db558000 1f 0 0 -1 no no t-5 fffffe0079b61c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
3 fffffe58db841000 1f 0 0 -1 no no t-1 fffffe0079bb7c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
4 fffffe58dc909000 1f 0 0 -1 no no t-0 fffffe0079cadc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
5 fffffffffbc4fe00 1b 0 0 59 no no t-0 fffffe58f6015be0 zfs
|
RUNNING <--+
READY
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
6 fffffe58dcaa7000 1f 0 0 -1 no no t-0 fffffe0079e69c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
7 fffffe58dcaa5000 1f 0 0 -1 no no t-11 fffffe0079eefc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
8 fffffe58dcaa1000 1f 1 0 -1 no no t-0 fffffe0079f75c20 (idle)
| |
RUNNING <--+ +--> PRI THREAD PROC
READY 60 fffffe007aa79c20 sched
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
9 fffffe58dca9d000 1f 0 0 -1 no no t-0 fffffe0079ffbc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
10 fffffe58dca99000 1f 0 0 -1 no no t-42 fffffe007a081c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
11 fffffe58dcd8d000 1f 0 0 -1 no no t-5 fffffe007a107c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
fffffe5970ec9d38::whatis
fffffe5970ec9d38 is allocated from kmem_alloc_48
fffffe5970ec9d38::print abd_t
{
abd_flags = 0x2 (ABD_FLAG_OWNER)
abd_size = 0x2000
abd_parent = 0
abd_children = {
rc_count = 0
}
abd_u = {
abd_scatter = {
abd_offset = 0
abd_chunk_size = 0x1000
abd_chunks = [ ... ]
}
abd_linear = {
abd_buf = 0x100000000000
}
}
}
Looks like I am again having zfs issues....
The error message I'm seeing comes from this bit of code:
https://github.com/openzfs/openzfs/blob/master/usr/src/uts/common/fs/zfs/arc.c
specifically: static void arc_hdr_verify_checksum(spa_t *spa,
arc_buf_hdr_t *hdr, const blkptr_t *bp)
which includes the (helpful?) comment: this should be changed to return an
error, and callers re-read from disk on failure (on nondebug bits).
Question then being, why am I getting there?
Looking at the code in question, I seem to end up in abd != NULL, which is
only the case, if my C is not too rusty, if I either use encryption (which
I don't) or compression (which I do):
int err = 0;
abd_t *abd = NULL;
if (BP_IS_ENCRYPTED(bp)) {
if (HDR_HAS_RABD(hdr)) {
abd = hdr->b_crypt_hdr.b_rabd;
}
} else if (HDR_COMPRESSION_ENABLED(hdr)) {
abd = hdr->b_l1hdr.b_pabd;
}
if (abd != NULL) {
So would a workaround be that I don't use compression? Or am I suffering
from some underlying issue and I'll just move my problem somewhere else?
Here's some other data form the system around config if that helps,
basically, the disks for the various VMs actually resize on the zpool vmdg.
There is one Windows, one Linux and two Solaris VMs. Yes, I know all about
images and zones, but these are "special" in various different ways, so
need to be VMs:
***@***.*** $ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
vmdg 952G 250G 702G - - 11% 26% 1.00x ONLINE -
zones 222G 20.3G 202G - - 0% 9% 1.00x ONLINE -
***@***.*** $ zfs list
NAME USED AVAIL REFER MOUNTPOINT
vmdg 621G 301G 24K /export/vm
vmdg/galatea0 63.1G 343G 7.05G -
vmdg/gaspra0 58.3G 343G 13.5G -
vmdg/io0 452G 549G 105G -
vmdg/nereid0 47.2G 343G 5.70G -
zones 85.3G 130G 538K /zones
zones/789a6253-c226-ed36-cf15-c5b0c5e79f23 193K 10.0G 49K
/zones/789a6253-c226-ed36-cf15-c5b0c5e79f23
zones/89d9e83c-437b-edf5-db03-b0004eb2b273 128K 10.0G 48K
/zones/89d9e83c-437b-edf5-db03-b0004eb2b273
zones/a0d804ec-5f92-c97c-af8a-810d656b1e93 126K 10.0G 48K
/zones/a0d804ec-5f92-c97c-af8a-810d656b1e93
zones/archive 50K 130G 24K /zones/archive
zones/config 272K 130G 59K legacy
zones/cores 144K 130G 24K none
zones/cores/789a6253-c226-ed36-cf15-c5b0c5e79f23 24K 100G 24K
/zones/789a6253-c226-ed36-cf15-c5b0c5e79f23/cores
zones/cores/89d9e83c-437b-edf5-db03-b0004eb2b273 24K 100G 24K
/zones/89d9e83c-437b-edf5-db03-b0004eb2b273/cores
zones/cores/a0d804ec-5f92-c97c-af8a-810d656b1e93 24K 100G 24K
/zones/a0d804ec-5f92-c97c-af8a-810d656b1e93/cores
zones/cores/e67a4093-4a30-49dd-a75a-8aead207180e 24K 100G 24K
/zones/e67a4093-4a30-49dd-a75a-8aead207180e/cores
zones/cores/global 24K 10.0G 24K /zones/global/cores
zones/dump 2.56G 130G 2.56G -
zones/e67a4093-4a30-49dd-a75a-8aead207180e 248K 10.0G 50.5K
/zones/e67a4093-4a30-49dd-a75a-8aead207180e
zones/opt 384K 130G 238K legacy
zones/swap 65.7G 195G 738M -
zones/usbkey 58.5K 130G 32.5K legacy
zones/var 17.0G 130G 14.5G legacy
***@***.*** $ su -
Password:
- SmartOS (build: 20190731T235744Z)
System is running the following applications:
vm server
DISPLAY=192.168.140.92:0.0
***@***.*** # vmadm list
UUID TYPE RAM STATE ALIAS
789a6253-c226-ed36-cf15-c5b0c5e79f23 KVM 4096 running gaspra
89d9e83c-437b-edf5-db03-b0004eb2b273 KVM 4096 running galatea
a0d804ec-5f92-c97c-af8a-810d656b1e93 KVM 4096 running nereid
e67a4093-4a30-49dd-a75a-8aead207180e KVM 4096 running io
***@***.*** #
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#196>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACETNBNKQUYGJ52V3TP33ALQNK5PBANCNFSM4I57UCGQ>
.
|
The motherboard is an ASRock Z370 Extreme4 and does not support ECC RAM. I just ran 12hrs worth of memtest86 tests over night without any errors. |
I'm getting something similar. I recently setup a new CN which is an Intel NUC 10. When I try to migrate a VM from the head node to the CN the CN panics and reboots. When the CN comes back online I see the following message in mail:
|
I ran |
Closing since this is a hardware issue. |
Re-opening because this is a more generalized issue than a single server's hardware problems. |
I appear to be getting data corruption in ZFS leading to panics. I've been asked by Brian Bennett to open a new issue for a kernel developer to look at this.
Here's more detail:
A bit of history: I have recently moved to SmartOS. I have home-built a server on which I am currently just running a handful of VMs, the idea being however that this will become my main server. The thing used to crash quite regularly. I eventually traced it to an additional SATA adapter I had in the server, so I removed that (not really required, I can make do with the 4 available on-board ports).
I thought I had it fixed now and was contemplating migrating my remaining VMs over, when, last night, the thing panic'ed again, twice.
Here's what I got out of the first panic:
root@nexon # pwd
/var/crash/volatile
root@nexon # savecore -f vmdump.1
savecore: incomplete dump on dump device
savecore: System dump time: Sat Sep 21 21:03:21 2019
savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.1
Constructing namelist /var/crash/volatile/unix.1
Constructing corefile /var/crash/volatile/vmcore.1
pfn 16454656 not found for as=fffffffffbc490a0, va=fffffe0040000000
pfn 16454657 not found for as=fffffffffbc490a0, va=fffffe0040001000
pfn 16454658 not found for as=fffffffffbc490a0, va=fffffe0040002000
pfn 16454659 not found for as=fffffffffbc490a0, va=fffffe0040003000
pfn 16454660 not found for as=fffffffffbc490a0, va=fffffe0040004000
pfn 16454661 not found for as=fffffffffbc490a0, va=fffffe0040005000
pfn 16454662 not found for as=fffffffffbc490a0, va=fffffe0040006000
pfn 16454663 not found for as=fffffffffbc490a0, va=fffffe0040007000
pfn 16454664 not found for as=fffffffffbc490a0, va=fffffe0040008000
pfn 16454665 not found for as=fffffffffbc490a0, va=fffffe0040009000
4:54 99% done: 1823654 of 1824153 pages saved
savecore: bad data after page 1823654
root@nexon # ls -la
total 24388208
drwx------ 2 root root 9 Sep 22 09:32 .
drwxr-xr-x 3 root root 3 Aug 9 10:36 ..
-rw-r--r-- 1 root root 2 Sep 22 00:07 bounds
-rw-r--r-- 1 root root 3014 Sep 22 00:07 METRICS.csv
-rw-r--r-- 1 root root 2324901 Sep 22 09:32 unix.1
-rw-r--r-- 1 root root 6859653120 Sep 22 09:37 vmcore.1
-rw-r--r-- 1 root root 2734096384 Aug 30 19:19 vmdump.0
-rw-r--r-- 1 root root 2739470336 Sep 21 21:07 vmdump.1
-rw-r--r-- 1 root root 2168651776 Sep 22 00:07 vmdump.2
root@nexon # mdb 1
mdb: failed to read .symtab header for 'unix', id=0: no mapping for address
mdb: failed to read .symtab header for 'genunix', id=1: no mapping for address
mdb: failed to read modctl at fffffe58c97eaf08: no mapping for address
Note the failed commands at the end.
For the second panic, I get something similar, but the commands work (the second panic occurred as I was running, via cron, my daily snapshot and zfs send to a remote system):
(Note that have re-run the zfs snapshots and zfs send manually this morning, without any issues, so this may be a red herring)
root@nexon # savecore -f vmdump.2
savecore: System dump time: Sun Sep 22 00:04:32 2019
savecore: saving system crash dump in /var/crash/volatile/{unix,vmcore}.2
Constructing namelist /var/crash/volatile/unix.2
Constructing corefile /var/crash/volatile/vmcore.2
2:36 100% done: 1532295 of 1532295 pages saved
root@nexon # ls -la
total 34188827
drwx------ 2 root root 11 Sep 22 09:42 .
drwxr-xr-x 3 root root 3 Aug 9 10:36 ..
-rw-r--r-- 1 root root 2 Sep 22 00:07 bounds
-rw-r--r-- 1 root root 3388 Sep 22 09:45 METRICS.csv
-rw-r--r-- 1 root root 2324901 Sep 22 09:32 unix.1
-rw-r--r-- 1 root root 2322416 Sep 22 09:42 unix.2
-rw-r--r-- 1 root root 7654342656 Sep 22 09:37 vmcore.1
-rw-r--r-- 1 root root 6374752256 Sep 22 09:45 vmcore.2
-rw-r--r-- 1 root root 2734096384 Aug 30 19:19 vmdump.0
-rw-r--r-- 1 root root 2739470336 Sep 21 21:07 vmdump.1
-rw-r--r-- 1 root root 2168651776 Sep 22 00:07 vmdump.2
root@nexon # mdb 2
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba xhci mm sd fctl stmf_sbd stmf zfs lofs idm sata crypto fcp random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ]
panic[cpu5]/thread=fffffe58f6015be0:
checksum of cached data doesn't match BP err=50 hdr=fffffe5970f343b0 bp=fffffe59bc8a8e58 abd=fffffe5970ec9d38 buf=fffffe5909849000
fffffe007a8a4640 zfs:zfs_nfsshare_inited+37dc3c6c ()
fffffe007a8a4760 zfs:arc_read+b43 ()
fffffe007a8a4810 zfs:do_dump+215 ()
fffffe007a8a49d0 zfs:dmu_send_impl+7e3 ()
fffffe007a8a4b10 zfs:dmu_send_obj+2f0 ()
fffffe007a8a4bb0 zfs:zfs_ioc_send+105 ()
fffffe007a8a4c70 zfs:zfsdev_ioctl+512 ()
fffffe007a8a4cb0 genunix:cdev_ioctl+39 ()
fffffe007a8a4d00 specfs:spec_ioctl+60 ()
fffffe007a8a4d90 genunix:fop_ioctl+55 ()
fffffe007a8a4eb0 genunix:ioctl+9b ()
fffffe007a8a4f10 unix:brand_sys_sysenter+1d3 ()
dumping to /dev/zvol/dsk/zones/dump, offset 65536, content: kernel
NOTICE: ahci1: ahci_tran_reset_dport port 0 reset port
NOTICE: ahci0: ahci_tran_reset_dport port 2 reset port
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
1 fffffe58db55a000 1f 0 0 -1 no no t-0 fffffe0079adbc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
2 fffffe58db558000 1f 0 0 -1 no no t-5 fffffe0079b61c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
3 fffffe58db841000 1f 0 0 -1 no no t-1 fffffe0079bb7c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
4 fffffe58dc909000 1f 0 0 -1 no no t-0 fffffe0079cadc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
5 fffffffffbc4fe00 1b 0 0 59 no no t-0 fffffe58f6015be0 zfs
|
RUNNING <--+
READY
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
6 fffffe58dcaa7000 1f 0 0 -1 no no t-0 fffffe0079e69c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
7 fffffe58dcaa5000 1f 0 0 -1 no no t-11 fffffe0079eefc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
8 fffffe58dcaa1000 1f 1 0 -1 no no t-0 fffffe0079f75c20 (idle)
| |
RUNNING <--+ +--> PRI THREAD PROC
READY 60 fffffe007aa79c20 sched
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
9 fffffe58dca9d000 1f 0 0 -1 no no t-0 fffffe0079ffbc20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
10 fffffe58dca99000 1f 0 0 -1 no no t-42 fffffe007a081c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
11 fffffe58dcd8d000 1f 0 0 -1 no no t-5 fffffe007a107c20 (idle)
|
RUNNING <--+
READY
QUIESCED
EXISTS
ENABLE
Looks like I am again having zfs issues....
The error message I'm seeing comes from this bit of code:
https://github.com/openzfs/openzfs/blob/master/usr/src/uts/common/fs/zfs/arc.c
specifically: static void arc_hdr_verify_checksum(spa_t *spa, arc_buf_hdr_t *hdr, const blkptr_t *bp)
which includes the (helpful?) comment: this should be changed to return an error, and callers re-read from disk on failure (on nondebug bits).
Question then being, why am I getting there?
Looking at the code in question, I seem to end up in abd != NULL, which is only the case, if my C is not too rusty, if I either use encryption (which I don't) or compression (which I do):
So would a workaround be that I don't use compression? Or am I suffering from some underlying issue and I'll just move my problem somewhere else?
Here's some other data form the system around config if that helps, basically, the disks for the various VMs actually resize on the zpool vmdg. There is one Windows, one Linux and two Solaris VMs. Yes, I know all about images and zones, but these are "special" in various different ways, so need to be VMs:
bautsche@nexon $ zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
vmdg 952G 250G 702G - - 11% 26% 1.00x ONLINE -
zones 222G 20.3G 202G - - 0% 9% 1.00x ONLINE -
bautsche@nexon $ zfs list
NAME USED AVAIL REFER MOUNTPOINT
vmdg 621G 301G 24K /export/vm
vmdg/galatea0 63.1G 343G 7.05G -
vmdg/gaspra0 58.3G 343G 13.5G -
vmdg/io0 452G 549G 105G -
vmdg/nereid0 47.2G 343G 5.70G -
zones 85.3G 130G 538K /zones
zones/789a6253-c226-ed36-cf15-c5b0c5e79f23 193K 10.0G 49K /zones/789a6253-c226-ed36-cf15-c5b0c5e79f23
zones/89d9e83c-437b-edf5-db03-b0004eb2b273 128K 10.0G 48K /zones/89d9e83c-437b-edf5-db03-b0004eb2b273
zones/a0d804ec-5f92-c97c-af8a-810d656b1e93 126K 10.0G 48K /zones/a0d804ec-5f92-c97c-af8a-810d656b1e93
zones/archive 50K 130G 24K /zones/archive
zones/config 272K 130G 59K legacy
zones/cores 144K 130G 24K none
zones/cores/789a6253-c226-ed36-cf15-c5b0c5e79f23 24K 100G 24K /zones/789a6253-c226-ed36-cf15-c5b0c5e79f23/cores
zones/cores/89d9e83c-437b-edf5-db03-b0004eb2b273 24K 100G 24K /zones/89d9e83c-437b-edf5-db03-b0004eb2b273/cores
zones/cores/a0d804ec-5f92-c97c-af8a-810d656b1e93 24K 100G 24K /zones/a0d804ec-5f92-c97c-af8a-810d656b1e93/cores
zones/cores/e67a4093-4a30-49dd-a75a-8aead207180e 24K 100G 24K /zones/e67a4093-4a30-49dd-a75a-8aead207180e/cores
zones/cores/global 24K 10.0G 24K /zones/global/cores
zones/dump 2.56G 130G 2.56G -
zones/e67a4093-4a30-49dd-a75a-8aead207180e 248K 10.0G 50.5K /zones/e67a4093-4a30-49dd-a75a-8aead207180e
zones/opt 384K 130G 238K legacy
zones/swap 65.7G 195G 738M -
zones/usbkey 58.5K 130G 32.5K legacy
zones/var 17.0G 130G 14.5G legacy
bautsche@nexon $ su -
Password:
System is running the following applications:
vm server
DISPLAY=192.168.140.92:0.0
root@nexon # vmadm list
UUID TYPE RAM STATE ALIAS
789a6253-c226-ed36-cf15-c5b0c5e79f23 KVM 4096 running gaspra
89d9e83c-437b-edf5-db03-b0004eb2b273 KVM 4096 running galatea
a0d804ec-5f92-c97c-af8a-810d656b1e93 KVM 4096 running nereid
e67a4093-4a30-49dd-a75a-8aead207180e KVM 4096 running io
root@nexon #
The text was updated successfully, but these errors were encountered: