Xen randomly crashing server – part 2

Some weeks ago I blogged about “Xen randomly crashing server“. The problem back then was that I couldn’t get any information why the server reboots. Using a netconsole was not possible, because netconsole refused to work with the bridge that is used for Xen networking. Luckily my colocation partner rrbone.net connected the second network port of my server to the network so that I could use eth1 instead of the bridged eth0 for netconsole.

Today the server crashed several times and I was able to collect some more information than just the screenshots from IPMI/KVM console as shown in my last blog entry (full netconsole output is attached as a file): 

May 12 11:56:39 31.172.31.251 [829681.040596] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-4-amd64 #1 Debian 3.16.7-ckt25-2
May 12 11:56:39 31.172.31.251 [829681.040647] Hardware name: Supermicro X9SRE/X9SRE-3F/X9SRi/X9SRi-3F/X9SRE/X9SRE-3F/X9SRi/X9SRi-3F, BIOS 3.0a 01/03/2014
May 12 11:56:39 31.172.31.251 [829681.040701] task: ffffffff8181a460 ti: ffffffff81800000 task.ti: ffffffff81800000
May 12 11:56:39 31.172.31.251 [829681.040749] RIP: e030:[<ffffffff812b7e56>]
May 12 11:56:39 31.172.31.251  [<ffffffff812b7e56>] memcpy+0x6/0x110
May 12 11:56:39 31.172.31.251 [829681.040802] RSP: e02b:ffff880280e03a58  EFLAGS: 00010286
May 12 11:56:39 31.172.31.251 [829681.040834] RAX: ffff88026eec9070 RBX: ffff88023c8f6b00 RCX: 00000000000000ee
May 12 11:56:39 31.172.31.251 [829681.040880] RDX: 00000000000004a0 RSI: ffff88006cd1f000 RDI: ffff88026eec9422
May 12 11:56:39 31.172.31.251 [829681.040927] RBP: ffff880280e03b38 R08: 00000000000006c0 R09: ffff88026eec9062
May 12 11:56:39 31.172.31.251 [829681.040973] R10: 0100000000000000 R11: 00000000af9a2116 R12: ffff88023f440d00
May 12 11:56:39 31.172.31.251 [829681.041020] R13: ffff88006cd1ec66 R14: ffff88025dcf1cc0 R15: 00000000000004a8
May 12 11:56:39 31.172.31.251 [829681.041075] FS:  0000000000000000(0000) GS:ffff880280e00000(0000) knlGS:ffff880280e00000
May 12 11:56:39 31.172.31.251 [829681.041124] CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
May 12 11:56:39 31.172.31.251 [829681.041153] CR2: ffff88006cd1f000 CR3: 0000000271ae8000 CR4: 0000000000042660
May 12 11:56:39 31.172.31.251 [829681.041202] Stack:
May 12 11:56:39 31.172.31.251 [829681.041225]  ffffffff814d38ff
May 12 11:56:39 31.172.31.251  ffff88025b5fa400
May 12 11:56:39 31.172.31.251  ffff880280e03aa8
May 12 11:56:39 31.172.31.251  9401294600a7012a
May 12 11:56:39 31.172.31.251 
May 12 11:56:39 31.172.31.251 [829681.041287]  0100000000000000
May 12 11:56:39 31.172.31.251  ffffffff814a000a
May 12 11:56:39 31.172.31.251  000000008181a460
May 12 11:56:39 31.172.31.251  00000000000080fe
May 12 11:56:39 31.172.31.251 
May 12 11:56:39 31.172.31.251 [829681.041346]  1ad902feff7ac40e
May 12 11:56:39 31.172.31.251  ffff88006c5fd980
May 12 11:56:39 31.172.31.251  ffff224afc3e1600
May 12 11:56:39 31.172.31.251  ffff88023f440d00
May 12 11:56:39 31.172.31.251 
May 12 11:56:39 31.172.31.251 [829681.041407] Call Trace:
May 12 11:56:39 31.172.31.251 [829681.041435]  <IRQ>
May 12 11:56:39 31.172.31.251 
May 12 11:56:39 31.172.31.251 [829681.041441]
May 12 11:56:39 31.172.31.251  [<ffffffff814d38ff>] ? ndisc_send_redirect+0x3bf/0x410
May 12 11:56:39 31.172.31.251 [829681.041506]  [<ffffffff814a000a>] ? ipmr_device_event+0x7a/0xd0
May 12 11:56:39 31.172.31.251 [829681.041548]  [<ffffffff814bc74c>] ? ip6_forward+0x71c/0x850
May 12 11:56:39 31.172.31.251 [829681.041585]  [<ffffffff814c9e54>] ? ip6_route_input+0xa4/0xd0
May 12 11:56:39 31.172.31.251 [829681.041621]  [<ffffffff8141f1a3>] ? __netif_receive_skb_core+0x543/0x750
May 12 11:56:39 31.172.31.251 [829681.041729]  [<ffffffff8141f42f>] ? netif_receive_skb_internal+0x1f/0x80
May 12 11:56:39 31.172.31.251 [829681.041771]  [<ffffffffa0585eb2>] ? br_handle_frame_finish+0x1c2/0x3c0 [bridge]
May 12 11:56:39 31.172.31.251 [829681.041821]  [<ffffffffa058c757>] ? br_nf_pre_routing_finish_ipv6+0xc7/0x160 [bridge]
May 12 11:56:39 31.172.31.251 [829681.041872]  [<ffffffffa058d0e2>] ? br_nf_pre_routing+0x562/0x630 [bridge]
May 12 11:56:39 31.172.31.251 [829681.041907]  [<ffffffffa0585cf0>] ? br_handle_local_finish+0x80/0x80 [bridge]
May 12 11:56:39 31.172.31.251 [829681.041955]  [<ffffffff8144fb65>] ? nf_iterate+0x65/0xa0
May 12 11:56:39 31.172.31.251 [829681.041987]  [<ffffffffa0585cf0>] ? br_handle_local_finish+0x80/0x80 [bridge]
May 12 11:56:39 31.172.31.251 [829681.042035]  [<ffffffff8144fc16>] ? nf_hook_slow+0x76/0x130
May 12 11:56:39 31.172.31.251 [829681.042067]  [<ffffffffa0585cf0>] ? br_handle_local_finish+0x80/0x80 [bridge]
May 12 11:56:39 31.172.31.251 [829681.042116]  [<ffffffffa0586220>] ? br_handle_frame+0x170/0x240 [bridge]
May 12 11:56:39 31.172.31.251 [829681.042148]  [<ffffffff8141ee24>] ? __netif_receive_skb_core+0x1c4/0x750
May 12 11:56:39 31.172.31.251 [829681.042185]  [<ffffffff81009f9c>] ? xen_clocksource_get_cycles+0x1c/0x20
May 12 11:56:39 31.172.31.251 [829681.042217]  [<ffffffff8141f42f>] ? netif_receive_skb_internal+0x1f/0x80
May 12 11:56:39 31.172.31.251 [829681.042251]  [<ffffffffa063f50f>] ? xenvif_tx_action+0x49f/0x920 [xen_netback]
May 12 11:56:39 31.172.31.251 [829681.042299]  [<ffffffffa06422f8>] ? xenvif_poll+0x28/0x70 [xen_netback]
May 12 11:56:39 31.172.31.251 [829681.042331]  [<ffffffff8141f7b0>] ? net_rx_action+0x140/0x240
May 12 11:56:39 31.172.31.251 [829681.042367]  [<ffffffff8106c6a1>] ? __do_softirq+0xf1/0x290
May 12 11:56:39 31.172.31.251 [829681.042397]  [<ffffffff8106ca75>] ? irq_exit+0x95/0xa0
May 12 11:56:39 31.172.31.251 [829681.042432]  [<ffffffff8135a285>] ? xen_evtchn_do_upcall+0x35/0x50
May 12 11:56:39 31.172.31.251 [829681.042469]  [<ffffffff8151669e>] ? xen_do_hypervisor_callback+0x1e/0x30
May 12 11:56:39 31.172.31.251 [829681.042499]  <EOI>
May 12 11:56:39 31.172.31.251 
May 12 11:56:39 31.172.31.251 [829681.042506]
May 12 11:56:39 31.172.31.251  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
May 12 11:56:39 31.172.31.251 [829681.042561]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
May 12 11:56:39 31.172.31.251 [829681.042592]  [<ffffffff81009e7c>] ? xen_safe_halt+0xc/0x20
May 12 11:56:39 31.172.31.251 [829681.042627]  [<ffffffff8101c8c9>] ? default_idle+0x19/0xb0
May 12 11:56:39 31.172.31.251 [829681.042666]  [<ffffffff810a83e0>] ? cpu_startup_entry+0x340/0x400
May 12 11:56:39 31.172.31.251 [829681.042705]  [<ffffffff81903076>] ? start_kernel+0x497/0x4a2
May 12 11:56:39 31.172.31.251 [829681.042735]  [<ffffffff81902a04>] ? set_init_arg+0x4e/0x4e
May 12 11:56:39 31.172.31.251 [829681.042767]  [<ffffffff81904f69>] ? xen_start_kernel+0x569/0x573
May 12 11:56:39 31.172.31.251 [829681.042797] Code:
May 12 11:56:39 31.172.31.251  <f3>
May 12 11:56:39 31.172.31.251 
May 12 11:56:39 31.172.31.251 [829681.043113] RIP
May 12 11:56:39 31.172.31.251  [<ffffffff812b7e56>] memcpy+0x6/0x110
May 12 11:56:39 31.172.31.251 [829681.043145]  RSP <ffff880280e03a58>
May 12 11:56:39 31.172.31.251 [829681.043170] CR2: ffff88006cd1f000
May 12 11:56:39 31.172.31.251 [829681.043488] —[ end trace 1838cb62fe32daad ]—
May 12 11:56:39 31.172.31.251 [829681.048905] Kernel panic – not syncing: Fatal exception in interrupt
May 12 11:56:39 31.172.31.251 [829681.048978] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)

I’m not that good at reading this kind of output, but to me it seems that ndisc_send_redirect is at fault. When googling for “ndisc_send_redirect” you can find a patch on lkml.org and Debian bug #804079, both seem to be related to IPv6.

When looking at the linux kernel source mentioned in the lkml patch I see that this patch is already applied (line 1510): 

        if (ha) 
                ndisc_fill_addr_option(buff, ND_OPT_TARGET_LL_ADDR, ha);

So, when the patch was intended to prevent “leading to data corruption or in the worst case a panic when the skb_put failed” it does not help in my case or in the case of #804079.

Any tips are appreciated!

PS: I’ll contribute to that bug in the BTS, of course!

2 thoughts on “Xen randomly crashing server – part 2

  1. same here
    Hi Ingo, we also suffer from this bug from time to time.
    in august of 2014, we ran into it multiple times per week running some 3.15 kernel with xen 4.4.
    we tried several versions and somehow it suddenly stopped after a crash and NO kernel change.
    the server then ran for over a year, we rebooted regularly and a few weeks ago, running debian stable kernel 3.16 and xen 4.4.4, the crash happened again when doing a apt-upgrade in a debian oldstable vps (kernel 3.2, not sure if related).
    today, the exact same thing happened again…
    i upgraded the vps to 3.16 from backports, maybe that helps.
    but even if it is related, the guest shouldn’t crash the dom0 kernel…

    1. Hi Andreas!

      Hi Andreas!

      I have tested even the kernel from unstable, but still the server crashes from time to time. Apparently it’s crashing because of some IPv6 routine, so deactivating IPv6 helps. But really: this cannot be the solution for this problem. Please contribute to Debian Bug #804079…

Comments are closed.