You are here

March 2016

Xen randomly crashing server

It's a long story... an oddessey of almost two years...

But to start from the beginning: Back then I rented a server at Hetzner until they decided to bill for every IP address you got from them. I got a /26 in the past and so I would have to pay for every IP address of that subnet in addition to the server rent of 79.- EUR/month. That would have meant nearly doubling the monthly costs. So I moved with my server from Hetzner to rrbone Net, which offered me a /26 on a rented Cisco C200 M2 server for a competitve price.

After migrating the VMs from Hetzner to rrbone with the same setup that was running just fine at Hetzner I experienced spontaneous reboots of the server, sometimes several times per day and in short time frame. The hosting provider was very, very helpful in debugging this like exchanging the memory, setting up a remote logging service for the CIMC and such. But in the end we found no root cause for this. The CIMC logs showed that the OS was rebooting the machine.

Anyway, I then bought my own server and exchanged the Cisco C200 by my own hardware, but the reboots still happen as before. Sometimes the servers runs for weeks, sometimes the server crashes 4-6 times a day, but usually it's like a pattern: when it crashes and reboots, it will do that again within a few hours and after the second reboot the chances are high that the server will run for several days without a reboot - or even weeks.

The strange thing is, that there are absolutely no hints in the logs, neither syslog or in the Xen logs, so I assume that's something quite deep in the kernel that causes the reboot. Another hint is, that the reboots fairly often happened, when I used my Squid proxy on one of the VMs to access the net. I'm connecting for example by SSH with portforwarding to one VM, whereas the proxy runs on another VM, which led to network traffic between the VMs. Sometimes the server crashed on the very firsts proxy requests. So, I exchanged Squid by tinyproxy or other proxies, moved the proxy from one VM to that VM I connect to using SSH, because I thought that the inter-VM traffic may cause the machine to reboot. Moving the proxy to another virtual server I rented at another hosting provider to host my secondary nameserver did help a little bit, but with no real hard proof and statistics, just an impression of mine.

I moved from xm toolstack to xl toolstack as well, but didn't help either. The reboots are still happening and in the last few days very frequent. Even with the new server I exchanged the memory, used memory mirroring as well, because I thought that it might be a faulty memory module or something, but still rebooting out of the blue.

During the last weekend I configured grub to include "noreboot" command line and then got my first proof that somehow the Xen network stack is causing the reboots: 

This is a screenshot of the IPMI console, so it's not showing the full information of that kernel oops, but as you can see, there are most likely such parts involved like bridge, netif, xenvif and the physical igb NIC.

Here's another screenshot of a crash from this night: 

Slightly different information, but still somehow network involved as you can see in the first line (net_rx_action).

So the big question is: is this a bug Xen or with my setup? I'm using xl toolstack, the xl.conf is basically the default, I think: 

## Global XL config file ##

# automatically balloon down dom0 when xen doesn't have enough free
# memory to create a domain
autoballoon=0

# full path of the lockfile used by xl during domain creation
#lockfile="/var/lock/xl"

# default vif script
#vif.default.script="vif-bridge"

With this the default network scripts of the distribution (i.e. Debian stable) should be used. The network setup consists of two brdiges: 

auto xenbr0
iface xenbr0 inet static
        address 31.172.31.193
        netmask 255.255.255.192
        gateway 31.172.31.254
        bridge_ports eth0
        pre-up brctl addbr xenbr0

auto xenbr1
iface xenbr1 inet static
        address 192.168.254.254
        netmask 255.255.255.0
        pre-up brctl addbr xenbr1

There are some more lines to that config like setting up some iptables rules with up commands and such. But as you can see my eth0 NIC is part of the "main" xen bridge with all the IP addresses that are reachable from the outside. The second bridge is used for internal networking like database connections and such.

I would rather like to use a netconsole to capture the full debug output in case of a new crash, but unfortunately this only works until the bridge is brought up and in place: 

[    0.000000] Command line: placeholder root=UUID=c3....22 ro debug ignore_loglevel loglevel=7 netconsole=port@31.172.31.193/eth0,514@5.45.x.y/e0:ac:f1:4c:y:x
[   32.565624] netpoll: netconsole: local port $port
[   32.565683] netpoll: netconsole: local IPv4 address 31.172.31.193
[   32.565742] netpoll: netconsole: interface 'eth0'
[   32.565799] netpoll: netconsole: remote port 514
[   32.565855] netpoll: netconsole: remote IPv4 address 5.45.x.y
[   32.565914] netpoll: netconsole: remote ethernet address e0:ac:f1:4c:y:x
[   32.565982] netpoll: netconsole: device eth0 not up yet, forcing it
[   36.126294] netconsole: network logging started
[   49.802600] netconsole: network logging stopped on interface eth0 as it is joining a master device

So, the first question is: how to use netconsole with an interface that is used on a bridge?

The second question is: is the setup with two bridges with Xen ok? I've been using this setup for years now and it worked fairly well on the Hetzner server as well, although I used there xm toolstack with a mix of bridge and routed setup, because Hetzner didn't like to see the MAC addresses of the other VMs on the switch and shut the port down if that happens.

Kategorie: 
 

Letsencrypt: challenging challenges solved

A few weeks ago I was wondering in Letsencrypt: challenging challenges about how to setup Letsencrypt when a domain is spread across several virtual machines (VM). One of the possible solutions would be to consolidate everything on one single VM, which is nothing I would like to do. The second option would need to generate the Letsencrypt certs on the webserver and copy over the certs to the appropriate VM on a regular basis or event driven. The third option is to use a network share - and this is what I'm using right now.

So, my setup is as following after I solved the GlusterFS issue with rpcbind binding to all interfaces, although it has been configured to only listen to certain interfaces (solution was: simply remove all NFS related stuff):

On Dom0 (or the host machine) I run GlusterFS as a server on a small 1 GB LVM as part of a replicate with the VM that will do the actual Letsencrypt work: 

Volume Name: le
Type: Replicate
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 192.168.x.254:/srv/gfs/le
Brick2: 192.168.x.1:/srv/gfs/le

This is to ensure that on reboot of the machine every other VM using Letsencrypt certs can mount the GlusterFS share, because the host machine will be there for sure whereas the other VM generating the certs with the letsencrypt.sh script might still be booting. And when the GlusterFS share is missing services will not start on the other VMs because of the missing certs, of course. So, the replica on the virtualization host (Dom0) is only acting as some kind of always-being-available network share, because, well, the other VM will not always be there... for example during a kernel update when a reboot is required.

The same setup is on my mailserver, acting as the second GlusterFS brick of that replica drive. The mailserver hosts the bind9 nameserver as well and I might do something that new domains with Letsencrypt certs get added to my DNSSEC setup as well. Of course, when the letsencrypt.sh script creates or updates the certs, it needs the certs being mounted in that configured location, so I needed to add a line to /etc/fstab: 

192.168.x.254:/le /etc/letsencrypt.sh/certs glusterfs noexec,nodev,_netdev 0 0

Basically the same needs to be done on the other VMs where you want to use the certs as well, but you may want to mount the share as read-only there.

The next step was a little more tricky. When letsencrypt.sh generates new certs, Letsencrypt will contact the webserver for that domain to respond to the ACME challenge. This requires that on each VM you want to use letsencrypt you have to run a webserver. Well, actually at least that there is somewhere a webserver that can answer these requests for that specific domain...

Now, the setup of the webserver (Apache in my case) is like this: 

I'm using the Apache macro module to make it more easy, so I generated two small configs in /etc/apache/conf-available and enabled them bei a2enconf: letsencrypt-proxy.conf to do some setup for proxying the ACME challenges to a common website called acme.example.org. And then letsencrypt-sslredir.conf to setup SSL redirection when everything is in place and the domain can be switched over to HTTPS-only.

letsencrypt-proxy.conf: 

<Macro le_proxy>
     ProxyRequests Off
     <Proxy *>
            Order deny,allow
            Allow from all
     </Proxy>
     ProxyPass /.well-known/acme-challenge/ http://acme.windfluechter.net/
     ProxyPassReverse / http://%{HTTP_HOST}/.well-known/acme-challenge/
</Macro>

letsencrypt-sslredir.conf:

<Macro le_sslredir>
    RewriteEngine on
    RewriteCond %{HTTPS} !=on
    RewriteRule . https://%{HTTP_HOST}%{REQUEST_URI}  [L]
</Macro>

So, after all the setup of a virtual host for Apache looks like this: 

<Macro example.org>
(lots of setup stuff)
</Macro>
<VirtualHost 31.172.31.x:443 [2a01:a700:4629:x::1]:443>
        Header always set Strict-Transport-Security "max-age=31556926; includeSubDomains"
        SSLEngine on
        # letsencrypt certs:
        SSLCertificateFile /etc/letsencrypt.sh/certs/example.org/fullchain.pem
        SSLCertificateKeyFile /etc/letsencrypt.sh/certs/example.org/privkey.pem
        SSLHonorCipherOrder On
    Use example.org
    Use le_proxy
</VirtualHost>
<VirtualHost 31.172.31.x:80 [2a01:a700:4629:x::1]:80>
    Use example.org
    Use le_proxy
    Use le_sslredir
</VirtualHost>

le_sslredir is only needed when you are sure that you want all traffic being redirected to HTTPS. For example when your blog is listed on planet.debian.org or other Planets you might want to omit this from your HTTP config because bug #813313 is not yet solved. 

In the end, when creating a new Letsencrypt cert, you need to add the le_proxy macro to your website, add the domain to letsencrypt.sh config in /etc/letsencrypt.sh/domains.txt and then the scripts will request a new cert from Letsencrypt, handling the ACME challenge stuff via the URL redirection in le_proxy being redirected to your acme.exmaple.org site and finally writes your new cert to the GlusterFS share. From that share you can then use the new cert on all needed VMs, be it your mailserver, webserver or XMPP/SIP server VMs. 

At least this works for me.

UPDATE:
Of course you should be careful about your file permissions on that GlusterFS share, so that the automatic key renewal works, but also without too many permissions granted that everyone can obtain your private keys.

Kategorie: 
 

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer