Outages on Nerdculture.de due to Ceph – Part 2

Last weekend I had “fun” with Ceph again on a Saturday evening. But let’s start at the beginning….

Before the weekend I announced a downtime/maintenance windows to upgrade PostgreSQL from v15 to v17 – because of the Debian upgrade from Bookworm to Trixie. After some tests with a cloned VM I decided use the quick path of pg_ugradecluster 15 main -v 17 -m upgrade –clone. As this would be my first time to upgrade PostgreSQL that way, I made several backups. In the end everything went smooth and the database is now on v17.

However, there was also a new Proxmox kernel and packages, so I also upgrade one Proxmox node and rebootet it. And then the issues began:

But before that I also encountered an issue with Redis for Mastodon. It complained about this:

Unable to obtain the AOF file appendonly.aof.4398.base.rdb

Solution to this was to change redis configuration to autoappend no.

And then CephFS was unavailable again, complaining about laggy MDS or no MDS at all, which – of course – was totally wrong. I search for solutions and read many forum posts in the Proxmox forum, but nothing helped. I also read the official Ceph documentation. After a whole day offline for all of the services to my thousands of users, I somehow managed to get systemctl reset-failed mnt-pve-cephfs && systemctl start mnt-pve-cephfs again. Shortly before that I followed the advice in the Ceph docs for RADOS Health and there especially section about Troubleshooting Monitors.

In the end, I can’t say which step exactly did the trick that CephFS was working again. But as it seems, I will have one or two more chances to find out, because only one server out of three is currently updated.

Another issue during the downtime also was that one server crashed/rebooted and didn’t came back. It hang in the midst of an upgrade at the point of upgrade-grub. Usually it wouldn’t be a big deal: just go the IPMI website and reboot the server.

Nah! That’s too simple!

For some unknow reason the IPMI interfaces lost their DHCP leases: the DHCP server at the colocation was not serving IPs. So I opened a ticket, got some acknowledgement from the support but also a statement “maybe tomorrow or on Monday…”. Hmpf!

On Sunday evening I managed to bring back CephFS. As said: no idea what specific step did the trick. But the story continues: On Monday before lunch time the IPMI DHCP was working again and I could access the web interfaces again, logged in…. and was forcefully locked out again:

Your session has timed out. You will need to open a new session

I hit the problem described here. But cold resetting the BMC didn’t work. So still no working web interface to deal with the issue. But on my phone I got “IPMIView” as app and that still worked and showed the KVM console. But what I saw there didn’t make me happy as well:

The reason for this is apparently the crash while running update-grub. Anyway, using the Grub bootloader and selecting an older kernel works fine. The server boots, Proxmox is showing the node as up and…. the working CephFS is stalled again! Fsck!

Rebooting the node or stopping Ceph on that node results immediatedly in a working CephFS again.

Currently I’m moving everything off of Ceph to the local disks of the two nodes. If everything is on local disks I can work on debugging CephFS without interrupting the service for the users (hopefully). But this also means that there will be no redundancy for Mastodon and mail.

When I have more detailled information about possible reasons and such, I may post to the Proxmox forum.

1 thought on “Outages on Nerdculture.de due to Ceph – Part 2

  1. @ij Thanks a lot for your great work and making this so transparent – although I am not deep enough into server administration to fully understand all of this in detail, really interesting Insights! 👍

Leave a Reply

Your email address will not be published. Required fields are marked *