Confusion about mkfs.xfs and log stripe size being too big

Recently I bought some new disks, placed them into my computer, and built a RAID5 on these 3x 4 TB disks. Creating a physical device (PV) with pvcreate, a volume group (VG) with vgcreate and some logical volumes (LV) with lvcreate was as easy and well-known as creating an XFS filesystem on the LVs… but something was strange! I never saw this message before, when creating XFS filesystems with mkfs.xfs: 

log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB

Usually I don’t mess around with the parameters of mkfs.xfs, because mkfs.xfs is smart enough to find near to optimal parameters for your filesystem. But apparently mkfs.xfs wanted to use a log stripe unit of 512 kiB, although its maximum size for this is 256 kiB. Why? So I started to google and in parallel asked on #xfs@freenode. Erik Sandeen, one of the core developers of XFS, suggested that I write that issue to the mailing list. He did already face this issue himself, but couldn’t remember details.

So I collected some more information about my setup and wrote to the XFS ML. Of course I included information about my RAID5 setup:

muaddib:/home/ij# mdadm --detail /dev/md7
        Version : 1.2
  Creation Time : Sun Jun 24 14:58:21 2012
     Raid Level : raid5
     Array Size : 7811261440 (7449.40 GiB 7998.73 GB)
  Used Dev Size : 3905630720 (3724.70 GiB 3999.37 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Tue Jun 26 05:13:03 2012
          State : active, resyncing
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

  Resync Status : 98% complete

           Name : muaddib:7  (local to host muaddib)
           UUID : b56a714c:d193231e:365e6297:2ca61b65
         Events : 16

    Number   Major   Minor   RaidDevice State
       0       8       52        0      active sync   /dev/sdd4
       1       8       68        1      active sync   /dev/sde4
       2       8       84        2      active sync   /dev/sdf4

Apparently, mkfs.xfs takes the chunk size of the RAID5 and want to use this for its log stripe size setting. So, that’s the explanation why mkfs.xfs wants to use 512 kiB, but why is the chunk size 512 kiB at all? I didn’t messed around with chunk sizes when creating the RAID5 either and all of my other RAIDs are using chunk sizes of 64 kiB. The reason was quickly found: the new RAID5 has a 1.2 format superblock, whereas the older ones do have a 0.90 format superblock.

So, it seems that somewhen the default setting in mdadm, which superblock format is to be used for its metadata, has been changed. I asked on .de@ircnet and someone answered that this was changed in Debian after release of Squeeze. Even in Squeeze the 0.90 format superblock was obsolete and has been only kept for backward compatibility. Well, ok. There actually was a change of defaults, which explains the behaviour of mkfs.xfs now, wanting to set log stripe size to 512 kiB.

But what is the impact of falling back to 32 kiB log stripe size? Dave Chinner, another XFS developer explains: 

Best thing in general is to align all log writes to the
underlying stripe unit of the array. That way as multiple frequent
log writes occur, it is guaranteed to form full stripe writes and
basically have no RMW overhead. 32k is chosen by default because
that’s the default log buffer size and hence the typical size of
log writes.

If you increase the log stripe unit, you also increase the minimum
log buffer size that the filesystem supports. The filesystem can
support up to 256k log buffers, and hence the limit on maximum log
stripe alignment.

And in another mail, when being asked if it’s possible to raise the 256 kiB limit to 512 kiB because of the mdadm defaults to 512 kiB as well: 

You can’t, simple as that. The maximum supported is 256k. As it is,
a default chunk size of 512k is probably harmful to most workloads –
large chunk sizes mean that just about every write will trigger a
RMW cycle in the RAID because it is pretty much impossible to issue
full stripe writes. Writeback doesn’t do any alignment of IO (the
generic page cache writeback path is the problem here), so we will
lamost always be doing unaligned IO to the RAID, and there will be
little opportunity for sequential IOs to merge and form full stripe
writes (24 disks @ 512k each on RAID6 is a 11MB full stripe write).

IOWs, every time you do a small isolated write, the MD RAID volume
will do a RMW cycle, reading 11MB and writing 12MB of data to disk.
Given that most workloads are not doing lots and lots of large
sequential writes this is, IMO, a pretty bad default given typical
RAID5/6 volume configurations we see….

So, reducing the log stripe size is in fact a good thing[TM]. If anyone will benefit from larger log stripe sizes, s/he would be knowledgeable enough to play around with mkfs.xfs parameters and tune them to needs of the workload.

Erik Sandeen suggested, though, to remove the warning in mkfs.xfs. Dave objects and maybe it’s a good compromise to extend the warning by giving an URL for a FAQ entry explaining this issue in more depth than a short warning can do?

Maybe someone else is facing the same issue and searches for information and find this blog entry helpful in the meantime…


3 thoughts on “Confusion about mkfs.xfs and log stripe size being too big

  1. Unaligned writes on Linxu md raid
    Linux md is smart enough to know it doesn’t have to touch the other sectors of a stripe for RMW, other than those that were effectively modified. It will operate as if the stripe had 4KiB (page size) granularity. I.e., it will not RMW any IO that is multiple of 4KiB in size. It will just write to all spindles (data + parity), no reads at all.

    Hardware RAID is often not this smart.

  2. mkfs.xfs -l size=128m
    I have seen the same error when formatting a 50TB LVM striped logical volume. Thank you for the write-up.
    # mkfs.xfs -l size=128m -L Paperback00 /dev/VGPaperback00/LVPaperback00
    log stripe unit (1048576 bytes) is too large (maximum is 256KiB)
    log stripe unit adjusted to 32KiB

  3. Thanks for this post. It
    Thanks for this post. It explains enough to keep using xfs on a raid 6 system with 40 1TB disks. Without your post I would have taken an other fs most likely.


Comments are closed.