Sie sind hier

Januar 2019

Too much disk IO on sda in RAID10 setup - part 2

Some days ago I blogged about my issue with one of the disks in my server having a high utilization and latency. There have been several ideas and guesses what the reason might be, but I think I found the root cause today:

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     50113         60067424
# 2  Short offline       Completed without error       00%      3346   

After the RAID-Sync-Sunday last weekend I removed sda from my RAIDs today and started a "smartctl -t long /dev/sda". This test was quickly aborted because it already ran into an read error after just a few minutes. Currently I'm still running a "badblocks -w" test and this is the result so far:

# badblocks -s -c 65536 -w /dev/sda
Testing with pattern 0xaa: done
Reading and comparing: 42731372done, 4:57:27 elapsed. (0/0/0 errors)
42731373done, 4:57:30 elapsed. (1/0/0 errors)
42731374done, 4:57:33 elapsed. (2/0/0 errors)
42731375done, 4:57:36 elapsed. (3/0/0 errors)
 46.82% done, 6:44:41 elapsed. (4/0/0 errors)

Long story short: I already ordered a replacement disk!

But what's also interesting is this:

I removed the disk today at approx. 12:00 and you can see the immediate effect on the other disks/LVs (the high blue graph from sda shows the badblocks test), although the RAID10 is now in degraded mode. Interesting what effect (currently) 4 defect blocks can have to a RAID10 performance without smartctl taking notice of this. Smartctl only reported an issue after issueing the selftest. It's also strange that the latency and high utilization slowly increased over time, like 6 months or so.

Kategorie: 
 

Too much disk IO on sda in RAID10 setup

I have a RAID10 setup with 4x 2 TB WD Red disks in my server. Although the setup works fairly well and has enough throughput there is one strange issue with that setup: /dev/sda has more utilzation/load than the other 3 disks. See the blue line in the following graph which represents utilization by week for sda:

As you can see from the graphs and from the numbers below sda has a 2-3 times higher utilization than sdb, sdc or sdd, especially when looking at disk latency graph by Munin:

Although the graphs are a little confusing you can easily spot the big difference from the below values. And it's not only Munin showing this strange behaviour of sda, but also atop: 

Here you see that sda is 94% busy although the writes to the disks are a little bit lower than on the other disks. The screenshot of atop was before I moved MySQL/MariaDB to my NVMe disk 4 weeks ago. But you can also spot that sda is slowing down the RAID10.

So the big question is: why is utilization and latency of sda that high? it's the same disk model as the other disks. All disks are connected to a Supermicro X9SRi-F mainboard. The first two SATA ports are 6 Gbit/s, the other 4 ports are 3 Gbit/s ports:

sda sdb sdc sdd
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68AX9N0
Serial Number:    WD-WMC301414725
LU WWN Device Id: 5 0014ee 65887fe2c
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jan  5 17:24:58 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68AX9N0
Serial Number:    WD-WMC301414372
LU WWN Device Id: 5 0014ee 65887f374
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jan  5 17:27:01 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68AX9N0
Serial Number:    WD-WMC301411045
LU WWN Device Id: 5 0014ee 603329112
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Jan  5 17:30:15 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD20EFRX-68AX9N0
Serial Number:    WD-WMC301391344
LU WWN Device Id: 5 0014ee 60332a8aa
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Jan  5 17:30:26 2019 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

There is even the same firmware version of the disks. Usually I would have expected a slower disk IO from the 3 Gbit/s disks (sdc & sdd), but not from sda. All disks are configured in BIOS to use AHCI.

I cannot explain why sda has higher latency and more utilization than the other disks. Any ideas are welcome and appreciated. You can also reach me in the Fediverse (Friendica: https://nerdica.net/profile/ij & Mastodon: ">https://nerdculture.de/) or via XMPP at ij@nerdica.net.

Kategorie: 
 

Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer