Ceph - OSD Flapping
Debugging and Recovery

Ceph Performance Enhancements

Throughout our extensive series of articles on Ceph and it’s various features and functionalities, we’ve reviewed what Ceph is, how to properly build a Ceph cluster for maximum performance, and how Ceph’s backing technologies like CRUSH ensure data redundancy and resiliency.  Understanding how Ceph works and the underlying technologies that power it is important, but what happens in the event that these technologies stop working well together, or stop working altogether?

In this article, we’ll discuss in particular what to do if you find your Ceph OSD’s stop peering randomly, have issues with scrubbing, or intermittently disconnect from the cluster before reinserting themselves.  This behavior can also be referred to as flapping, and is detrimental to the performance and durability of your cluster for a number of reasons.

Symptoms

When an OSD, or multiple OSD’s begin flapping, you’ll probably first notice a discernable slow down in both read and write speeds.  This is for multiple reasons.  When the OSD is down during an active flap, you have effectively lost the combined throughput of the underlying OSD(s) that are flapping.  On a large scale cluster, servicing tens of thousands of IOPs, this can be catastrophic. 

After the OSD(s) return(s) to active service in the cluster, the first thing that the cluster would do is attempt a recovery.  Recovery is a very expensive process, where the underlying blocks that are replicated across multiple OSD’s on multiple different hosts need to be checksum verified for integrity.  If a checksum doesn’t match, that block needs to be re-replicated. 

This checksum verification and file retransfer process present a very heavy strain on the physical infrastructure that powers your Ceph cluster, spiking network traffic on your cluster network.  If you already have a very heavily loaded cluster, this additional unnecessary load may cause further degradation of performance to the point where the cluster stops responding, or you can crash out other daemons and cascade other OSD’s into failure.

So how do you review the symptoms of your cluster, to see if a performance issue is related to OSD’s Flapping?  The first, and also the easiest way is to simply check on your cluster’s health status.  This can be done by using either the Ceph Dashboard provided in Luminous and above releases.  The Dashboard is accessible at “http://IP.OF.CEPH.NODE:7000,” or whatever custom port you setup during the initial installation.  You can also view the status of the Ceph cluster via the command line on any ceph-admin enabled hosts.  “ceph -s” can be issued, which will output the current health status of the cluster, and will alert you if an ongoing repair is happening or an OSD is currently marked as down.

Causes and Resolutions

Causes for OSD Flapping are similar to those that cause OSD failure in general.  Poor hardware health, excessive heat, network issues, and overall system load can all contribute to OSD Flapping. 

From a hardware perspective, the underlying spindle disk could be close to failing.  Determining this on a modern SAS or SATA hard drive is straight forward – requiring a single utility known as smartctl.  SmartCTL can be leveraged to determined whether the S.M.A.R.T status of the hard drive reports an imminent failure – which can lead to OSD failures and restarts at random.  Check the S.M.A.R.T status of your hard drive by locating the physical device by logging in to the potentially affected host and typing ‘smartctl -a /dev/sdX’ where X is the device ID.  You can pipe this command to grep and search for SMART.  This will output the status of SMART on the drive, and may indicate imminent failure.  If it does – replace the drive!

Another cause for OSD flapping could be as simple as an MTU mismatch on an underlying interface that powers your Ceph storage network.  MTU, or Maximum Transmission Unit, specifies the maximum packet size that an interface is allowed to send before the kernel has to perform a function called fragmentation – effectively splitting up packets into smaller chunks to fit within the system specified MTU.  Because Ceph is a storage technology that deals with large blocks of data, the larger your MTU, the more throughput you’ll get with the least effort on behalf of the system.

If you have any MTU mismatch anywhere in your entire cluster, i.e. one server as an MTU set to 9000 and another to 1500, you’ll end up with a situation where you’ll get packets backed up, and essentially bring your Ceph replication to a halt.  This can also be a cause of Ceph OSD flapping, as the health-checks compete with a low MTU for bandwidth.

To check your interface for MTU, type “ifconfig” or “ip addr” on your command line.  Match the MTU line with the interface, and compare all host interfaces across your cluster for consistency.

Conclusion

AMDS Cosmos Engineers are available to assist you in the architecture, design, selection, deployment and ongoing maintenance of your Ceph cluster, or any related Linux, Windows, or storage project.  With extensive experience in both vendor management and open source software, we can augment your teams existing skillset to help you grow into new technology.

Ceph Storage Professionals Are Ready To Assist