
Raspberry Pi Ceph Cluster

A Raspberry Pi Ceph Cluster using 2TB USB drives.

Similar projects worth following
I needed an easily expandable storage solution for warehousing my ever growing hoard of data. I decided to go with Ceph since it's open source and I had slight experience from work. The most important benefit is that I can continuously expand the storage capacity as needed simply by adding more nodes and drives.

Current Hardware:

  • 1x RPi 4 w/ 2GB RAM
    • management machine
    • influxdb, apcupsd, apt-cacher
  • 3x RPi 4 w/ 4GB RAM
    • 1x ceph mon/mgr/mds per RPi
  • 18x RPi 4 w/ 8GB RAM
    • 2 ceph osds per RPi
    • 2x Seagate 2TB USB 3.0 HDD per RPi

Current Total Raw Capacity: 65 TiB

The RPi's are all housed in a nine drawer cabinet with rear exhaust fans.  Each drawer has an independent 5V 10A power supply.  There is a 48-port network switch in the rear of the cabinet to provide the necessary network fabric.

The HDDs are double-stacked five wide to fit 10 HDDs in each drawer along with five RPi 4's.  A 2" x 7" x 1/8" aluminum bar is sandwiched between the drives for heat dissipation.  Each drawer has a custom 5-port USB power fanout board to power the RPi's.  The RPi's have the USB PMIC bypassed with a jumper wire to power the HDDs since the 1.2A current limit is insufficient to spin up both drives.

  • 1 × Raspberry Pi 4 w/ 2GB RAM
  • 3 × Raspberry Pi 4 w/ 4GB RAM
  • 18 × Raspberry Pi 4 w/ 8GB RAM
  • 22 × MB-MJ64GA/AM Samsung PRO Endurance 64GB 100MB/s (U1) MicroSDXC Memory Card with Adapter
  • 22 × USB C Cable (1 ft) USB Type C Cable Braided Fast Charge Cord

View all 11 components

  • Instability Followup and Resolution

    Robert Rouquette03/30/2022 at 15:36 0 comments

    The OSD instability I encountered after the kernel update persisted though with less frequency.  I've finally determined that cause is a confluence of small issues that amplify each other:

    • Power Supply Aging - The 5V 10A supplies have lost a small amount of their output headroom with age.
    • CPU Governor Changes - The ondemand CPU governor is no longer as aggressive at reducing the CPU frequency
    • CPU Aging - The CPU and PCIe controller appear to have become more prone to core undervoltage.

    I've remediated the instability by underclocking the CPU.  Underclocking was insufficient on its own, so I've also applied slight overvoltage as well.  The OSD RPi's have been holding steady after applying both changes.

    Here's the current state pf my usercfg.txt:

    # Place "config.txt" changes (dtparam, dtoverlay, disable_overscan, etc.) in
    # this file. Please refer to the README file for a description of the various
    # configuration files on the boot partition.

    These changes have not appeared to impact my Ceph performance.

  • Unstable Kernel (5.4.0-1041-raspi)

    Robert Rouquette08/24/2021 at 16:42 0 comments

    The linux-image-5.4.0-1041-raspi version of the Ubuntu linux-image package appears to be unstable.  I've had two of the OSD RPi boards randomly lockup.  The boards recover on their own once power-cycled, but this is the first time I've observed this behavior.  There no messages in the system logs. The logs simply stop at the time of the lockup, and resume on reboot.  I've upgraded all of the RPis to the latest image version (1042) which should hopefully resolve the issue.

  • Rebalance Complete

    Robert Rouquette06/28/2021 at 14:26 0 comments

  • Two More OSDs

    Robert Rouquette06/17/2021 at 01:40 0 comments

    I've added the last two drives for this round of expansion which brings the total for the cluster to 34 HDDs (OSDs).  Amazon would not allow me to purchase more of the STGX2000400 drives, so I went with the STKB2000412 drives instead.  They have roughly the same performance, but cost about $4 more per drive.  The aluminum top portion of the case should provide better thermal performance though.

  • Additional Storage

    Robert Rouquette06/09/2021 at 01:32 0 comments

    Added the first two of four additional drives.  I plan to add the other two once the rebalance completes.

  • OSD Recovery Complete

    Robert Rouquette05/19/2021 at 03:24 0 comments

    Once the failed drive was replaced, the cluster was able to rebalance and repair the inconsistent PGs.

        id:     105370dd-a69b-4836-b18c-53bcb8865174
        health: HEALTH_OK
        mon: 3 daemons, quorum ceph-mon00,ceph-mon01,ceph-mon02 (age 33m)
        mgr: ceph-mon02(active, since 13d), standbys: ceph-mon00, ceph-mon01
        mds: cephfs:2 {0=ceph-mon00=up:active,1=ceph-mon02=up:active} 1 up:standby
        osd: 30 osds: 30 up (since 9d), 30 in (since 9d)
        pools:   5 pools, 385 pgs
        objects: 8.42M objects, 29 TiB
        usage:   41 TiB used, 14 TiB / 55 TiB avail
        pgs:     385 active+clean
        client:   7.7 KiB/s wr, 0 op/s rd, 0 op/s wr

     After poking around on the failed drive, it looks like the actual 2.5" drive itself is fine.  The USB-to-SATA controller seems to be culprit, and randomly garbles data over the USB interface.  I was also able to observe it fail to enumerate on the USB bus.  A failure rate of 1 in 30 isn't bad considering the cost of the drives.

  • OSD Failure

    Robert Rouquette05/04/2021 at 19:13 0 comments

    The deep scrubs turned up a few more repairable inconsistencies until a few days ago when they grew concerning.  It turned out that one of the OSDs had unexplained read errors.  Smartctl showed that there were no read errors recorded by the disk, so I initially assumed it was just the result of a power failure.  It became obvious that something was physically wrong with the disk when previously clean or repaired PGs were found to have new errors.

    As a result I've marked the suspect OSD out of the cluster and I have ordered a replacement drive.  The exact cause of the read errors is unknown, but since it is isolated to a single drive, and the other OSD on the same RPi is fine, it's most likely just a bad drive.

    Ceph is currently rebuilding the data from the bad drive, and I'll post an update once the new drive arrives.

    The inconsistent PGs all have a single OSD in common: 2147483647 (formerly identified as 25)

  • Inconsistent Placement Group

    Robert Rouquette04/08/2021 at 16:54 0 comments

    The OSD deep scrubbing found an inconsistency in one of the placement groups.  I've marked the PG in question for repair, so hopefully it's correctable and is merely a transient issue.  The repair operation should be complete in the next few hours.

    Ceph was able to successfully repair the PG.

  • Zabbix Monitoring

    Robert Rouquette01/22/2021 at 00:30 0 comments

    I decided I have enough nodes that some comprehensive monitoring was worth the time, so I configured Zabbix to monitor the nodes in the ceph cluster.  I used the zbx-smartctl project for collecting the smart data.  

  • Rebalance Complete

    Robert Rouquette12/10/2020 at 15:15 0 comments

    The rebalance finally completed.  I had to relax the global mon_target_pg_per_osd setting on my cluster to allow the PG count increase and the balancer to settle.  Without setting that parameter to 1, the balancer and PG autoscaler were caught in a slow thrashing loop

View all 19 project logs

Enjoy this project?



miouge wrote 11/25/2020 at 08:48 point

Looks great. Just a couple of questions:

- Which install method did you use for Ceph?

- What kind of data protection do you use? Replication or EC? How has the performance?

  Are you sure? yes | no

Robert Rouquette wrote 11/25/2020 at 19:22 point

I used the ceph-deploy method.  It's simpler and makes more sense for lower-power embedded systems like the Raspberry Pi since it's a bare-metal installation instead of being containerized.

I use 3x replication for the meta, fs root, and rbd pools.  I use 5:2 EC for the majority of my cephfs data.

  Are you sure? yes | no

Toby Archer wrote 11/20/2020 at 14:15 point

20TB raw capacity per shelf is great. How are you finding heat? It would be very cool to wire in two temperature probes to your management Pi's GPIO and monitor temperature by the exhausts.

Have you found single gigabit per node to become a bottleneck?

Awesome project, makes me very jealous.

  Are you sure? yes | no

Robert Rouquette wrote 11/21/2020 at 00:54 point

The CPU and disk temperature tends to stay below 55 C.  The Gigabit ethernet hasn't been a bottleneck so far, and I don't expect it to be.  The disks don't perform well enough with random IO to saturate the networking, and filesystem IO on CephFS is typically uniformly distributed across all of the OSDs.  As a result the networking bottleneck is almost always on the client side.

  Are you sure? yes | no

Aaron Covrig wrote 11/16/2020 at 18:33 point

This is a sweet looking project!  I noticed that you look to be playing it safe with how you distributed your Pi's based on available ram, are you able to provide any details on what the memory consumption has looked like when idle and under load?

  Are you sure? yes | no

Robert Rouquette wrote 11/20/2020 at 00:40 point

The OSDs are configured for a maximum of 3 GiB per service instance and they tend to consume all of it.  The comes to 6 GiB per 8 GiB RPi just for the OSD services.  The kernel and other system services consume a minor amount as well, so they tend to consistently hover around 20% memory nearly all the time.  The extra "unused" memory is necessary as padding since there is no swap space.  (Adding swap on an SD card is simply inviting premature hardware failure.)

  Are you sure? yes | no

Aaron Covrig wrote 11/20/2020 at 02:00 point

Awesome, thank you.  

  Are you sure? yes | no

Does this project spark your interest?

Become a member to follow this project and never miss any updates