Homelab Storage: A Journey to Ceph

tldr: I started with a modest NAS setup in my homelab, but my curiosity led me to build an 8-node Ceph distributed storage cluster. Ceph, though overkill for a home environment, offers scalable, resilient storage by distributing data across multiple servers. It supports various storage types like block, file, and object storage, making it versatile for different needs. My current setup includes management nodes and storage nodes with NVMe and SATA SSDs, providing high performance and capacity. I use CephFS for Docker containers, RBD for Proxmox VMs, and maintain backups with TrueNAS and Unraid. This journey has transformed my homelab into a robust, interesting playground for storage technologies.

Storage in my homelab started modest and pretty typical: A NAS appliance and some PCs/Macs that connected to it. Fast forward to today and I have an 8-node Ceph distributed storage cluster and multiple NAS servers. This is the story of my Ceph rabbit hole.

In the beginning...

The beginning was unremarkable. I have historically used a variety of NAS appliances from Synology, TrueNAS, and others. Storage was just a thing that I had. It sat in the background and was uninteresting.

But my homelab is a hobby. I mess with it because I have fun messing with it. The thing that triggered my journey into "let's play with storage" was actually a desire for non-NFS shared storage. I was using NFS mounts on my NAS to provide shared storage for VM virtual disks on my Proxmox cluster. But when I started playing with Kubernetes and Docker Swarm, I found myself wanting shared storage for containers that wasn't simply an NFS mount—if only for the fun of it.

I was already using Proxmox, and the Proxmox console has this tab for "Ceph." So naturally, I started to read about it, and it intrigued me.

Ceph

Ceph is overkill for a homelab of my size. A NAS is more cost-effective and straightforward. And boring. Ok, that's out of the way.

Ceph is a distributed storage system designed to scale out by combining a bunch of commodity hardware into one big storage cluster. What's cool about Ceph is that instead of storing data on single-purpose storage appliances like a traditional NAS, Ceph spreads the data across many regular servers, automatically keeping multiple copies so that everything keeps working even if some hardware fails.

So in my homelab, instead of having one big NAS box, I've got a whole bunch of small servers with Ceph installed that all work together to provide one big distributed storage system that's more resilient and scalable than a normal NAS. And not boring.

With Ceph, I just add hard drives (SSDs in my case) to the cluster, and their storage capacity is assimilated into the cluster. I define "pools" to store different kinds of data, and each pool has its own replication rules that determine how data in that pool is spread across devices.

In a traditional NAS with some variation of RAID, things work at the array level. That has a couple of interesting aspects. If you have a drive failure, you replace the drive and the array rebuilds itself to ensure you have protection. Until the drive is replaced, your array is degraded.

Also, if you need more space, you generally can't just add another drive. The array was built with a specific set of drives, and you can't just turn a 5-drive RAID 5 array into a 6-drive RAID 6 array. There are NAS solutions like Unraid that get around this. There are also solutions like ZFS that let you essentially combine multiple "sub arrays" into a larger pool of data. Of course, each has pros and cons.

With Ceph, data is not managed in "arrays" or at the level of drives. Instead, data in the cluster is broken down into blocks of data (called objects by Ceph), and those objects are distributed across available drives. Objects are replicated according to the rules you specify, either with simple "make multiple copies" replication or with erasure coding, which is a parity-like approach similar in concept to RAID5/6 but at the object level instead of at the drive level.

When a drive fails, the pools that have data on that drive become degraded, but Ceph immediately starts making new copies of the lost data on available space elsewhere in the cluster. Usually, within a short period of time (relative to a RAID array rebuild), your cluster is automatically back to a healthy state.

All you need to make this viable is enough free space in your cluster to accommodate one or more drives failing, and that free space can be small amounts across multiple nodes. You can replace the drive that died at your leisure so that you once again have spare capacity in the cluster for future failures.

Adding more capacity is as easy as adding a drive. Ceph will assimilate that new drive into the appropriate pools (based on rules), and behind the scenes, it will start rebalancing where data is stored so that the new drive takes on a share of the responsibility for the existing data.

When it comes to making this pool of storage available to clients, Ceph provides multiple options. Block storage (called RBD) is ideal for things like Proxmox virtual disk images. File storage (CephFS) lets remote servers mount folders similarly to how they do NFS shares, and this is what I use for the shared storage needs of my Docker cluster. Ceph also has S3-compatible object storage for cases where that is needed, but I don't currently have a use case for that in my own lab.

The other aspect I found really interesting was how the distribution of data actually works. In a distributed storage system, you might imagine that any time a client needs data, it first goes to some central server and asks, "Where can I find data X?" The central server then looks up the location and directs the client to the actual node that has the data. But Ceph does data distribution with math.

CRUSH (Controlled Replication Under Scalable Hashing) is the secret sauce that allows any Ceph client to calculate where a particular piece of data is stored in the cluster without having to ask a central lookup table. It does this by using a deterministic hashing function that takes into account factors like the cluster topology and desired replication level.

This means that clients can read and write data directly to the right places without a centralized bottleneck, which is a big part of what allows Ceph to scale out so well in real-world scenarios much larger than my lab.

Sorry. I think Ceph is really cool, and I got to rambling there a bit.

My Ceph Cluster

As I mentioned, I originally heard about Ceph because of Proxmox, and in fact, my original cluster was a Proxmox-managed Ceph cluster. That was cool and worked okay, but it meant anytime I wanted to add a new node to the cluster, it had to be a full-blown Proxmox node even if it was never going to perform any VM compute duties.

So recently, I transitioned to a standalone Ceph cluster managed by cephadm, one of the official orchestration engines available for Ceph. I now have a nice little cluster that can be centrally managed either from the built-in dashboard or from the command line on any of the three management nodes.

The orchestration engine makes sure that required services are deployed appropriately, including a monitoring suite. And anytime I add a new drive to any of the nodes, the orchestration engine notices it and adds it to the pool of available storage.

3x Minisforum MS-01 boxes (Intel i9-12900H, 64GB RAM, 10GbE, Ubuntu 22.04)
- These are essentially management nodes. They run the mon, mgr, and mds services as well as things like Prometheus, Grafana, Alertmanager, and other cluster support services.
- Each also has 2x NVMe M.2 SSDs (on top of the drive used for the OS) that back NVMe-specific storage pools for performance-sensitive workloads.
5x general "storage" nodes (misc Intel CPUs, 32GB RAM, 10GbE, Ubuntu 22.04)
- These are mini-ITX tower PCs that were originally used as NAS boxes, so they have plenty of SATA connectivity. These don't run any Ceph services except OSDs.
- Each currently has 5x SATA SSDs that are the default storage for pools and certainly for the pool that contains most of my media files.

I have allocated storage as follows:

A CephFS file system that is mounted by my Docker VMs and serves as shared storage for container data so that a container can start up on any node and have access to its data. The filesystem is backed by two pools depending on the subfolder:
- An NVMe pool that is used for Docker application files by default. This pool uses 3:1 replication.
- A higher capacity but slower SATA SSD pool that is used for media files (photos, movies, TV shows, audiobooks). This pool uses 2+2 erasure coding to minimize additional storage costs versus 3:1 replication.
- Individual folder trees are mapped to specific backend pools using CephFS File Layouts
An RBD pool that is used by Proxmox to store VM virtual disks. This pool is backed by the NVMe drives for performance reasons and also uses 3:1 replication.

All nodes in the cluster have 10 GbE connections to each other and to the Proxmox hosts and VMs that connect to the cluster.

Backups

Ceph is now my primary storage mechanism, but I still have my old NAS servers. They are effectively used as backup targets at this point. One is currently running TrueNAS, and the other I am experimenting with Unraid.

The reason I have two NAS servers is because one of them used to be offsite (3-2-1 backups), but I had to bring it back to rebuild it. Once I decide if I like Unraid more than TrueNAS as a backup NAS, I will probably move one of them offsite again.

The data in the CephFS file system is backed up to each NAS. The TrueNAS server runs rsync tasks hourly to get any changes. For the Unraid server, I am experimenting with Borg Backup. I am not sure which I will standardize on in the long term.

Proxmox does its nightly VM backups of virtual disks to the Unraid server currently.

Additionally, I still have offsite backups for the most valuable files (family photos, important documents) via a nightly rclone task on the TrueNAS server that pushes an encrypted backup to Storj. Depending on how my experiments with Borg Backup go, I may use it to do remote backups to something like BorgBase, but that's TBD.

Wrapping Up

So that's my journey into the world of Ceph. What started as a simple curiosity has turned into a pretty cool distributed storage setup that keeps my homelab both functional and interesting. Sure, it's overkill for a home environment, and not inexpensive, but it's not boring!

Homelab Storage: A Journey to Ceph

In the beginning...

Ceph

My Ceph Cluster

Backups

Wrapping Up

Comments (3)

Homelab Tour

Going Overboard with My Homelab

More from this blog

New Auth, Legacy Data, New Options

TrendWeight v2 Has Launched!

Rebooting TrendWeight (Again)

AI as Observer: Chronicling Tabletop RPGs

CephFS: Migrating Files Between Pools

Command Palette

In the beginning...

Ceph

My Ceph Cluster

Backups

Wrapping Up

Comments (3)

Homelab Tour

Going Overboard with My Homelab

More from this blog