Designing My Multi Site HomeLab

Jun 01, 2024

As you saw in my last video, I decided to locate some of my servers in a data center not too far away. In that video, you saw most of my hardware options, which gave my HomeLab servers a new life, in a new rack, on a new network, in a new location. And with this new location comes a host of challenges, including networking, security, VPN, virtualization, DNS,

multi

ple clusters, backups, and much more. That's why I wanted to share some of my progress candidly. This will be much less formal, but much more in-depth about some of my upcoming choices.

And that's where I could use your help again. Think of this as an architectural review, where you are the judge. Don't be too hard on me. So let's dive into my architecture and get started. The first part was setting up my network. And at first I thought this was going to be the really tricky part, but it turned out a little easier than I thought it would be. Moving my public workloads to co-location simplified many things on my network. So, a quick review of what my network was like before this. I drew this diagram about three months ago.

More Interesting Facts About,

designing my multi site homelab...

But I had many VLANs here. As you can see, I had my default network, a camera VLAN, an IOT VLAN, a primary VLAN, a guest VLAN, an untrusted servers VLAN, and then a trusted VLAN. And between all of these networks, you had a lot of complicated firewall rules for lack of a better term or ACLs. A lot of complicated ACLs that said, "Hey, devices on this core network can communicate with some devices on the trusted network." Or a rule that said: "Anything on the default network can communicate with anything on the external network." And then a rule that says: "Only established and related traffic can return." So it was really complicated.

And I did this because I only had one network and I hosted things outside of my house. But a lot of that stuff was hosted on the untrusted network of this server you see here. And by moving those workloads out of my house, it also meant that I no longer had to commute. So I no longer have to forward or allow incoming traffic on my network. And I had a lot of complicated port forwarding rules to allow that traffic into my network. So now I can move most of those rules and some of that complexity to the UDM that's co-located.

That will greatly simplify all these rules that you can see on my home network. So I no longer have to have probably 75% of these rules because I'm no longer allowing some of that traffic in. And I was also thinking about flattening some of my VLANs, but that's where I'll need your help later. . And then I moved some of those VLANs to my shared location. And you can see that I only have a couple of VLANs here. I have one called trustworthy. I'm not even sure if I'm going to use that or just consider everything there to be untrustworthy.

I have one called servers and it's not really trusted, but it's where I put things like my DNS server that I don't want to expose to the public. And then I have one called public, which is just that. These are workloads that are directly exposed to the public either through the ingress controller or load balancer, but have port forwarding rules to those servers. I'll show them to you here in a second. And then the management. This is for a lot of my admin interfaces and is pretty locked down. So how does that work between my two networks?

Well, I currently have a

site

-to-

site

VPN set up. Now I thought this was going to be very difficult, it turned out to be much easier because I ended up using the magic of the UniFi site. Part of the reason I chose UniFi devices is that it's literally as simple as checking these checkboxes and clicking connect. And in about five seconds, devices on my main network can now communicate with, for example, devices on the public network. Now, the initial configuration of Site Magic is one thing, but the firewall rules are another. Now, I'm not sure if this is complicated or confusing because of the way UniFi does it, or if it's just complicated and confusing because that's what site-to-site VPNs and their ACLs are like. "They're just complicated and confusing." But after many attempts, I figured out how these firewall rules work between two site-to-site VPNs.

And it's this little interface called LAN output. Now, most of the time I've used LAN for my firewall rules and I think that can restrict traffic on the LAN side, inside the LAN. And I believe that the rules that you configure apply to the inside of that LAN connection. I don't know, maybe this LAN output makes sense. But I think this is traffic that is leaving, but I don't think so. It's actually traffic that's already passed through the VPN and is trying to get into this VLAN here. But I think it's on the outside. I'm not a networking person and I'm definitely not an expert on all things UniFi.

Leaving the products aside and focusing on the technology, I set up an SD-WAN or site-to-site VPN between these two sites. So between home and away. And this is what is happening between the two sites. At home, obviously being me and away being the co-location. Now, this is not a formal or sophisticated diagram. It's something I prepared. I had to get it all out of my head and put it into something. And so it ended up looking like this. Now, I know you're going to ask what this tool is because in my last video, when I showed a diagram, a thousand people asked me what the tool was because I didn't mention what it was.

But this is FigJam. It's from Figma. You could think of them as a type of design tool mostly for web design, but they also make up a collaboration board where

multi

ple people can collaborate together. And I found out that it was really cool to make these artboards. Anyway, back to VPN. What I was talking about is that I now have a site-to-site VPN set up between these two sites. And that allows me to send traffic here if I want and/or send traffic back if I want. And this makes it a little complicated just because of all the rules that you have to have and not just the rules that you want, but you also have to make sure that you don't expose anything that you don't want to expose.

So with SiteMagic, I filtered out all the VLANs that I don't want to expose anyway, but I still need to create these ACLs here to make sure some of these devices can't get back onto my network. There are a lot of VPN settings here. I could definitely use your help if you have any advice. So in the queue, I have three servers and I'm running Proxmox on each one. I have PVE1, PVE2 and PVE3. These are my three Proxmox nodes. And then inside these three nodes, I have virtual machines running. And inside the first one, I have DNS1 running.

And within the third one, don't ask me why there are three, I have DNS2 working. So these are two DNS servers. Now, these are actually PiHole if you see here. Now, there is an explanation for that. I know there are better DNS systems out there, but I already had a lot of DNS entries here that I created over time, along with a lot of CNAMEs, probably 30, 40, 50 CNAMEs. And I didn't want to duplicate that on another DNS system, at least not now. So the easiest way I found was to just build two more PiHoles, place them here, and then create a sink back to my PiHole server.

And this worked very well. I was surprised how well it worked. I mean, it works at home. But at home I technically have three DNS servers, it doesn't matter. And I'm using Gravity Sync to sync these DNS servers. So I have DNS1 here, which is the source of truth. And currently, all of these DNS servers pull from this DNS1. So what this means is that I only have to set it in one place, and it's the source of truth and everyone is happy. Now, I could change this to a push configuration, something like that a little bit better, where I allow the trusted things to pass to the untrusted ones.

But you could end up choosing a completely different DNS server. So I'll leave it as is. So if we double click on that diagram, here's my Proxmox cluster. Here is the first Proxmox server, the second Proxmox server, and the third Proxmox server. And as you can see, I already have some virtual machines running here. DNS1, that's one we just talked about. GHARUNNER1, This is a GitHub action runner that runs my CI and CD jobs for my Ansible K3S GitHub repository. So every time an approved person opens a pull request, he will actually test the K3S Ansible Playbook, create all kinds of clusters, bring them down, and make sure the tests pass.

Then I have k8s-public-01 and then I have k8s-public-worker-01. As you might have guessed, this is a Kubernetes node, this is the etcd and control plane node for this cluster, and this is the worker for that cluster 1. As you can see, I have a convention here, 1 1 1, 2 2 2 and 3 3 3. So I do it on purpose to know which VMs are on which node, which helps me a little bit, but it's also spreading them across these three servers to make it a little more resilient and to make it HA. But we'll talk about load balancers and things like that in a moment.

And then I have a Rancher server and as you can see I have Rancher 1, 2 and 3. Yes, I still use Rancher. Right now I'm managing five clusters, five Kubernetes clusters with Rancher because I'm in the middle. of migrating my local Rancher and my local cluster to this new Rancher instance and then creating subsequent clusters for everything. It's a little complicated but it's actually a lot of fun to do. Currently these are the machines that are working at the placement. And one thing that made this super simple was connecting this Proxmox server over the site-to-site VPN to my NAS, what's on my NAS?

Well, my backups with NFS. So if I look at the backups of one of these nodes, once it loads, you can see I have a lot of backups, a lot of backups. So what I did, what I'm doing temporarily, is connect both the public cluster and my home private Proxmox cluster to the same NFS share so I can back them up at home and then restore them here on the colocation. So let's look at the diagram real quick because this might make a little more sense. So all of these nodes go through my NAS's site-to-site VPN down here.

And right now they are backing up to NFS. And right now, as I mentioned, they're backing up the same share. I'm going to say share A. Well, in the future, I want to break this down and say, okay, well, I'm going to have share A and share B over NFS. The public ones go to, say, A and the private ones go to B because I want to separate them in case someone can get into these Proxmox nodes. I want to make sure that, well, if they can and they can access the backups, they will only be able to access the backups of those servers and they will not be able to access the backups of my private workloads.

So, TBD, I still have to do that. Not to be determined, to be done. What about that? Also, most of these things will be finished by the time you watch this video. I would probably be just as forthcoming with all my architecture, as I've been pretty open with what I run and how I run it. But some of them seem like gaps in my security. Just know that some of these things I'm talking about will already be in place when you watch this video. So since we come back to this diagram and I talked a little bit about the services I'm keeping at home versus the services I'm putting in the Go location, my plan so far is to separate public and home, but just give public a way to get to home for some of these services.

Now I don't want to build a NAS and put a NAS in the cloud or have NFS in the cloud or the same with SMB if I need it there or the same with my object storage with Minio or S3. So I decided that for me, I think it's a little easier, maybe not the right choice, to set up a firewall rule to allow these devices to backup to my NAS this way. And that also makes it good for disaster recovery because let's say these services were here at the co-location. Well, if that failed, if you didn't have this anymore, you wouldn't be able to access those backups or restore anything.

Although there is a slight risk, I decided that backing up offsite to something else on my NAS is a little better or easier to manage. I could be wrong. Let me know if I'm wrong. Another way to solve it is to just put those things in the cloud somewhere else and then get them out of the cloud here. Basically, put it on another cloud here and then this will pull it down and back it up. That might be a better way to do it, but I don't have this third site here. These are some of the workloads running there in the cloud.

I decided to try RKE2 on my shared location, where I still run K3S at home and that allows me to dabble in both. And I like RKE2 because it's already hardened to some government standards and is much closer to Kubernetes. Not that I've had any issues with K3S and it's nowhere near Kubernetes, but that's one of the selling points of RKE2. But now that I have two private clouds, I thought I'd give it a try there. Calico came as default CNI with RKE2. I just left it at the default becauseI really didn't want to run into any unknown errors when trying something else.

That's different at home and we'll get into that where I'm going to try other things. But here I also run my MySQL database. I run another database that should be here, MongoDB. I do not see it. Maybe you will. Anyway, pretend MongoDB is here and then also pretend Postgres is here because I have no idea where it is. But I run my databases there and I run them clustered and I decided that instead of putting them at home and going through this site-to-site VPN all the time, putting them there was probably a better option. Not so much because of latency or anything like that.

You know that I don't care about database latency. Things are going to take half a millisecond longer. What I do care about is returning to the DR or something like that. I feel like this colo should be, for the most part, self-sufficient to the point that if I lost internet at home, didn't pay my bill, UDM broke, someone cut the line, this can stand on its own. legs with the databases there. I don't know. That's not a good idea. Let me know. Add comments. It's up to you to decide. And then I'm also running Jekyll for my documentation, cert-manager for certificates and running Longhorn there.

I mean GitHub action runners, Shlink for my redirects or my short links, documentation site, also other websites, Nginx for my documentation site. And then my GitLab runners also run there, create code and also run Flux. So I'm doing GitOps again and I'm using Flux and I also keep a lot of my custom code there. I have some of the APIs and bots that I run all over the place. Now I will receive them at my colo instead of at home. But back to Flux, I use GitOps and you've probably heard me talk about it before. If you didn't have a video explaining what it is and why it's cool and why it's awesome, Flux really made it super simple.

So, as I mentioned in the last video, I wasn't going to backup my current cluster and restore it to the cloud. I was going to make some changes to the architecture like you saw there. And that meant I had to build a new cluster there and then migrate some of my workloads there. This was made very easy with Flux. And let me show you why it's super easy and why GitOps is great. It's amazing because my cluster is defined in code and I know it sounds intimidating, but the more you start doing it, the more amazing it is.

And it really helped me move some of these workloads to my location. So in my public cluster, you can see some of these applications and this was as simple as just copying all of these folders and pasting them into a new cluster. So for example, if I wanted to deploy all those applications to a new cluster, I first need to create the folder. And then here I would paste all these files, all these folders and files to find these workloads all the way to ingress, a version of helm, a cluster for my database. And then if I commit this and upload it, in a couple of minutes, this cluster will have all these applications running.

That helped me a lot because what I ended up doing was exactly that. I entered my group 01, which was my home group. Now you can see I only have things from home there, I copied them all and pasted them into my public 01 where they are now. Now, there is a warning about this. I said it was as simple as copy and paste. There were a couple of things I had to have in place first, which brings me to my storage and that is Longhorn. This is my Longhorn instance at home. And as you can see, these are all separate.

These were attached to Kubernetes workloads, but I removed them. Now those containers are no longer attached to these volumes. But what I did before was to backup all these volumes. And you can see that all of these volumes here are backed up and ran three or four days ago, 15 minutes ago. There should be many more, but I also disabled this job. But I backup these to object storage or S3 or MinIO, whatever you want to use. And then on my new cluster, which is right here, I ended up going to my backups and restoring them. And similarly I backed up my VMs to NFS and then went into my new cluster and restored those VMs from NFS.

I did the same, but with object storage. So on my NAS I have object storage running and from my old home cluster I backed up all those Longhorn volumes to object storage. I then connected the new cluster after installing Longhorn and entered the Longhorn UI. It could also have been done via GitOps. And then restored all these volumes to the cluster. And after that, you could copy and paste all of those applications and then they would be activated and then attached to those volumes. But that saved me a lot. If you didn't have them defined in the code, you'd probably still be clicking buttons and trying to figure out how this all fits together.

Another reason I like GitOps and will "drop" it after this last bit about GitOps is that it's also my documentation. That way I never have to worry or wonder how things are set up. I can always look here to see how they are set up and I can compare them to other people's if I want. Anyway, I said I was going to "get rid" of GitOps, so let's do it. Now, just backtracking a bit on how Rancher got there, I built a new Rancher group in the colo and created new subsequent groups from Rancher. That's the great thing about Rancher.

You can spin up a Rancher management cluster and then from there you can easily create subsequent clusters. And that's exactly what I did and let me show you. In Rancher, you can see that I have my Rancher management cluster where nothing but Rancher is running and then you can see that I have my public cluster and now my local cluster. So I decided to put my Rancher instance in the colo and then let it manage all the clusters around the world, no matter where they are located. And I decided to do that because I didn't want to welcome Rancher home anymore.

Remember, I want to cut the firewall. I don't want to port forward anymore. So I figured running Rancher on my private cloud on the go with the proper static IP and DNS was probably a better option than hosting it at home. And that's what I ended up doing. I just created a new cluster. I chose the custom node. I chose which Kubernetes distribution I wanted. I chose the CNI. I wanted to, I chose whether or not to include Nginx login and all the other options. And then I created the cluster and then it gives you a curl or a command line to run and then you create your cluster.

You run that curl command on all of those machines. And that's what I did on these machines at home. As you can see, my local group is not doing very well right now. Storinator disabled because I removed all those workloads. Some machines are turned off. Some machines are still working. And now I'm running, as you can see, k8s-home-01, k8s-home-02, k8s-home-03. Those are the etcd nodes and the control plane nodes to control and manage Kubernetes. And then you could see home worker one, home worker two, home worker three. And these are distributed in these three nodes.

And that's how I distribute that load and make things a little more available than if I had a single server. Caveats on that, I've talked a lot about it, but for the most part, a little more available than I would otherwise be. Again, to illustrate that, Rancher would be here. All these groups communicate. All of these nodes communicate with this Rancher management group, all of these nodes. And then all of these nodes in the home also communicate with this Rancher management group. And that allows me to manage three clusters right now with one Rancher instance. And I still have my old Rancher instance that I'm moving some of the workloads here to.

Once I do that, I'll close it. So I'm kind of in full flight right now. But let me know, is this a smart thing to do? Should I have left my Rancher servers here at home and figured out something different? Or is keeping them somewhere public and allowing my nodes to connect to them a better architecture? You tell me. And now, diving into my home environment, as you saw, we freed up some home computing on all of those servers. Some nodes were down. Many of them will be eliminated. And at home, I'm just going to run some virtual machines.

Yes, I still have a HomeLab at home. Just think of my colo as giving new life to those servers because they were down for four months. And so yes, I still have a HomeLab. Did I move everything to the colo? No, I gave him servers that were down and I was probably going to sell them a new life and put them there. Anyway, I feel like I'm explaining that a lot to a lot of people, but I still have a HomeLab. I absolutely do. So what am I going to run in my HomeLab? That's what I want to talk about.

So at home, I'm still running my DNS servers. I have my home group. We just talked about that. And here is the capacity that I released right here. And then I just didn't know what to put there. I set up a play area because that's what my HomeLab has always been about. Although I did manage some utilities for my documentation and stuff like that. Self-hosted many things for most of the community. It also had a space to play. This gives me more room to play, which I'm really excited about. So what am I going to run in that play space?

Well, I already chose RKE2 at home and that's just for now. I think I'm going to change them to K3S and I think I might try a different CNI. So I think I might try Cilium instead of Calico at home. And then I run Traefik at home, which is great. And I know this very well, but I'm thinking of experimenting with Nginx Ingress Controller. And the same goes for the Longhorns. I have been running Longhorn at home. I'm thinking, hey, why not? I have this playground. I can play with Rook Ceph. And there are things I'm still going to do at home.

I'll still run my DNS, I'll still run Home Assistant, and a lot of other things. I'm still going to run Scrypted and I'm still going to run some custom code. And I'm still going to run Flux. But hey, while I try new things, who knows? Maybe you could try Argo CD at home. So tons of possibilities that I wouldn't otherwise have because I didn't have much compute space left. So yeah, the moral of the story is, hey, I could be trying a lot of different things at home and it may not be a mirror of my architecture that's out there now, which is pretty much tried and true.

And I think I'm going to continue for a long time and I'm going to experiment more at home. And I know you probably see this TailScale icon. And this was a question I posed in the last video. Should I use a site-to-site VPN or should I use TailScale? I think the answer is yes. I think the answer is yes. And I'm going to create a site-to-site VPN like I already did, mainly because I have it working and the firewall rules are, well, complicated at first, but now I've figured it out. But I think TailScale still has a place here.

And I'm thinking I could do some interesting things with TailScale and run it as an exit node up here, run an exit node here so I can, you know, maybe open up some services on my NAS and then have them. Come out here in the colo. Or who knows, maybe I could install TailScale on my NAS and install TailScale on these Proxmox servers so they can access my NAS to access NFS, SMB, and object storage without creating this site-to-site VPN rule. I'm not sure. Or it could be something even crazier like, hey, what if I moved all the etcd nodes?

So the ones that hold the Kubernetes information, all the secrets, what if I take them out of the tail, move them back home, but then use TailScale to automatically connect to these workers? And then we can configure it in different ways in Proxmox inside a virtual machine or even in Kubernetes itself, and then expose only a certain number of services from here to there through TailScale. As you can see, I still have a lot to figure out and I'd love to hear from you. If you see that I made mistakes along the way, let me know in the comments below and I'll be sure to address them maybe in the next video.

I think I'll continue this series where you help me build this architecture at my location and who knows, maybe in the future I'll even open source my entire architecture so I can open a pull request and run CI/CD and perform those changes almost in real time. Now I am very excited about all the possibilities. I learned a lot about site-to-site VPNs and how to migrate services between private clouds, and I hope you learned something too. And remember, if you find anything useful in this video, don't forget to like and subscribe. Thanks for watching.

Watch Video & Subscribe

If you have any copyright issue, please Contact