The New Jorviker - Unnecessary lemmas. Very sloppy. Handwriting needs improvement.

Hello there again. No, I didn’t go and forget about you all after a single post. I’ve just had a crazy last couple of months. Since the last time I posted, I interviewed with Google, got hired, and moved two timezones away to California. Talk about a busy period. But the project is still ongoing, and even has a name now: Ravenna.

One advantage of living in San Jose, heart of Silicon Valley, is that high speed internet is ubiquitous. I’m ostensibly getting a gigabit per second through Comcast. Unfortunately, when I performed a speed test from my laptop over wifi, I was only get somewhere around 10 Mbps. Oh man…

So I troubleshot the problem the best I could. I tested a wired device downstream from my router – same speed. I eliminated the router altogether and just tested the speed plugged directly into the modem. Now I was getting around 600 Mbps. Not quite a gig, but not shabby. Certainly much better than I was getting through WoW back in Huntsville.

So what was the problem, then? The first culprit in my mind was the custom router firmware I was running – dd-wrt. I use a Nighthawk r7000, a router I bought a few years ago without too much thought. After a while, I decided I wanted finer-grained controls over my network than the default Netgear GUI would allow. I wanted to be able to automate things using scripts. Unfortunately, the r7000 isn’t particularly well-supported under dd-wrt.

After I got under the hood, I could see that the router was basically an underpowered dual core ARM processor, with under a gigabyte of RAM hooked up to a broadcom ASIC. The ASIC is used for the data plane while the CPUs are used for the control plane. However, the ASIC’s API is proprietary, so the authors of dd-wrt had to reverse engineer its interface in order to support hardware-accelerated NAT.

I saw a lot of dark secrets while playing around with dd-wrt, including a 2.X series Linux kernel. But it met my needs. I didn’t notice a decrease in connection speed and I gained the flexibility of running scripts directly on my router. So I left it the way it was, for the most part.

Fast forward to a couple of weeks ago when I moved in. Now I could see it. This performance was unacceptable. So I bit my tongue and reflashed the netgear locked-down Duplo firmware. …and all of a sudden I was getting 600 Mbps wired connections and 200 Mbps wireless connections! Not bad!

But switching away from dd-wrt meant losing all of the flexibility I had gained from using my router. So I decided to break out the sidecar pattern. I’d let my router do what it’s good at – routing, firewalling, and NATing – and let another, more flexible component do the rest. So I set a raspberry pi up with a static IP and let it be both the DNS server and the DHCP server. dnsmasq allowed me to do both.

But I had a laundy list of things I needed to DNS to do. I needed hostnames to be resolvable after DHCPing. I needed service names to be resolvable after they’d started up (e.g. jenkins, gitlab), and I needed all outbound DNS traffic to be encrypted. A tall order for just dnsmasq. So I decided to set up a bit of DNS pipeline. After all was said and done, it looked a bit like this:

DNS Pipeline

consul is a great tool for service discovery. It offers a REST API both for registering and querying services and their associated data. It also offers a read-only DNS API. It was a perfect fit for my use case.

cloudflared is a service created by the eponymous cloudflare that proxies plaintext UDP DNS queries over DNS over HTTPS. Since all of my outbound queries proxy through it, none of my external traffic should be plaintext. This service uses Cloudflare’s relatively recent public DNS server 1.1.1.1, which has a new-fangled HTTPS interface for secure name resolution.

PSA: 1.1.1.1 is not suitable for use as a fake IP in tests. I’ve seen integration tests fail because people have thought that. If you must use a fake IP, the IETF has set aside test subnets for just this purpose.

So let’s say I want to resolve node2. The request is sent to the sidecar since the system’s nameserver was configured via DHCP. dnsmasq is listening on port 53 of the sidecar and will immediately resolve the request since node2 is one of its active DHCP leases.

How about jenkins? Before I even make the request, Jenkins should have registered with consul on startup with a REST POST request. Now, dnsmasq will receive the DNS request and since it has no record of jenkins, it will proxy it to consul. consul will return the appropriate result. I’m actually glossing over things a little bit here. I had to make use of a search path to make this work since consul returns its DNS records in the form of <SERVICE_NAME>.service.<DOMAIN>.

Finally, suppose I try to google.com. This sort of request should be the average case. The request will make its way through dnsmasq and consul, neither of which will know the answwer and will pass the request along to the next link in the chain. Cloudflared will receive the request, translate it to an equivalent HTTPS request, and query 1.1.1.1.

This all might sound quite circuitous, but in practice, I haven’t noticed much of an impact on my name resolution latency. Certainly not enough to abandon the level of flexibility this affords me. With this model, I can plug a compute node into my server farm and with zero manual configuration, the services hosted on it magically become available. Getting Jenkins up is as simple as plugging in a D plug and a cat5 cable.

After messing around with the configurations of these various services, I wrapped them all up into debian packges to make it real and reproducible. Stay safe kids. Practice immutable infrastructure.

There are a couple of warts to this system. I haven’t integrated with consul’s health check system yet, so if I ever migrate a service to another node, or a node changes addresses, I’ll have one invalid dns record hanging around for that service. I’ll need to fix that in the near future. For the moment, I’m marking it down as technical debt and moving along. A lot of changes have been made to the simulation since I last posted and I really want to get back to working on the meat of the project – geometry, physics, and procedural generation. Stay tuned for more.