Recently in networking Category

Why tcp-mss-clamp still matters

By SysAdmin1138 on May 28, 2024 8:40 AM

This is blogging in anger after fighting this over the weekend. Because I'm like that I have a backup cable ISP in case my primary fiber ISP flakes out. I work from home, so the existence of internet is critical to me getting paid, and neither cell phone has good enough service to hotspot reliably. Thus, having two ISPs. It's expensive, but then so would be missing work for a week while I wait for a cable tech to come out to diagnose why their stuff isn't working.

The backup ISP hasn't been working well for a while, but the network card pointing to the second cable modem flaked out two weeks ago and that meant replacement. Which refused to pick up address info (v4 or v6) off of DHCP. Doing a hard reset from the provider side fixed the issue, but left me with the curious circumstance of:

I can curl from the router
But nothing behind it could curl.
Looking at the packet trace of the behind the router case saw the TCP handshake finish, but TLS handshake fail after the initial hello.

What the actual fuck.

What fixed the problem was the following policy added to my firewalld config in /etc/firewalld/policies/backuprouter.xml.

<rule> <tcp-mss-clamp value="1448"/> </rule>

MSS means 'maximum segment size' which is a TCP thing indicating how much the TCP portion of the packet can occupy. For networks with a typical Maximum Transfer Unit (MTU) size of 1500, MSS is typically 1460. Networking over things like VPNs often trims the effective MTU due to VPN overhead, often to 1492 with a corresponding reduction in MSS to 1452. The tcp-mss-clamp setting is telling firewalld to lock MSS to 1448; so if something behind it requests higher, the router will rewrite (and reassemble) segments to conform to the MSS setting.

The tcp-mss-clamp setting can be set to 'pmtu' which will cause firewalld to probe what the effective MTU (and by proxy MSS) number should be so you don't have to hard-code. And yet, here I am, hard-coding because crossing my own router seems to require an extra 4 bytes. I don't know why, and that angers me. Packet traces from the router itself show MSS of 1452 working fine, but that provably doesn't work from behind my router.

Whatever. It works now, which is what matters, and now I'm contributing this nugget back to the internet.

What my CompSci degree got me

By SysAdmin1138 on January 20, 2017 9:00 AM

The what use is a csci degree meme has been going around again, so I thought I'd interrogate what mine got me.

First, a few notes on my career journey:

Elected not to go to grad-school. Didn't have the math for a masters or doctorate.
Got a job in helpdesk, intending to get into Operations.
Got promoted into sysadmin work.
Did some major scripting as part of Y2K remediation, first big coding project after school.
Got a new job, at WWU.
Microsoft released PowerShell.
Performed a few more acts of scripting. Knew I so totally wasn't a software engineer.
Manage to change career tracks into Linux. Started learning Ruby as a survival mechanism.
Today: I write code every day. Still don't consider myself a 'software engineer'.

Elapsed time: 20ish years.

As it happens, even though my career has been ops-focused I still got a lot out of that degree. Here are the big points.

Continue reading What my CompSci degree got me.

Redundancy in the Cloud

By SysAdmin1138 on September 26, 2014 4:38 AM

Strange as it might be to contemplate, but imagine what would happen if AWS went into receivership and was shut down to liquidate assets? What would that mean for your infrastructure? Project? Or even startup?

It would be pretty bad.

Startups have been deploying preferentially on AWS or other Cloud services for some time now, in part due to venture-capitalist push to not have physical infrastructure to liquidate should the startup go *pop* and to scale fast should a much desired rocket-launch happen. If AWS shut down fully for, say, a week, the impact to pretty much everything would be tremendous.

Or what if it was Azure? Fully debilitating for those that are on it, but the wide impacts would be less.

Cloud vendors are big things. In the old physical days we used to deal with the all-our-eggs-in-one-basket problem by putting eggs in multiple places. If you're on AWS, Amazon is very big about making sure you deploy across multiple Availability Zones and helping you become multi-region in the process if that's important to you. See? More than one basket for your eggs. I have to presume Azure and the others are similar, since I haven't used them.

Do you put your product on multiple cloud-vendors as your more-than-one-basket approach?

It isn't as easy as it was with datacenters, that's for sure.

This approach can work if you treat the Cloud vendors as nothing but Virtualization and block-storage vendors. The multiple-datacenter approach worked in large part because colos sell only a few things that impact the technology (power, space, network connectivity, physical access controls), though pricing and policies may differ wildly. Cloud vendors are not like that, they differentiate in areas that are technically relevant.

Do you deploy your own MySQL servers, or do you use RDS?
Do you deploy your now MongoDB servers, or do you use DynamoDB?
Do you deploy your own CDN, or do you use CloudFront?
Do you deploy your own Redis group, or do you use SQS?
Do you deploy your own Chef, or do you use OpsWorks?

The deeper down the hole of Managed Services you dive, and Amazon is very invested in pushing people to use them, the harder it is to take your toys and go elsewhere. Or run your toys on multiple Cloud infrastructures. Azure and the other vendors are building up their own managed service offerings because AWS is successfully differentiating from everyone else by having the widest offering. The end-game here is to have enough managed services offerings that virtual private servers don't need to be used at all.

Deploying your product on multiple cloud vendors requires either eschewing managed-services entirely, or accepting greater management overhead due to very significant differences in how certain parts of your stack are managed. Cloud vendors are very much Infrastructure-as-Code, and deploying on both AWS and Azure is like deploying the same application in Java and .NET; it takes a lot of work, the dialect differences can be insurmountable, and the expertise required means different people are going to be working on each environment which creates organizational challenges. Deploying on multiple cloud-vendors is far harder than deploying in multiple physical datacenters, and this is very much intentional.

It can be done, it just takes drive.

New features will be deployed on one infrastructure before the others, and the others will follow on as the integration teams figure out how to port it.
Some features may only ever live on one infrastructure as they're not deemed important enough to go to all of the effort to port to another infrastructure. Even if policy says everything must be multi-infrastructure, because that's how people work.
The extra overhead of running in multiple infrastructures is guaranteed to become a target during cost-cutting drives.

The ChannelRegister article's assertion that AWS is now in "too big to fail" territory, and thus requiring governmental support to prevent wide-spread industry collapse, is a reasonable assertion. It just plain costs too much to plan for that kind of disaster in corporate disaster-response planning.

Barriers to internal IPv6 deployment: human factors

By SysAdmin1138 on February 24, 2014 4:40 AM

While the push for IPv6 at the Internet edge is definitely there, the push for internal adoption is not nearly as strong. In the absence of a screaming crisis or upper-management commands to push things along, it is human-factors that will prevent such a push. I'm going to go into a few.

Continue reading Barriers to internal IPv6 deployment: human factors.

NetWare Retrospective Part 3: Network protocol migration

By SysAdmin1138 on January 16, 2014 5:00 AM

Worried about the IPv4 to IPv6 migration?

NetWare users had a similar migration when Novell finally got off of IPX and moved to native TCP/IP with the release of NetWare 5.0 on or around 1999. We've done it before. Like the IPv6 transition, it was reasons other than "because it's a good idea" that pushed for the retirement of IPX from the core network. Getting rid of old networking protocols is hard and involves a lot of legacy, so they stick around for a long, long time.

As it happens IPv6 is spookily familiar to old IPX hands, but better in pretty much every way. It's what Novell had in mind back in the 80's, but done right.

Dynamic network addressing that doesn't require DHCP.
A mechanism for whole-network announcements (SAP in IPX, various multicast methods for IPv6)

Anyway, you have a network protocol you need to eventually retire, but pretty much everything uses it. What do you do? Like the stages of grief, there is a progression at work here:

Ignore it. We're using the old system just fine, it's going to work for the forseeable future, no reason to migrate.
On by default, but disabled manually. The installer asks for the new stuff, but we just turn it off as soon as the system is up. We're not migrating yet.
The WAN link doesn't support the old stuff. Um, crap. Tunnel the old stuff over the new stuff for that link and otherwise... continue to not migrate.
Clients go on-by-default, but disabled manually. New clients are supporting the new stuff, but we disable it manually when we push out new clients. We're not migrating.
Clients get trouble related to protocol negotiation. Thanks to the tunnel there is new stuff out there and clients are finding it, but can't talk to it. Which is creating network delays and causing support tickets. Find ways to disable protocol discovery, push that out to clients.
Internal support says all the manual changes are harshing their workflow, and can we please migrate since everything supports it now anyway. Okay, maybe we can go dual stack now.
Network team asks if they can turn off the old stuff since everything is also using the new stuff. Say no, and revise deploy guides to start disabling the old stuff on clients but keep it on servers just in case.
Network team asks again since the networking vendor has issued a bulletin on this stuff. Audit servers to see if there is any oldstuff usage. Find that the only usage is between the servers themselves and some really old, extremely broken stuff. Replace the broken stuff, turn off old stuff stack on servers.
Migration complete.

At WWU we finished our IPX to IP migration by following this road and it took us something like 7 years to do it.

Ask yourself where you are in your IPv6 implementation. At WWU when I left we'd gotten to step 5 (but didn't have a step 3).

I've done this before, and so have most old NetWare hands. Appeals to best practices and address-space exhaustion won't work as well as you'd hope, feeling the pain of the protocol transition does. Just like we're seeing right now. Migration will happen after operational pain is felt, because people are lazy. We're going to have RFC1918 IPv4 islands hiding behind corporate firewalls for years and years to come, with full migration only happening after devices stop supporting IPv4 at all.

The IPX transition was a private-network only transition since it was never transited over the public Internet. The IPv6 transition is Internet wide, but there are local mitigations that will allow local v4 islands to function for a long, long time. I know this, since I've done it before.

An unexpectedly long evening.

By SysAdmin1138 on August 19, 2013 7:38 AM

Friday evening that is.

Right before I left for the day I noticed my computer lost network. Seeing as it's directly connected to a switch, this was surprising. When I bipped into the utility room to see what was going on, I found the switch in reboot mode and a fellow employee behind the rack doing perfectly legitimate business things.

Perfectly legitimate business things that over the course of a year or so had managed to work the power cable out of the Ethernet switch. We didn't get one with redundant power supplies, it's just the office network, not critical like our actual revenue systems, so this caused a switch reboot.

It didn't come up. Crap.

Very, very happily I'd already figured out what combination of serial cable and minicom settings I needed to talk to this switch over the console port so I was able to plug in and see WTF was going wrong.Â

Error, /cfa0/boot.ini corrupted; please reboot to console and repair.

Bugger.

Happily, I already had software images on that laptop so I proceeded to set my baud rate to 115,200 and uploaded a new one via XModem. Since I was not doing this at 9600 baud, this only took about 10 minutes for an 8.2MB file.

Software image corrupted.

Bugger. Looking around the file-system I saw a strange directory in there:

/cfa0/.Trash-nebeker/

Huh. A ".Trash-$Username" folder is dropped by Gnome2 on removable media if something is deleted on it. How in blazes did that get onto a factory firmware image? A bit of Googling brought me to a certain HP Customer Advisory. Yep, looks like from 2009 to May of 2010 that directory was indeed baked into switches, and was definitely causing problems.

Since my switch was in the switch has already failed to reboot or failed on software update state, I had to follow that workflow. Running the given lshw command and removing the bad boot.ini file did allow the switch to boot into its normal state. I tried updating to "a switch software version which automatically removes the extraneous files", but no matter how I tried to update the firmware I got

Software image corrupted.

USB, TFTP, even another XModem upload. Same thing, every time, from fresh downloads even. Clearly, this option wasn't going to work for me, so I had to go to the "show tech custom" script they mention.

Frought with peril.

Continue reading An unexpectedly long evening..

The push for IPv6

By SysAdmin1138 on March 27, 2013 2:30 PM | 2 Comments

This is inspired from last night's LOPSA DC meeting. The topic was IPv6 and we had a round-table.

One of the big questions brought up was, "What's making me go IPv6?"

The stock answer to that is, "IPv4 addresses are running out, we'll have to learn at some point or be left behind."

That's all well and good, but for us? Most of us are working in, for, or with the US Government, an entity that is not going to be experiencing v4 address scarcity any time soon. What is going to push us to go v6 (other than the already existing mandate to have support that is)?

In my opinion, it'll come from the edges. IPv6 is a natural choice for rapidly expanding networks such as wireless networks, and extremely large networks like Comcast/Verizon run for their kit. These are two areas where sysadmins in general don't deal with much at all (VPN and mobile-access being the two major exceptions).

If your phone has an IPv6 address and accesses the IPv4 internet through a carrier-grade NAT device, you may never notice. Joe Average User is going to be even less likely to notice so long as that widget just works. Once v6 is in the hands of the "I don't care how it works so long as it works" masses, it'll start becoming our problem.

Once having a native v6 site means slightly better perceived mobile performance (those DNS lookups do cause a bit of latency you know), you can guarantee that hungry startups are going to start pushing v6 from launch. Once that ecosystem develops it'll start dragging the entrenched legacy stuff (the, er, government) along with it.Â Some agency sites are very sensitive to performance perception and will adapt early. Others only put their data online because they were told to and will only move when the pain gets to be too much.

Business-to-business links (or those between .gov agencies, and their .com suppliers) will likely stay v4 for a very, very long time. Those will also be subject to pain-based mitigation strategies.

But the emergence of v6 on mobile will likely push a lot of us to get v6 to at least our edges. Internal use may be long time coming, but it'll show up at all because of the need to connect with others.

So why DO VPN clients use UDP?

By SysAdmin1138 on October 7, 2012 8:26 AM | 2 Comments

I've wondered why IPSec VPNs do that for a while, but hadn't taken the time to figure out why that is. I recently took the time.

The major reason comes down to one very big problem: NAT traversal.

When IPSec VPNs came out originally, I remember there being many problems with the NAT gateways most houses (and hotels) had. It eventually cleared up but I didn't pay attention; it wasn't a problem that was in my area of responsibility, so I didn't do any troubleshooting for it.

There are three problems IPSec VPNs encounter with NAT gateways. One is intrinsic to NAT, the other two are specific to some implementations of NAT.

IPv4 IPSec traffic uses IP Protocol 50, which is neither TCP (proto 6) or UDP (proto 17), and protocol 50 uses no ports on the packet. Therefore, a VPN concentrator can only support a single VPN client behind a specific NAT gateway. This can be a problem if four people from the same company are staying in the same hotel for a conference.
IPv4 IPSec traffic uses IP Protocol 50, which is neither TCP or UDP. Some NAT gateways drop anything that isn't TCP or UDP, which will be a problem for IPSec VPNs.
NAT gateways rewrite certain headers and play games with packet checksums, which IPSec doesn't like. So if IPSec is going to tunnel via TCP or UDP, there will be issues.

These are some of the reasons SSL VPNs became popular.

This is where RFC 3751 comes in. It's titled, "IPsec-Network Address Translation (NAT) Compatibility Requirements" oddly enough. It turns out that packet checksums are not required for IPv4 UDP packets, which makes them a natural choice to tunnel an IPSec VPN through a stupid NAT gateway. The VPN concentrator pulls the IPSec packet out of the UDP packet, and thanks to the cryptographic nature of IPSec it already has ways to detect packet corruption and will handle that (and any required retransmits) at the IPSec layer.

Continue reading So why DO VPN clients use UDP?.

Fixing the other home network

By SysAdmin1138 on July 15, 2012 1:38 PM | 1 Comment

Part of the blog-silence the past few weeks is because I was on vacation.

Nice, nice vacation.

However, it was at my parent's place and as with any technically savvy sprog who comes home, there be questions. In my case, it was an ongoing problem of slowness with the network. The first few days I didn't have time to delve, but I did eventually get into it.

The symptoms:

Wifi download speeds were about 75KB/s, and upload about 3x that. Yes, upload was faster. Yes, my 4G phone had faster downloads.
Wired speeds were at the rated speeds for the broadband connection.

Clearly, something was wrong with the wireless.

My little spectrum analyzer showed remarkably clean airwaves for a residential area. There was some congestion on their frequency, but moving it didn't help much.

And then I noticed something when I ran iwconfig on my laptop:

wlan0     IEEE 802.11abgn  ESSID:"dadhome"  
          Mode:Managed  Frequency:2.462 GHz  Access Point: 06:1D:61:FF:BB:00   
          Bit Rate=54 Mb/s   Tx-Power=15 dBm   
          Retry  long limit:7   RTS thr:off   Fragment thr:off
          Encryption key:on
          Power Management:off
          Link Quality=70/70  Signal level=-31 dBm  
          Rx invalid nwid:0  Rx invalid crypt:0  Rx invalid frag:0
          Tx excessive retries:2942  Invalid misc:1892   Missed beacon:0

Specifically, that "Tx excessive retired" statistic was incrementing, and I'd never seen that other than zero before. Most odd.

The router was a Cisco/Linksys, and a pretty new one at that. The firmware was latest, so that wasn't it. After a bit of poking about I found out that I could get vastly better throughput by setting the Wifi to G-Only, instead of B/G/N. In fact, setting it to N-only made the problem worse! Clearly, this router's N implementation is a bit off.

Wifi wasn't at the broadband speed yet, but it was still a vast improvement. That's where I left it, and they're happy.

New racy TLDs, more defensive domain-buys

By SysAdmin1138 on June 11, 2012 4:44 AM | 1 Comment

I've been talking about this one for a while around the new .xxx top-level-domain. Simply put, some organizations such as my old employer consider a certain string of characters to be their trademark regardless of what follows the . in the name. This is why WWU had at the time I left mid twenties of domains that all redirect to wwu.edu. The same will happen with .xxx

And also with .sex, .adult and .porn.

Each new TLD means yet another domain to buy defensively.

For the extremely cash-strapped non-profit, now entering year 5 of shrinking or stagnating budgets, such forced trademark defense expenses are highly resented.

« netware | Main Index | Archives | novell »