Sniffing packets

| 2 Comments
When I first started this sysadmin gig 'round about 1997, Windows based packet sniffers were still in their infancy. In fact, the word 'sniffer' was (and probably still is) a trademarked term for the software and hardware package for, er, sniffing packets. Sniffer. So when I needed to figure out a problem on the network, I went to the Network Guys who plugged their Sniffer into any available port on the 10baseT hub I needed analysis on and went to work. They told me what was wrong. Like a JetDirect card transmitting packets whenever it sensed a packet on the wire, thus bringing the network to is knees. Things like that.

Time passed and Sniffer was bought by Network Associates. Who then added a zero to the price because that package really did have a lock on the market. The next rev then more than doubled the already inflated price. So when it came time to renew/upgrade, our Sniffer couldn't handle Fast Ethernet, the price was eye watering. So. On came the free sniffers.

At first I was using Ether Boy, a now long lost packet sniffer. But eventually I found Ethereal (now WireShark), and I went to work. By the time I left my old job in 2003 I already had a rep for knowing WTF I was looking at, and the network guys didn't bat an eyelash when I asked for a span port. This ability was very handy when diagnosing slow Novell logins.

Fast forward to now. Right now I'm trying to figure out why the heck a certain NetWare server is so slow talking to the Data Protector media agent. It isn't obviously a TSA problem, but I've had problems with DP and NW talking to each other on the TCP level so that's where I'm looking now. Unfortunately for me, the desktop-grade GigE nic I have on the span isn't, shall we say, resourced enough to sniff a full GigE stream without at least a few buffer overruns. So I'm not getting ALL of the packets.

When I asked for the span port, the telecom guy said he greatly respected my ability to dig in to TCP issues. And said it in the voice of, "I think you're better at that kind of troubleshooting than we are." Which is a bit disconcerting to hear from your telecom router-gods. But there it is. What it means is that I can't very well ask for help interpreting these traces.

So far I've been able to determine that there is something hinky going on with network delays. There are some 200ms delays in there, which hints strongly at a failed protocol negotiation somewhere. But there are some rather longer delays, and it could be due to window size negotiation problems. Server 2008, the media-agent server, has a much newer TCP/IP stack than NetWare so it is entirely possible that they just don't work well together. I don't understand that quite well enough to manually deconstruct what's going on, so that's what I'm googling on right now.

And why Saturday? Because of course the volume that's doing this is our single largest and it is on the weekend where it is in the failed state where I can pry the hood off and look. Who knows, I may resort to posting packets and crowd sourcing the problem.

Update 12/23/09: Found it.

2 Comments

It sounds like you're on the right track, IMHO.

I'd take a look at some of the newer features in the Server 2008 TCP stack, referenced here: http://blogs.dirteam.com/blogs/sanderberkouwer/archive/2008/05/15/backward-compatible-networking-with-server-core.aspxSpecifically, I'd give a look at the "Receive Window auto-tuning" feature, as this has caused a few other random delay issues I've heard of.