The dirty secret of systems administration

| 1 Comment
When you break the click-wrap around a new piece of software you agree to certain terms of service. Right there in the ToS is usually a paragraph or five taken from the lesson IBM learned when Compaq figured out how to make IBM-compatible BIOS. Which is to say, Thou Shalt Not Reverse Engineer This Software. Perfectly understandable, they want to sue into oblivion anyone who tries to do that kind of thing (again).

Unfortunately, sometimes... you just have to do the reverse engineering. And sysadmins are kinda the people on the front line of that.

Here's how it works:

You have a piece of software. It was purchased and installed, and came as a binary blob as well. It has a user interface, and maybe an API. It does work. Your users are happy. The black-box hums the way it should hum. You are happy.

Then it breaks. It stops doing what it should. The manual and online docs are useless. Your users are unhappy. You are unhappy.

At this point, you have a choice. You can call up support and have THEM deal with it. This is the ideal option, since that's how this model of software works. However, what if the internal organization who bought this piece of software didn't buy the support contract since they're cheap bastards they blew the budget getting it at all? What if there isn't a support contract, but the vendor does per-incident pricing and there is no budget for that?

The pressure is still on you to fix the damned thing.

So you blow right past the NO USER SERVICEABLE PARTS INSIDE -- VOIDS WARRANTY IF BROKEN seal and try to figure out how it works so you can fix it. Or prove to the financial powers that be that it's really in their best interests to pay for support in this case. Enter now the land of reverse engineering.

There are a variety of tools to use in your quest to figure out WTF. A small list:

  • Packet captures to figure out how it talks on the wire.
  • Utilities like strace to figure out what files it's looking at, and what rights it expects.
  • Debugging tools to tease out what system functions it calls, or isolate where in their code the fault lay.

These will give you a good idea how it interacts with its environment, which in turn gives you clues about how it runs internally. This can help solve problems.



As it happens, I've been doing some of this work over the past week or so. Only I don't have a black box, I have actual software we wrote ourselves! I was asked to isolate performance problems in a dynamic web-app, so the software engineers can better isolate where to spend time optimizing. As I'm still very much a novice with the programming language being used here it was not worth my time to do a code-audit, so I exercised my reverse engineering skills.

And found some good stuff.

A settings page was taking a long time to render. When I examined a packet trace of what the server was doing while loading it, one item of interest was a pattern of SQL queries that looked like this:

SELECT firstname FROM settings WHERE user_id=42
SELECT lastname FROM settings WHERE user_id=42
SELECT avatar_file FROM settings WHERE user_id=42
SELECT city FROM settings WHERE user_id=42
SELECT state FROM settings WHERE user_id=42
[...]
SELECT paternal_grandmother FROM settings WHERE user_id=42

It was reading the full row of the 'settings' table for user_id=42 one select statement at a time. This stage alone accounted for 1.2 seconds of loading time out of a total of 9 seconds. All that data could be fetched with a SINGLE select statement pulling all the fields needed, but instead it's being done one at a time.

WTF?

This is an artifact of a library being used to build the settings page.

What's most likely happening here is that the settings page is a list of defined text-entry objects, each of which is linked to specific field in the settings table. The page itself is being written in a higher level programming language that is object-oriented. When it comes time to render that page, the renderer pulls each object atomically, which gives us the umpty individual select statements.

What can be done to fix this? I'm not sure, I'm not a software engineer. Perhaps the settings-page library is not doing sufficient optimization of database calls.

Anyway, using my super-powers for good is always a nice feeling.

1 Comment

Whew! When I saw your headline I thought you were going to reveal that *other* dirty secret... but now I see that my fears were misplaced.