Speeeeeed with data!

All values Megs/Minute:
[Using TSAFS]
FacSrv1/Share1 FacSrv2/User3 FacSrv2/Share3 StuSrv2/Class1
908 2106 1565 2492 846
930 2105 1536 2447 885
925 2070 1582 2519

[Using TSA600]
StuSrv1/Stu1 FacSrv1/Share1 FacSrv2/User3 FacSrv2/Share3 StuSrv2/Class1
238 875 625 1160 221
265 904 656 1160 218
272 895 670 1184

The format of the test is nodename/volume. Each dataset had more data than known buffers in the data path in an effort to minimize the accelleration gained by such. In most tests, subsequent runs of the same dataset were faster than previous runs due to the caching on the server itself and on the EVA back-end SAN. The data were gathered using the TSATEST shipping in the NW6SP4 service-pack. The FacSrv2 tests are unique in that first User3 was run at 5GB then Share3 was run at 5GB in order to bust buffers where possible. The only variable between datasets was the TSA loaded and any itty bitty file changes that may have happened during normal usage.

What this test clearly shows is that TSAFS is faster than TSA600. In all cases tested the speed-up was north of 200%.

Other observations

One thing I did observe is that the FacSrv servers are faster than the StuSrv servers when it comes to TSA backups. This did merit investigation and I have come to the conclusing that the PSM file used has impacts on speed. Spurious interupts (displayable by "display interupts" from console) increment dramtatically when CPQACPI.PSK (v1.06 5/12/03) is loaded, as it is on the StuSrv-series of servers. The FacSrv series of servers are old enough that they load CPQMPK.PSM instead, and the spurious interupts on those servers are worlds lower than the numbers reported on the StuSrv series. I predict that backup speed improvements are to be had by changing which PSM we load.

It also looked like data characterization played a role in how fast things backed up. User volumes went slower than 'shared' volumes (Class1 is the student version of Share1). This may have something to do with the average file size being smaller on User volumes, or perhaps the churn-rate on the user-volumes leads to noticable fragmentation.

Another test I ran once just because I was curious was to back up Share1 from FacSrv1 and Share3 from FacSrv2 at the same time with TSAFS and monitor the results. Both servers backed up 5GB of data and when they hit the 5GB mark I recorded the rate. When the first server hit that mark I let the backup continue in order to provide the same contentious I/O path for the slower server. The totals were:

FacSrv1/Share1 @ 1862 MB/Min
FacSrv2/Share3 @ 2065 MB/Bin
Aggregate throughput of SAN I/O channel: 3927 MB/Min

This tells me a number of things:
  • Parallel backups won't drop absolute performance below the rated tape-drive performance spec
  • The SAN storage really is fast
  • A third stream (Share2?) could be added and still probably maintain real-world network speeds.

We have some work to do. For one, we need to either get Compaq to address the spurious interupt issue, or drop back to CPQMPK.PSM in order to get good results out of the StuSrv series. Once this is done, I hope that the StuSrv series will be able to provide performance matching or besting that of the older FacSrv line.

The networking infrastructure needs attention. The maximim theoretical throughput for 100Mb ethernet is 750 megs/minute, and each one of the TSAFS backups went faster then that. The maximum speeds observed are fast enough to dent even Gig Ethernet, though at those speeds tape-drive latency comes into play. The current 100Mb connections are not adequate.

The backup server needs to be robust. With the possibility of multiple high-rate streams coming to the server, its own Gig Ethernet connection may become saturated if all four tape-drives are streaming from a fast source. With speeds up that high, PCI-bus contention actually becomes a factor here so the server has to be built with high I/O in mind. No "PC on steroids" here. Best case would be multiple PCI busses, or at the very minimum 64-bit PCI; neither option shows up on sub $2000 hardware.

Our storage back-end is robust. The EVA on the back end can pitch data at amazing speeds. It is very nice to see rates like that, even on volumes that have had 12 months of heavy usage to fragment the bejebers out of them. We can scale this system out a lot further before running into caps.

We need to verify if our backup software solution is compatible with TSAFS. We use BackupExec for Windows, and that answer is not quite clear yet. BackupExec for NetWare is already there, so at least part of the product line knows how to handle it. We need a decision on how to handle open files before progressing on that front. But TSAFS versus BEREMOTE is something I can't test easilly.

TSAFS is the way to go, but it is only one piece of the puzzle.