September 2006 Archives
This also means that project-wise things are really quieting down around here. The first week of classes is always spent mostly on fire watch, so not a lot else gets done. It also means that after-hours, "it's broke, fix it right now," calls are sadly more common. It's amazing how easy it is to get used to not having students around, having them all here suddenly is a bit of a shock.
One amusing note. My boss is in a corner office that looks across the street at an apartment complex that houses a lot of students. He has remarked humorously at the number of wireless access points his laptop picks up. So far, at least 15 are visible.
Please note that Open Enterprise Server services currently run on SUSE Linux Enterprise Server 9. New purchases of Open Enterprise Server will not include SUSE Linux Enterprise Server 10 until it officially becomes part of Open Enterprise Server in the next release, scheduled for mid-2007.Hmm. This tells me that what we'd be seeing at BrainShare '07 will be beta builds of OES2. March is not 'mid-2007'.
This further brings the question of what the Big Thing will be at BS-07. Last year it was SUSE 10. All. Over. The. Place. OES2 will be big for me, but I'm not convinced that Novell will give the next OES the same push it did for SLES 10. I'm a bit irked that they seem to be minimizing the file and print serving that made the company, but that's just business; file-servers don't make for profit anymore. On the other hand, I may be wrong.
The flag-ship products are SLES, GroupWise, Zen, and Identity Manager. IM is a big consulting driver, and still a hot technology, so that'll still get a big focus. Zen7 SP1 is recently enough out the door that SP2 or even a version 8 is probably not going to happen by BrainShare time. GroupWise 7 has been out a while now, but I haven't heard any mumblings about a v8 for that product.
On the other hand, openSUSE 10.2 is in Alpha right now. According to the roadmap 10.2 will release Devember 7th. What this means for SLES is unclear to me, but it could mean that beta builds of SLES 10.2 may be available at BrainShare. You can find a list of changes from 10.1 to 10.2 (for openSUSE, this isn't for SLES) here. The changes aren't terribly significant, just improvements to the XWindows environment (both Gnome and KDE), and related applications.
So no, I can't yet tell what the Big All Consuming Message will be. Eh. Time will tell.
Tags: novell, OES2
Once Upon a Time this was Developer.novell.com only option, but somewhere along the line it snuck into the ConsoleOne directory. Take a look at...
Use and enjoy.
Tags: novell, edir
There are a number of ways to get around the problem, but Microsoft has suggested a few. You can read their take on things here.
It turns out that one of the methods recommended by Microsoft is actually pretty easily done through Zen for Desktops.
Microsoft has tested the following workaround. While this workaround will not correct the underlying vulnerability, it helps block known attack vectors. When a workaround reduces functionality, it is identified in the following section.
Note The following steps require Administrative privileges. It is recommended that the system be restarted after applying this workaround. It is also possible to log out and log back in after applying the workaround however; the recommendation is to restart the system.
To un-register Vgx.dll, follow these steps:
Click Start, click Run, type "regsvr32 -u "%ProgramFiles%\Common Files\Microsoft Shared\VGX\vgx.dll"" (without the quotation marks), and then click OK.
A dialog box appears to confirm that the un-registration process has succeeded. Click OK to close the dialog box.
Impact of Workaround: Applications that render VML will no longer do so once Vgx.dll has been unregistered.
To undo this change, re-register Vgx.dll by following the above steps. Replace the text in Step 1 with "regsvr32 "%ProgramFiles%\Common Files\Microsoft Shared\VGX\vgx.dllâ€" (without the quotation marks).
And this in the Parameters:
-u "%*ProgramFiles%\Common Files\Microsoft Shared\VGX\vgx.dll"
Set it to run in system impersonation and associate it how you will with a force-run and probably run-once. To undo it once the patch it out or you have confidance that your AntiVirus vendor will catch the bug, re-registering it the same way is just as easy.
Note: This is just a wild idea, not something we have running. We might, but we have several layers of approvals to get through before we push something like this out to everyone. Feel free to riff on this idea to your own needs.
On the topic of clusters, do you find the benefits of a cluster/SAN setup out weighed by the increased complication in node upgrades/patching and the "all your eggs in one basket" when it comes to storage on the SAN.One of the biggest things to get used to with clustering is that your uptimes for your cluster nodes will go down dramatically from what you're used to with your existing mainline servers, but your service uptimes will go up. Once we put in the cluster we haven't had an unplanned multi-hour outage that wasn't attributable to network issues. The key here is 'unplanned'. We've had several planned outages for both service-packing and actual hardware upgrades to the SAN array itself.
Prior to the cluster WWU put in three 'facstaff' servers and three 'student' servers to handle user directories and shared directories. This way when one server died, only a third of that class of user was out of luck. The cluster still follows this design for the user directories, but that's more for load-balancing between the cluster nodes than disaster resiliance. Since the cluster went in we've merged all of our facstaff shared volumes into a single volume. This was done because we were getting more and more cases of departments needing access to both Share1 and Share3, and we didn't have drive letters for that.
Patching and service-packing the cluster is easier than it would be with stand-alone servers. I can script things so that three of our six cluster nodes vacate themselves from the cluster in the middle of the night, so I can apply service-packs to them in the middle of the day. Repeat the same trick the next day. I can have a service-pack rolled out to the cluster in 48 hours with no after hours work on my part. THAT is a savings (unless you're counting on the overtime pay, which I don't get anyway).
The downside is the 'eggs in one basket' problem. If this building sinks into the Earth right now, WWU is screwed. Recovering from tape, after we get replacement hardware of course, would take close to a week. Don't think we haven't noticed this problem.
To be fair, though, we'd have this problem even if we were still on separate servers. True disaster recovery requires multi-location of data and services, which stand-alone servers also suffer from. Under the old architecture and presuming those servers were split between campus and our building, the 'building sinking into the ground' scenario would cause a significant portion of campus to stop working and a significant portion of students to lose everything for the days it'd take us to recover from tape. During that time WWU's teaching function would probably halt as, 'the Earth ate my homework,' would be a very valid excuse.
In our case losing a third or two thirds of all user-directory and shared-directory data would halt the business of the university. While the outage wouldn't be quite as severe as it would be if our SAN melted, it would be just as disruptive. Because of that, going for an 'all or nothing' solution that increases perceived uptime was very much in order.
We're in the process of trying to replicate our SAN data to a backup datacenter on campus. We can't afford Novell's Business Continuity Cluster which would provide automation to make this exact thing work. So we're having to make do on our own. We don't yet have a firm plan on how to make it work, and the 'fail back' plan is just a shaky; we only got the hardware for the backup SAN a month ago. It will happen, we just don't know what the final solution will look like.
As for iSCSI versus FibreChannel, my personal bias is for FC. However, I fully realize that gigabit ethernet is w-a-y cheaper than any FC solution out there today. I prefer FC because the bandwidth is higher and due to how it is designed I/O contention on the wire has less impact to overall performance. Just remember that iSCSI really really likes jumbo frames (MTU >1500 bytes), and not all router techs are OK with twiddling that; you may end up with a parallel and separate ethernet setup between your servers and the iSCSI storage.
As for iSCSI throughput, I haven't done tests on that. However I just got done looking at a whole bunch of throughput tests in and out of our FC SAN. During the IOZONE tests on NetWare, I recorded a high-water mark of 101 MB/s out of the EVA. This is 80% GigE speed, and therefore theoretically this transfer rate was doable over iSCSI. The true high-water mark was achieved by running IOZONE on the Linux server locally on the server, and on a cluster node running TSATEST on a locally mounted volume. At that time I saw a maximum transfer rate of 146 MB/s, which is 117% of GigE speed, so iSCSI wouldn't have been able to handle that. On the other hand, during day to day operations and during backups the transfer rate has never exceeded the 125 MB/s GigE mark. It's come close, but not exceeded it.
I strongly suspect a contributing factor is where the code executes. In NetWare everything is in Ring 0 (kernel-land) unless exiled to a Protected Memory Space whereupon it executes in Ring 3 (user-land). My CNE classes said that stuff running in a protected memory space typically runs 3-5% slower than in the OS memory space on NetWare. On Linux, at least as far as the 2.6 kernels anyway, memory accessible from Ring 0 is limited to the first 1GB of RAM and most processes are supposed to run in Ring 3. This is the architecture that permits things like "kill -9 [pid]" to work on Linux, but abend the server in NetWare.
There was a very handy slide at BrainShare 2006 that showed the differences in the NCP/NSS architecture in NetWare and Linux. The session was IO104: File System Roadmap by Richard Jones. Because you can purchase your very own BrainShare DVD, I'm going to assume that any NDAs on this information have lapsed. You'll want to open these links in different tabs, I'll be referring to the contents of them.
IO104 Slide 40: Linux and NetWare Architectures
The NetWare architecture is very familiar. I've been looking at that chart for years. The thing to note is that the NSS and NCP bits are right next to eachother in kernel-land, so run well together with little interference.
IO104 Slide 41: NSS on Linux in OES
This is how NSS and NCP are crammed into Linux. The 'up call' box is how communication between kernel-land and user-land are performed. Every piece of I/O that comes in on an NSS volume over any file protocol, NCP, Samba, NFS, or AFP, has to pass the user/kernel interface. If you look at slide 40 you can see that this is true for all file-systems on Linux.
The side information on slide 41 hints at a major problem when OES-Linux first shipped. At that time the file-cache was being kept in kernel-land like it is in NetWare. This gave some screaming numbers. Unfortunately Linux is limited to 1GB of RAM in kernel-land, and that has to be shared with everything in kernel-land. So it screamed... so long as you had very small file systems. Ahem. SP1 changed that so NSS could use Linux's native caching mechanism. It dropped the speed a bit, but it could again handle large file-systems.
Since every I/O request on a file-system has to pass the computing equivalent of the blood/brain barrier, this introduces certain lags. The true impact of this is unknown to me, as my linux-fu is too weak to know where to stick the probes to get an idea as to where all that CPU is going. Watching the split of load types I clearly saw that the CPU spent very little time in IOWAIT, and split roughly evenly between USER and SYSTEM. The NCP server was doing something, but NSS (all that SYSTEM time) clearly was quite busy as well. Due to how file-servers are handled on Linux if I had run this against Samba the busy process would have been SMBD, since CPU for file-system work is 'charged' against the calling process.
Then there is the possibility of just not having fully optimized code. I've heard that NSS as a linux file system runs 'only' 12% slower than reiser (when called locally on the Linux server, and not over a file-serving protocol), which says that NSS is pretty butch as it is. Scale is the key question, though.
The same File System Futures presentation had a few slides about where NSS is likely to go in future revisions of OES and SLES, where 'future' is likely the version past the one coming out Real Soon Now, and it looks quite promising. The block diagram for how the NetWare Services shim into Linux is much cleaner. The plan, as of March, was to shim in a 'NetWare Modular Features' layer between the file-systems and the Virtual File Services layer. The advantage to this would be at a minimum NetWare-style trustees on reiser, JFS, UFS, etc.
Once the next version of OES ships I'll see if I can get the hardware to re-run the dir-create and file-create tests. Even doing a single workstation should tell me what improvements, if any, were put into OES when it comes to scalability.
Tags: novell, oes
I was testing the performance of an NSS volume mounted over NCP. In part this is because NetWare clustering only works with NSS, but mostly because of two other reasons. The only other viable file-server for Linux is Samba, and I already know it has 'concurrency issues' that crop up well below the level of concurrency we show on the WUF cluster. Second, the rich meta-data that NSS provides is extensively used by us. I don't believe any Linux file system has an equivalent for directory quotas.
- HP ProLiant BL20P G2
- 2x 2.8GHz CPU
- 4GB RAM
- HP EVA3000 fibre attached
- NetWare 6.5, SP5 (a.k.a. OES NetWare SP2)
- N65NSS5B patch
- 200GB NSS volume, no salvage, RAID0, on EVA3000
- OES Linux SP2
- Post-patches up to 9/12/06
- 200GB NSS volume, no salvage, RAID0, on EVA3000
To facilitate the testing I was granted the use of one of the computer labs on mothballs between terms. This lab had 32 stations in it, though only 30 stations were ever used in a test. I thank ATUS for the lending of the lab.
- Windows XP Sp2, patched
- P3 1.6GHz CPU
- 256MB RAM
- Novell Client version 126.96.36.19951209 + patches
- NWFS.SYS dated 11/22/05
Unfortunately, the Linux configuration hits its performance ceiling well before the NetWare server does. Linux just doesn't scale as well as NetWare. I/O operations on Linux are much more CPU bound than on NetWare, as CPU load on all tests on the Linux server was excessive. The impact of that loading was very variable, though, so there is some leeway.
Both of the file-create and dir-create tests created 600,000 objects in each run of the test. This is a clearly synthetic benchmark that also happened to highlight one of the weaknesses of the NCP Server on Linux. During both tests it was 'ndsd' that was showing the high load, and that is the process that handles the NCP server. Very little time was spent in "IO WAIT", with the rest evenly split between USER and SYSTEM.
The IOZONE tests also drove CPU quite high due to NCP traffic, but it seems that actual I/O throughput was not greatly affected by the load. In this test it seems that Linux may have out-run NetWare in terms of how fast it drove the network. The difference is slight, a few percentage points, but looks to be present. I regret not having firm data for that, but what I do have is suggestive of this.
But what does that mean for WWU?
The answer to this comes with understanding the characteristics of the I/O pattern of the WUF cluster. The vast majority of it is read/write, with create and delete thrown in as very small minority operations. Backup performance is exclusively read, and that is the most I/O intensive thing we do with these volumes. There are a few middling sized Access databases on some of the shared volumes, but most of our major databases have been housed in the MS SQL server (or Oracle).
For a hypothetical reformat of WUF to be OES-Linux based, I can expect CPU on the servers doing file-serving to be in the 60-80% range with frequent peaks to 100%. I can also expect 100% CPU during backups. This, I believe, is the high end of the acceptable performance envelope for the server hardware we have right now. With half of the nodes scheduled for hardware replacement in the next 18 months, the possibility of dual and even quad-core systems becomes much more attractive if OES Linux is to be a long term goal.
OES-Linux meets our needs. Barely, but it does. Now to see what OES2 does for us!
Tags: novell, benchmarking
With the throughput tests, there were no perceivable differences between 16 simultaneous threads and 32 simultaneous threads. The NetWare throughput test showed signs of client-side caching as well, so those results are tainted. Plus I learned that there were some client-side considerations that impacted the test. The clients all had WinXP SP2 in 256MB of RAM, and instantiating 16 to 32 simultaneous IOZone threads causes serious page faults to occur during the test.
As such, I'm left with much more rough data from these tests. CPU load for the servers in question, network load, and fibre-channel switch throughput. Since these didn't record very granular details, the results are very rough and hard to draw conclusions from. But I'll do what I can.
At the outset I predicted that these tests would be I/O intensive, not CPU intensive. It turns out I was wrong for Linux, as CPU loads approached those exhibited by the dir-create and file-create tests for the whole iozone run. On the other hand, the data are suggestive that the CPU loading did not affect performance to a significant degree. CPU load on NetWare did approach 80% during the very early phases of the iozone tests, when file-sizes were under 8MB, and decreased markedly as the test went on. It was during this time that the highest throughputs were reported on the SAN.
Looking at the network throughput graphs for both the lab-switch uplink to the router core and the NIC on the server itself suggest that throughput to/from OES-Linux was actually faster than OES-NetWare. The difference is slight if it is there, but at a minimum both servers drove an equivalent speed of data over the ethernet. Unfortunately, the presence of client-side caching on the clients for the NetWare run prevent me from determining the actual truth of this.
On the fibre-channel switch attached to the server and the disk device (an HP EVA) I watched the throughputs recorded on the fibre ports for both devices. The high-water mark for data transfer occurred during the first 30 minutes of the iozone run with NetWare, the Linux test may have posted an equivalent level but that test was ran during the night and therefore its high-water mark was unobserved. At the time of the NetWare high-water mark all 32 stations were pounding on the server with file-sizes under 16MB. The level posted as 101 MB/s (or 6060 MB/Minute), which is quite zippy. This transfer rate coincided quite well with the rate observed on the ethernet. This translates to about 80% utilization on the ethernet, which is pretty close to the maximum expected throughput for parallel streams.
For comparison, the absolute maximum transfer rate I've achieved with this EVA is 146 MB/s (8760 MB/Min). This was done with iozone running locally on the OES-Linux box and TSATEST running on one of the WUF cluster nodes backing up a large locally mounted volume. Since this setup involved no ethernet overhead, it did test the EVA to its utmost. It was quite clear that the iozone I/O was contending with the TSATEST data, as when the iozone test was terminated the TSATEST screen reported throughput increasing from 830 MB/Min to 1330 MB/Min. I should also note that due to the zoning on the Fibre Channel switch, this I/O occurred on different controllers on the EVA.
These tests suggest that when it comes to shoveling data as fast as possible in parallel, OES-Linux performs at a minimum the equivalent of OES-NetWare and may even surpass it by a few percentage points. This test tested modify, read, and write operations, which except for the initial file-create and final file-delete operations are metadata-light. Unlike file-create, the modify, read, and write operations on OES-Linux appear to not be significantly impacted by CPU loading.
Tags: novell, benchmarking
I just put the Min values in the error bars to make it a cleaner graph. But here you can see the trend mentioned in the file-create tess about the 4000 object line. Only here 4500 objects seems to be the point where file-create passes dir-create in terms of time per operation. This is a result of CPU usage and the fact that file-create appears to be more affected by it than NetWare is. The idential NetWare chart is illustrative, but since CPU never went above 70% for more than a few moments it isn't a pure apples-to-apples comparison.
In this case, file-create remains below dir-create for the whole run. What's more, dir-create drove CPU a lot harder than file-create did. The early data in the Linux run shows that OES-Linux would follow this file-create-is-faster pattern given sufficient CPU.
Exactly why file-create performance degrades so fast when CPU contention begins is unclear me. In terms of disk bandwidth, all four tests barely twitched the needle on the SAN monitor; these tests do not involve big I/O transfers. As far as NSS is concerned, a directory and a file are very similar objects in the grand scheme of things. Yet NSS seems to track more data related to directories than files, so it seems counter intuitive that file-create would lag when CPU becomes a problem. This question is one I should bring with me to BrainShare 2007.
Next, IOZONE and throughput tests.
Tags: novell, benchmarking
30 workstations create a sub-directory, and in that sub-directory create 20,000 files. At each 500 files it does a directory listing and times how long it takes to retrieve the list. A running total of the time taken to create files is kept, and a log of how long each entry takes to create is also kept.
This chart is interesting in several ways. First of all, note the lower error bars for the Linux line. Those bars overlap and up to about 4000 files actually is below the NetWare average. This says to me that when there is CPU room, Linux may be faster than NetWare when responding to file creates. This particular line was caused by the same method as the previous test, namely that some test stations started up to 30 seconds before the whole group was running and therefore had a window of uncontended I/O. Those same workstations finished their tests while others were still around 12000 files, which further explains the downward trend of the Linux line above that threshold.
The second interesting thing is the sheer variability of the results. As with the dir-create test, CPU was completely utilized on the OES-linux box. The reported load-averages were very similar to dir-create. Some test workstations were able to run a complete test before others even got to 12000 files. Yet others took a really long time to process. The file-create test ran well over an hour, where the same test on NetWare took just under 30 minutes.
This graph shows significant differences between the two platforms. As with the first chart, 4000 directories and under some workstations turned in NetWare-equivalent response times when speaking to OES-Linux. As with the above, this was due to uncontended I/O. But once all the clients started running the test the response time for directory enumeration was greatly degraded.
Because file-create seems to clog the I/O channels more than dir-create did, directory enumeration had to compete in the same channels and thus response times suffered. Towards the end of the test when some workstations had finished early response times were creeping back towards parity with OES-NetWare.
Next, create operation differences.
Tags: novell, benchmarking
30 workstations create a sub-directory, and in that sub-directory create 20,000 directories. At each 500 directories it does a directory listing and times how long it takes to retrieve the list. A running total of the time taken to create directories is kept, and a log of how long each entry takes to create is also kept.
This chart shows it very well. As I've said before, the state of the server affected this run. At its peak, the NetWare server had a CPU load around 65%. The Linux server had a load average around 18, which roughly translates to a CPU load of 900%. Directory Create is an expensive operation due to the amount of meta-data involved. This is clearly much more expensive on the Linux platform than it is on the NetWare platform.
The range of results is also quite interesting. Generally speaking, when speaking to a NetWare server the clients had a pretty even spread of response times. Time were faster than others. It just happens. Because of testing limits I was not able to start all stations at exactly the same time; however, start-time was within 30 seconds of eachother. The stations that went first recorded really good times for the first 3000 directories or so then slowed down as everyone got going. This effect was quite clear in the raw Linux data, though it is hidden in the above chart.
A side effect of that is that when the fast clients finished, it removed some of the I/O contention going on. You can see that in the downward curve of the Linux line towards the end of the test. That doesn't indicate that Linux was getting better at higher speeds, just that some clients had finished working and had removed themselved from the testing environment.
This is the chart that describes how long it takes to enumerate a single directory inside of a dir-list of the created sub-directory. As the test progressed there were more directories to enumerate. Mere enumeration isn't an expensive operation, as it just involved a sub-set of the metadata involved in the directory-entries. As with the dir-create test, dir-enum shows that Linux is slower on the ball than NetWare is under heavy load conditions. This is pretty clearly CPU related, as a single client running these tests shows very little difference between the platforms.
The hump and fall-off of the Linux line is an artifact of faster workstations getting done quicker and getting out of the way. The sheer variability of the linux line is interesting in and of itself. I'm sure further testing may identify the cause of that, but I'm limited on time and other resources so I won't be investigating it now.
Next, on Monday, file-create and file-enumerate.
Tags: novell, benchmarking
Today over at Cool Blogs, Richard Jones posted about the progress of this technology in the industry. The short version is that Novell implemented SMS on Linux, and for vendors that already had a solid Linux client it required them to completely rewrite it. Which would explain why it has taken almost two years for the big storage players to come out with supported product. Novell has taken steps to support the really big storage players in UnixLand (IBM, et. al.) in their clients, using extended attributes (Xattrs).
Turns out that xattr thing was slipped into a patch on the 11th of August. I wonder if that's the same package that had shadow volumes included?
Tags: novell, OES
But, I figured I'd give some impressions I got from the tests. For brevity purposes, when I say NetWare I mean, "OES NetWare 6.5 SP3 with patches up to 8/23/06", and when I say Linux, I mean, "OES Linux SP2, with patches up to 9/1/06". Also, when talking about I/O, I'm referring to, "I/O performed over the network via NCP to an NSS volume."
- I/O on Linux is more CPU bound than on NetWare. For absolute sure, dir-create and file-create are much more expensive operations CPU-wise. They both perform similarly when done with unloaded systems, but the system hit for create on Linux is much higher than on NetWare. This could be due to System/User memory barriers, but my testing isn't robust enough to test that sort of thing. NetWare is all Ring 0, where by necessity Novell has brought a lot of the file-sharing functions in Linux into Ring 3.
- Bulk I/O speed is similar. When talking about bulk I/O functions, in my case this was the IOZONE test, both platforms perform similarly. Unfortunately, caching played a big role on the NetWare test and didn't perform any role in the Linux test. This is the inverse of my findings in January. The testing gods frowned on me.
- Linux seems to support faster network I/O than NetWare. Unfortunately, this may just be a side-effect of the caching. But network loads were higher when running the bulk IO tests on Linux than they were with NetWare. This can be a good thing (Linux supports more network I/O than NetWare) or a bad thing (Linux requires more network I/O for similar performance). Not sure at this time which it is.
Another thing to note is that the bulk IO test with IOZONE also induced very high load-averages on Linux, but the apparent throughput was very comparable to NetWare. IOZONE works by creating a file of size X and runs a series of tests on records of size Y. Unlike the dir-create and file-create tests, this test doesn't test how fast you can create files it tests how fast you can get data. Clearly record I/O within files still induces CPU load in the form of NDSD activity; however, unlike the dir-create and file-create tests the apparent throughput is not nearly as affected by high-CPU conditions.
From this early stage it looks like we could convert WUF to Linux and still not need new hardware. But we'd be running that hardware harder, much harder, than it would have run under NetWare. Since we're not pushing the envelope with our NetWare servers now, we have the room to move. If our servers were running closer to 20% CPU, the answer would be quite different.
As I read the documentation, it looks like NCPserv is a function of ndsd. Therefore, seeing ndsd taking up CPU cycles that way was due to NCP operations, not DS operations. If that's the case, substituting a reiser partition for the NSS partition would decrease CPU loading some, but probably not the order of magnitude it needs.
Tags: novell, benchmarking
Yet, when the fibre throughput monitor is reporting 125 MB/s (1Gb/s), it also shows a utilization of only 25%. Buh? Am I missing something here?
BENCHTEST-LIN:helpNote the bolded commands. Perhaps Novell has slipped in Shadow Volumes in a post-SP2 update? Doing help on the 'create shadow_volume' command gives this output:
BENCHTEST-LIN:help create shadow_volumeand "help shadow"
NAME: create shadow_volume - Create NCP shadow volume
create shdadow_volume ncp_volume_name path
Use this command to create an association between an NCP volume
and a NCP shadow volume. This command only adds the NCP shadow
volume mount information to "/etc/opt/novell/ncpserv.conf".
This command can be added to a cluster load script.
You can run ncpcon console commands without entering NCPCON by
prefacing the command with ncpcon.
create shadow_volume vol1 /home/shadows/vol1
BENCHTEST-LIN:help shadowYes, 'EXAMPLE:' is blank in the HELP. Hmmmmmm. I don't see any documentation updates, but those commands are indeed present. Richard Jones mentioned that shadow volumes are an OES2 feature, and to try it out in the beta. Perhaps there is an OES2 beta in the near future? Who knows.
NAME: shadow - Perform Shadow Volume operations on a NCP Volume - (null)
shadow volumename operation [options]
You can run ncpcon console commands without entering NCPCON by
prefacing the command with ncpcon.
operation=[lp][ls][mp][ms] - (lp) List primary files
(ls) List shadow files
(mp) Move files to primary
(ms) Move files to shadow
pattern="searchPattern" - File pattern to match against
owner="username.context" - Username and Context
uid=uidValue - User ID
time=[m][a][c] - (m) Last Time Modified (a) Last Time Accessed
(c) Last Time Changed
range=[time period] - See Time period
size=[size differential] = See Size differential
output="filename" - Output all results to the specified filename
(a) Within Last Day
(b) 1 Day - 1 Week
(c) 1 Week - 2 Weeks
(d) 2 Weeks - 1 Month
(e) 1 Month - 2 Months
(f) 2 Months - 4 Months
(g) 4 Months - 6 Months
(h) 6 Months - 1 Year
(i) 1 Year - 2 Years
(j) More Than 2 Years
(a) Less than 1KB
(b) 1 KB - 4 KB
(c) 4 KB - 16 KB
(d) 16 KB - 64 KB
(e) 64 KB - 256 KB
(f) 256 KB - 1 MB
(g) 1 MB - 4 MB
(h) 4 MB - 16 MB
(i) 16 MB - 64 MB
(j) 64 MB - 256 MB
(k) More than 256 MB
Tags: novell, shadowvolumes
Unfortunately, we seem to have an 'apples to apples' problem. While the network utilization appears to be higher with the OES Linux server, implying better throughputs, it is clear from the few clients that have finished the run that there was no caching involved with this particular test. Comparing numbers, therefore, will be a bear.
Ideally I'd rerun the NetWare test with client caching and oplock 2 disabled, but I don't have time for that. This server needs to be given back to the service I borrowed it from.
Tags: novell, benchmarking
Okay, it turned 'warning'. Before it was either green/working, or red/broken. They'd never seen yellow/high-load before. They were quite happy.
Anyway... the 1GB link between the lab with all the workstations and the router core was running 79-81% utilization. Nice!
On the san link we had around 20% utilization, the highest I'd ever seen the EVA drive before.
Right now I can't tell what that link is running, but the link into the server itself is running in the 50-60% range. Better analysis will occur tomorrow when I can ask the Telecom guys how that link behaved overnight. As for the server, load-levels are well above 3.0 again. Right at this moment it's at 14ish, with ndsd being the prime process.
At this point I'm begining to question what unix load-averages mean when compared to the cpu-percentage reported by NetWare. Are they comparable? How does one compare? Anyway, the dir-create and file-create tests did showed to be much more cpu-bound on Linux than NetWare, and this sort of bulk I/O seems to have a similar binding. Late in the test CPU on NetWare was fairly low, 20% range, with the prime teller of loading being allocated Service Processes. So I'm pretty curious as to what load will look like when all the stations get into the 128MB file sizes and larger.
Tags: novell, benchmarking
I haven't looked at the data closely yet, but I suspect that the same trends reported in the dir-create test follow here. I didn't do a test for dir-create and file-create on NetWare with a smaller number of stations, but then it didn't seem like I needed to. The 'break even' point, where CPU is just under 100%, on the dir-create looks to be in the 4-6 station range, with the file-create point on or around 10 stations.
Tags: novell, benchmarking
Test 1 is the 'big directory' test. The client stations create 20,000 sub directories in a sub-directory titled the name of the machine. The time to create each directory is tracked, and the time it takes to enumerate each directory is also tracked. In testing out the benchmark it is clear that mkdir is a more expensive operation than 'touch' is in the make-file test (also 20,000 files).
On NetWare with 30 client machines pounding the server, CPU rose to about 80% or so and stayed there. Load on the CPUs were equal. There was some form of bottlenecking going on because some clients finished much faster than others, and it isn't clear what separated the two classes.
On Linux the load-average is pretty stable around 18. The process taking up that CPU is ndsd. The numbers I'm getting back from the clients are vastly worse than NetWare. The first time I ran it I figured that this was due to the workstation objects not having the posixAccount extension. So I fixed that, and now the percentages are better, but still much worse than NetWare. I'll run this test again with only 10 clients, so I get to compare smaller concurrent access numbers.
That kind of load is not exactly 'real user load', it's a synthetic load designed to show how well either platform handles abuse. The iozone benchmark should be closer to comparable since that's just a single file, and ndsd shouldn't be involved with those accesses much at all. That'll be almost entirely i/o subsystem.
Tags: novell, benchmarking
iozone -Rab \report-dump\IOZONE-std\%COMPUTERNAME%-iozone1.xls -g 1G -i 0 -i 1 -i 2 -i 3 -i 4 -i 5
Right now all the stations are chewing on the 1GB file, and are all at various record-size stages. But the fun thing is the "nss /cachestats" output:
BENCHTEST-NW:nss /cachestatYep. All that I/O is only partially being satisified by cache-reads. As it should be at this stage of the game.
***** Buffer Cache Statistics *****
Min cache buffers: 512
Num hash buckets: 524288
Min OS free cache buffers: 256
Num cache pages allocated: 414103
Cache hit percentage: 63%
Cache hit: 3407435
Cache miss: 1978789
Cache hit percentage(user): 60%
Cache hit(user): 3031275
Cache miss(user): 1978789
Cache hit percentage(sys): 100%
Cache hit(sys): 376160
Cache miss(sys): 0
Percent of buckets used: 48%
Max entries in a bucket: 7
Total entries: 399112
What surprised me yesterday when I kicked off this particular test was how baddly hammered the server was at the very begining. This is the small file-size test, and better approximates actual usage. CPU during the first 30 minutes of the test was in the 70-90% range, and was asymetric, CPU1 was nearly pegged. During that phase of it we also drove a network utilization of 79-83% on the GigE uplink from the switch serving the testing machines and the router core. And on the Fibre Channel switch serving the test server, the high-water mark for transfer speed was 101 MB/Second (~20% utilization).
The FC speed is notible. The fasted throughput I was able to produce on the port linking the EVA was about 25 MB/Second, and that was done with TSATEST running against local volumes in parallel on three machines. Clearly our EVA is capable of much higher performance than we've been demanding of it. Nice to know.
Depending on how the numbers look once this test is done, I might change my testing procedure a bit. Run a separate 'small file' run in IOZone to capture the big-load periods, and perhaps a separate 'big file' run with 1G files to capture the 'cache exhaustion' performance.
From a NetWare note, the 'Current MP Service Processes' counter hit the max of 750 pretty fast during the early stages of the test. Upping the max to 1000 showed how utilization of service processes progressed during the test. Right now it's steady at 530 used processes. Since I don't think Linux has a similar tunable parameter, this could be one factor making a difference between the platforms.
Tags: benchmarking, novell
Item: Apple is now shipping on Intel hardware.
Item: OS 10.5 will be shipping with Boot Camp pre built in.
Given those, I suspect that game makers now have less incentive to create games for OS X. The presumption being that anyone who needs to can boot Windows and run their games there. Why go to the extensive effort to port games to OS X?
Therefore: There may be fewer games released for the Mac
Therefore: More gaming will be done on Mac hardware, but Windows OS.
Therefore: Macintosh machines will come under similar upgrade pressure as PC machines, thanks to increased gaming numbers.
Therefore: Apple will be under increased pressure to permit at least graphics card upgrades to their mid-line machines (iMac).
Anyway, the trick:
- Make sure all the clients are imported as Workstation Objects.
- Create a Workstation Group, and add all of the clients into it.
- Add the newly created Workstation Group as a R/W trustee of the volume I'm benchmarking against. This allows the workstations as themselves, not users, to write files.
- Create a Workstation Policy, associate it to the group.
- In the Workstation Policy, create a Scheduled Task. Point it at the batchfile I wrote that'll map a drive to the correct volume, run the tests, and clean up.
- Modify the schedule so it'll run at a specific time, making sure to uncheck the 'randomize' box.
- Force a refresh of the Policies on the clients (restarting the Workstation Manager service will do it).
The jobs all seem to start within 30 seconds of the scheduled time. This doesn't seem to be due to differences in the workstation clocks, on checking those are all within 3 seconds of 'true', rather the Workstation Manager task polling interval. I wish I could get true 'everyone right now' performance, but that's not possible without w-a-y more minions.
On the 'large number of sub-directories' test, the early jumpers seemed to get a continued edge over their late starters. The time to create directories for the early jumpers was consistantly in the 3-5ms range, where the late jumpers were in the 10-13ms range. Significant difference there. And some started fast and became slow, so there is clearly some threshold involved here beyond just the server dealing with all those new directory entries. CPU load on the NetWare box (what I have staged up first) during the test with 32 clients creating and enumerating large directories was in the 55-70% range. That load is spread equally over both CPUs, so those bits of NSS are fully MP enabled.
Tags: benchmarking, zenworks
In Benchmarking news, I'm still building the testing protocol. But one thing has shown itself one more time. Back in the original benchmark I noticed an artifact in the data. The IOZONE "random read" test shows a troff in the NCP-on-NetWare data. When using a 64kb record size, there is a marked decrease in performance. It showed up in the original data, and in some of the test runs I've just completed. The hardware behind the January test and this one is different, the server is a bit older but this older server is hooked up to the SAN.
|32kb ||64kb ||128kb |
This is sample data. For the 8MB file size, you can see the three record sizes either side of the 64kb line. The performance drop exhibited there is repeated throughout the whole random read test. I wonder why that particular record size is so slow?
Find it: http://www.novell.com/training/attlive/sessions.html
To which I say, 'eh.'
- We don't do GroupWise.
- We don't do Identity Manager.
- We don't to high performance Linux. We barely do web-serving with Linux at the moment.
- Which leaves ZEN, the one bright spot.
Tags: novell, brainshare