For a good run-down of the type of data we most commonly work with, there is a very nice write-up over here on DiscoveryBrain.Those are the top twelve file-types we run into, and six of the twelve are Microsoft-specific file-types.
There is a long tail of other file-types we work with, which is where we get into how we're competitive versus other companies. We won a major contract a while back because we could natively handle Lotus Notes archives, rather than converting them to PST before processing like some other vendors. Things like that.
Processing all of those MS-Office files can be tricky to do with pure open-source tools. OpenOffice is very good at a lot of things, but there are some corner cases (or in some instances corner offices) where it doesn't yield very good results. So we may process with actual MS-Office, which in turn means we need Windows around.
Once in a great while we'll run into some Mac-specific formats. We can handle those too, though we don't do so with Macs.
We've even run into some Unix-specific formats. But the OSS support for those is rather strong, so those are pretty well handled.
But still. The vast majority of our processing is those twelve formats.
There is a long tail of other file-types we work with, which is where we get into how we're competitive versus other companies. We won a major contract a while back because we could natively handle Lotus Notes archives, rather than converting them to PST before processing like some other vendors. Things like that.
Processing all of those MS-Office files can be tricky to do with pure open-source tools. OpenOffice is very good at a lot of things, but there are some corner cases (or in some instances corner offices) where it doesn't yield very good results. So we may process with actual MS-Office, which in turn means we need Windows around.
Once in a great while we'll run into some Mac-specific formats. We can handle those too, though we don't do so with Macs.
We've even run into some Unix-specific formats. But the OSS support for those is rather strong, so those are pretty well handled.
But still. The vast majority of our processing is those twelve formats.