Re: WAR FileSystem for fast nested JAR access?

Jeremy Boynes Wed, 04 Mar 2015 09:20:07 -0800

On Mar 4, 2015, at 3:49 AM, Konstantin Kolinko <knst.koli...@gmail.com> wrote:
> 
> 2015-03-04 8:20 GMT+03:00 Jeremy Boynes <jboy...@apache.org>:
>> In https://bz.apache.org/bugzilla/show_bug.cgi?id=57251, Mark Thomas wrote:
>> 
>>> The fix for bug 57472 might shave a few seconds of the deployment time but
>>> it doesn't appear to make a significant difference.
>>> 
>>> The fundamental problem when running from a packed WAR is that to access any
>>> resource in a JAR, Tomcat has to do the following:
>>> - open the WAR
>>> - get the entry for the JAR
>>> - get the InputStream for the JAR entry
>>> - Create a JarInputStream
>>> - Read the JarInputStream until it finds the entry it wants
>>> 
>>> This is always going to be slow.
>>> 
>>> The reason that it is fast in Tomcat 7 and earlier took some digging. In
>>> unpackWARs is false in Tomcat 7, it unpacks the JARs anyway into the work
>>> directory and uses them from there. Performance is therefore comparable with
>>> unpackWARs="true".
>> 
>> Has anyone looked into using a NIO2 FileSystem for this? It may offer a way 
>> to avoid having to stream the entry in order to be able to locate a 
>> resource. ZipFile is fast, I believe, because it has random access to the 
>> archive and can seek directly to an entry's location based on the zip index; 
>> the jar: FileSystem seems to be able to do the same.
>> 
>> However, neither can cope with nested entries: ZipFile because its 
>> constructor takes a File rather than a Path and uses native code, and ZipFS 
>> because it relies on URIs and can't cope with a jar: URI based on another 
>> jar: URI (ye olde problem with jar: URL syntax).
>> 
>> What a FileSystem can do differently is return a FileChannel which supports 
>> seek operations over the archive's content. IOW, if ZipFS can work given a 
>> random access channel to bytes on disk, the same approach could be adopted 
>> with a random access channel to bytes on a virtual FileSystem.
>> 
>> I imagine that would get pretty hairy for write operations but fortunately 
>> we would not need to deal with that.
>> 
>> If no-one’s looked at it yet I'll take a shot.
>> Cheers
>> Jeremy
>> 
>> FWIW, this could also be exposed to web applications e.g.
>>  FileSystem webappFS = servletContext.getFileSystem();
>>  Path resource = webappFS.getPath(request.getPathInfo());
>>  Files.copy(resource, response.getOutputStream());
>> 
> 
> The fundamental issue is how the data of JAR file (as a whole) is
> available via API.
> 
> To be able to use random access with the JAR you technically have to
> 
> 1) Jump to the end of the JAR file and read the ZIP index ("Central
> directory") that is located there. See the image at:
> http://en.wikipedia.org/wiki/Zip_%28file_format%29
> 
> 2) Jump to the specific file.
> 
> As JAR itself is compressed, there is no real API to jump to a
> position in it, besides maybe InputStream.skip(). This skip() will
> involve the same overhead as the current implementation that scans the
> jar, unless the war has zero compression.
> 
> 
> Also
> 1. Reading the zip index takes time and would better be cached. That
> is the issue behind
> https://bz.apache.org/bugzilla/show_bug.cgi?id=52448
> 
> 2. It makes sense to cache the list of directories (packages) in the
> zip file. Scanning the whole jar for a class that is not present there
> is the worst case.  A bonus is that it can improve handling of JARs
> that do not have explicit entries for directories.


I agree caching would help but I’m not convinced the lack thereof is the main 
cause of the speed issue here. From Mark’s description above, "Read the 
JarInputStream until it finds the entry it wants” sounds more problematic.

“Open the WAR” and “get the entry for the JAR” can use ZipFile which uses 
random access to locate the bytes for the nested JAR. However, ZipFile only 
provides access to those bytes as an InputStream so we need to stream to locate 
the resource entry.

As an aside, there’s also the issue that zip archives can have zombie entries 
left in the stream but removed from the central directory, so the only way to 
know if an entry should actually be returned is to read to the directory which 
happens to be at the end. AIUI, ZipInputStream will return those zombies as it 
proceeds. This is seldom an issue for JARs as they typically don’t have zombies.

My suggestion for using an NIO2 FileSystem is because its API provides for 
nesting and for random access to the entries in the filesystem. Something like:

   Path war = FileSystems.getDefault().getPath(“real/path/of/application.war”);
   FileSystem warFS = FileSystems.newFileSystem(“war:” + war.toURI());
   Path nestedJar = warFS.getPath(“/WEB-INF/lib/some.jar”);
   FileSystem jarFS = FileSystems.newFileSystem(“jar:” + nestedJar.toURI());
   Path resource = jarFS.getPath(“some/resource.txt”);
   return Files.newInputStream(resource); // or newFileChannel(resource) etc.

There are two requirements on the archive FileSystem implementation for this to 
work:
* Support for nesting in the URI
* Functioning implementation of newByteChannel or newFileChannel

Unfortunately the jar: provider that comes with the JRE won’t do that. It has 
ye olde jar: URL nesting issues and requires the archive Path be provided by 
the default FileSystem. Its newByteChannel() returns a SeekableByteChannel that 
is not seekable (doh!) and newFileChannel() works by extracting the entry to a 
temp file.

The former problem seems easy to work around. To support a seekable channel 
without extraction would be trickier as you would need to convert channel 
positions to the actual position in the compressed data which would mean 
digging into the compression block structure. However, I think the naive 
approach of scanning the entry data and then caching the block offsets would 
still be quicker than inflating to a temp file.

—
Jeremy

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: WAR FileSystem for fast nested JAR access?

Reply via email to