Re: WAR FileSystem for fast nested JAR access?

Jeremy Boynes Sun, 08 Mar 2015 08:40:26 -0700

On Mar 8, 2015, at 5:28 AM, Christopher Schultz <[email protected]> 
wrote:
> 
> Jeremy,
> 
> On 3/7/15 1:13 PM, Jeremy Boynes wrote:
>> On Mar 6, 2015, at 7:43 AM, Mark Thomas <[email protected]> wrote:
>>> Interesting. The deciding factor for me will be performance. Keep
>>> in mind that we might not need all the API. As long as there is
>>> enough to implement WebResourceSet and WebResource, we probably
>>> have all we need.
>> 
>> I ran a micro-benchmark using the greenhouse WAR associated with the
>> original bug. I instrumented JarWarResource to log all resources
>> opened during startup and record the time. On my system it took
>> ~21,000ms to start the application of which ~16,000ms was spent in
>> getJarInputStreamWrapper(). 2935 resources were opened, primarily
>> class files.
>> 
>> I then replayed the log against the sandbox FS. With the current
>> implementation it took ~300ms to open the war, ~350ms to open all the
>> jars, and ~5ms to open all the entries with newInputStream().
>> 
>> I interpret that to mean that there is pretty constant time taken to
>> inflate 15MB of data - the 300ms to scan the archive and the ~350ms
>> to scan each of the jars within (each one that was used at least).
>> The speed up here comes because we only scan each archive once, the
>> downside is the extra memory used to store the inflated data.
>> 
>> This is promising enough to me that I’m going to keep exploring.
>> 
>> Konstantin’s patch, AIUI, creates an index for each jar which
>> eliminates the need to scan jars on the classpath that don’t contain
>> the class being requested. However, once the classloader has
>> determined the jar file to use we still need to stream through that
>> jar until we reach the desired entry.
> 
> I have a dumb question about this: why does the JAR file have to be
> /searched/ for a particular entry? Opening the JAR file should seek to
> the end of the file to read the TOC, and then the file offsets should be
> immediately available. Need file #27? Look in entries[27] and the offset
> into the file should be available.
> 
> Is the problem that, because the JAR is inside a WAR file, the "offset"
> into the JAR file is meaningless because there isn't an easy way to
> determine the mapping from uncompressed-offset to compressed-offset?


It’s limitation of the classical API. We can open the WAR as a JarFile which 
allows this type of random access using just this mechanism. You can access any 
entry and retrieve an InputStream which we wrap as a JarInputStream,. That’s 
not seekable so the only way to locate an entry in that JarInputStream (e.g. a 
resource or class) is to search through that stream.

JarFile though can only open a File (i.e. something on the default filesystem). 
This is why it is much faster when the JAR is extracted to the file system 
where it can then be opened with JarFile to give random access to its contents.

There are many code paths in our code and in the JDK (e.g. 
URLClassLoader/URLClassPath) to detect whether something is on disk and can be 
optimized (i.e. is it a directory which allows path manipulation, or is it 
file:// URL that could be opened using JarFile).

> 
>> I think we can avoid that here by digging into the zip file’s
>> internal metadata. Where I am currently  streaming the jar to build
>> the directory, with random access I can build an index just by
>> reading the central directory structure. An index entry would contain
>> the name, metadata, and the offset in the archive of the entry’s
>> data.
> 
> Which archive do you mean, here? The inner JAR or the outer WAR?

Either, basically whatever archive is being mounted. During the mount, it 
builds an index of the entries in the jar. As a quick hack, it currently does 
that by scanning all the content using a JarInputStream. What I’m planning on 
doing next is building that index by reading the archive’s structure directly. 
Doing that requires seeking backwards through the zip data structures which is 
what the current APIs don’t support.

> 
>> When an entry is opened would we inflate the data so that it
>> could be used to underpin the channel. When the channel is closed the
>> memory would be released.
>> 
>> In general, I don’t think there’s a need for the FileSystem to retain
>> inflated data after the channel is closed. This would be particularly
>> true for the leaf resources which are not likely to be reused; for
>> example, once a ClassLoader has used the .class file to define the
>> Class or once a framework has processed a .xml config file then
>> neither will need it again.
> 
> You could use a small LRU cache or something if you wanted to get fancy.
> Once the majority of class loading is done, it might help for other
> resources that are requested with some regularity.
> 
>> However, I think the WAR ClassLoader would benefit from keeping the
>> JAR files on the classpath open to avoid re-inflating them. The
>> pattern though would be bursty e.g. lots of class loads during
>> startup followed by quiescence. I can think of two ways to handle
>> that: 1) FileSystem has maintains a cache of inflated entries much
>> like a disk filesystem has buffers The FileSystem would be
>> responsible for evictions, perhaps on a LRU or timed basis. 2) Having
>> the classloader keep the JARs opened/mounted after loading a resource
>> until such time as it thinks quiescence is reached. It would then
>> unmount JARs to free the memory. We could do both as they don’t
>> conflict.
> 
> I like both LRU + timed expiration. If a 1GiB resource is requested a
> single time and then nothing else for a long time (days?), that file
> will sit there hogging-up heap space.
> 
>> Next step will be to look into building the index directly from the
>> archive’s central directory rather than by streaming it.
> 
> Is this possible using java.util.zip.ZipFile? Skimming the API, it
> doesn't seem so. This kind of thing really ought to exist already.
> Perhaps there is an ASL-compatible tool available we could use.

Right, the current API does not allow it. Even if we proposed a patch to 
OpenJDK and it was accepted it wouldn’t help in the short term.

There are a couple of ASL-compatible tools I’ve looked at for inspiration:
* Sun’s FileSystem demo, which became the jar: FileSystem bundled in the JDK 
(BSD license)
* Apache Commons VFS, an earlier VFS implementation with Zip support. It 
predates the NIO2 APIs.

There’s also JZlib, a BSD licensed (de-)compressor but I’ve not looked at that 
yet as I think j.u.z.Inflater may be sufficient.

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: WAR FileSystem for fast nested JAR access?

Reply via email to