Re: WAR FileSystem for fast nested JAR access?

Jeremy Boynes Wed, 18 Mar 2015 08:59:24 -0700

On Mar 17, 2015, at 9:01 AM, Christopher Schultz <ch...@christopherschultz.net> 
wrote:
> 
> Jeremy,
> 
> On 3/17/15 2:39 AM, Jeremy Boynes wrote:
>> On Mar 7, 2015, at 10:13 AM, Jeremy Boynes <jboy...@apache.org> wrote:
>>> 
>>> On Mar 6, 2015, at 7:43 AM, Mark Thomas <ma...@apache.org> wrote:
>>>> Interesting. The deciding factor for me will be performance. Keep in
>>>> mind that we might not need all the API. As long as there is enough to
>>>> implement WebResourceSet and WebResource, we probably have all we need.
>>> 
>>> I ran a micro-benchmark using the greenhouse WAR associated with the 
>>> original bug. I instrumented JarWarResource to log all resources opened 
>>> during startup and record the time. On my system it took ~21,000ms to start 
>>> the application of which ~16,000ms was spent in getJarInputStreamWrapper(). 
>>> 2935 resources were opened, primarily class files.
>>> 
>>> I then replayed the log against the sandbox FS. With the current 
>>> implementation it took ~300ms to open the war, ~350ms to open all the jars, 
>>> and ~5ms to open all the entries with newInputStream().
>>> 
>>> I interpret that to mean that there is pretty constant time taken to 
>>> inflate 15MB of data - the 300ms to scan the archive and the ~350ms to scan 
>>> each of the jars within (each one that was used at least). The speed up 
>>> here comes because we only scan each archive once, the downside is the 
>>> extra memory used to store the inflated data.
>>> 
>>> This is promising enough to me that I’m going to keep exploring.
>>> 
>>> Konstantin’s patch, AIUI, creates an index for each jar which eliminates 
>>> the need to scan jars on the classpath that don’t contain the class being 
>>> requested. However, once the classloader has determined the jar file to use 
>>> we still need to stream through that jar until we reach the desired entry.
>>> 
>>> I think we can avoid that here by digging into the zip file’s internal 
>>> metadata. Where I am currently  streaming the jar to build the directory, 
>>> with random access I can build an index just by reading the central 
>>> directory structure. An index entry would contain the name, metadata, and 
>>> the offset in the archive of the entry’s data. When an entry is opened 
>>> would we inflate the data so that it could be used to underpin the channel. 
>>> When the channel is closed the memory would be released.
>>> 
>>> In general, I don’t think there’s a need for the FileSystem to retain 
>>> inflated data after the channel is closed. This would be particularly true 
>>> for the leaf resources which are not likely to be reused; for example, once 
>>> a ClassLoader has used the .class file to define the Class or once a 
>>> framework has processed a .xml config file then neither will need it again.
>>> 
>>> However, I think the WAR ClassLoader would benefit from keeping the JAR 
>>> files on the classpath open to avoid re-inflating them. The pattern though 
>>> would be bursty e.g. lots of class loads during startup followed by 
>>> quiescence. I can think of two ways to handle that:
>>> 1) FileSystem has maintains a cache of inflated entries much like a disk 
>>> filesystem has buffers
>>>  The FileSystem would be responsible for evictions, perhaps on a LRU or 
>>> timed basis.
>>> 2) Having the classloader keep the JARs opened/mounted after loading a 
>>> resource until such time as it thinks quiescence is reached. It would then 
>>> unmount JARs to free the memory.
>>> We could do both as they don’t conflict.
>>> 
>>> Next step will be to look into building the index directly from the 
>>> archive’s central directory rather than by streaming it.
>> 
>> Next step was actually just to verify that we could make a URLClassLoader 
>> work with this API. I got this to work by turning the path URIs into 
>> collection URLs (ending in ‘/‘) which prevented the classloader from trying 
>> to open them as JarFiles.
>> 
>> The classloader works but the classpath search is pretty inefficient relying 
>> on UrlConnection#getInputStream throwing an Exception to detect if a 
>> resource exists. Using it to load the 2935 resources from before took 
>> ~1900ms even after the jars had been indexed. getInputStream() was called 
>> ~120,000 times as the classpath was scanned, i.e. 15us per check with an 
>> average of ~40 checks per resource which seems about right for a classpath 
>> that contains 73 jars.
>> 
>> An obvious solution to avoid the repeated search would be to union the jars’ 
>> directories into a single index. I may try this with a PathClassLoader that 
>> operates using a list of Paths rather than URLs.
> 
> I just wanted to let you know that I'm reading these with interest. I'm
> anxious to find out if this is going to pan-out.


Thanks. Real-life is a bit busy at the moment so progress will be sporadic. If 
you or anyone would like to jump in there are a few of areas which still have 
unknowns:
* a way to read the zip’s central directory
* a way to seek into a deflated zip entry without inflating the entire thing
* is a ClassLoader from a list of Path helpful?
* how to deal with the locking model on Windows platform
* how to work with Paths that are directories - do we get this for free?
* how to use the WatchService to detect changes e.g. web.xml or *.jsp touched?

I think its time for a wiki page.
Cheers
Jeremy

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: WAR FileSystem for fast nested JAR access?

Reply via email to