On Mar 7, 2015, at 10:13 AM, Jeremy Boynes <jboy...@apache.org> wrote:
> 
> On Mar 6, 2015, at 7:43 AM, Mark Thomas <ma...@apache.org> wrote:
>> Interesting. The deciding factor for me will be performance. Keep in
>> mind that we might not need all the API. As long as there is enough to
>> implement WebResourceSet and WebResource, we probably have all we need.
> 
> I ran a micro-benchmark using the greenhouse WAR associated with the original 
> bug. I instrumented JarWarResource to log all resources opened during startup 
> and record the time. On my system it took ~21,000ms to start the application 
> of which ~16,000ms was spent in getJarInputStreamWrapper(). 2935 resources 
> were opened, primarily class files.
> 
> I then replayed the log against the sandbox FS. With the current 
> implementation it took ~300ms to open the war, ~350ms to open all the jars, 
> and ~5ms to open all the entries with newInputStream().
> 
> I interpret that to mean that there is pretty constant time taken to inflate 
> 15MB of data - the 300ms to scan the archive and the ~350ms to scan each of 
> the jars within (each one that was used at least). The speed up here comes 
> because we only scan each archive once, the downside is the extra memory used 
> to store the inflated data.
> 
> This is promising enough to me that I’m going to keep exploring.
> 
> Konstantin’s patch, AIUI, creates an index for each jar which eliminates the 
> need to scan jars on the classpath that don’t contain the class being 
> requested. However, once the classloader has determined the jar file to use 
> we still need to stream through that jar until we reach the desired entry.
> 
> I think we can avoid that here by digging into the zip file’s internal 
> metadata. Where I am currently  streaming the jar to build the directory, 
> with random access I can build an index just by reading the central directory 
> structure. An index entry would contain the name, metadata, and the offset in 
> the archive of the entry’s data. When an entry is opened would we inflate the 
> data so that it could be used to underpin the channel. When the channel is 
> closed the memory would be released.
> 
> In general, I don’t think there’s a need for the FileSystem to retain 
> inflated data after the channel is closed. This would be particularly true 
> for the leaf resources which are not likely to be reused; for example, once a 
> ClassLoader has used the .class file to define the Class or once a framework 
> has processed a .xml config file then neither will need it again.
> 
> However, I think the WAR ClassLoader would benefit from keeping the JAR files 
> on the classpath open to avoid re-inflating them. The pattern though would be 
> bursty e.g. lots of class loads during startup followed by quiescence. I can 
> think of two ways to handle that:
> 1) FileSystem has maintains a cache of inflated entries much like a disk 
> filesystem has buffers
>   The FileSystem would be responsible for evictions, perhaps on a LRU or 
> timed basis.
> 2) Having the classloader keep the JARs opened/mounted after loading a 
> resource until such time as it thinks quiescence is reached. It would then 
> unmount JARs to free the memory.
> We could do both as they don’t conflict.
> 
> Next step will be to look into building the index directly from the archive’s 
> central directory rather than by streaming it.

Next step was actually just to verify that we could make a URLClassLoader work 
with this API. I got this to work by turning the path URIs into collection URLs 
(ending in ‘/‘) which prevented the classloader from trying to open them as 
JarFiles.

The classloader works but the classpath search is pretty inefficient relying on 
UrlConnection#getInputStream throwing an Exception to detect if a resource 
exists. Using it to load the 2935 resources from before took ~1900ms even after 
the jars had been indexed. getInputStream() was called ~120,000 times as the 
classpath was scanned, i.e. 15us per check with an average of ~40 checks per 
resource which seems about right for a classpath that contains 73 jars.

An obvious solution to avoid the repeated search would be to union the jars’ 
directories into a single index. I may try this with a PathClassLoader that 
operates using a list of Paths rather than URLs.

Cheers
Jeremy

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to