On Tue, 11 Mar 2025 14:46:40 GMT, Jaikiran Pai <j...@openjdk.org> wrote:

>> Can I please get a review of this change which proposes to fix an issue 
>> `java.util.zip.ZipFile` which would cause failures when multiple instances 
>> of `ZipFile` using non-UTF8 `Charset` were operating against the same 
>> underlying ZIP file? This addresses 
>> https://bugs.openjdk.org/browse/JDK-8347712.
>> 
>> ZIP file specification allows for ZIP entries to mark a `UTF-8` flag to 
>> indicate that the entry name and comment are encoded using UTF8. A 
>> `java.util.zip.ZipFile` can be constructed by passing it a `Charset`. This 
>> `Charset` (which defaults to UTF-8) gets used for decoding entry names and 
>> comments for non-UTF8 entries.
>> 
>> The internal implementation of `ZipFile` uses a `ZipCoder` (backed by 
>> `java.nio.charset.CharsetEncoder/CharsetDecoder` instance) for the given 
>> `Charset`. Except for UTF8 `ZipCoder`, other `ZipCoder`s are not thread safe.
>> 
>> The internal implementation of `ZipFile` maintains a cache of 
>> `ZipFile$Source`. A `Source` corresponds to the underlying ZIP file and 
>> during construction, uses a `ZipCoder` for parsing the ZIP entries and once 
>> constructed holds on to the parsed ZIP structure. Multiple instances of a 
>> `ZipFile` which all correspond to the same ZIP file on the filesystem, share 
>> a single instance of `Source` (after the `Source` has been constructed and 
>> cached). Although `ZipFile` instances aren't expected to be thread-safe, the 
>> fact that multiple different instances of `ZipFile` could be sharing the 
>> same instance of `Source` in concurrent threads, mandates that the `Source` 
>> must be thread-safe.
>> 
>> In Java 15, we did a performance optimization through 
>> https://bugs.openjdk.org/browse/JDK-8243469. As part of that change, we 
>> started holding on to the `ZipCoder` instance (corresponding to the 
>> `Charset` provided during `ZipFile` construction) in the `Source`. This 
>> stored `ZipCoder` was then used for `ZipFile` operations when working with 
>> the ZIP entries. As noted previously, any non-UTF8 `ZipCoder` is not 
>> thread-safe and as a result, any usages of `ZipCoder` in the `Source` makes 
>> `Source` not thread-safe too. That effectively violates the requirement that 
>> `Source` must be thread-safe to allow for its usage in multiple different 
>> `ZipFile` instances concurrently. This then causes `ZipFile` usages to fail 
>> in unexpected ways like the one shown in the linked 
>> https://bugs.openjdk.org/browse/JDK-8347712.
>> 
>> The commit in this PR addresses the issue by not maintaining `ZipCoder` as a 
>> instance field of `Source`. Instead the `ZipCoder` is now mainta...
>
> Jaikiran Pai has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   tiny typo fix in newly introduced documentation

@jaikiran I looked over this and added some comments. Some or most may be 
personal preference, take whatever you find useful. (Not a Reviewer)

src/java.base/share/classes/java/util/zip/ZipFile.java line 86:

> 84:     private final String fileName;     // name of the file
> 85:     // the ZipCoder instance is derived from the Charset passed to the 
> ZipFile constructor
> 86:     // and will be used for decoding the non-UTF-8 entry names and the 
> ZIP file comment.

Maybe a personal preference, and perhaps I'm too intimitely familiar with this 
code, but I feel this field comment is overly verbose. 

Not sure how useful it is to document what the instance is derived from, it's 
not something we do for other fields and maintainers can always inspect the 
constructor to find out how it's being assigned.

The "non-UTF-8" part is correct, but I feel this information belongs closer to 
where that decision is made, not here.

The ZipCoder is used when decoding entry comments as well, so the current 
comment is not fully correct.

Perhaps something like "Used for decoding binary encoded names and comments 
into strings" would do?

src/java.base/share/classes/java/util/zip/ZipFile.java line 87:

> 85:     // the ZipCoder instance is derived from the Charset passed to the 
> ZipFile constructor
> 86:     // and will be used for decoding the non-UTF-8 entry names and the 
> ZIP file comment.
> 87:     private final ZipCoder entryNameCommentCoder;

Since we do not have separate ZipCoders for decoding the differnt ZIP values, 
I'm not sure it's useful to put the field names ("name"/"comment") into the 
instance field name here. Especially if the comment already has this 
information.

Could we call this just `zc` or `zipCoder`?

src/java.base/share/classes/java/util/zip/ZipFile.java line 87:

> 85:     // the ZipCoder instance is derived from the Charset passed to the 
> ZipFile constructor
> 86:     // and will be used for decoding the non-UTF-8 entry names and the 
> ZIP file comment.
> 87:     private final ZipCoder entryNameCommentCoder;

The Source `ZipCoder` field had a `@Stable` annotation. Any reason why this did 
not survive the move?

src/java.base/share/classes/java/util/zip/ZipFile.java line 376:

> 374:      * @param pos the entry position
> 375:      */
> 376:     private static boolean useUTF8Coder(final byte[] cen, final int pos) 
> {

This seems mostly used to determine which `ZipCoder` to pick. Could we fold 
this into the method, calling it `zipCoderForPos` and make it just return the 
`ZipCoder`, like we had in `Source`?

If it needs to be static, then the non-UTF8 ZipCoder could be passed as a 
parameter?

If a method to determine whether a CEN entry uses UTF-8 is still needed, then I 
think it would be more appropriately named  `isUTF8Entry`.

src/java.base/share/classes/java/util/zip/ZipFile.java line 1222:

> 1220:                 table[hsh] = index;
> 1221:                 // Record the CEN offset and the name hash in our hash 
> cell.
> 1222:                 entries[index] = hash;

Seems unrelated to the issue at hand. Perhaps better left for a separate PR, 
backed by a benchmark.

src/java.base/share/classes/java/util/zip/ZipFile.java line 1426:

> 1424:             private final Charset entryNameCommentCharset;
> 1425: 
> 1426:             public Key(File file, BasicFileAttributes attrs, Charset 
> entryNameCommentCharset) {

I feel this parameter and the field it is assigned to could also just be called 
`charset`. The comment has the information about what it's used for.

src/java.base/share/classes/java/util/zip/ZipFile.java line 1434:

> 1432:             @Override
> 1433:             public int hashCode() {
> 1434:                 long t = entryNameCommentCharset.hashCode();

This represents a behavioral change, right? Is a CSR warranted?

src/java.base/share/classes/java/util/zip/ZipFile.java line 1744:

> 1742: 
> 1743:                 int entryPos = pos + CENHDR;
> 1744:                 final ZipCoder entryZipCoder;

If you want to construct this lazily, then I think a comment documenting that 
purpose would be useful here.

-------------

PR Review: https://git.openjdk.org/jdk/pull/23986#pullrequestreview-2675038945
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989545209
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989552146
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989553775
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989558136
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989562779
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989564232
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989566059
PR Review Comment: https://git.openjdk.org/jdk/pull/23986#discussion_r1989622836

Reply via email to