[
https://issues.apache.org/jira/browse/TIKA-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611084#comment-17611084
]
Tong Wang commented on TIKA-3864:
---------------------------------
Thank you for updating the documentation. This is very helpful.
This is good enough for me now. But since any serious application has to
anticipate and deal with non-ASCII file names, I wonder if it make sense to
either drop the fetcherName and fetchKey headers or enhance them to be able to
accept unicode. In its current form, the headers are not useful.
> Non-ascii UTF-8 characters in fetchKey not working with FileSystemFetcher
> -------------------------------------------------------------------------
>
> Key: TIKA-3864
> URL: https://issues.apache.org/jira/browse/TIKA-3864
> Project: Tika
> Issue Type: Bug
> Components: tika-pipes, tika-server
> Affects Versions: 2.4.1
> Environment: debian:bullseye docker container running
> tika-server-standard-2.4.1jar
> Reporter: Tong Wang
> Priority: Major
>
> When use FileSystemFetcher, if there is non-ascii characters in fetchKey,
> Tika Server throws exception because the file name is incorrect. Here is an
> example:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted"
> --header "fetchKey: 中文.txt" {code}
> I get java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException: /restricted/䏿.txt at
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
> at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860) at
> org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)
> at
> org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)
> at
> org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159)
> {code}
>
> When I try to quote the characters:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted"
> --header "fetchKey: %E4%B8%AD%E6%96%87.txt" {code}
> I still get a java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException:
> /restricted/%E4%B8%AD%E6%96%87.txt at
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
> at
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
> at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860) at
> org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)
> at
> org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)
> at
> org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159){code}
> BTW, locale is set to C.UTF-8 on Tika Server:
> {code:java}
> # locale
> LANG=C.UTF-8
> LANGUAGE=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_PAPER="C.UTF-8"
> LC_NAME="C.UTF-8"
> LC_ADDRESS="C.UTF-8"
> LC_TELEPHONE="C.UTF-8"
> LC_MEASUREMENT="C.UTF-8"
> LC_IDENTIFICATION="C.UTF-8"
> LC_ALL= {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)