[ 
https://issues.apache.org/jira/browse/TIKA-3864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611084#comment-17611084
 ] 

Tong Wang commented on TIKA-3864:
---------------------------------

Thank you for updating the documentation. This is very helpful.

This is good enough for me now. But since any serious application has to 
anticipate and deal with non-ASCII file names, I wonder if it make sense to 
either drop the fetcherName and fetchKey headers or enhance them to be able to 
accept unicode.  In its current form, the headers are not useful.

> Non-ascii UTF-8 characters in fetchKey not working with FileSystemFetcher
> -------------------------------------------------------------------------
>
>                 Key: TIKA-3864
>                 URL: https://issues.apache.org/jira/browse/TIKA-3864
>             Project: Tika
>          Issue Type: Bug
>          Components: tika-pipes, tika-server
>    Affects Versions: 2.4.1
>         Environment: debian:bullseye docker container running 
> tika-server-standard-2.4.1jar
>            Reporter: Tong Wang
>            Priority: Major
>
> When use FileSystemFetcher, if there is non-ascii characters in fetchKey, 
> Tika Server throws exception because the file name is incorrect. Here is an 
> example:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" 
> --header "fetchKey: 中文.txt" {code}
> I get java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException: /restricted/䏿–‡.txt   at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>      at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>       at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>       at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860)  at 
> org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)
>   at 
> org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)
>         at 
> org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159)
>  {code}
>  
> When I try to quote the characters:
> {code:java}
> curl -v -X PUT http://tika:9998/rmeta/text --header "fetcherName: restricted" 
> --header "fetchKey: %E4%B8%AD%E6%96%87.txt" {code}
> I still get a java.nio.file.NoSuchFileException:
> {code:java}
> Caused by: java.nio.file.NoSuchFileException: 
> /restricted/%E4%B8%AD%E6%96%87.txt      at 
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
>      at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
>       at 
> java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
>       at java.base/sun.nio.fs.UnixPath.toRealPath(UnixPath.java:860)  at 
> org.apache.tika.pipes.fetcher.fs.FileSystemFetcher.fetch(FileSystemFetcher.java:64)
>   at 
> org.apache.tika.server.core.FetcherStreamFactory.getInputStream(FetcherStreamFactory.java:90)
>         at 
> org.apache.tika.server.core.resource.TikaResource.getInputStream(TikaResource.java:159){code}
> BTW, locale is set to C.UTF-8 on Tika Server:
> {code:java}
> # locale
> LANG=C.UTF-8
> LANGUAGE=
> LC_CTYPE="C.UTF-8"
> LC_NUMERIC="C.UTF-8"
> LC_TIME="C.UTF-8"
> LC_COLLATE="C.UTF-8"
> LC_MONETARY="C.UTF-8"
> LC_MESSAGES="C.UTF-8"
> LC_PAPER="C.UTF-8"
> LC_NAME="C.UTF-8"
> LC_ADDRESS="C.UTF-8"
> LC_TELEPHONE="C.UTF-8"
> LC_MEASUREMENT="C.UTF-8"
> LC_IDENTIFICATION="C.UTF-8"
> LC_ALL= {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to