: However, when I convert InputStream stream (inside parse function) to File, 
: it seems that Solr is adding header and footer that contains Metadata so the
: file won't be converted properly.

        ...

It's not totally clear from your problem description, but i *think* you 
are saying that you are using SolrJ to stream these special XML files you 
created to Solr, and then ou are using a custom parser registered with 
Tika/ExtractingRequestHandler to parse them into documents.  The output 
you've pasted below appears to be a HEX dump of the raw HTTP stream from 
this communicaiton.

Solr isn't adding any header/footer to your XML files, what you are seeing 
are the normal HTTP headers added to a file when using MIME to send 
multiple files.  You may also occasically notice "chunked encoding" 
markers used to stream arbitrary amounts of data over HTTP w/o requiring 
the clients to pre-calculate the total "Content-Length".   This is all 
happening at the HTTP protocol level, and will be dealt with by the 
HttpClient and Servlet Container before Solr ever sees the InputStreams -- 
let alone hands them to Tika -- so it should be completley transparent to 
you (unless you go sniffing the wire like this)

If you are encountering an actual problem, then you need to give us a lot 
more details about how you are using SolrJ/Solr, what servlet container 
you are using, what your custom parser code looks like, and what kind of 
errors you are getting, so someone can try to reproduce the problem.

: Following text is added as a header
: 
:   1 0000000: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d  ----------------
:   2 0000010: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 3139  --------------19
:   3 0000020: 3230 3862 3937 3764 6637 0d0a 436f 6e74  208b977df7..Cont
:   4 0000030: 656e 742d 4469 7370 6f73 6974 696f 6e3a  ent-Disposition:
:   5 0000040: 2066 6f72 6d2d 6461 7461 3b20 6e61 6d65   form-data; name
:   6 0000050: 3d22 6d79 6669 6c65 223b 2066 696c 656e  ="myfile"; filen
:   7 0000060: 616d 653d 2268 7770 322e 6877 7022 0d0a  ame="hwp2.hwp"..
:   8 0000070: 436f 6e74 656e 742d 5479 7065 3a20 6170  Content-Type: ap
:   9 0000080: 706c 6963 6174 696f 6e2f 6f63 7465 742d  plication/octet-
:  10 0000090: 7374 7265 616d 0d0a 0d0a d0cf 11e0 a1b1  stream
: 
: 
: Following text is added as a footer
: 
: 554 0002290: 0000 0000 0000 0000 0000 0d0a 2d2d 2d2d  ............----
: 555 00022a0: 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d 2d2d  ----------------
: 556 00022b0: 2d2d 2d2d 2d2d 2d2d 2d2d 3139 3230 3862  ----------19208b
: 557 00022c0: 3937 3764 6637 2d2d 0d0a                        977df7--..


-Hoss

Reply via email to