Re: ContentTypes supported by Solr to index

Andrea Gazzarini Wed, 15 Apr 2015 08:30:24 -0700

Sorry, attachments are not supported here :(

Anyway, I believe the misunderstanding resides in what you think youshould mean "image indexing": actually, AFAIK, Tika indexes only a) thetextual content of a given resource b) its metadata.

So


- for a JPG file (or in genetal, an image) you will get only its metadata

- for a compressed archive, Commons Compress API will decompress thearchive and once did that, each file within the archive will beassociated to a proper parser. So here actually it depends on the files(types) you have in your archive.


Best,
Andrea



Is that close to what you were thinking?

On 04/15/2015 05:16 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Thanks Andrea. I can see that Tika1.5 supports both compressed (ZIP)and image (JPG) formats. If thats the case, why SolrCell could notindex the documents of .zip and .jpg? Am I missing something here? Noerror is thrown in the overall process and the java program completessuccessfully. But when I query the Solr UI, only 8 files are indexed.
Attached is a simple screenshot of the files types I am trying to index.

Thanks & Regards
Vijay
On 15 April 2015 at 15:27, Andrea Gazzarini <a.gazzar...@gmail.com<mailto:a.gazzar...@gmail.com>> wrote:
    Hi Vijay,
    here you can find all supported formats by Tika, which is
    internally used by SolrCell:

     * https://tika.apache.org/*1.4*/formats.html
     * https://tika.apache.org/*1.5*/formats.html
     * https://tika.apache.org/*1.6*/formats.html
     * https://tika.apache.org/*1.7*/formats.html

    Best,
    Andrea




    On 04/15/2015 04:20 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

        Hi,

        I am trying to index various binary file types into Solr.
        However, some
        file types seems to be ignored and not getting indexed, though
        the metadata
        is being extracted successfuly for all the types.

        Specifically, zip files and jpg files are not getting indexed,
        where as
        pdf, MS office documents are getting indexed. Hence wondering
        whether there
        is a defined list of indexable file types.

        Moreover, I am just wondering why Solr could not index the jpg
        and zip
        documents when it was able to extract the metadata from those
        files?

        The code snippet is as below:

        contentStreamUpdateReq.addFile(file, fileType);
        contentStreamUpdateReq.setParam("literal.id
        <http://literal.id>", literalId);
        contentStreamUpdateReq.setParam("uprefix", "attr_");
        contentStreamUpdateReq.setParam("fmap.content", "content");
        contentStreamUpdateReq.setAction(AbstractUpdateRequest.ACTION.COMMIT,
        true,
        true);
        solrServer.request(contentStreamUpdateReq);

        Thanks & Regards
        Vijay
The contents of this e-mail are confidential and for the exclusive useof the intended recipient. If you receive this e-mail in error pleasedelete it from your system immediately and notify us either by e-mailor telephone. You should not copy, forward or otherwise disclose thecontent of the e-mail. The views expressed in this communication maynot necessarily be the view held by WHISHWORKS.

Re: ContentTypes supported by Solr to index

Reply via email to