Re: ContentTypes supported by Solr to index

Jack Krupansky Wed, 15 Apr 2015 08:56:12 -0700

Check to see if there are any errors in the Solr log for jpg and zip files.
Solr should do something for them - if not, file a Jira to suggest that it
should, as an imporvement. Zip should give a list of the enclosed files.
Images should at least give the metadata.


-- Jack Krupansky

On Wed, Apr 15, 2015 at 11:45 AM, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

> Thanks Andrea. For image files and zip files, even metadata is not
> available. Just to explain further, I have indexed a total of 10 files, out
> of which a .jpg file and .zip file are present.
>
> After the indexing process is complete, no information about either of
> these files is present in the solr query UI when I give *.* as the query
> parameters. Not even metadata is displayed. Infact in the response,
> *numFound* is showing only 8 documents, which are the ones apart from zip
> and jpg files.
>
> Thanks & Regards
> Vijay
>
>
> On 15 April 2015 at 16:29, Andrea Gazzarini <a.gazzar...@gmail.com> wrote:
>
> > Sorry, attachments are not supported here :(
> >
> > Anyway, I believe the misunderstanding resides in what you think you
> > should mean "image indexing": actually, AFAIK, Tika indexes only a) the
> > textual content of a given resource b) its metadata.
> > So
> >
> > - for a JPG file (or in genetal, an image) you will get only its metadata
> > - for a compressed archive, Commons Compress API will decompress the
> > archive and once did that, each file within the archive will be
> associated
> > to a proper parser. So here actually it depends on the files (types) you
> > have in your archive.
> >
> > Best,
> > Andrea
> >
> >
> >
> > Is that close to what you were thinking?
> >
> > On 04/15/2015 05:16 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:
> >
> >> Thanks Andrea. I can see that Tika1.5 supports both compressed (ZIP) and
> >> image (JPG) formats. If thats the case, why SolrCell could not index the
> >> documents of .zip and .jpg? Am I missing something here?  No error is
> >> thrown in the overall process and the java program completes
> successfully.
> >> But when I query the Solr UI, only 8 files are indexed.
> >>
> >> Attached is a simple screenshot of the files types I am trying to index.
> >>
> >> Thanks & Regards
> >> Vijay
> >>
> >> On 15 April 2015 at 15:27, Andrea Gazzarini <a.gazzar...@gmail.com
> >> <mailto:a.gazzar...@gmail.com>> wrote:
> >>
> >>     Hi Vijay,
> >>     here you can find all supported formats by Tika, which is
> >>     internally used by SolrCell:
> >>
> >>      * https://tika.apache.org/*1.4*/formats.html
> >>      * https://tika.apache.org/*1.5*/formats.html
> >>      * https://tika.apache.org/*1.6*/formats.html
> >>      * https://tika.apache.org/*1.7*/formats.html
> >>
> >>     Best,
> >>     Andrea
> >>
> >>
> >>
> >>
> >>     On 04/15/2015 04:20 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:
> >>
> >>         Hi,
> >>
> >>         I am trying to index various binary file types into Solr.
> >>         However, some
> >>         file types seems to be ignored and not getting indexed, though
> >>         the metadata
> >>         is being extracted successfuly for all the types.
> >>
> >>         Specifically, zip files and jpg files are not getting indexed,
> >>         where as
> >>         pdf, MS office documents are getting indexed. Hence wondering
> >>         whether there
> >>         is a defined list of indexable file types.
> >>
> >>         Moreover, I am just wondering why Solr could not index the jpg
> >>         and zip
> >>         documents when it was able to extract the metadata from those
> >>         files?
> >>
> >>         The code snippet is as below:
> >>
> >>         contentStreamUpdateReq.addFile(file, fileType);
> >>         contentStreamUpdateReq.setParam("literal.id
> >>         <http://literal.id>", literalId);
> >>         contentStreamUpdateReq.setParam("uprefix", "attr_");
> >>         contentStreamUpdateReq.setParam("fmap.content", "content");
> >>         contentStreamUpdateReq.setAction(AbstractUpdateRequest.ACTION.
> >> COMMIT,
> >>         true,
> >>         true);
> >>         solrServer.request(contentStreamUpdateReq);
> >>
> >>         Thanks & Regards
> >>         Vijay
> >>
> >>
> >>
> >>
> >> The contents of this e-mail are confidential and for the exclusive use
> of
> >> the intended recipient. If you receive this e-mail in error please
> delete
> >> it from your system immediately and notify us either by e-mail or
> >> telephone. You should not copy, forward or otherwise disclose the
> content
> >> of the e-mail. The views expressed in this communication may not
> >> necessarily be the view held by WHISHWORKS.
> >>
> >
> >
>
> --
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>

Re: ContentTypes supported by Solr to index

Reply via email to