Check to see if there are any errors in the Solr log for jpg and zip files. Solr should do something for them - if not, file a Jira to suggest that it should, as an imporvement. Zip should give a list of the enclosed files. Images should at least give the metadata.
-- Jack Krupansky On Wed, Apr 15, 2015 at 11:45 AM, Vijaya Narayana Reddy Bhoomi Reddy < vijaya.bhoomire...@whishworks.com> wrote: > Thanks Andrea. For image files and zip files, even metadata is not > available. Just to explain further, I have indexed a total of 10 files, out > of which a .jpg file and .zip file are present. > > After the indexing process is complete, no information about either of > these files is present in the solr query UI when I give *.* as the query > parameters. Not even metadata is displayed. Infact in the response, > *numFound* is showing only 8 documents, which are the ones apart from zip > and jpg files. > > Thanks & Regards > Vijay > > > On 15 April 2015 at 16:29, Andrea Gazzarini <a.gazzar...@gmail.com> wrote: > > > Sorry, attachments are not supported here :( > > > > Anyway, I believe the misunderstanding resides in what you think you > > should mean "image indexing": actually, AFAIK, Tika indexes only a) the > > textual content of a given resource b) its metadata. > > So > > > > - for a JPG file (or in genetal, an image) you will get only its metadata > > - for a compressed archive, Commons Compress API will decompress the > > archive and once did that, each file within the archive will be > associated > > to a proper parser. So here actually it depends on the files (types) you > > have in your archive. > > > > Best, > > Andrea > > > > > > > > Is that close to what you were thinking? > > > > On 04/15/2015 05:16 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote: > > > >> Thanks Andrea. I can see that Tika1.5 supports both compressed (ZIP) and > >> image (JPG) formats. If thats the case, why SolrCell could not index the > >> documents of .zip and .jpg? Am I missing something here? No error is > >> thrown in the overall process and the java program completes > successfully. > >> But when I query the Solr UI, only 8 files are indexed. > >> > >> Attached is a simple screenshot of the files types I am trying to index. > >> > >> Thanks & Regards > >> Vijay > >> > >> On 15 April 2015 at 15:27, Andrea Gazzarini <a.gazzar...@gmail.com > >> <mailto:a.gazzar...@gmail.com>> wrote: > >> > >> Hi Vijay, > >> here you can find all supported formats by Tika, which is > >> internally used by SolrCell: > >> > >> * https://tika.apache.org/*1.4*/formats.html > >> * https://tika.apache.org/*1.5*/formats.html > >> * https://tika.apache.org/*1.6*/formats.html > >> * https://tika.apache.org/*1.7*/formats.html > >> > >> Best, > >> Andrea > >> > >> > >> > >> > >> On 04/15/2015 04:20 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote: > >> > >> Hi, > >> > >> I am trying to index various binary file types into Solr. > >> However, some > >> file types seems to be ignored and not getting indexed, though > >> the metadata > >> is being extracted successfuly for all the types. > >> > >> Specifically, zip files and jpg files are not getting indexed, > >> where as > >> pdf, MS office documents are getting indexed. Hence wondering > >> whether there > >> is a defined list of indexable file types. > >> > >> Moreover, I am just wondering why Solr could not index the jpg > >> and zip > >> documents when it was able to extract the metadata from those > >> files? > >> > >> The code snippet is as below: > >> > >> contentStreamUpdateReq.addFile(file, fileType); > >> contentStreamUpdateReq.setParam("literal.id > >> <http://literal.id>", literalId); > >> contentStreamUpdateReq.setParam("uprefix", "attr_"); > >> contentStreamUpdateReq.setParam("fmap.content", "content"); > >> contentStreamUpdateReq.setAction(AbstractUpdateRequest.ACTION. > >> COMMIT, > >> true, > >> true); > >> solrServer.request(contentStreamUpdateReq); > >> > >> Thanks & Regards > >> Vijay > >> > >> > >> > >> > >> The contents of this e-mail are confidential and for the exclusive use > of > >> the intended recipient. If you receive this e-mail in error please > delete > >> it from your system immediately and notify us either by e-mail or > >> telephone. You should not copy, forward or otherwise disclose the > content > >> of the e-mail. The views expressed in this communication may not > >> necessarily be the view held by WHISHWORKS. > >> > > > > > > -- > The contents of this e-mail are confidential and for the exclusive use of > the intended recipient. If you receive this e-mail in error please delete > it from your system immediately and notify us either by e-mail or > telephone. You should not copy, forward or otherwise disclose the content > of the e-mail. The views expressed in this communication may not > necessarily be the view held by WHISHWORKS. >