>Another side issue: Using the extracting handler for handling rich documents >is discouraged. Tika (which is what is used by the extracting >handler) is pretty amazing software, but it has a habit of crashing or >consuming all the heap memory when it encounters a document that it doesn't >>know how to properly handle. It is best to run Tika in your external program >and send its output to Solr, so that if there's a problem, it won't affect >>your search capability.
As an alternative to earlier code, I had tried this (exactly the same set of files going in) File f = new File(filename); ContentHandler textHandler = new BodyContentHandler(10*1024*1024); Metadata metadata = new Metadata(); Parser parser = new AutoDetectParser(); ParseContext context = new ParseContext(); InputStream input = new FileInputStream(f); try { parser.parse(input, textHandler, metadata, context); } catch (Exception e) { Logger.getLogger(JsMapAdminService.class.getName()).log(Level.SEVERE, null,String.format("File %s failed", f.getCanonicalPath())); e.printStackTrace(); } SolrInputDocument up = new SolrInputDocument(); up.addField("id",f.getCanonicalPath()); up.addField("fileLocation",idString); up.addField("access",access); up.addField("title",metadata.get("title")); up.addField("author",metadata.get("author")); String content = textHandler.toString(); up.addField("_text_",content); solr.add(up); return true; Exceptions never triggered but metadata was essentially empty except for contentType, and content was always an empty string. I don’t know what parser was doing, but I gave up and with the extractHandler route instead which did at least build a full index. Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.