I already provided feedback, you haven’t evidenced any attempt to follow up on it.
Best, Erick > On Aug 29, 2019, at 4:54 AM, Khare, Kushal (MIND) > <kushal.kh...@mind-infotech.com> wrote: > > Erick, > I am using the code that I posted yesterday. But, am not getting anything in > 'texthandler.toString'. Please check my snippet once and guide. Because, I > think I am very close to my requirement yet stuck here. I also debugged my > code. It is not going inside doTikaDocuments() & giving Null Pointer > Exception. > > -----Original Message----- > From: Erick Erickson [mailto:erickerick...@gmail.com] > Sent: 28 August 2019 16:50 > To: solr-user@lucene.apache.org > Subject: Re: Require searching only for file content and not metadata > > Attachments are aggressively stripped of attachments, you’ll have to either > post it someplace and provide a link or paste the relevant sections into the > e-mail. > > You’re not getting any metadata because you’re not adding any metadata to the > documents with doc.addField(“metadatafield1”, value_of_metadata_field1); > > The only thing ever in the doc is what you explicitly put there. At this > point it’s just “id” and “_text_”. > > As for why _text_ isn’t showing up, does the schema have ’stored=“true”’ for > the field? And when you query, are you specifying &fl=_text_? _text_ is > usually a catch-all field in the default schemas with this definition: > > <field name="_text_" type="text_general" indexed="true" stored="false" > multiValued="true”/> > > Since stored=false, well, it’s not stored so can’t be returned. If you’re > successfully _searching_ on that field but not getting it back in the “fl” > list, this is almost certainly a stored=“false” issue. > > As for why you might have gotten all the metadata in this field with the post > tool, check that there are no “copyField” directives in the schema that > automatically copy other data into _text_. > > Best, > Erick > > > >> On Aug 28, 2019, at 7:03 AM, Khare, Kushal (MIND) >> <kushal.kh...@mind-infotech.com> wrote: >> >> Attaching managed-schema.xml >> >> -----Original Message----- >> From: Khare, Kushal (MIND) [mailto:kushal.kh...@mind-infotech.com] >> Sent: 28 August 2019 16:30 >> To: solr-user@lucene.apache.org >> Subject: RE: Require searching only for file content and not metadata >> >> I already tried this example, I am currently working on this. I have >> complied the code, it is indexing the documents. But, it is not adding any >> thing to the field - _text_ . Also, not giving any metadata. >> doc.addField("_text_", textHandler.toString()); --> here, >> textHandler.toString() is blank for all the 40 documents. All I am getting >> is the 'id' & 'version' field. >> >> This is the code that I tried : >> >> package mind.solr; >> >> import org.apache.solr.client.solrj.SolrServerException; >> import org.apache.solr.client.solrj.impl.HttpSolrClient; >> import org.apache.solr.client.solrj.impl.XMLResponseParser; >> import org.apache.solr.client.solrj.response.UpdateResponse; >> import org.apache.solr.common.SolrInputDocument; >> import org.apache.tika.metadata.Metadata; >> import org.apache.tika.parser.AutoDetectParser; >> import org.apache.tika.parser.ParseContext; >> import org.apache.tika.sax.BodyContentHandler; >> import org.xml.sax.ContentHandler; >> >> import java.io.File; >> import java.io.FileInputStream; >> import java.io.IOException; >> import java.io.InputStream; >> import java.util.ArrayList; >> import java.util.Collection; >> >> public class solrJExtract { >> >> private HttpSolrClient client; >> private long start = System.currentTimeMillis(); private >> AutoDetectParser autoParser; private int totalTika = 0; private int >> totalSql = 0; >> >> @SuppressWarnings("rawtypes") >> private Collection docList = new ArrayList(); >> >> >> public static void main(String[] args) { >> try { >> solrJExtract idxer = new solrJExtract("http://localhost:8983/solr/tika"); >> idxer.doTikaDocuments(new File("D:\\docs")); >> idxer.endIndexing(); >> } catch (Exception e) { >> e.printStackTrace(); >> } >> } >> >> private solrJExtract(String url) throws IOException, SolrServerException { >> // Create a SolrCloud-aware client to send docs to Solr >> // Use something like HttpSolrClient for stand-alone >> >> client = new HttpSolrClient.Builder("http://localhost:8983/solr/tika") >> .withConnectionTimeout(10000) >> .withSocketTimeout(60000) >> .build(); >> >> // binary parser is used by default for responses >> client.setParser(new XMLResponseParser()); >> >> // One of the ways Tika can be used to attempt to parse arbitrary files. >> autoParser = new AutoDetectParser(); } >> >> // Just a convenient place to wrap things up. >> @SuppressWarnings("unchecked") >> private void endIndexing() throws IOException, SolrServerException { >> if ( docList.size() > 0) { // Are there any documents left over? >> client.add(docList, 300000); // Commit within 5 minutes >> } >> client.commit(); // Only needs to be done at the end, >> // commitWithin should do the rest. >> // Could even be omitted >> // assuming commitWithin was specified. >> long endTime = System.currentTimeMillis(); >> System.out.println("Total Time Taken: " + (endTime - start) + >> " milliseconds to index " + totalSql + >> " SQL rows and " + totalTika + " documents"); >> >> } >> >> /** >> * ***************************Tika processing here >> */ >> // Recursively traverse the filesystem, parsing everything found. >> private void doTikaDocuments(File root) throws IOException, >> SolrServerException { >> >> // Simple loop for recursively indexing all the files >> // in the root directory passed in. >> for (File file : root.listFiles()) { >> if (file.isDirectory()) { >> doTikaDocuments(file); >> continue; >> } >> // Get ready to parse the file. >> ContentHandler textHandler = new BodyContentHandler(); >> Metadata metadata = new Metadata(); >> ParseContext context = new ParseContext(); >> // Tim Allison noted the following, thanks Tim! >> // If you want Tika to parse embedded files (attachments within your >> .doc or any other embedded >> // files), you need to send in the autodetectparser in the parsecontext: >> // context.set(Parser.class, autoParser); >> >> InputStream input = new FileInputStream(file); >> >> // Try parsing the file. Note we haven't checked at all to >> // see whether this file is a good candidate. >> try { >> autoParser.parse(input, textHandler, metadata, context); >> } catch (Exception e) { >> // Needs better logging of what went wrong in order to >> // track down "bad" documents. >> System.out.println(String.format("File %s failed", >> file.getCanonicalPath())); >> e.printStackTrace(); >> continue; >> } >> // Just to show how much meta-data and what form it's in. >> dumpMetadata(file.getCanonicalPath(), metadata); >> >> // Index just a couple of the meta-data fields. >> SolrInputDocument doc = new SolrInputDocument(); >> >> doc.addField("id", file.getCanonicalPath()); >> >> // Crude way to get known meta-data fields. >> // Also possible to write a simple loop to examine all the >> // metadata returned and selectively index it and/or >> // just get a list of them. >> // One can also use the Lucidworks field mapping to >> // accomplish much the same thing. >> String author = metadata.get("Author"); >> >> /* >> * if (author != null) { //doc.addField("author", author); } */ >> >> doc.addField("_text_", textHandler.toString()); >> //doc.addField("meta", metadata.get("Last_Modified")); >> docList.add(doc); >> ++totalTika; >> >> // Completely arbitrary, just batch up more than one document >> // for throughput! >> if ( docList.size() >= 1000) { >> // Commit within 5 minutes. >> UpdateResponse resp = client.add(docList, 300000); >> if (resp.getStatus() != 0) { >> System.out.println("Some horrible error has occurred, status is: " + >> resp.getStatus()); >> } >> docList.clear(); >> } >> } >> } >> >> // Just to show all the metadata that's available. >> private void dumpMetadata(String fileName, Metadata metadata) { >> System.out.println("Dumping metadata for file: " + fileName); >> for (String name : metadata.names()) { >> System.out.println(name + ":" + metadata.get(name)); >> } >> System.out.println("........xxxxxxxxxxxxxxxxxxxxxxxxx.........."); >> } >> } >> >> >> Also, I am attaching the scrollconfig.xml & Managed-schema.xml for my >> collection. Please see to it & suggest where I am getting wrong. >> I can't even get to see the _text_ field in the query result, instead of >> stored parameter being true. >> Any help would really be appreciated. >> Thanks ! >> >> -----Original Message----- >> From: Shawn Heisey [mailto:apa...@elyograg.org] >> Sent: 28 August 2019 14:18 >> To: solr-user@lucene.apache.org >> Subject: Re: Require searching only for file content and not metadata >> >> On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote: >>> Basically, what problem I am facing is - I am getting the textual content + >>> other metadata in my _text_ field. But, I want only the textual content >>> written inside the document. >>> I tried various Request Handler Update Extract configurations, but none of >>> them worked for me. >>> Please help me resolve this as I am badly stuck in this. >> >> Controlling exactly what gets indexed in which fields is likely going to >> require that you write the indexing software yourself -- a program that >> extracts the data you want and sends it to Solr for indexing. >> >> We do not recommend running the Extracting Request Handler in >> production >> -- Tika is known to crash when given some documents (usually PDF files are >> the problematic ones, but other formats can cause it too), and if it crashes >> while running inside Solr, it will take Solr down with it. >> >> Here is an example program that uses Tika for rich document parsing. It >> also talks to a database, but that part could be easily removed or modified: >> >> https://lucidworks.com/post/indexing-with-solrj/ >> >> Thanks, >> Shawn >> >> ________________________________ >> >> The information contained in this electronic message and any >> attachments to this message are intended for the exclusive use of the >> addressee(s) and may contain proprietary, confidential or privileged >> information. If you are not the intended recipient, you should not >> disseminate, distribute or copy this e-mail. Please notify the sender >> immediately and destroy all copies of this message and any >> attachments. WARNING: Computer viruses can be transmitted via email. >> The recipient should check this email and any attachments for the >> presence of viruses. The company accepts no liability for any damage >> caused by any virus/trojan/worms/malicious code transmitted by this >> email. www.motherson.com >> >> ________________________________ >> >> The information contained in this electronic message and any >> attachments to this message are intended for the exclusive use of the >> addressee(s) and may contain proprietary, confidential or privileged >> information. If you are not the intended recipient, you should not >> disseminate, distribute or copy this e-mail. Please notify the sender >> immediately and destroy all copies of this message and any >> attachments. WARNING: Computer viruses can be transmitted via email. >> The recipient should check this email and any attachments for the >> presence of viruses. The company accepts no liability for any damage >> caused by any virus/trojan/worms/malicious code transmitted by this >> email. www.motherson.com > > > ________________________________ > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. WARNING: Computer viruses can be transmitted via > email. The recipient should check this email and any attachments for the > presence of viruses. The company accepts no liability for any damage caused > by any virus/trojan/worms/malicious code transmitted by this email. > www.motherson.com