Re: Require searching only for file content and not metadata

Erick Erickson Thu, 29 Aug 2019 04:27:59 -0700

I already provided feedback, you haven’t evidenced any attempt to follow up on 
it.


Best,
Erick

> On Aug 29, 2019, at 4:54 AM, Khare, Kushal (MIND) 
> <kushal.kh...@mind-infotech.com> wrote:
> 
> Erick,
> I am using the code that I posted yesterday. But, am not getting anything in 
> 'texthandler.toString'. Please check my snippet once and guide. Because, I 
> think I am very close to my requirement yet stuck here. I also debugged my 
> code. It is not going inside doTikaDocuments() & giving Null Pointer 
> Exception.
> 
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: 28 August 2019 16:50
> To: solr-user@lucene.apache.org
> Subject: Re: Require searching only for file content and not metadata
> 
> Attachments are aggressively stripped of attachments, you’ll have to either 
> post it someplace and provide a link or paste the relevant sections into the 
> e-mail.
> 
> You’re not getting any metadata because you’re not adding any metadata to the 
> documents with doc.addField(“metadatafield1”, value_of_metadata_field1);
> 
> The only thing ever in the doc is what you explicitly put there. At this 
> point it’s just “id” and “_text_”.
> 
> As for why _text_ isn’t showing up, does the schema have ’stored=“true”’ for 
> the field? And when you query, are you specifying &fl=_text_? _text_ is 
> usually a catch-all field in the default schemas with this definition:
> 
> <field name="_text_" type="text_general" indexed="true" stored="false" 
> multiValued="true”/>
> 
> Since stored=false, well, it’s not stored so can’t be returned. If you’re 
> successfully _searching_ on that field but not getting it back in the “fl” 
> list, this is almost certainly a stored=“false” issue.
> 
> As for why you might have gotten all the metadata in this field with the post 
> tool, check that there are no “copyField” directives in the schema that 
> automatically copy other data into _text_.
> 
> Best,
> Erick
> 
> 
> 
>> On Aug 28, 2019, at 7:03 AM, Khare, Kushal (MIND) 
>> <kushal.kh...@mind-infotech.com> wrote:
>> 
>> Attaching managed-schema.xml
>> 
>> -----Original Message-----
>> From: Khare, Kushal (MIND) [mailto:kushal.kh...@mind-infotech.com]
>> Sent: 28 August 2019 16:30
>> To: solr-user@lucene.apache.org
>> Subject: RE: Require searching only for file content and not metadata
>> 
>> I already tried this example, I am currently working on this. I have 
>> complied the code, it is indexing the documents. But, it is not adding any 
>> thing to the field - _text_ . Also, not giving any metadata.
>> doc.addField("_text_", textHandler.toString()); --> here, 
>> textHandler.toString() is blank for all the 40 documents. All I am getting 
>> is the 'id' & 'version' field.
>> 
>> This is the code that I tried :
>> 
>> package mind.solr;
>> 
>> import org.apache.solr.client.solrj.SolrServerException;
>> import org.apache.solr.client.solrj.impl.HttpSolrClient;
>> import org.apache.solr.client.solrj.impl.XMLResponseParser;
>> import org.apache.solr.client.solrj.response.UpdateResponse;
>> import org.apache.solr.common.SolrInputDocument;
>> import org.apache.tika.metadata.Metadata;
>> import org.apache.tika.parser.AutoDetectParser;
>> import org.apache.tika.parser.ParseContext;
>> import org.apache.tika.sax.BodyContentHandler;
>> import org.xml.sax.ContentHandler;
>> 
>> import java.io.File;
>> import java.io.FileInputStream;
>> import java.io.IOException;
>> import java.io.InputStream;
>> import java.util.ArrayList;
>> import java.util.Collection;
>> 
>> public class solrJExtract {
>> 
>> private HttpSolrClient client;
>> private long start = System.currentTimeMillis();  private
>> AutoDetectParser autoParser;  private int totalTika = 0;  private int
>> totalSql = 0;
>> 
>> @SuppressWarnings("rawtypes")
>> private Collection docList = new ArrayList();
>> 
>> 
>> public static void main(String[] args) {
>>   try {
>>   solrJExtract idxer = new solrJExtract("http://localhost:8983/solr/tika";);
>>   idxer.doTikaDocuments(new File("D:\\docs"));
>>   idxer.endIndexing();
>>   } catch (Exception e) {
>>     e.printStackTrace();
>>   }
>> }
>> 
>> private  solrJExtract(String url) throws IOException, SolrServerException {
>>   // Create a SolrCloud-aware client to send docs to Solr
>>   // Use something like HttpSolrClient for stand-alone
>> 
>>   client = new HttpSolrClient.Builder("http://localhost:8983/solr/tika";)
>>   .withConnectionTimeout(10000)
>>   .withSocketTimeout(60000)
>>   .build();
>> 
>>   // binary parser is used by default for responses
>>   client.setParser(new XMLResponseParser());
>> 
>>   // One of the ways Tika can be used to attempt to parse arbitrary files.
>>   autoParser = new AutoDetectParser();  }
>> 
>> // Just a convenient place to wrap things up.
>> @SuppressWarnings("unchecked")
>> private void endIndexing() throws IOException, SolrServerException {
>>   if ( docList.size() > 0) { // Are there any documents left over?
>>     client.add(docList, 300000); // Commit within 5 minutes
>>   }
>>   client.commit(); // Only needs to be done at the end,
>>   // commitWithin should do the rest.
>>   // Could even be omitted
>>   // assuming commitWithin was specified.
>>   long endTime = System.currentTimeMillis();
>>   System.out.println("Total Time Taken: " + (endTime - start) +
>>       " milliseconds to index " + totalSql +
>>       " SQL rows and " + totalTika + " documents");
>> 
>> }
>> 
>> /**
>>  * ***************************Tika processing here
>>  */
>> // Recursively traverse the filesystem, parsing everything found.
>> private void doTikaDocuments(File root) throws IOException,
>> SolrServerException {
>> 
>>   // Simple loop for recursively indexing all the files
>>   // in the root directory passed in.
>>   for (File file : root.listFiles()) {
>>     if (file.isDirectory()) {
>>       doTikaDocuments(file);
>>       continue;
>>     }
>>     // Get ready to parse the file.
>>     ContentHandler textHandler = new BodyContentHandler();
>>     Metadata metadata = new Metadata();
>>     ParseContext context = new ParseContext();
>>     // Tim Allison noted the following, thanks Tim!
>>     // If you want Tika to parse embedded files (attachments within your 
>> .doc or any other embedded
>>     // files), you need to send in the autodetectparser in the parsecontext:
>>     // context.set(Parser.class, autoParser);
>> 
>>     InputStream input = new FileInputStream(file);
>> 
>>     // Try parsing the file. Note we haven't checked at all to
>>     // see whether this file is a good candidate.
>>     try {
>>       autoParser.parse(input, textHandler, metadata, context);
>>     } catch (Exception e) {
>>       // Needs better logging of what went wrong in order to
>>       // track down "bad" documents.
>>       System.out.println(String.format("File %s failed", 
>> file.getCanonicalPath()));
>>       e.printStackTrace();
>>       continue;
>>     }
>>     // Just to show how much meta-data and what form it's in.
>>     dumpMetadata(file.getCanonicalPath(), metadata);
>> 
>>     // Index just a couple of the meta-data fields.
>>     SolrInputDocument doc = new SolrInputDocument();
>> 
>>     doc.addField("id", file.getCanonicalPath());
>> 
>>     // Crude way to get known meta-data fields.
>>     // Also possible to write a simple loop to examine all the
>>     // metadata returned and selectively index it and/or
>>     // just get a list of them.
>>     // One can also use the Lucidworks field mapping to
>>     // accomplish much the same thing.
>>     String author = metadata.get("Author");
>> 
>> /*
>> * if (author != null) { //doc.addField("author", author); }  */
>> 
>>     doc.addField("_text_", textHandler.toString());
>>     //doc.addField("meta", metadata.get("Last_Modified"));
>>     docList.add(doc);
>>     ++totalTika;
>> 
>>     // Completely arbitrary, just batch up more than one document
>>     // for throughput!
>>     if ( docList.size() >= 1000) {
>>       // Commit within 5 minutes.
>>       UpdateResponse resp = client.add(docList, 300000);
>>       if (resp.getStatus() != 0) {
>>       System.out.println("Some horrible error has occurred, status is: " +
>>             resp.getStatus());
>>       }
>>       docList.clear();
>>     }
>>   }
>> }
>> 
>> // Just to show all the metadata that's available.
>> private void dumpMetadata(String fileName, Metadata metadata) {
>> System.out.println("Dumping metadata for file: " + fileName);
>>   for (String name : metadata.names()) {
>>     System.out.println(name + ":" + metadata.get(name));
>>   }
>>   System.out.println("........xxxxxxxxxxxxxxxxxxxxxxxxx..........");
>> }
>> }
>> 
>> 
>> Also, I am attaching the scrollconfig.xml & Managed-schema.xml for my 
>> collection. Please see to it & suggest where I am getting wrong.
>> I can't even get to see the _text_ field in the query result, instead of 
>> stored parameter being true.
>> Any help would really be appreciated.
>> Thanks !
>> 
>> -----Original Message-----
>> From: Shawn Heisey [mailto:apa...@elyograg.org]
>> Sent: 28 August 2019 14:18
>> To: solr-user@lucene.apache.org
>> Subject: Re: Require searching only for file content and not metadata
>> 
>> On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:
>>> Basically, what problem I am facing is - I am getting the textual content + 
>>> other metadata in my _text_ field. But, I want only the textual content 
>>> written inside the document.
>>> I tried various Request Handler Update Extract configurations, but none of 
>>> them worked for me.
>>> Please help me resolve this as I am badly stuck in this.
>> 
>> Controlling exactly what gets indexed in which fields is likely going to 
>> require that you write the indexing software yourself -- a program that 
>> extracts the data you want and sends it to Solr for indexing.
>> 
>> We do not recommend running the Extracting Request Handler in
>> production
>> -- Tika is known to crash when given some documents (usually PDF files are 
>> the problematic ones, but other formats can cause it too), and if it crashes 
>> while running inside Solr, it will take Solr down with it.
>> 
>> Here is an example program that uses Tika for rich document parsing.  It 
>> also talks to a database, but that part could be easily removed or modified:
>> 
>> https://lucidworks.com/post/indexing-with-solrj/
>> 
>> Thanks,
>> Shawn
>> 
>> ________________________________
>> 
>> The information contained in this electronic message and any
>> attachments to this message are intended for the exclusive use of the
>> addressee(s) and may contain proprietary, confidential or privileged
>> information. If you are not the intended recipient, you should not
>> disseminate, distribute or copy this e-mail. Please notify the sender
>> immediately and destroy all copies of this message and any
>> attachments. WARNING: Computer viruses can be transmitted via email.
>> The recipient should check this email and any attachments for the
>> presence of viruses. The company accepts no liability for any damage
>> caused by any virus/trojan/worms/malicious code transmitted by this
>> email. www.motherson.com
>> 
>> ________________________________
>> 
>> The information contained in this electronic message and any
>> attachments to this message are intended for the exclusive use of the
>> addressee(s) and may contain proprietary, confidential or privileged
>> information. If you are not the intended recipient, you should not
>> disseminate, distribute or copy this e-mail. Please notify the sender
>> immediately and destroy all copies of this message and any
>> attachments. WARNING: Computer viruses can be transmitted via email.
>> The recipient should check this email and any attachments for the
>> presence of viruses. The company accepts no liability for any damage
>> caused by any virus/trojan/worms/malicious code transmitted by this
>> email. www.motherson.com
> 
> 
> ________________________________
> 
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments. WARNING: Computer viruses can be transmitted via 
> email. The recipient should check this email and any attachments for the 
> presence of viruses. The company accepts no liability for any damage caused 
> by any virus/trojan/worms/malicious code transmitted by this email. 
> www.motherson.com

Re: Require searching only for file content and not metadata

Reply via email to