Re: Too long to index PDF - SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0

Soumitra Banerjee Wed, 07 Dec 2011 18:36:28 -0800

Thanks for the response. I will set the stream accrodingly. As for
extraction of the text from pdf, I want the entire content of the pdf. This
content will be part of a SOLR document, which has an uniqueid.


The unique is for what? Here's my schema:

  <fields>
    <field name="InternalCheckID" type="string" indexed="true"
stored="true" required="true" />
    <field name="ProductName" type="text" indexed="true" stored="false"
required="false" />
    <field name="ProductID" type="text" indexed="true" stored="false"
required="false" />
    <field name="Manufacturer" type="text" indexed="true" stored="false"
required="false" />
    <field name="RevisionDate" type="date" indexed="true" stored="false"
required="false"/>
    <field name="FilePath" type="text" indexed="true" stored="false" />
    <field name="Content" type="text" indexed="true" stored="false"
multiValued="true"/>
  </fields>
  <uniqueKey>InternalCheckID</uniqueKey>
  <defaultSearchField>Content</defaultSearchField>

Thanks for your help as always.

Regards, Soumitra


On Wed, Dec 7, 2011 at 3:06 PM, Mauricio Scheffer <
mauricioschef...@gmail.com> wrote:

> Try setting the StreamType to application/pdf, that way Tika doesn't have
> to infer it.
> BTW the second argument to ExtractParameters is the unique key... a value
> of "*" probably doesn't make sense.
>
> --
> Mauricio
>
>
> On Wed, Dec 7, 2011 at 5:50 PM, Soumitra Banerjee <
> soumitrabaner...@gmail.com> wrote:
>
> > All -
> >
> > I am using SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0 and am running a job
> > to extract the text from pds, stored on my local hard disk.
> >
> > *Tomcat StdErr log Shows:*
> >
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\10310.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=125
> > Dec 7, 2011 12:29:36 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 141
> > Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> =C:XXX\10311.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=141
> > Dec 7, 2011 12:29:36 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 125
> > Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\3M_US_EN_10313.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=125
> >
> > *Catalina Log Shows:*
> > **
> > INFO: {} 0 281
> > Dec 7, 2011 12:29:04 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\11511.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=281
> > Dec 7, 2011 12:29:05 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 391
> > Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:XXX\_11513.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=391
> > Dec 7, 2011 12:29:05 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > finish
> > INFO: {} 0 328
> > Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute
> > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true&
> > literal.id=*&resource.name
> > =C:\XXX\11514.pdf&extractFormat=text&version=2.2}
> > status=0 QTime=328
> >
> > The average pdf file size is around 50 KB. My questions are as follows:
> >
> > 1. Can I improve performance by updating any configutaion file for -
> > SolrConfig, Tomcat, others?
> > 2. Since I am using :
> >
> > var response = solr.Extract(new ExtractParameters(pdffile, "*")
> >
> >
> > from SOLRNet 0.4.0.2001, which just came out (Beta), is this a known
> issue
> > to be fixed in upcomming versions?
> >
> >
> > Any help/pointers from the experts will be highly appreciated. Also let
> me
> > know if you would need additional information and  will be more than
> happy
> > to provide that.
> >
> > Regards, Soumitra
> >
>

Re: Too long to index PDF - SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0

Reply via email to