Thanks for the response. I will set the stream accrodingly. As for extraction of the text from pdf, I want the entire content of the pdf. This content will be part of a SOLR document, which has an uniqueid.
The unique is for what? Here's my schema: <fields> <field name="InternalCheckID" type="string" indexed="true" stored="true" required="true" /> <field name="ProductName" type="text" indexed="true" stored="false" required="false" /> <field name="ProductID" type="text" indexed="true" stored="false" required="false" /> <field name="Manufacturer" type="text" indexed="true" stored="false" required="false" /> <field name="RevisionDate" type="date" indexed="true" stored="false" required="false"/> <field name="FilePath" type="text" indexed="true" stored="false" /> <field name="Content" type="text" indexed="true" stored="false" multiValued="true"/> </fields> <uniqueKey>InternalCheckID</uniqueKey> <defaultSearchField>Content</defaultSearchField> Thanks for your help as always. Regards, Soumitra On Wed, Dec 7, 2011 at 3:06 PM, Mauricio Scheffer < mauricioschef...@gmail.com> wrote: > Try setting the StreamType to application/pdf, that way Tika doesn't have > to infer it. > BTW the second argument to ExtractParameters is the unique key... a value > of "*" probably doesn't make sense. > > -- > Mauricio > > > On Wed, Dec 7, 2011 at 5:50 PM, Soumitra Banerjee < > soumitrabaner...@gmail.com> wrote: > > > All - > > > > I am using SOLR 3.5, SOLRNet 0.4.0.2001, Tom Cat 7.0 and am running a job > > to extract the text from pds, stored on my local hard disk. > > > > *Tomcat StdErr log Shows:* > > > > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true& > > literal.id=*&resource.name > > =C:\XXX\10310.pdf&extractFormat=text&version=2.2} > > status=0 QTime=125 > > Dec 7, 2011 12:29:36 PM > org.apache.solr.update.processor.LogUpdateProcessor > > finish > > INFO: {} 0 141 > > Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute > > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true& > > literal.id=*&resource.name > =C:XXX\10311.pdf&extractFormat=text&version=2.2} > > status=0 QTime=141 > > Dec 7, 2011 12:29:36 PM > org.apache.solr.update.processor.LogUpdateProcessor > > finish > > INFO: {} 0 125 > > Dec 7, 2011 12:29:36 PM org.apache.solr.core.SolrCore execute > > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true& > > literal.id=*&resource.name > > =C:\XXX\3M_US_EN_10313.pdf&extractFormat=text&version=2.2} > > status=0 QTime=125 > > > > *Catalina Log Shows:* > > ** > > INFO: {} 0 281 > > Dec 7, 2011 12:29:04 PM org.apache.solr.core.SolrCore execute > > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true& > > literal.id=*&resource.name > > =C:\XXX\11511.pdf&extractFormat=text&version=2.2} > > status=0 QTime=281 > > Dec 7, 2011 12:29:05 PM > org.apache.solr.update.processor.LogUpdateProcessor > > finish > > INFO: {} 0 391 > > Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute > > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true& > > literal.id=*&resource.name > > =C:XXX\_11513.pdf&extractFormat=text&version=2.2} > > status=0 QTime=391 > > Dec 7, 2011 12:29:05 PM > org.apache.solr.update.processor.LogUpdateProcessor > > finish > > INFO: {} 0 328 > > Dec 7, 2011 12:29:05 PM org.apache.solr.core.SolrCore execute > > INFO: [core1] webapp=/Solr path=/update/extract params={extractOnly=true& > > literal.id=*&resource.name > > =C:\XXX\11514.pdf&extractFormat=text&version=2.2} > > status=0 QTime=328 > > > > The average pdf file size is around 50 KB. My questions are as follows: > > > > 1. Can I improve performance by updating any configutaion file for - > > SolrConfig, Tomcat, others? > > 2. Since I am using : > > > > var response = solr.Extract(new ExtractParameters(pdffile, "*") > > > > > > from SOLRNet 0.4.0.2001, which just came out (Beta), is this a known > issue > > to be fixed in upcomming versions? > > > > > > Any help/pointers from the experts will be highly appreciated. Also let > me > > know if you would need additional information and will be more than > happy > > to provide that. > > > > Regards, Soumitra > > >