Best platform for hosting Solr

2015-02-18 Thread Ganesh.Yadav
Guys,

1.   Can anyone suggest what would be the best platform to host Solr on any 
Unix or windows server?

2.   All I will be doing is importing lots of PDF documents into Solr. I 
believe Solr will automatically build the schema for imported documents.

3.   Can someone suggest what would be the max size limit on PDF document 
that can be imported into Solr?

4.   What implementation would make it faster?



Thanks

Ganesh




PDF search functionality using Solr

2015-01-06 Thread Ganesh.Yadav
Hello Solr-users and developers,
Can you please suggest,

1.   What I should do to index PDF content information column wise?

2.   Do I need to extract the contents using one of the Analyzer, Tokenize 
and Filter combination and then add it to Index? How can test the results on 
command prompt? I do not know the selection of specific Analyzer, Tokenizer and 
Filter for this purpose

3.   How can I verify that the needed column info is extracted out of PDF 
and is indexed?

4.   So for example How to verify Ticket number is extracted in 
Ticket_number tag and is indexed?

5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? 
I think I saw some posts complaining on how large size that can be posted ?

6.   What will enable Solr to search in any PDF out of many, with different 
words such as "Runtime" "Error" "" and result will provide the link to the 
PDF

My PDFs are nothing but Jira ticket system.
PDF has info on
Ticket Number:
Desc:
Client:
Status:
Submitter:
And so on:


1.   I imported PDF document in Solr and it does the necessary searching 
and I can test some of it using the browse client interface provided.

2.   I have 80 GB worth of PDFs.

3.   Total number of PDFs are about 200

4.   Many PDFs are of size 4 GB

5.   What do you suggest me to import such a large PDFs? What tools can you 
suggest to extract PDF contents first in some XML format and later Post that 
XML to be indexed by Solr.?







Your early response is much appreciated.



Thanks

G



RE: PDF search functionality using Solr Schema.xml and SolrConfig.xml question

2015-01-06 Thread Ganesh.Yadav
Thanks Jürgen  for your quick reply.

Still looking for answer on Schema.xml and SolrConfig.xml


1.   Do I need to tell Solr, to extract Title from PDF, go look for Title 
word and extract entire line after the Tag and collect all such occurrence’s 
from hundreds of PDFs and build the Title column data and index it?


2.   How to define my own schema to Solr

3.   Say I defined my fields Title, Ticket_number, Submitter, client and so 
on, How can I verify respective data is extracted in specific columns in Solr 
and indexed? Any suggestion on how the Analyzer, Tokenizer and Filter and which 
one will help for this purpose?


1.   I do not want to dump entire 4 GB PDF contents in one searchable field 
(ATTR_CONTENT) in Solr

2.   Even if entire PDF contents is extracted in above field as a default, 
I still want to extract specific searchable column data in their respective 
fields

3.   Rather I want to configure Solr to have column wise searchable 
contents such as Title, number, and so on

Any suggestions on performance? PDF database is 80 GB, will it be fast enough? 
Do I Need to divide in multiple cores and on multiple machines ? and on 
multiple web apps? And clustering?


I should have mentioned my PDFs are from Ticketing system like Jira which is 
already retired way back from production and all I have is the Ticketing system 
PDF database.


4.   My system will be used internally just by the selected number of very 
few people.

5.   They can wait 4 GB PDF to get loaded.

6.   I agree there will be many matches will be found in one large PDF, 
based on search criteria

7.   To make searches faster I want Solr to create more columns and column 
based indexes

8.   Solr underneath uses Tika which is extracting contents and getting rid 
of all the rich content formatting characters present in the PDF document.

9.   I believe resulting extraction size is 1/5th of the original PDF 
..just a random guess based on one sample extraction




From: "Jürgen Wagner (DVT)" [mailto:juergen.wag...@devoteam.com]
Sent: Tuesday, January 06, 2015 11:56 AM
To: solr-user@lucene.apache.org
Subject: Re: PDF search functionality using Solr

Hello,
  no matter which search platform you will use, this will pose two challenges:

- The size of the documents will render search less and less useful as the 
likelihood of matches increases with document size. So, without a proper 
semantic extraction (e.g., using decent NER or relationship extraction with a 
commercial text mining product), I doubt you will get the required precision to 
make this overly usefiul.

- PDFs can have their own character sets based on the characters actually used. 
Such file-specific character sets are almost impossible to parse, i.e., if your 
PDFs happen to use this "feature" of the PDF format, you won't be lucky getting 
any meaningful text out of them.

My suggestion is to use the Jira REST API to collect all necessary documents 
and index the resulting XML or attachment formats. As the REST API provides 
filtering capabilities, you could easily create incremental feeds to avoid 
humongous indexing every time there's new information in Jira. Dumping Jira 
stuff as PDF seems to me to be the least suitable way of handling this.

Best regards,
--Jürgen


On 06.01.2015 18:30, ganesh.ya...@sungard.com 
wrote:

Hello Solr-users and developers,

Can you please suggest,



1.   What I should do to index PDF content information column wise?



2.   Do I need to extract the contents using one of the Analyzer, Tokenize 
and Filter combination and then add it to Index? How can test the results on 
command prompt? I do not know the selection of specific Analyzer, Tokenizer and 
Filter for this purpose



3.   How can I verify that the needed column info is extracted out of PDF 
and is indexed?



4.   So for example How to verify Ticket number is extracted in 
Ticket_number tag and is indexed?



5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? 
I think I saw some posts complaining on how large size that can be posted ?



6.   What will enable Solr to search in any PDF out of many, with different 
words such as "Runtime" "Error" "" and result will provide the link to the 
PDF



My PDFs are nothing but Jira ticket system.

PDF has info on

Ticket Number:

Desc:

Client:

Status:

Submitter:

And so on:





1.   I imported PDF document in Solr and it does the necessary searching 
and I can test some of it using the browse client interface provided.



2.   I have 80 GB worth of PDFs.



3.   Total number of PDFs are about 200



4.   Many PDFs are of size 4 GB



5.   What do you suggest me to import such a large PDFs? What tools can you 
suggest to extract PDF contents first in some XML format and later Post that 
XML to be indexed by Solr.?















Your early response is much appreciated.







RE: Running Multiple Solr Instances

2015-01-06 Thread Ganesh.Yadav
Nishanth,

1.   I understand you are implementing clustering for the web apps which is 
running the same application on multiple different instances on one or more 
machines.

2.   If each of your web apps start pointing to the different index 
directory, how it will switch to the next web App with different index if 
search term is not found in the first index directory?

3.   Or will the web app collect the result sequentially from all the Index 
directories and will present the resulting collection to the user?



Please share your thoughts



Thanks

G







-Original Message-
From: Nishanth S [mailto:nishanth.2...@gmail.com]
Sent: Tuesday, January 06, 2015 12:17 PM
To: solr-user@lucene.apache.org
Subject: Re: Running Multiple Solr Instances



Thanks a lot guys.As a begineer these are very helpful fo rme.



Thanks,

Nishanth



On Tue, Jan 6, 2015 at 5:12 AM, Michael Della Bitta < 
michael.della.bi...@appinions.com> 
wrote:



> I would do one of either:

>

> 1. Set a different Solr home for each instance. I'd use the

> -Dsolr.solr.home=/d/2 command line switch when launching Solr to do so.

>

> 2. RAID 10 the drives. If you expect the Solr instances to get uneven

> traffic, pooling the drives will allow a given Solr instance to share

> the capacity of all of them.

>

>

> On 1/5/15 23:31, Nishanth S wrote:

>

>> Hi folks,

>>

>> I  am running  multiple solr instances  (Solr 4.10.3 on tomcat

>> 8).There are

>> 3 physical machines and  I have 4 solr instances running  on each

>> machine on ports  8080,8081,8082 and 8083.The set up is well up to

>> this point.Now I want to point each of these instance to a different

>> index directories.The drives in the machines are mounted as

>> d/1,d/2,d/3 ,d/4 etc.Now if I define

>> /d/1 as  the solr home all solr index directories  are created in

>> /d/1 where as the other drives remain un used.So how do I configure solr to

>>   make use of all the drives so that I can  get maximum storage for

>> solr.I would really appreciate any help in this regard.

>>

>> Thanks,

>> Nishanth

>>

>>

>


RE: PDF search functionality using Solr Schema.xml and SolrConfig.xml question

2015-01-06 Thread Ganesh.Yadav
Still looking for answer on Schema.xml and SolrConfig.xml


1.   Do I need to tell Solr, to extract Title from PDF, go look for Title 
word and extract entire line after the Tag and collect all such occurrence’s 
from hundreds of PDFs and build the Title column data and index it?


2.   How to define my own schema to Solr

3.   Say I defined my fields Title, Ticket_number, Submitter, client and so 
on, How can I verify respective data is extracted in specific columns in Solr 
and indexed? Any suggestion on how the Analyzer, Tokenizer and Filter and which 
one will help for this purpose?


1.   I do not want to dump entire 4 GB PDF contents in one searchable field 
(ATTR_CONTENT) in Solr

2.   Even if entire PDF contents is extracted in above field as a default, 
I still want to extract specific searchable column data in their respective 
fields

3.   Rather I want to configure Solr to have column wise searchable 
contents such as Title, number, and so on

Any suggestions on performance? PDF database is 80 GB, will it be fast enough? 
Do I Need to divide in multiple cores and on multiple machines ? and on 
multiple web apps? And clustering?


I should have mentioned my PDFs are from Ticketing system like Jira which is 
already retired way back from production and all I have is the Ticketing system 
PDF database.


4.   My system will be used internally just by the selected number of very 
few people.

5.   They can wait 4 GB PDF to get loaded.

6.   I agree there will be many matches will be found in one large PDF, 
based on search criteria

7.   To make searches faster I want Solr to create more columns and column 
based indexes

8.   Solr underneath uses Tika which is extracting contents and getting rid 
of all the rich content formatting characters present in the PDF document.

9.   I believe resulting extraction size is 1/5th of the original PDF 
..just a random guess based on one sample extraction




From: "Jürgen Wagner (DVT)" [mailto:juergen.wag...@devoteam.com]
Sent: Tuesday, January 06, 2015 11:56 AM
To: solr-user@lucene.apache.org
Subject: Re: PDF search functionality using Solr

Hello,
  no matter which search platform you will use, this will pose two challenges:

- The size of the documents will render search less and less useful as the 
likelihood of matches increases with document size. So, without a proper 
semantic extraction (e.g., using decent NER or relationship extraction with a 
commercial text mining product), I doubt you will get the required precision to 
make this overly usefiul.

- PDFs can have their own character sets based on the characters actually used. 
Such file-specific character sets are almost impossible to parse, i.e., if your 
PDFs happen to use this "feature" of the PDF format, you won't be lucky getting 
any meaningful text out of them.

My suggestion is to use the Jira REST API to collect all necessary documents 
and index the resulting XML or attachment formats. As the REST API provides 
filtering capabilities, you could easily create incremental feeds to avoid 
humongous indexing every time there's new information in Jira. Dumping Jira 
stuff as PDF seems to me to be the least suitable way of handling this.

Best regards,
--Jürgen


On 06.01.2015 18:30, ganesh.ya...@sungard.com 
wrote:

Hello Solr-users and developers,

Can you please suggest,



1.   What I should do to index PDF content information column wise?



2.   Do I need to extract the contents using one of the Analyzer, Tokenize 
and Filter combination and then add it to Index? How can test the results on 
command prompt? I do not know the selection of specific Analyzer, Tokenizer and 
Filter for this purpose



3.   How can I verify that the needed column info is extracted out of PDF 
and is indexed?



4.   So for example How to verify Ticket number is extracted in 
Ticket_number tag and is indexed?



5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? 
I think I saw some posts complaining on how large size that can be posted ?



6.   What will enable Solr to search in any PDF out of many, with different 
words such as "Runtime" "Error" "" and result will provide the link to the 
PDF



My PDFs are nothing but Jira ticket system.

PDF has info on

Ticket Number:

Desc:

Client:

Status:

Submitter:

And so on:





1.   I imported PDF document in Solr and it does the necessary searching 
and I can test some of it using the browse client interface provided.



2.   I have 80 GB worth of PDFs.



3.   Total number of PDFs are about 200



4.   Many PDFs are of size 4 GB



5.   What do you suggest me to import such a large PDFs? What tools can you 
suggest to extract PDF contents first in some XML format and later Post that 
XML to be indexed by Solr.?















Your early response is much appreciated.







Thanks



G





--

Mit freundlichen

RE: .htaccess / password

2015-01-06 Thread Ganesh.Yadav
Craig,

1.   What is .htaccess file meant for?

2.   What are the contents inside this file?

3.   How will you or how Solr knows that it needs to look for this file to 
bring in the needed security to this (which) area?

4.   What event is causing for you to re-index the engine every night?



Please share



Thanks

G



-Original Message-
From: Craig Hoffman [mailto:choff...@eclimb.net]
Sent: Tuesday, January 06, 2015 12:29 PM
To: Apache Solr
Subject: .htaccess / password



Quick question: If put a .htaccess file in 
www.mydomin.com/8983/solr/#/ will Solr 
continue to function properly? One thing to note, I will have a CRON job that 
runs nightly that re-indexes the engine. In a nutshell I’m looking for a way to 
secure this area.



Thanks,

Craig

--

Craig Hoffman

w: http://www.craighoffmanphotography.com

FB: 
www.facebook.com/CraigHoffmanPhotography

TW: https://twitter.com/craiglhoffman




























OutOfMemoryError for PDF document upload into Solr

2015-01-14 Thread Ganesh.Yadav
Hello,

Can someone pass on the hints to get around following error? Is there any Heap 
Size parameter I can set in Tomcat or in Solr webApp that gets deployed in Solr?

I am running Solr webapp inside Tomcat on my local machine which has RAM of 12 
GB. I have PDF document which is 4 GB max in size that needs to be loaded into 
Solr




Exception in thread "http-apr-8983-exec-6" java.lang.: Java heap space
at java.util.AbstractCollection.toArray(Unknown Source)
at java.util.ArrayList.(Unknown Source)
at org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
at org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)
at 
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611)
at 
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2462)
at 
org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:2451)


Thanks
Ganesh



RE: OutOfMemoryError for PDF document upload into Solr

2015-01-15 Thread Ganesh.Yadav
Siegfried and Michael Thank you for your replies and help.

-Original Message-
From: Siegfried Goeschl [mailto:sgoes...@gmx.at] 
Sent: Thursday, January 15, 2015 3:45 AM
To: solr-user@lucene.apache.org
Subject: Re: OutOfMemoryError for PDF document upload into Solr

Hi Ganesh,

you can increase the heap size but parsing a 4 GB PDF document will very likely 
consume A LOT OF memory - I think you need to check if that large PDF can be 
parsed at all :-)

Cheers,

Siegfried Goeschl

On 14.01.15 18:04, Michael Della Bitta wrote:
> Yep, you'll have to increase the heap size for your Tomcat container.
>
> http://stackoverflow.com/questions/6897476/tomcat-7-how-to-set-initial
> -heap-size-correctly
>
> Michael Della Bitta
>
> Senior Software Engineer
>
> o: +1 646 532 3062
>
> appinions inc.
>
> “The Science of Influence Marketing”
>
> 18 East 41st Street
>
> New York, NY 10017
>
> t: @appinions  | g+:
> plus.google.com/appinions
>  3336/posts>
> w: appinions.com 
>
> On Wed, Jan 14, 2015 at 12:00 PM,  wrote:
>
>> Hello,
>>
>> Can someone pass on the hints to get around following error? Is there 
>> any Heap Size parameter I can set in Tomcat or in Solr webApp that 
>> gets deployed in Solr?
>>
>> I am running Solr webapp inside Tomcat on my local machine which has 
>> RAM of 12 GB. I have PDF document which is 4 GB max in size that 
>> needs to be loaded into Solr
>>
>>
>>
>>
>> Exception in thread "http-apr-8983-exec-6" java.lang.: Java heap space
>>  at java.util.AbstractCollection.toArray(Unknown Source)
>>  at java.util.ArrayList.(Unknown Source)
>>  at
>> org.apache.pdfbox.cos.COSDocument.getObjects(COSDocument.java:518)
>>  at org.apache.pdfbox.cos.COSDocument.close(COSDocument.java:575)
>>  at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:254)
>>  at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1238)
>>  at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1203)
>>  at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:111)
>>  at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>  at
>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>>  at
>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>>  at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
>>  at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>  at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>>  at
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
>>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
>>  at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
>>  at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
>>  at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>>  at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:241)
>>  at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:208)
>>  at
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
>>  at
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
>>  at
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:170)
>>  at
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
>>  at
>> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
>>  at
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)
>>  at
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:421)
>>  at
>> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1070)
>>  at
>> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:611)
>>  at
>> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2462)
>>  at
>> org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoin
>> t.java:2451)
>>
>>
>> Thanks
>> Ganesh
>>
>>
>