Re: Running Multiple Solr Instances

2015-01-06 Thread Shawn Heisey
On 1/5/2015 9:31 PM, Nishanth S wrote:
> I  am running  multiple solr instances  (Solr 4.10.3 on tomcat 8).There are
> 3 physical machines and  I have 4 solr instances running  on each machine
> on ports  8080,8081,8082 and 8083.The set up is well up to this point.Now I
> want to point each of these instance to a different  index directories.The
> drives in the machines are mounted as d/1,d/2,d/3 ,d/4 etc.Now if I define
> /d/1 as  the solr home all solr index directories  are created in /d/1
> where as the other drives remain un used.So how do I configure solr to
>  make use of all the drives so that I can  get maximum storage for solr.I
> would really appreciate any help in this regard.

You should only run one Solr instance per machine.  One instance can
handle as many indexes as you want to run.  Running multiple instances
will waste a fair amount of system resources, and will also make the
entire setup a lot more complicated than it needs to be.

If you don't plan on setting up RAID (which would probably be a lot
easier to manage), here's an idea:

Set up the solr home somewhere on the root filesystem, then create
symlinks under that which will be the instance directories, pointed to
various directories under your other mount points.  When Solr starts, it
should begin core detection at the solr home and follow those symlinks
into the other locations.  I'm not aware of any problems with using
symlinks in this way.

If you're running SolrCloud, that can be a little more complicated,
because creating a new collection from scratch will create the cores
under the solr home ... but you can move them and symlink them after
they're created, then either reload the collection or restart Solr.
Just be sure that no indexing is happening when you begin the move.

Thanks,
Shawn



Vertical search Engine

2015-01-06 Thread klunwebale
hello 

i want to create a vertical search engine like trovit.com.

I have installed solr  and solarium.

What else to i need can you recommend a suitable crawler 
and how to structure my data to be indexed 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Vertical search Engine

2015-01-06 Thread Furkan KAMACI
Hi,

You should estimate the size of the data you will index before you decide
crawler. Crawler is out of scope at this mail list. If you will crawl big
size of data you can check Apache Nutch user list.

Furkan KAMACI

2015-01-06 10:39 GMT+02:00 klunwebale :

> hello
>
> i want to create a vertical search engine like trovit.com.
>
> I have installed solr  and solarium.
>
> What else to i need can you recommend a suitable crawler
> and how to structure my data to be indexed
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


edismax with multiple words for keyword tokenizer splitting on space

2015-01-06 Thread Sankalp Gupta
Hi
I come across this weird behaviour in solr. I'm not sure that why this is
desired in solr. I have filed this on stackoverflow. Please check
http://stackoverflow.com/questions/27795177/edismax-with-multiple-words-for-keyword-tokenizer-splitting-on-space

Thanks
Sankalp Gupta


Solr startup script in version 4.10.3

2015-01-06 Thread Dominique Bejean
Hi,

In release 4.10.3, the following lines were removed from solr starting
script (bin/solr)

# TODO: see SOLR-3619, need to support server or example
# depending on the version of Solr
if [ -e "$SOLR_TIP/server/start.jar" ]; then
  DEFAULT_SERVER_DIR=$SOLR_TIP/server
else
  DEFAULT_SERVER_DIR=$SOLR_TIP/example
fi

However, the usage message always say

"  -d   Specify the Solr server directory; defaults to server"


Either the usage have to be fixed or the removed lines put back to the
script.

Personally, I like the default to server directory.

My installation process in order to have a clean empty solr instance is to
copy examples into server and remove directories like example-DIH,
example-shemaless, multicore and solr/collection1

Solr server (or node) can be started without the -d parameter.

If this makes sense, a Jira issue could be open.

Dominique
http://www.eolya.fr/


Re: FOSDEM Open source search devroom

2015-01-06 Thread Charlie Hull

On 02/01/2015 08:37, Bram Van Dam wrote:

Hi folks,

There will be an Open source search devroom[1] at this year's FOSDEM in
Brussels, 31st of January & 1st of February.

I don't know if there will be a Lucene/Solr presence (there's no
schedule for the dev room yet), but this seems like a good place meet up
and talk shop.

I'll be there, and I hope some of you will as well.


Sadly I won't, but my colleague (and committer) Alan Woodward will, 
talking about text search for stream processing.


C


  - Bram

[1] https://fosdem.org/2015/schedule/track/open_source_search/




--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


Re: edismax with multiple words for keyword tokenizer splitting on space

2015-01-06 Thread Jack Krupansky
You need to escape the space in your query (using backslash or quotes
around the term) - the query parser doesn't parse based on the
analyzer/tokenizer for each field.

-- Jack Krupansky

On Tue, Jan 6, 2015 at 4:05 AM, Sankalp Gupta 
wrote:

> Hi
> I come across this weird behaviour in solr. I'm not sure that why this is
> desired in solr. I have filed this on stackoverflow. Please check
>
> http://stackoverflow.com/questions/27795177/edismax-with-multiple-words-for-keyword-tokenizer-splitting-on-space
>
> Thanks
> Sankalp Gupta
>


Re: Vertical search Engine

2015-01-06 Thread Jack Krupansky
Consider the Fusion product from LucidWorks:
http://lucidworks.com/product/fusion/

Structuring of your data should be driven by your queries and access
patterns - what are the most common queries and what are the most extreme
and complex queries that you expect to handle, both tin terms of the
queries are expressed and the results being returned.

-- Jack Krupansky

On Tue, Jan 6, 2015 at 3:39 AM, klunwebale  wrote:

> hello
>
> i want to create a vertical search engine like trovit.com.
>
> I have installed solr  and solarium.
>
> What else to i need can you recommend a suitable crawler
> and how to structure my data to be indexed
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Running Multiple Solr Instances

2015-01-06 Thread Michael Della Bitta

I would do one of either:

1. Set a different Solr home for each instance. I'd use the 
-Dsolr.solr.home=/d/2 command line switch when launching Solr to do so.


2. RAID 10 the drives. If you expect the Solr instances to get uneven 
traffic, pooling the drives will allow a given Solr instance to share 
the capacity of all of them.


On 1/5/15 23:31, Nishanth S wrote:

Hi folks,

I  am running  multiple solr instances  (Solr 4.10.3 on tomcat 8).There are
3 physical machines and  I have 4 solr instances running  on each machine
on ports  8080,8081,8082 and 8083.The set up is well up to this point.Now I
want to point each of these instance to a different  index directories.The
drives in the machines are mounted as d/1,d/2,d/3 ,d/4 etc.Now if I define
/d/1 as  the solr home all solr index directories  are created in /d/1
where as the other drives remain un used.So how do I configure solr to
  make use of all the drives so that I can  get maximum storage for solr.I
would really appreciate any help in this regard.

Thanks,
Nishanth





Re: How to limit the number of result sets of the 'export' handler

2015-01-06 Thread Alexandre Rafalovitch
Export was specifically designed to get everything which is very
expensive otherwise.

If you just want the subset, you might be better off with normal
queries and/or with deep paging (cursor).

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 6 January 2015 at 00:30, Sandy Ding  wrote:
> Using rows=xxx doesn't seem to work.
> Is there a way to do this?


RE: Frequent deletions

2015-01-06 Thread Amey Jadiye
Well, we are doing same thing(in a way). we have to do frequent deletions in 
mass, at a time we are deleting around 20M+ documents.All i am doing is after 
deletion i am firing the below command on each of our solr node and keep some 
patience as it take way much time.

curl -vvv 
"http://node1.solr.x.com/collection1/update?optimize=true&distrib=false"; >> 
/tmp/__solr_clener_log

After finishing optimisation curl returns below xml :











010268995


Regards,Amey

> Date: Wed, 31 Dec 2014 02:32:37 -0700
> From: inna.gel...@elbitsystems.com
> To: solr-user@lucene.apache.org
> Subject: Frequent deletions
> 
> Hello,
> We perform frequent deletions from our index, which greatly increases the
> index size.
> How can we perform an optimization in order to reduce the size.
> Please advise,
> Thanks.
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Frequent-deletions-tp4176689.html
> Sent from the Solr - User mailing list archive at Nabble.com.
  

Re: Vertical search Engine

2015-01-06 Thread Ahmet Arslan
Hi,

http://manifoldcf.apache.org is another option to consider. 
It is useful to crawl projected pages.

Free resources :

http://www.manning.com/wright/ManifoldCFinAction_manuscript.pdf
https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/
 
Ahmet 



On Tuesday, January 6, 2015 1:56 PM, Jack Krupansky  
wrote:
Consider the Fusion product from LucidWorks:
http://lucidworks.com/product/fusion/

Structuring of your data should be driven by your queries and access
patterns - what are the most common queries and what are the most extreme
and complex queries that you expect to handle, both tin terms of the
queries are expressed and the results being returned.

-- Jack Krupansky


On Tue, Jan 6, 2015 at 3:39 AM, klunwebale  wrote:

> hello
>
> i want to create a vertical search engine like trovit.com.
>
> I have installed solr  and solarium.
>
> What else to i need can you recommend a suitable crawler
> and how to structure my data to be indexed
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


IstvanKulcsar - Wiki Solr

2015-01-06 Thread ikulcsar

Hy,

I would like suggest pages which use SOLR and developed my company.

Please put this page this site:
http://wiki.apache.org/solr/PublicServers

http://www.odrportal.hu/kereso/
http://idea.unideb.hu/idealista/
http://www.jobmonitor.hu
http://www.profession.hu/
http://webicina.com/
http://www.cylex.hu/

Thanks for answer.

STeve


Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Charles VALLEE
I am considering using Solr to extend Hortonworks Data Platform 
capabilities to search.

- I found tutorials to index documents into a Solr instance from HDFS, but 
I guess this solution would require a Solr cluster distinct to the Hadoop 
cluster. Is it possible to have a Solr integrated into the Hadoop cluster 
instead? - With the index stored in HDFS?
- Where would the processing take place (could it be handed down to 
Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to 
integrate with Yarn?
- What about SolrCloud: what does it bring regarding Hadoop based 
use-cases? Does it stand for a Solr-only cluster?
- Well, if that could lead to something working with a roles-based 
authorization-compliant Banana, it would be Christmass again!
Thanks a lot for any help!
Charles


Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


Re: IstvanKulcsar - Wiki Solr

2015-01-06 Thread Shawn Heisey
On 1/6/2015 7:28 AM, ikulc...@precognox.com wrote:
> I would like suggest pages which use SOLR and developed my company.
> 
> Please put this page this site:
> http://wiki.apache.org/solr/PublicServers
> 
> http://www.odrportal.hu/kereso/
> http://idea.unideb.hu/idealista/
> http://www.jobmonitor.hu
> http://www.profession.hu/
> http://webicina.com/
> http://www.cylex.hu/

Create a user on the Solr wiki and let us know what your username is.
We will get your username added to the group that allows you to edit the
wiki.  Is IstvanKulcsar (first thing in the subject) your username on
the wiki?

Thanks,
Shawn



PDF search functionality using Solr

2015-01-06 Thread Ganesh.Yadav
Hello Solr-users and developers,
Can you please suggest,

1.   What I should do to index PDF content information column wise?

2.   Do I need to extract the contents using one of the Analyzer, Tokenize 
and Filter combination and then add it to Index? How can test the results on 
command prompt? I do not know the selection of specific Analyzer, Tokenizer and 
Filter for this purpose

3.   How can I verify that the needed column info is extracted out of PDF 
and is indexed?

4.   So for example How to verify Ticket number is extracted in 
Ticket_number tag and is indexed?

5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? 
I think I saw some posts complaining on how large size that can be posted ?

6.   What will enable Solr to search in any PDF out of many, with different 
words such as "Runtime" "Error" "" and result will provide the link to the 
PDF

My PDFs are nothing but Jira ticket system.
PDF has info on
Ticket Number:
Desc:
Client:
Status:
Submitter:
And so on:


1.   I imported PDF document in Solr and it does the necessary searching 
and I can test some of it using the browse client interface provided.

2.   I have 80 GB worth of PDFs.

3.   Total number of PDFs are about 200

4.   Many PDFs are of size 4 GB

5.   What do you suggest me to import such a large PDFs? What tools can you 
suggest to extract PDF contents first in some XML format and later Post that 
XML to be indexed by Solr.?







Your early response is much appreciated.



Thanks

G



Re: Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Otis Gospodnetic
Hi Charles,

See http://search-lucene.com/?q=solr+hdfs and
https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE 
wrote:

> I am considering using *Solr* to extend *Hortonworks Data Platform*
> capabilities to search.
>
> - I found tutorials to index documents into a Solr instance from *HDFS*,
> but I guess this solution would require a Solr cluster distinct to the
> Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop
> cluster instead? - *With the index stored in HDFS?*
>
> - Where would the processing take place (could it be handed down to
> Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to
> integrate with *Yarn*?
>
> - What about *SolrCloud*: what does it bring regarding Hadoop based
> use-cases? Does it stand for a Solr-only cluster?
>
> - Well, if that could lead to something working with a roles-based
> authorization-compliant *Banana*, it would be Christmass again!
>
> Thanks a lot for any help!
>
> Charles
>
>
>
> Ce message et toutes les pièces jointes (ci-après le 'Message') sont
> établis à l'intention exclusive des destinataires et les informations qui y
> figurent sont strictement confidentielles. Toute utilisation de ce Message
> non conforme à sa destination, toute diffusion ou toute publication totale
> ou partielle, est interdite sauf autorisation expresse.
>
> Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de
> le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou
> partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de
> votre système, ainsi que toutes ses copies, et de n'en garder aucune trace
> sur quelque support que ce soit. Nous vous remercions également d'en
> avertir immédiatement l'expéditeur par retour du message.
>
> Il est impossible de garantir que les communications par messagerie
> électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
> erreur ou virus.
> 
>
> This message and any attachments (the 'Message') are intended solely for
> the addressees. The information contained in this Message is confidential.
> Any use of information contained in this Message not in accord with its
> purpose, any dissemination or disclosure, either whole or partial, is
> prohibited except formal approval.
>
> If you are not the addressee, you may not copy, forward, disclose or use
> any part of it. If you have received this message in error, please delete
> it and all copies from your system and notify the sender immediately by
> return message.
>
> E-mail communication cannot be guaranteed to be timely secure, error or
> virus-free.
>
>


Re: Solr on HDFS in a Hadoop cluster

2015-01-06 Thread Otis Gospodnetic
Oh, and https://issues.apache.org/jira/browse/SOLR-6743

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 12:52 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hi Charles,
>
> See http://search-lucene.com/?q=solr+hdfs and
> https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Jan 6, 2015 at 11:02 AM, Charles VALLEE 
> wrote:
>
>> I am considering using *Solr* to extend *Hortonworks Data Platform*
>> capabilities to search.
>>
>> - I found tutorials to index documents into a Solr instance from *HDFS*,
>> but I guess this solution would require a Solr cluster distinct to the
>> Hadoop cluster. Is it possible to have a Solr integrated into the Hadoop
>> cluster instead? - *With the index stored in HDFS?*
>>
>> - Where would the processing take place (could it be handed down to
>> Hadoop)? Is there a way to garantee a level of service (CPU, RAM) - to
>> integrate with *Yarn*?
>>
>> - What about *SolrCloud*: what does it bring regarding Hadoop based
>> use-cases? Does it stand for a Solr-only cluster?
>>
>> - Well, if that could lead to something working with a roles-based
>> authorization-compliant *Banana*, it would be Christmass again!
>>
>> Thanks a lot for any help!
>>
>> Charles
>>
>>
>>
>> Ce message et toutes les pièces jointes (ci-après le 'Message') sont
>> établis à l'intention exclusive des destinataires et les informations qui y
>> figurent sont strictement confidentielles. Toute utilisation de ce Message
>> non conforme à sa destination, toute diffusion ou toute publication totale
>> ou partielle, est interdite sauf autorisation expresse.
>>
>> Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de
>> le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou
>> partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de
>> votre système, ainsi que toutes ses copies, et de n'en garder aucune trace
>> sur quelque support que ce soit. Nous vous remercions également d'en
>> avertir immédiatement l'expéditeur par retour du message.
>>
>> Il est impossible de garantir que les communications par messagerie
>> électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
>> erreur ou virus.
>> 
>>
>> This message and any attachments (the 'Message') are intended solely for
>> the addressees. The information contained in this Message is confidential.
>> Any use of information contained in this Message not in accord with its
>> purpose, any dissemination or disclosure, either whole or partial, is
>> prohibited except formal approval.
>>
>> If you are not the addressee, you may not copy, forward, disclose or use
>> any part of it. If you have received this message in error, please delete
>> it and all copies from your system and notify the sender immediately by
>> return message.
>>
>> E-mail communication cannot be guaranteed to be timely secure, error or
>> virus-free.
>>
>>
>


Re: PDF search functionality using Solr

2015-01-06 Thread Jürgen Wagner (DVT)
Hello,
  no matter which search platform you will use, this will pose two
challenges:

- The size of the documents will render search less and less useful as
the likelihood of matches increases with document size. So, without a
proper semantic extraction (e.g., using decent NER or relationship
extraction with a commercial text mining product), I doubt you will get
the required precision to make this overly usefiul.

- PDFs can have their own character sets based on the characters
actually used. Such file-specific character sets are almost impossible
to parse, i.e., if your PDFs happen to use this "feature" of the PDF
format, you won't be lucky getting any meaningful text out of them.

My suggestion is to use the Jira REST API to collect all necessary
documents and index the resulting XML or attachment formats. As the REST
API provides filtering capabilities, you could easily create incremental
feeds to avoid humongous indexing every time there's new information in
Jira. Dumping Jira stuff as PDF seems to me to be the least suitable way
of handling this.

Best regards,
--Jürgen


On 06.01.2015 18:30, ganesh.ya...@sungard.com wrote:
> Hello Solr-users and developers,
> Can you please suggest,
>
> 1.   What I should do to index PDF content information column wise?
>
> 2.   Do I need to extract the contents using one of the Analyzer, 
> Tokenize and Filter combination and then add it to Index? How can test the 
> results on command prompt? I do not know the selection of specific Analyzer, 
> Tokenizer and Filter for this purpose
>
> 3.   How can I verify that the needed column info is extracted out of PDF 
> and is indexed?
>
> 4.   So for example How to verify Ticket number is extracted in 
> Ticket_number tag and is indexed?
>
> 5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by 
> Solr? I think I saw some posts complaining on how large size that can be 
> posted ?
>
> 6.   What will enable Solr to search in any PDF out of many, with 
> different words such as "Runtime" "Error" "" and result will provide the 
> link to the PDF
>
> My PDFs are nothing but Jira ticket system.
> PDF has info on
> Ticket Number:
> Desc:
> Client:
> Status:
> Submitter:
> And so on:
>
>
> 1.   I imported PDF document in Solr and it does the necessary searching 
> and I can test some of it using the browse client interface provided.
>
> 2.   I have 80 GB worth of PDFs.
>
> 3.   Total number of PDFs are about 200
>
> 4.   Many PDFs are of size 4 GB
>
> 5.   What do you suggest me to import such a large PDFs? What tools can 
> you suggest to extract PDF contents first in some XML format and later Post 
> that XML to be indexed by Solr.?
>
>
>
>
>
>
>
> Your early response is much appreciated.
>
>
>
> Thanks
>
> G
>
>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




htaccess

2015-01-06 Thread Craig Hoffman
Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/ will 
Solr continue to function properly? One thing to note, I will have a CRON job 
that runs nightly that re-indexes the engine. In a nutshell I’m looking for a 
way to secure this area. 

Thanks,
Craig
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman















Re: Running Multiple Solr Instances

2015-01-06 Thread Nishanth S
Thanks a lot guys.As a begineer these are very helpful fo rme.

Thanks,
Nishanth

On Tue, Jan 6, 2015 at 5:12 AM, Michael Della Bitta <
michael.della.bi...@appinions.com> wrote:

> I would do one of either:
>
> 1. Set a different Solr home for each instance. I'd use the
> -Dsolr.solr.home=/d/2 command line switch when launching Solr to do so.
>
> 2. RAID 10 the drives. If you expect the Solr instances to get uneven
> traffic, pooling the drives will allow a given Solr instance to share the
> capacity of all of them.
>
>
> On 1/5/15 23:31, Nishanth S wrote:
>
>> Hi folks,
>>
>> I  am running  multiple solr instances  (Solr 4.10.3 on tomcat 8).There
>> are
>> 3 physical machines and  I have 4 solr instances running  on each machine
>> on ports  8080,8081,8082 and 8083.The set up is well up to this point.Now
>> I
>> want to point each of these instance to a different  index directories.The
>> drives in the machines are mounted as d/1,d/2,d/3 ,d/4 etc.Now if I define
>> /d/1 as  the solr home all solr index directories  are created in /d/1
>> where as the other drives remain un used.So how do I configure solr to
>>   make use of all the drives so that I can  get maximum storage for solr.I
>> would really appreciate any help in this regard.
>>
>> Thanks,
>> Nishanth
>>
>>
>


RE: PDF search functionality using Solr Schema.xml and SolrConfig.xml question

2015-01-06 Thread Ganesh.Yadav
Thanks Jürgen  for your quick reply.

Still looking for answer on Schema.xml and SolrConfig.xml


1.   Do I need to tell Solr, to extract Title from PDF, go look for Title 
word and extract entire line after the Tag and collect all such occurrence’s 
from hundreds of PDFs and build the Title column data and index it?


2.   How to define my own schema to Solr

3.   Say I defined my fields Title, Ticket_number, Submitter, client and so 
on, How can I verify respective data is extracted in specific columns in Solr 
and indexed? Any suggestion on how the Analyzer, Tokenizer and Filter and which 
one will help for this purpose?


1.   I do not want to dump entire 4 GB PDF contents in one searchable field 
(ATTR_CONTENT) in Solr

2.   Even if entire PDF contents is extracted in above field as a default, 
I still want to extract specific searchable column data in their respective 
fields

3.   Rather I want to configure Solr to have column wise searchable 
contents such as Title, number, and so on

Any suggestions on performance? PDF database is 80 GB, will it be fast enough? 
Do I Need to divide in multiple cores and on multiple machines ? and on 
multiple web apps? And clustering?


I should have mentioned my PDFs are from Ticketing system like Jira which is 
already retired way back from production and all I have is the Ticketing system 
PDF database.


4.   My system will be used internally just by the selected number of very 
few people.

5.   They can wait 4 GB PDF to get loaded.

6.   I agree there will be many matches will be found in one large PDF, 
based on search criteria

7.   To make searches faster I want Solr to create more columns and column 
based indexes

8.   Solr underneath uses Tika which is extracting contents and getting rid 
of all the rich content formatting characters present in the PDF document.

9.   I believe resulting extraction size is 1/5th of the original PDF 
..just a random guess based on one sample extraction




From: "Jürgen Wagner (DVT)" [mailto:juergen.wag...@devoteam.com]
Sent: Tuesday, January 06, 2015 11:56 AM
To: solr-user@lucene.apache.org
Subject: Re: PDF search functionality using Solr

Hello,
  no matter which search platform you will use, this will pose two challenges:

- The size of the documents will render search less and less useful as the 
likelihood of matches increases with document size. So, without a proper 
semantic extraction (e.g., using decent NER or relationship extraction with a 
commercial text mining product), I doubt you will get the required precision to 
make this overly usefiul.

- PDFs can have their own character sets based on the characters actually used. 
Such file-specific character sets are almost impossible to parse, i.e., if your 
PDFs happen to use this "feature" of the PDF format, you won't be lucky getting 
any meaningful text out of them.

My suggestion is to use the Jira REST API to collect all necessary documents 
and index the resulting XML or attachment formats. As the REST API provides 
filtering capabilities, you could easily create incremental feeds to avoid 
humongous indexing every time there's new information in Jira. Dumping Jira 
stuff as PDF seems to me to be the least suitable way of handling this.

Best regards,
--Jürgen


On 06.01.2015 18:30, ganesh.ya...@sungard.com 
wrote:

Hello Solr-users and developers,

Can you please suggest,



1.   What I should do to index PDF content information column wise?



2.   Do I need to extract the contents using one of the Analyzer, Tokenize 
and Filter combination and then add it to Index? How can test the results on 
command prompt? I do not know the selection of specific Analyzer, Tokenizer and 
Filter for this purpose



3.   How can I verify that the needed column info is extracted out of PDF 
and is indexed?



4.   So for example How to verify Ticket number is extracted in 
Ticket_number tag and is indexed?



5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? 
I think I saw some posts complaining on how large size that can be posted ?



6.   What will enable Solr to search in any PDF out of many, with different 
words such as "Runtime" "Error" "" and result will provide the link to the 
PDF



My PDFs are nothing but Jira ticket system.

PDF has info on

Ticket Number:

Desc:

Client:

Status:

Submitter:

And so on:





1.   I imported PDF document in Solr and it does the necessary searching 
and I can test some of it using the browse client interface provided.



2.   I have 80 GB worth of PDFs.



3.   Total number of PDFs are about 200



4.   Many PDFs are of size 4 GB



5.   What do you suggest me to import such a large PDFs? What tools can you 
suggest to extract PDF contents first in some XML format and later Post that 
XML to be indexed by Solr.?















Your early response is much appreciated.







.htaccess / password

2015-01-06 Thread Craig Hoffman
Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/ will 
Solr continue to function properly? One thing to note, I will have a CRON job 
that runs nightly that re-indexes the engine. In a nutshell I’m looking for a 
way to secure this area. 

Thanks,
Craig
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman















RE: Running Multiple Solr Instances

2015-01-06 Thread Ganesh.Yadav
Nishanth,

1.   I understand you are implementing clustering for the web apps which is 
running the same application on multiple different instances on one or more 
machines.

2.   If each of your web apps start pointing to the different index 
directory, how it will switch to the next web App with different index if 
search term is not found in the first index directory?

3.   Or will the web app collect the result sequentially from all the Index 
directories and will present the resulting collection to the user?



Please share your thoughts



Thanks

G







-Original Message-
From: Nishanth S [mailto:nishanth.2...@gmail.com]
Sent: Tuesday, January 06, 2015 12:17 PM
To: solr-user@lucene.apache.org
Subject: Re: Running Multiple Solr Instances



Thanks a lot guys.As a begineer these are very helpful fo rme.



Thanks,

Nishanth



On Tue, Jan 6, 2015 at 5:12 AM, Michael Della Bitta < 
michael.della.bi...@appinions.com> 
wrote:



> I would do one of either:

>

> 1. Set a different Solr home for each instance. I'd use the

> -Dsolr.solr.home=/d/2 command line switch when launching Solr to do so.

>

> 2. RAID 10 the drives. If you expect the Solr instances to get uneven

> traffic, pooling the drives will allow a given Solr instance to share

> the capacity of all of them.

>

>

> On 1/5/15 23:31, Nishanth S wrote:

>

>> Hi folks,

>>

>> I  am running  multiple solr instances  (Solr 4.10.3 on tomcat

>> 8).There are

>> 3 physical machines and  I have 4 solr instances running  on each

>> machine on ports  8080,8081,8082 and 8083.The set up is well up to

>> this point.Now I want to point each of these instance to a different

>> index directories.The drives in the machines are mounted as

>> d/1,d/2,d/3 ,d/4 etc.Now if I define

>> /d/1 as  the solr home all solr index directories  are created in

>> /d/1 where as the other drives remain un used.So how do I configure solr to

>>   make use of all the drives so that I can  get maximum storage for

>> solr.I would really appreciate any help in this regard.

>>

>> Thanks,

>> Nishanth

>>

>>

>


Re: How large is your solr index?

2015-01-06 Thread Peter Sturge
Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even
performs reasonably well on commodity hardware with lots of faceting and
concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but
it works.

++1 for the automagic shard creator. We've been looking into doing this
sort of thing internally - i.e. when a shard reaches a certain size/num
docs, it creates 'sub-shards' to which new commits are sent and queries to
the 'parent' shard are included. The concept works, as long as you don't
try any non-dist stuff - it's one reason why all our fields are always
single valued. There are also other implications like cleanup, deletes and
security to take into account, to name a few.
A cool side-effect of sub-sharding (for lack of a snappy term) is that the
parent shard then stops suffering from auto-warming latency due to commits
(we do a fair amount of committing). In theory, you could carry on
sub-sharding until your hardware starts gasping for air.


On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam  wrote:

> On 01/04/2015 02:22 AM, Jack Krupansky wrote:
>
>> The reality doesn't seem to
>> be there today. 50 to 100 million documents, yes, but beyond that takes
>> some kind of "heroic" effort, whether a much beefier box, very careful and
>> limited data modeling or limiting of query capabilities or tolerance of
>> higher latency, expert tuning, etc.
>>
>
> I disagree. On the scale, at least. Up until 500M Solr performs "well"
> (read: well enough considering the scale) in a single shard on a single box
> of commodity hardware. Without any tuning or heroic efforts. Sure, some
> queries aren't as snappy as you'd like, and sure, indexing and querying at
> the same time will be somewhat unpleasant, but it will work, and it will
> work well enough.
>
> Will it work for thousands of concurrent users? Of course not. Anyone who
> is after that sort of thing won't find themselves in this scenario -- they
> will throw hardware at the problem.
>
> There is something to be said for making sharding less painful. It would
> be nice if, for instance, Solr would automagically create a new shard once
> some magic number was reached (2B at the latest, I guess). But then that'll
> break some query features ... :-(
>
> The reason we're using single large instances (sometimes on beefy
> hardware) is that SolrCloud is a pain. Not just from an administrative
> point of view (though that seems to be getting better, kudos for that!),
> but mostly because some queries cannot be executed with distributed=true.
> Our users, at least, prefer a slow query over an impossible query.
>
> Actually, this 2B limit is a good thing. It'll help me convince
> $management to donate some of our time to Solr :-)
>
>  - Bram
>


Re: PDF search functionality using Solr

2015-01-06 Thread Erick Erickson
Seconding Jürgen's comment. 4G docs are almost, but not quite totally
useless to search How many JIRA's each? That's _one_ document unless
you do some fancy dancing. Pulling the data directly using the JIRA
API sounds far superior.

If you _must_ use the JIRA->PDF->Solr option, consider the following:
Use Tika on the client to parse the doc and taking control of the
mapping of the meta-data
and, probably, breaking thins up into individual document, one Solr
document per JIRA.

That'll give you a chance to deal with charset issues and the like.
Here's an example:

https://lucidworks.com/blog/indexing-with-solrj/

That one has both Tika and database connectivity but should be pretty
straight-forward to adapt, just pull the database junk out.

Best,
Erick

On Tue, Jan 6, 2015 at 9:55 AM, "Jürgen Wagner (DVT)"
 wrote:
> Hello,
>   no matter which search platform you will use, this will pose two
> challenges:
>
> - The size of the documents will render search less and less useful as the
> likelihood of matches increases with document size. So, without a proper
> semantic extraction (e.g., using decent NER or relationship extraction with
> a commercial text mining product), I doubt you will get the required
> precision to make this overly usefiul.
>
> - PDFs can have their own character sets based on the characters actually
> used. Such file-specific character sets are almost impossible to parse,
> i.e., if your PDFs happen to use this "feature" of the PDF format, you won't
> be lucky getting any meaningful text out of them.
>
> My suggestion is to use the Jira REST API to collect all necessary documents
> and index the resulting XML or attachment formats. As the REST API provides
> filtering capabilities, you could easily create incremental feeds to avoid
> humongous indexing every time there's new information in Jira. Dumping Jira
> stuff as PDF seems to me to be the least suitable way of handling this.
>
> Best regards,
> --Jürgen
>
>
>
> On 06.01.2015 18:30, ganesh.ya...@sungard.com wrote:
>
> Hello Solr-users and developers,
> Can you please suggest,
>
> 1.   What I should do to index PDF content information column wise?
>
> 2.   Do I need to extract the contents using one of the Analyzer,
> Tokenize and Filter combination and then add it to Index? How can test the
> results on command prompt? I do not know the selection of specific Analyzer,
> Tokenizer and Filter for this purpose
>
> 3.   How can I verify that the needed column info is extracted out of
> PDF and is indexed?
>
> 4.   So for example How to verify Ticket number is extracted in
> Ticket_number tag and is indexed?
>
> 5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by
> Solr? I think I saw some posts complaining on how large size that can be
> posted ?
>
> 6.   What will enable Solr to search in any PDF out of many, with
> different words such as "Runtime" "Error" "" and result will provide the
> link to the PDF
>
> My PDFs are nothing but Jira ticket system.
> PDF has info on
> Ticket Number:
> Desc:
> Client:
> Status:
> Submitter:
> And so on:
>
>
> 1.   I imported PDF document in Solr and it does the necessary searching
> and I can test some of it using the browse client interface provided.
>
> 2.   I have 80 GB worth of PDFs.
>
> 3.   Total number of PDFs are about 200
>
> 4.   Many PDFs are of size 4 GB
>
> 5.   What do you suggest me to import such a large PDFs? What tools can
> you suggest to extract PDF contents first in some XML format and later Post
> that XML to be indexed by Solr.?
>
>
>
>
>
>
>
> Your early response is much appreciated.
>
>
>
> Thanks
>
> G
>
>
>
>
> --
>
> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
> уважением
> i.A. Jürgen Wagner
> Head of Competence Center "Intelligence"
> & Senior Cloud Consultant
>
> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
> E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de
>
> 
> Managing Board: Jürgen Hatzipantelis (CEO)
> Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
>
>


Re: How large is your solr index?

2015-01-06 Thread Erick Erickson
Have you considered pre-supposing SolrCloud and using the SPLITSHARD
API command?
Even after that's done, the sub-shard needs to be physically moved to
another machine
(probably), but that too could be scripted.

May not be desirable, but I thought I'd mention it.

Best,
Erick

On Tue, Jan 6, 2015 at 10:33 AM, Peter Sturge  wrote:
> Yes, totally agree. We run 500m+ docs in a (non-cloud) Solr4, and it even
> performs reasonably well on commodity hardware with lots of faceting and
> concurrent indexing! Ok, you need a lot of RAM to keep faceting happy, but
> it works.
>
> ++1 for the automagic shard creator. We've been looking into doing this
> sort of thing internally - i.e. when a shard reaches a certain size/num
> docs, it creates 'sub-shards' to which new commits are sent and queries to
> the 'parent' shard are included. The concept works, as long as you don't
> try any non-dist stuff - it's one reason why all our fields are always
> single valued. There are also other implications like cleanup, deletes and
> security to take into account, to name a few.
> A cool side-effect of sub-sharding (for lack of a snappy term) is that the
> parent shard then stops suffering from auto-warming latency due to commits
> (we do a fair amount of committing). In theory, you could carry on
> sub-sharding until your hardware starts gasping for air.
>
>
> On Sun, Jan 4, 2015 at 1:44 PM, Bram Van Dam  wrote:
>
>> On 01/04/2015 02:22 AM, Jack Krupansky wrote:
>>
>>> The reality doesn't seem to
>>> be there today. 50 to 100 million documents, yes, but beyond that takes
>>> some kind of "heroic" effort, whether a much beefier box, very careful and
>>> limited data modeling or limiting of query capabilities or tolerance of
>>> higher latency, expert tuning, etc.
>>>
>>
>> I disagree. On the scale, at least. Up until 500M Solr performs "well"
>> (read: well enough considering the scale) in a single shard on a single box
>> of commodity hardware. Without any tuning or heroic efforts. Sure, some
>> queries aren't as snappy as you'd like, and sure, indexing and querying at
>> the same time will be somewhat unpleasant, but it will work, and it will
>> work well enough.
>>
>> Will it work for thousands of concurrent users? Of course not. Anyone who
>> is after that sort of thing won't find themselves in this scenario -- they
>> will throw hardware at the problem.
>>
>> There is something to be said for making sharding less painful. It would
>> be nice if, for instance, Solr would automagically create a new shard once
>> some magic number was reached (2B at the latest, I guess). But then that'll
>> break some query features ... :-(
>>
>> The reason we're using single large instances (sometimes on beefy
>> hardware) is that SolrCloud is a pain. Not just from an administrative
>> point of view (though that seems to be getting better, kudos for that!),
>> but mostly because some queries cannot be executed with distributed=true.
>> Our users, at least, prefer a slow query over an impossible query.
>>
>> Actually, this 2B limit is a good thing. It'll help me convince
>> $management to donate some of our time to Solr :-)
>>
>>  - Bram
>>


Re: htaccess

2015-01-06 Thread Gora Mohanty
Hi,

Your message seems quite confused (even the URL is not right for most
normal Solr setup), and it is not clear as to what you mean by "function
properly". Solr is a search engine, and has no idea about .htacess files.

Are you asking whether Solr respects directives in .htaccess files? I am
pretty sure that cannot be the case.

With regards to Solr security, it is again normally not a Solr concern.
Please start from https://wiki.apache.org/solr/SolrSecurity

No offence, but it seems that your real concerns might lie elsewhere.
Please take a look at http://people.apache.org/~hossman/#xyproblem

Please do follow up on this list if your questions have not been addressed.

Regards,
Gora


On 6 January 2015 at 23:28, Craig Hoffman  wrote:

> Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/
> will Solr continue to function properly? One thing to note, I will have a
> CRON job that runs nightly that re-indexes the engine. In a nutshell I’m
> looking for a way to secure this area.
>
> Thanks,
> Craig
> --
> Craig Hoffman
> w: http://www.craighoffmanphotography.com
> FB: www.facebook.com/CraigHoffmanPhotography
> TW: https://twitter.com/craiglhoffman
>
>
>
>
>
>
>
>
>
>
>
>
>
>


RE: PDF search functionality using Solr Schema.xml and SolrConfig.xml question

2015-01-06 Thread Ganesh.Yadav
Still looking for answer on Schema.xml and SolrConfig.xml


1.   Do I need to tell Solr, to extract Title from PDF, go look for Title 
word and extract entire line after the Tag and collect all such occurrence’s 
from hundreds of PDFs and build the Title column data and index it?


2.   How to define my own schema to Solr

3.   Say I defined my fields Title, Ticket_number, Submitter, client and so 
on, How can I verify respective data is extracted in specific columns in Solr 
and indexed? Any suggestion on how the Analyzer, Tokenizer and Filter and which 
one will help for this purpose?


1.   I do not want to dump entire 4 GB PDF contents in one searchable field 
(ATTR_CONTENT) in Solr

2.   Even if entire PDF contents is extracted in above field as a default, 
I still want to extract specific searchable column data in their respective 
fields

3.   Rather I want to configure Solr to have column wise searchable 
contents such as Title, number, and so on

Any suggestions on performance? PDF database is 80 GB, will it be fast enough? 
Do I Need to divide in multiple cores and on multiple machines ? and on 
multiple web apps? And clustering?


I should have mentioned my PDFs are from Ticketing system like Jira which is 
already retired way back from production and all I have is the Ticketing system 
PDF database.


4.   My system will be used internally just by the selected number of very 
few people.

5.   They can wait 4 GB PDF to get loaded.

6.   I agree there will be many matches will be found in one large PDF, 
based on search criteria

7.   To make searches faster I want Solr to create more columns and column 
based indexes

8.   Solr underneath uses Tika which is extracting contents and getting rid 
of all the rich content formatting characters present in the PDF document.

9.   I believe resulting extraction size is 1/5th of the original PDF 
..just a random guess based on one sample extraction




From: "Jürgen Wagner (DVT)" [mailto:juergen.wag...@devoteam.com]
Sent: Tuesday, January 06, 2015 11:56 AM
To: solr-user@lucene.apache.org
Subject: Re: PDF search functionality using Solr

Hello,
  no matter which search platform you will use, this will pose two challenges:

- The size of the documents will render search less and less useful as the 
likelihood of matches increases with document size. So, without a proper 
semantic extraction (e.g., using decent NER or relationship extraction with a 
commercial text mining product), I doubt you will get the required precision to 
make this overly usefiul.

- PDFs can have their own character sets based on the characters actually used. 
Such file-specific character sets are almost impossible to parse, i.e., if your 
PDFs happen to use this "feature" of the PDF format, you won't be lucky getting 
any meaningful text out of them.

My suggestion is to use the Jira REST API to collect all necessary documents 
and index the resulting XML or attachment formats. As the REST API provides 
filtering capabilities, you could easily create incremental feeds to avoid 
humongous indexing every time there's new information in Jira. Dumping Jira 
stuff as PDF seems to me to be the least suitable way of handling this.

Best regards,
--Jürgen


On 06.01.2015 18:30, ganesh.ya...@sungard.com 
wrote:

Hello Solr-users and developers,

Can you please suggest,



1.   What I should do to index PDF content information column wise?



2.   Do I need to extract the contents using one of the Analyzer, Tokenize 
and Filter combination and then add it to Index? How can test the results on 
command prompt? I do not know the selection of specific Analyzer, Tokenizer and 
Filter for this purpose



3.   How can I verify that the needed column info is extracted out of PDF 
and is indexed?



4.   So for example How to verify Ticket number is extracted in 
Ticket_number tag and is indexed?



5.   Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? 
I think I saw some posts complaining on how large size that can be posted ?



6.   What will enable Solr to search in any PDF out of many, with different 
words such as "Runtime" "Error" "" and result will provide the link to the 
PDF



My PDFs are nothing but Jira ticket system.

PDF has info on

Ticket Number:

Desc:

Client:

Status:

Submitter:

And so on:





1.   I imported PDF document in Solr and it does the necessary searching 
and I can test some of it using the browse client interface provided.



2.   I have 80 GB worth of PDFs.



3.   Total number of PDFs are about 200



4.   Many PDFs are of size 4 GB



5.   What do you suggest me to import such a large PDFs? What tools can you 
suggest to extract PDF contents first in some XML format and later Post that 
XML to be indexed by Solr.?















Your early response is much appreciated.







Thanks



G





--

Mit freundlichen

RE: .htaccess / password

2015-01-06 Thread Ganesh.Yadav
Craig,

1.   What is .htaccess file meant for?

2.   What are the contents inside this file?

3.   How will you or how Solr knows that it needs to look for this file to 
bring in the needed security to this (which) area?

4.   What event is causing for you to re-index the engine every night?



Please share



Thanks

G



-Original Message-
From: Craig Hoffman [mailto:choff...@eclimb.net]
Sent: Tuesday, January 06, 2015 12:29 PM
To: Apache Solr
Subject: .htaccess / password



Quick question: If put a .htaccess file in 
www.mydomin.com/8983/solr/#/ will Solr 
continue to function properly? One thing to note, I will have a CRON job that 
runs nightly that re-indexes the engine. In a nutshell I’m looking for a way to 
secure this area.



Thanks,

Craig

--

Craig Hoffman

w: http://www.craighoffmanphotography.com

FB: 
www.facebook.com/CraigHoffmanPhotography

TW: https://twitter.com/craiglhoffman




























Re: .htaccess / password

2015-01-06 Thread Otis Gospodnetic
Hi Craig,

If you want to protect Solr, put it behind something like Apache / Nginx /
HAProxy and put .htaccess at that level, in front of Solr.
Or try something like
http://blog.jelastic.com/2013/06/17/secure-access-to-your-jetty-web-application/

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 1:28 PM, Craig Hoffman  wrote:

> Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/
> will Solr continue to function properly? One thing to note, I will have a
> CRON job that runs nightly that re-indexes the engine. In a nutshell I’m
> looking for a way to secure this area.
>
> Thanks,
> Craig
> --
> Craig Hoffman
> w: http://www.craighoffmanphotography.com
> FB: www.facebook.com/CraigHoffmanPhotography
> TW: https://twitter.com/craiglhoffman
>
>
>
>
>
>
>
>
>
>
>
>
>
>


Solr Memory Usage - How to reduce memory footprint for solr

2015-01-06 Thread Abhishek Sharma
*Q* - I am forced to set Java Xmx as high as 3.5g for my solr app.. If i
keep this low, my CPU hits 100% and response time for indexing increases a
lot.. And i have hit OOM Error as well when this value is low..

Is this too high? If so, how can I reduce this?

*Machine Details* 4 G RAM, SSD

*Solr App Details* (Standalone solr app, no shards)

   1. num. of Solr Cores = 5
   2. Index Size - 2 g
   3. num. of Search Hits per sec - 10 [*IMP* - All search queries have
   faceting..]
   4. num. of times Re-Indexing per hour per core - 10 (it may happen at
   the same time at a moment for all the 5 cores)
   5. Query Result Cache, Document cache and Filter Cache are all default
   size - 4 kb.

*top* stats -

  VIRTRESSHR S %CPU %MEM
6446600 3.478g  18308 S 11.3 94.6

*iotop* stats

 DISK READ  DISK WRITE  SWAPIN IO>
0-1200 K/s0-100 K/s  0  0-5%


RE: Solr Memory Usage - How to reduce memory footprint for solr

2015-01-06 Thread Toke Eskildsen
Abhishek Sharma [abhishe...@unbxd.com] wrote:

> *Q* - I am forced to set Java Xmx as high as 3.5g for my solr app.. If i
> keep this low, my CPU hits 100% and response time for indexing increases a
> lot.. And i have hit OOM Error as well when this value is low..

[...]

>   2. Index Size - 2 g
>   3. num. of Search Hits per sec - 10 [*IMP* - All search queries have
>   faceting..]

Faceting is often the reason for high memory usage. If you are not already 
doing so, do enable DocValues for the fields you are faceting on. If you have a 
lot of unique values in your facets (millions), you might also consider 
limiting the amount of concurrent searches.

Still, 3.5GB heap seems like quite a bit for a 2GB index. How many documents do 
you have?

- Toke Eskildsen


facet.contains

2015-01-06 Thread Will Butler
https://issues.apache.org/jira/browse/SOLR-1387 
 contains a patch to support 
facet.contains and facet.contains.ignoreCase, making it possible to easily 
filter facet results without the facet.prefix limitations. I know that it is 
possible to approximate this using two fields, searching against one and 
displaying values from the other. However, this also has its limitations, and 
doesn’t work properly for multi-valued fields. Are there any other suggestions 
for being able to display facet values and counts based on a user supplied 
string?

Thanks,

Will

Re: .htaccess / password

2015-01-06 Thread Craig Hoffman
Thanks Otis. Do think a .htaccess / .passwd file in the Solr admin dir would 
interfere with its operation?
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman













> On Jan 6, 2015, at 1:09 PM, Otis Gospodnetic  
> wrote:
> 
> Hi Craig,
> 
> If you want to protect Solr, put it behind something like Apache / Nginx /
> HAProxy and put .htaccess at that level, in front of Solr.
> Or try something like
> http://blog.jelastic.com/2013/06/17/secure-access-to-your-jetty-web-application/
> 
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> 
> 
> On Tue, Jan 6, 2015 at 1:28 PM, Craig Hoffman  wrote:
> 
>> Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/
>> will Solr continue to function properly? One thing to note, I will have a
>> CRON job that runs nightly that re-indexes the engine. In a nutshell I’m
>> looking for a way to secure this area.
>> 
>> Thanks,
>> Craig
>> --
>> Craig Hoffman
>> w: http://www.craighoffmanphotography.com
>> FB: www.facebook.com/CraigHoffmanPhotography
>> TW: https://twitter.com/craiglhoffman
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 



Re: .htaccess / password

2015-01-06 Thread Michael Della Bitta
The Jetty servlet container that Solr uses doesn't understand those 
files. It would not use them to determine access, and would likely make 
them accessible to web requests in plain text.


On 1/6/15 16:01, Craig Hoffman wrote:

Thanks Otis. Do think a .htaccess / .passwd file in the Solr admin dir would 
interfere with its operation?
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman














On Jan 6, 2015, at 1:09 PM, Otis Gospodnetic  wrote:

Hi Craig,

If you want to protect Solr, put it behind something like Apache / Nginx /
HAProxy and put .htaccess at that level, in front of Solr.
Or try something like
http://blog.jelastic.com/2013/06/17/secure-access-to-your-jetty-web-application/

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jan 6, 2015 at 1:28 PM, Craig Hoffman  wrote:


Quick question: If put a .htaccess file in www.mydomin.com/8983/solr/#/
will Solr continue to function properly? One thing to note, I will have a
CRON job that runs nightly that re-indexes the engine. In a nutshell I’m
looking for a way to secure this area.

Thanks,
Craig
--
Craig Hoffman
w: http://www.craighoffmanphotography.com
FB: www.facebook.com/CraigHoffmanPhotography
TW: https://twitter.com/craiglhoffman


















solrcloud without faceting, i.e. for failover only

2015-01-06 Thread Will Milspec
Hi all,

We have a smallish index that performs well for searches and are
considering using solrcloud --but just for high availability/redundancy,
i.e. without any sharding.

The indexes would be replicated, but not distributed.

I know that "there are no stupid questions..Only stupid people"...but here
goes:

-is solrcloud w/o sharding done?( I.e. "it's just not done!!" )
-any downside (i.e. aside from the lack of horizontal scalability )

will


SOLR - any open source framework

2015-01-06 Thread Vishal Swaroop
I am new to SOLR and was able to configure, run samples as well as able to
index data using DIH (from database).

Just wondering if there are open source framework to query and
display/visualize.

Regards


Re: solrcloud without faceting, i.e. for failover only

2015-01-06 Thread Michael Della Bitta

The downsides that come to mind:

1. Every write gets amplified by the number of nodes in the cloud. 1000 
write requests end up creating 1000*N HTTP calls as the leader forwards 
those writes individually to all of the followers in the cloud. Contrast 
that with classical replication where only changed index segments get 
replicated asynchronously.


2. Slightly more complicated infrastructure in terms of having to run a 
zookeeper cluster.


#1 is a trade off against being possibly more available to writes in the 
case of a single down node. In the cloud case, you're still open for 
business. In the classical replication case, you're no longer available 
for writes if the downed node is the master.


My two cents.

On 1/6/15 16:30, Will Milspec wrote:

Hi all,

We have a smallish index that performs well for searches and are
considering using solrcloud --but just for high availability/redundancy,
i.e. without any sharding.

The indexes would be replicated, but not distributed.

I know that "there are no stupid questions..Only stupid people"...but here
goes:

-is solrcloud w/o sharding done?( I.e. "it's just not done!!" )
-any downside (i.e. aside from the lack of horizontal scalability )

will





Re: solrcloud without faceting, i.e. for failover only

2015-01-06 Thread Chris Hostetter

: #1 is a trade off against being possibly more available to writes in the case
: of a single down node. In the cloud case, you're still open for business. In
: the classical replication case, you're no longer available for writes if the
: downed node is the master.

or to put it another way: classic replication lets you use N nodes for 
high availability reads, but you have a single point of failure for 
writes.

solr cloud gives you high availability for reads and writes -- including 
NRT support -- at the expense of more network overhead when writes happen.

: > -is solrcloud w/o sharding done?( I.e. "it's just not done!!" )
: > -any downside (i.e. aside from the lack of horizontal scalability )

it is certainly done -- specifically it is a matter of creating a 
collection with numShards=1 and replicationFactor=N.



-Hoss
http://www.lucidworks.com/


Re: SOLR - any open source framework

2015-01-06 Thread Alexandre Rafalovitch
That's very general question. So, the following are three random ideas
just to get you started to think of options.

*) spring.io (Spring Data Solr) + Vaadin
*)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder too)
*) http://projectblacklight.org/

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 6 January 2015 at 16:35, Vishal Swaroop  wrote:
> I am new to SOLR and was able to configure, run samples as well as able to
> index data using DIH (from database).
>
> Just wondering if there are open source framework to query and
> display/visualize.
>
> Regards


Re: SOLR - any open source framework

2015-01-06 Thread Erick Erickson
There's also the VelocityResponseWriter that comes with Solr. It takes
some effort to modify, but not a lot. It's useful for very fast iterations.

Best,
Erick

On Tue, Jan 6, 2015 at 1:58 PM, Alexandre Rafalovitch
 wrote:
> That's very general question. So, the following are three random ideas
> just to get you started to think of options.
>
> *) spring.io (Spring Data Solr) + Vaadin
> *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder too)
> *) http://projectblacklight.org/
>
> Regards,
>Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 6 January 2015 at 16:35, Vishal Swaroop  wrote:
>> I am new to SOLR and was able to configure, run samples as well as able to
>> index data using DIH (from database).
>>
>> Just wondering if there are open source framework to query and
>> display/visualize.
>>
>> Regards


Re: SOLR - any open source framework

2015-01-06 Thread Roman Chyla
Hi Vishal, Alexandre,

Here is another one, using Backbone, just released v1.0.16

https://github.com/adsabs/bumblebee

you can see it in action: http://ui.adslabs.org/

While it primarily serves our own needs, I tried to architect it to be
extendible (within reasonable limits of code, man power)

Roman

On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch 
wrote:

> That's very general question. So, the following are three random ideas
> just to get you started to think of options.
>
> *) spring.io (Spring Data Solr) + Vaadin
> *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder
> too)
> *) http://projectblacklight.org/
>
> Regards,
>Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 6 January 2015 at 16:35, Vishal Swaroop  wrote:
> > I am new to SOLR and was able to configure, run samples as well as able
> to
> > index data using DIH (from database).
> >
> > Just wondering if there are open source framework to query and
> > display/visualize.
> >
> > Regards
>


Re: SOLR - any open source framework

2015-01-06 Thread Vishal Swaroop
Great... Thanks for the inputs... I explored Velocity respond writer some
posts suggest it is good for prototyping but not for production...
On Jan 6, 2015 4:59 PM, "Alexandre Rafalovitch"  wrote:

> That's very general question. So, the following are three random ideas
> just to get you started to think of options.
>
> *) spring.io (Spring Data Solr) + Vaadin
> *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder
> too)
> *) http://projectblacklight.org/
>
> Regards,
>Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 6 January 2015 at 16:35, Vishal Swaroop  wrote:
> > I am new to SOLR and was able to configure, run samples as well as able
> to
> > index data using DIH (from database).
> >
> > Just wondering if there are open source framework to query and
> > display/visualize.
> >
> > Regards
>


Re: SOLR - any open source framework

2015-01-06 Thread Vishal Swaroop
Thanks Roman... I will check it... Maybe it's off topic but how about
Angular...
On Jan 6, 2015 5:17 PM, "Roman Chyla"  wrote:

> Hi Vishal, Alexandre,
>
> Here is another one, using Backbone, just released v1.0.16
>
> https://github.com/adsabs/bumblebee
>
> you can see it in action: http://ui.adslabs.org/
>
> While it primarily serves our own needs, I tried to architect it to be
> extendible (within reasonable limits of code, man power)
>
> Roman
>
> On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch 
> wrote:
>
> > That's very general question. So, the following are three random ideas
> > just to get you started to think of options.
> >
> > *) spring.io (Spring Data Solr) + Vaadin
> > *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder
> > too)
> > *) http://projectblacklight.org/
> >
> > Regards,
> >Alex.
> > 
> > Sign up for my Solr resources newsletter at http://www.solr-start.com/
> >
> >
> > On 6 January 2015 at 16:35, Vishal Swaroop  wrote:
> > > I am new to SOLR and was able to configure, run samples as well as able
> > to
> > > index data using DIH (from database).
> > >
> > > Just wondering if there are open source framework to query and
> > > display/visualize.
> > >
> > > Regards
> >
>


Re: SOLR - any open source framework

2015-01-06 Thread Roman Chyla
We've compared several projects before starting - AngularJS was on them,
 it is great for stuff where you could find components (already prepared)
but writing custom components was easier in other framworks (you need to
take this statement with grain of salt: it was specific to our situation),
but that was one year ago...

On Tue, Jan 6, 2015 at 5:20 PM, Vishal Swaroop  wrote:

> Thanks Roman... I will check it... Maybe it's off topic but how about
> Angular...
> On Jan 6, 2015 5:17 PM, "Roman Chyla"  wrote:
>
> > Hi Vishal, Alexandre,
> >
> > Here is another one, using Backbone, just released v1.0.16
> >
> > https://github.com/adsabs/bumblebee
> >
> > you can see it in action: http://ui.adslabs.org/
> >
> > While it primarily serves our own needs, I tried to architect it to be
> > extendible (within reasonable limits of code, man power)
> >
> > Roman
> >
> > On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> > > That's very general question. So, the following are three random ideas
> > > just to get you started to think of options.
> > >
> > > *) spring.io (Spring Data Solr) + Vaadin
> > > *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI builder
> > > too)
> > > *) http://projectblacklight.org/
> > >
> > > Regards,
> > >Alex.
> > > 
> > > Sign up for my Solr resources newsletter at http://www.solr-start.com/
> > >
> > >
> > > On 6 January 2015 at 16:35, Vishal Swaroop 
> wrote:
> > > > I am new to SOLR and was able to configure, run samples as well as
> able
> > > to
> > > > index data using DIH (from database).
> > > >
> > > > Just wondering if there are open source framework to query and
> > > > display/visualize.
> > > >
> > > > Regards
> > >
> >
>


Re: SOLR - any open source framework

2015-01-06 Thread Vishal Swaroop
Thanks a lot... We are in the process of analyzing what to use with SOLR...
On Jan 6, 2015 5:30 PM, "Roman Chyla"  wrote:

> We've compared several projects before starting - AngularJS was on them,
>  it is great for stuff where you could find components (already prepared)
> but writing custom components was easier in other framworks (you need to
> take this statement with grain of salt: it was specific to our situation),
> but that was one year ago...
>
> On Tue, Jan 6, 2015 at 5:20 PM, Vishal Swaroop 
> wrote:
>
> > Thanks Roman... I will check it... Maybe it's off topic but how about
> > Angular...
> > On Jan 6, 2015 5:17 PM, "Roman Chyla"  wrote:
> >
> > > Hi Vishal, Alexandre,
> > >
> > > Here is another one, using Backbone, just released v1.0.16
> > >
> > > https://github.com/adsabs/bumblebee
> > >
> > > you can see it in action: http://ui.adslabs.org/
> > >
> > > While it primarily serves our own needs, I tried to architect it to be
> > > extendible (within reasonable limits of code, man power)
> > >
> > > Roman
> > >
> > > On Tue, Jan 6, 2015 at 4:58 PM, Alexandre Rafalovitch <
> > arafa...@gmail.com>
> > > wrote:
> > >
> > > > That's very general question. So, the following are three random
> ideas
> > > > just to get you started to think of options.
> > > >
> > > > *) spring.io (Spring Data Solr) + Vaadin
> > > > *)  http://gethue.com/ (it's primarily Hadoop, but has Solr UI
> builder
> > > > too)
> > > > *) http://projectblacklight.org/
> > > >
> > > > Regards,
> > > >Alex.
> > > > 
> > > > Sign up for my Solr resources newsletter at
> http://www.solr-start.com/
> > > >
> > > >
> > > > On 6 January 2015 at 16:35, Vishal Swaroop 
> > wrote:
> > > > > I am new to SOLR and was able to configure, run samples as well as
> > able
> > > > to
> > > > > index data using DIH (from database).
> > > > >
> > > > > Just wondering if there are open source framework to query and
> > > > > display/visualize.
> > > > >
> > > > > Regards
> > > >
> > >
> >
>


cloudsolrserver

2015-01-06 Thread tharpa
We are switching from a direct HTTP connection to use cloudsolrserver.  I
have looked and failed for an example of code for connecting to
cloudsolrserver.  Are there any tutorials or code examples?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/cloudsolrserver-tp4177724.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Vertical search Engine

2015-01-06 Thread Dominique Bejean
Hi,

You can have a look at www.crawl-anywhere.com
A web crawler on top of Solr. Used for following vertical search engines :

http://www.hurisearch.org/
http://www.searchamnesty.org/

Regards

Dominique


2015-01-06 15:22 GMT+01:00 Ahmet Arslan :

> Hi,
>
> http://manifoldcf.apache.org is another option to consider.
> It is useful to crawl projected pages.
>
> Free resources :
>
> http://www.manning.com/wright/ManifoldCFinAction_manuscript.pdf
> https://manifoldcfinaction.googlecode.com/svn/trunk/pdfs/
>
> Ahmet
>
>
>
> On Tuesday, January 6, 2015 1:56 PM, Jack Krupansky <
> jack.krupan...@gmail.com> wrote:
> Consider the Fusion product from LucidWorks:
> http://lucidworks.com/product/fusion/
>
> Structuring of your data should be driven by your queries and access
> patterns - what are the most common queries and what are the most extreme
> and complex queries that you expect to handle, both tin terms of the
> queries are expressed and the results being returned.
>
> -- Jack Krupansky
>
>
> On Tue, Jan 6, 2015 at 3:39 AM, klunwebale  wrote:
>
> > hello
> >
> > i want to create a vertical search engine like trovit.com.
> >
> > I have installed solr  and solarium.
> >
> > What else to i need can you recommend a suitable crawler
> > and how to structure my data to be indexed
> >
> >
> >
> > --
> > View this message in context:
> > http://lucene.472066.n3.nabble.com/Vertical-search-Engine-tp4177542.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


Re: cloudsolrserver

2015-01-06 Thread Anshum Gupta
To get started, the ref guide should be helpful.

https://cwiki.apache.org/confluence/display/solr/Using+SolrJ

You just need to pass the Zk host string to the constructor and then use
the server.

Also, what do you mean by *connect to CloudSolrServer*? you mean connect
using, right?


On Tue, Jan 6, 2015 at 2:58 PM, tharpa <7kavsn...@sneakemail.com> wrote:

> We are switching from a direct HTTP connection to use cloudsolrserver.  I
> have looked and failed for an example of code for connecting to
> cloudsolrserver.  Are there any tutorials or code examples?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/cloudsolrserver-tp4177724.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Anshum Gupta
http://about.me/anshumgupta


Re: cloudsolrserver

2015-01-06 Thread tharpa
Thanks Anshum.

If you say that "connect using CloudSolrServer" is more correct than saying,
"connect to CloudSolrServer", I believe you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/cloudsolrserver-tp4177724p4177728.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to limit the number of result sets of the 'export' handler

2015-01-06 Thread Sandy Ding
Thanks Alexandre.
I actually need the whole result set. But it is large(perhaps 10m-100m) and
I find select is slow.
How does export differ from select except that select will make distributed
requests and do the merge?
Will select with ‘distrib=false’ have comparable performance with export?


2015-01-06 20:55 GMT+08:00 Alexandre Rafalovitch :

> Export was specifically designed to get everything which is very
> expensive otherwise.
>
> If you just want the subset, you might be better off with normal
> queries and/or with deep paging (cursor).
>
> Regards,
>Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 6 January 2015 at 00:30, Sandy Ding  wrote:
> > Using rows=xxx doesn't seem to work.
> > Is there a way to do this?
>