from:"Dave"

Re: Network segmentation of replica

2017-07-06 Thread Dave

I have tested that out in solr cloud, but for solr master slave replication the 
config sets will not go without a reload, even if specified in the in the slave 
settings. 

> On Jul 6, 2017, at 5:56 PM, Erick Erickson  wrote:
> 
> I'm not entirely sure what happens if the sequence is
> 1> node drops out due to network glitch but Solr is still running
> 2> you upload a new configset
> 3> the network glitch repairs itself
> 4> the Solr instance reconnects.
> 
> Certainly if the Solr node is _restarted_ or _reloaded_ the new
> configs are read down.
> 
> The _index_ is always checked after a node is unavailable, so I'm sure
> of this sequence.
> 1> node drops out due to network glitch but Solr is still running
> 2> indexing continues
> 3> the network glitch repairs itself
> 4> the Solr instance reconnects.
> 5> the index is synchronized if necessary
> 
> Anyone else wants to chime in?
> 
> 
> Best,
> Erick
> 
> On Thu, Jul 6, 2017 at 2:27 PM, Lars Karlsson
>  wrote:
>> Ok, so although there was a configuration change and/or schema change
>> (during network segmentation) that normally requires a manual core reload
>> (that nowadays happen automatically via the schema API), this replica will
>> get instructions from Zookeeper to update its configuration and schema,
>> reload its core, then synchronize and finally serve again,
>> 
>> Please confirm.
>> 
>> Regards
>> Lars
>> 
>> 
>>> On Thu, 6 Jul 2017 at 23:18, Erick Erickson  wrote:
>>> 
>>> right, when the node connects again to Zookeeper, it will also rejoin
>>> the collection. At that point it's index is synchronized with the
>>> leader and when it goes "active", then it should again start serving
>>> queries.
>>> 
>>> Best,
>>> Erick
>>> 
>>> On Thu, Jul 6, 2017 at 2:04 PM, Lars Karlsson
>>>  wrote:
 Hi all, please help clarify how solr will handle network segmented
>>> replica
 meanwhile configuration and reload of cores/nodes for one collection is
 applied?
 
 Does the replica become part of the collection after connectivity is
 restored?
 
 Hence the node is not down, but lost ability to communicate to zookeepers
 and other nodes for a short while.
 
 Regards
 Lars
>>>

Re: Network segmentation of replica

2017-07-06 Thread Dave

Sorry that should have read have not tested in solr cloud. 

> On Jul 6, 2017, at 6:37 PM, Dave  wrote:
> 
> I have tested that out in solr cloud, but for solr master slave replication 
> the config sets will not go without a reload, even if specified in the in the 
> slave settings. 
> 
>> On Jul 6, 2017, at 5:56 PM, Erick Erickson  wrote:
>> 
>> I'm not entirely sure what happens if the sequence is
>> 1> node drops out due to network glitch but Solr is still running
>> 2> you upload a new configset
>> 3> the network glitch repairs itself
>> 4> the Solr instance reconnects.
>> 
>> Certainly if the Solr node is _restarted_ or _reloaded_ the new
>> configs are read down.
>> 
>> The _index_ is always checked after a node is unavailable, so I'm sure
>> of this sequence.
>> 1> node drops out due to network glitch but Solr is still running
>> 2> indexing continues
>> 3> the network glitch repairs itself
>> 4> the Solr instance reconnects.
>> 5> the index is synchronized if necessary
>> 
>> Anyone else wants to chime in?
>> 
>> 
>> Best,
>> Erick
>> 
>> On Thu, Jul 6, 2017 at 2:27 PM, Lars Karlsson
>>  wrote:
>>> Ok, so although there was a configuration change and/or schema change
>>> (during network segmentation) that normally requires a manual core reload
>>> (that nowadays happen automatically via the schema API), this replica will
>>> get instructions from Zookeeper to update its configuration and schema,
>>> reload its core, then synchronize and finally serve again,
>>> 
>>> Please confirm.
>>> 
>>> Regards
>>> Lars
>>> 
>>> 
>>>> On Thu, 6 Jul 2017 at 23:18, Erick Erickson  
>>>> wrote:
>>>> 
>>>> right, when the node connects again to Zookeeper, it will also rejoin
>>>> the collection. At that point it's index is synchronized with the
>>>> leader and when it goes "active", then it should again start serving
>>>> queries.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> On Thu, Jul 6, 2017 at 2:04 PM, Lars Karlsson
>>>>  wrote:
>>>>> Hi all, please help clarify how solr will handle network segmented
>>>> replica
>>>>> meanwhile configuration and reload of cores/nodes for one collection is
>>>>> applied?
>>>>> 
>>>>> Does the replica become part of the collection after connectivity is
>>>>> restored?
>>>>> 
>>>>> Hence the node is not down, but lost ability to communicate to zookeepers
>>>>> and other nodes for a short while.
>>>>> 
>>>>> Regards
>>>>> Lars
>>>>

Re: solr cloud vs standalone solr

2017-07-29 Thread Dave

There is no solid rule. Honestly stand alone solr can handle quite a bit, I 
don't think there's a valid reason to go to cloud unless you are starting from 
scratch and want to use the newest buzz word, stand alone can handle well over 
half a terabyte index at sub second speeds all day long.  

> On Jul 29, 2017, at 11:24 AM, Aman Tandon  wrote:
> 
> Hello Sara,
> 
> There is hard n fast rule, performance depends on caches, RAM, hdd etc.and
> how much resourced you could invest to keep the acceptable performance.
> Information on Number of Indexed documents, number of dynamic fields can be
> viewed from the below link. I hope this helps.
> 
> http://lucene.472066.n3.nabble.com/Solr-limitations-td4076250.html
> 
>> On Sat, Jul 29, 2017, 13:23 sara hajili  wrote:
>> 
>> hi all,
>> I want to know when standalone solr can't be sufficient for storing data
>> and we need to migrate to solr cloud?for example standalone solr take too
>> much time to return query result or to store document or etc.
>> 
>> in other word ,what is best capacity and data index size in  standalone
>> solr  that doesn't bad effect on query running and data inserting
>> performance?and after passing this index size i must switch to solr cloud?
>>

Re: Move index directory to another partition

2017-08-01 Thread Dave

To add to this, not sure of solr cloud uses it, but you're going to want to 
destroy the wrote.lock file as well

> On Aug 1, 2017, at 9:31 PM, Shawn Heisey  wrote:
> 
>> On 8/1/2017 7:09 PM, Erick Erickson wrote:
>> WARNING: what I currently understand about the limitations of AWS
>> could fill volumes so I might be completely out to lunch.
>> 
>> If you ADDREPLICA with the new replica's  data residing on the new EBS
>> volume, then wait for it to sync (which it'll do all by itself) then
>> DELETEREPLICA on the original you'll be all set.
>> 
>> In recent Solr's, theres also the MOVENODE collections API call.
> 
> I did consider mentioning that as a possible way forward, but I hate to
> rely on special configurations with core.properties, particularly if the
> newly built replica core instanceDirs aren't in the solr home (or
> coreRootDirectory) at all.  I didn't want to try and explain the precise
> steps required to get that plan to work.  I would expect to need some
> arcane Collections API work or manual ZK modification to reach a correct
> state -- steps that would be prone to error.
> 
> The idea I mentioned seemed to me to be the way forward that would
> require the least specialized knowledge.  Here's a simplified stating of
> the steps:
> 
> * Mount the new volume somewhere.
> * Use multiple rsync passes to get the data copied.
> * Stop Solr.
> * Do a final rsync pass.
> * Unmount the original volume.
> * Remount the new volume in the original location.
> * Start Solr.
> 
> Thanks,
> Shawn
>

Re: MongoDb vs Solr

2017-08-04 Thread Dave

Ones a search engine and the other is a nosql db. They're nothing alike and are 
completely different tools for completely different jobs. 

> On Aug 4, 2017, at 7:16 PM, Francesco Viscomi  wrote:
> 
> Hi all,
> why i have to choose solr if mongoDb is easier to learn and to use?
> Both are NoSql database, is there a good reason to chose solr and not
> mongoDb?
> 
> thanks really much
> 
> -- 
> Ing. Viscomi Francesco

Re: MongoDb vs Solr

2017-08-04 Thread Dave

Uhm. Dude are you drinking?

1. Lucidworks would never say that. 
2. Maria is not a json +MySQL. Maria is a fork of the last open source version 
of MySQL before oracle bought them 
3.walter is 100% correct. Solr is search. The only complex data structure it 
has is an array. Something like mongo can do arrays hashes arrays of hashes 
etc, it's actually json based. But it can't search well as a search engine can. 

There is no one tool. Use each for their own abilities. 


> On Aug 4, 2017, at 10:35 PM, GW  wrote:
> 
> The people @ Lucidworks would beg to disagree but I know exactly what you
> are saying Walter.
> 
> A simple flat file like a cardx is fine and dandy as a Solrcloud noSQL DB.
> I like to express it as knowing when to fish and when to cut bait. As soon
> as you are in the one - many or many - many world a real DB is a whole lot
> more sensible.
> 
> Augment your one-many|many-many NoSQL DB with a Solrcloud and you've got a
> rocket. Maria (MySQL with JSON) has had text search for a long time but It
> just does not compare to Solr. Put the two together and you've got some
> serious magic.
> 
> No offense intended, There's nothing wrong with being 97.5% correct. I wish
> I could be 97.5% correct all the time. :-)
> 
> 
> 
>> On 4 August 2017 at 18:41, Walter Underwood  wrote:
>> 
>> Solr is NOT a database. If you need a database, don’t choose Solr.
>> 
>> If you need both a database and search, choose MarkLogic.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 4, 2017, at 4:16 PM, Francesco Viscomi 
>> wrote:
>>> 
>>> Hi all,
>>> why i have to choose solr if mongoDb is easier to learn and to use?
>>> Both are NoSql database, is there a good reason to chose solr and not
>>> mongoDb?
>>> 
>>> thanks really much
>>> 
>>> --
>>> Ing. Viscomi Francesco
>> 
>>

Re: MongoDb vs Solr

2017-08-05 Thread Dave

Solr=search engine
Mongodb=schemaless database



> On Aug 5, 2017, at 12:26 AM, Walter Underwood  wrote:
> 
> MarkLogic can do many-to-many. I worked there six years ago. They use search 
> engine index structure with generational updates, including segment level 
> caches. With locking. Pretty good stuff.
> 
> A many to many relationship is an intersection across posting lists, with 
> transactions. Straightforward, but not easy to do it fast.
> 
> The “Inside MarkLogic Server” paper does a good job of explaining the guts.
> 
> Now, back to our regularly scheduled Solr presentations.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Aug 4, 2017, at 8:13 PM, David Hastings  wrote:
>> 
>> Also, id love to see an example of a many to many relationship in a nosql db 
>> as you described, since that's a rdbms concept. If it exists in a nosql 
>> environment I would like to learn how...
>> 
>>> On Aug 4, 2017, at 10:56 PM, Dave  wrote:
>>> 
>>> Uhm. Dude are you drinking?
>>> 
>>> 1. Lucidworks would never say that. 
>>> 2. Maria is not a json +MySQL. Maria is a fork of the last open source 
>>> version of MySQL before oracle bought them 
>>> 3.walter is 100% correct. Solr is search. The only complex data structure 
>>> it has is an array. Something like mongo can do arrays hashes arrays of 
>>> hashes etc, it's actually json based. But it can't search well as a search 
>>> engine can. 
>>> 
>>> There is no one tool. Use each for their own abilities. 
>>> 
>>> 
>>>> On Aug 4, 2017, at 10:35 PM, GW  wrote:
>>>> 
>>>> The people @ Lucidworks would beg to disagree but I know exactly what you
>>>> are saying Walter.
>>>> 
>>>> A simple flat file like a cardx is fine and dandy as a Solrcloud noSQL DB.
>>>> I like to express it as knowing when to fish and when to cut bait. As soon
>>>> as you are in the one - many or many - many world a real DB is a whole lot
>>>> more sensible.
>>>> 
>>>> Augment your one-many|many-many NoSQL DB with a Solrcloud and you've got a
>>>> rocket. Maria (MySQL with JSON) has had text search for a long time but It
>>>> just does not compare to Solr. Put the two together and you've got some
>>>> serious magic.
>>>> 
>>>> No offense intended, There's nothing wrong with being 97.5% correct. I wish
>>>> I could be 97.5% correct all the time. :-)
>>>> 
>>>> 
>>>> 
>>>>> On 4 August 2017 at 18:41, Walter Underwood  wrote:
>>>>> 
>>>>> Solr is NOT a database. If you need a database, don’t choose Solr.
>>>>> 
>>>>> If you need both a database and search, choose MarkLogic.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>> 
>>>>>> On Aug 4, 2017, at 4:16 PM, Francesco Viscomi 
>>>>> wrote:
>>>>>> 
>>>>>> Hi all,
>>>>>> why i have to choose solr if mongoDb is easier to learn and to use?
>>>>>> Both are NoSql database, is there a good reason to chose solr and not
>>>>>> mongoDb?
>>>>>> 
>>>>>> thanks really much
>>>>>> 
>>>>>> --
>>>>>> Ing. Viscomi Francesco
>>>>> 
>>>>> 
>

Re: MongoDb vs Solr

2017-08-05 Thread Dave

Also I wouldn't really recommend mongodb at all, it should only to be used as a 
fast front end to an acid compliant relational db same with  memcahed for 
example. If you're going to stick to open source, as I do, you should use the 
correct tool for the job. 

> On Aug 5, 2017, at 7:32 AM, GW  wrote:
> 
> Insults for Walter only.. sorry..
> 
>> On 5 August 2017 at 06:28, GW  wrote:
>> 
>> For The Guardian, Solr is the new database | Lucidworks
>> <https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwiR1rn6_b_VAhVB7IMKHWGKBj4QFgguMAE&url=https%3A%2F%2Flucidworks.com%2F2010%2F04%2F29%2Ffor-the-guardian-solr-is-the-new-database%2F&usg=AFQjCNE6CwwFRMvNhgzvEZu-Sryu_vtL8A>
>> https://lucidworks.com/2010/04/29/for-the-guardian-solr-
>> is-the-new-database/
>> Apr 29, 2010 - For The Guardian, *Solr* is the new *database*. I blogged
>> a few days ago about how open search source is disrupting the relationship
>> between ...
>> 
>> You are arrogant and probably lame as a programmer.
>> 
>> All offense intended
>> 
>>> On 5 August 2017 at 06:23, GW  wrote:
>>> 
>>> Watch their videos
>>> 
>>> On 4 August 2017 at 23:26, Walter Underwood 
>>> wrote:
>>> 
>>>> MarkLogic can do many-to-many. I worked there six years ago. They use
>>>> search engine index structure with generational updates, including segment
>>>> level caches. With locking. Pretty good stuff.
>>>> 
>>>> A many to many relationship is an intersection across posting lists,
>>>> with transactions. Straightforward, but not easy to do it fast.
>>>> 
>>>> The “Inside MarkLogic Server” paper does a good job of explaining the
>>>> guts.
>>>> 
>>>> Now, back to our regularly scheduled Solr presentations.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>> 
>>>>> On Aug 4, 2017, at 8:13 PM, David Hastings 
>>>> wrote:
>>>>> 
>>>>> Also, id love to see an example of a many to many relationship in a
>>>> nosql db as you described, since that's a rdbms concept. If it exists in a
>>>> nosql environment I would like to learn how...
>>>>> 
>>>>>> On Aug 4, 2017, at 10:56 PM, Dave 
>>>> wrote:
>>>>>> 
>>>>>> Uhm. Dude are you drinking?
>>>>>> 
>>>>>> 1. Lucidworks would never say that.
>>>>>> 2. Maria is not a json +MySQL. Maria is a fork of the last open
>>>> source version of MySQL before oracle bought them
>>>>>> 3.walter is 100% correct. Solr is search. The only complex data
>>>> structure it has is an array. Something like mongo can do arrays hashes
>>>> arrays of hashes etc, it's actually json based. But it can't search well as
>>>> a search engine can.
>>>>>> 
>>>>>> There is no one tool. Use each for their own abilities.
>>>>>> 
>>>>>> 
>>>>>>> On Aug 4, 2017, at 10:35 PM, GW  wrote:
>>>>>>> 
>>>>>>> The people @ Lucidworks would beg to disagree but I know exactly
>>>> what you
>>>>>>> are saying Walter.
>>>>>>> 
>>>>>>> A simple flat file like a cardx is fine and dandy as a Solrcloud
>>>> noSQL DB.
>>>>>>> I like to express it as knowing when to fish and when to cut bait.
>>>> As soon
>>>>>>> as you are in the one - many or many - many world a real DB is a
>>>> whole lot
>>>>>>> more sensible.
>>>>>>> 
>>>>>>> Augment your one-many|many-many NoSQL DB with a Solrcloud and you've
>>>> got a
>>>>>>> rocket. Maria (MySQL with JSON) has had text search for a long time
>>>> but It
>>>>>>> just does not compare to Solr. Put the two together and you've got
>>>> some
>>>>>>> serious magic.
>>>>>>> 
>>>>>>> No offense intended, There's nothing wrong with being 97.5% correct.
>>>> I wish
>>>>>>> I could be 97.5% correct all the time. :-)
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 4 August 2017 at 18:41, Walter Underwood 
>>>> wrote:
>>>>>>>> 
>>>>>>>> Solr is NOT a database. If you need a database, don’t choose Solr.
>>>>>>>> 
>>>>>>>> If you need both a database and search, choose MarkLogic.
>>>>>>>> 
>>>>>>>> wunder
>>>>>>>> Walter Underwood
>>>>>>>> wun...@wunderwood.org
>>>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Aug 4, 2017, at 4:16 PM, Francesco Viscomi 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> why i have to choose solr if mongoDb is easier to learn and to use?
>>>>>>>>> Both are NoSql database, is there a good reason to chose solr and
>>>> not
>>>>>>>>> mongoDb?
>>>>>>>>> 
>>>>>>>>> thanks really much
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Ing. Viscomi Francesco
>>>>>>>> 
>>>>>>>> 
>>>> 
>>>> 
>>> 
>>

Re: MongoDb vs Solr

2017-08-05 Thread Dave

And to add to the conversation, 7 year old blog posts are not a reason to make 
decisions for your tech stack. 

And insults are not something I'd like to see in this mailing list, at all, so 
please do not repeat any such disrespect or condescending statements in your 
contributions to the mailing list that's supposed to serve as a source of help, 
which, you asked for. 

> On Aug 5, 2017, at 7:54 AM, Dave  wrote:
> 
> Also I wouldn't really recommend mongodb at all, it should only to be used as 
> a fast front end to an acid compliant relational db same with  memcahed for 
> example. If you're going to stick to open source, as I do, you should use the 
> correct tool for the job. 
> 
>> On Aug 5, 2017, at 7:32 AM, GW  wrote:
>> 
>> Insults for Walter only.. sorry..
>> 
>>> On 5 August 2017 at 06:28, GW  wrote:
>>> 
>>> For The Guardian, Solr is the new database | Lucidworks
>>> <https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwiR1rn6_b_VAhVB7IMKHWGKBj4QFgguMAE&url=https%3A%2F%2Flucidworks.com%2F2010%2F04%2F29%2Ffor-the-guardian-solr-is-the-new-database%2F&usg=AFQjCNE6CwwFRMvNhgzvEZu-Sryu_vtL8A>
>>> https://lucidworks.com/2010/04/29/for-the-guardian-solr-
>>> is-the-new-database/
>>> Apr 29, 2010 - For The Guardian, *Solr* is the new *database*. I blogged
>>> a few days ago about how open search source is disrupting the relationship
>>> between ...
>>> 
>>> You are arrogant and probably lame as a programmer.
>>> 
>>> All offense intended
>>> 
>>>> On 5 August 2017 at 06:23, GW  wrote:
>>>> 
>>>> Watch their videos
>>>> 
>>>> On 4 August 2017 at 23:26, Walter Underwood 
>>>> wrote:
>>>> 
>>>>> MarkLogic can do many-to-many. I worked there six years ago. They use
>>>>> search engine index structure with generational updates, including segment
>>>>> level caches. With locking. Pretty good stuff.
>>>>> 
>>>>> A many to many relationship is an intersection across posting lists,
>>>>> with transactions. Straightforward, but not easy to do it fast.
>>>>> 
>>>>> The “Inside MarkLogic Server” paper does a good job of explaining the
>>>>> guts.
>>>>> 
>>>>> Now, back to our regularly scheduled Solr presentations.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>> 
>>>>>> On Aug 4, 2017, at 8:13 PM, David Hastings 
>>>>> wrote:
>>>>>> 
>>>>>> Also, id love to see an example of a many to many relationship in a
>>>>> nosql db as you described, since that's a rdbms concept. If it exists in a
>>>>> nosql environment I would like to learn how...
>>>>>> 
>>>>>>> On Aug 4, 2017, at 10:56 PM, Dave 
>>>>> wrote:
>>>>>>> 
>>>>>>> Uhm. Dude are you drinking?
>>>>>>> 
>>>>>>> 1. Lucidworks would never say that.
>>>>>>> 2. Maria is not a json +MySQL. Maria is a fork of the last open
>>>>> source version of MySQL before oracle bought them
>>>>>>> 3.walter is 100% correct. Solr is search. The only complex data
>>>>> structure it has is an array. Something like mongo can do arrays hashes
>>>>> arrays of hashes etc, it's actually json based. But it can't search well 
>>>>> as
>>>>> a search engine can.
>>>>>>> 
>>>>>>> There is no one tool. Use each for their own abilities.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Aug 4, 2017, at 10:35 PM, GW  wrote:
>>>>>>>> 
>>>>>>>> The people @ Lucidworks would beg to disagree but I know exactly
>>>>> what you
>>>>>>>> are saying Walter.
>>>>>>>> 
>>>>>>>> A simple flat file like a cardx is fine and dandy as a Solrcloud
>>>>> noSQL DB.
>>>>>>>> I like to express it as knowing when to fish and when to cut bait.
>>>>> As soon
>>>>>>>> as you are in the one - many or many - many world a real DB is a
>>>>> wh

Re: Need help with query syntax

2017-08-10 Thread Dave

Eric you going to vegas next month? 

> On Aug 10, 2017, at 7:38 PM, Erick Erickson  wrote:
> 
> Omer:
> 
> Solr does not implement pure boolean logic, see:
> https://lucidworks.com/2011/12/28/why-not-and-or-and-not/.
> 
> With appropriate parentheses it can give the same results as you're
> discovering.
> 
> Best
> Erick
> 
>> On Thu, Aug 10, 2017 at 3:00 PM, OTH  wrote:
>> Thanks for the help!
>> That's resolved the issue.
>> 
>> On Fri, Aug 11, 2017 at 1:48 AM, David Hastings <
>> hastings.recurs...@gmail.com> wrote:
>> 
>>> type:value AND (name:america^1+name:state^1+name:united^1)
>>> 
>>> but in reality what you want to do is use the fq parameter with type:value
>>> 
 On Thu, Aug 10, 2017 at 4:36 PM, OTH  wrote:

 Hello,

 I have the following use case:

 I have two fields (among others); one is 'name' and the other is 'type'.
 'Name' is the field I need to search, whereas, with 'type', I need to
>>> make
 sure that it has a certain value, depending on the situation.  Often,
>>> when
 I search the 'name' field, the search query would have multiple tokens.
 Furthermore, each query token needs to have a scoring weight attached to
 it.

 However, I'm unable to figure out the syntax which would allow all these
 things to happen.

 For example, if I use the following query:
 select?q=type:value+AND+name:america^1+name:state^1+name:united^1
 It would only return documents where 'name' includes the token 'america'
 (and where type==value).  It will totally ignore
 "+name:state^1+name:united^1", it seems.

 This does not happen if I omit "type:value+AND+".  So, with the following
 query:
 select?q=name:america^1+name:state^1+name:united^1
 It returns all documents which contain any of the three tokens {america,
 state, united}; which is what I need.  However, it also returns documents
 where type != value; which I can't have.

 If I put "type:value" at the end of the query command, like so:
 select?q=name:america^1+name:state^1+name:united^1+AND+type:value
 In this case, it will only return documents which contain the "united"
 token in the name field (and where type==value).  Again, it will totally
 ignore "name:america^1+name:state^1", it seems.

 I tried putting an "AND" between everything, like so:
 select?q=type:value+AND+name:america^1+AND+name:state^1+
>>> AND+name:united^1
 But this, of course, would only return documents which contain all the
 tokens {america, state, united}; whereas I need all documents which
>>> contain
 any of those tokens.

 If anyone could help me out with how this could be done / what the
>>> correct
 syntax would be, that would be a huge help.

 Much thanks
 Omer

>>>

Re: Fetch a binary field

2017-08-11 Thread Dave

Why didn't you set it to be indexed? Sure it would be a small dent in an index

> On Aug 11, 2017, at 5:20 PM, Barbet Alain  wrote:
> 
> Re,
> I take a look on the source code where this msg happen
> https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/schema/SchemaField.java#L186
> 
> I use version 6.5 who differ from master.
> In 6.5:
> if (! (indexed() || hasDocValues()) ) {
> As my field is not indexed (and hasDocValues() is false as binary),
> this test fail.
> Ok but so, what is the way to get this field as I can see it in Luke
> (in hex format) ?
> 
> 
> 2017-08-11 15:41 GMT+02:00 Barbet Alain :
>> Hi !
>> 
>> 
>> I've a Lucene base coming from a C++ program linked with Lucene++, a
>> port of Lucene 3.5.9. When I open this base with Luke, it show Lucene
>> 2.9. Can see a binary field I have in Luke, with data encoded in
>> base64.
>> 
>> 
>> I have upgrade this base from 2.9 => 4.0 => 5.0 =>6.0 so I can use it
>> with Solr 6.5.1. I rebuild a schema for this base, who have one field
>> in binary:
>> 
>> 
>> 
>> 
>> 
>> When I try to retrieve this field with a query (with SOLR admin
>> interface) it's fail with "can not use FieldCache on a field which is
>> neither indexed nor has doc values: document"
>> 
>> 
>> I can retrieve others fields, but can't find a way for this one. Does
>> someone have an idea ? (At the end I want do this with php, but as it
>> fail with Solr interface too ).
>> 
>> Thank you for any help !

Re: MongoDb vs Solr

2017-08-12 Thread Dave

Personally I say use a rdbms for data storage, it's what it's for. Solr is for 
search and retrieve and the expense of possible loss of all data, in which case 
you rebuild it. 

> On Aug 12, 2017, at 11:26 AM, Muwonge Ronald  wrote:
> 
> Hi Solr can use mongodb for storage and you can play with the data as it
> grows depending on your data goals.Ease of learning doesn't mean
> happiness.I recommend you use both for serious projects that won't collapse
> soon.
> Ronny
>> On 5 Aug 2017 02:16, "Francesco Viscomi"  wrote:
>> 
>> Hi all,
>> why i have to choose solr if mongoDb is easier to learn and to use?
>> Both are NoSql database, is there a good reason to chose solr and not
>> mongoDb?
>> 
>> thanks really much
>> 
>> --
>> Ing. Viscomi Francesco
>>

Re: Different order of docs between SOLR-4.10.4 to SOLR-6.5.1

2017-08-13 Thread Dave

Rebuild your index. It's just the safest way. 

On Aug 13, 2017, at 2:02 PM, SOLR4189  wrote:

>> If you are changing things like WordDelimiterFilterFactory to the graph 
>> version, you'll definitely want to reindex
> 
> What does it mean "*want to reindex*"? If I change
> WordDelimiterFilterFactory to the graph and use IndexUpgrader is it mistake?
> Or changes will not be affected only?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Different-order-of-docs-between-SOLR-4-10-4-to-SOLR-6-5-1-tp4349021p4350413.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: query with wild card with AND taking lot of time

2017-09-03 Thread Dave

My other concern would be your p's and q's. If you start mixing in Boolean 
logic and solrs weak respect for it, it could be unpredictable 

> On Sep 3, 2017, at 5:43 PM, Phil Scadden  wrote:
> 
> 5 seems a reasonable limit to me. After that revert to slow.
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Saturday, 2 September 2017 12:01 p.m.
> To: solr-user 
> Subject: Re: query with wild card with AND taking lot of time
> 
> How far would you take that? Say you had 100 terms joined by AND (ridiculous 
> I know, just sayin' ). Then you'd chew up 100 entries in the filterCache.
> 
>> On Fri, Sep 1, 2017 at 4:24 PM, Walter Underwood  
>> wrote:
>> Hmm. Solr really should convert an fq of “a AND b” to separate “a” and “b” 
>> fq filters. That should be a simple special-case rewrite. It might take less 
>> time to implement than explaining it to everyone.
>> 
>> Well, I guess then we’d have to explain how it wasn’t really necessary
>> to send separate fq params…
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Sep 1, 2017, at 2:01 PM, Erick Erickson  wrote:
>>> 
>>> Shawn:
>>> 
>>> See: https://issues.apache.org/jira/browse/SOLR-7219
>>> 
>>> Try fq=filter(foo) filter(bar) filter(baz)
>>> 
>>> Patches to docs welcome ;)
>>> 
 On Fri, Sep 1, 2017 at 1:50 PM, Shawn Heisey  wrote:
> On 9/1/2017 8:13 AM, Alexandre Rafalovitch wrote:
> You can OR cachable filter queries in the latest Solr. There is a
> special
> (filter) syntax for that.
 
 This is actually possible?  If so, I didn't see anything come across
 the dev list about it.
 
 I opened an issue for it, didn't know anything had been implemented.
 After I opened the issue, I discovered that I was merely the latest
 to do so, it had been requested before.
 
 Can you point to the relevant part of the reference guide and the
 Jira issue where the change was committed?
 
 Thanks,
 Shawn
 
>> 
> Notice: This email and any attachments are confidential and may not be used, 
> published or redistributed without the prior written consent of the Institute 
> of Geological and Nuclear Sciences Limited (GNS Science). If received in 
> error please destroy and immediately notify GNS Science. Do not copy or 
> disclose the contents.

Re: Performance Test

2017-09-04 Thread Dave

Get the raw logs from normal use, script out something to replicate the 
searches and have it fork to as many cores as the solr server has is what I'd 
do. 



> On Sep 4, 2017, at 5:26 AM, Daniel Ortega  wrote:
> 
> I would recommend you Solrmeter cloud
> 
> This fork supports solr cloud:
> https://github.com/idealista/solrmeter/blob/master/README.md
> 
> Disclaimer: This fork was developed by idealista, the company where I work
> 
> El El lun, 4 sept 2017 a las 11:18, Selvam Raman 
> escribió:
> 
>> Hi All,
>> 
>> which is the best tool for solr perfomance test. I want to identify how
>> much load my solr could handle and how many concurrent users can query on
>> solr.
>> 
>> Please suggest.
>> 
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>

Re: length of indexed value

2017-10-03 Thread Dave

I’d personally use your second option. Simple and straightforward if you can 
afford the time for a reindex 

> On Oct 3, 2017, at 6:23 PM, John Blythe  wrote:
> 
> hey all.
> 
> was hoping to find a query function that would allow me to filter based on
> the length of an indexed value. only things i could find/think of would be
> one of two things:
> 
> - use the strdist function to see how different things are (instead of
> comparing specific lengths)
> - create a new field to be indexed with the length of what i know will end
> up being the indexed value's length
> 
> am i missing out on an easier, more straight forward solution?
> 
> thanks!
> 
> --
> John Blythe

Re: Semantic Knowledge Graph

2017-10-09 Thread Dave

Thanks Trey. Also thanks for the presentation. It was for me the best one I 
attended. Really looking forward to experimenting with it. Are there any plans 
of it getting into the core distribution?

> On Oct 9, 2017, at 12:30 PM, Trey Grainger  wrote:
> 
> Hi David, that's my fault. I need to do a final proofread through them
> before they get posted (and may have to push one quick code change, as
> well). I'll try to get that done within the next few days.
> 
> All the best,
> 
> Trey Grainger
> SVP of Engineering @ Lucidworks
> Co-author, Solr in Action <http://solrinaction.com>
> http://www.treygrainger.com
> 
> 
> On Mon, Oct 9, 2017 at 10:14 AM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> 
>> Hey All, slides form the 2017 lucene revolution were put up recently, but
>> unfortunately, the one I have the most interest in, the semantic knowledge
>> graph, have not been put up:
>> 
>> https://lucenesolrrevolution2017.sched.com/event/BAwX/the-
>> apache-solr-semantic-knowledge-graph?iframe=no&w=100%&sidebar=yes&bg=no
>> 
>> 
>> dont suppose any one knows where i may be able to find them, or point me in
>> a direction to get more information about this tool.
>> 
>> Thanks - dave
>>

Re: Scaling issue with Solr

2017-12-27 Thread Dave

You may find that buying some more memory will be your best bang for the buck 
in your set up.  32-64 gb isn’t expensive, 

> On Dec 27, 2017, at 6:57 PM, Suresh Pendap  wrote:
> 
> What is the downside of configuring ramBufferSizeMB to be equal to 5GB ?
> Is it only that the window of time for flush is larger, so recovery time will 
> be higher in case of a crash?
> 
> Thanks
> Suresh
> 
> On 12/27/17, 1:34 PM, "Erick Erickson"  wrote:
> 
>You are probably hitting more and more background merging which will
>slow things down. Your system looks to be severely undersized for this
>scale.
> 
>One thing you can try (and I emphasize I haven't prototyped this) is
>to increase your RamBufferSizeMB solrcofnig.xml setting significantly.
>By default, Solr won't merge segments to greater than 5G, so
>theoretically you could just set your ramBufferSizeMB to that figure
>and avoid merging all together. Or you could try configuring the
>NoMergePolicy in solrconfig.xml (but beware that you're going to
>create a lot of segments unless you set the rambuffersize higher).
> 
>How this will affect your indexing throughput I frankly have no data.
>You can see that with numbers like this, though, a 4G heap is much too
>small.
> 
>Best,
>Erick
> 
>On Wed, Dec 27, 2017 at 2:18 AM, Prasad Tendulkar
> wrote:
>> Hello All,
>> 
>> We have been building a Solr based solution to hold a large amount of data 
>> (approx 4 TB/day or > 24 Billion documents per day). We are developing a 
>> prototype on a small scale just to evaluate Solr performance gradually. Here 
>> is our setup configuration.
>> 
>> Solr cloud:
>> node1: 16 GB RAM, 8 Core CPU, 1TB disk
>> node2: 16 GB RAM, 8 Core CPU, 1TB disk
>> 
>> Zookeeper is also installed on above 2 machines in cluster mode.
>> Solr commit intervals: Soft commit 3 minutes, Hard commit 15 seconds
>> Schema: Basic configuration. 5 fields indexed (out of one is text_general), 
>> 6 fields stored.
>> Collection: 12 shards (6 per node)
>> Heap memory: 4 GB per node
>> Disk cache: 12 GB per node
>> Document is a syslog message.
>> 
>> Documents are being ingested into Solr from different nodes. 12 SolrJ 
>> clients ingest data into the Solr cloud.
>> 
>> We are experiencing issues when we keep the setup running for long time and 
>> after processing around 100 GB of index size (I.e. Around 600 Million 
>> documents). Note that we are only indexing the data and not querying it. So 
>> there should not be any query overhead. From the VM analysis we figured out 
>> that over time the disk operations starts declining and so does the CPU, RAM 
>> and Network usage of the Solr nodes. We concluded that Solr is unable to 
>> handle one big collection due to index read/write overhead and most of the 
>> time it ends up doing only the commit (evident in Solr logs). And because of 
>> that indexing is getting hampered (?)
>> 
>> So we thought of creating small sized collections instead of one big 
>> collection anticipating the commit performance might improve. But eventually 
>> the performance degrades even with that and we observe more or less similar 
>> charts for CPU, memory, disk and network.
>> 
>> To put forth some stats here are the number of documents processed every hour
>> 
>> 1St hour: 250 million
>> 2nd hour: 250 million
>> 3rd hour: 240 million
>> 4th hour: 200 million
>> .
>> .
>> 11th hour: 80 million
>> 
>> Could you please help us identifying the root cause of degradation in the 
>> performance? Are we doing something wrong with the Solr configuration or the 
>> collections/sharding etc? Due to this performance degradation we are 
>> currently stuck with Solr.
>> 
>> Thank you very much in advance.
>> 
>> Prasad Tendulkar
>> 
>> 
> 
> 
>

Re: Solr Data Import Handler

2017-02-12 Thread Dave

That sounds pretty much like a hack. So if two imports happen at the same time 
they have to wait for each other?

> On Feb 12, 2017, at 4:01 PM, Shawn Heisey  wrote:
> 
>> On 2/12/2017 10:30 AM, Minh wrote:
>> Hi everyone,
>> How can i run multithreads of DIH in a cluster for a collection?
> 
> The DIH handler is single-threaded.  It used to have a config option for
> multiple threads, but it was removed since it didn't actually work.
> 
> If you create multiple DIH handlers and start an import on them all at
> the same time, then you'll have multiple threads.
> 
> Thanks,
> Shawn
>

Re: Issues with Solr Morphline reading RFC822 files

2017-02-13 Thread Dave

Can't see what's color coded in the email. 

> On Feb 13, 2017, at 5:35 PM, Anatharaman, Srinatha (Contractor) 
>  wrote:
> 
> Hi,
> 
> I am loading email files which are in RFC822 format into SolrCloud using Flume
> But some meta data of the emails is not getting loaded to Solr.
> Please find below sample email, text which is colored in Bold Red is ignored 
> by Solr
> I can read this files ONLY using org.apache.tika.parser.mail.RFC822Parser 
> Parser, If I want to read it using TXTparser Solr ignores the files with 
> error "No supported MIME type found for _attachment_mimetype=message/rfc822"
> 
> How do I overcome this issue? I want to read the emails files without losing 
> single word from the file
> 
> Received: from resqmta-po-08v.sys..net ([196.114.154.167])
>by csp-imta02.westchester.pa.bo..net with bizsmtp
>id EClZ1u0013cy81c01E9enp; Wed, 30 Nov 2016 14:09:38 +
> Received: from resimta-po-14v.sys. .net ([96.114.154.142])
>by resqmta-po-08v.sys..net with SMTP
>id C5ZqcRB3e2dNjC5ZqcQvHl; Wed, 30 Nov 2016 14:09:38 +
> Received: from outgoingemail1.digitalrightscorp.com ([69.36.73.150])
>by resimta-po-14v.sys..net with SMTP
>id C5ZNcJfg9npCYC5Zcceh9K; Wed, 30 Nov 2016 14:09:25 +
> X-Xfinity-Message-Heuristics: IPv6:N;TLS=0;SPF=0;DMARC=
> Received: from outgoingemail1-69-150 (localhost [127.0.0.1])
>by outgoingemail1. XRightsCorp.com (Postfix) with ESMTP id 
> 15EB7100419
>for ; Wed, 30 Nov 2016 06:05:52 -0800 (PST)
> From: a...@xrightscorp.com
> To: d...@.net
> Message-ID: <551271522.6.1480514752082.JavaMail.root@outgoingemail1-69-150>
> Subject: Unauthorized Use of Copyrights RE:
> TC-cc0ae97d-8918-4a4b-8515-749ff9303bc0
> MIME-Version: 1.0
> Content-Type: text/plain; charset=us-ascii
> Content-Transfer-Encoding: 7bit
> Date: Wed, 30 Nov 2016 06:05:52 -0800 (PST)
> X-CMAE-Envelope: 
> MS4wfAIoEnMl1VVV7nPS/7pis5Gr/ijSjTNaioaGiZVCAo4cXRoeTl9Z1Nt8SYSY4kX7RpDlZuxzGbzyeRDJIorfdeodi9fzNtQETs56Or8SwlysmgQQQt4R
> kKDdiZaRx3Q0be579K6C4XZGyRC6JMDzDi1X6bXgBL8KYDFFA/aEyOBd+2Zrz1YpOi2aTjzyRc4d4MXJwaIGivtlXtZc6R5KypOhVP6eX1kx/qV9OwVzXAz6
> 
> **NOTE TO ISP: PLEASE FORWARD THE ENTIRE NOTICE***
> 
> Re: Unauthorized Use of Copyrights Owned Exclusively by The Bicycle Music 
> Company
> 
> Reference#: ZBP96D4  IP Address: 73.166.122.44
> 
> Dear Sir or Madam:
> .
> .
> .
> .
> .
> .
> 
> 
> Regards,
> ~Sri

Re: Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Dave

B is a better option long term. Solr is meant for retrieving flat data, fast, 
not hierarchical. That's what a database is for and trust me you would rather 
have a real database on the end point.  Each tool has a purpose, solr can never 
replace a relational database, and a relational database could not replace 
solr. Start with the slow model (database) for control/display and enhance with 
the fast model (solr) for retrieval/search 



> On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:
> 
> To learn how to properly use Solr, I'm building a little experimental
> project with it to search for used car listings.
> 
> Car listings appear on a variety of different places ... central places
> Craigslist and also many many individual Used Car dealership websites.
> 
> I am wondering, should I:
> 
> (a) deploy a Solr search engine and build individual indexers for every
> type of web site I want to find listings on?
> 
> or
> 
> (b) build my own database to store car listings, and then build services
> that scrape data from different sites and feed entries into the database;
> then point my Solr search to my database, one simple source of listings?
> 
> My concerns are:
> 
> With (a) ... I have to be smart enough to understand all those different
> data sources and remove/update listings when they change; while this be
> harder to do with custom Solr indexers than writing something from scratch?
> 
> With (b) ... I'm maintaining a huge database of all my listings which seems
> redundant; google doesn't make a *copy* of everything on the internet, it
> just knows it's there.  Is maintaining my own database a bad design?
> 
> Thanks for reading!

Re: Question about best way to architect a Solr application with many data sources

2017-02-21 Thread Dave

Ha I think I went to one of your training seminars in NYC maybe 4 years ago 
Eric. I'm going to have to respectfully disagree about the rdbms.  It's such a 
well know data format that you could hire a high school programmer to help with 
the db end if you knew how to flatten it to solr. Besides it's easy to 
visualize and interact with the data before it goes to solr. A Json/Nosql 
format would work just as well, but I really think a database has its place in 
a scenario like this 

> On Feb 21, 2017, at 8:20 PM, Erick Erickson  wrote:
> 
> I'll add that I _guarantee_ you'll want to re-index the data as you
> change your schema
> and the like. You'll be able to do that much more quickly if the data
> is stored locally somehow.
> 
> A RDBMS is not necessary however. You could simply store the data on
> disk in some format
> you could re-read and send to Solr.
> 
> Best,
> Erick
> 
>> On Tue, Feb 21, 2017 at 5:17 PM, Dave  wrote:
>> B is a better option long term. Solr is meant for retrieving flat data, 
>> fast, not hierarchical. That's what a database is for and trust me you would 
>> rather have a real database on the end point.  Each tool has a purpose, solr 
>> can never replace a relational database, and a relational database could not 
>> replace solr. Start with the slow model (database) for control/display and 
>> enhance with the fast model (solr) for retrieval/search
>> 
>> 
>> 
>>> On Feb 21, 2017, at 7:57 PM, Robert Hume  wrote:
>>> 
>>> To learn how to properly use Solr, I'm building a little experimental
>>> project with it to search for used car listings.
>>> 
>>> Car listings appear on a variety of different places ... central places
>>> Craigslist and also many many individual Used Car dealership websites.
>>> 
>>> I am wondering, should I:
>>> 
>>> (a) deploy a Solr search engine and build individual indexers for every
>>> type of web site I want to find listings on?
>>> 
>>> or
>>> 
>>> (b) build my own database to store car listings, and then build services
>>> that scrape data from different sites and feed entries into the database;
>>> then point my Solr search to my database, one simple source of listings?
>>> 
>>> My concerns are:
>>> 
>>> With (a) ... I have to be smart enough to understand all those different
>>> data sources and remove/update listings when they change; while this be
>>> harder to do with custom Solr indexers than writing something from scratch?
>>> 
>>> With (b) ... I'm maintaining a huge database of all my listings which seems
>>> redundant; google doesn't make a *copy* of everything on the internet, it
>>> just knows it's there.  Is maintaining my own database a bad design?
>>> 
>>> Thanks for reading!

Re: solr warning - filling logs

2017-02-26 Thread Dave

You shouldn't use the embedded zookeeper with solr, it's just for development 
not anywhere near worthy of being out in production. Otherwise it looks like 
you may have a port scanner running. In any case don't use the zk that comes 
with solr 

> On Feb 26, 2017, at 6:52 PM, Satya Marivada  wrote:
> 
> Hi All,
> 
> I have configured solr with SSL and enabled http authentication. It is all
> working fine on the solr admin page, indexing and querying process. One
> bothering thing is that it is filling up logs every second saying no
> authority, I have configured host name, port and authentication parameters
> right in all config files. Not sure, where is it coming from. Any
> suggestions, please. Really appreciate it. It is with sol-6.3.0 cloud with
> embedded zookeeper. Could it be some bug with solr-6.3.0 or am I missing
> some configuration?
> 
> 2017-02-26 23:32:43.660 WARN (qtp606548741-18) [c:plog s:shard1
> r:core_node2 x:plog_shard1_replica1] o.e.j.h.HttpParser parse exception:
> java.lang.IllegalArgumentException: No Authority for
> HttpChannelOverHttp@6dac689d{r=0,c=false,a=IDLE,uri=null}
> java.lang.IllegalArgumentException: No Authority
> at
> org.eclipse.jetty.http.HostPortHttpField.(HostPortHttpField.java:43)
> at org.eclipse.jetty.http.HttpParser.parsedHeader(HttpParser.java:877)
> at org.eclipse.jetty.http.HttpParser.parseHeaders(HttpParser.java:1050)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1266)
> at
> org.eclipse.jetty.server.HttpConnection.parseRequestBuffer(HttpConnection.java:344)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:227)
> at org.eclipse.jetty.io
> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:186)
> at org.eclipse.jetty.io
> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at org.eclipse.jetty.io
> .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
> at java.lang.Thread.run(Thread.java:745)

Re: solr warning - filling logs

2017-02-26 Thread Dave

I don't know about your network setup but a port scanner sometimes can be an it 
security device that, well, scans ports looking to see if they're open. 

> On Feb 26, 2017, at 7:14 PM, Satya Marivada  wrote:
> 
> May I ask about the port scanner running? Can you please elaborate?
> Sure, will try to move out to external zookeeper
> 
>> On Sun, Feb 26, 2017 at 7:07 PM Dave  wrote:
>> 
>> You shouldn't use the embedded zookeeper with solr, it's just for
>> development not anywhere near worthy of being out in production. Otherwise
>> it looks like you may have a port scanner running. In any case don't use
>> the zk that comes with solr
>> 
>>> On Feb 26, 2017, at 6:52 PM, Satya Marivada 
>> wrote:
>>> 
>>> Hi All,
>>> 
>>> I have configured solr with SSL and enabled http authentication. It is
>> all
>>> working fine on the solr admin page, indexing and querying process. One
>>> bothering thing is that it is filling up logs every second saying no
>>> authority, I have configured host name, port and authentication
>> parameters
>>> right in all config files. Not sure, where is it coming from. Any
>>> suggestions, please. Really appreciate it. It is with sol-6.3.0 cloud
>> with
>>> embedded zookeeper. Could it be some bug with solr-6.3.0 or am I missing
>>> some configuration?
>>> 
>>> 2017-02-26 23:32:43.660 WARN (qtp606548741-18) [c:plog s:shard1
>>> r:core_node2 x:plog_shard1_replica1] o.e.j.h.HttpParser parse exception:
>>> java.lang.IllegalArgumentException: No Authority for
>>> HttpChannelOverHttp@6dac689d{r=0,c=false,a=IDLE,uri=null}
>>> java.lang.IllegalArgumentException: No Authority
>>> at
>>> 
>> org.eclipse.jetty.http.HostPortHttpField.(HostPortHttpField.java:43)
>>> at org.eclipse.jetty.http.HttpParser.parsedHeader(HttpParser.java:877)
>>> at org.eclipse.jetty.http.HttpParser.parseHeaders(HttpParser.java:1050)
>>> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1266)
>>> at
>>> 
>> org.eclipse.jetty.server.HttpConnection.parseRequestBuffer(HttpConnection.java:344)
>>> at
>>> 
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:227)
>>> at org.eclipse.jetty.io
>>> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
>>> at
>> org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:186)
>>> at org.eclipse.jetty.io
>>> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
>>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
>>> at org.eclipse.jetty.io
>>> .SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>>> at
>>> 
>> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
>>> at
>>> 
>> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
>>> at
>>> 
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
>>> at
>>> 
>> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
>>> at java.lang.Thread.run(Thread.java:745)
>>

Re: SOLR JOIN

2017-02-28 Thread Dave

That seems difficult if not impossible. The joins are just complex queries, 
with the same data set. 

> On Feb 28, 2017, at 11:37 PM, Nitin Kumar  wrote:
> 
> Hi,
> 
> Can we use join query for more than 2 cores in solr. If yes, please provide
> reference or example.
> 
> Thanks,
> Nitin

Re: Facet? Search problem

2017-03-13 Thread Dave

Perhaps look into grouping on that field. 

> On Mar 13, 2017, at 9:08 PM, Scott Smith  wrote:
> 
> I'm trying to solve a search problem and wondering if facets (or something 
> else) might solve the problem.
> 
> Let's assume I have a bunch of documents (100 million+).  Each document has a 
> category (keyword) assigned to it.  A single document my only have one 
> category, but there may be multiple documents with the same category (1 to a 
> few hundred documents may be in any one category).  There are several million 
> categories.
> 
> Supposed I'm doing a search with a page size of 50.  What I want to do is do 
> a search (e.g., "dog") and get back the top 50 documents that match the 
> contain the word "dog" and are all in different categories.  So, there needs 
> to be one document from 50 different categories.
> 
> If that's not possible, then is it possible to do it if I know the 50 
> categories up-front and hand that off as part of the search (so "find 50 
> documents that match the term 'dog' and there is one document from each of 50 
> specified categories").
> 
> Is there a way to do this?
> 
> I'm not extremely knowledgeable about facets, but thought that might be a 
> solution.  But, it doesn't have to be facets.
> 
> Thanks for any help
> 
> Scott
> 
>

Re: Facet? Search problem

2017-03-13 Thread Dave

https://wiki.apache.org/solr/FieldCollapsing

> On Mar 13, 2017, at 9:59 PM, Dave  wrote:
> 
> Perhaps look into grouping on that field. 
> 
>> On Mar 13, 2017, at 9:08 PM, Scott Smith  wrote:
>> 
>> I'm trying to solve a search problem and wondering if facets (or something 
>> else) might solve the problem.
>> 
>> Let's assume I have a bunch of documents (100 million+).  Each document has 
>> a category (keyword) assigned to it.  A single document my only have one 
>> category, but there may be multiple documents with the same category (1 to a 
>> few hundred documents may be in any one category).  There are several 
>> million categories.
>> 
>> Supposed I'm doing a search with a page size of 50.  What I want to do is do 
>> a search (e.g., "dog") and get back the top 50 documents that match the 
>> contain the word "dog" and are all in different categories.  So, there needs 
>> to be one document from 50 different categories.
>> 
>> If that's not possible, then is it possible to do it if I know the 50 
>> categories up-front and hand that off as part of the search (so "find 50 
>> documents that match the term 'dog' and there is one document from each of 
>> 50 specified categories").
>> 
>> Is there a way to do this?
>> 
>> I'm not extremely knowledgeable about facets, but thought that might be a 
>> solution.  But, it doesn't have to be facets.
>> 
>> Thanks for any help
>> 
>> Scott
>> 
>>

Re: Phrase Fields performance

2017-04-01 Thread Dave

Maybe commongrams could help this but it boils down to speed/quality/cheap. 
Choose two. Thanks

> On Apr 1, 2017, at 10:28 AM, Shawn Heisey  wrote:
> 
>> On 3/31/2017 1:55 PM, David Hastings wrote:
>> So I un-commented out the line, to enable it to go against 6 important
>> fields. Afterwards through monitoring performance I noticed that my
>> searches were taking roughly 50% to 100% (2x!) longer, and it started
>> at the exact time I committed that change, 1:40 pm, qtimes below in a
>> 15 minute average cycle with the start time listed. 
> 
> That is fully expected.  Using both pf and qf basically has Solr doing
> the exact same queries twice, once as specified on fields in qf, then
> again as a phrase query on fields in pf.  If you add pf2 and/or pf3, you
> can expect further speed drops.
> 
> If you're sorting by relevancy, using pf with higher boosts than qf
> generally will make your results better, but it comes at a cost in
> performance.
> 
> Thanks,
> Shawn
>

Re: Filter Facet Query

2017-04-17 Thread Dave

Min.count is what you're looking for to get non 0 facets

> On Apr 17, 2017, at 6:51 PM, Furkan KAMACI  wrote:
> 
> My query:
> 
> /select?facet.field=research&facet=on&q=content:test
> 
> Q1) Facet returns research values with 0 counts which has a research value
> that is not from a document matched by main query (content:test). Is that
> usual?
> 
> Q2) I want to filter out research values with empty string ("") from facet
> result. How can I do that?
> 
> Kind Regards,
> Furkan KAMACI

Re: SOLR as nosql database store

2017-05-08 Thread Dave

You will want to have both solr and a sql/nosql data storage option. They serve 
different purposes 


> On May 8, 2017, at 10:43 PM, bharath.mvkumar  
> wrote:
> 
> Hi All,
> 
> We have a use case where we have mysql database which stores documents and
> also some of the fields in the document is also indexed in solr. 
> We plan to move all those documents to solr by making solr as the nosql
> datastore for storing those documents. The reason we plan to do this is
> because we have to support cross center data replication for both mysql and
> solr and we are in a way duplicating the same data.The number of writes we
> do per second is around 10,000. Also currently we have only one shard and we
> have around 70 million records and we plan to support close to 1 billion
> records and also perform sharding.
> 
> Using solr as the nosql database is a good choice or should we look at
> Cassandra for our use case? 
> 
> Thanks,
> Bharath Kumar
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SOLR-as-nosql-database-store-tp4334119.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Best practices for backup & restore

2017-05-16 Thread Dave

I think it's depends what you are backing up and restoring from. Hardware 
failure? Accidental delete?  For my use case my master indexer stores the index 
on a San with daily snap shots for reliability, then my live searching master 
is on a San as well, my live slave searchers are all on SSD drives for speed. 
In my situation that means the test index is backed up daily. A copy of the 
live index is backed up daily and the SSD's can die and it doesn't matter to 
me.  I don't think there is a best practice, just find how risk adverse you are 
and how much performance you require

> On May 16, 2017, at 6:38 PM, Jay Potharaju  wrote:
> 
> Hi,
> I was wondering if there are any best practices for doing solr backup &
> restore. In the past when running backup, I stopped indexing during the
> backup process.
> 
> I am looking at this documentation and it says that indexing can continue
> when backup is in progress.
> https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backups
> 
> Any recommendations ?
> 
> -- 
> Thanks
> Jay

Re: Solr in NAS or Network Shared Drive

2017-05-26 Thread Dave

This could be useful in a space expensive situation, although the reason I 
wanted to try it is multiple solr instances in one server reading one index on 
the ssd. This use case where on the nfs still leads to a single point of 
failure situation on one of the most fragile parts of a server, the disk, on 
one machine. So if the nfs master gets corrupt then all clients are dead rather 
than the slaves all having their own copy of the index. 





> On May 26, 2017, at 5:37 PM, Florian Gleixner  wrote:
> 
> 
> Just tested: if file metadata (last change time, access permissions ...)
> on NFS storage change, then all NFS clients invalidate the memory cache
> of the file completely.
> So, if your index does not get changed, caching is good on readonly
> slaves - the NFS client queries only file metadata sometimes.
> But if yout index changes, all affected files have to be read again from
> NFS. You can try this by "touching" the files.
> 
> fincore from linux ftools can be used to view the file caching status.
> 
> "touching" a file on a local mount does not invalidate the memory cache.
> The kernel knows, that no file data have been changed.
> 
> 
>> On 26.05.2017 19:53, Robert Haschart wrote:
>> 
>> The individual servers cannot do a merge on their own, since they mount
>> the NAS read-only.   Nothing they can do will affect the index.  I
>> believe this allows each machine to cache much of the index in memory,
>> with no fear that their cache will be made invalid by one of the others.
>> 
>> -Bob Haschart
>> University of Virginia Library
>> 
> 
> 
>

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Dave

If you are not capable of even writing your own indexing code, let alone 
crawler, I would prefer that you just stop now.  No one is going to help you 
with this request, at least I'd hope not. 

> On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
> 
> Hello,
> 
>   I was wondering if anyone could guide me on how to crawl the web and
> ignore the robots.txt since I can not index some big sites. Or if someone
> could point how to get around it. I read somewhere about a
> protocol.plugin.check.robots
> but that was for nutch.
> 
> The way I index is
> bin/post -c gettingstarted https://en.wikipedia.org/
> 
> but I can't index the site I'm guessing because of the robots.txt.
> I can index with
> bin/post -c gettingstarted http://lucene.apache.org/solr
> 
> which I am guessing allows it. I was also wondering how to find the name of
> the crawler bin/post uses.

Re: Solr Web Crawler - Robots.txt

2017-06-01 Thread Dave

And I mean that in the context of stealing content from sites that explicitly 
declare they don't want to be crawled. Robots.txt is to be followed. 

> On Jun 1, 2017, at 5:31 PM, David Choi  wrote:
> 
> Hello,
> 
>   I was wondering if anyone could guide me on how to crawl the web and
> ignore the robots.txt since I can not index some big sites. Or if someone
> could point how to get around it. I read somewhere about a
> protocol.plugin.check.robots
> but that was for nutch.
> 
> The way I index is
> bin/post -c gettingstarted https://en.wikipedia.org/
> 
> but I can't index the site I'm guessing because of the robots.txt.
> I can index with
> bin/post -c gettingstarted http://lucene.apache.org/solr
> 
> which I am guessing allows it. I was also wondering how to find the name of
> the crawler bin/post uses.

Using SOLR Autocomplete for addresses (i.e. multiple terms)

2012-01-02 Thread Dave

Hi,

I'm new to SOLR, but I've got it up and running, indexing data via the DIH,
and properly returning results for queries. I'm trying to setup another
core to run suggester, in order to autocomplete geographical locations. We
have a web application that needs to take a city, state / region, country
input. We'd like to do this in a single entry box. Here are some examples:

Brooklyn, New York, United States of America
Philadelphia, Pennsylvania, United States of America
Barcelona, Catalunya, Spain

Assume for now that every location around the world can be split into this
3-form input. I've setup my DIH to create a TemplateTransformer field that
combines the 4 tables (city, state and country are all independent tables
connected to each other by a master places table) into a field called
"fullplacename":



I've defined a "text_auto" field in schema.xml:








and have defined these two fields as well:




Now, here's my problem. This works fine for the first term, i.e. if I type
"brooklyn" I get the results I'd expect, using this URL to query:

http://localhost:8983/solr/places/suggest?q=brooklyn

However, as soon as I put a comma and/or a space in there, it breaks them
up into 2 suggestions, and I get a suggestion for each:

http://localhost:8983/solr/places/suggest?q=brooklyn%2C%20ny

Gives me a suggestion for "brooklyn" and a suggestion for "ny" instead of a
suggestion that matches "brooklyn, ny". I've tried every solution I can
find via google and haven't had any luck. Is there something simple that
I've missed, or is this the wrong approach?

Just in case, here's the searchComponent and requestHandler definition:



true
suggest
10


suggest





suggest
org.apache.solr.spelling.suggest.Suggester
org.apache.solr.spelling.suggest.tst.TSTLookup
name_autocomplete`



Thanks for any assistance!

Using SOLR Autocomplete for addresses (i.e. multiple terms)

2012-01-02 Thread Dave

Hi,

I'm reposting my StackOverflow question to this thread as I'm not getting
much of a response there. Thank you for any assistance you can provide!

http://stackoverflow.com/questions/8705600/using-solr-autocomplete-for-addresses

I'm new to SOLR, but I've got it up and running, indexing data via the DIH,
and properly returning results for queries. I'm trying to setup another
core to run suggester, in order to autocomplete geographical locations. We
have a web application that needs to take a city, state / region, country
input. We'd like to do this in a single entry box. Here are some examples:

Brooklyn, New York, United States of America
Philadelphia, Pennsylvania, United States of America
Barcelona, Catalunya, Spain

Assume for now that every location around the world can be split into this
3-form input. I've setup my DIH to create a TemplateTransformer field that
combines the 4 tables (city, state and country are all independent tables
connected to each other by a master places table) into a field called
"fullplacename":



I've defined a "text_auto" field in schema.xml:








and have defined these two fields as well:




Now, here's my problem. This works fine for the first term, i.e. if I type
"brooklyn" I get the results I'd expect, using this URL to query:

http://localhost:8983/solr/places/suggest?q=brooklyn

However, as soon as I put a comma and/or a space in there, it breaks them
up into 2 suggestions, and I get a suggestion for each:

http://localhost:8983/solr/places/suggest?q=brooklyn%2C%20ny

Gives me a suggestion for "brooklyn" and a suggestion for "ny" instead of a
suggestion that matches "brooklyn, ny". I've tried every solution I can
find via google and haven't had any luck. Is there something simple that
I've missed, or is this the wrong approach?

Just in case, here's the searchComponent and requestHandler definition:



true
suggest
10


suggest





suggest
org.apache.solr.spelling.suggest.Suggester
org.apache.solr.spelling.suggest.tst.TSTLookup
name_autocomplete`



Thanks for any assistance!

Re: Using SOLR Autocomplete for addresses (i.e. multiple terms)

2012-01-03 Thread Dave

Hi Jan,

Yes, I just saw the answer. I've implemented that, and it's working as
expected. I do have Suggest running on its own core, separate from my
standard search handler. I think, however, that the custom QueryConverter
that was linked to is now too restrictive. For example, it works perfectly
when someone enters "brooklyn, n", but if they start by entering "ny" or
"new york" it doesn't return anything. I think what you're talking about,
suggesting from whole input and individual tokens is the way to go. Is
there anything you can point me to as a starting point? I think I've got
the basic setup, but I'm not quite comfortable enough with SOLR and the
SOLR architecture yet (honestly I've only been using it for about 2 weeks
now).

Thanks for the help!

Dave

On Tue, Jan 3, 2012 at 8:24 AM, Jan Høydahl  wrote:

> Hi,
>
> As you see, you've got an answer at StackOverflow already with a proposed
> solution to implement your own QueryConverter.
>
> Another way is to create a Solr core solely for Suggest, and tune it
> exactly the way you like. Then you can have it suggest from the whole input
> as well as individual tokens and weigh these as you choose, as well as
> implement phonetic normalization and other useful tricks.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 3. jan. 2012, at 00:52, Dave wrote:
>
> > Hi,
> >
> > I'm reposting my StackOverflow question to this thread as I'm not getting
> > much of a response there. Thank you for any assistance you can provide!
> >
> >
> http://stackoverflow.com/questions/8705600/using-solr-autocomplete-for-addresses
> >
> > I'm new to SOLR, but I've got it up and running, indexing data via the
> DIH,
> > and properly returning results for queries. I'm trying to setup another
> > core to run suggester, in order to autocomplete geographical locations.
> We
> > have a web application that needs to take a city, state / region, country
> > input. We'd like to do this in a single entry box. Here are some
> examples:
> >
> > Brooklyn, New York, United States of America
> > Philadelphia, Pennsylvania, United States of America
> > Barcelona, Catalunya, Spain
> >
> > Assume for now that every location around the world can be split into
> this
> > 3-form input. I've setup my DIH to create a TemplateTransformer field
> that
> > combines the 4 tables (city, state and country are all independent tables
> > connected to each other by a master places table) into a field called
> > "fullplacename":
> >
> > 
> >
> > I've defined a "text_auto" field in schema.xml:
> >
> > 
> >
> >
> >
> >
> > 
> >
> > and have defined these two fields as well:
> >
> >  > stored="true" multiValued="true" />
> > 
> >
> > Now, here's my problem. This works fine for the first term, i.e. if I
> type
> > "brooklyn" I get the results I'd expect, using this URL to query:
> >
> > http://localhost:8983/solr/places/suggest?q=brooklyn
> >
> > However, as soon as I put a comma and/or a space in there, it breaks them
> > up into 2 suggestions, and I get a suggestion for each:
> >
> > http://localhost:8983/solr/places/suggest?q=brooklyn%2C%20ny
> >
> > Gives me a suggestion for "brooklyn" and a suggestion for "ny" instead
> of a
> > suggestion that matches "brooklyn, ny". I've tried every solution I can
> > find via google and haven't had any luck. Is there something simple that
> > I've missed, or is this the wrong approach?
> >
> > Just in case, here's the searchComponent and requestHandler definition:
> >
> >  > class="org.apache.solr.handler.component.SearchHandler">
> >
> >true
> >suggest
> >10
> >
> >
> >suggest
> >
> > 
> >
> > 
> >
> >suggest
> > name="classname">org.apache.solr.spelling.suggest.Suggester
> > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
> >name_autocomplete`
> >
> > 
> >
> > Thanks for any assistance!
>
>

Re: Using SOLR Autocomplete for addresses (i.e. multiple terms)

2012-01-03 Thread Dave

I've got another question for anyone that might have some insight - how do
you get all of your indexed information along with the suggestions? i.e. if
each suggestion has an ID# associated with it, do I have to then query for
that ID#, or is there some way or specifying a field list in the URL to the
suggester?

Thanks!
Dave

On Tue, Jan 3, 2012 at 9:41 AM, Dave  wrote:

> Hi Jan,
>
> Yes, I just saw the answer. I've implemented that, and it's working as
> expected. I do have Suggest running on its own core, separate from my
> standard search handler. I think, however, that the custom QueryConverter
> that was linked to is now too restrictive. For example, it works perfectly
> when someone enters "brooklyn, n", but if they start by entering "ny" or
> "new york" it doesn't return anything. I think what you're talking about,
> suggesting from whole input and individual tokens is the way to go. Is
> there anything you can point me to as a starting point? I think I've got
> the basic setup, but I'm not quite comfortable enough with SOLR and the
> SOLR architecture yet (honestly I've only been using it for about 2 weeks
> now).
>
> Thanks for the help!
>
> Dave
>
>
> On Tue, Jan 3, 2012 at 8:24 AM, Jan Høydahl  wrote:
>
>> Hi,
>>
>> As you see, you've got an answer at StackOverflow already with a proposed
>> solution to implement your own QueryConverter.
>>
>> Another way is to create a Solr core solely for Suggest, and tune it
>> exactly the way you like. Then you can have it suggest from the whole input
>> as well as individual tokens and weigh these as you choose, as well as
>> implement phonetic normalization and other useful tricks.
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>>
>> On 3. jan. 2012, at 00:52, Dave wrote:
>>
>> > Hi,
>> >
>> > I'm reposting my StackOverflow question to this thread as I'm not
>> getting
>> > much of a response there. Thank you for any assistance you can provide!
>> >
>> >
>> http://stackoverflow.com/questions/8705600/using-solr-autocomplete-for-addresses
>> >
>> > I'm new to SOLR, but I've got it up and running, indexing data via the
>> DIH,
>> > and properly returning results for queries. I'm trying to setup another
>> > core to run suggester, in order to autocomplete geographical locations.
>> We
>> > have a web application that needs to take a city, state / region,
>> country
>> > input. We'd like to do this in a single entry box. Here are some
>> examples:
>> >
>> > Brooklyn, New York, United States of America
>> > Philadelphia, Pennsylvania, United States of America
>> > Barcelona, Catalunya, Spain
>> >
>> > Assume for now that every location around the world can be split into
>> this
>> > 3-form input. I've setup my DIH to create a TemplateTransformer field
>> that
>> > combines the 4 tables (city, state and country are all independent
>> tables
>> > connected to each other by a master places table) into a field called
>> > "fullplacename":
>> >
>> > 
>> >
>> > I've defined a "text_auto" field in schema.xml:
>> >
>> > 
>> >
>> >
>> >
>> >
>> > 
>> >
>> > and have defined these two fields as well:
>> >
>> > > > stored="true" multiValued="true" />
>> > 
>> >
>> > Now, here's my problem. This works fine for the first term, i.e. if I
>> type
>> > "brooklyn" I get the results I'd expect, using this URL to query:
>> >
>> > http://localhost:8983/solr/places/suggest?q=brooklyn
>> >
>> > However, as soon as I put a comma and/or a space in there, it breaks
>> them
>> > up into 2 suggestions, and I get a suggestion for each:
>> >
>> > http://localhost:8983/solr/places/suggest?q=brooklyn%2C%20ny
>> >
>> > Gives me a suggestion for "brooklyn" and a suggestion for "ny" instead
>> of a
>> > suggestion that matches "brooklyn, ny". I've tried every solution I can
>> > find via google and haven't had any luck. Is there something simple that
>> > I've missed, or is this the wrong approach?
>> >
>> > Just in case, here's the searchComponent and requestHandler definition:
>> >
>> > > > class="org.apache.solr.handler.component.SearchHandler">
>> >
>> >true
>> >suggest
>> >10
>> >
>> >
>> >suggest
>> >
>> > 
>> >
>> > 
>> >
>> >suggest
>> >> name="classname">org.apache.solr.spelling.suggest.Suggester
>> >> name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
>> >name_autocomplete`
>> >
>> > 
>> >
>> > Thanks for any assistance!
>>
>>
>

Re: SOLR results case

2012-01-05 Thread Dave

Hi Juan,

When I'm storing the content, the field has a LowerCaseFilterFactory
filter, so that when I'm searching it's not case sensitive. Is there a way
to re-filter the data when it's presented as a result to restore the case
or convert to Title Case?

Thanks,
Dave

On Thu, Jan 5, 2012 at 12:41 PM, Juan Grande  wrote:

> Hi Dave,
>
> The stored content (which is returned in the results) isn't modified by the
> analyzers, so this shouldn't be a problem. Could you describe in more
> detail what you are doing and the results that you're getting?
>
> Thanks,
>
> *Juan*
>
>
>
> On Thu, Jan 5, 2012 at 2:17 PM, Dave  wrote:
>
> > I'm running all of my indexed data and queries through a
> > LowerCaseFilterFactory because I don't want to worry about case when
> > matching. All of my results are titles - is there an easy way to restore
> > case or convert all results to Title Case when returning them? My results
> > are returned as JSON if that makes any difference.
> >
> > Thanks,
> > Dave
> >
>

Re: SOLR results case

2012-01-06 Thread Dave

Hi Juan,

You're correct, the search results casing is working fine. This is my
mistake, I didn't specify that I'm using the Suggester component in order
to drive an auto-completion field on a website. Is there anyway to change
the output of the Suggester to maintain case? Everything is coming out
lower-case.

Thanks!
Dave

On Thu, Jan 5, 2012 at 2:01 PM, Juan Grande  wrote:

> Hi Dave,
>
> Have you tried running a query and taking a look at the results?
>
> The filters that you define in the fieldType don't affect the way the data
> is *stored*, it affects the way the data is *indexed*. With this I mean
> that the filters affect the way that a query matches a document, and will
> affect other features that rely on the *indexed* values (like faceting) but
> won't affect the way in which results are shown, which depends on the
> *stored* value.
>
> *Juan*
>
>
>
> On Thu, Jan 5, 2012 at 3:19 PM, Dave  wrote:
>
> > Hi Juan,
> >
> > When I'm storing the content, the field has a LowerCaseFilterFactory
> > filter, so that when I'm searching it's not case sensitive. Is there a
> way
> > to re-filter the data when it's presented as a result to restore the case
> > or convert to Title Case?
> >
> > Thanks,
> > Dave
> >
> > On Thu, Jan 5, 2012 at 12:41 PM, Juan Grande 
> > wrote:
> >
> > > Hi Dave,
> > >
> > > The stored content (which is returned in the results) isn't modified by
> > the
> > > analyzers, so this shouldn't be a problem. Could you describe in more
> > > detail what you are doing and the results that you're getting?
> > >
> > > Thanks,
> > >
> > > *Juan*
> > >
> > >
> > >
> > > On Thu, Jan 5, 2012 at 2:17 PM, Dave  wrote:
> > >
> > > > I'm running all of my indexed data and queries through a
> > > > LowerCaseFilterFactory because I don't want to worry about case when
> > > > matching. All of my results are titles - is there an easy way to
> > restore
> > > > case or convert all results to Title Case when returning them? My
> > results
> > > > are returned as JSON if that makes any difference.
> > > >
> > > > Thanks,
> > > > Dave
> > > >
> > >
> >
>

Trying to understand SOLR memory requirements

2012-01-16 Thread Dave

cene.util.fst.NodeHash.add(NodeHash.java:128)
 at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:161)
at org.apache.lucene.util.fst.Builder.compilePrevTail(Builder.java:247)
 at org.apache.lucene.util.fst.Builder.add(Builder.java:364)
at
org.apache.lucene.search.suggest.fst.FSTLookup.buildAutomaton(FSTLookup.java:486)
 at org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:179)
at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
 at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
at org.apache.solr.spelling.suggest.Suggester.reload(Suggester.java:153)
 at
org.apache.solr.handler.component.SpellCheckComponent$SpellCheckerListener.newSearcher(SpellCheckComponent.java:675)
at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1181)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)

Jan 16, 2012 4:06:15 PM org.apache.solr.core.SolrCore registerSearcher
INFO: [places] Registered new searcher Searcher@34b0ede5 main



Basically this means once I've run a full-import, I cannot exit the SOLR
process because I receive this error no matter what when I restart the
process. I've tried with different -Xmx arguments, and I'm really at a loss
at this point. Is there any guideline to how much RAM I need? I've got 8GB
on this machine, although that could be increased if necessary. However, I
can't understand why it would need so much memory. Could I have something
configured incorrectly? I've been over the configs several times, trying to
get them down to the bare minimum.

Thanks for any assistance!

Dave

Re: Trying to understand SOLR memory requirements

2012-01-16 Thread Dave

I've tried up to -Xmx5g

On Mon, Jan 16, 2012 at 9:15 PM, qiu chi  wrote:

> What is the largest -Xmx value you have tried?
> Your index size seems not very big
> Try -Xmx2048m , it should work
>
> On Tue, Jan 17, 2012 at 9:31 AM, Dave  wrote:
>
> > I'm trying to figure out what my memory needs are for a rather large
> > dataset. I'm trying to build an auto-complete system for every
> > city/state/country in the world. I've got a geographic database, and have
> > setup the DIH to pull the proper data in. There are 2,784,937 documents
> > which I've formatted into JSON-like output, so there's a bit of data
> > associated with each one. Here is an example record:
> >
> > Brooklyn, New York, United States?{ |id|: |2620829|,
> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United
> > States| }
> >
> > I've got the spellchecker / suggester module setup, and I can confirm
> that
> > everything works properly with a smaller dataset (i.e. just a couple of
> > countries worth of cities/states). However I'm running into a big problem
> > when I try to index the entire dataset. The
> dataimport?command=full-import
> > works and the system comes to an idle state. It generates the following
> > data/index/ directory (I'm including it in case it gives any indication
> on
> > memory requirements):
> >
> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> > -rw-rw 1 root   root22M Jan 17 00:13 _2w.fdx
> > -rw-rw 1 root   root131 Jan 17 00:13 _2w.fnm
> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> > -rw-rw 1 root   root16M Jan 17 00:13 _2w.nrm
> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> > -rw-rw 1 root   root 20 Jan 17 00:13 segments.gen
> > -rw-rw 1 root   root291 Jan 17 00:13 segments_2
> >
> > Next I try to run the suggest?spellcheck.build=true command, and I get
> the
> > following error:
> >
> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester build
> > INFO: build()
> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> > at java.lang.String.(String.java:215)
> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> > at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
> >  at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
> >  at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
> > at
> > org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
> >  at
> > org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
> > at
> >
> >
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75)
> >  at
> >
> >
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125)
> > at
> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157)
> >  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
> > at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
> >  at
> >
> >
> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
> > at
> >
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
> >  at
> >
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
> >  at
> >
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> > at
> >
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> >  at
> >
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> > at
> org.mortbay.jetty

Re: Trying to understand SOLR memory requirements

2012-01-16 Thread Dave

According to http://wiki.apache.org/solr/Suggester FSTLookup is the least
memory-intensive of the lookupImpl's. Are you suggesting a different
approach entirely or is that a lookupImpl that is not mentioned in the
documentation?


On Mon, Jan 16, 2012 at 9:54 PM, qiu chi  wrote:

> you may disable FST look up and use lucene index as the suggest method
>
> FST look up loads all documents into the memory, you can use the lucene
> spell checker instead
>
> On Tue, Jan 17, 2012 at 10:31 AM, Dave  wrote:
>
> > I've tried up to -Xmx5g
> >
> > On Mon, Jan 16, 2012 at 9:15 PM, qiu chi  wrote:
> >
> > > What is the largest -Xmx value you have tried?
> > > Your index size seems not very big
> > > Try -Xmx2048m , it should work
> > >
> > > On Tue, Jan 17, 2012 at 9:31 AM, Dave  wrote:
> > >
> > > > I'm trying to figure out what my memory needs are for a rather large
> > > > dataset. I'm trying to build an auto-complete system for every
> > > > city/state/country in the world. I've got a geographic database, and
> > have
> > > > setup the DIH to pull the proper data in. There are 2,784,937
> documents
> > > > which I've formatted into JSON-like output, so there's a bit of data
> > > > associated with each one. Here is an example record:
> > > >
> > > > Brooklyn, New York, United States?{ |id|: |2620829|,
> > > > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229|
> },
> > > > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> > > > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> > > > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> > > > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
> > United
> > > > States| }
> > > >
> > > > I've got the spellchecker / suggester module setup, and I can confirm
> > > that
> > > > everything works properly with a smaller dataset (i.e. just a couple
> of
> > > > countries worth of cities/states). However I'm running into a big
> > problem
> > > > when I try to index the entire dataset. The
> > > dataimport?command=full-import
> > > > works and the system comes to an idle state. It generates the
> following
> > > > data/index/ directory (I'm including it in case it gives any
> indication
> > > on
> > > > memory requirements):
> > > >
> > > > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> > > > -rw-rw 1 root   root22M Jan 17 00:13 _2w.fdx
> > > > -rw-rw 1 root   root131 Jan 17 00:13 _2w.fnm
> > > > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> > > > -rw-rw 1 root   root16M Jan 17 00:13 _2w.nrm
> > > > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> > > > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> > > > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> > > > -rw-rw 1 root   root 20 Jan 17 00:13 segments.gen
> > > > -rw-rw 1 root   root291 Jan 17 00:13 segments_2
> > > >
> > > > Next I try to run the suggest?spellcheck.build=true command, and I
> get
> > > the
> > > > following error:
> > > >
> > > > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester
> > build
> > > > INFO: build()
> > > > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> > > > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> > > >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> > > > at java.lang.String.(String.java:215)
> > > >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> > > > at
> > org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
> > > >  at
> > org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
> > > > at
> > org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
> > > >  at
> > org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
> > > > at
> > > >
> > org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
> > > >  at
> > > >
> > org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
> > > > at
> > > >
> > > >
> > >

Re: Trying to understand SOLR memory requirements

2012-01-17 Thread Dave

Thank you Robert, I'd appreciate that. Any idea how long it will take to
get a fix? Would I be better switching to trunk? Is trunk stable enough for
someone who's very much a SOLR novice?

Thanks,
Dave

On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:

> looks like https://issues.apache.org/jira/browse/SOLR-2888.
>
> Previously, FST would need to hold all the terms in RAM during
> construction, but with the patch it uses offline sorts/temporary
> files.
> I'll reopen the issue to backport this to the 3.x branch.
>
>
> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> > I'm trying to figure out what my memory needs are for a rather large
> > dataset. I'm trying to build an auto-complete system for every
> > city/state/country in the world. I've got a geographic database, and have
> > setup the DIH to pull the proper data in. There are 2,784,937 documents
> > which I've formatted into JSON-like output, so there's a bit of data
> > associated with each one. Here is an example record:
> >
> > Brooklyn, New York, United States?{ |id|: |2620829|,
> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United
> > States| }
> >
> > I've got the spellchecker / suggester module setup, and I can confirm
> that
> > everything works properly with a smaller dataset (i.e. just a couple of
> > countries worth of cities/states). However I'm running into a big problem
> > when I try to index the entire dataset. The
> dataimport?command=full-import
> > works and the system comes to an idle state. It generates the following
> > data/index/ directory (I'm including it in case it gives any indication
> on
> > memory requirements):
> >
> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> > -rw-rw 1 root   root22M Jan 17 00:13 _2w.fdx
> > -rw-rw 1 root   root131 Jan 17 00:13 _2w.fnm
> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> > -rw-rw 1 root   root16M Jan 17 00:13 _2w.nrm
> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> > -rw-rw 1 root   root 20 Jan 17 00:13 segments.gen
> > -rw-rw 1 root   root291 Jan 17 00:13 segments_2
> >
> > Next I try to run the suggest?spellcheck.build=true command, and I get
> the
> > following error:
> >
> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester build
> > INFO: build()
> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> > at java.lang.String.(String.java:215)
> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> > at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
> >  at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
> >  at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
> > at
> org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
> >  at
> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
> > at
> >
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75)
> >  at
> >
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125)
> > at
> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157)
> >  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
> > at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
> >  at
> >
> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
> > at
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
> >  at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
> >  at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.

Re: Trying to understand SOLR memory requirements

2012-01-18 Thread Dave

I'm using 3.5

On Tue, Jan 17, 2012 at 7:57 PM, Lance Norskog  wrote:

> Which version of Solr do you use? 3.1 and 3.2 had a memory leak bug in
> spellchecking. This was fixed in 3.3.
>
> On Tue, Jan 17, 2012 at 5:59 AM, Robert Muir  wrote:
> > I committed it already: so you can try out branch_3x if you want.
> >
> > you can either wait for a nightly build or compile from svn
> > (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).
> >
> > On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
> >> Thank you Robert, I'd appreciate that. Any idea how long it will take to
> >> get a fix? Would I be better switching to trunk? Is trunk stable enough
> for
> >> someone who's very much a SOLR novice?
> >>
> >> Thanks,
> >> Dave
> >>
> >> On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
> >>
> >>> looks like https://issues.apache.org/jira/browse/SOLR-2888.
> >>>
> >>> Previously, FST would need to hold all the terms in RAM during
> >>> construction, but with the patch it uses offline sorts/temporary
> >>> files.
> >>> I'll reopen the issue to backport this to the 3.x branch.
> >>>
> >>>
> >>> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> >>> > I'm trying to figure out what my memory needs are for a rather large
> >>> > dataset. I'm trying to build an auto-complete system for every
> >>> > city/state/country in the world. I've got a geographic database, and
> have
> >>> > setup the DIH to pull the proper data in. There are 2,784,937
> documents
> >>> > which I've formatted into JSON-like output, so there's a bit of data
> >>> > associated with each one. Here is an example record:
> >>> >
> >>> > Brooklyn, New York, United States?{ |id|: |2620829|,
> >>> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229|
> },
> >>> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> >>> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> >>> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> >>> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
> United
> >>> > States| }
> >>> >
> >>> > I've got the spellchecker / suggester module setup, and I can confirm
> >>> that
> >>> > everything works properly with a smaller dataset (i.e. just a couple
> of
> >>> > countries worth of cities/states). However I'm running into a big
> problem
> >>> > when I try to index the entire dataset. The
> >>> dataimport?command=full-import
> >>> > works and the system comes to an idle state. It generates the
> following
> >>> > data/index/ directory (I'm including it in case it gives any
> indication
> >>> on
> >>> > memory requirements):
> >>> >
> >>> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> >>> > -rw-rw 1 root   root22M Jan 17 00:13 _2w.fdx
> >>> > -rw-rw 1 root   root131 Jan 17 00:13 _2w.fnm
> >>> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> >>> > -rw-rw 1 root   root16M Jan 17 00:13 _2w.nrm
> >>> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> >>> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> >>> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> >>> > -rw-rw 1 root   root 20 Jan 17 00:13 segments.gen
> >>> > -rw-rw 1 root   root291 Jan 17 00:13 segments_2
> >>> >
> >>> > Next I try to run the suggest?spellcheck.build=true command, and I
> get
> >>> the
> >>> > following error:
> >>> >
> >>> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester
> build
> >>> > INFO: build()
> >>> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> >>> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >>> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> >>> > at java.lang.String.(String.java:215)
> >>> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> >>> > at
> org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
> >>

Re: Trying to understand SOLR memory requirements

2012-01-18 Thread Dave

Robert, where can I pull down a nightly build from? Will it include the
apache-solr-core-3.3.0.jar and lucene-core-3.3-SNAPSHOT.jar jars? I need to
re-build with a custom SpellingQueryConverter.java.

Thanks,
Dave

On Tue, Jan 17, 2012 at 8:59 AM, Robert Muir  wrote:

> I committed it already: so you can try out branch_3x if you want.
>
> you can either wait for a nightly build or compile from svn
> (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).
>
> On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
> > Thank you Robert, I'd appreciate that. Any idea how long it will take to
> > get a fix? Would I be better switching to trunk? Is trunk stable enough
> for
> > someone who's very much a SOLR novice?
> >
> > Thanks,
> > Dave
> >
> > On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
> >
> >> looks like https://issues.apache.org/jira/browse/SOLR-2888.
> >>
> >> Previously, FST would need to hold all the terms in RAM during
> >> construction, but with the patch it uses offline sorts/temporary
> >> files.
> >> I'll reopen the issue to backport this to the 3.x branch.
> >>
> >>
> >> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> >> > I'm trying to figure out what my memory needs are for a rather large
> >> > dataset. I'm trying to build an auto-complete system for every
> >> > city/state/country in the world. I've got a geographic database, and
> have
> >> > setup the DIH to pull the proper data in. There are 2,784,937
> documents
> >> > which I've formatted into JSON-like output, so there's a bit of data
> >> > associated with each one. Here is an example record:
> >> >
> >> > Brooklyn, New York, United States?{ |id|: |2620829|,
> >> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229|
> },
> >> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> >> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> >> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> >> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
> United
> >> > States| }
> >> >
> >> > I've got the spellchecker / suggester module setup, and I can confirm
> >> that
> >> > everything works properly with a smaller dataset (i.e. just a couple
> of
> >> > countries worth of cities/states). However I'm running into a big
> problem
> >> > when I try to index the entire dataset. The
> >> dataimport?command=full-import
> >> > works and the system comes to an idle state. It generates the
> following
> >> > data/index/ directory (I'm including it in case it gives any
> indication
> >> on
> >> > memory requirements):
> >> >
> >> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> >> > -rw-rw 1 root   root22M Jan 17 00:13 _2w.fdx
> >> > -rw-rw 1 root   root131 Jan 17 00:13 _2w.fnm
> >> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> >> > -rw-rw 1 root   root16M Jan 17 00:13 _2w.nrm
> >> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> >> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> >> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> >> > -rw-rw 1 root   root 20 Jan 17 00:13 segments.gen
> >> > -rw-rw 1 root   root291 Jan 17 00:13 segments_2
> >> >
> >> > Next I try to run the suggest?spellcheck.build=true command, and I get
> >> the
> >> > following error:
> >> >
> >> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester
> build
> >> > INFO: build()
> >> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> >> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> >> > at java.lang.String.(String.java:215)
> >> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> >> > at
> org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
> >> >  at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
> >> > at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
> >> >  at
> org.apache.lucene.index.SegmentReader.docFreq

Re: Trying to understand SOLR memory requirements

2012-01-18 Thread Dave

Ok, I've been able to pull the code from SVN, build it, and compile my
SpellingQueryConverter against it. However, I'm at a loss as to where to
find / how to build the solr.war file?

On Tue, Jan 17, 2012 at 8:59 AM, Robert Muir  wrote:

> I committed it already: so you can try out branch_3x if you want.
>
> you can either wait for a nightly build or compile from svn
> (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).
>
> On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
> > Thank you Robert, I'd appreciate that. Any idea how long it will take to
> > get a fix? Would I be better switching to trunk? Is trunk stable enough
> for
> > someone who's very much a SOLR novice?
> >
> > Thanks,
> > Dave
> >
> > On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
> >
> >> looks like https://issues.apache.org/jira/browse/SOLR-2888.
> >>
> >> Previously, FST would need to hold all the terms in RAM during
> >> construction, but with the patch it uses offline sorts/temporary
> >> files.
> >> I'll reopen the issue to backport this to the 3.x branch.
> >>
> >>
> >> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> >> > I'm trying to figure out what my memory needs are for a rather large
> >> > dataset. I'm trying to build an auto-complete system for every
> >> > city/state/country in the world. I've got a geographic database, and
> have
> >> > setup the DIH to pull the proper data in. There are 2,784,937
> documents
> >> > which I've formatted into JSON-like output, so there's a bit of data
> >> > associated with each one. Here is an example record:
> >> >
> >> > Brooklyn, New York, United States?{ |id|: |2620829|,
> >> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229|
> },
> >> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> >> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> >> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> >> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
> United
> >> > States| }
> >> >
> >> > I've got the spellchecker / suggester module setup, and I can confirm
> >> that
> >> > everything works properly with a smaller dataset (i.e. just a couple
> of
> >> > countries worth of cities/states). However I'm running into a big
> problem
> >> > when I try to index the entire dataset. The
> >> dataimport?command=full-import
> >> > works and the system comes to an idle state. It generates the
> following
> >> > data/index/ directory (I'm including it in case it gives any
> indication
> >> on
> >> > memory requirements):
> >> >
> >> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> >> > -rw-rw 1 root   root22M Jan 17 00:13 _2w.fdx
> >> > -rw-rw 1 root   root131 Jan 17 00:13 _2w.fnm
> >> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> >> > -rw-rw 1 root   root16M Jan 17 00:13 _2w.nrm
> >> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> >> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> >> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> >> > -rw-rw 1 root   root 20 Jan 17 00:13 segments.gen
> >> > -rw-rw 1 root   root291 Jan 17 00:13 segments_2
> >> >
> >> > Next I try to run the suggest?spellcheck.build=true command, and I get
> >> the
> >> > following error:
> >> >
> >> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester
> build
> >> > INFO: build()
> >> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> >> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> >> > at java.lang.String.(String.java:215)
> >> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> >> > at
> org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
> >> >  at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
> >> > at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
> >> >  at
> org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
> >> > at
> >>
> org.

Re: Trying to understand SOLR memory requirements

2012-01-18 Thread Dave

Unfortunately, that doesn't look like it solved my problem. I built the new
.war file, dropped it in, and restarted the server. When I tried to build
the spellchecker index, it ran out of memory again. Is there anything I
needed to change in the configuration? Did I need to upload new .jar files,
or was replacing the .war file enough?

Jan 18, 2012 2:20:25 PM org.apache.solr.spelling.suggest.Suggester build
INFO: build()


Jan 18, 2012 2:22:06 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:344)
at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:352)
 at org.apache.lucene.util.fst.FST$BytesWriter.writeByte(FST.java:975)
at org.apache.lucene.util.fst.FST.writeLabel(FST.java:395)
 at org.apache.lucene.util.fst.FST.addNode(FST.java:499)
at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:182)
 at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:270)
at org.apache.lucene.util.fst.Builder.add(Builder.java:365)
 at
org.apache.lucene.search.suggest.fst.FSTCompletionBuilder.buildAutomaton(FSTCompletionBuilder.java:228)
at
org.apache.lucene.search.suggest.fst.FSTCompletionBuilder.build(FSTCompletionBuilder.java:202)
 at
org.apache.lucene.search.suggest.fst.FSTCompletionLookup.build(FSTCompletionLookup.java:199)
at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
 at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
at
org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
 at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1375)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:358)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:253)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
 at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)


On Tue, Jan 17, 2012 at 8:59 AM, Robert Muir  wrote:

> I committed it already: so you can try out branch_3x if you want.
>
> you can either wait for a nightly build or compile from svn
> (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).
>
> On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
> > Thank you Robert, I'd appreciate that. Any idea how long it will take to
> > get a fix? Would I be better switching to trunk? Is trunk stable enough
> for
> > someone who's very much a SOLR novice?
> >
> > Thanks,
> > Dave
> >
> > On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
> >
> >> looks like https://issues.apache.org/jira/browse/SOLR-2888.
> >>
> >> Previously, FST would need to hold all the terms in RAM during
> >> construction, but with the patch it uses offline sorts/temporary
> >> files.
> >> I'll reopen the issue to backport this to the 3.x branch.
> >>
> >>
> >> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> >> > I'm trying to figure out what my memory needs are for a rather large
> >> > dataset. I'm trying to build an auto-complete system for every
> >> > city/state/country in the world. I've got a geographic database, and
> have
> >> > setup the DIH to pull the proper data in. There are 2,784,937
> documents
> >> > which I've formatted into JSON-like output, so there's a bit of data
> >> > associated with each one. Here is an example record:
> >> >
> >> > Brooklyn, New York, United States?{ |id|: |2620829|,
> >> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229|
> },
> >> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> >> > |Brooklyn|, |name|: |Bro

Re: Trying to understand SOLR memory requirements

2012-01-19 Thread Dave

I'm also seeing the error when I try to start up the SOLR instance:

SEVERE: java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:344)
 at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:352)
at org.apache.lucene.util.fst.FST$BytesWriter.writeByte(FST.java:975)
 at org.apache.lucene.util.fst.FST.writeLabel(FST.java:395)
at org.apache.lucene.util.fst.FST.addNode(FST.java:499)
 at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:182)
at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:270)
 at org.apache.lucene.util.fst.Builder.add(Builder.java:365)
at
org.apache.lucene.search.suggest.fst.FSTCompletionBuilder.buildAutomaton(FSTCompletionBuilder.java:228)
 at
org.apache.lucene.search.suggest.fst.FSTCompletionBuilder.build(FSTCompletionBuilder.java:202)
at
org.apache.lucene.search.suggest.fst.FSTCompletionLookup.build(FSTCompletionLookup.java:199)
 at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
 at org.apache.solr.spelling.suggest.Suggester.reload(Suggester.java:153)
at
org.apache.solr.handler.component.SpellCheckComponent$SpellCheckerListener.newSearcher(SpellCheckComponent.java:675)
 at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1184)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


On Wed, Jan 18, 2012 at 5:24 PM, Dave  wrote:

> Unfortunately, that doesn't look like it solved my problem. I built the
> new .war file, dropped it in, and restarted the server. When I tried to
> build the spellchecker index, it ran out of memory again. Is there anything
> I needed to change in the configuration? Did I need to upload new .jar
> files, or was replacing the .war file enough?
>
> Jan 18, 2012 2:20:25 PM org.apache.solr.spelling.suggest.Suggester build
> INFO: build()
>
>
> Jan 18, 2012 2:22:06 PM org.apache.solr.common.SolrException log
>  SEVERE: java.lang.OutOfMemoryError: Java heap space
> at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:344)
> at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:352)
>  at org.apache.lucene.util.fst.FST$BytesWriter.writeByte(FST.java:975)
> at org.apache.lucene.util.fst.FST.writeLabel(FST.java:395)
>  at org.apache.lucene.util.fst.FST.addNode(FST.java:499)
> at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:182)
>  at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:270)
> at org.apache.lucene.util.fst.Builder.add(Builder.java:365)
>  at
> org.apache.lucene.search.suggest.fst.FSTCompletionBuilder.buildAutomaton(FSTCompletionBuilder.java:228)
> at
> org.apache.lucene.search.suggest.fst.FSTCompletionBuilder.build(FSTCompletionBuilder.java:202)
>  at
> org.apache.lucene.search.suggest.fst.FSTCompletionLookup.build(FSTCompletionLookup.java:199)
> at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
>  at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
> at
> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
>  at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1375)
> at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:358)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:253)
> at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>  at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>  at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>  at org.mortbay.jetty.Server.handle(Server.java:326)
> at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>  at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> at org.mortbay.jetty.HttpParser

Re: Trying to understand SOLR memory requirements

2012-01-19 Thread Dave

In my original post I included one of my terms:

Brooklyn, New York, United States?{ |id|: |2620829|,
|timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
|region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
|Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
|2300664|, |label|: |Brooklyn, New York, United States|, |value|:
|Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United
States| }

I'm matching on the first part of the term (the part before the ?), and
then the rest is being passed via JSON into Javascript, then converted to a
JSON term itself. Here is my data-config.xml file, in case it sheds any
light:


  
  




























  





On Thu, Jan 19, 2012 at 11:52 AM, Robert Muir  wrote:

> I don't think the problem is FST, since it sorts offline in your case.
>
> More importantly, what are you trying to put into the FST?
>
> it appears you are indexing terms from your term dictionary, but your
> term dictionary is over 1GB, why is that?
>
> what do your terms look like? 1GB for 2,784,937 documents does not make
> sense.
> for example, all place names in geonames (7.2M documents) creates a
> term dictionary of 22MB.
>
> So there is something wrong with your data importing and/or analysis
> process, your terms are not what you think they are.
>
> On Thu, Jan 19, 2012 at 11:27 AM, Dave  wrote:
> > I'm also seeing the error when I try to start up the SOLR instance:
> >
> > SEVERE: java.lang.OutOfMemoryError: Java heap space
> > at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:344)
> >  at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:352)
> > at org.apache.lucene.util.fst.FST$BytesWriter.writeByte(FST.java:975)
> >  at org.apache.lucene.util.fst.FST.writeLabel(FST.java:395)
> > at org.apache.lucene.util.fst.FST.addNode(FST.java:499)
> >  at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:182)
> > at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:270)
> >  at org.apache.lucene.util.fst.Builder.add(Builder.java:365)
> > at
> >
> org.apache.lucene.search.suggest.fst.FSTCompletionBuilder.buildAutomaton(FSTCompletionBuilder.java:228)
> >  at
> >
> org.apache.lucene.search.suggest.fst.FSTCompletionBuilder.build(FSTCompletionBuilder.java:202)
> > at
> >
> org.apache.lucene.search.suggest.fst.FSTCompletionLookup.build(FSTCompletionLookup.java:199)
> >  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
> > at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
> >  at org.apache.solr.spelling.suggest.Suggester.reload(Suggester.java:153)
> > at
> >
> org.apache.solr.handler.component.SpellCheckComponent$SpellCheckerListener.newSearcher(SpellCheckComponent.java:675)
> >  at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1184)
> > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >  at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> > at java.lang.Thread.run(Thread.java:662)
> >
> >
> > On Wed, Jan 18, 2012 at 5:24 PM, Dave  wrote:
> >
> >> Unfortunately, that doesn't look like it solved my problem. I built the
> >> new .war file, dropped it in, and restarted the server. When I tried to
> >> build the spellchecker index, it ran out of memory again. Is there
> anything
> >> I needed to change in the configuration? Did I need to upload new .jar
> >> files, or was replacing the .war file enough?
> >>
> >> Jan 18, 2012 2:20:25 PM org.apache.solr.spelling.suggest.Suggester build
> >> INFO: build()
> >>
> >>
> >> Jan 18, 2012 2:22:06 PM org.apache.solr.common.SolrException log
> >>  SEVERE: java.lang.OutOfMemoryError: Java heap space
> >> at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:344)
> >> at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:352)
> >>  at org.apache.lucene.util.fst.FST$BytesWriter.writeByte(FST.java:975)
> >> at org.apache.lucene.util.fst.FST.writeLabel(FST.java:395)
> >>  at org.apache.lucene.util.fst.FST.addNode(FST.java:499)
>

Re: Trying to understand SOLR memory requirements

2012-01-19 Thread Dave

That was how I originally tried to implement it, but I could not figure out
how to get the suggester to return anything but the suggestion. How do you
do that?

On Thu, Jan 19, 2012 at 1:13 PM, Robert Muir  wrote:

> I really don't think you should put a huge json document as a search term.
>
> Just make "Brooklyn, New York, United States" or whatever you intend
> the user to actually search on/type in as your search term.
> put the rest in different fields (e.g. stored-only, not even indexed
> if you dont need that) and have solr return it that way.
>
> On Thu, Jan 19, 2012 at 12:31 PM, Dave  wrote:
> > In my original post I included one of my terms:
> >
> > Brooklyn, New York, United States?{ |id|: |2620829|,
> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United
> > States| }
> >
> > I'm matching on the first part of the term (the part before the ?), and
> > then the rest is being passed via JSON into Javascript, then converted
> to a
> > JSON term itself. Here is my data-config.xml file, in case it sheds any
> > light:
> >
> > 
> >   >  driver="com.mysql.jdbc.Driver"
> >  url=""
> >  user=""
> >  password=""
> >  encoding="UTF-8"/>
> >  
> > >pk="id"
> >query="select p.id as placeid, c.id, c.plainname, c.name,
> > p.timezone from countries c, places p where p.regionid = 1 AND p.cityid
> = 1
> > AND c.id=p.countryid AND p.settingid=1"
> >transformer="TemplateTransformer">
> >
> >
> >
> >
> >
> >
> >
> > >pk="id"
> >query="select p.id as placeid, p.countryid as countryid,
> > c.plainname as countryname, p.timezone as timezone, r.id as regionid,
> > r.plainname as regionname, r.population as regionpop from places p,
> regions
> > r, countries c where r.id = p.regionid AND p.settingid = 1 AND
> p.regionid >
> > 1 AND p.countryid=c.id AND p.cityid=1 AND r.population > 0"
> >transformer="TemplateTransformer">
> >
> >
> >
> >
> >
> >
> >
> > >pk="id"
> >query="select c2.id as cityid, c2.plainname as cityname,
> > c2.population as citypop, p.id as placeid, p.countryid as countryid,
> > c.plainname as countryname, p.timezone as timezone, r.id as regionid,
> > r.plainname as regionname from places p, regions r, countries c, cities
> c2
> > where c2.id = p.cityid AND p.settingid = 1 AND p.regionid > 1 AND
> > p.countryid=c.id AND r.id=p.regionid"
> >transformer="TemplateTransformer">
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >  
> > 
> >
> >
> >
> >
> > On Thu, Jan 19, 2012 at 11:52 AM, Robert Muir  wrote:
> >
> >> I don't think the problem is FST, since it sorts offline in your case.
> >>
> >> More importantly, what are you trying to put into the FST?
> >>
> >> it appears you are indexing terms from your term dictionary, but your
> >> term dictionary is over 1GB, why is that?
> >>
> >> what do your terms look like? 1GB for 2,784,937 documents does not make
> >> sense.
> >> for example, all place names in geonames (7.2M documents) creates a
> >> term dictionary of 22MB.
> >>
> >> So there is something wrong with your data importing and/or analysis
> >> process, your terms are not what you think they are.
> >>
> >> On Thu, Jan 19, 2012 at 11:27 AM, Dave  wrote:
> >> > I'm also seeing the error when I try to start up the SOLR instance:
> >> >
> >> > SEVERE: java.lang.OutOfMemoryError: Java heap space
> >> > at org.apache.lucene.util.ArrayUtil.grow(Array

Re: Trying to understand SOLR memory requirements

2012-01-22 Thread Dave

I take it from the overwhelming silence on the list that what I've asked is
not possible? It seems like the suggester component is not well supported
or understood, and limited in functionality.

Does anyone have any ideas for how I would implement the functionality I'm
looking for. I'm trying to implement a single location auto-suggestion box
that will search across multiple DB tables. It would take several possible
inputs: city, state, country; state,county; or country. In addition, there
are many aliases for each city, state and country that map back to the
original city/state/country. Once they select a suggestion, that suggestion
needs to have certain information associated with it. It seems that the
Suggester component is not the right tool for this. Anyone have other ideas?

Thanks,
Dave

On Thu, Jan 19, 2012 at 6:09 PM, Dave  wrote:

> That was how I originally tried to implement it, but I could not figure
> out how to get the suggester to return anything but the suggestion. How do
> you do that?
>
>
> On Thu, Jan 19, 2012 at 1:13 PM, Robert Muir  wrote:
>
>> I really don't think you should put a huge json document as a search term.
>>
>> Just make "Brooklyn, New York, United States" or whatever you intend
>> the user to actually search on/type in as your search term.
>> put the rest in different fields (e.g. stored-only, not even indexed
>> if you dont need that) and have solr return it that way.
>>
>> On Thu, Jan 19, 2012 at 12:31 PM, Dave  wrote:
>> > In my original post I included one of my terms:
>> >
>> > Brooklyn, New York, United States?{ |id|: |2620829|,
>> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
>> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
>> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
>> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
>> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
>> United
>> > States| }
>> >
>> > I'm matching on the first part of the term (the part before the ?), and
>> > then the rest is being passed via JSON into Javascript, then converted
>> to a
>> > JSON term itself. Here is my data-config.xml file, in case it sheds any
>> > light:
>> >
>> > 
>> >  > >  driver="com.mysql.jdbc.Driver"
>> >  url=""
>> >  user=""
>> >  password=""
>> >  encoding="UTF-8"/>
>> >  
>> >> >pk="id"
>> >query="select p.id as placeid, c.id, c.plainname, c.name,
>> > p.timezone from countries c, places p where p.regionid = 1 AND p.cityid
>> = 1
>> > AND c.id=p.countryid AND p.settingid=1"
>> >transformer="TemplateTransformer">
>> >
>> >
>> >
>> >
>> >
>> >> template="${countries.plainname}?{
>> > |id|: |${countries.placeid}|, |timezone|:|${countries.timezone}|,|type|:
>> > |1|, |country|: { |id| : |${countries.id}|, |plainname|:
>> > |${countries.plainname}|, |name|: |${countries.plainname}| }, |region|:
>> {
>> > |id| : |0| }, |city|: { |id|: |0| }, |hint|: ||, |label|:
>> > |${countries.plainname}|, |value|: |${countries.plainname}|, |title|:
>> > |${countries.plainname}| }"/>
>> >
>> >> >pk="id"
>> >query="select p.id as placeid, p.countryid as countryid,
>> > c.plainname as countryname, p.timezone as timezone, r.id as regionid,
>> > r.plainname as regionname, r.population as regionpop from places p,
>> regions
>> > r, countries c where r.id = p.regionid AND p.settingid = 1 AND
>> p.regionid >
>> > 1 AND p.countryid=c.id AND p.cityid=1 AND r.population > 0"
>> >transformer="TemplateTransformer">
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >> >pk="id"
>> >query="select c2.id as cityid, c2.plainname as cityname,
>> > c2.population as citypop, p.id as placeid, p.countryid as countryid,
>> > c.plainname as countryname, p.timezone as timezone, r.id as regionid,
>> > r.plainname as regionname from places p, regions r,

Re: Using SOLR Autocomplete for addresses (i.e. multiple terms)

2012-01-29 Thread Dave

Thanks Jan, this is perfect! I'm going to work on implementing it this week
and let you know how it works for us. Thanks again!

Dave

On Wed, Jan 25, 2012 at 1:10 PM, Jan Høydahl  wrote:

> Hi,
>
> I don't think that the suggester can output multiple fields. You would
> have to encode your data in a special way with separators.
>
> Using the separate Solr core approach, you may return whatever fields you
> choose to the suggest Ajax component.
> I've written up a blog post and uploaded an example to GitHub. See
> http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 3. jan. 2012, at 20:41, Dave wrote:
>
> > I've got another question for anyone that might have some insight - how
> do
> > you get all of your indexed information along with the suggestions? i.e.
> if
> > each suggestion has an ID# associated with it, do I have to then query
> for
> > that ID#, or is there some way or specifying a field list in the URL to
> the
> > suggester?
> >
> > Thanks!
> > Dave
> >
> > On Tue, Jan 3, 2012 at 9:41 AM, Dave  wrote:
> >
> >> Hi Jan,
> >>
> >> Yes, I just saw the answer. I've implemented that, and it's working as
> >> expected. I do have Suggest running on its own core, separate from my
> >> standard search handler. I think, however, that the custom
> QueryConverter
> >> that was linked to is now too restrictive. For example, it works
> perfectly
> >> when someone enters "brooklyn, n", but if they start by entering "ny" or
> >> "new york" it doesn't return anything. I think what you're talking
> about,
> >> suggesting from whole input and individual tokens is the way to go. Is
> >> there anything you can point me to as a starting point? I think I've got
> >> the basic setup, but I'm not quite comfortable enough with SOLR and the
> >> SOLR architecture yet (honestly I've only been using it for about 2
> weeks
> >> now).
> >>
> >> Thanks for the help!
> >>
> >> Dave
> >>
> >>
> >> On Tue, Jan 3, 2012 at 8:24 AM, Jan Høydahl 
> wrote:
> >>
> >>> Hi,
> >>>
> >>> As you see, you've got an answer at StackOverflow already with a
> proposed
> >>> solution to implement your own QueryConverter.
> >>>
> >>> Another way is to create a Solr core solely for Suggest, and tune it
> >>> exactly the way you like. Then you can have it suggest from the whole
> input
> >>> as well as individual tokens and weigh these as you choose, as well as
> >>> implement phonetic normalization and other useful tricks.
> >>>
> >>> --
> >>> Jan Høydahl, search solution architect
> >>> Cominvent AS - www.cominvent.com
> >>> Solr Training - www.solrtraining.com
> >>>
> >>> On 3. jan. 2012, at 00:52, Dave wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I'm reposting my StackOverflow question to this thread as I'm not
> >>> getting
> >>>> much of a response there. Thank you for any assistance you can
> provide!
> >>>>
> >>>>
> >>>
> http://stackoverflow.com/questions/8705600/using-solr-autocomplete-for-addresses
> >>>>
> >>>> I'm new to SOLR, but I've got it up and running, indexing data via the
> >>> DIH,
> >>>> and properly returning results for queries. I'm trying to setup
> another
> >>>> core to run suggester, in order to autocomplete geographical
> locations.
> >>> We
> >>>> have a web application that needs to take a city, state / region,
> >>> country
> >>>> input. We'd like to do this in a single entry box. Here are some
> >>> examples:
> >>>>
> >>>> Brooklyn, New York, United States of America
> >>>> Philadelphia, Pennsylvania, United States of America
> >>>> Barcelona, Catalunya, Spain
> >>>>
> >>>> Assume for now that every location around the world can be split into
> >>> this
> >>>> 3-form input. I've setup my DIH to create a TemplateTransformer field
> >>> that
> >>>> combines the 4 tables (city, state and country are all independent
&g

7.3 to 7.5

2018-10-18 Thread Dave

Would a minor solr upgrade such as this require a reindexing in order to take 
advantage of the skg functionality, or would it work regardless?  A full 
reindex is quite a large operation in my use case

Re: Index optimization takes too long

2018-11-03 Thread Dave

On a side note, does adding docvalues to an already indexed field, and then 
optimizing, prevent the need to reindex to take advantage of docvalues? I was 
under the impression you had to reindex the content. 

> On Nov 3, 2018, at 4:41 AM, Deepak Goel  wrote:
> 
> I would start by monitoring the hardware (CPU, Memory, Disk) & software
> (heap, threads) utilization's and seeing where the bottlenecks are. Or what
> is getting utilized the most. And then tune that parameter.
> 
> I would also look at profiling the software.
> 
> 
> Deepak
> "The greatness of a nation can be judged by the way its animals are
> treated. Please consider stopping the cruelty by becoming a Vegan"
> 
> +91 73500 12833
> deic...@gmail.com
> 
> Facebook: https://www.facebook.com/deicool
> LinkedIn: www.linkedin.com/in/deicool
> 
> "Plant a Tree, Go Green"
> 
> Make In India : http://www.makeinindia.com/home
> 
> 
>> On Sat, Nov 3, 2018 at 4:30 AM Wei  wrote:
>> 
>> Hello,
>> 
>> After a recent schema change,  it takes almost 40 minutes to optimize the
>> index.  The schema change is to enable docValues for all sort/facet fields,
>> which increase the index size from 12G to 14G. Before the change it only
>> takes 5 minutes to do the optimization.
>> 
>> I have tried to increase maxMergeAtOnceExplicit because the default 30
>> could be too low:
>> 
>> 100
>> 
>> But it doesn't seem to help. Any suggestions?
>> 
>> Thanks,
>> Wei
>>

Re: Solr Cloud configuration

2018-11-20 Thread Dave

But then I would lose the steaming expressions right?

> On Nov 20, 2018, at 6:00 PM, Edward Ribeiro  wrote:
> 
> Hi David,
> 
> Well, as a last resort you can resort to classic schema.xml if you are
> using standalone Solr and don't bother to give up schema API. Then you are
> back to manually editing conf/ files. See:
> 
> https://lucene.apache.org/solr/guide/7_4/schema-factory-definition-in-solrconfig.html
> 
> Best regards,
> Edward
> 
> 
> Em ter, 20 de nov de 2018 18:21, Adam Constabaris  escreveu:
> 
>> David,
>> 
>> One benefit of the way recommended in the reference guide is that it lets
>> you use zookeeper upconfig/downconfig as deployment tools on a set of text
>> files, which in turn allows you to manage your Solr configuration like any
>> other bit of source code, e.g. with version control and, if your situation
>> permits, things like branching and pull requests or other review
>> mechanisms.
>> 
>> In particular I have found the capacity to view diffs, have peers review,
>> and the ease of deploying changes to test and staging environments before
>> moving them into production is worth the effort all by itself.
>> 
>> HTH,
>> 
>> AC
>> 
>> 
>> 
>> On Tue, Nov 20, 2018 at 2:22 PM David Hastings <
>> hastings.recurs...@gmail.com>
>> wrote:
>> 
>>> Well considering that any access to the user interface by anyone can
>>> completely destroy entire collections/cores, I would think the security
>> of
>>> the stop word file wouldnt be that important
>>> Thanks Erick, it seems the only reason I have any desire to use SolrCloud
>>> is the use of streaming expressions.  I think thats the only benefit that
>>> more hardware cant solve.
>>> 
>>> On Tue, Nov 20, 2018 at 2:17 PM Erick Erickson 
>>> wrote:
>>> 
 David:
 
 Sure would. See https://issues.apache.org/jira/browse/SOLR-5287.
 Especially the bits about how allowing this leads to security
 vulnerabilities. You're not the first one who had this idea ;).
 
 Whether those security issues are still valid is another question I
 suppose.
 
 Best,
 Erick
 On Tue, Nov 20, 2018 at 11:01 AM David Hastings
  wrote:
> 
> Thanks, researching that now, but this seems extremely annoying.
>>> wouldnt
> it just be easier if you could edit the config files raw from the
>> admin
> UI?
> 
> On Tue, Nov 20, 2018 at 1:41 PM Pure Host - Wolfgang Freudenberger <
> w.freudenber...@pure-host.de> wrote:
> 
>> Hi David,
>> 
>> 
>> You can upload configuration to the zookeeper - it is nearly the
>> same
 as
>> the standaloneconfig.
>> 
>> You can also edit the schema.xml in this file. At least I do it
>> like
 this.
>> 
>> Mit freundlichem Gruß / kind regards
>> 
>> Wolfgang Freudenberger
>> Pure Host IT-Services
>> Münsterstr. 14
>> 48341 Altenberge
>> GERMANY
>> Tel.: (+49) 25 71 - 99 20 170
>> Fax: (+49) 25 71 - 99 20 171
>> 
>> Umsatzsteuer ID DE259181123
>> 
>> Informieren Sie sich über unser gesamtes Leistungsspektrum unter
>> www.pure-host.de
>> Get our whole services at www.pure-host.de
>> 
>>> Am 20.11.2018 um 19:38 schrieb David Hastings:
>>> I cant seem to find the documentation on how to actually edit the
 schema
>>> file myself, everything seems to lead me to using an API to add
 fields
>> and
>>> stop words etc.  this is more or less obnoxious, and the admin
>> api
 for
>>> adding fields/field types is not exactly functional.  is there a
 guide or
>>> something to let me know how to do it normally like in standalone
 solr?
>>> 
>> 
>> 
 
>>> 
>>

Re: Large Number of Collections takes down Solr 7.3

2019-01-22 Thread Dave

Do you mind if I ask why so many collections rather than a field in one 
collection that you can apply a filter query to each customer to restrict the 
result set, assuming you’re the one controlling the middle ware?

> On Jan 22, 2019, at 4:43 PM, Monica Skidmore 
>  wrote:
> 
> We have been running Solr 5.4 in master-slave mode with ~4500 cores for a 
> couple of years very successfully.  The cores represent individual customer 
> data, so they can vary greatly in size, and some of them have gotten too 
> large to be manageable.
> 
> We are trying to upgrade to Solr 7.3 in cloud mode, with ~4500 collections, 2 
>  NRTreplicas total per collection.  We have experimented with additional 
> servers and ZK nodes as a part of this move.  We can create up to ~4000 
> collections, with a slow-down to ~20s per collection to create, but if we go 
> much beyond that, the time to create collections shoots up, some collections 
> fail to be created, and we see some of the nodes crash.  Autoscaling brings 
> nodes back into the cluster, but they don’t have all the replicas created on 
> them that they should – we’re pretty sure this is related to the challenge of 
> adding the large number of collections on those node as they come up.
> 
> There are some approaches we could take that don’t separate our customers 
> into collections, but we get some benefits from this approach that we’d like 
> to keep.  We’d also like to add the benefits of cloud, like balancing where 
> collections are placed and the ability to split large collections.
> 
> Is anyone successfully running Solr 7x in cloud mode with thousands or more 
> of collections?  Are there some configurations we should be taking a closer 
> look at to make this feasible?  Should we try a different replica type?  (We 
> do want NRT-like query latency, but we also index heavily – this cluster will 
> have 10’s of millions of documents.)
> 
> I should note that the problems are not due to the number of documents – the 
> problems occur on a new cluster while we’re creating the collections we know 
> we’ll need.
> 
> Monica Skidmore
> 
> 
>

Re: English Analyzer

2019-02-05 Thread Dave

This will tell you pretty everything you need to get started

https://lucene.apache.org/solr/guide/6_6/language-analysis.html

> On Feb 5, 2019, at 4:55 AM, akash jayaweera  wrote:
> 
> Hello All,
> 
> Can i get details how to use English analyzer with stemming,
> lemmatizatiion, stopword removal techniques.
> I want to see the difference between before and after applying the English
> analyzer
> 
> Regards,
> *Akash Jayaweera.*
> 
> 
> E akash.jayawe...@gmail.com 
> M + 94 77 2472635 <+94%2077%20247%202635>

Re: edismax: sorting on numeric fields

2019-02-16 Thread Dave

Sounds like you need to use code and post process your results as it sounds too 
specific to your use case. Just my opinion, unless you want to get into spacial 
queries which is a whole different animal and something I don’t think many have 
experience with, including myself 

> On Feb 16, 2019, at 10:10 AM, Nicolas Paris  wrote:
> 
> Hi
> 
> Thanks.
> To clarify, I don't want to sort by numeric fields, instead, I d'like to
> get sort by distance to my query.
> 
> 
>> On Thu, Feb 14, 2019 at 06:20:19PM -0500, Gus Heck wrote:
>> Hi Niclolas,
>> 
>> Solr has no difficulty sorting on numeric fields if they are indexed as a
>> numeric type. Just use "&sort=weight asc" If you're field is indexed as
>> text of course it won't sort properly, but then you should fix your schema.
>> 
>> -Gus
>> 
>> On Thu, Feb 14, 2019 at 4:10 PM David Hastings 
>> wrote:
>> 
>>> Not clearly understanding your question here.  if your query is
>>> q=kind:animal weight:50 you will get no results, as nothing matches
>>> (assuming a q.op of AND)
>>> 
>>> 
>>> On Thu, Feb 14, 2019 at 4:06 PM Nicolas Paris 
>>> wrote:
>>> 
 Hi
 
 I have a numeric field (say "weight") and I d'like to be able to get
 results sorted.
 q=kind:animal weight:50
 pf=kind^2 weight^3
 
 would return:
 name=dog, kind=animal, weight=51
 name=tiger, kind=animal,weight=150
 name=elephant, kind=animal,weight=2000
 
 
 In other terms how to deal with numeric fields ?
 
 My first idea is to encode numeric into letters (one x per value)
 dog x
 tiger x
 elephant
 
>>> 
 
 and the query would be
 kind:animal, weight:xxx
 
 
 How to deal with numeric fields ?
 
 Thanks
 --
 nicolas
 
>>> 
>> 
>> 
>> -- 
>> http://www.the111shift.com
> 
> -- 
> nicolas

Re: MLT and facetting

2019-02-25 Thread Dave

Use the mlt to get the queries to use for getting facets in a two search 
approach

> On Feb 25, 2019, at 10:18 PM, Zheng Lin Edwin Yeo  
> wrote:
> 
> Hi Martin,
> 
> I think there are some pictures which are not being sent through in the
> email.
> 
> Do send your query that you are using, and which version of Solr you are
> using?
> 
> Regards,
> Edwin
> 
>> On Mon, 25 Feb 2019 at 20:54, Martin Frank Hansen (MHQ)  wrote:
>> 
>> Hi,
>> 
>> 
>> 
>> I am trying to combine the mlt functionality with facets, but Solr throws
>> org.apache.solr.common.SolrException: ":"Unable to compute facet ranges,
>> facet context is not set".
>> 
>> 
>> 
>> What I am trying to do is quite simple, find similar documents using mlt
>> and group these using the facet parameter. When using mlt and facets
>> separately everything works fine, but not when combining the functionality.
>> 
>> 
>> 
>> 
>> 
>> {
>> 
>>  "responseHeader":{
>> 
>>"status":500,
>> 
>>"QTime":109},
>> 
>>  "match":{"numFound":1,"start":0,"docs":[
>> 
>>  {
>> 
>>"Journalnummer":" 00759",
>> 
>>"id":"6512815"  },
>> 
>>  "response":{"numFound":602234,"start":0,"docs":[
>> 
>>  {
>> 
>>"Journalnummer":" 00759",
>> 
>>"id":"6512816",
>> 
>>  {
>> 
>>"Journalnummer":" 00759",
>> 
>>"id":"6834653"
>> 
>>  {
>> 
>>"Journalnummer":" 00739",
>> 
>>"id":"6202373"
>> 
>>  {
>> 
>>"Journalnummer":" 00739",
>> 
>>"id":"6748105"
>> 
>> 
>> 
>>  {
>> 
>>"Journalnummer":" 00803",
>> 
>>"id":"7402155"
>> 
>>  },
>> 
>>  "error":{
>> 
>>"metadata":[
>> 
>>  "error-class","org.apache.solr.common.SolrException",
>> 
>>  "root-error-class","org.apache.solr.common.SolrException"],
>> 
>>"msg":"Unable to compute facet ranges, facet context is not set",
>> 
>>"trace":"org.apache.solr.common.SolrException: Unable to compute facet
>> ranges, facet context is not set\n\tat
>> org.apache.solr.handler.component.RangeFacetProcessor.getFacetRangeCounts(RangeFacetProcessor.java:66)\n\tat
>> org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetComponent.java:331)\n\tat
>> org.apache.solr.handler.component.FacetComponent.getFacetCounts(FacetComponent.java:295)\n\tat
>> org.apache.solr.handler.MoreLikeThisHandler.handleRequestBody(MoreLikeThisHandler.java:240)\n\tat
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
>> org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)\n\tat
>> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)\n\tat
>> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)\n\tat
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
>> org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat
>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnect

Re: MLT and facetting

2019-02-28 Thread Dave

I’m more curious what you’d expect to see, and what possible benefit you could 
get from it

> On Feb 28, 2019, at 8:48 PM, Zheng Lin Edwin Yeo  wrote:
> 
> Hi Martin,
> 
> I have no idea on this, as the case has not been active for almost 2 years.
> Maybe I can try to follow up.
> 
> Faceting by default will show the list according to the number of
> occurrences. But I'm not sure how it will affect the MLT score or how it
> will be output when combine together, as it is not working currently and
> there is no way to test.
> 
> Regards,
> Edwin
> 
>> On Thu, 28 Feb 2019 at 14:51, Martin Frank Hansen (MHQ)  wrote:
>> 
>> Hi Edwin,
>> 
>> Ok that is nice to know. Do you know when this bug will get fixed?
>> 
>> By ordering I mean that MLT score the documents according to its
>> similarity function (believe it is cosine similarity), and I don’t know how
>> faceting will affect this score? Or ignore it all together?
>> 
>> Best regards
>> 
>> Martin
>> 
>> 
>> Internal - KMD A/S
>> 
>> -Original Message-
>> From: Zheng Lin Edwin Yeo 
>> Sent: 28. februar 2019 06:19
>> To: solr-user@lucene.apache.org
>> Subject: Re: MLT and facetting
>> 
>> Hi Martin,
>> 
>> According to the JIRA, it says it is a bug, as it was working previously
>> in Solr 4. I have not tried Solr 4 before, so I'm not sure how it works.
>> 
>> For the ordering of the documents, do you mean to sort them according to
>> the criteria that you want?
>> 
>> Regards,
>> Edwin
>> 
>> On Wed, 27 Feb 2019 at 14:43, Martin Frank Hansen (MHQ) 
>> wrote:
>> 
>>> Hi Edwin,
>>> 
>>> Thanks for your response. Are you sure it is a bug? Or is it not meant
>>> to work together?
>>> After doing some thinking I do see a problem faceting a MLT-result.
>>> MLT-results have a clear ordering of the documents which will be hard
>>> to maintain with facets. How will faceting MLT-results deal with the
>>> ordering of the documents? Will the ordering just be ignored?
>>> 
>>> Best regards
>>> 
>>> Martin
>>> 
>>> 
>>> 
>>> Internal - KMD A/S
>>> 
>>> -Original Message-
>>> From: Zheng Lin Edwin Yeo 
>>> Sent: 27. februar 2019 03:38
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: MLT and facetting
>>> 
>>> Hi Martin,
>>> 
>>> I also get the same problem in Solr 7.7 if I turn on faceting in /mlt
>>> requestHandler.
>>> 
>>> Found this issue in the JIRA:
>>> https://issues.apache.org/jira/browse/SOLR-7883
>>> Seems like it is a bug in Solr and it has not been resolved yet.
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> On Tue, 26 Feb 2019 at 21:03, Martin Frank Hansen (MHQ) 
>>> wrote:
>>> 
 Hi Edwin,
 
 Here it is:
 
 
 
 
 
 -
 
 
 -
 
 text
 
 1
 
 1
 
 true
 
 
 
 
 
 
 Internal - KMD A/S
 
 -Original Message-
 From: Zheng Lin Edwin Yeo 
 Sent: 26. februar 2019 08:24
 To: solr-user@lucene.apache.org
 Subject: Re: MLT and facetting
 
 Hi Martin,
 
 What is your setting in your /mlt requestHandler in solrconfig.xml?
 
 Regards,
 Edwin
 
 On Tue, 26 Feb 2019 at 14:43, Martin Frank Hansen (MHQ) 
 wrote:
 
> Hi Edwin,
> 
> Thanks for your response.
> 
> Yes you are right. It was simply the search parameters from Solr.
> 
> The query looks like this:
> 
> http://
> .../solr/.../mlt?df=text&facet.field=Journalnummer&facet=on&fl=id,
> Jo
> ur
> nalnummer&q=id:*6512815*
> 
> best regards,
> 
> Martin
> 
> 
> Internal - KMD A/S
> 
> -Original Message-
> From: Zheng Lin Edwin Yeo 
> Sent: 26. februar 2019 03:54
> To: solr-user@lucene.apache.org
> Subject: Re: MLT and facetting
> 
> Hi Martin,
> 
> I think there are some pictures which are not being sent through
> in the email.
> 
> Do send your query that you are using, and which version of Solr
> you are using?
> 
> Regards,
> Edwin
> 
> On Mon, 25 Feb 2019 at 20:54, Martin Frank Hansen (MHQ)
> 
> wrote:
> 
>> Hi,
>> 
>> 
>> 
>> I am trying to combine the mlt functionality with facets, but
>> Solr throws
>> org.apache.solr.common.SolrException: ":"Unable to compute facet
>> ranges, facet context is not set".
>> 
>> 
>> 
>> What I am trying to do is quite simple, find similar documents
>> using mlt and group these using the facet parameter. When using
>> mlt and facets separately everything works fine, but not when
>> combining the
> functionality.
>> 
>> 
>> 
>> 
>> 
>> {
>> 
>>  "responseHeader":{
>> 
>>"status":500,
>> 
>>"QTime":109},
>> 
>>  "match":{"numFound":1,"start":0,"docs":[
>> 
>>  {
>> 
>>"Journalnummer":" 00759",
>> 
>>"id":"6512815"  },
>> 
>>  "response":{"numFound":602234,

Re: SOLR Text Field

2019-04-06 Thread Dave

Wow. Ok dude relax and take a nap. It sounds like you don’t even have a core 
defined. Maybe you’d do and I’m reaching a bit but start there solr is super 
simple and only gets complicated when you’re complicated. 

> On Apr 6, 2019, at 8:59 AM, Dave Beckstrom  wrote:
> 
> Hi Everyone,
> 
> I'm really hating SOLR.   All I want is to define a text field that data
> can be indexed into and which is searchable.  Should be super simple.  But
> I run into issue after issue.  I'm running SOLR 7.3 because it's compatible
> with the version of NUTCH I'm running.
> 
> The docs say that SOLR ships with a default TextField but that seems to be
> wrong.  I define:
> 
> 
>  indexed="true"/>
> 
> The above throws error  "Unable to create core [MyCore] Caused by: Unknown
> fieldType 'TextField' specified on field metadata.myfield"
> 
> Then I try:
> 
> 
> 
> Same error.
> 
> Then as a workaround I got into defining a "Text_General" field because I
> couldn't get Text to work.  Text_General extends the Text field which seems
> to indicate there should be a text field built into SOLR!
> 
> Text_General causes a new set of problems.   How does one go about using
> the supposed default text field available in SOLR?
> 
> When I defined Text_General:
> 
>  name="add-schema-fields">
>
>  java.lang.String
>  text_general
>  true
>
> 
> Text_General with type=string complains when I try and insert data that has
> characters and numbers:
> 
> java.lang.Exception:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://127.0.0.1:/solr/MyCore: ERROR: [doc=
> http://xxx.xxx.com/services/mydocument.htm] Error adding field
> 'metatag.myfield'='15c0188' msg=For input string: "15c0188"
> 
> I'm very frustrated.  If anyone is able to help sort this out I would
> really appreciate it!  What do I need to do to be able to define a simple
> text field that is stored and searchable?
> 
> Thank you!
> 
> -- 
> *Fig Leaf Software, Inc.* 
> https://www.figleaf.com/ 
> <https://www.figleaf.com/>  
> 
> Full-Service Solutions Integrator
> 
> 
> 
> 
> 
>

Re: Sorting on ip address

2018-06-18 Thread Dave

Store it as an atom rather than an up address. 

> On Jun 18, 2018, at 12:14 PM, root23  wrote:
> 
> Hi all,
> is there a  built in data type which i can use for ip address which can
> provide me sorting ip address based on the class? if not then what is the
> best way to sort based on ip address ?
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Clarity on Stable Release

2020-01-29 Thread Dave

But!

If we don’t have people throwing a new release into production and finding real 
world problems we can’t trust that the current release problems will be exposed 
and then remedied, so it’s a double edged sword. I personally agree with 
staying a major version back, but that’s because it takes a long time to 
reindex another terabyte in combined indexes when a bug is found. However 
that’s not the norm, and I’m on an edge case where a full reindex is a few 
weeks or longer, if it was less than an a day or so I would be on 8.x

> On Jan 29, 2020, at 7:43 PM, Jeff  wrote:
> 
> Thanks Shawn! Your answer is very helpful. Especially your note about
> keeping up to date with the latest major version after a number of releases.
> 
>> On Wed, Jan 29, 2020 at 6:35 PM Shawn Heisey  wrote:
>> 
>>> On 1/29/2020 11:24 AM, Jeff wrote:
>>> Now, we are considering 8.2.0, 8.3.1, or 8.4.1 to use as they seem to be
>>> stable. But it is hard to determine if we should be using the bleeding
>> edge
>>> or a few minor versions back since each of  these includes many bug
>> fixes.
>>> It is unclear to me why some fixes get back-patched and why some are
>>> released under new minor version changes (which include some hefty
>>> improvements and features).
>> 
>> 
>> 
>>> 
>>> To clarify, I am mostly asking for some clarity on which versions
>> *should*
>>> be used for a stable system and that we somehow can make it more clear in
>>> the future. I am not trying to point the finger at specific bugs, but am
>>> simply using them as examples as to why it is hard to determine a release
>>> as stable.
>>> 
>>> If anybody has insight on this, please let me know.
>> 
>> My personal thought about any particular major version is that before
>> using that version, it's a good idea to wait for a few releases, so that
>> somebody braver than me can find the really big problems.
>> 
>> If 8.x were still brand new, I'd run the latest version of 7.x.  Since
>> 8.x has had a number of releases, my current thought for a new
>> deployment would be to run the latest version of 8.x.  I would also plan
>> on watching for new issues and being aggressive about upgrading to
>> future 8.x versions.  I would maintain a test environment to qualify
>> those releases.
>> 
>> All releases are called "stable".  That is the intent with any release
>> -- for it to be good enough for anyone to use in production.  Sometimes
>> we find problems after release.  When a problem is noted, we almost
>> always create a test that will alert us if that problem should resurface.
>> 
>> What you refer to as "bleeding edge" is the master branch, and that
>> branch is never used to create releases.
>> 
>> Thanks,
>> Shawn
>>

Re: Upgrading Solrcloud indexes from 7.2 to 8.4.1

2020-03-06 Thread Dave

You best off doing a full reindex to a single solr cloud 8.x node and then when 
done start taking down 7.x nodes, upgrade them to 8.x and add them to the new 
cluster. upgrading indexes has so many potential issues, 

> On Mar 6, 2020, at 9:21 PM, lstusr 5u93n4  wrote:
> 
> Hi Webster,
> 
> When we upgraded from 7.5 to 8.1 we ran into a very strange issue:
> https://lucene.472066.n3.nabble.com/Stored-field-values-don-t-update-after-7-gt-8-upgrade-td4442934.html
> 
> 
> We ended up having to do a full re-index to solve this issue, but if you're
> going to do this upgrade I would love to know if this issue shows up for
> you too. At the very least, I'd suggest doing some variant of the test
> outlined in that post, so you can be confident in your data integrity.
> 
> Kyle
> 
>> On Fri, 6 Mar 2020 at 14:08, Webster Homer 
>> wrote:
>> 
>> We are looking at upgrading our Solrcoud  instances from 7.2 to the most
>> recent version of solr 8.4.1 at this time. The last time we upgraded a
>> major solr release we were able to upgrade the index files to the newer
>> version, this prevented us from having an outage. Subsequently we've
>> reindexed all our collections. However the Solr documentation for 8.4.1
>> states that we need to be at Solr 7.3 or later to run the index upgrade.
>> https://lucene.apache.org/solr/guide/8_4/solr-upgrade-notes.html
>> 
>> So if we upgrade to 7.7,  and then move to 8.4.1  and run the index
>> upgrade script just once?
>> I guess I'm confused about the 7.2 -> 8.* issue is it data related?
>> 
>> Regards,
>> Webster
>> 
>> 
>> 
>> This message and any attachment are confidential and may be privileged or
>> otherwise protected from disclosure. If you are not the intended recipient,
>> you must not copy this message or attachment or disclose the contents to
>> any other person. If you have received this transmission in error, please
>> notify the sender immediately and delete the message and any attachment
>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>> subsidiaries do not accept liability for any omissions or errors in this
>> message which may arise as a result of E-Mail-transmission or for damages
>> resulting from any unauthorized changes of the content of this message and
>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> subsidiaries do not guarantee that this message is free of viruses and does
>> not accept liability for any damages caused by any virus transmitted
>> therewith.
>> 
>> 
>> 
>> Click http://www.merckgroup.com/disclaimer to access the German, French,
>> Spanish and Portuguese versions of this disclaimer.
>>

Re: Script to check if solr is running

2020-06-08 Thread Dave

A simple Perl script would be able to cover this, I have a cron job Perl script 
that does a search with an expected result, if the result isn’t there it fails 
over to a backup search server, sends me an email, and I fix what’s wrong. The 
backup search server is a direct clone of the live server and just as strong, 
no interruption (aside from the five minute window) 

If you need a hand with this I’d gladly help, everything I run is Linux based 
but it’s a simple curl command and server switch on failure. 

> On Jun 8, 2020, at 12:14 PM, Jörn Franke  wrote:
> 
> Use the solution described by Walter. This allows you to automatically 
> restart in case of failure and is also cleaner than defining a cronjob. 
> Otherwise This would be another dependency one needs to keep in mind - means 
> if there is an issue and someone does not know the system the person has to 
> look at different places which never is good 
> 
>> Am 04.06.2020 um 18:36 schrieb Ryan W :
>> 
>> Does anyone have a script that checks if solr is running and then starts it
>> if it isn't running?  Occasionally my solr stops running even if there has
>> been no Apache restart.  I haven't been able to determine the root cause,
>> so the next best thing might be to check every 15 minutes or so if it's
>> running and run it if it has stopped.
>> 
>> Thanks.

Re: How to determine why solr stops running?

2020-06-09 Thread Dave

I’ll add that whenever I’ve had a solr instance shut down, for me it’s been a 
hardware failure. Either the ram or the disk got a “glitch” and both of these 
are relatively fragile and wear and tear type parts of the machine, and should 
be expected to fail and be replaced from time to time. Solr is pretty 
aggressive with its logging so there are a lot of writes always happening and 
of course reads, if the disk has any issues or the memory it can lock it up and 
bring her down, more so if you have any spellcheck dictionaries or suggesters 
being built on start up. 

Just my experience with this, could be wrong (most likely wrong) but we always 
have extra drives and memory around the server room for this reason.  At least 
once or twice a year we will have a disk failure in the raid and need to swap 
in a new one. 

Good luck though, also solr should be logging it’s failures so it would be good 
to look there too

> On Jun 9, 2020, at 2:35 AM, Shawn Heisey  wrote:
> 
> On 5/14/2020 7:22 AM, Ryan W wrote:
>> I manage a site where solr has stopped running a couple times in the past
>> week. The server hasn't been rebooted, so that's not the reason.  What else
>> causes solr to stop running?  How can I investigate why this is happening?
> 
> Any situation where Solr stops running and nobody requested the stop is a 
> result of a serious problem that must be thoroughly investigated.  I think 
> it's a bad idea for Solr to automatically restart when it stops unexpectedly. 
>  Chances are that whatever caused the crash is going to simply make the crash 
> happen again until the problem is solved. Automatically restarting could hide 
> problems from the system administrator.
> 
> The only way a Solr auto-restart would be acceptable to me is if it sends a 
> high priority alert to the sysadmin EVERY time it executes an auto-restart.  
> It really is that bad of a problem.
> 
> The causes of Solr crashes (that I can think of) include the following. I 
> believe I have listed these four options from most likely to least likely:
> 
> * Java OutOfMemoryError exceptions.  On non-windows systems, the "bin/solr" 
> script starts Solr with an option that results in Solr's death anytime one of 
> these exceptions occurs.  We do this because program operation is 
> indeterminate and completely unpredictable when OOME occurs, so it's far 
> safer to stop running.  That exception can be caused by several things, some 
> of which actually do not involve memory at all.  If you're running on Windows 
> via the bin\solr.cmd command, then this will not happen ... but OOME could 
> still cause a crash, because as I already mentioned, program operation is 
> unpredictable when OOME occurs.
> 
> * The OS kills Solr because system memory is completely exhausted and Solr is 
> the process using the most memory.  Linux calls this the "oom-killer" ... I 
> am pretty sure something like it exists on most operating systems.
> 
> * Corruption somewhere in the system.  Could be in Java, the OS, Solr, or 
> data used by any of those.
> 
> * A very serious bug in Solr's code that we haven't discovered yet.
> 
> I included that last one simply for completeness.  A bug that causes a crash 
> *COULD* exist, but as of right now, we have not seen any supporting evidence.
> 
> My guess is that Java OutOfMemoryError is the cause here, but I can't be 
> certain.  If that is happening, then some resource (which might not be 
> memory) is fully depleted.  We would need to see the full OutOfMemoryError 
> exception in order to determine why it is happening. Sometimes the exception 
> is logged in solr.log, sometimes it isn't.  We cannot predict what part of 
> the code will be running when OOME occurs, so it would be nearly impossible 
> for us to guarantee logging.  OOME can happen ANYWHERE - even in code that 
> the compiler thinks is immune to exceptions.
> 
> Side note to fellow committers:  I wonder if we should implement an uncaught 
> exception handler in Solr.  I have found in my own programs that it helps 
> figure out thorny problems.  And while I am on the subject of handlers that 
> might not be general knowledge, I didn't find a shutdown hook or a security 
> manager outside of tests.
> 
> Thanks,
> Shawn

Re: Getting rid of zookeeper

2020-06-09 Thread Dave

Is it horrible that I’m already burnt out from just reading that?

I’m going to stick to the classic solr master slave set up for the foreseeable 
future, at least that let’s me focus more on the search theory rather than the 
back end system non stop. 

> On Jun 9, 2020, at 5:11 PM, Vincenzo D'Amore  wrote:
> 
> My 2 cents, I have few solrcloud productions installations, I would share
> some thoughts of what I learned in the latest 4/5 years (fwiw) just as they
> come out of my mind.
> 
> - to configure a SolrCloud *production* Cluster you have to be a zookeeper
> expert even if you only need Solr.
> - the Zookeeper ensemble (3 or 5 zookeeper nodes) is recommended to run on
> separate machines but for many customers this is too expensive. And for the
> rest it is expensive just to have the instances (i.e. dockers). It is
> expensive even to have people that know Zookeeper or even only train them.
> - given the high availability function of a zookeeper cluster you have
> to monitor it and promptly backup and restore. But it is hard to monitor
> (and configure the monitoring) and it is even harder to backup and restore
> (when it is running).
> - You can't add or remove nodes in zookeeper when it is up. Only the latest
> version should finally give the possibility to add/remove nodes when it is
> running, but afak this is not still supported by SolrCloud (out of the box).
> - many people fail when they try to run a SolrCloud cluster because it is
> hard to set up, for example: SolrCloud zkcli runs poorly on windows.
> - it is hard to admin the zookeeper remotely, basically there are no
> utilities that let you easily list/read/write/delete files on a zookeeper
> filesystem.
> - it was really hard to create a zookeeper ensemble in kubernetes, only
> recently appeared few solutions. This was so counter-productive for the
> Solr project because now the world is moving to Kubernetes, and there is
> basically no support.
> - well, after all these troubles, when the solrcloud clusters are
> configured correctly then, well, they are solid (rock?). And even if few
> Solr nodes/replicas went down the entire cluster can restore itself almost
> automatically, but how much work.
> 
> Believe me, I like Solr, but at the end of this long journey, sometimes I
> would really use only paas/saas instead of having to deal with all these
> troubles.

Re: URGENTRe: Questions about Solr Search

2020-07-03 Thread Dave

Seriously. Doug answered all of your questions. 

> On Jul 3, 2020, at 6:12 AM, Atri Sharma  wrote:
> 
> Please do not cross post. I believe your questions were already answered?
> 
>> On Fri, Jul 3, 2020 at 3:08 PM Gautam K  wrote:
>> 
>> Since it's a bit of an urgent request so if could please help me on this by 
>> today it will be highly appreciated.
>> 
>> Thanks & Regards,
>> Gautam Kanaujia
>> 
>>> On Thu, Jul 2, 2020 at 7:49 PM Gautam K  wrote:
>>> 
>>> Dear Team,
>>> 
>>> Hope you all are doing well.
>>> 
>>> Can you please help with the following question? We are using Solr search 
>>> in our Organisation and now checking whether Solr provides search 
>>> capabilities like Google Enterprise search(Google Knowledge Graph Search).
>>> 
>>> 1, Does Solr Search provide Voice Search like Google?
>>> 2. Does Solar Search provide NLP Search(Natural Language Processing)?
>>> 3. Does Solr have all the capabilities which Google Knowledge Graph 
>>> provides like below?
>>> 
>>> Getting a ranked list of the most notable entities that match certain 
>>> criteria.
>>> Predictively completing entities in a search box.
>>> Annotating/organizing content using the Knowledge Graph entities.
>>> 
>>> 
>>> Your help will be appreciated highly.
>>> 
>>> Many thanks
>>> Gautam Kanaujia
>>> India
> 
> -- 
> Regards,
> 
> Atri
> Apache Concerted

Re: sorting help

2020-07-15 Thread Dave

That’s a good place to start. The idea was to make sure titles that started 
with a date would not always be at the forefront and the actual title of the 
doc would be sorted. 

> On Jul 15, 2020, at 4:58 PM, Erick Erickson  wrote:
> 
> Yeah, it’s always a question “how much is enough/too much”.
> 
> That looks reasonable for alphatitle, but what about title? Your original
> question was that the sorting changes depending on which field you 
> sort on. If your title field uses something that tokenizes or doesn’t
> include the same analysis chain (particularly the lowercasing
> and patternreplace) then I’d expect the order to change.
> 
> Best,
> Erick
> 
>> On Jul 15, 2020, at 4:49 PM, David Hastings  
>> wrote:
>> 
>> thanks, ill check the admin, didnt want to send a big clock of text but:
>> 
>> 
>>  -
>> -
>> 
>> Tokenizer:
>> org.apache.lucene.analysis.core.KeywordTokenizerFactoryclass:
>> solr.KeywordTokenizerFactoryluceneMatchVersion: 7.1.0
>> -
>> 
>> Token Filters:
>> org.apache.lucene.analysis.core.LowerCaseFilterFactoryclass:
>> solr.LowerCaseFilterFactoryluceneMatchVersion: 7.1.0
>> org.apache.lucene.analysis.miscellaneous.TrimFilterFactoryclass:
>> solr.TrimFilterFactoryluceneMatchVersion: 7.1.0
>> org.apache.lucene.analysis.pattern.PatternReplaceFilterFactorypattern:
>> ([^a-z])replace: allclass: solr.PatternReplaceFilterFactoryreplacement
>> luceneMatchVersion: 7.1.0
>>  -
>> 
>>  Query Analyzer:
>>  
>> 
>>  org.apache.solr.analysis.TokenizerChain
>> -
>> 
>> Tokenizer:
>> org.apache.lucene.analysis.core.KeywordTokenizerFactoryclass:
>> solr.KeywordTokenizerFactoryluceneMatchVersion: 7.1.0
>> -
>> 
>> Token Filters:
>> org.apache.lucene.analysis.core.LowerCaseFilterFactoryclass:
>> solr.LowerCaseFilterFactoryluceneMatchVersion: 7.1.0
>> org.apache.lucene.analysis.miscellaneous.TrimFilterFactoryclass:
>> solr.TrimFilterFactoryluceneMatchVersion: 7.1.0
>> org.apache.lucene.analysis.pattern.PatternReplaceFilterFactorypattern:
>> ([^a-z])replace: allclass: solr.PatternReplaceFilterFactoryreplacement
>> luceneMatchVersion: 7.1.0
>> 
>> 
>>> On Wed, Jul 15, 2020 at 4:47 PM Erick Erickson 
>>> wrote:
>>> 
>>> I’d look two places:
>>> 
>>> 1> try the admin/analysis page from the admin UI. In particular, look at
>>> what tokens actually get in the index.
>>> 
>>> 2> again, the admin UI will let you choose the field (alphatitle and
>>> title) and see what the actual indexed tokens are.
>>> 
>>> Both have the issue that I don’t know what tokenizer you are using. For
>>> sorting it better be something
>>> like KeywordTokenizer. Anything that breaks up the input into separate
>>> tokens will produce surprises.
>>> 
>>> And unless you have lowercaseFilter in front of your patternreplace,
>>> you’re removing uppercase characters.
>>> 
>>> Best,
>>> Erick
>>> 
 On Jul 15, 2020, at 3:06 PM, David Hastings <
>>> hastings.recurs...@gmail.com> wrote:
 
 howdy,
 i have a field that sorts fine all other content, and i cant seem to
>>> debug
 why it wont sort for me on this one chunk of it.
 "sort":"alphatitle asc", "debugQuery":"on", "_":"1594733127740"}},
>>> "response
 ":{"numFound":3,"start":0,"docs":[ { "title":"Money orders", {
 "title":"Finance,
 consolidation and rescheduling of debts", { "title":"Rights in former
 German Islands in Pacific", },
 
 its using a copyfield from "title" to "alphatitle" that replaces all
 punctuation
 pattern: ([^a-z])replace: allclass: solr.PatternReplaceFilterFactory
 
 and if i use just title it flips:
 
 "title":"Finance, consolidation and rescheduling of debts"}, {
>>> "title":"Rights
 in former German Islands in Pacific"}, { "title":"Money orders"}]
 
 and im banging my head trying to figure out what it is about this
 content in particular that is not sorting the way I would expect.
 don't suppose someone would be able to lead me to a good place to look?
>>> 
>>> 
>

Re: solr startup

2020-08-07 Thread Dave

It sounds like you have suggester indexes being built on startup.  Without them 
they just come up in a second or so

> On Aug 7, 2020, at 6:03 PM, Schwartz, Tony  wrote:
> 
> I have many collections.  When I start solr, it takes 30 - 45 minutes to 
> start up and load all the collections.  My collections are named per day.  
> During startup, solr loads the collections in alpha-numeric name order.  I 
> would like solr to load the collections in the descending order.  So the most 
> recent collections are loaded first and are available for searching while the 
> older collections are not as important.  Is this possible?
> 
>

Re: solr startup

2020-08-08 Thread Dave

Ah. Glad you found it. Yeah warming queries are much better substituted with 
home made scripts if you need them. I like to use the previous days logs and 
run the last couple hundred or so on a cron in the morning. 

> On Aug 8, 2020, at 9:39 AM, Schwartz, Tony  wrote:
> 
> I did not have a suggester set up.  I disabled the spell checker component, 
> but that wasn't the problem.  I found my issue... it was related to a warming 
> query i was running for each newly opened searcher.  Early on I enabled that, 
> but I completely forgot about it.  And i don't believe it's needed.  I was 
> hoping it would help with performance related to time filtering and sorting.  
> But, now it seems to be performing quite well without it.
> 
> Tony
> 
> 
> 
> From: Schwartz, Tony
> Sent: Friday, August 7, 2020 6:27 PM
> To: solr-user@lucene.apache.org
> Subject: RE: solr startup
> 
> suggester?  what do i need to look for in the configs?
> 
> Tony
> 
> 
> 
> Sent from my Verizon, Samsung Galaxy smartphone
> 
> 
> 
>  Original message 
> From: Dave mailto:hastings.recurs...@gmail.com>>
> Date: 8/7/20 18:23 (GMT-05:00)
> To: solr-user@lucene.apache.org<mailto:solr-user@lucene.apache.org>
> Subject: Re: solr startup
> 
> It sounds like you have suggester indexes being built on startup.  Without 
> them they just come up in a second or so
> 
>> On Aug 7, 2020, at 6:03 PM, Schwartz, Tony 
>> mailto:tony.schwa...@cinbell.com>> wrote:
>> 
>> I have many collections.  When I start solr, it takes 30 - 45 minutes to 
>> start up and load all the collections.  My collections are named per day.  
>> During startup, solr loads the collections in alpha-numeric name order.  I 
>> would like solr to load the collections in the descending order.  So the 
>> most recent collections are loaded first and are available for searching 
>> while the older collections are not as important.  Is this possible?
>> 
>>

Re: Solr endpoint on the public internet

2020-10-08 Thread Dave

#1. This is a HORRIBLE IDEA
#2 If I was going to do this I would destroy the update request handler as well 
as the entire admin ui from the solr instance, set up a replication from a 
secure solr instance on an interval. This way no one could send an update 
/delete command, you could still update the index, and still be readable. Just 
remove any request handler that isn’t a search or replicate, and put the 
replication only on a port shared between the master and slave, 

> On Oct 8, 2020, at 2:27 PM, Marco Aurélio  wrote:
> 
> Hi!
> 
> We're looking into the option of setting up search with Solr without an
> intermediary application. This would mean our backend would index data into
> Solr and we would have a public Solr endpoint on the internet that would
> receive search requests directly.
> 
> Since I couldn't find an existing solution similar to ours, I would like to
> know whether it's possible to secure Solr in a way that allows anyone only
> read-access only to collections and how to achieve that. Specifically
> because of this part of the documentation
> :
> 
> *No Solr API, including the Admin UI, is designed to be exposed to
> non-trusted parties. Tune your firewall so that only trusted computers and
> people are allowed access. Because of this, the project will not regard
> e.g., Admin UI XSS issues as security vulnerabilities. However, we still
> ask you to report such issues in JIRA.*
> Is there a way we can restrict read-only access to Solr collections so as
> to allow users to make search requests directly to it or should we always
> keep our Solr instances completely private?
> 
> Thanks in advance!
> 
> Best regards,
> Marco Godinho

Re: Avoiding single digit and single charcater ONLY query by putting them in stopwords list

2020-10-27 Thread Dave

Agreed. Just a JavaScript check on the input box would work fine for 99% of 
cases, unless something automatic is running them in which case just server 
side redirect back to the form. 

> On Oct 27, 2020, at 11:54 AM, Mark Robinson  wrote:
> 
> Hi  Konstantinos ,
> 
> Thanks for the reply.
> I too feel the same. Wanted to find what others also in the Solr world
> thought about it.
> 
> Thanks!
> Mark.
> 
>> On Tue, Oct 27, 2020 at 11:45 AM Konstantinos Koukouvis <
>> konstantinos.koukou...@mecenat.com> wrote:
>> 
>> Oh hi Mark!
>> 
>> Why would you wanna do such a thing in the solr end. Imho it would be much
>> more clean and easy to do it on the client side
>> 
>> Regards,
>> Konstantinos
>> 
>> 
 On 27 Oct 2020, at 16:42, Mark Robinson  wrote:
>>> 
>>> Hello,
>>> 
>>> I want to block queries having only a digit like "1" or "2" ,... or
>>> just a letter like "a" or "b" ...
>>> 
>>> Is it a good idea to block them ... ie just single digits 0 - 9 and  a -
>> z
>>> by putting them as a stop word? The problem with this I can anticipate
>> is a
>>> query like "1 inch screw" can have the important information "1" stripped
>>> out if I tokenize it.
>>> 
>>> So what would be a good way to avoid  single digit only and single letter
>>> only queries, from the Solr end?
>>> Or should I not do this at the Solr end at all?
>>> 
>>> Could someone please share your thoughts?
>>> 
>>> Thanks!
>>> Mark
>> 
>> ==
>> Konstantinos Koukouvis
>> konstantinos.koukou...@mecenat.com
>> 
>> Using Golang and Solr? Try this: https://github.com/mecenat/solr
>> 
>> 
>> 
>> 
>> 
>>

Re: Need help to resolve Apache Solr vulnerability

2020-11-12 Thread Dave

Solr isn’t meant to be public facing. Not sure how anyone would send these 
commands since it can’t be reached from the outside world 

> On Nov 12, 2020, at 7:12 AM, Sheikh, Wasim A. 
>  wrote:
> 
> Hi Team,
> 
> Currently we are facing the below vulnerability for Apache Solr tool. So can 
> you please check the below details and help us to fix this issue.
> 
> /etc/init.d/solr-master version
> 
> Server version: Apache Tomcat/7.0.62
> Server built: May 7 2015 17:14:55 UTC
> Server number: 7.0.62.0
> OS Name: Linux
> OS Version: 2.6.32-431.29.2.el6.x86_64
> Architecture: amd64
> JVM Version: 1.8.0_20-b26
> JVM Vendor: Oracle Corporation
> 
> 
> "solr-spec-version":"4.10.4",
> Solr is an enterprise search platform.
> Solr is prone to remote code execution vulnerability.
> 
> Affected Versions:
> Apache Solr version prior to 6.6.2 and prior to 7.1.0
> 
> QID Detection Logic (Unauthenticated):
> This QID sends specifically crafted request which include special entities in 
> the xml document and looks for the vulnerable response.
> Alternatively, in another check, this QID matches vulnerable versions in the 
> response webpage
> Successful exploitation allows attacker to execute arbitrary code.
> The vendor has issued updated packages to fix this vulnerability. For more 
> information about the vulnerability and obtaining patches, refer to the 
> following Fedora security advisories : HREF="https://lucene.apache.org/solr/news.html"; TARGET="_blank">Apache Solr 
> 6.6.2 For more information regarding the update can be found at  HREF="https://lucene.apache.org/solr/news.html"; TARGET="_blank">Apache Solr  
> 7.1.0.
> 
> 
> 
> 
> 
> 
> 
> Patch:
> Following are links for downloading patches to fix the vulnerabilities:
>  https://lucene.apache.org/solr/news.html"; TARGET="_blank">Apache 
> Solr 6.6.2 https://lucene.apache.org/solr/news.html"; 
> TARGET="_blank">Apache Solr 7.1.0
> 
> 
> Thanks...
> Wasim Shaikh
> 
> 
> 
> This message is for the designated recipient only and may contain privileged, 
> proprietary, or otherwise confidential information. If you have received it 
> in error, please notify the sender immediately and delete the original. Any 
> other use of the e-mail by you is prohibited. Where allowed by local law, 
> electronic communications with Accenture and its affiliates, including e-mail 
> and instant messaging (including content), may be scanned by our systems for 
> the purposes of information security and assessment of internal compliance 
> with Accenture policy. Your privacy is important to us. Accenture uses your 
> personal data only in compliance with data protection laws. For further 
> information on how Accenture processes your personal data, please see our 
> privacy statement at https://www.accenture.com/us-en/privacy-policy.
> __
> 
> www.accenture.com

Re: Recovering deleted files without backup

2020-11-13 Thread Dave

Just rebuild the index. Pretty sure they’re gone if they aren’t in your vm 
backup, and solr isn’t a document storage tool, it’s a place to index the data 
from your document store, so it’s understood more or less that it can always be 
rebuilt when needed

> On Nov 13, 2020, at 9:52 PM, Alex Hanna  wrote:
> 
> Hi all,
> 
> I've accidentally deleted some documents and am trying to recover them.
> Unfortunately I don't have a snapshot or backup of the core, but have daily
> backups of my VM. When my sysadmin restores the data folder, however,
> the documents don't come back for some reason.
> 
> I'm running a pretty old version of Solr (5.x). Also, it looks like the
> only new files created recently are .liv files, which were created at the
> time of deletion, and also a segment_ file.
> 
> I'd love some guidance on this.
> 
> Thanks,
> - A
> 
> -- 
> Alex Hanna, PhD
> alex-hanna.com
> @alexhanna

Re: Solr8.7 - How to optmize my index ?

2020-12-02 Thread Dave

I’m going to go against the advice SLIGHTLY, it really depends on how you have 
things set up as far as your solr server hosting is done. If you’re searching 
off the same solr server you’re indexing to, yeah don’t ever optimize it will 
take care of itself, people much smarter than us, like Erick/Walter/Yonik, have 
spent time on this and if they say don’t do it don't do it. 

 In my particular use case I do see a measured improvement from optimizing 
every three or four months.  In my case a large portion, over 75% of the 
documents, which each measure around 500k to 3mg get reindexed every month, as 
the fields in the documents change every month, while documents are added to it 
daily as well.  So when I can go from a 650gb index to a 450gb once in a while 
it makes a difference if I only have 500gb of memory to work with on the 
searchers and can fit all the segments straight to memory. Also I use the old 
set up of master slave, so my indexing server, when it’s optimizing has no 
impact on the searching servers.  Once the optimized index gets warmed back up 
in the searcher I do notice improvement in my qtimes (I like to think) however 
I’ve been using my same integration process of occasional hard optimizations 
since 1.4, and it might just be i like to watch the index inflate three times 
the size then shrivel up. Old habits die hard. 

> On Dec 2, 2020, at 10:28 PM, Matheo Software  wrote:
> 
> Hi Erick,
> Hi Walter,
> 
> Thanks for these information,
> 
> I will learn seriously about the solr article you gave me. 
> I thought it was important to always delete and optimize collection.
> 
> More information concerning my collection,
> Index size is about 390Go for 130M docs (3-5ko / doc), around 25 fields 
> (indexed, stored)
> All Tuesday I do an update of around 1M docs and all Thusday I do an add new 
> docs (around 50 000). 
> 
> Many thanks !
> 
> Regards,
> Bruno
> 
> -Message d'origine-
> De : Erick Erickson [mailto:erickerick...@gmail.com] 
> Envoyé : mercredi 2 décembre 2020 14:07
> À : solr-user@lucene.apache.org
> Objet : Re: Solr8.7 - How to optmize my index ?
> 
> expungeDeletes is unnecessary, optimize is a superset of expungeDeletes.
> The key difference is commit=true. I suspect if you’d waited until your 
> indexing process added another doc and committed, you’d have seen the index 
> size drop.
> 
> Just to check, you send the command to my_core but talk about collections.
> Specifying the collection is sufficient, but I’ll assume that’s a typo and 
> you’re really saying my_collection.
> 
> I agree with Walter like I always do, you shouldn’t be running optimize 
> without some proof that it’s helping. About the only time I think it’s 
> reasonable is when you have a static index, unless you can demonstrate 
> improved performance. The optimize button was removed precisely because it 
> was so tempting. In much earlier versions of Lucene, it made a demonstrable 
> difference so was put front and center. In more recent versions of Solr 
> optimize doesn’t help nearly as much so it was removed.
> 
> You say you have 38M deleted documents. How many documents total? If this is 
> 50% of your index, that’s one thing. If it’s 5%, it’s certainly not worth the 
> effort. You’re rewriting 466G of index, if you’re not seeing demonstrable 
> performance improvements, that’s a lot of wasted effort…
> 
> See: https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/
> and the linked article for what happens in pre 7.5 solr versions.
> 
> Best,
> Erick
> 
>> On Dec 1, 2020, at 2:31 PM, Info MatheoSoftware  
>> wrote:
>> 
>> Hi All,
>> 
>> 
>> 
>> I found the solution, I must do :
>> 
>> curl ‘http://xxx:8983/solr/my_core/update?
>> 
>> commit=true&expungeDeletes=true’
>> 
>> 
>> 
>> It works fine
>> 
>> 
>> 
>> Thanks,
>> 
>> Bruno
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> De : Matheo Software [mailto:i...@matheo-software.com] Envoyé : mardi 
>> 1 décembre 2020 13:28 À : solr-user@lucene.apache.org Objet : Solr8.7 
>> - How to optmize my index ?
>> 
>> 
>> 
>> Hi All,
>> 
>> 
>> 
>> With Solr5.4, I used the UI button but in Solr8.7 UI this button is missing.
>> 
>> 
>> 
>> So I decide to use the command line:
>> 
>> curl http://xxx:8983/solr/my_core/update?optimize=true
>> 
>> 
>> 
>> My collection my_core exists of course.
>> 
>> 
>> 
>> The answer of the command line is:
>> 
>> {
>> 
>> "responseHeader":{
>> 
>>   "status":0,
>> 
>>   "QTime":18}
>> 
>> }
>> 
>> 
>> 
>> But nothing change.
>> 
>> I always have 38M deleted docs in my collection and directory size no 
>> change like with solr5.4.
>> 
>> The size of the collection stay always at : 466.33Go
>> 
>> 
>> 
>> Could you tell me how can I purge deleted docs ?
>> 
>> 
>> 
>> Cordialement, Best Regards
>> 
>> Bruno Mannina
>> 
>>  www.matheo-software.com
>> 
>>  www.patent-pulse.com
>> 
>> Tél. +33 0 970 738

Re: Using Solr as a Database?

2019-06-02 Thread Dave

You *can use solr as a database, in the same sense that you *can use a chainsaw 
to remodel your bathroom.  Is it the right tool for the job? No. Can you make 
it work? Yes.  As for HA and cluster rdbms gallera cluster works great for 
Maria db, and is acid compliant.  I’m sure any other database has its own 
cluster like product, or would hope so.  Generally it’s best to use the right 
tool for the job and not force behavior from something it’s not intended to do. 


That being said I heavily abuse solr as a data store as pre processing data 
before it goes into the index is more efficient for me than a bunch of sql 
joins for the outgoing product, but for actually editing and storing the data a 
relational db is easier to deal with and easier to find others that can work 
with it. 



> On Jun 2, 2019, at 4:33 PM, Ralph Soika  wrote:
> 
> Thanks Jörn and Erick for your explanations.
> 
> What I do so far is the following:
> 
>  * I have a RDBMS with one totally flatten table holding all the data and the 
> id.
>  * The data is unstructured. Fields can vary from document to document. I 
> have no fixed schema. A dataset is represented by a Hashmap.
>  * Lucene (7.5) is perfect to index the data - with analysed-fulltext and 
> also with non-analysed-fields.
> 
> The whole system is highly transactional as it runs on Java EE with JPA and 
> Session EJBs.
> I can easily rebuild my index on any time as I have all the data in a RDBMS. 
> And of course it was necessary in the past to rebuild the index for many 
> projects after upgrading lucene (e.g. from 4.x to 7.x).
> 
> So, as far as I understand, you recommend to leave the data in the RDBMS?
> 
> The problem with RDBMS is that you can not easily scale over many nodes with 
> a master less cluster. This was why I thought Solr can solve this problem 
> easily. On the other hand my Lucene index also did not scale over multiple 
> nodes. Maybe Solr would be a solution to scale just the index?
> 
> Another solution I am working on is to store all my data in a HA Cassandra 
> cluster because I do not need the SQL-Core functionallity. But in this case I 
> only replace the RDBMS with Cassandra and Lucene/Solr holds again only the 
> index.
> 
> So Solr can't improve my architecture, with the exception of the fact that 
> the search index could be distributed across multiple nodes with Solr. Did I 
> get that right?
> 
> 
> ===
> Ralph
> 
> 
>> On 02.06.19 16:35, Erick Erickson wrote:
>> You must be able to rebuild your index completely when, at some point, you 
>> change your schema in incompatible ways. For that reason, either you have to 
>> play tricks with Solr (i.e. store all fields or the original document or….) 
>> or somehow have access to the original document.
>> 
>> Furthermore, starting with Lucene 8, Lucene will not even open an index 
>> _ever_ touched with Lucene 6. In general you can’t even open an index with 
>> Lucene X that was ever worked on with Lucene X-2 (starting where X = 8).
>> 
>> That said, it’s a common pattern to put enough information into Solr that a 
>> user can identify documents that they need then go to the system-of-record 
>> for the full document, whether that is an RDBMS or file system or whatever. 
>> I’ve seen lots of hybrid systems that store additional data besides the id 
>> and let the user get to the document she wants and only when she clicks on a 
>> single document go to the system-of-record and fetch it. Think of a Google 
>> search where the information you see as the result of a search is stored in 
>> Solr, but when the user clicks on a link the original doc is fetched from 
>> someplace other than Solr.
>> 
>> FWIW,
>> Erick
>> 
>>> On Jun 2, 2019, at 7:05 AM, Jörn Franke  wrote:
>>> 
>>> It depends what you want to do with it. You can store all fields in Solr 
>>> and filter on them. However, as soon as it comes to Acid guarantees or if 
>>> you need to join the data you will be probably needing something else than 
>>> Solr (or have other workarounds eg flatten the table ).
>>> 
>>> Maybe you can describe more what the users do in Solr or in the database.
>>> 
 Am 02.06.2019 um 15:28 schrieb Ralph Soika :
 
 Inspired by an article in the last german JavaMagazin written by Uwe 
 Schindler I wonder if Solr can also be used as a database?
 
 In our open source project Imixs-Workflow we use Lucene 
  since several years with great 
 success. We have unstructured document-like data generated by the workflow 
 engine. We store all the data in a transactional RDBMS into a blob column 
 and index the data with lucene. This works great and is impressive fast 
 also when we use complex queries.
 
 The thing is that we do not store any fields into lucene - only the 
 primary key of our dataset is stored in lucene. The document data is 
 stored in the SQL database.
 
 Now as far as I understand is solr a cluste

Re: Solr 7.7.2 Autoscaling policy - Poor performance

2019-09-03 Thread Dave

You’re going to want to start by having more than 3gb for memory in my opinion 
but the rest of your set up is more complex than I’ve dealt with. 

On Sep 3, 2019, at 1:10 PM, Andrew Kettmann  
wrote:

>> How many zookeepers do you have? How many collections? What is there size?
>> How much CPU / memory do you give per container? How much heap in comparison 
>> to total memory of the container ?
> 
> 3 Zookeepers.
> 733 containers/nodes
> 735 total cores. Each core ranges from ~4-10GB of index. (Autoscaling splits 
> at 12GB)
> 10 collections, ranging from 147 shards at most, to 3 at least. Replication 
> factor of 2 other than .system which has 3 replicas.
> Each container has a min/max heap of 750MB other than the overseer containers 
> which have a min/max of 3GB.
> Containers aren't hard limited by K8S on memory or CPU but the machines the 
> containers are on have 4 cores and ~13GB of ram.
> 
> Now that I look at the CPU usage on a per container basis, it looks like it 
> is maxing out all four cores on the VM that is hosting the overseer 
> container. Barely using the heap (300MB).
> 
> I suppose that means that if we put the overseers on machines with more 
> cores, it might be able to get things done a bit faster. Though that still 
> seems like a limited solution as we are going to grow this cluster at least 
> double in size if not larger.
> 
> We are using the solr:7.7.2 container.
> 
> Java Options on the home page are below:
>-DHELM_CHART=overseer
>-DSTOP.KEY=solrrocks
>-DSTOP.PORT=7983
>-Dhost=overseer-solr-0.solr.DOMAIN
>-Djetty.home=/opt/solr/server
>-Djetty.port=8983
>-Dsolr.data.home=
>-Dsolr.default.confdir=/opt/solr/server/solr/configsets/_default/conf
>-Dsolr.install.dir=/opt/solr
>-Dsolr.jetty.https.port=8983
>-Dsolr.log.dir=/opt/solr/server/logs
>-Dsolr.log.level=INFO
>-Dsolr.solr.home=/opt/solr/server/home
>-Duser.timezone=UTC
>-DzkClientTimeout=6
>
> -DzkHost=zookeeper-1.DOMAIN:2181,zookeeper-2.DOMAIN:2181,zookeeper-3.DOMAIN:2181/ZNODE
>-XX:+CMSParallelRemarkEnabled
>-XX:+CMSScavengeBeforeRemark
>-XX:+ParallelRefProcEnabled
>-XX:+UseCMSInitiatingOccupancyOnly
>-XX:+UseConcMarkSweepGC
>-XX:-OmitStackTraceInFastThrow
>-XX:CMSInitiatingOccupancyFraction=50
>-XX:CMSMaxAbortablePrecleanTime=6000
>-XX:ConcGCThreads=4
>-XX:MaxTenuringThreshold=8
>-XX:NewRatio=3
>-XX:ParallelGCThreads=4
>-XX:PretenureSizeThreshold=64m
>-XX:SurvivorRatio=4
>-XX:TargetSurvivorRatio=90
>
> -Xlog:gc*:file=/opt/solr/server/logs/solr_gc.log:time,uptime:filecount=9,filesize=20M
>-Xmx3g
>-Xmx3g
>-Xss256k
> 
> 
> 
> evolve24 Confidential & Proprietary Statement: This email and any attachments 
> are confidential and may contain information that is privileged, confidential 
> or exempt from disclosure under applicable law. It is intended for the use of 
> the recipients. If you are not the intended recipient, or believe that you 
> have received this communication in error, please do not read, print, copy, 
> retransmit, disseminate, or otherwise use the information. Please delete this 
> email and attachments, without reading, printing, copying, forwarding or 
> saving them, and notify the Sender immediately by reply email. No 
> confidentiality or privilege is waived or lost by any transmission in error.

Re: Need more info on MLT (More Like This) feature

2019-09-13 Thread Dave

As a side note, if you use shingles with the mlt handler I believe you will get 
better scores/relevant results. So “to be free” becomes indexes as “to_be” 
“to_be_free” and “be_free” but also as each word. It makes the index 
significantly larger but creates better “unique terms” in my opinion and 
improved the results for me at least. 

> On Sep 13, 2019, at 2:51 PM, Srisatya Pyla  wrote:
> 
> Thank you very much for quick response. This is very much helpful to us.
> While analyzing the results for some jobs, it is returning high score for a 
> document which is not much relevant to the base document. 
> Is there any way we can improve the results and scoring?  
> How it exactly give the score for matching document based on a matching 
> field?  This is helpful to know why it is giving highest matching score for 
> the specific documents.
> 
> 
> Regards,
> SST  Narasimha Rao Pyla
> IBM Talent Management Solutions
> Mobile :+91 9849315546
> E-mail :srisp...@in.ibm.com   
> 
> 
> IBM Visakha Hills
> Visakhapatnam, AP 530045
> India
> 
> 
> 
> 
> 
> From:Chee Yee Lim 
> To:Srisatya Pyla 
> Cc:solr-user@lucene.apache.org, Rajeev Kasarabada1 
> , Archana Gavini1 
> Date:13/09/2019 04:32 PM
> Subject:[EXTERNAL] Re: Need more info on MLT (More Like This) feature
> 
> 
> 
> To use knnSearch, you need to submit a POST request to the Stream request 
> handler.
> 
> Using your example query, you will need to rewrite them from this :
> 
> http://[SOLRURL]/mlt?q=sjkey:1414462-25600-5258&wt=json&indent=true&mlt=true&rows=100&mlt.fl=jobdescription&mlt.mindf=1&mlt.mintf=1&fl=jobtitle,jobdescription&fq=siteid:5258
> 
> to this (using curl as an example to send POST request) :
> 
> curl --data-urlencode 'expr=knnSearch([collection_name],
> id="1414462-25600-5258",
> qf="jobdescription",
> k=100,
> fl="jobtitle,jobdescription,score",
> sort="score desc",
> fq="siteid:5258",
> mintf=1, 
> mindf=1)' http://[SOLRURL]/stream
> 
> Note that this assume your document ID is sjkey.
> 
> More detailed documentation on how Stream handler works can be seen here, 
> https://lucene.apache.org/solr/guide/8_1/streaming-expressions.html.
> 
> Best wishes,
> Chee Yee
> 
> On Fri, 13 Sep 2019 at 17:57, Srisatya Pyla  wrote:
> Hi Chee Yee Lim,
> 
> 
> Thank you for your quick response.  
> We do not find much documentation on knnsearch on how to do use that.   
> Could you please guide us with more info on how this can be used?
> 
> Can we use this the way we use Solr by querying with Solr URL like   
> http://[SOLR URL]/mlt ?  OR any other way?
> And also please provide with any more detailed documentation if you have any.
> 
> 
> Regards,
> SST  Narasimha Rao Pyla
> IBM Talent Management Solutions
> Mobile :+91 9849315546
> E-mail :srisp...@in.ibm.com   
> 
> 
> IBM Visakha Hills
> Visakhapatnam, AP 530045
> India
> 
> 
> 
> 
> 
> 
>  
>  
> - Original message -
> From: Chee Yee Lim 
> To: solr-user@lucene.apache.org
> Cc: Archana Gavini1 , Rajeev Kasarabada1 
> 
> Subject: [EXTERNAL] Re: Need more info on MLT (More Like This) feature
> Date: Thu, Sep 12, 2019 6:43 PM
>  
> I've been working with MLT handler (Solr 8.1.1) by calling it the same way 
> you did, http://[SOLRURL]/mlt. But the response is very unreliable with 90% 
> of the same queries resulting in Java null pointer exception, and only 10% 
> returning expected response. I do not know what is the cause of this.
>  
> I overcame this problem by using knnSearch via Stream handler 
> (https://lucene.apache.org/solr/guide/8_1/stream-source-reference.html#knnsearch).
>  It is just a wrapper on MLT, and it works brilliantly. It is worth checking 
> it out if you are running Solr in cloud mode.
>  
> If you pass the fl="score"&sort="score desc" to knnSearch, you will be able 
> to get the results sorted by matching scores.
>  
> Best wishes,
> Chee Yee
>   
> On Thu, 12 Sep 2019 at 19:44, Srisatya Pyla  wrote:
> Hi Solr Seatch Team,
> 
> I am a developer from IBM Kenexa Brassring.  We are using Solr Search engine 
> for searching jobs in our applications.
> We are planning to use MLT feature to get the similar matching documents 
> (jobs) based on one document (job).
> 
> When trying to explore this option, we are using matching field as 
> JobDescription of the job and we are getting some unrelated documents in the 
> MLT results which are not expected.
> 
> The query like below:
> 
> http://[SOLRURL]/mlt?q=sjkey:1414462-25600-5258&wt=json&indent=true&mlt=true&rows=100&mlt.fl=jobdescription&mlt.mindf=1&mlt.mintf=1&fl=jobtitle,jobdescription&fq=siteid:5258
> 
> 
> We have few questions:
> 1) Is there any way we can get the matching score for each of the matching 
> document we get in the MLT results, so that we can get the sorting done on 
> the score to have the highest matching document at the top of the result.
> 
> 2) Are there any best practices using MLT Handler?
> 
> 
> Regards,
> SST  Narasimha Rao Pyla
> IBM Talent Manageme

Re: Sample JWT Solr configuration

2019-09-19 Thread Dave

I know this has nothing to do with the issue at hand but if you have a public 
facing solr instance you have much bigger issues.  

> On Sep 19, 2019, at 10:16 PM, Tyrone Tse  wrote:
> 
> I finally got JWT Authentication working on Solr 8.1.1.
> This is my security.json file contents
> {
>   "authentication":{
>  "class":"solr.JWTAuthPlugin",
>  "jwk":{
> "kty":"oct",
> "use":"sig",
> "kid":"k1",
> 
> "k":"xbQNocUhLJKSmGi0Qp_4hAVfls9CWH5WoTrw543WTXi5H6G-AXFlHRaTKWoGZtLKAD9jn6-MFC49jvR3bJI2L_H9a3yeRgd3tMkhxcR7ABsnhFz2WutN7NSZHiAxCJzTxR8YsgzMM9SXjvp6H1xpNWALdi67YIogKFTLiUIRDtdp3xBJxMP9IQlSYxK4ov81lt4hpAhSdkfpeczgRGd2xxrMbN38uDqtoIXSPRX-7d3pf1YvlyzWKHudTz30sjM6R2h-RRDBOp-SK_tDq4vjG72DyqFYt7BRyzSzrxGl-Ku5yURr21u6vep6suWeJ2_fmA8hgd304e60DBKZoFebxQ",
> "alg":"HS256"
>  },
>  "aud":"Solr"
>   },
>   "authorization":{
>  "class":"solr.RuleBasedAuthorizationPlugin",
>  "permissions":[
> {
>"name":"open_select",
>"path":"/select/*",
>"role":null
> },
> {
>"name":"all-admin",
>"collection":null,
>"path":"/*",
>"role":"admin"
> },
> {
>"name":"update",
>"role":"solr-update"
> }
>  ],
>  "user-role":{
> "admin":"solr-update"
>  }
>   }
> }
> 
> I used the web site to generate the JWK key.
> 
> So I am using the "k" value from the JWK to sign the JWT token.
> 
> Initially, I used website
> https://jwt.io/#debugger-io?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJhZG1pbiIsImF1ZCI6InNvbHIiLCJleHAiOjk5MTYyMzkwMjJ9.rqMpVpTSbNUHDA7VLSYUpv4ebeMjvwQMD6hwMDpvcBQ
> 
> to generate the JWT and sign it with the value
> xbQNocUhLJKSmGi0Qp_4hAVfls9CWH5WoTrw543WTXi5H6G-AXFlHRaTKWoGZtLKAD9jn6-MFC49jvR3bJI2L_H9a3yeRgd3tMkhxcR7ABsnhFz2WutN7NSZHiAxCJzTxR8YsgzMM9SXjvp6H1xpNWALdi67YIogKFTLiUIRDtdp3xBJxMP9IQlSYxK4ov81lt4hpAhSdkfpeczgRGd2xxrMbN38uDqtoIXSPRX-7d3pf1YvlyzWKHudTz30sjM6R2h-RRDBOp-SK_tDq4vjG72DyqFYt7BRyzSzrxGl-Ku5yURr21u6vep6suWeJ2_fmA8hgd304e60DBKZoFebxQ
> 
> The header is
> {
>  "alg": "HS256",
>  "typ": "JWT"
> }
> 
> and the payload is
> 
> {
>  "sub": "admin",
>  "aud": "Solr",
>  "exp": 9916239022
> }
> 
> This generates the JWT key of
> eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJhZG1pbiIsImF1ZCI6IlNvbHIiLCJleHAiOjk5MTYyMzkwMjJ9._H1qeNvlpIOn3X9IpDG0QiRWnEDXITMhZm1NMfuocSc
> 
> So when I use this JWT token generated https://jwt.io/  JWT authentication
> is working, and I can authenticate as the user admin and Post data to the
> Solr collections/cores.
> 
> Now we have decided to get the JWT token generated using Java before we
> authenticate as the user admin to Post data to Solr, and to have a
> calculated expiration date
> 
> Here is the Java Snippet for generating the JWT token
> 
> import io.jsonwebtoken.Jwts;
> import io.jsonwebtoken.SignatureAlgorithm;
> ...
> ...
>String
> key="xbQNocUhLJKSmGi0Qp_4hAVfls9CWH5WoTrw543WTXi5H6G-AXFlHRaTKWoGZtLKAD9jn6-MFC49jvR3bJI2L_H9a3yeRgd3tMkhxcR7ABsnhFz2WutN7NSZHiAxCJzTxR8YsgzMM9SXjvp6H1xpNWALdi67YIogKFTLiUIRDtdp3xBJxMP9IQlSYxK4ov81lt4hpAhSdkfpeczgRGd2xxrMbN38uDqtoIXSPRX-7d3pf1YvlyzWKHudTz30sjM6R2h-RRDBOp-SK_tDq4vjG72DyqFYt7BRyzSzrxGl-Ku5yURr21u6vep6suWeJ2_fmA8hgd304e60DBKZoFebxQ";
>Calendar cal =Calendar.getInstance();
>Date issueAt = cal.getTime();
>cal.add(Calendar.MINUTE,60);
>Date expDate = cal.getTime();
>String jws = Jwts.builder().
>setSubject("admin")
>.setAudience("Solr")
>.setExpiration(expDate)
>.signWith(SignatureAlgorithm.HS256,key).compact();
>System.out.println(jws);
> 
> This does not generate a valid JWT token, when I use it I am getting the
> error message
> 
> 
> 
>
>Error 401 Signature invalid
> 
> 
> 
>HTTP ERROR 401
>Problem accessing /solr/stores/update. Reason:
> Signature invalid
>
> 
> 
> 
> 
> I tried generating the JWT token using JavaScript from this codepen
> https://codepen.io/tyrone-tse/pen/MWgzExB
> 
> and it too generates an invalid JWT key.
> 
> How come it works when the JWT is generated from
> https://jwt.io/#debugger-io?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiJhZG1pbiIsImF1ZCI6InNvbHIiLCJleHAiOjk5MTYyMzkwMjJ9.rqMpVpTSbNUHDA7VLSYUpv4ebeMjvwQMD6hwMDpvcBQ
> 
> 
> 
> 
> 
> 
> 
>> On Sat, Sep 14, 2019 at 9:06 AM Jan Høydahl  wrote:
>> 
>> See answer in other thread. JWT works for 8.1 or later, don’t attempt it
>> in 7.x.
>> 
>> You could try to turn on debug logging for or.apache.solr.security to get
>> more logging.
>> 
>> Jan Høydahl
>> 
>>> 13. sep. 2019 kl. 00:24 skrev Tyrone Tse :
>>> 
>>> Jan
>>> 
>>> I tried using the JWT Plugin https://github.com/cominvent/solr-auth-jwt
>>> 
>>> If my security.json file is
>>> 
>>> {
>>> "authentication": {
>>>   "class":"com.cominvent.solr.JWTAuthPlugin",
>>>   "jwk" : {
>>>

Re: POS Tagger

2019-10-25 Thread Dave

Yeah. My mistake in explanation. But it really does help with better relevance 
in the returned documents

> On Oct 25, 2019, at 12:39 PM, Audrey Lorberfeld - audrey.lorberf...@ibm.com 
>  wrote:
> 
> Oh I see I see 
> 
> -- 
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
> 
> 
> On 10/25/19, 12:21 PM, "David Hastings"  wrote:
> 
>oh i see what you mean, sorry, i explained it incorrectly.
> those sentences are what would be in the index, and a general search for
>'rush limbaugh' would come back with results where he is an entity higher
>than if it was two words in a sentence
> 
>On Fri, Oct 25, 2019 at 12:12 PM David Hastings <
>hastings.recurs...@gmail.com> wrote:
> 
>> nope, i boost the fields already tagged at query time against teh query
>> 
>> On Fri, Oct 25, 2019 at 12:11 PM Audrey Lorberfeld -
>> audrey.lorberf...@ibm.com  wrote:
>> 
>>> So then you do run your POS tagger at query-time, Dave?
>>> 
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> IBM
>>> audrey.lorberf...@ibm.com
>>> 
>>> 
>>> On 10/25/19, 12:06 PM, "David Hastings" 
>>> wrote:
>>> 
>>>I use them for query boosting, so if someone searches for:
>>> 
>>>i dont want to rush limbaugh out the door
>>>vs
>>>i talked to rush limbaugh through the door
>>> 
>>>my documents where 'rush limbaugh' is a known entity (noun) and a
>>> person
>>>(look at the sentence, its obviously a person and the nlp finds that)
>>> have
>>>'rush limbaugh' stored in a field, which is boosted on queries.  this
>>> makes
>>>sure results from the second query with him as a person will be
>>> boosted
>>>above those from the first query
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris <
>>> nicolas.pa...@riseup.net>
>>>wrote:
>>> 
>>>> Also we are using stanford POS tagger for french. The processing
>>> time is
>>>> mitigated by the spark-corenlp package which distribute the process
>>> over
>>>> multiple node.
>>>> 
>>>> Also I am interesting in the way you use POS information within solr
>>>> queries, or solr fields.
>>>> 
>>>> Thanks,
>>>>> On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
>>>>> ah, yeah its not the fastest but it proved to be the best for my
>>>> purposes,
>>>>> I use it to pre-process data before indexing, to apply more
>>> metadata to
>>>> the
>>>>> documents in a separate field(s)
>>>>> 
>>>>> On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
>>>>> audrey.lorberf...@ibm.com  wrote:
>>>>> 
>>>>>> No, I meant for part-of-speech tagging __ But that's
>>> interesting that
>>>> you
>>>>>> use StanfordNLP. I've read that it's very slow, so we are
>>> concerned
>>>> that it
>>>>>> might not work for us at query-time. Do you use it at
>>> query-time, or
>>>> just
>>>>>> index-time?
>>>>>> 
>>>>>> --
>>>>>> Audrey Lorberfeld
>>>>>> Data Scientist, w3 Search
>>>>>> IBM
>>>>>> audrey.lorberf...@ibm.com
>>>>>> 
>>>>>> 
>>>>>> On 10/25/19, 10:30 AM, "David Hastings" <
>>> hastings.recurs...@gmail.com
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>Do you mean for entity extraction?
>>>>>>I make a LOT of use from the stanford nlp project, and get
>>> out the
>>>>>> entities
>>>>>>and use them for different purposes in solr
>>>>>>-Dave
>>>>>> 
>>>>>>On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
>>>>>>audrey.lorberf...@ibm.com 
>>> wrote:
>>>>>> 
>>>>>>> Hi All,
>>>>>>> 
>>>>>>> Does anyone use a POS tagger with their Solr instance
>>> other than
>>>>>>> OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> --
>>>>>>> Audrey Lorberfeld
>>>>>>> Data Scientist, w3 Search
>>>>>>> IBM
>>>>>>> audrey.lorberf...@ibm.com
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> --
>>>> nicolas
>>>> 
>>> 
>>> 
>>> 
> 
>

Re: Active directory integration in Solr

2019-11-20 Thread Dave

I guess I don’t understand why one wouldn’t simply make a basic front end for 
solr, it’s literally the easiest thing to throw together and then you control 
all authentication and filters per user.  Even a basic one would be some w3 
school tutorials with php+json+whatever authentication Mech you want to use.  
Access to the ui right away let’s you just, drop entire cores or collections, 
there’s no way anyone not familiar with what they’re doing should be allowed to 
touch it

> On Nov 20, 2019, at 6:22 PM, Jörn Franke  wrote:
> 
> Well i propose for Solr Kerberos authentication on HTTPS (2) for the web ui 
> backend. Then the web ui backend does any type of authentication / 
> authorization of users you need.
> I would not let users access directly access Solr in any environment. 
> 
> 
> 
>> Am 20.11.2019 um 20:19 schrieb Kevin Risden :
>> 
>> So I wrote the blog more of an experiment above. I don't know if it is
>> fully operating other than on a single node. That being said, the Hadoop
>> authentication plugin doesn't require running on HDFS. It just uses the
>> Hadoop code to do authentication.
>> 
>> I will echo what Jorn said though - I wouldn't expose Solr to the internet
>> or directly without some sort of API. Whether you do
>> authentication/authorization at the API is a separate question.
>> 
>> Kevin Risden
>> 
>> 
>>> On Wed, Nov 20, 2019 at 1:54 PM Jörn Franke  wrote:
>>> 
>>> I would not give users directly access to Solr - even with LDAP plugin.
>>> Build a rest interface or web interface that does the authentication and
>>> authorization and security sanitization. Then you can also manage better
>>> excessive queries or explicitly forbid certain type of queries (eg specific
>>> streaming expressions - I would not expose all of them to users).
>>> 
> Am 19.11.2019 um 11:02 schrieb Kommu, Vinodh K. :
 
 Thanks Charlie.
 
 We are already using Basic authentication in our existing clusters,
>>> however it's getting difficult to maintain number of users as we are
>>> getting too many requests for readonly access from support teams. So we
>>> desperately looking for active directory solution. Just wondering if
>>> someone might have same requirement need.
 
 
 Regards,
 Vinodh
 
 -Original Message-
 From: Charlie Hull 
 Sent: Tuesday, November 19, 2019 2:55 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Active directory integration in Solr
 
 ATTENTION! This email originated outside of DTCC; exercise caution.
 
 Not out of the box, there are a few authentication plugins bundled but
>>> not for AD
 
>>> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Flucene.apache.org%2Fsolr%2Fguide%2F7_2%2Fauthentication-and-authorization-plugins.html&data=02%7C01%7Cvkommu%40dtcc.com%7C2e17e1feef78432502e008d76cd26635%7C0465519d7f554d47998b55e2a86f04a8%7C0%7C0%7C637097523245309858&sdata=fkahJ62aWFYh7QxcyFQbJV9u8OsTYSWp6pv0MNdzjps%3D&reserved=0
 - there's also some useful stuff in Apache ManifoldCF
 
>>> https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.francelabs.com%2Fblog%2Ftutorial-on-authorizations-for-manifold-cf-and-solr%2F&data=02%7C01%7Cvkommu%40dtcc.com%7C2e17e1feef78432502e008d76cd26635%7C0465519d7f554d47998b55e2a86f04a8%7C0%7C0%7C637097523245319858&sdata=iYiKRDJKYBZaxUd%2F%2BIddFBwxB2RhSqih2KZc26aZlRU%3D&reserved=0
 
 
 Best
 
 Charlie
 
> On 18/11/2019 15:08, Kommu, Vinodh K. wrote:
> Hi,
> 
> Does anyone know that Solr has any out of the box capability to
>>> integrate Active directory (using LDAP) when security is enabled? Instead
>>> of creating users in security.json file, planning to use users who already
>>> exists in active directory so they can use their individual credentials
>>> rather than defining in Solr. Did anyone came across similar requirement?
>>> If so was there any working solution?
> 
> 
> Thanks,
> Vinodh
> 
> DTCC DISCLAIMER: This email and any files transmitted with it are
>>> confidential and intended solely for the use of the individual or entity to
>>> whom they are addressed. If you have received this email in error, please
>>> notify us immediately and delete the email and any attachments from your
>>> system. The recipient should check this email and any attachments for the
>>> presence of viruses. The company accepts no liability for any damage caused
>>> by any virus transmitted by this email.
> 
 
 --
 Charlie Hull
 Flax - Open Source Enterprise Search
 
 tel/fax: +44 (0)8700 118334
 mobile:  +44 (0)7767 825828
 web:
>>> https://nam02.safelinks.protection.outlook.com/?url=www.flax.co.uk&data=02%7C01%7Cvkommu%40dtcc.com%7C2e17e1feef78432502e008d76cd26635%7C0465519d7f554d47998b55e2a86f04a8%7C0%7C0%7C637097523245319858&sdata=YNGIg%2FVgL2w82i3JWsBkBTJeefHMjSxbjLaQyOdJVt0%3D&reserved=0
 
 DTCC DISCLAIMER: This email and any fil

Re: Solr process takes several minutes before accepting commands after restart

2019-11-21 Thread Dave

https://lucidworks.com/post/solr-suggester/

You must set buildonstartup to false, the default is true. Try it

> On Nov 21, 2019, at 3:21 PM, Koen De Groote  
> wrote:
> 
> Erick:
> 
> No suggesters. There is 1 spellchecker for
> 
> text_general
> 
> But no buildOnCommit or buildOnStartup setting mentioned anywhere.
> 
> That being said, the point in time at which this occurs, the database is
> guaranteed to be empty, as the data folders had previously been deleted and
> recreated empty. Then the docker container is restarted and this behavior
> is observed.
> 
> Long shot, but even if Solr is getting data from zookeeper telling of file
> locations and checking for the existence of these files... that should be
> pretty fast, I'd think.
> 
> This is really disturbing. I know what to expect when recovering now, but
> someone doing this on a live environment that has to be up again ASAP is
> probably going to be sweating bullets.
> 
> 
> On Thu, Nov 21, 2019 at 2:45 PM Erick Erickson 
> wrote:
> 
>> Koen:
>> 
>> Do you have any spellcheckers or suggesters defined with buildOnCommit or
>> buildOnStartup set to “true”? Depending on the implementation, this may
>> have to read the stored data for the field used in the
>> suggester/spellchecker from _every_ document in your collection, which can
>> take many minutes. Even if your implementation in your config is file-based
>> it can still take a while.
>> 
>> Shot in the dark….
>> 
>> Erick
>> 
>>> On Nov 21, 2019, at 4:03 AM, Koen De Groote 
>> wrote:
>>> 
>>> The logs files showed a startup, printing of all the config options that
>>> had been set, 1 or 2 commands that got executed and then nothing.
>>> 
>>> Sending the curl did not get shown in the logs files until after that
>>> period where Solr became unresponsive.
>>> 
>>> Service mesh, I don't think so? It's in a docker container, but that
>>> shouldn't be a problem, it usually never is.
>>> 
>>> 
>>> On Wed, Nov 20, 2019 at 10:42 AM Jörn Franke 
>> wrote:
>>> 
 Have you checked the log files of Solr?
 
 
 Do you have a service mesh in-between? Could it be something at the
 network layer/container orchestration  that is blocking requests for
>> some
 minutes?
 
> Am 20.11.2019 um 10:32 schrieb Koen De Groote <
 koen.degro...@limecraft.com>:
> 
> Hello
> 
> I was testing some backup/restore scenarios.
> 
> 1 of them is Solr7.6 in a docker container(7.6.0-slim), set up as
> SolrCloud, with zookeeper.
> 
> The steps are as follows:
> 
> 1. Manually delete the data folder.
> 2. Restart the container. The process is now in error mode, complaining
> that it cannot find the cores.
> 3. Fix the install, meaning create new data folders, which are empty at
> this point.
> 4. Restart the container again, to pick up the empty folders and not be
 in
> error anymore.
> 5. Perform the restore
> 6. Check if everything is available again
> 
> The problem is between step 4 and 5. After step 4, it takes several
 minutes
> before solr actually responds to curl commands.
> 
> Once responsive, the restore happened just fine. But it's very
>> stressful
 in
> a situation where you have to restore a production environment and the
> process just doesn't respond for 5-10 minutes.
> 
> We're talking about 20GB of data here, so not very much, but not little
> either.
> 
> Is it normal that it takes so long before solr responds? If not, what
> should I look at in order to find the cause?
> 
> I have asked this before recently, though the wording was confusing.
>> This
> should be clearer.
> 
> Kind regards,
> Koen De Groote
 
>> 
>>

Re: Solr process takes several minutes before accepting commands after restart

2019-11-21 Thread Dave

https://doc.sitecore.com/developers/90/platform-administration-and-architecture/en/using-solr-auto-suggest.html


If you need more references. Set all parameters yourself, don’t rely on 
defaults. 

> On Nov 21, 2019, at 3:41 PM, Dave  wrote:
> 
> https://lucidworks.com/post/solr-suggester/
> 
> You must set buildonstartup to false, the default is true. Try it
> 
>> On Nov 21, 2019, at 3:21 PM, Koen De Groote  
>> wrote:
>> 
>> Erick:
>> 
>> No suggesters. There is 1 spellchecker for
>> 
>> text_general
>> 
>> But no buildOnCommit or buildOnStartup setting mentioned anywhere.
>> 
>> That being said, the point in time at which this occurs, the database is
>> guaranteed to be empty, as the data folders had previously been deleted and
>> recreated empty. Then the docker container is restarted and this behavior
>> is observed.
>> 
>> Long shot, but even if Solr is getting data from zookeeper telling of file
>> locations and checking for the existence of these files... that should be
>> pretty fast, I'd think.
>> 
>> This is really disturbing. I know what to expect when recovering now, but
>> someone doing this on a live environment that has to be up again ASAP is
>> probably going to be sweating bullets.
>> 
>> 
>> On Thu, Nov 21, 2019 at 2:45 PM Erick Erickson 
>> wrote:
>> 
>>> Koen:
>>> 
>>> Do you have any spellcheckers or suggesters defined with buildOnCommit or
>>> buildOnStartup set to “true”? Depending on the implementation, this may
>>> have to read the stored data for the field used in the
>>> suggester/spellchecker from _every_ document in your collection, which can
>>> take many minutes. Even if your implementation in your config is file-based
>>> it can still take a while.
>>> 
>>> Shot in the dark….
>>> 
>>> Erick
>>> 
>>>> On Nov 21, 2019, at 4:03 AM, Koen De Groote 
>>> wrote:
>>>> 
>>>> The logs files showed a startup, printing of all the config options that
>>>> had been set, 1 or 2 commands that got executed and then nothing.
>>>> 
>>>> Sending the curl did not get shown in the logs files until after that
>>>> period where Solr became unresponsive.
>>>> 
>>>> Service mesh, I don't think so? It's in a docker container, but that
>>>> shouldn't be a problem, it usually never is.
>>>> 
>>>> 
>>>> On Wed, Nov 20, 2019 at 10:42 AM Jörn Franke 
>>> wrote:
>>>> 
>>>>> Have you checked the log files of Solr?
>>>>> 
>>>>> 
>>>>> Do you have a service mesh in-between? Could it be something at the
>>>>> network layer/container orchestration  that is blocking requests for
>>> some
>>>>> minutes?
>>>>> 
>>>>>> Am 20.11.2019 um 10:32 schrieb Koen De Groote <
>>>>> koen.degro...@limecraft.com>:
>>>>>> 
>>>>>> Hello
>>>>>> 
>>>>>> I was testing some backup/restore scenarios.
>>>>>> 
>>>>>> 1 of them is Solr7.6 in a docker container(7.6.0-slim), set up as
>>>>>> SolrCloud, with zookeeper.
>>>>>> 
>>>>>> The steps are as follows:
>>>>>> 
>>>>>> 1. Manually delete the data folder.
>>>>>> 2. Restart the container. The process is now in error mode, complaining
>>>>>> that it cannot find the cores.
>>>>>> 3. Fix the install, meaning create new data folders, which are empty at
>>>>>> this point.
>>>>>> 4. Restart the container again, to pick up the empty folders and not be
>>>>> in
>>>>>> error anymore.
>>>>>> 5. Perform the restore
>>>>>> 6. Check if everything is available again
>>>>>> 
>>>>>> The problem is between step 4 and 5. After step 4, it takes several
>>>>> minutes
>>>>>> before solr actually responds to curl commands.
>>>>>> 
>>>>>> Once responsive, the restore happened just fine. But it's very
>>> stressful
>>>>> in
>>>>>> a situation where you have to restore a production environment and the
>>>>>> process just doesn't respond for 5-10 minutes.
>>>>>> 
>>>>>> We're talking about 20GB of data here, so not very much, but not little
>>>>>> either.
>>>>>> 
>>>>>> Is it normal that it takes so long before solr responds? If not, what
>>>>>> should I look at in order to find the cause?
>>>>>> 
>>>>>> I have asked this before recently, though the wording was confusing.
>>> This
>>>>>> should be clearer.
>>>>>> 
>>>>>> Kind regards,
>>>>>> Koen De Groote
>>>>> 
>>> 
>>>

Re: A Last Message to the Solr Users

2019-11-30 Thread Dave

I’m young here I think, not even 40 and only been using solr since like 2008 or 
so, so like 1.4 give or take. But I know a really good therapist if you want to 
talk about it. 

> On Nov 30, 2019, at 6:56 PM, Mark Miller  wrote:
> 
> Now I have sacrificed to give you a new chance. A little for my community.
> It was my community. But it was mostly for me. The developer I started as
> would kick my ass today.  Circumstances and luck has brought money to our
> project. And it has corrupted our process, our community, and our code.
> 
> In college i would talk about past Mark screwing future Mark and too bad
> for him. What did he ever do for me? Well, he got me again ;)
> 
> I’m out of steam, time and wife patentice.
> 
> Enough key people are aware of the scope of the problem now that you won’t
> need me. I was never actually part of the package. To the many, many people
> that offered me private notes of encouragement and future help - thank you
> so much. Your help will be needed.
> 
> You will reset. You will fix this. Or I will be back.
> 
> Mark
> 
> 
> -- 
> - Mark
> 
> http://about.me/markrmiller

Re: Is it possible to have different Stop words depending on the value of a field?

2019-12-02 Thread Dave

It clarifies yes. You need new fields. In this case something like
Address_us
Address_uk
And index and search them accordingly with different stopword files used in 
different field types, hence the copy field from “address” into as many new 
fields as needed

> On Dec 2, 2019, at 7:33 PM,   wrote:
> 
> To clarify, a document would look like this : 
> 
> {
>  address: "123 main Street",
>  country : "US"
> }
> 
> What I'd like to do when I configure my index is to apply a set of different 
> stop words to the address field depending on the value of the country. For 
> example, something like this : 
> 
> If (country == US) -> File1
> Else If (country == UK) -> File2
> 
> Etc..
> 
> Hopefully, that clarifies.
> 
> -Original Message-
> From: Jörn Franke  
> Sent: Monday, December 2, 2019 3:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to have different Stop words depending on the 
> value of a field?
> 
> You can have different fields by country. I am not sure about your stop words 
> but if they are not occurring in the other languages then you have not a 
> problem. 
> On the other hand: it you need more than stop words (eg lemmatizing, 
> specialized way of tokenization etc) then you need a different field per 
> language. You don’t describe your full use case, but if you have different 
> fields for different language then your client application needs to handle 
> this (not difficult, but you have to be aware).
> Not sure if you need to search a given address in all languages or if you use 
> the language of the user etc.
> 
>> Am 02.12.2019 um 20:13 schrieb yeikel valdes :
>> 
>> Hi,
>> 
>> 
>> I have an index that stores addresses from different countries.
>> 
>> 
>> As every country has different stop words, I was wondering if it is possible 
>> to apply a different set of stop words depending on the value of a field. 
>> 
>> 
>> Or do I need different indexes/do itnat the ETL step to accomplish this?
>> 
>> 
> 
>

Re: Is it possible to have different Stop words depending on the value of a field?

2019-12-02 Thread Dave

I’ll add to that since I’m up. Stopwords are in a practical sense useless and 
serve no purpose. It’s an old way to save index size that’s not needed any 
more. You’d need very specific use cases to want to use them. Maybe you do, but 
generally you never do unless it’s for training a machine or something a bit 
more on the experimental side. If you can explain *why you think you need stop 
words that would be helpful in perhaps guiding you to an alternative 

> On Dec 2, 2019, at 7:45 PM,   wrote:
> 
> That makes sense, thank you for the clarification!
> 
> @wun...@wunderwood.org If you can, please build on your explanation as It 
> sounds relevant. 
> -Original Message-
> From: Dave  
> Sent: Monday, December 2, 2019 7:38 PM
> To: solr-user@lucene.apache.org
> Cc: jornfra...@gmail.com
> Subject: Re: Is it possible to have different Stop words depending on the 
> value of a field?
> 
> It clarifies yes. You need new fields. In this case something like Address_us 
> Address_uk And index and search them accordingly with different stopword 
> files used in different field types, hence the copy field from “address” into 
> as many new fields as needed
> 
>> On Dec 2, 2019, at 7:33 PM,   wrote:
>> 
>> To clarify, a document would look like this : 
>> 
>> {
>> address: "123 main Street",
>> country : "US"
>> }
>> 
>> What I'd like to do when I configure my index is to apply a set of different 
>> stop words to the address field depending on the value of the country. For 
>> example, something like this : 
>> 
>> If (country == US) -> File1
>> Else If (country == UK) -> File2
>> 
>> Etc..
>> 
>> Hopefully, that clarifies.
>> 
>> -Original Message-
>> From: Jörn Franke 
>> Sent: Monday, December 2, 2019 3:25 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Is it possible to have different Stop words depending on the 
>> value of a field?
>> 
>> You can have different fields by country. I am not sure about your stop 
>> words but if they are not occurring in the other languages then you have not 
>> a problem. 
>> On the other hand: it you need more than stop words (eg lemmatizing, 
>> specialized way of tokenization etc) then you need a different field per 
>> language. You don’t describe your full use case, but if you have different 
>> fields for different language then your client application needs to handle 
>> this (not difficult, but you have to be aware).
>> Not sure if you need to search a given address in all languages or if you 
>> use the language of the user etc.
>> 
>>> Am 02.12.2019 um 20:13 schrieb yeikel valdes :
>>> 
>>> Hi,
>>> 
>>> 
>>> I have an index that stores addresses from different countries.
>>> 
>>> 
>>> As every country has different stop words, I was wondering if it is 
>>> possible to apply a different set of stop words depending on the value of a 
>>> field. 
>>> 
>>> 
>>> Or do I need different indexes/do itnat the ETL step to accomplish this?
>>> 
>>> 
>> 
>> 
> 
>

Re: xms/xmx choices

2019-12-06 Thread Dave

Actually at about that time the replication finished and added about 20-30gb to 
the index from the master.  My current set up goes
Indexing master -> indexer slave/production master (only replicated on 
command)-> three search slaves (replicate each 15 minutes)

We added about 2.3m docs, then I replicated it to the production master and 
since there was a change it replicated out to the slave node the gc came from

I’ll set one of the slaves to 31/31 and force all load to that one and see how 
she does. Thanks!


> On Dec 6, 2019, at 1:02 AM, Shawn Heisey  wrote:
> 
> On 12/5/2019 12:57 PM, David Hastings wrote:
>> That probably isnt enough data, so if youre interested:
>> https://gofile.io/?c=rZQ2y4
> 
> The previous one was less than 4 minutes, so it doesn't reveal anything 
> useful.
> 
> This one is a little bit less than two hours.  That's more useful, but still 
> pretty short.
> 
> Here's the "heap after GC" graph from the larger file:
> 
> https://www.dropbox.com/s/q9hs8fl0gfkfqi1/david.hastings.gc.graph.2019.12.png?dl=0
> 
> At around 14:15, the heap usage was rather high. It got up over 25GB. There 
> were some very long GCs right at that time, which probably means they were 
> full GCs.  And they didn't free up any significant amount of memory.  So I'm 
> betting that sometimes you actually *do* need a big chunk of that 60GB of 
> heap.  You might try reducing it to 31g instead of 6m.  Java's memory 
> usage is a lot more efficient if the max heap size is less than 32 GB.
> 
> I can't give you any information about what happened at that time which 
> required so much heap.  You could see if you have logfiles that cover that 
> timeframe.
> 
> Thanks,
> Shawn

Re: How to add a new field to already an existing index in Solr 6.6 ?

2019-12-08 Thread Dave

Or just do it the lazy way and use a dynamic field. I’ve found little to no 
drawbacks with them aside from a complete lack of documentation of the field in 
the schema itself 

> On Dec 8, 2019, at 8:07 AM, David Barnett  wrote:
> 
> Also - look at adding fields using Solr admin, this will these will be
> available to use  (I believe) without the need to restart and is very
> easy to do.
> 
>> On Sun, 8 Dec 2019, 13:03 David Barnett,  wrote:
>> 
>> There is a few ways to add fields, adding the field definition in the
>> managed-schema will do this for you but make sure you have downloaded the
>> current config before you edit and reload the schema.
>> 
>> Google - solr 6.6 upconfig downconfig for lots of guides on this
>> 
>>> On Tue, 3 Dec 2019, 13:21 Erick Erickson,  wrote:
>>> 
>>> Update your schema to include the new field and reload your collection.
>>> 
>>> Then updating your field should work.
>>> 
>>> Best,
>>> Erick
>>> 
 On Dec 3, 2019, at 4:40 AM, Vignan Malyala 
>>> wrote:

 How to add a new field to already an existing index in Solr 6.6 ?

 I tried to use set for this, but it shows error as undefined field. But
 however I could create a new index with set.
 But, how to add new filed to already indexed data?
 Is it possible?

 Thank you!

 Regards,
 Sai
>>> 
>>>

Re: Indexing strategies for user profiles

2019-12-10 Thread Dave

I would index the products a user purchased as well as the number of times 
purchased, then I would take a user, search their bought products boosted by 
how many times purchased, against other users, have a facet for products and 
filter out the top bought products that are not on the users already purchased 
list.  then you have a list of products purchased by users with the same buying 
habits as your user that they have not bought.  And over time you can tune your 
original search with geographic info or age or other demographics that return 
more relatable users etc. 

Fun mental project, would be fun to have a data set like amazon or wal mart or 
something to see if you start getting legit results.  Could even see if you 
could start predicting their next purchase when you throw in things like time 
of year and recently purchased items, cross reference with the other users etc. 
with enough data I really start enjoying weaponizing solr, it can be quite 
entertaining as long as you have no morals with privacy or they clicked the 
little box allowing you to do anything you want. But that’s how Facebook and 
the likes make a lot of money, is by taking your friends and following them 
around the Internet and doing the above to place the exact ad for a bottle of 
wine that your best friend just bought a few weeks ago and brought to an event 
you were invited to. 

Gets addicting:)

> On Dec 10, 2019, at 5:56 PM, Arnold Bronley  wrote:
> 
> Hi,
> 
> I have a Solr collection 'products' for different products that users
> interact with. With MoreLikeThis, I can retrieve for a given product
> another related product. Now, I want to create a Solr collection for users
> such that I can use MoreLikeThis approach between users and products. Not
> just that, I would also like to get relevant product for a user based on
> some sort of collaborative filtering. What should be my indexing indexing
> and collection creation strategy to tackle this problem in general?

Re: does copyFields increase indexe size ?

2019-12-25 Thread Dave

#1 merry Xmas thing 
#2 you initially said you were talking about 1k documents.  That will not be a 
large enough sample size to see the index size differences with this new field, 
in any case the index size should never really matter.  But if you go to a few 
million you will notice the size has increased by a good amount. Other things 
come into play like if the index was wiped clean with a commit before indexing 
or if it was reindexed with out, or if we are taking about documents that have 
a lot of similar words between them, so many other scenarios can increase or 
decrease the index. But no matter what if you have a copy field, the text is 
going somewhere 

> On Dec 25, 2019, at 3:07 AM, Nicolas Paris  wrote:
> 
> 
>> 
>> If you are redoing the indexing after changing the schema and
>> reloading/restarting, then you can ignore me.
> 
> I am sorry to say that I have to ignore you. Indeed, my tests include
> recreating the collection from scratch - with and without the copy
> fields.
> In both cases the index size is the same ! (while the _text_ field is
> working correctly)
> 
>> On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote:
>>> On 12/24/2019 5:11 PM, Nicolas Paris wrote:
>>> Do you mean "copy fields" is only an action of changing the schema ?
>>> I was thinking it was adding a new field and eventually a new index to
>>> the collection
>> 
>> The copy that copyField does happens at index time.  Reindexing is required
>> after changing the schema, or nothing happens.
>> 
>> If you are redoing the indexing after changing the schema and
>> reloading/restarting, then you can ignore me.
>> 
>> Thanks,
>> Shawn
>> 
> 
> -- 
> nicolas

Re: Solr 7.5 speed up, accuracy details

2019-12-28 Thread Dave

There is no increase in speed, but features. Doc values add some but it’s hard 
to quantify, and some people think solr cloud has speed increases but I don’t 
think they exist when hardware cost is nonexistent and it adds too much 
complexity to something that should be simple.  

> On Dec 28, 2019, at 12:52 PM, Rajdeep Sahoo  
> wrote:
> 
> Hi all,
>  How can I get the performance improvement features in indexing and search
> in solr 7.5...
> 
>> On Sat, 28 Dec, 2019, 9:18 PM Rajdeep Sahoo, 
>> wrote:
>> 
>> Thank you for the information
>>  Why you are recommending to use the schema api instead of schema xml?
>> 
>> 
>>> On Sat, 28 Dec, 2019, 8:01 PM Jörn Franke,  wrote:
>>> 
>>> This highly depends on how you designed your collections etc. - there is
>>> no general answer. You have to do a performance test based on your
>>> configuration and documents.
>>> 
>>> I also recommend to check the Solr documentation on how to design a
>>> collection for 7.x and maybe start even from scratch defining it with a new
>>> fresh schema (using the schema api instead of schema.xml and solrconfig.xml
>>> etc). You will have anyway to reindex everything so it is a also a good
>>> opportunity to look at your existing processes and optimize them.
>>> 
 Am 28.12.2019 um 15:19 schrieb Rajdeep Sahoo <
>>> rajdeepsahoo2...@gmail.com>:
 
 Hi all,
 Is there any way I can get the speed up,accuracy details i.e.
>>> performance
 improvements of solr 7.5 in comparison with solr 4.6
 Currently,we are using solr 4.6 and we are in a process to upgrade to
 solr 7.5. Need these details.
 
 Thanks in advance
>>> 
>>

Re: Failed to connect to server

2020-01-17 Thread Dave

It doesn’t need to be identical, just anything with a buildon reload statement

> On Jan 17, 2020, at 12:17 PM, rhys J  wrote:
> 
> On Fri, Jan 17, 2020 at 12:10 PM David Hastings <
> hastings.recurs...@gmail.com> wrote:
> 
>> something like this in your solr config:
>> 
>>  autosuggest > "exactMatchFirst">false text> str> 0.005 
>> DocumentDictionaryFactory title > "weightField">weight true > "buildOnOptimize">true 
>> 
>> 
> I checked both /var/solr/solr/data/solr.xml and
> /var/solr/data/CORE/solrconfig.xml, and I did not find this entry.
> 
> Thanks,
> 
> Rhys

Re: Solr cloud production set up

2020-01-18 Thread Dave

Agreed with the above. what’s your idea of “huge”? I have 600 ish gb in one 
core plus another 250x2 in two more on the same standalone solr instance and it 
runs more than fine

> On Jan 18, 2020, at 11:31 AM, Shawn Heisey  wrote:
> 
> On 1/18/2020 1:05 AM, Rajdeep Sahoo wrote:
>> Our Index size is huge and in master slave the full indexing time is almost
>> 24 hrs.
>>In future the no of documents will increase.
>> So,please some one recommend about the no of nodes and configuration like
>> ram and cpu core for solr cloud.
> 
> Indexing is not going to be any faster in SolrCloud.  It would probably be a 
> little bit slower.  The best way to speed up indexing, whether running 
> SolrCloud or not, is to make your indexing processes run in parallel, so that 
> multiple batches of documents are being indexed at the same time.
> 
> SolrCloud is not a magic bullet that solves all problems.  It's just a 
> different way of managing indexes that has more automation, and makes initial 
> setup of a distributed index a lot easier.  It doesn't do the job any faster 
> than running without SolrCloud.  The legacy master/slave mode is likely to be 
> a little bit faster.
> 
> You haven't provided any of the information required for us to guess about 
> the system requirements.  And it will be a guess ... we could be completely 
> wrong.
> 
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> 
> Thanks,
> Shawn

Re: Solr cloud production set up

2020-01-18 Thread Dave

If you’re not getting values, don’t ask for the facet. Facets are expensive as 
hell, maybe you should think more about your query’s than your infrastructure, 
solr cloud won’t help you at all especially if your asking for things you don’t 
need

> On Jan 18, 2020, at 1:25 PM, Rajdeep Sahoo  wrote:
> 
> We have assigned 16 gb out of 24gb for heap .
> No other process is running on that node.
> 
> 200 facets fields are there in the query but we will not be getting the
> values for each facets for every search.
> There can be max of 50-60 facets for which we will be getting values.
> 
> We are using caching,is it not going to help.
> 
> 
> 
>> On Sat, 18 Jan, 2020, 11:36 PM Shawn Heisey,  wrote:
>> 
>>> On 1/18/2020 10:09 AM, Rajdeep Sahoo wrote:
>>> We are having 2.3 million documents and size is 2.5 gb.
>>>   10 core cpu and 24 gb ram . 16 slave nodes.
>>> 
>>>   Still some of the queries are taking 50 sec at solr end.
>>> As we are using solr 4.6 .
>>>   Other thing is we are having 200 (avg) facet fields  in a query.
>>>  And 30 searchable fields.
>>>  Is there any way to identify why it is taking 50 sec for a query.
>>> Multiple concurrent requests are there.
>> 
>> Searching 30 fields and computing 200 facets is never going to be super
>> fast.  Switching to cloud will not help, and might make it slower.
>> 
>> Your index is pretty small to a lot of us.  There are people running
>> indexes with billions of documents that take terabytes of disk space.
>> 
>> As Walter mentioned, computing 200 facets is going to require a fair
>> amount of heap memory.  One *possible* problem here is that the Solr
>> heap size is too small, so a lot of GC is required.  How much of the
>> 24GB have you assigned to the heap?  Is there any software other than
>> Solr running on these nodes?
>> 
>> Thanks,
>> Shawn
>>

Performance of facet contain search in 5.2.1

2015-07-21 Thread Lo Dave

I found that facet contain search take much longer time than facet prefix 
search. Do anyone have idea how to make contain search faster?
org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select 
params={q=sentence:"duty+of+care"&facet.field=autocomplete&indent=true&facet.prefix=duty+of+care&rows=1&wt=json&facet=true&_=1437462916852}
 hits=1856 status=0 QTime=5 org.apache.solr.core.SolrCore; [concordance] 
webapp=/solr path=/select 
params={q=sentence:"duty+of+care"&facet.field=autocomplete&indent=true&facet.contains=duty+of+care&rows=1&wt=json&facet=true&facet.contains.ignoreCase=true}
 hits=1856 status=0 QTime=10951 
As show above, prefix search take 5 but contain search take 10951
Thanks.

RE: Performance of facet contain search in 5.2.1

2015-07-22 Thread Lo Dave

Yes. I am going to provide autocomplete with facet count as rank.i.e. when 
yours input "owe a duty", the system will suggest "xxx owe a duty yyy" with 
highest count.
Thanks.
Dave
> Date: Wed, 22 Jul 2015 14:35:40 +0100
> Subject: Re: Performance of facet contain search in 5.2.1
> From: benedetti.ale...@gmail.com
> To: solr-user@lucene.apache.org
> 
> I think as usually Erick says, this is a X-Y problem.
> I think the user was trying to solve the infix autocomplete problem with
> faceting.
> 
> We should get from him the initial problem to try to suggest a better
> solution.
> 
> Cheers
> 
> 2015-07-22 14:01 GMT+01:00 Markus Jelsma :
> 
> > Hello - why not index the facet field as n-grams? It blows up the index
> > but is very fast!
> > Markus
> >
> > -Original message-
> > > From:Erick Erickson 
> > > Sent: Tuesday 21st July 2015 21:36
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Performance of facet contain search in 5.2.1
> > >
> > > "contains" has to basically examine each and every term to see if it
> > > matches. Say my
> > > facet.contains=bbb. A matching term could be
> > > aaabbbxyz
> > > or
> > > zzzbbbxyz
> > >
> > > So there's no way to _know_ when you've found them all without
> > > examining every last
> > > one. So I'd try to redefine the problem to not require that. If it's
> > > absolutely required,
> > > you can do some interesting things but it's going to inflate your index.
> > >
> > > For instance, "rotate" words (assuming word boundaries here). So, for
> > > instance, you have
> > > a text field with "my dog has fleas". Index things like
> > > my dog has fleas|my dog has fleas
> > > dog has fleas my|my dog has fleas
> > > has fleas my dog|my dog has fleas
> > > fleas my dog has|my dog has fleas
> > >
> > > Literally with the pipe followed by the original text. Now all your
> > > contains clauses are
> > > simple prefix facets, and you can have the UI split the token on the
> > > pipe and display the
> > > original.
> > >
> > > Best,
> > > Erick
> > >
> > > On Tue, Jul 21, 2015 at 1:16 AM, Lo Dave  wrote:
> > > > I found that facet contain search take much longer time than facet
> > prefix search. Do anyone have idea how to make contain search faster?
> > > > org.apache.solr.core.SolrCore; [concordance] webapp=/solr path=/select
> > params={q=sentence:"duty+of+care"&facet.field=autocomplete&indent=true&facet.prefix=duty+of+care&rows=1&wt=json&facet=true&_=1437462916852}
> > hits=1856 status=0 QTime=5 org.apache.solr.core.SolrCore; [concordance]
> > webapp=/solr path=/select
> > params={q=sentence:"duty+of+care"&facet.field=autocomplete&indent=true&facet.contains=duty+of+care&rows=1&wt=json&facet=true&facet.contains.ignoreCase=true}
> > hits=1856 status=0 QTime=10951
> > > > As show above, prefix search take 5 but contain search take 10951
> > > > Thanks.
> > > >
> >
> 
> 
> 
> -- 
> --
> 
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England

Search speed issue on new core creation

2015-04-08 Thread dhaivat dave

Hello All,

I am using Master - Slave architecture setup with hundreds of cores getting
replicated between master and slave servers. I am facing very weird issue
while creating a new core.

Whenever there is a new call for a new core creation (using
CoreAdminRequest.createCore(coreName,instanceDir,serverObj)) all the
searches issued to other cores are getting blocked.

Any help or thoughts would highly appreciated.

Regards,
Dhaivat

Poor Solr Cloud Query Performance against a Small Dataset

2016-11-01 Thread Dave Seltzer

Hello!

I'm trying to utilize Solr Cloud to help with a hash search problem. The
record set has only 4,300 documents.

When I run my search against a single core I get results on the order of
10ms. When I run the same search against Solr Cloud results take about
5,000 ms.

Is there something about this particular query which makes it perform
poorly in a Cloud environment? The query looks like this (linebreaks added
for readability):

{!frange+l%3D5+u%3D25}sum(
termfreq(hashTable_0,'225706351'),
termfreq(hashTable_1,'17664000'),
termfreq(hashTable_2,'86447642'),
termfreq(hashTable_3,'134816033'),
termfreq(hashTable_4,'1061820218'),
termfreq(hashTable_5,'543627850'),
termfreq(hashTable_6,'-1828379348'),
termfreq(hashTable_7,'423236759'),
termfreq(hashTable_8,'522192943'),
termfreq(hashTable_9,'572537937'),
termfreq(hashTable_10,'286991887'),
termfreq(hashTable_11,'789711386'),
termfreq(hashTable_12,'235801909'),
termfreq(hashTable_13,'67109911'),
termfreq(hashTable_14,'609628285'),
termfreq(hashTable_15,'1796472850'),
termfreq(hashTable_16,'202312085'),
termfreq(hashTable_17,'306200840'),
termfreq(hashTable_18,'85657669'),
termfreq(hashTable_19,'671548727'),
termfreq(hashTable_20,'71309060'),
termfreq(hashTable_21,'1125848323'),
termfreq(hashTable_22,'1077548043'),
termfreq(hashTable_23,'117638159'),
termfreq(hashTable_24,'-1408039642'))

The schema looks like this:

   
   
   
   
   
   subFingerprintId

I've included some sample output below. I wasn't sure if this was a matter
of changing the routing key in the collections system, or if this is a more
fundamental problem with the way Term Frequencies are counted in a Solr
Cloud environment.

Many thanks!

-Dave

-- Single Core Example Query:
{
  "responseHeader":{
"status":0,
"QTime":13,
"params":{
  "q":"{!frange l=5
u=25}sum(termfreq(hashTable_0,'354749018'),termfreq(hashTable_1,'286534657'),termfreq(hashTable_2,'1798007322'),termfreq(hashTable_3,'151854851'),termfreq(hashTable_4,'142869766'),termfreq(hashTable_5,'240584768'),termfreq(hashTable_6,'68120837'),termfreq(hashTable_7,'134945863'),termfreq(hashTable_8,'688067644'),termfreq(hashTable_9,'621220625'),termfreq(hashTable_10,'1732446991'),termfreq(hashTable_11,'505547282'),termfreq(hashTable_12,'135990559'),termfreq(hashTable_13,'123097623'),termfreq(hashTable_14,'454174225'),termfreq(hashTable_15,'788988675'),termfreq(hashTable_16,'53480196'),termfreq(hashTable_17,'487550779'),termfreq(hashTable_18,'455477045'),termfreq(hashTable_19,'1141310997'),termfreq(hashTable_20,'71322652'),termfreq(hashTable_21,'805503533'),termfreq(hashTable_22,'656158000'),termfreq(hashTable_23,'302410303'),termfreq(hashTable_24,'194970957'))",
  "indent":"on",
  "wt":"json",
  "debugQuery":"on",
  "_":"1478024378680"}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"subFingerprintId":"f6c9093e-e8e9-4c0f-aa2a-387b46e7ef2a",
"trackId":"5207095a-0126-4c41-8787-16d41165158a",
"sequenceNumber":136,
"sequenceAt":12.5399129172714,
"hashTable_0":354749018,
"hashTable_1":287779841,
"hashTable_2":1797994010,
"hashTable_3":151854851,
"hashTable_4":375260422,
"hashTable_5":441911360,
"hashTable_6":68120837,
"hashTable_7":420158535,
"hashTable_8":16979004,
"hashTable_9":1443304209,
"hashTable_10":1732468239,
"hashTable_11":455215642,
"hashTable_12":135990559,
"hashTable_13":123093271,
"hashTable_14":1444029969,
"hashTable_15":788988675,
"hashTable_16":53480196,
"hashTable_17":488255035,
"hashTable_18":505809973,
"hashTable_19":201814293,
"hashTable_20":70208520,
"hashTable_21":805503541,
"hashTable_22":658713904,
"hashTable_23":302387775,
"hashTable_24":19497095

1 2 3 >

1 - 100 of 214 matches

Mail list logo