Increasing Solr5 time out from 30 seconds while starting solr

2015-12-07 Thread D

Hi,

Many time while starting solr I see the below message and then the solr 
is not reachable.


|debraj@boutique3:~/solr5$ sudo bin/solr start -p 8789 Waiting to see 
Solr listening on port 8789 [-] Still not seeing Solr listening on 8789 
after 30 seconds!|


However when I try to start solr again by trying to execute the same 
command. It says that /"solr is already running on port 8789. Try using 
a different port with -p"/


I am having two cores in my local set-up. I am guessing this is 
happening because one of the core is a little big. So solr is timing out 
while loading the core. If I take one of the core out of solr then 
everything works fine.


Can some one let me know how can I increase this timeout value from 
default 30 seconds?


I am using Solr 5.2.1 on Debian 7.

Thanks,




Re: Solr edismax field boosting

2016-05-09 Thread Nick D
You can add the debug flag to the end of the request and see exactly what
the scoring is and why things are happening.

&debug=ALL will show you everything including the scoring.

Showing the result of the debug query should help you, or adding that into
your question here, decipher what is going on with your scoring and how the
boosts are(n't) working.

Nick

On Mon, May 9, 2016 at 7:22 PM, Megha Bhandari 
wrote:

> Hi
>
> We are trying to boost certain fields with relevancy. However we are not
> getting results as per expectation. Below is the configuration in
> solr-config.xml.
> Even though the title field has a lesser boost than metatag.description
> results for title field are coming higher.
>
> We even created test data that have data only in description in
> metatag.description and title. Example , page 1 has foo in description and
> page 2 has foo in title. Solr is still returning page 2 before page 1.
>
> We are using Solr 5.5 and Nutch 1.11 currently.
>
> Following is the configuration we are using. Any ideas on what we are
> missing to enable correct field boosting?
>
> 
> 
>   
> metatag.keywords^10 metatag.description^9 title^8 h1^7 h2^6 h3^5
> h4^4 id _text_^1
>   
>   explicit
>   10
>
>   
>
>   explicit
>   _text_
>   default
>   on
>   false
>   10
>   5
>   5
>   false
>   true
>   10
>   5
> 
>   id title metatag.description itemtype
> lang metatag.hideininternalsearch metatag.topresultthumbnailalt
> metatag.topresultthumbnailurl playerid playerkey
>   on
>   0
>   title metatag.description
>   
>   
> 
> 
>   spellcheck
> elevator
> 
>   
>
> Thanks
> Megha
>


Re: Solr edismax field boosting

2016-05-09 Thread Nick D
;q.alt":"Upendra",
>   "indent":"on",
>   "qf":"metatag.description^9 h1^7 h2^6 h3^5 h4^4 _text_^1 id^0.5",
>   "wt":"json",
>   "debugQuery":"on",
>   "_":"1462810987788"}},
>   "response":{"numFound":3,"start":0,"maxScore":0.8430033,"docs":[
>   {
> "h2":["Looks like your browser is a little out-of-date."],
> "h3":["Already a member?"],
> "strtitle":"Upendra Custon",
> "id":"
> http://localhost:4503/content/uhcdotcom/en/home/waysin/poc/upendra-custon.html
> ",
> "tstamp":"2016-05-09T17:15:57.604Z",
> "metatag.hideininternalsearch":[false],
> "segment":[20160509224553],
> "digest":["844296a63233b3e4089424fe1ec9d036"],
> "boost":[1.4142135],
> "lang":"en",
> "_version_":1533871839698223104,
> "host":"localhost",
> "url":"
> http://localhost:4503/content/uhcdotcom/en/home/waysin/poc/upendra-custon.html
> ",
> "score":0.8430033},
>   {
> "metatag.description":"test",
> "h1":["Health care"],
> "h2":["Looks like your browser is a little out-of-date."],
> "h3":["Already a member?"],
> "strtitle":"Upendra",
> "id":"
> http://localhost:4503/content/uhcdotcom/en/home/waysin/poc/upendra-custon/health-care.html
> ",
> "tstamp":"2016-05-09T17:15:57.838Z",
> "metatag.hideininternalsearch":[false],
> "metatag.topresultthumbnailalt":[","],
> "segment":[20160509224553],
> "digest":["dd4ef8879be2d4d3f28e24928e9b84c5"],
> "boost":[1.4142135],
> "lang":"en",
> "metatag.keywords":",",
> "_version_":1533871839731777536,
> "host":"localhost",
> "url":"
> http://localhost:4503/content/uhcdotcom/en/home/waysin/poc/upendra-custon/health-care.html
> ",
> "score":0.7009616},
>   {
> "metatag.description":"Upendra decription testing",
> "h1":["healthcare description"],
> "h2":["Looks like your browser is a little out-of-date."],
> "h3":["Already a member?"],
> "strtitle":"healthcare description",
> "id":"
> http://localhost:4503/content/uhcdotcom/en/home/waysin/poc/upendra-custon/healthcare-description.html
> ",
> "tstamp":"2016-05-09T17:15:57.682Z",
> "metatag.hideininternalsearch":[false],
> "metatag.topresultthumbnailalt":[","],
> "segment":[20160509224553],
> "digest":["6262795db6aed05a5de7cc3cbe496401"],
> "boost":[1.4142135],
> "lang":"en",
> "metatag.keywords":",",
> "_version_":1533871839739117568,
> "host":"localhost",
> "url":"
> http://localhost:4503/content/uhcdotcom/en/home/waysin/poc/upendra-custon/healthcare-description.html
> ",
> "score":0.5481102}]
>   },
>   "debug":{
> "rawquerystring":"Upendra",
> "querystring":"Upendra",
>
> "parsedquery":"(+DisjunctionMaxQuery(((metatag.description:Upendra)^9.0 |
> (h1:Upendra)^7.0 | (h2:Upendra)^6.0 | (h3:Upendra)^5.0 | (id:Upendra)^0.5 |
> (h4:Upendra)^4.0 | _text_:upendra)~0.99))/no_coord",
> "parsedquery_toString":"+((metatag.description:Upendra)^9.0 |
> (h1:Upendra)^7.0 | (h2:Upendra)^6.0 | (h3:Upendra)^5.0 | (id:Upendra)^0.5 |
> (h4:Upendra)^4.0 | _text_:upendra)~0.99",
> "explain":{
>   "
> http://localhost:4503/content/uhcdotcom/en/home/waysin/poc/upendra-custon.html":"\n0.84300333
> = max plus 0.99 times others of:\n  0.84300333 = weight(_text_:upendra in
> 0) [], result of:\n0.84300333 = score(doc=0,freq=6.0 = termFreq=6.0\n),
> product of:\n  0.44183275 = id

Re: Solr edismax field boosting

2016-05-10 Thread Nick D
plain'=>{
>   'http://localhost:4503/baseurl/upendra-custon.html'=>'
> 0.14641379 = max of:
>   0.14641379 = weight(_text_:upendra in 0) [], result of:
> 0.14641379 = score(doc=0,freq=8.0 = termFreq=8.0
> ), product of:
>   0.074107975 = idf(docFreq=6, docCount=6)
>   1.9756819 = tfNorm, computed from:
> 8.0 = termFreq=8.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 168.3 = avgFieldLength
> 113.8 = fieldLength
> ',
>   '
> http://localhost:4503/baseurl/upendra-custon/care-body-content.html'=>'
> 0.13738367 = max of:
>   0.13738367 = weight(_text_:upendra in 1) [], result of:
> 0.13738367 = score(doc=1,freq=4.0 = termFreq=4.0
> ), product of:
>   0.074107975 = idf(docFreq=6, docCount=6)
>   1.853831 = tfNorm, computed from:
> 4.0 = termFreq=4.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 168.3 = avgFieldLength
> 83.591835 = fieldLength
> ',
>   'http://localhost:4503/baseurl/upendra-custon/care-keyword.html'=>'
> 0.13738367 = max of:
>   0.13738367 = weight(_text_:upendra in 2) [], result of:
> 0.13738367 = score(doc=2,freq=4.0 = termFreq=4.0
> ), product of:
>   0.074107975 = idf(docFreq=6, docCount=6)
>   1.853831 = tfNorm, computed from:
> 4.0 = termFreq=4.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 168.3 = avgFieldLength
> 83.591835 = fieldLength
> ',
>   'http://localhost:4503/baseurl/upendra-custon/care.html'=>'
> 0.13286635 = max of:
>   0.13286635 = weight(_text_:upendra in 3) [], result of:
> 0.13286635 = score(doc=3,freq=4.0 = termFreq=4.0
> ), product of:
>   0.074107975 = idf(docFreq=6, docCount=6)
>   1.7928753 = tfNorm, computed from:
> 4.0 = termFreq=4.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 168.3 = avgFieldLength
> 113.8 = fieldLength
> ',
>   '
> http://localhost:4503/baseurl/upendra-custon/care-description.html'=>'
> 0.13053702 = max of:
>   0.13053702 = weight(_text_:upendra in 4) [], result of:
> 0.13053702 = score(doc=4,freq=3.0 = termFreq=3.0
> ), product of:
>   0.074107975 = idf(docFreq=6, docCount=6)
>   1.7614436 = tfNorm, computed from:
> 3.0 = termFreq=3.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 168.3 = avgFieldLength
> 83.591835 = fieldLength
> ',
>   'http://localhost:4503/baseurl/upendra-custon/care-without-.html'=>'
> 0.11870542 = max of:
>   0.11870542 = weight(_text_:upendra in 5) [], result of:
> 0.11870542 = score(doc=5,freq=2.0 = termFreq=2.0
> ), product of:
>   0.074107975 = idf(docFreq=6, docCount=6)
>   1.6017901 = tfNorm, computed from:
> 2.0 = termFreq=2.0
> 1.2 = parameter k1
> 0.75 = parameter b
> 168.3 = avgFieldLength
> 83.591835 = fieldLength
> '},
> 'QParser'=>'ExtendedDismaxQParser',
> 'altquerystring'=>nil,
> 'boost_queries'=>nil,
> 'parsed_boost_queries'=>[],
> 'boostfuncs'=>nil,
> 'timing'=>{
>   'time'=>6.0,
>   'prepare'=>{
> 'time'=>0.0,
> 'query'=>{
>   'time'=>0.0},
> 'facet'=>{
>   'time'=>0.0},
> 'facet_module'=>{
>   'time'=>0.0},
> 'mlt'=>{
>   'time'=>0.0},
> 'highlight'=>{
>   'time'=>0.0},
> 'stats'=>{
>   'time'=>0.0},
> 'expand'=>{
>   'time'=>0.0},
> 'debug'=>{
>   'time'=>0.0}},
>   'process'=>{
> 'time'=>5.0,
> 'query'=>{
>   'time'=>0.0},
> 'facet'=>{
>   'time'=>0.0},
> 'facet_module'=>{
>   'time'=>0.0},
> 'mlt'=>{
>   'time'=>0.0},
> 'highlight'=>{
>   'time'=>0.0},
> 'stats'=>{
>   'time'=>0.0},
> 'expand'=>{
>   'time'=>0.0},
> 'debug'=>{
>

Re: How to search in solr for words like %rek Dr%

2016-05-10 Thread Nick D
You can use a combination of ngram or edgengram fields and possibly the
shingle factory if you want to combine words. Also might want to have it as
exact text with no query sloop if the two words, even the partial text,
need to be right next to each other. Edge is great for left to right ngram
is great just to splitup by a size.  There are a number of tokenizers you
can try out.

Nick
On May 10, 2016 9:22 AM, "Thrinadh Kuppili"  wrote:

> I am trying to search a field named Address which has a space in it.
> Example :
> Address has the below values in it.
> 1. 2000 North Derek Dr Fullerton
> 2. 2011 N Derek Drive Fullerton
> 3. 2108 N Derek Drive Fullerton
> 4. 2100 N Derek Drive Fullerton
> 5. 2001 N Drive Derek Fullerton
>
> Search Query:- Derek Drive or rek Dr
> Expectation is it should return all  2,3,4 and it should not return 1 & 5 .
>
> Finally i am trying to find a word which can search similar to database
> search of %N Derek%
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: How to search in solr for words like %rek Dr%

2016-05-10 Thread Nick D
Don't really get what "Q= {!dismax qf=address} "rek Dr*" - It is not
allowed since perfix in Quotes is not allowed" means, why cant you use
exact phrase matching? Do you have some limitation of quoting as you are
specifically looking for an exact phrase I dont see why you wouldn't want
exact matching.


Anyways

You can look into using another type of tokenizer, my guess is you are
probably using the standard tokenizer or possibly the whitespace tokenizer.
You may want to try a different one a see what result you get. Also you
probably wont need to use the wildcards if you setup you gram sizes the way
you want.

The shingle factory can do stuff like (now my memory is a bit fuzzy on this
but I play with it in the admin page).

This is a sentence
shingle = 4
this_is_a_sentence

Combine that with your ngram factory and you can do something like. Mingram
= 4 max =50
this
this_i
this_is

this_is_a_sentence

his_i
his_is

his_is_a_sentence

etc.


Then apply the shingle factory on query to take something like

his is-> his_is and you will get that phrase back.

My personal favorite is just using edgengram and fixing something like but
the concept is the same with regular old ngram:

2001 N Drive Derek Fullerton

2
[32]
0
1
1
word
1
20
[32 30]
0
2
1
word
1
200
[32 30 30]
0
3
1
word
1
2001
[32 30 30 31]
0
4
1
word
1
n
[6e]
5
6
1
word
2
d
[64]
7
8
1
word
3
dr
[64 72]
7
9
1
word
3
dri
[64 72 69]
7
10
1
word
3
driv
[64 72 69 76]
7
11
1
word
3
drive
[64 72 69 76 65]
7
12
1
word
3
d
[64]
13
14
1
word
4
de
[64 65]
13
15
1
word
4
der
[64 65 72]
13
16
1
word
4
dere
[64 65 72 65]
13
17
1
word
4
derek
[64 65 72 65 6b]
13
18
1
word
4
f
[66]
19
20
1
word
5
fu
[66 75]
19
21
1
word
5
ful
[66 75 6c]
19
22
1
word
5
full
[66 75 6c 6c]
19
23
1
word
5
fulle
[66 75 6c 6c 65]
19
24
1
word
5
fuller
[66 75 6c 6c 65 72]
19
25
1
word
5
fullert
[66 75 6c 6c 65 72 74]
19
26
1
word
5
fullerto
[66 75 6c 6c 65 72 74 6f]
19
27
1
word
5
fullerton
[66 75 6c 6c 65 72 74 6f 6e]
19
28
1
word
5

Works great for a quick type-ahead field type.

Oh and by the way your ngram size is two small for _rek_ to be split up
from _derek_


Setting up a few different field types and playing with the analyzer in
admin page can give you a good idea about what both index and query time
results can be and with your tiny data set is the best way I can think of
to see instant results with your new field types.

Nick

On Tue, May 10, 2016 at 10:01 AM, Thrinadh Kuppili 
wrote:

> I have tried with  maxGramSize="12"/> and search using the Extended Dismax
>
> Q= {!dismax qf=address} rek Dr* - It did not work as expected since i am
> getting all the records which has rek, Dr .
>
> Q= {!dismax qf=address} "rek Dr*" - It is not allowed since perfix in
> Quotes
> is not allowed.
>
> Q= {!complexphrase inOrder=true}address:"rek dr*" - It did not work since
> it
> is searching for words starts with rek
>
> I am not aware of shingle factory as of now will try to use and findout how
> i can use.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-search-in-solr-for-words-like-rek-Dr-tp4275854p4275859.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Dynamically change solr suggest field

2016-05-11 Thread Nick D
There are only two ways I can think of to accomplish this and neither of
them are dynamically setting the suggester field as is looks according to
the Doc (which does sometimes have lacking info so I might be wrong) you
cannot set something like *suggest.fl=combo_box_field* at query time. But
maybe they can help you get started.

1. Multiple suggester request handlers for each option in combo box. This
way you just change the request handler in the query you submit based on
the context.

2. Use copy fields to put all possible suggestions into same field name, so
no more dynamic field settings, with another field defining whatever the
option would be for that document out of the combo box and use context
filters which can be passed at query time to limit the suggestions to those
filtered by whats in the combo box.
https://cwiki.apache.org/confluence/display/solr/Suggester#Suggester-ContextFiltering

Hope this helps a bit

Nick

On Wed, May 11, 2016 at 7:05 AM, Lasitha Wattaladeniya 
wrote:

> Hello devs,
>
> I'm trying to implement auto complete text suggestions using solr. I have a
> text box and next to that there's a combo box. So the auto complete should
> suggest based on the value selected in the combo box.
>
> Basically I should be able to change the suggest field based on the value
> selected in the combo box. I was trying to solve this problem whole day but
> not much luck. Can anybody tell me is there a way of doing this ?
>
> Regards,
> Lasitha.
>
> Lasitha Wattaladeniya
> Software Engineer
>
> Mobile : +6593896893
> Blog : techreadme.blogspot.com
>


Re: More Like This on not new documents

2016-05-13 Thread Nick D
https://wiki.apache.org/solr/MoreLikeThisHandler

Bottom of the page, using context streams. I believe this still works in
newer versions of Solr. Although I have not tested it on a new version of
Solr.

But if you plan on indexing the document anyways then just indexing and
then passing the ID to mlt isn't a bad thing at all.

Nick

On Fri, May 13, 2016 at 2:23 AM, Vincenzo D'Amore 
wrote:

> Hi all,
>
> anybody know if is there a chance to use the mlt component with a new
> document not existing in the collection?
>
> In other words, if I have a new document, should I always first add it to
> my collection and only then, using the mlt component, have the list of
> similar documents?
>
>
> Best regards,
> Vincenzo
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>


Re: Facet data type

2016-05-26 Thread Nick D
Although you did mention that you wont need to sort and you are using
mutlivalued=true. On the off chance you do change something like
multivalued=false docValues=false then this will come in to play:

https://issues.apache.org/jira/browse/SOLR-7495

This has been a rather large pain to deal with in terms of faceting. (the
Lucene change that caused a number of Issues is also referenced in this
Jira).

Nick


On Thu, May 26, 2016 at 11:45 AM, Erick Erickson 
wrote:

> I always prefer ints to strings, they can't help but take
> up less memory, comparing two ints is much faster than
> two strings etc. Although Lucene can play some tricks
> to make that less noticeable.
>
> Although if these are just a few values, it'll be hard to
> actually measure the perf difference.
>
> And if it's a _lot_ of unique values, you have other problems
> than the int/string distinction. Faceting on very high
> cardinality fields is something that can have performance
> implications.
>
> But I'd certainly add docValues="true" to the definition no matter
> which you decide on.
>
> Best,
> Erick
>
> On Wed, May 25, 2016 at 9:29 AM, Steven White 
> wrote:
> > Hi everyone,
> >
> > I will be faceting on data of type integers and I'm wonder if there is
> any
> > difference on how I design my schema.  I have no need to sort or use
> range
> > facet, given this, in terms of Lucene performance and index size, does it
> > make any difference if I use:
> >
> > #1:  indexed="true"
> > required="true" stored="false"/>
> >
> > Or
> >
> > #2:  > required="true" stored="false"/>
> >
> > (notice how I changed the "type" from "string" to "int" in #2)
> >
> > Thanks in advanced.
> >
> > Steve
>


Re: Facet data type

2016-05-27 Thread Nick D
Steven,

The case that I was pointing to was specifically talking about the need for
a int to be set to multivalued=true for the field to be used as a
facet.field. I personally ran into it when upgrading to 5.x from 4.10.2. I
believe setting docValues=true will not have an affect (untested by me but
there was mention of that in the Jira). Also there are some linking Jiras
that talk about other issues with Facets in 5.x but my guess is if you
aren't upgrading from 4.x to 5.x then you will probably wont hit the issue
but there are some things people are finding with Doc values and
performance with 4.x upgrades.

I think there are some even more knowledgeable people on here who could
chime in with a more detailed explanation or correct me if I misspoke.

Nick

On Fri, May 27, 2016 at 12:11 PM, Steven White  wrote:

> Thanks Erick.
>
> What about Solr defect SOLR-7495 that Nick mentioned?  It sounds like
> because of this defect, I should NOT set docValues="true" on a filed when:
> a) type="int" and b) multiValued="true".  Can you confirm that I got this
> right?  I'm on Solr 5.2.1
>
> Steve
>
>
> On Fri, May 27, 2016 at 1:30 PM, Erick Erickson 
> wrote:
>
> > bq: my index size grew by 20%.  Is this expected
> >
> > Yes. But don't worry about it ;). Basically, you've serialized
> > to disk the "uninverted" form of the field. But, that is
> > accessed through Lucene by MMapDirectory, see:
> > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >
> > If you don't use DocValues, the uninverted version
> > is built in Java's memory, which is much more expensive
> > for a variety of reasons. What you lose in disk size you gain
> > in a lower JVM footprint, fewer GC problems etc.
> >
> > But the implication is, indeed, that you should use DocValues
> > for field you intend to facet and/or sort etc on. If you only search
> > it's just wasted space.
> >
> > Best,
> > Erick
> >
> > On Fri, May 27, 2016 at 6:25 AM, Steven White 
> > wrote:
> > > Thank you Erick for pointing out about DocValues.  I re-indexed my data
> > > with it set to true and my index size grew by 20%.  Is this expected?
> > >
> > > Hi Nick, I'm not clear about SOLR-7495.  Are you saying I should not
> use
> > > docValues=true if:type="int"and multiValued="true"?  I'm on Solr 5.2.1.
> > > Thanks.
> > >
> > > Steve
> > >
> > > On Thu, May 26, 2016 at 9:29 PM, Nick D  wrote:
> > >
> > >> Although you did mention that you wont need to sort and you are using
> > >> mutlivalued=true. On the off chance you do change something like
> > >> multivalued=false docValues=false then this will come in to play:
> > >>
> > >> https://issues.apache.org/jira/browse/SOLR-7495
> > >>
> > >> This has been a rather large pain to deal with in terms of faceting.
> > (the
> > >> Lucene change that caused a number of Issues is also referenced in
> this
> > >> Jira).
> > >>
> > >> Nick
> > >>
> > >>
> > >> On Thu, May 26, 2016 at 11:45 AM, Erick Erickson <
> > erickerick...@gmail.com>
> > >> wrote:
> > >>
> > >> > I always prefer ints to strings, they can't help but take
> > >> > up less memory, comparing two ints is much faster than
> > >> > two strings etc. Although Lucene can play some tricks
> > >> > to make that less noticeable.
> > >> >
> > >> > Although if these are just a few values, it'll be hard to
> > >> > actually measure the perf difference.
> > >> >
> > >> > And if it's a _lot_ of unique values, you have other problems
> > >> > than the int/string distinction. Faceting on very high
> > >> > cardinality fields is something that can have performance
> > >> > implications.
> > >> >
> > >> > But I'd certainly add docValues="true" to the definition no matter
> > >> > which you decide on.
> > >> >
> > >> > Best,
> > >> > Erick
> > >> >
> > >> > On Wed, May 25, 2016 at 9:29 AM, Steven White  >
> > >> > wrote:
> > >> > > Hi everyone,
> > >> > >
> > >> > > I will be faceting on data of type integers and I'm wonder if
> there
> > is
> > >> > any
> > >> > > difference on how I design my schema.  I have no need to sort or
> use
> > >> > range
> > >> > > facet, given this, in terms of Lucene performance and index size,
> > does
> > >> it
> > >> > > make any difference if I use:
> > >> > >
> > >> > > #1:  > >> > indexed="true"
> > >> > > required="true" stored="false"/>
> > >> > >
> > >> > > Or
> > >> > >
> > >> > > #2:  > indexed="true"
> > >> > > required="true" stored="false"/>
> > >> > >
> > >> > > (notice how I changed the "type" from "string" to "int" in #2)
> > >> > >
> > >> > > Thanks in advanced.
> > >> > >
> > >> > > Steve
> > >> >
> > >>
> >
>


Re: Trouble connecting to IRC

2017-06-29 Thread Aravind D
Hi Anshum,

I'm not having any issue connecting. 

-Aravind



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Trouble-connecting-to-IRC-tp4343512p4343515.html
Sent from the Solr - User mailing list archive at Nabble.com.


Weighting the Licene score

2008-08-26 Thread s d
I want to weighted average the Lucene score with an additional score i have,
i.e. (W1 * Lucene score + W2 * Other score) / (W1 + W2) .
What is the easiest way to do this?
Also, is the Lucene score normalized.
Thanks,


Re: Weighting the Licene score

2008-08-26 Thread s d
But function query doesn't give access to the SOLR score, only to fields in
the index, no ?
thx

On Tue, Aug 26, 2008 at 2:02 PM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:

> I think the easiest approach might be making use of Lucene's function
> query.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: s d <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Tuesday, August 26, 2008 1:55:38 PM
> > Subject: Weighting the Licene score
> >
> > I want to weighted average the Lucene score with an additional score i
> have,
> > i.e. (W1 * Lucene score + W2 * Other score) / (W1 + W2) .
> > What is the easiest way to do this?
> > Also, is the Lucene score normalized.
> > Thanks,
>
>


Partitioning the index

2008-12-17 Thread s d
Hi,Is there a recommended index size (on disk, number of documents) for when
to start partitioning it to ensure good response time?
Thanks,
S


Multicore admin problem in Websphere

2010-04-28 Thread D. Manning
Hello, I have solr 1.4 deployed in websphere 6.1. Im trying to add a url
based security constraint to my project but if I specify the core name in
the constraint the path to the admin of each core give a 404 error. Does
anyone have any experience of this or suggestions of how I can work around
it?

Thankyou


shards design/customization coding question

2010-05-17 Thread D C

We have a large index, separated
into multiple shards, that consists of records exported from a database.  One 
requirement is to support near real-time
synchronization with the database.  To accomplish this we are considering 
creating
a "daily" shard where create and update documents
(records never get deleted) will be posted and at the end of the day, "empty" 
the daily shard into
the other shards and start afresh the next day.


 


The problem with this
approach is when an existing database record is updated into the daily shard, 
then the daily shard contains an updated document that has a duplicate id with 
another shard. 
It is my understanding that in the case of duplicate document ids returned
from multiple shards, the document returned first will be returned in the
search results and the other duplicate document ids will be discarded.


 


My question is where can I
customize the solr code to specify that documents from a particular shard 
should be
given precedence in the search results.  Any pointers would be very much 
appreciated.
  
_
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Vish D.
There seems to be some code out for Tika now (not packaged/announced yet,
but...). Could someone please take a look at it and see if that could fit
in? I am eagerly waiting for a reply back from tika-dev, but no luck yet.

http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/

I see that Eric's patch uses POI (for most of it)...so that's great! I have
seen too many duplicated efforts, even in Apache projects alone, and this is
one step close to fixing it (other than Tika, which isnt' 'complete' yet).
Are there any plans on releasing this patch with Solr dist? Or, any
instructions on using/installing the patch itself?

Thanks
Vish


On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
>
> Christian,
>
> Eric Pugh created implemented this functionality for a project we were
> doing and has released to code on JIRA.  We have had very good results
> with it.  If I can be of any help using it beyond the Java code itself
> let me know.  The last revision I used with it was 552853, so if the
> build happens to fail you can roll back to that and it will work.
>
> https://issues.apache.org/jira/browse/SOLR-284
>
> - Pete
>
> On 8/21/07, Christian Klinger <[EMAIL PROTECTED]> wrote:
> > Hi Solr Users,
> >
> > i have set up a Solr-Server with a custom Schema.
> > Now i have updated the index with some content form
> > xml-files.
> >
> > Now i try to update the contents of a folder.
> > The folder consits of various document-types
> > (pdf,doc,xls,...).
> >
> > Is there anywhere an howto how can i parse the
> > documents, make an xml of the paresed content
> > and post it to the solr server?
> >
> > Thanks in advance.
> >
> > Christian
> >
> >
>


Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Vish D.
Pete,

Thanks for the great explanation.

Thinking it through my process, I am not sure how to use it:

I have a bunch of docs that pretty much contain a lot of meta-data, some
which include full-text files (.pdf, .ppt, etc...). I use these docs
correctly to index/update into Solr. The next step now is to somehow index
the text from the full-text files. One way to think about it is, I could
have a placeholder field 'data' and keep it empty for the first pass, and
then run update/rich to index the actual full-text, but using the same
unique doc id. But this would actually overwrite the doc in the index, won't
it? And, there really isn't a 'merge' operation, right?

There might be a better way to use this full-text indexing option,
schema-wise, say:

- have a new option richData that will take in a source field name,
- validate it's value (valid filename/file),
- recognize the file type,
- and put the 'data' into another field

What do you think?  I am not a true Java developer, so not sure if I could
do it myself, but only hope that someone else on the project could ;-)...

Rao

On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
>
> Installing the patch requires downloading the latest solr via
> subversion and applying the patch to the source.  Eric has updated his
> patch with various revisions of subversion.  To make sure it will
> compile I suggest getting the revision he lists.
>
> As for using the features of this patch.  This is the url that would be
> called
>
>
> /solr/update/rich?stream.file=filename&stream.type=filetype&id=id&stream.fieldname=storagefield&fieldnames=cat,desc,type,name&type=filetype&cat=category&name=name&desc=description
>
> Breaking this down
>
> You have stream.file which will be the absolute path to the file you
> want to index.  You then have stream.type which specifies the type of
> file, which currently supports pdf, xls, doc, ppt.  The next field is
> the id, which is where you specify the unique value for the id in your
> schema.  Example is we had a document reference in a database, and
> that id was 103, so we would specify the value 103 to identify which
> document it was in the index.  Stream.fieldname is the name of the
> field in your index that will actually be storing the text from the
> document.  We had the field 'data' so it would be
> stream.fieldname=data in the url.
>
> The parameter fieldnames is any additional fields in your index that
> need to be filled.  We were passing a category, description for the
> document, a name, and the type.  So you just need to specify the names
> of the fields.  Solr will then look for corresponding parameters with
> those names, which you can see at the end of my URL.  The values
> passed for the additional parameters need to be sent url encoded.
>
> I'm not a Java programmer so if you have questions about the internals
> of the code, definitely direct those to Eric as I cannot help.  I have
> only implemented it in web applications.  If you have any other
> questions about the use of the patch I can answer those questions.
>
> Enjoy!
>
> - Pete
>
> On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote:
> > There seems to be some code out for Tika now (not packaged/announced
> yet,
> > but...). Could someone please take a look at it and see if that could
> fit
> > in? I am eagerly waiting for a reply back from tika-dev, but no luck
> yet.
> >
> >
> http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/
> >
> > I see that Eric's patch uses POI (for most of it)...so that's great! I
> have
> > seen too many duplicated efforts, even in Apache projects alone, and
> this is
> > one step close to fixing it (other than Tika, which isnt' 'complete'
> yet).
> > Are there any plans on releasing this patch with Solr dist? Or, any
> > instructions on using/installing the patch itself?
> >
> > Thanks
> > Vish
> >
> >
> > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> > >
> > > Christian,
> > >
> > > Eric Pugh created implemented this functionality for a project we were
> > > doing and has released to code on JIRA.  We have had very good results
> > > with it.  If I can be of any help using it beyond the Java code itself
> > > let me know.  The last revision I used with it was 552853, so if the
> > > build happens to fail you can roll back to that and it will work.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-284
> > >
> > > - Pete
> > >
> > > On 8/21/07, Christian Klinger <[EMAIL PROTEC

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Vish D.
On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
>
> I am a little confused how you have things setup, so these meta data
> files contain certain information and there may or may not be a pdf,
> xls, doc that it is associated with?


Yes, you have it right.

If that is the case, if it were me I would write something to parse
> the meta data files, and if there is a binary file associated with it
> submit it using the url I showed you.  If the meta data is just that
> and has no associated documents submit it in XML form.  The script
> shouldn't be  too complicated, but that would depend on the complexity
> of the meta data you are parsing.
>
> To give you an idea how I use it, we have hundreds of documents in
> PDF, DOC, XLS, HTML, TXT, CSV, and PPT formats.  When a document is to
> be indexed by solr we look at the extension, if it is a txt or
> html,htm we read the data in and submit it with the xml handler.  If
> the document is one of the binary formats we submit it with the url I
> showed you.  All information about these files is stored in a database
> and some of the 'documents' in the database are just links to external
> documents.  In that case we are only indexing a description, title,
> and category.
>
> You are correct, it would overwrite the data by doing an update unless
> you parsed the meta data, and if you are parsing the meta data you
> might as well just parse it from the start and index once.
>
> How are you handling these meta data files right now?  are they simply
> xml files like in the solr example where you are just running the bash
> script on or is something parsing the contents already?


Yes, I am running a similar bash script to index these meta-data xml docs.
The big downside in using the url way is that, for one thing, it has the
characters-limit (1024, is it?). So, if I had a lot of meta-data, or even a
long description for a record, that might not work all that well. I am
guessing you haven't run into this issue yet, right?

- Pete


The proposed schema additions might not make sense for everyone, since the
actual requirements might be more complex than just that (i.e., say you want
to extract text, structure it in various elements, update your doc xml, and
then index). But, it goes well with Solr's search-engine-in-a-box
perception, but now with full-text- prefix to it. Another way I can see it
happen, is to extend the default handler and still take in a xml doc, but
look out for, say, a field name ''. From here on, within the handler,
you can validate the filename, handle it anyways you want (create extra
elements, create '' for pdf files and '' for html files, etc..),
etc... This strips out having to deal with if/else scripting outside of
Solr.

Rao



On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote:
> > Pete,
> >
> > Thanks for the great explanation.
> >
> > Thinking it through my process, I am not sure how to use it:
> >
> > I have a bunch of docs that pretty much contain a lot of meta-data, some
> > which include full-text files (.pdf, .ppt, etc...). I use these docs
> > correctly to index/update into Solr. The next step now is to somehow
> index
> > the text from the full-text files. One way to think about it is, I could
> > have a placeholder field 'data' and keep it empty for the first pass,
> and
> > then run update/rich to index the actual full-text, but using the same
> > unique doc id. But this would actually overwrite the doc in the index,
> won't
> > it? And, there really isn't a 'merge' operation, right?
> >
> > There might be a better way to use this full-text indexing option,
> > schema-wise, say:
> > 
> > - have a new option richData that will take in a source field name,
> > - validate it's value (valid filename/file),
> > - recognize the file type,
> > - and put the 'data' into another field
> >
> > What do you think?  I am not a true Java developer, so not sure if I
> could
> > do it myself, but only hope that someone else on the project could
> ;-)...
> >
> > Rao
> >
> > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> > >
> > > Installing the patch requires downloading the latest solr via
> > > subversion and applying the patch to the source.  Eric has updated his
> > > patch with various revisions of subversion.  To make sure it will
> > > compile I suggest getting the revision he lists.
> > >
> > > As for using the features of this patch.  This is the url that would
> be
> > > called
> > >
> > >
> > >
> /solr/update/rich?stream.file=filename&stream.typ

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

2007-08-21 Thread Vish D.
On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote:
>
> On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> >
> > I am a little confused how you have things setup, so these meta data
> > files contain certain information and there may or may not be a pdf,
> > xls, doc that it is associated with?
>
>
> Yes, you have it right.
>
> If that is the case, if it were me I would write something to parse
> > the meta data files, and if there is a binary file associated with it
> > submit it using the url I showed you.  If the meta data is just that
> > and has no associated documents submit it in XML form.  The script
> > shouldn't be  too complicated, but that would depend on the complexity
> > of the meta data you are parsing.
> >
> > To give you an idea how I use it, we have hundreds of documents in
> > PDF, DOC, XLS, HTML, TXT, CSV, and PPT formats.  When a document is to
> > be indexed by solr we look at the extension, if it is a txt or
> > html,htm we read the data in and submit it with the xml handler.  If
> > the document is one of the binary formats we submit it with the url I
> > showed you.  All information about these files is stored in a database
> > and some of the 'documents' in the database are just links to external
> > documents.  In that case we are only indexing a description, title,
> > and category.
> >
> > You are correct, it would overwrite the data by doing an update unless
> > you parsed the meta data, and if you are parsing the meta data you
> > might as well just parse it from the start and index once.
> >
> > How are you handling these meta data files right now?  are they simply
> > xml files like in the solr example where you are just running the bash
> > script on or is something parsing the contents already?
>
>
> Yes, I am running a similar bash script to index these meta-data xml docs.
> The big downside in using the url way is that, for one thing, it has the
> characters-limit (1024, is it?). So, if I had a lot of meta-data, or even a
> long description for a record, that might not work all that well. I am
> guessing you haven't run into this issue yet, right?
>
> - Pete
>
>
> The proposed schema additions might not make sense for everyone, since the
> actual requirements might be more complex than just that (i.e., say you
> want to extract text, structure it in various elements, update your doc xml,
> and then index). But, it goes well with Solr's search-engine-in-a-box
> perception, but now with full-text- prefix to it. Another way I can see it
> happen, is to extend the default handler and still take in a xml doc, but
> look out for, say, a field name ''. From here on, within the handler,
> you can validate the filename, handle it anyways you want (create extra
> elements, create '' for pdf files and '' for html files, etc..),
> etc... This strips out having to deal with if/else scripting outside of
> Solr.
>

Of course, I meant to create new handler and not just modify the standard
one (which is a big no-no, I know that!). Perhaps, name it '/update/file'.

Rao
>
>
>
> On 8/21/07, Vish D. <[EMAIL PROTECTED] > wrote:
> > > Pete,
> > >
> > > Thanks for the great explanation.
> > >
> > > Thinking it through my process, I am not sure how to use it:
> > >
> > > I have a bunch of docs that pretty much contain a lot of meta-data,
> > some
> > > which include full-text files (.pdf, .ppt, etc...). I use these docs
> > > correctly to index/update into Solr. The next step now is to somehow
> > index
> > > the text from the full-text files. One way to think about it is, I
> > could
> > > have a placeholder field 'data' and keep it empty for the first pass,
> > and
> > > then run update/rich to index the actual full-text, but using the same
> > > unique doc id. But this would actually overwrite the doc in the index,
> > won't
> > > it? And, there really isn't a 'merge' operation, right?
> > >
> > > There might be a better way to use this full-text indexing option,
> > > schema-wise, say:
> > > 
> > > - have a new option richData that will take in a source field name,
> > > - validate it's value (valid filename/file),
> > > - recognize the file type,
> > > - and put the 'data' into another field
> > >
> > > What do you think?  I am not a true Java developer, so not sure if I
> > could
> > > do it myself, but only hope that someone else on the project c

display tokens

2007-12-07 Thread s d
How can I retrieve the "analyzed tokens" (e.g. the stemmed values) of a
specific field?


Is there a way to retrieve the "analyzed tokens" (e.g. the stemmed values) of a field from the SOLR index ?

2007-12-09 Thread s d
Is there a way to retrieve the "analyzed tokens" (e.g. the stemmed
values) of a field from the SOLR index ?
Almost like using SOLR as a utility for generating the tokens.
Thanks !


Lucene And SOLR

2007-12-19 Thread s d
Is there a way to import a Lucene index (as is) into SOLR? Basically, I'm
looking to enjoy the "web context" and caching provided by SOLR but keep the
index under my control in Lucene.


RAMDirectory

2007-12-27 Thread s d
Is there a way to use RAMDirectory with SOLR?If you can point me to
documentation that would be great.
Thanks,
S


Query Syntax (Standard handler) Question

2008-01-04 Thread s d
Is there a simpler way to write this query (I'm using the standard handler)
?
field1:t1 field1:t2 field1:"t1 t2" field2:t1 field2:t2 field2:"t1 t2"
Thanks,


Re: Query Syntax (Standard handler) Question

2008-01-04 Thread s d
but i want to sum the scores and not use max, can i still do it with the
DisMax? am i missing anything ?

On Jan 4, 2008 2:32 AM, Erik Hatcher <[EMAIL PROTECTED]> wrote:

>
> On Jan 4, 2008, at 4:40 AM, s d wrote:
> > Is there a simpler way to write this query (I'm using the standard
> > handler)
> > ?
> > field1:t1 field1:t2 field1:"t1 t2" field2:t1 field2:t2 field2:"t1 t2"
>
> Looks like you'd be better off using the DisMax handler for 
> (without the brackets).
>
>Erik
>
>


queryResultCache

2008-01-05 Thread s d
What is the best approach to tune queryResultCache ?For example  the default
size is: size="512" but since a document id is just an int (it is an int,
right?) ,i.e 4 bytes why not set size to 10,000,000 for example (it's only
~38Mb).
I sense there is something that I'm missing here :). any help would be
appreciated.
Thanks,


Boosting a Field (Standard Handler)

2008-01-05 Thread s d
How do i boost a field (not a term) using the standard handler syntax? I
know i can do that with the DisMax but I'm trying to keep myself in the
standard one.Can this be done ?
Thanks,


Re: queryResultCache

2008-01-06 Thread s d
Thanks. a factor of 20 or even 30 from my numbers still gives a much larger
number than the default one and i was wondering is there any disadvantage in
having a big number/ cache?BTW, where is the TTL controlled ?

On Jan 6, 2008 7:23 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote:

> On Jan 6, 2008 12:59 AM, s d <[EMAIL PROTECTED]> wrote:
> > What is the best approach to tune queryResultCache ?For example  the
> default
> > size is: size="512" but since a document id is just an int (it is an
> int,
> > right?) ,i.e 4 bytes why not set size to 10,000,000 for example (it's
> only
> > ~38Mb).
>
> This cash size refers to the number of id lists are stored.
> One query + sort that yields the top 20 results == 1 entry in the cache.
>
> -Yonik
>


Re: queryResultCache

2008-01-06 Thread s d
Got it. Smart.
Thx

On 1/6/08, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : number than the default one and i was wondering is there any disadvantage
> in
> : having a big number/ cache?BTW, where is the TTL controlled ?
>
> no disadvantage as long as you've got the RAM ... NOTE: the magic "512"
> number you refered to isn't a "default" -- it's an "example" in the
> "example"
> solrconfig.xml
>
> There is no TTL for Solr caches, as noted in the wiki...
>
> http://wiki.apache.org/solr/SolrCaching
>
> Solr caches are associated with an Index Searcher -- a particular 'view'
> of the index that doesn't change. So as long as that Index Searcher is
> being used, any items in the cache will be valid and available for reuse.
> Caching in Solr is unlike ordinary caches in that Solr cached objects will
> not expire after a certain period of time; rather, cached objects will be
> valid as long as the Index Searcher is valid.
>
>
>
> -Hoss
>
>


How do i normalize diff information (different type of documents) in the index ?

2008-01-07 Thread s d
e.g. if the index is field1 and field2 and documents of type (A) always have
information for field1 AND information for field2 while document of type (B)
always have information for field1 but NEVER information for field2.
The problem is that the formula will sum field1 and field2 hence skewing in
favour of documents of type (A).
If i combine the 2 fields into 1 field (in an attempt to normalize) i will
obviously skew the statistics.
Please advise,
Thanks,


Re: How do i normalize diff information (different type of documents) in the index ?

2008-01-07 Thread s d
Isn't there a better way to take the information into account but still
normalize? taking the score of only one of the fields doesn't sound like the
best thing to do (it's basically ignoring part of the information).

On Jan 7, 2008 9:20 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:

>
> On 7-Jan-08, at 9:02 PM, s d wrote:
>
> > e.g. if the index is field1 and field2 and documents of type (A)
> > always have
> > information for field1 AND information for field2 while document of
> > type (B)
> > always have information for field1 but NEVER information for field2.
> > The problem is that the formula will sum field1 and field2 hence
> > skewing in
> > favour of documents of type (A).
> > If i combine the 2 fields into 1 field (in an attempt to normalize)
> > i will
> > obviously skew the statistics.
>
> Try the dismax handler.  It's main goal is to query multiple fields
> while only counting the score of the highest-scoring one (mostly).
>
> -Mike
>


Performance - FunctionQuery

2008-01-08 Thread s d
Adding a FunctionQuery made the query response time slower by ~300ms, adding
a 2ndFunctionQuery added another ~300ms so overall i got over 0.5sec for a
response time (slow).Is this expected or am i doing something wrong ?
Thx


Min-Score Filter

2008-01-08 Thread s d
Is there a way or a point in filtering all results bellow a certain score?
e.g. exclude all results bellow score Y.Thanks


Re: How do i normalize diff information (different type of documents) in the index ?

2008-01-08 Thread s d
Got it (
http://wiki.apache.org/solr/DisMaxRequestHandler#head-cfa8058622bce1baaf98607b197dc906a7f09590)
.
thx !

On Jan 8, 2008 12:11 AM, Chris Hostetter < [EMAIL PROTECTED]> wrote:

>
> : Isn't there a better way to take the information into account but still
> : normalize? taking the score of only one of the fields doesn't sound like
> the
> : best thing to do (it's basically ignoring part of the information).
>
> note the word "mostly" in Mike's response about dismax ... the "tie" param
>
> lets you decide how much the other fields influence the score.  Try it,
> it works really well ... trust me/us.
>
> For the record: i'm really not sure what your question is ... you say you
> want to normalize for the fact that some docs don't have a value in some
> fields, but you don't want to combine the fields because it will skew the
> statistics ... isn't that "skewing" exactly what you are trying to
> achieve?
>
> don't you need to introduce some "skew" in favor of hte docs that don't
> have a value for field2 to compensate forr the existing "counter skew"
> they already have?
>
>
>
> -Hoss
>
>


DisMax Syntax

2008-01-08 Thread s d
User Query: x1 x2
Desired query (Lucene): field:x1 x2 field:"x1 x2"~a^b

In the standard handler the only way i saw how to make this work was:
field:x1 field:x2 field:"x1 x2"!a^b

Now that i want to try the DisMax is there a way to implement this without
having duplicate fields? i.e. since the fields and the terms are separated
in the DisMax how do i achieve the same query ?

Thanks


Re: DisMax Syntax

2008-01-08 Thread s d
I may be mistaken, but this is not equivalent to my query.In my query i have
matches for x1, matches for x2 without slope and/or boosting and then match
to "x1 x2" (exact match) with slope (~) a and boost (b) in order to have
results with exact match score better.
The total score is the sum of all the above.
Your query seems diff


On Jan 8, 2008 11:56 AM, Chris Hostetter <[EMAIL PROTECTED]> wrote:

>
> : User Query: x1 x2
> : Desired query (Lucene): field:x1 x2 field:"x1 x2"~a^b
> :
> : In the standard handler the only way i saw how to make this work was:
> : field:x1 field:x2 field:"x1 x2"!a^b
> :
> : Now that i want to try the DisMax is there a way to implement this
> without
> : having duplicate fields? i.e. since the fields and the terms are
> separated
> : in the DisMax how do i achieve the same query ?
>
> i'm not sure what you mean by "without duplicate fields" but assuming i
> understand your goal, this seems trivial...
>
>q = x1 x2
>qf = field
>pf = field^b
>ps = a
>
>
> -Hoss
>
>


Inconsistent results

2008-02-01 Thread s d
Hi,I use SOLR with standard handler and when i send the same exact query to
solr i get different results every time (i.e. refresh the page with the
query and get different results).
Any ideas?
Thx,


Interleaved results form different sources

2008-04-14 Thread s d
We have an index of documents from different sources and we want to make
sure the results we display are interleaved from the different sources and
not only ranked based on relevancy.Is there a way to do this ?
Thanks,
S.


result limit / diversity with an OR query

2008-05-12 Thread s d
Hi,I have a query similar to: x OR y OR z and i want to know if there is a
way to make sure i get 1 result with x, 1 result with y and one with z ?
Alternatively, is it possible to achieve through facets?
Thanks,
S.


Does SOLR support RAMDirectory ?

2008-06-01 Thread s d
Can i use RAMDirectory in SOLR?Thanks,
S


Re: Interest in Extending SOLR

2006-04-13 Thread Vish D.
Mike,

I am currently evaluating different search engine technologies (esp., open
source ones), and this is very interesting to me, for the following reasons:

Our data is much like yours in that we have different types of data
(abstracts, fulltext, music, etc...), which eventually fall under different
"databases" in our subscription/offering model. So, the ability of have
different indexes (on database level, and on type level) would be the ideal
solution. The only difference being, when comparing to your needs, it would
be a requirement to be able to search between different indexes (searching
between "databases"), but also be able to search only within types. That is,
with your proposal, objectType could be "type" or "database." The point here
isn't that it would be nice to have second parameter, but it would be a
necessity to be able search between indexes.

I am truly interested in how this all works out, and hope to get myself
involved in Solr technology.





On 4/12/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote:
>
> Yonik -
>
> > So the number of filters is equal to the number of sites?
> > How many sites are there?
>
> Today: When new customers join, we generally don't do anything
> special. Currently we have roughly 400 customers, most of which have
> one site each. Note that a few customers have as many as 50 sites. In
> total, we probably filter data in 500 unique ways, before we actually
> search on the query string entered by the user. Of the 500 unique ways
> in which we filter data, there are approximately 50 for which we would
> prefer to use a unique index. I don't have 100% accurate numbers, but
> these should be in the ballpark.
>
> Future: We are planning to expand on the concepts we've developed to
> integrate Lucene and hopefully SOLR into other applications. One in
> particular:
>
>   * Provides a core data set of 100K records
>
>   * Allows each of 1,000 customers to create their own view of that
> data
>
>   * In theory, our overall dataset may contain up to 100K * 1,000
> records (100M), but we know that at any given time, only 100K
> records should be made available.
>
> We did rough tests and found that creating multiple indexes performed
> better at run time, especially as the logic to determine what results
> should be presented to which customer became more complex.
>
>
> > Support for indexing from CSV files as well as simple pulling from a
> > database is on our "todo" list: http://wiki.apache.org/solr/TaskList
>
> I had seen this on the TODO list. I'm offering to contribute this
> piece when we've got an idea of overall fit...
>
>
> > How would one identify what index (or SolrCore) an update is
> > targeted to?
>
> This is a good question. I think the query interface itself would have
> to be extended. That is, a new parameter would have to be introduced
> which identified the objectType you would like to search/update. If
> omitted,
> the default object type would be used. In our current system, we set
> the objectType to the name of the database table and thus can issue
> queries like:
>
>   search.jsp?tableName=users&queryString=email:michael.bryzek
>
>
> > What is the relationship between the multiple indicies... do queries
> > ever go across multiple indicies, or would there be an "objectType"
> > parameter passed in as part of the query?
>
> In our case, there is no relationship between the multiple indices,
> but I do see value here (more on this below). In our specific case, we
> have a one to one mapping between a database table and a Lucene index
> and have not needed to search across tables.
>
> I think the value of the objectType is this true independence. If you
> are indexing similar data, use a field on your data. If your data sets
> are truly different, use a different object type.
>
>
> > What is the purpose of multiple indicies... is it so search results
> > are always restricted to a single site, but it's not practical to
> > have that many Solr instances?  It looks like the indicies are
> > partitioned along the lines of object type, and not site-id though.
>
> Your questions and comments are good. Thinking about it has helped me
> to clarify what exactly we're trying to accomplish. I think it boils
> down to these goals:
>
>   a) Minimize the number of instances of SOLR. If I have 3 web
>  applications, each with 12 database tables to index, I don't want
>  to run 36 JVMs. I think introducing an objectType would address
>  this.
>
>   b) Optimize retrieval when I have some knowledge that I can use to
>  define partitions of data. This may actually be more appropriate
>  for Lucene itself, but I see SOLR pretty well positioned to
>  address. One approach is to introduce a "partitionField" that
>  SOLR would use to figure out if a new index is required. For each
>  unique value of the partitionField, we create a separate physical
>  index. If the query does NOT contain a term for the
>  partitionField, 

Re: Interest in Extending SOLR

2006-04-18 Thread Vish D.
Yonik/Chris,

Do we have a eta on " Allow multiple independent Solr *webapps* in the same
app server"?

After reading up, silently, on the many emails on this topic, I agree with
you that it would be worthwhile to test out the current implementation and
see how it performs. But, it makes sense to run a comparison against the
multiple 'indexes' idea...and so, my question above.

Thanks!

On 4/15/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> On 4/15/06, Bryzek.Michael <[EMAIL PROTECTED]> wrote:
> >   * Translate a table design in the database to an XML document that
> > only defined the fields and their types for that table
>
> Just so anyone new to Lucene/Solr isn't mislead by this thread...
>
> Lucene (and hence Solr) documents don't need to be homogeneous within
> a single Lucene index.  Some documents can have one set of fields, and
> other documents in the same index can have a completely different set
> of fields.
>
> -Yonik
>


Faceted Browsing questions

2006-06-23 Thread Vish D.

Hi all,

I am trying to figure out how I can have some type of faceted browsing
working. I am also in need of a way to get a list of unique field values
within a query's results set (for filtering, etc...). When I say trying, I
mean having it up and running without much coding, b/c of time reasons. I
would most definitely be involved in some customizing just because of the
nature of the data I am working with. I have searched through the mailing
list and seen some posts mentioning BitSets DocSets, etc.., but wasn't clear
on if those are already built into the solr's nightly builds (I don't see
any documentation either on the wiki, or online). Can some please steer me
towards the right direction to have the above up in the short time?

Thanks a lot!

Vish


Re: Faceted Browsing questions

2006-06-24 Thread Vish D.

Thank you Chris and Erik. That makes it a bit clearer, but I might need to
sit down and look at the code (nines + DisMax...) a bit closer to see how it
all works in Solr.

Erik, when do you plan on having your implementation refactored with "good"
use of code? Or, in general, when is Solr planning on having this feature
out (as I see it on the wiki for near term features)? It might be better for
me to wait and see how the group decides to implement it, rather than having
something done myself and have to drop it at the end. Plus, you guys
probably have the higher hand when it comes to knowing the details of
Solr/Lucene, and its re-useable features.

Thanks all, and just wanted to say -- I am quite impressed by how Solr is
being taken on by the community. It's a solid search api, if it fits your
needs.

On 6/23/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: nature of the data I am working with. I have searched through the
mailing
: list and seen some posts mentioning BitSets DocSets, etc.., but wasn't
clear
: on if those are already built into the solr's nightly builds (I don't
see
: any documentation either on the wiki, or online). Can some please steer
me
: towards the right direction to have the above up in the short time?

You'll want to start with the Solr javadocs, which are linked to from the
left nav of every page on the Solr website ("Documentation > API Docs")...

http://incubator.apache.org/solr/docs/api/

The DocSet classes are in fact a core part of Solr.

There are some examples in email threads where Erik sent out some code
demonstrating how he was doing faceting using BitSets, and I suggested
ways he could do things using DocSets ... another good example you can
look at is the code for the DisMaxRequestHandler.  It doesn't do faceting,
but it does use DocSets when dealing with the "fq" (filter query) param.

That should be a good place to start.


-Hoss




Re: Faceted Browsing questions

2006-06-24 Thread Vish D.

Erik,

Oh good! Keep me (us) updated!!

As for committing some code into Solr, and the real world uses, I am sure we
can find some generic/abstract rules for faceted browsing -- simplest being,
a set of fields/categories defined in schema.xml, which could be used for an
optional extented query response, or a custom/new response by itself.

I am also sure that we have at least a couple other implementation of this
feature, which might bring in some good insights in "better" use of code. In
any case, I am eager to see this feature "ironed" out on the community
level.

Thanks!


On 6/24/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:



On Jun 24, 2006, at 12:38 PM, Vish D. wrote:
> Erik, when do you plan on having your implementation refactored
> with "good"
> use of code?

This weekend :)   I have imported more data than my hacked
implementation can handle without bumping up Jetty's JVM heap size,
so I'm now at the point where it is necessary for me to start using
the LRUCache.  Though I have already refactored to use OpenBitSet
instead of BitSet.

> Or, in general, when is Solr planning on having this feature
> out (as I see it on the wiki for near term features)? It might be
> better for
> me to wait and see how the group decides to implement it, rather
> than having
> something done myself and have to drop it at the end. Plus, you guys
> probably have the higher hand when it comes to knowing the details of
> Solr/Lucene, and its re-useable features.

The best way for Solr to get this functionality is for those that
have implemented it in a custom fashion to get together and
generalize it, so that we have a proven architecture that is
configurable enough to handle real world situations.  My
implementation is still being ironed out.  And it does rely on custom
request handlers to utilize the facets and return back the counts per
facet.

Erik





Re: Faceted Browsing questions

2006-06-28 Thread Vish D.

Erik,

Any update on your progress? Eager to get my hands on on your latest code...
:=)

Thanks!


On 6/28/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:


: > well, the most obvious solution i can think of would be a patch adding
an
: > invert() method to DocSet, HashDocSet and BitDocSet.   :)
:
: Yes, we could add a flip(startIndex, endIndex) or flip(maxDoc)

Yeah ... like i suggested before just make maxDoc an intrinsic property
that DocSets know when they are created.

Another issue though is one of performance ... inverting a HashDocSet with
only a few docs should probably produce a BitDocSet -- ideally using the
same configured maxSize and loadFactor information that the
SolrIndexSearcher uses ... perhaps the method for inverting/flipping a
DocSet should live in in SolrIndexSearcher?


-Hoss




Re: Is solr scalable with respect to number of documents?

2006-09-27 Thread Vish D.

Are there any plans on implementing a MultiSearcher into solr?

I have been following the list for a while, and read quite a few topic on
multiple instances of solr, in order to accomodate multiple schemas as well
as break down index sizes for performance reasons. I have a use case that
sits right in the middle of these. I need to be able to break down the data
into collections, which might be defined by business factors, but also be
able search across these collections. When I say search across, I would
expect to see a results list ordered in relevance, just as you would expect
when searching a single collection.

Does solr support such a use case? Does anyone have a working implementation
of it that you would like to share? If there aren't any plans on
implementing the above, any suggestions on tackling the above requirement?



On 7/11/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 7/11/06, Wang, Ningjun (LNG-NPV) <[EMAIL PROTECTED]> wrote:
> Can I solve this problem by using lucene directly without solr? In
> another word, does lucene offer and API to merge search results from
> different machine?

A MultiSearcher over multiple RemoteSearchers can work in Lucene.

-Yonik



Re: Is solr scalable with respect to number of documents?

2006-09-27 Thread Vish D.

I just noticed that link on the first reply from Yonik about
FederatedSearch. I see that a lot of thought went in to it. I guess the
question to ask would be, any progress on it, Yonik? :)

On 9/27/06, Vish D. <[EMAIL PROTECTED]> wrote:


Are there any plans on implementing a MultiSearcher into solr?

I have been following the list for a while, and read quite a few topic on
multiple instances of solr, in order to accomodate multiple schemas as well
as break down index sizes for performance reasons. I have a use case that
sits right in the middle of these. I need to be able to break down the data
into collections, which might be defined by business factors, but also be
able search across these collections. When I say search across, I would
expect to see a results list ordered in relevance, just as you would expect
when searching a single collection.

Does solr support such a use case? Does anyone have a working
implementation of it that you would like to share? If there aren't any plans
on implementing the above, any suggestions on tackling the above
requirement?



On 7/11/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> On 7/11/06, Wang, Ningjun (LNG-NPV) <[EMAIL PROTECTED]> wrote:
> > Can I solve this problem by using lucene directly without solr? In
> > another word, does lucene offer and API to merge search results from
> > different machine?
>
> A MultiSearcher over multiple RemoteSearchers can work in Lucene.
>
> -Yonik
>




LIUS/Fulltext indexing

2007-06-11 Thread Vish D.

Anyone have experience working with LIUS (
http://sourceforge.net/projects/lius/)? I can't seem to find any real
documentation on it, even though it seems 'active' @ sourceforge. I need a
way to index various types of fulltext, and LIUS seems very promising at
first glance. What do you guys think? Is there a similar implementation you
recommend, even something that might provide the simple text extraction
functionality for the various types? I figure, I would need to do that
anyways, and massage the text into Solr-type docs.

Vish


Re: LIUS/Fulltext indexing

2007-06-12 Thread Vish D.

Sounds interesting. I can't seem to find any clear dates on the project
website. Do you know? ...V1 shipping date?

Thanks!
On 6/12/07, Bertrand Delacretaz <[EMAIL PROTECTED]> wrote:


On 6/12/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

>... I think Tika will be the way forward (some of the code for Tika is
> coming from LIUS)...

Work has indeed started to incoroporate the Lius code into Tika, see
https://issues.apache.org/jira/browse/TIKA-7 and
http://incubator.apache.org/projects/tika.html

-Bertrand



Re: LIUS/Fulltext indexing

2007-06-12 Thread Vish D.

Wonder if TOM could be useful to integrate?
http://tom.library.upenn.edu/convert/sofar.html

On 6/12/07, Bertrand Delacretaz <[EMAIL PROTECTED]> wrote:


On 6/12/07, Vish D. <[EMAIL PROTECTED]> wrote:
> ...Sounds interesting. I can't seem to find any clear dates on the
project
> website. Do you know? ...V1 shipping date?...

Not at the moment, Tika just entered incubation and it's impossible to
predict what will happen.

But help is welcome, of course ;-)

-Bertrand



Min-should-match and Mutli-word synonyms unexpected result

2018-02-05 Thread Nick D
I have run into an issue with multi-word synonyms and a min-should-match
(MM) of anything other than `0`, *Solr version 6.6.0*.

Here is my example query, first with mm set to zero and the second with a
non-zero value:

With MM set to 0
select?fl=*&indent=on&wt=json&debug=ALL&q=EIB&qf=ngs_title%20ngs_field_description&sow=false&mm=0

which parse to:

parsedquery_toString":"+(((+ngs_field_description:enterprise
+ngs_field_description:interface +ngs_field_description:builder)
ngs_field_description:eib) | ((+ngs_title:enterprise
+ngs_title:interface +ngs_title:builder) ngs_title:eib))~0.01"

and using my default MM (2<-35%)
select?fl=*&indent=on&wt=json&debug=ALL&q=EIB&qf=ngs_title%20ngs_field_description&sow=false

which parse to

+ngs_field_description:enterprise +ngs_field_description:interface
+ngs_field_description:builder) ngs_field_description:eib)~2) |
(((+ngs_title:enterprise +ngs_title:interface +ngs_title:builder)
ngs_title:eib)~2))

My synonym here is:
EIB, Enterprise Interface Builder

For my two documents I have the field ngs_title with values "EIB" (Doc 1)
and "enterprise interface builder" (Doc 2)

For both queries the doc 1 is always returned as EIB is matched, but for
doc 2 although I have EIB and Enterprise interface builder defined as
equivalent synonyms when the MM is not set to zero that document is not
returned. From the parsestring I see the ~2 being applied for the MM but my
expectation was that it has been met via the synonyms and the fact that I
am not actaully searching a phrase.

I couldn't find much on the relationship between the two outside of a some
of the things Doug Turnbull had linked to another solr-user question and
this blog post that mentions weirdness around MM and multi-word:

https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/

http://opensourceconnections.com/blog/2013/10/27/why-is-multi-term-synonyms-so-hard-in-solr/

Also looked through the comments here,
https://issues.apache.org/jira/browse/SOLR-9185, but at first glance didn't
see anything that jumped out at me.

Here is the field definition for the ngs_* fields:


  








  
  



  


I am not sure if we cannot use MM anymore for these type of queries or if
there is something I setup incorrectly, any help would be greatly
appreciated.

Nick


Re: Min-should-match and Mutli-word synonyms unexpected result

2018-02-06 Thread Nick D
Thanks Steve,

I'll test out that version.

Nick

On Feb 6, 2018 6:23 AM, "Steve Rowe"  wrote:

> Hi Nick,
>
> I think this was fixed by https://issues.apache.org/
> jira/browse/LUCENE-7878 in Solr 6.6.1.
>
> --
> Steve
> www.lucidworks.com
>
> > On Feb 5, 2018, at 3:58 PM, Nick D  wrote:
> >
> > I have run into an issue with multi-word synonyms and a min-should-match
> > (MM) of anything other than `0`, *Solr version 6.6.0*.
> >
> > Here is my example query, first with mm set to zero and the second with a
> > non-zero value:
> >
> > With MM set to 0
> > select?fl=*&indent=on&wt=json&debug=ALL&q=EIB&qf=ngs_title%
> 20ngs_field_description&sow=false&mm=0
> >
> > which parse to:
> >
> > parsedquery_toString":"+(((+ngs_field_description:enterprise
> > +ngs_field_description:interface +ngs_field_description:builder)
> > ngs_field_description:eib) | ((+ngs_title:enterprise
> > +ngs_title:interface +ngs_title:builder) ngs_title:eib))~0.01"
> >
> > and using my default MM (2<-35%)
> > select?fl=*&indent=on&wt=json&debug=ALL&q=EIB&qf=ngs_title%
> 20ngs_field_description&sow=false
> >
> > which parse to
> >
> > +ngs_field_description:enterprise +ngs_field_description:interface
> > +ngs_field_description:builder) ngs_field_description:eib)~2) |
> > (((+ngs_title:enterprise +ngs_title:interface +ngs_title:builder)
> > ngs_title:eib)~2))
> >
> > My synonym here is:
> > EIB, Enterprise Interface Builder
> >
> > For my two documents I have the field ngs_title with values "EIB" (Doc 1)
> > and "enterprise interface builder" (Doc 2)
> >
> > For both queries the doc 1 is always returned as EIB is matched, but for
> > doc 2 although I have EIB and Enterprise interface builder defined as
> > equivalent synonyms when the MM is not set to zero that document is not
> > returned. From the parsestring I see the ~2 being applied for the MM but
> my
> > expectation was that it has been met via the synonyms and the fact that I
> > am not actaully searching a phrase.
> >
> > I couldn't find much on the relationship between the two outside of a
> some
> > of the things Doug Turnbull had linked to another solr-user question and
> > this blog post that mentions weirdness around MM and multi-word:
> >
> > https://lucidworks.com/2017/04/18/multi-word-synonyms-
> solr-adds-query-time-support/
> >
> > http://opensourceconnections.com/blog/2013/10/27/why-is-
> multi-term-synonyms-so-hard-in-solr/
> >
> > Also looked through the comments here,
> > https://issues.apache.org/jira/browse/SOLR-9185, but at first glance
> didn't
> > see anything that jumped out at me.
> >
> > Here is the field definition for the ngs_* fields:
> >
> >  positionIncrementGap="100">
> >  
> > > mapping="mapping-ISOLatin1Accent.txt"/>
> > > pattern="([()])" replacement=""/>
> >
> > > pattern="(^[^0-9A-Za-z_]+)|([^0-9A-Za-z_]+$)" replacement=""/>
> > > words="stopwords.txt"/>
> >
> >
> > > maxGramSize="50"/>
> >  
> >  
> >
> > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >
> >  
> >
> >
> > I am not sure if we cannot use MM anymore for these type of queries or if
> > there is something I setup incorrectly, any help would be greatly
> > appreciated.
> >
> > Nick
>
>


Does anyone want to put in a few hours at their rate?

2019-06-18 Thread Val D
To whom it may concern:

I have a Windows based system running Java 8.  I have installed SOLR 7.7.2  ( I 
also tried this with version 8.1.1 as well with the same results ). I have SQL 
Server 2018 with 1 table that contains 22+ columns and a few thousand rows.  I 
am attempting to index the SQL Server table using SOLR as the indexing 
mechanism.  Once indexed I need to be able to search the table using the SOLR 
ADMIN module.  That is it.  I have tried every on-line example, sample or 
explanation and none of them have helped.  I can only assume that I am doing 
something wrong.  I do not have the required expertise to determine where I am 
going wrong.  I would like to know if you can help, what it would cost to have 
someone log into my system to get it working and how long you think it might 
take.  I will be working very late today.

Thank you in advance,
Vincent

(650) 334-2925
US UTC-5   Feel free to call at any hour.


Synonym expansions w/ phrase slop exhausting memory after upgrading to SOLR 7

2019-12-17 Thread Nick D
Hello All,

We recently upgraded from Solr 6.6 to Solr 7.7.2 and recently had spikes in
memory that eventually caused either an OOM or almost 100% utilization of
the available memory. After trying a few things, increasing the JVM heap,
making sure docValues were set for all Sort, facet fields (thought maybe
the fieldCache was blowing up), I was able to isolate a single query that
would cause the used memory to become fully exhausted and effectively
render the instance dead. After applying a timeAllowed  value to the query
and reducing the query phrase (system would crash on without throwing the
warning on longer queries containing synonyms). I was able to idenitify the
following warning in the logs:

o.a.s.s.SolrIndexSearcher Query: 

the request took too long to iterate over terms. Timeout: timeoutAt:
812182664173653 (System.nanoTime(): 812182715745553),
TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@7a0db441

I have narrowed the problem down to the following:
the way synonyms are being expaneded along with phrase slop.

With a ps=5 I get 4096 possible permutations of the phrase being searched
with because of synonyms, looking similar to:
ngs_title:"bereavement leave type build bereavement leave type data p"~5
 ngs_title:"bereavement leave type build bereavement bereavement type data
p"~5
 ngs_title:"bereavement leave type build bereavement jury duty type data
p"~5
 ngs_title:"bereavement leave type build bereavement maternity leave type
data p"~5
 ngs_title:"bereavement leave type build bereavement paternity type data
p"~5
 ngs_title:"bereavement leave type build bereavement paternity leave type
data p"~5
 ngs_title:"bereavement leave type build bereavement adoption leave type
data p"~5
 ngs_title:"bereavement leave type build jury duty maternity leave type
data p"~5
 ngs_title:"bereavement leave type build jury duty paternity type data p"~5
 ngs_title:"bereavement leave type build jury duty paternity leave type
data p"~5
 ngs_title:"bereavement leave type build jury duty adoption leave type data
p"~5
 ngs_title:"bereavement leave type build jury duty absence type data p"~5
 ngs_title:"bereavement leave type build maternity leave leave type data
p"~5
 ngs_title:"bereavement leave type build maternity leave bereavement type
data p"~5
 ngs_title:"bereavement leave type build maternity leave jury duty type
data p"~5



Previously in Solr 6 that same query, with the same synonyms (and query
analysis chain) would produce a parsedQuery like when using a &ps=5:
DisjunctionMaxQuery(((ngs_field_description:\"leave leave type build leave
leave type data ? p leave leave type type.enabled\"~5)^3.0 |
(ngs_title:\"leave leave type build leave leave type data ? p leave leave
type type.enabled\"~5)^10.0)

The expansion wasn't being applied to the added disjunctionMaxQuery to when
adjusting rankings with phrase slop.

In general the parsedqueries between 6 and 7 are differnet, with some new
`spanNears` showing but they don't create the memory consumpution issues
that I have seen when a large synonym expansion is happening along w/ using
a PS parameter.

I didn't see much in terms on release notes changes for synonym changes
(outside of SOW=false being the default for version . 7).

The field being opertated on has the following query analysis chain:

 




  

Not sure if there is a change in phrase slop that now takes synonyms into
account and if there is way to disable that kind of expansion or not. I am
not sure if it is related to SOLR-10980
 or
not, does seem to be related,  but referenced Solr 6 which does not do the
expansion.

Any help would be greatly appreciated.

Nick


Re: Synonym expansions w/ phrase slop exhausting memory after upgrading to SOLR 7

2019-12-18 Thread Nick D
Michael,

Thank you so much, that was extremely helpful. My googlefu wasn't good
enough I guess.

1. Was my initial fix just to stop it from exploding.

2. Will be the perm solutions for now until we can get some things squared
away for 8.0.

Sounds like even in 8 there is a problem with any graph query expansion
potential still growing rather large but it just won't consume all
available memory, is that correct?

One final question, why would the maxbooleanqueries value in the solrconfig
still apply? Reading through all the jiras I thought that was supposed to
still be a fail safe, did I miss something?

Thanks again for your help,

Nick

On Wed, Dec 18, 2019, 8:10 AM Michael Gibney 
wrote:

> This is related to this issue:
> https://issues.apache.org/jira/browse/SOLR-13336
>
> Also tangentially relevant:
> https://issues.apache.org/jira/browse/LUCENE-8531
> https://issues.apache.org/jira/browse/SOLR-12243
>
> I think your options include:
> 1. setting slop=0, which restores SpanNearQuery as the graph phrase
> query implementation (see LUCENE-8531)
> 2. downgrading to 7.5 would avoid the OOM, but would cause graph
> phrase queries to be effectively ignored (see SOLR-12243)
> 3. upgrade to 8.0, which will restore the failsafe maxBooleanClauses,
> avoiding OOM but returning an error code for affected queries (which
> in your case sounds like most queries?) (see SOLR-13336)
>
> Michael
>
> On Tue, Dec 17, 2019 at 4:16 PM Nick D  wrote:
> >
> > Hello All,
> >
> > We recently upgraded from Solr 6.6 to Solr 7.7.2 and recently had spikes
> in
> > memory that eventually caused either an OOM or almost 100% utilization of
> > the available memory. After trying a few things, increasing the JVM heap,
> > making sure docValues were set for all Sort, facet fields (thought maybe
> > the fieldCache was blowing up), I was able to isolate a single query that
> > would cause the used memory to become fully exhausted and effectively
> > render the instance dead. After applying a timeAllowed  value to the
> query
> > and reducing the query phrase (system would crash on without throwing the
> > warning on longer queries containing synonyms). I was able to idenitify
> the
> > following warning in the logs:
> >
> > o.a.s.s.SolrIndexSearcher Query: <very long synonym expansion>
> >
> > the request took too long to iterate over terms. Timeout: timeoutAt:
> > 812182664173653 (System.nanoTime(): 812182715745553),
> > TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@7a0db441
> >
> > I have narrowed the problem down to the following:
> > the way synonyms are being expaneded along with phrase slop.
> >
> > With a ps=5 I get 4096 possible permutations of the phrase being searched
> > with because of synonyms, looking similar to:
> > ngs_title:"bereavement leave type build bereavement leave type data p"~5
> >  ngs_title:"bereavement leave type build bereavement bereavement type
> data
> > p"~5
> >  ngs_title:"bereavement leave type build bereavement jury duty type data
> > p"~5
> >  ngs_title:"bereavement leave type build bereavement maternity leave type
> > data p"~5
> >  ngs_title:"bereavement leave type build bereavement paternity type data
> > p"~5
> >  ngs_title:"bereavement leave type build bereavement paternity leave type
> > data p"~5
> >  ngs_title:"bereavement leave type build bereavement adoption leave type
> > data p"~5
> >  ngs_title:"bereavement leave type build jury duty maternity leave type
> > data p"~5
> >  ngs_title:"bereavement leave type build jury duty paternity type data
> p"~5
> >  ngs_title:"bereavement leave type build jury duty paternity leave type
> > data p"~5
> >  ngs_title:"bereavement leave type build jury duty adoption leave type
> data
> > p"~5
> >  ngs_title:"bereavement leave type build jury duty absence type data p"~5
> >  ngs_title:"bereavement leave type build maternity leave leave type data
> > p"~5
> >  ngs_title:"bereavement leave type build maternity leave bereavement type
> > data p"~5
> >  ngs_title:"bereavement leave type build maternity leave jury duty type
> > data p"~5
> >
> > 
> >
> > Previously in Solr 6 that same query, with the same synonyms (and query
> > analysis chain) would produce a parsedQuery like when using a &ps=5:
> > DisjunctionMaxQuery(((ngs_field_description:\"leave leave type build
> leave
> > leave type data ? p leave leave type type.enabled\"~5

Rolling Up Group Queries

2014-08-03 Thread Jonathan D Kaufman
Hi all,

I have a question regarding field collapsing.

Has anyone successfully used SOLR to roll up more information than just 
"numFound" in the response to a group query? 

For some background, I'm trying to see if any artifact in the group has been 
marked with a certain priority. Ideally I can do this by having group.limit set 
to 1. The ultimate goal is to get the first artifact in a group as well as some 
metadata about the entire group.

Where in the implementation of group queries is "numFound" calculated? Is it 
possible to, at this time, add in code to support additional roll up 
information?


Thanks,

Jon


Simplification of boolean query failed

2011-12-02 Thread Mark D Sievers
I've put the question nicely formatted on StackOverflow here
http://stackoverflow.com/questions/8360257/solr-lucene-why-is-this-or-query-failing-when-the-two-individual-queries-suc

Here is that question Verbatim:

I have a Solr document schema with with a solr.TrieDateField and noticed
this boolean query (not authored by me) which I thought could benefit from
some simplification;

q=-(-event_date:[2011-12-02T00:00:00.000Z TO NOW/DAY+90DAYS] OR
(event_date:[* TO *]))

which means *events within the next 90 days or non-events* (See Pure
Negativefor
Solr boolean
NOT notation) . My simplification looked like

q=event_date:[2011-12-02T00:00:00.000Z TO NOW/DAY+90DAYS] OR
-event_date:[* TO *]

As stated, this didn't work (0 results). So as a test I ran the two sides
of the OR query individually and the sum of the two results (both non-zero)
equaled the sum of the original query and I can't come up with a good
explanation why. Running with debugQuery=true didn't present anything
helpful.
Thanks,

Mark


RE: Solr - Load Increasing.

2009-11-16 Thread Sudarsan, Sithu D.
 
Hi,

Lakh or Lac - 100,000
Crore   - 100,00,000 (ten million)

Commonly used in India

Sincerely,
Sithu D Sudarsan

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Monday, November 16, 2009 5:22 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr - Load Increasing.

Probably "lakh": 100,000.

So, 900k qpd and 3M docs.

http://en.wikipedia.org/wiki/Lakh

wunder

On Nov 16, 2009, at 2:17 PM, Otis Gospodnetic wrote:

> Hi,
> 
> Your autoCommit settings are very aggressive.  I'm guessing that's
what's causing the CPU load.
> 
> btw. what is "laks"?
> 
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> - Original Message 
>> From: kalidoss 
>> To: solr-user@lucene.apache.org
>> Sent: Mon, November 16, 2009 9:11:21 AM
>> Subject: Solr - Load Increasing. 
>> 
>> Hi All.
>> 
>>   My server solr box cpu utilization  increasing b/w 60 to 90% and
some time 
>> solr is getting down and we are restarting it manually.
>> 
>>   No of documents in solr 30 laks.
>>   No of add/update requrest solr 30 thousand / day. Avg of every 30
minutes 
>> around 500 writes.
>>   No of search request 9laks / day.
>>   Size of the data directory: 4gb.
>> 
>> 
>>   My system ram is 8gb.
>>   System available space 12gb.
>>   processor Family: Pentium Pro
>> 
>>   Our solr data size can be increase in number like 90 laks. and
writes per day 
>> will be around 1laks.   - Hope its possible by solr.
>> 
>>   For write commit i have configured like
>> 
>>  1
>>  10
>> 
>> 
>>   Is all above can be possible? 90laks datas and 1laks per day writes
and 
>> 30laks per day read??  - if yes what type of system configuration
would require.
>> 
>>   Please suggest us.
>> 
>> thanks,
>> Kalidoss.m,
>> 
>> 
>> Get your world in your inbox!
>> 
>> Mail, widgets, documents, spreadsheets, organizer and much more with
your 
>> Sifymail WIYI id!
>> Log on to http://www.sify.com
>> 
>> ** DISCLAIMER **
>> Information contained and transmitted by this E-MAIL is proprietary
to Sify 
>> Limited and is intended for use only by the individual or entity to
which it is 
>> addressed, and may contain information that is privileged,
confidential or 
>> exempt from disclosure under applicable law. If this is a forwarded
message, the 
>> content of this E-MAIL may not have been sent with the authority of
the Company. 
>> If you are not the intended recipient, an agent of the intended
recipient or a  
>> person responsible for delivering the information to the named
recipient,  you 
>> are notified that any use, distribution, transmission, printing,
copying or 
>> dissemination of this information in any way or in any manner is
strictly 
>> prohibited. If you have received this communication in error, please
delete this 
>> mail & notify us immediately at ad...@sifycorp.com
> 



RE: java.lang.OutOfMemoryError while autowarming

2009-12-22 Thread Sudarsan, Sithu D.
 
I'll recommend setting -Xms and -Xmx to the same value.


Sincerely,
Sithu D Sudarsan

-Original Message-
From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] 
Sent: Tuesday, December 22, 2009 2:31 PM
To: solr-user@lucene.apache.org
Subject: Re: java.lang.OutOfMemoryError while autowarming

> -Xmx64m JVM parameter.

Is going to be far too low.  For example I always start at 1 GB and
move up from there.

On Tue, Dec 22, 2009 at 4:35 AM, c82famo
 wrote:
>
> Hi,
>
> I'm facing some OutOfMemory issues with SOLR.
> Tomcat is started with a -Xmx64m JVM parameter.
> Tomcat manages 14 SOLR applications (not using multi-core : we're using SOLR
> 1.2).
>
> The SOLR application impacted manages an 500Mo index (ie 250 lucene
> documents).
>
> Here is catalina.out log :
> INFO: autowarming searc...@23d823d8 main from searc...@1ef21ef2 main
>
> queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=1,evictions=0,size=1,cumulative_lookups=1,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=1,cumulative_evictions=0}
> Dec 19, 2009 4:57:44 AM org.apache.solr.core.SolrException log
> SEVERE: Error during auto-warming of
> key:org.apache.solr.search.queryresult...@e2819933:java.lang.OutOfMemoryError
>        at
> org.apache.lucene.search.FieldCacheImpl$3.createValue(FieldCacheImpl.java:164)
>        at
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
>        at 
> org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:154)
>        at 
> org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:148)
>        at
> org.apache.lucene.search.FieldSortedHitQueue.comparatorInt(FieldSortedHitQueue.java:204)
>        at
> org.apache.lucene.search.FieldSortedHitQueue$1.createValue(FieldSortedHitQueue.java:175)
>        at
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
>        at
> org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSortedHitQueue.java:155)
>        at
> org.apache.lucene.search.FieldSortedHitQueue.(FieldSortedHitQueue.java:56)
>        at
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:856)
>        at
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:805)
>        at
> org.apache.solr.search.SolrIndexSearcher.access$100(SolrIndexSearcher.java:60)
>        at
> org.apache.solr.search.SolrIndexSearcher$2.regenerateItem(SolrIndexSearcher.java:251)
>        at org.apache.solr.search.LRUCache.warm(LRUCache.java:193)
>        at
> org.apache.solr.search.SolrIndexSearcher.warm(SolrIndexSearcher.java:1385)
>        at org.apache.solr.core.SolrCore$1.call(SolrCore.java:488)
>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:284)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:665)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:690)
>        at java.lang.Thread.run(Thread.java:810)
>
> I join the solrconfig.xml file ti this post.
>
> Thank you for your help.
>
> Frédéric http://old.nabble.com/file/p26887639/solrconfig.xml solrconfig.xml
> --
> View this message in context: 
> http://old.nabble.com/java.lang.OutOfMemoryError-while-autowarming-tp26887639p26887639.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Newbie question - using existing Lucene Index

2008-12-03 Thread Sudarsan, Sithu D.
Hi All,

Using Lucene, index has been created. It has five different fields.

How to just use those index from SOLR for searching? I tried changing
the schema as in tutorial, and copied the index to the data directory,
but all searches return empty and no error message!

Is there a sample project available which shows using tomcat as the web
engine rather than jetty?

Your help is appreciated,
Sincerely,
Sithu D Sudarsan

ORISE Fellow, DESE/OSEL/CDRH
WO62 - 3209
&
GRA, UALR

[EMAIL PROTECTED]
[EMAIL PROTECTED]



Use of scanned documents for text extraction and indexing

2009-02-26 Thread Sudarsan, Sithu D.

Hi All:

Is there any study / research done on using scanned paper documents as
images (may be PDF), and then use some OCR or other technique for
extracting text, and the resultant index quality?


Thanks in advance,
Sithu D Sudarsan

sithu.sudar...@fda.hhs.gov
sdsudar...@ualr.edu




RE: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Sudarsan, Sithu D.
Thanks Hannes,

The tool looks good. 

Sincerely,
Sithu D Sudarsan

sithu.sudar...@fda.hhs.gov
sdsudar...@ualr.edu

-Original Message-
From: hannesc...@googlemail.com [mailto:hannesc...@googlemail.com] On
Behalf Of Hannes Carl Meyer
Sent: Thursday, February 26, 2009 11:35 AM
To: solr-user@lucene.apache.org
Subject: Re: Use of scanned documents for text extraction and indexing

Hi Sithu,

there is a project called ocropus done by the DFKI, check the online
demo
here: http://demo.iupr.org/cgi-bin/main.cgi

And also http://sites.google.com/site/ocropus/

Regards

Hannes

m...@hcmeyer.com
http://mimblog.de

On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
sithu.sudar...@fda.hhs.gov> wrote:

>
> Hi All:
>
> Is there any study / research done on using scanned paper documents as
> images (may be PDF), and then use some OCR or other technique for
> extracting text, and the resultant index quality?
>
>
> Thanks in advance,
> Sithu D Sudarsan
>
> sithu.sudar...@fda.hhs.gov
> sdsudar...@ualr.edu
>
>
>


RE: Use of scanned documents for text extraction and indexing

2009-02-27 Thread Sudarsan, Sithu D.
 

Thanks to all who have responded (Hanners, Shashi, Vikram, Bastian,
Renaud and the rest).

Using OCRopus might provide the flexibility to use multi-column
documents and formatted ones.

Regarding literature on OCR, few follow up of the paper link provided
Renaud do exist, but could not locate anything significant.

I'll update if I can find something useful to report.



Sincerely,
Sithu 
sithu.sudar...@fda.hhs.gov
sdsudar...@ualr.edu

-Original Message-
From: Vikram Kumar [mailto:vikrambku...@gmail.com] 
Sent: Friday, February 27, 2009 5:44 AM
To: solr-user@lucene.apache.org; Shashi Kant
Subject: Re: Use of scanned documents for text extraction and indexing

Check this:
http://code.google.com/p/ocropus/wiki/FrequentlyAskedQuestions

> How well does it work?
>
The character recognition accuracy of OCRopus right now (04/2007) is
about
> like Tesseract. That's because the only character recognition plug-in
in
> OCRopus is, in fact, Tesseract. In the future, there will be
additional
> character recognition plug-ins, both for Latin and for other character
sets.
>
The big area of improvement relative to other open source OCR systems
right
> now is in the area of layout analysis; in our benchmarks, OCRopus
greatly
> reduces layout errors compared to other open source systems."
>
OCR is only a part of the solution with scanned documents. i.,e they
recognize text.

For structural/semantic understanding of documents, you need engines
like
OCRopus that can do layout analysis and provide meaningful data for
document
analysis and understanding.

>From their own Wiki:

Should I use OCRopus or Tesseract?
>
You might consider using OCRopus right now if you require layout
analysis,
> if you want to contribute to it, if you find its output format more
> convenient (HTML with embedded OR information), and/or if you
anticipate
> requiring some of its other capabilities in the future (pluggability,
> multiple scripts, statistical language models, etc.).
>
In terms of character error rates, OCRopus performs similar to
Tesseract. In
> terms of layout analysis, OCRopus is significantly better than
Tesseract.
>
The main reasons not to use OCRopus yet is that it hasn't been packaged
yet,
> that it has limited multi-platform support, and that it runs somewhat
> slower. We hope to address all those issues by the beta release."
>


On Thu, Feb 26, 2009 at 11:35 PM, Shashi Kant 
wrote:

> Can anyone back that up?
>
> IMHO Tesseract is the state-of-the-art in OCR, but not sure that
"Ocropus
> builds on Tesseract".
> Can you confirm that Vikram has a point?
>
> Shashi
>
>
>
>
> - Original Message 
> From: Vikram Kumar 
> To: solr-user@lucene.apache.org; Shashi Kant 
> Sent: Thursday, February 26, 2009 9:21:07 PM
> Subject: Re: Use of scanned documents for text extraction and indexing
>
> Tesseract is pure OCR. Ocropus builds on Tesseract.
> Vikram
>
> On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant 
> wrote:
>
> > Another project worth investigating is Tesseract.
> >
> > http://code.google.com/p/tesseract-ocr/
> >
> >
> >
> >
> > - Original Message 
> > From: Hannes Carl Meyer 
> > To: solr-user@lucene.apache.org
> > Sent: Thursday, February 26, 2009 11:35:14 AM
> > Subject: Re: Use of scanned documents for text extraction and
indexing
> >
> > Hi Sithu,
> >
> > there is a project called ocropus done by the DFKI, check the online
demo
> > here: http://demo.iupr.org/cgi-bin/main.cgi
> >
> > And also http://sites.google.com/site/ocropus/
> >
> > Regards
> >
> > Hannes
> >
> > m...@hcmeyer.com
> > http://mimblog.de
> >
> > On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
> > sithu.sudar...@fda.hhs.gov> wrote:
> >
> > >
> > > Hi All:
> > >
> > > Is there any study / research done on using scanned paper
documents as
> > > images (may be PDF), and then use some OCR or other technique for
> > > extracting text, and the resultant index quality?
> > >
> > >
> > > Thanks in advance,
> > > Sithu D Sudarsan
> > >
> > > sithu.sudar...@fda.hhs.gov
> > > sdsudar...@ualr.edu
> > >
> > >
> > >
> >
> >
>
>


Authentication for Solr 6.4.2 , when deployed as WAR in tomcat

2018-01-22 Thread D Dasaradha Rami Reddy
Hi All,

We have solr 6.4.2 currently deployed as war in tomcat. It doesn't have 
authentication now. I want to setup the authentication for solr. When it 
deployed as war in tomcat, The process specified in below page is not working, 
Even after adding security.json in solr home directory, curl request of 
authentication says it’s not configured.

https://lucene.apache.org/solr/guide/6_6/authentication-and-authorization-plugins.html#AuthenticationandAuthorizationPlugins-EnablePluginswithsecurity.json

Did anyone did this before? If so, please guide me.

Thanks

Rami Reddy


Solr Question

2018-10-14 Thread Joseph Costello - F&amp;D Reports
I had a quick question regarding using Solr for doing fast geospatial 
calculations against multiple locations.  For example we have a product that 
takes 2 to 10 companies at a time (i.e. McDonalds 14,000 Subway 20,000, Duncan 
Donuts 5000), and identifies and maps any store overlap based on a range 
between 0.1 or 20 mile radius.   As you probably aware with this many locations 
performing these calculations on the fly just takes too long.   Our initial 
solution was to process all distance calculations via a nightly process so the 
system just needs to retrieve them from the database.  This for the most part 
has work really well and returns results no matter how large the dataset almost 
immediately.

I know that Solr is very fast, especially in the Geospatial queries, but is 
there any way it will be faster doing millions of on the fly geospatial 
calculations, then having the calculations already done and just retrieving 
them from the Database?

Regards,

Joe


Joseph Costello
Chief Information Officer

F&D Reports | Creditntell | ARMS
===
Information Clearinghouse Inc. & Market Service Inc.
310 East Shore Road, Great Neck, NY 11023
email: jose...@fdreports.com<mailto:jose...@fdreports.com> | Tel: 800.789.0123 
ext 112 | Cell: 516.263.6555 | 
www.informationclearinghouseinc.com<http://www.informationclearinghouseinc.com/>

[Help Desk]<mailto:soluti...@fdreports.com>Need Help?  Click 
here<mailto:soluti...@fdreports.com> to request IT Support.



Solr performing Calculations vs. Pulling data Values Directly From DB Question

2018-10-15 Thread Joseph Costello - F&amp;D Reports

My question is regarding using Solr for doing fast geospatial calculations 
against multiple locations.  For example we have a product that takes 2 to 10 
companies at a time (i.e. McDonalds 14,000 Subway 20,000, Duncan Donuts 5000), 
and identifies and maps any store overlap based on a range between 0.1 or 20 
mile radius.   As you are probably aware with this many locations performing 
these calculations on the fly just takes too long.   Our initial solution was 
to process all distance calculations via a nightly process so the system just 
needs to retrieve them from the database.  This for the most part has work 
really well and returns results no matter how large the dataset almost 
immediately.

I know that Solr is very fast, especially in the Geospatial queries, but is 
there any way it could be faster doing millions of on the fly geospatial 
calculations, then having the calculations already done and just retrieving 
them from the Database?   If it could I would not need to run a nightly process 
that pre calculates the distances.  Thougths?

Regards,

Joe


Joseph Costello
Chief Information Officer

F&D Reports | Creditntell | ARMS
===
Information Clearinghouse Inc. & Market Service Inc.
310 East Shore Road, Great Neck, NY 11023
email: jose...@fdreports.com<mailto:jose...@fdreports.com> | Tel: 800.789.0123 
ext 112 | Cell: 516.263.6555 | 
www.informationclearinghouseinc.com<http://www.informationclearinghouseinc.com/>

[Help Desk]<mailto:soluti...@fdreports.com>Need Help?  Click 
here<mailto:soluti...@fdreports.com> to request IT Support.



RE: Solr performing Calculations vs. Pulling data Values Directly From DB Question

2018-10-17 Thread Joseph Costello - F&amp;D Reports
Any feedback from the group on the question below.

The question was will solr performing distance calculations (10,000++) on the 
fly, perform faster than SQL query simply pulling pre-calculated distance 
values directly from the database.

I know Solr is fast but basic computer science tells me that pulling 
pre-calculate data values directly would be faster.  Just looking for 
confirmation before I go down a rabbit hole seeing if Solr can be faster.   
Thoughts?

#SaveMeFromTheRabbitHole. Thx in advance

Regards,

Joe

Joseph Costello
Chief Information Officer

F&D Reports | Creditntell | ARMS
===
Information Clearinghouse Inc. & Market Service Inc.
310 East Shore Road, Great Neck, NY 11023
email: jose...@fdreports.com<mailto:jose...@fdreports.com> | Tel: 800.789.0123 
ext 112 | Cell: 516.263.6555 | 
www.informationclearinghouseinc.com<http://www.informationclearinghouseinc.com/>

[Help Desk]<mailto:soluti...@fdreports.com>Need Help?  Click 
here<mailto:soluti...@fdreports.com> to request IT Support.

From: Joseph Costello - F&D Reports
Sent: Monday, October 15, 2018 12:21 PM
To: 'solr-user@lucene.apache.org' 
Subject: Solr performing Calculations vs. Pulling data Values Directly From DB 
Question


My question is regarding using Solr for doing fast geospatial calculations 
against multiple locations.  For example we have a product that takes 2 to 10 
companies at a time (i.e. McDonalds 14,000 Subway 20,000, Duncan Donuts 5000), 
and identifies and maps any store overlap based on a range between 0.1 or 20 
mile radius.   As you are probably aware with this many locations performing 
these calculations on the fly just takes too long.   Our initial solution was 
to process all distance calculations via a nightly process so the system just 
needs to retrieve them from the database.  This for the most part has work 
really well and returns results no matter how large the dataset almost 
immediately.

I know that Solr is very fast, especially in the Geospatial queries, but is 
there any way it could be faster doing millions of on the fly geospatial 
calculations, then having the calculations already done and just retrieving 
them from the Database?   If it could I would not need to run a nightly process 
that pre calculates the distances.  Thougths?

Regards,

Joe


Joseph Costello
Chief Information Officer

F&D Reports | Creditntell | ARMS
===
Information Clearinghouse Inc. & Market Service Inc.
310 East Shore Road, Great Neck, NY 11023
email: jose...@fdreports.com<mailto:jose...@fdreports.com> | Tel: 800.789.0123 
ext 112 | Cell: 516.263.6555 | 
www.informationclearinghouseinc.com<http://www.informationclearinghouseinc.com/>

[Help Desk]<mailto:soluti...@fdreports.com>Need Help?  Click 
here<mailto:soluti...@fdreports.com> to request IT Support.



RE: Solr performing Calculations vs. Pulling data Values Directly From DB Question

2018-10-17 Thread Joseph Costello - F&amp;D Reports
First thank you for your response. With that said the on the fly calculations 
that Solr would be tasked to preform would be as follows.   Perform a store 
location Overlap Analysis of McDonald's (14,000 locations), Subway(20,000 
locations), and Duncan Donuts(4,000 locations) utilizing their Geocodes 
returning all locations that are within 5 miles radius.



I will absolutely be using the Solr index down the road with my pre-calculated 
value data to get the best of both worlds.   However the current frontend will 
need to be re-programed to take advantage of this.  This will be a phase 2 
initiative.



The main reason for the question was we are optimizing the nightly 
per-calculation job to only do the calculations on the deltas which will reduce 
the nightly jobs processing time by 85%.  Which will make it less likely to 
fail and when it does fail quicker to re-process.



If the process pre-calculates without issue the speed at which the SQL query 
returns data is super-fast and not the issue at all.  Thx again and I look 
forward to any addition replies and/or feedback.



Regards,



Joe





Joseph Costello

Chief Information Officer



F&D Reports | Creditntell | ARMS

===

Information Clearinghouse Inc. & Market Service Inc.

310 East Shore Road, Great Neck, NY 11023

email: jose...@fdreports.com | Tel: 800.789.0123 ext 112 | Cell: 516.263.6555 | 
www.informationclearinghouseinc.com



Need Help?  Click here to request IT Support.



-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Wednesday, October 17, 2018 11:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr performing Calculations vs. Pulling data Values Directly From 
DB Question



On 10/17/2018 9:19 AM, Joseph Costello - F&D Reports wrote:

>

> Any feedback from the group on the question below.

>

> The question was will solr performing distance calculations (10,000++)

> on the fly, perform faster than SQL query simply pulling

> pre-calculated distance values directly from the database.

>



If your database contains pre-calculated values, then pulling those is likely 
to be faster than doing calculations on the fly.  Whether that's true in 
practice depends on precisely what kinds of calculation you are doing, and what 
must be done to obtain the values that go into the calculation.



If your database has pre-calculated values, and you need those in search 
results, why not just put the pre-calculated values into your Solr index when 
you build it?  One of the key things done with a search engine for performance 
is handling as much as possible at index time, so there's less work to do at 
query time.



Thanks,

Shawn


RE: Solr performing Calculations vs. Pulling data Values Directly From DB Question

2018-10-18 Thread Joseph Costello - F&amp;D Reports
Sorry for the late response Walter, but I believe the bottom line of what you 
are saying below is that having the data pre calculated (either in Cache or in 
database) is faster than calculating the values on the fly.  However once the 
data is pre calculated using Solr with HTTP Caching will produce the best 
performance results.  Correct?

Regards,

Joe

Joseph Costello
Chief Information Officer

F&D Reports | Creditntell | ARMS
===
Information Clearinghouse Inc. & Market Service Inc.
310 East Shore Road, Great Neck, NY 11023
email: jose...@fdreports.com | Tel: 800.789.0123 ext 112 | Cell: 516.263.6555 | 
www.informationclearinghouseinc.com

    Need Help?  Click here to request IT Support.


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, October 17, 2018 11:35 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr performing Calculations vs. Pulling data Values Directly 
>From DB Question

Instead of putting things in a database, I would put an HTTP cache, like 
Varnish, in front of Solr. Then you get the best of both worlds. Fast fetches 
for things that have already been calculated and results for any query that 
shows up.

The client does not need special code to get the manually cached values. 
Everything is the same kind of HTTP request.

An HTTP cache is extremely fast, almost certainly faster than a database.

wunder
Walter Underwood
wun...@wunderwood.org
https://url.emailprotection.link/?aOxxHWms19X_FosFdd-I1waYrAmmgiQjoUY9nT2nRS00~ 
 (my blog)

> On Oct 17, 2018, at 8:29 AM, Shawn Heisey  wrote:
> 
> On 10/17/2018 9:19 AM, Joseph Costello - F&D Reports wrote:
>> 
>> Any feedback from the group on the question below.
>> 
>> The question was will solr performing distance calculations (10,000++) on 
>> the fly, perform faster than SQL query simply pulling pre-calculated 
>> distance values directly from the database.
>> 
> 
> If your database contains pre-calculated values, then pulling those is likely 
> to be faster than doing calculations on the fly.  Whether that's true in 
> practice depends on precisely what kinds of calculation you are doing, and 
> what must be done to obtain the values that go into the calculation.
> 
> If your database has pre-calculated values, and you need those in search 
> results, why not just put the pre-calculated values into your Solr index when 
> you build it?  One of the key things done with a search engine for 
> performance is handling as much as possible at index time, so there's less 
> work to do at query time.
> 
> Thanks,
> Shawn
>