CommonTerms & slow queries

2019-03-29 Thread Erie Data Systems
Using Solr 8.0.0, single instance, single core, 50m records (38gb  index)
on one SSD, 96gb ram, 16 cores CPU

Most queries run very very fast <1 sec however we have noticed queries
containing "common" words are quite slow sometimes 10+sec , currently using
edismax with 2 text_general fields,. qf, and pf, qs=0,ps=0

I came across these which describe the issue.
https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

https://lucene.apache.org/core/5_5_3/queries/org/apache/lucene/queries/CommonTermsQuery.html

Test queries with issues :
1. things to do in seattle with eric
2. year of the cat
3. time of my life
4. when will i be loved
5. once upon a time in the west

Stopwords are not an option as in the case of #2, if of and the are removed
it essentially destroys relevance.  Is there a common suggested solution to
what would seem to be a common issue besides adding stopwords.

Thank you.
Craig Stadler


Re: CommonTerms & slow queries

2019-03-29 Thread Erie Data Systems
Michael,

select/?&rows=12&qf=title+description&q=once+upon+a+time+in+the+west&fl=*&hl=true&hl.field=desc&hl.fragsize=250&hl.maxAnalyzedChars=20&ps=1&qs=1&df=title&mm=2&defType=edismax&debugQuery=off&indent=on&wt=json&debug=true
"rawquerystring":"once upon a time in the west",
"querystring":"once upon a time in the west",
"parsedquery":"+(DisjunctionMaxQuery((description:once | title:once))
DisjunctionMaxQuery((description:upon | title:upon))
DisjunctionMaxQuery((description:a | title:a))
DisjunctionMaxQuery((description:time | title:time))
DisjunctionMaxQuery((description:in | title:in))
DisjunctionMaxQuery((description:the | title:the))
DisjunctionMaxQuery((description:west | title:west)))~2",
"parsedquery_toString":"+(((description:once | title:once)
(description:upon | title:upon) (description:a | title:a) (description:time
| title:time) (description:in | title:in) (description:the | title:the)
(description:west | title:west))~2)"

Removing pf cuts time almost half but its still 5+sec

Thank you for your help, more than happy to include more output..
-Craig


On Fri, Mar 29, 2019 at 12:24 PM Michael Gibney 
wrote:

> Can you post the query that's actually built for some of these inputs
> ("parsedquery" or "parsedquery_toString" output included for requests with
> "debug=query" parameter)? What is performance like if you turn off pf
> (i.e., no implicit phrase searching)?
> Michael
>
> On Fri, Mar 29, 2019 at 11:53 AM Erie Data Systems 
> wrote:
>
> > Using Solr 8.0.0, single instance, single core, 50m records (38gb  index)
> > on one SSD, 96gb ram, 16 cores CPU
> >
> > Most queries run very very fast <1 sec however we have noticed queries
> > containing "common" words are quite slow sometimes 10+sec , currently
> using
> > edismax with 2 text_general fields,. qf, and pf, qs=0,ps=0
> >
> > I came across these which describe the issue.
> >
> >
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
> >
> >
> >
> https://lucene.apache.org/core/5_5_3/queries/org/apache/lucene/queries/CommonTermsQuery.html
> >
> > Test queries with issues :
> > 1. things to do in seattle with eric
> > 2. year of the cat
> > 3. time of my life
> > 4. when will i be loved
> > 5. once upon a time in the west
> >
> > Stopwords are not an option as in the case of #2, if of and the are
> removed
> > it essentially destroys relevance.  Is there a common suggested solution
> to
> > what would seem to be a common issue besides adding stopwords.
> >
> > Thank you.
> > Craig Stadler
> >
>


Re: CommonTerms & slow queries

2019-03-29 Thread Erie Data Systems
>
> All great advice thanks Michael, have an excellent weekend! Testing the
> common grams
>
-Craig


Interesting Grouping/Facet issue

2019-04-09 Thread Erie Data Systems
Solr 8.0.0, I have a HASHTAG string field I am trying to facet on to get
the most popular hashtags (top 100) across many sources. (SITE field is
string)

/select?facet.field=hashtag&facet=on&rows=0&q=%2Bhashtag:*%20%2BDT:[" .
date('Y-m-d') . "T00:00:00Z+TO+" . date('Y-m-d')  .
"T23:59:59Z]&facet.limit=100&facet.mincount=1&facet.method=fc

It works but not to what I feel should happen... For example if one site
has 1000 rows on todays date and they all have a HASHTAG in common, that
HASHTAG automatically rises to the top simply because one SITE has 1000
pages with the same HASHTAG.

Is there a way to get a better more even distribution of top HASHTAGS for a
given date, ie facet. ..by a grouping or distinct or filter of some sort?
Im more interesting in knowing if a HASHTAG is used frequently among SITEs,
not just one one.

Hope this makes sense... any recommendations welcomed.

Thank you in advance,
-Craig


Issue with max documents on single instance

2019-05-31 Thread Erie Data Systems
Solr 8.0.0 (single server, single instance, single core) Centos 6x86_64
Error :  number of documents in the index cannot exceed 2147483519

Ive read about the max number of documents which means I need to go with
SolrCloud..
My question is this, can I implement a "clustered" environment on single
server so I can take advantage of the segmented data? I have a TON (96gb)
of RAM and plenty of SSD disk space available...

Thanks,
-Craig


Odd error with Solr 8 log / ingestion

2019-06-06 Thread Erie Data Systems
Hello everyone,

I recently setup Solr 8 in SolrCloud mode, previously I was using
standalone mode and was able to easily push 10,000 records in per HTTP call
wit autocommit. Ingestion occurs when server A pushes (HTTPS) payload to
server B (SolrCloud) on LAN network.

However, once converted to SolrCloud (1 node, 3 shards, 1 replica) I am
seeing the following error :

ConcurrentUpdateHttp2SolrClient
Error consuming and closing http response stream.

Im wondering, what possibly causes could be, im not seeing much
documentation online specific to Solr.

Thanks in advance for any assistance,
Craig


Moving to solrcloud from single instance

2019-08-12 Thread Erie Data Systems
I am starting the planning stages of moving from a single instance of solr
8 to a solrcloud implementation.

Currently I have a 148GB index on a single dedicated server w 96gb ram @ 16
cores /2.4ghz ea. + SSD disk. The search is fast but obviously the index
size is greater than the physical memory, which to my understanding is not
a good thing.

I have a lot of experience with single instance but none with solrcloud. I
have 3 machines (other than my main 1) with the exact same hardware 96gb *
3 essentially which should be plenty.

My issue is that im not sure where to go to learn how to set this up, how
many shards, how many replicas, etc and would rather hire somebody or
something (detailed video or document)  to guide me through the process,
and make decisions along the way...For example I think a shard is a piece
of the index... but I dont even know how to decide how many replicas or
what they are .

Thanks everyone.
-Craig