Very low filter cache hit ratio

2019-05-29 Thread Saurabh Sharma
Hi All,

I am trying to run an index on solr cloud version 7.3.1 with 3 nodes.
Planning to index the records using full index once a day and delta index
every 30 minutes. Purpose to keep stale index was to utilize the cache of
solr. But to my surprise, when I put real traffic on this index . cache
usage was very less. It was varying between 0 to 10% irrespective of the
size of filter cache.

I tried varying the cache size but nothing happened and usage was very low.
Most of the fields in the index are stored/doc values.

I tried with cache sizes of 1024, 10024, 100024.

What can be the possible reasons for low cache usage?
How can I leverage cache feature for high traffic indexes?

Thanks
Saurabh Sharma


Re: Very low filter cache hit ratio

2019-05-29 Thread Shawn Heisey

On 5/29/2019 6:57 AM, Saurabh Sharma wrote:

What can be the possible reasons for low cache usage?
How can I leverage cache feature for high traffic indexes?


Your usage apparently does not use the exact same query (or filter 
query, in the case of filterCache) very often.


In order to achieve a high hit ratio on a cache, the same query will 
need to be used by many users.  That's not happening here.  I'm betting 
that each user is sending something unique to Solr - which means it will 
be impossible to get a hit, unless that user sends the same query again.


Thanks,
Shawn


Re: Very low filter cache hit ratio

2019-05-29 Thread Saurabh Sharma
Hi Shwan,

Many filters are common among the queries. AFAIK, filter cache are created
against filters and by that logic one should get good hit ratio for those
cached filter conditions.i tried to create a cache of 100K size and that
too was not producing good hit ratio. Any document/suggetion about
efficient usage of various caches  and their internal working.

Thanks
Saurabh

On Wed 29 May, 2019, 6:53 PM Shawn Heisey,  wrote:

> On 5/29/2019 6:57 AM, Saurabh Sharma wrote:
> > What can be the possible reasons for low cache usage?
> > How can I leverage cache feature for high traffic indexes?
>
> Your usage apparently does not use the exact same query (or filter
> query, in the case of filterCache) very often.
>
> In order to achieve a high hit ratio on a cache, the same query will
> need to be used by many users.  That's not happening here.  I'm betting
> that each user is sending something unique to Solr - which means it will
> be impossible to get a hit, unless that user sends the same query again.
>
> Thanks,
> Shawn
>


Re: Very low filter cache hit ratio

2019-05-29 Thread Atita Arora
You can refer to this one:
https://teaspoon-consulting.com/articles/solr-cache-tuning.html

HTH,
Atita

On Wed, May 29, 2019 at 3:33 PM Saurabh Sharma 
wrote:

> Hi Shwan,
>
> Many filters are common among the queries. AFAIK, filter cache are created
> against filters and by that logic one should get good hit ratio for those
> cached filter conditions.i tried to create a cache of 100K size and that
> too was not producing good hit ratio. Any document/suggetion about
> efficient usage of various caches  and their internal working.
>
> Thanks
> Saurabh
>
> On Wed 29 May, 2019, 6:53 PM Shawn Heisey,  wrote:
>
> > On 5/29/2019 6:57 AM, Saurabh Sharma wrote:
> > > What can be the possible reasons for low cache usage?
> > > How can I leverage cache feature for high traffic indexes?
> >
> > Your usage apparently does not use the exact same query (or filter
> > query, in the case of filterCache) very often.
> >
> > In order to achieve a high hit ratio on a cache, the same query will
> > need to be used by many users.  That's not happening here.  I'm betting
> > that each user is sending something unique to Solr - which means it will
> > be impossible to get a hit, unless that user sends the same query again.
> >
> > Thanks,
> > Shawn
> >
>


RE: Very low filter cache hit ratio

2019-05-29 Thread Markus Jelsma
Hello,

What is missing in that article is you must never use NOW without rounding it 
down in a filter query. If you have it, round it down to an hour, day or minute 
to prevent flooding the filter cache.

Regards,
Markus

-Original message-
> From:Atita Arora 
> Sent: Wednesday 29th May 2019 15:43
> To: solr-user@lucene.apache.org
> Subject: Re: Very low filter cache hit ratio
> 
> You can refer to this one:
> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
> 
> HTH,
> Atita
> 
> On Wed, May 29, 2019 at 3:33 PM Saurabh Sharma 
> wrote:
> 
> > Hi Shwan,
> >
> > Many filters are common among the queries. AFAIK, filter cache are created
> > against filters and by that logic one should get good hit ratio for those
> > cached filter conditions.i tried to create a cache of 100K size and that
> > too was not producing good hit ratio. Any document/suggetion about
> > efficient usage of various caches  and their internal working.
> >
> > Thanks
> > Saurabh
> >
> > On Wed 29 May, 2019, 6:53 PM Shawn Heisey,  wrote:
> >
> > > On 5/29/2019 6:57 AM, Saurabh Sharma wrote:
> > > > What can be the possible reasons for low cache usage?
> > > > How can I leverage cache feature for high traffic indexes?
> > >
> > > Your usage apparently does not use the exact same query (or filter
> > > query, in the case of filterCache) very often.
> > >
> > > In order to achieve a high hit ratio on a cache, the same query will
> > > need to be used by many users.  That's not happening here.  I'm betting
> > > that each user is sending something unique to Solr - which means it will
> > > be impossible to get a hit, unless that user sends the same query again.
> > >
> > > Thanks,
> > > Shawn
> > >
> >
> 


ExactSharedStatsCache vs LRUStatsCache

2019-05-29 Thread Walter Underwood
Running 6.6, why should I prefer one over the other? And what kind of cache 
does Exact use if it isn’t LRU?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Very low filter cache hit ratio

2019-05-29 Thread Shawn Heisey

On 5/29/2019 7:33 AM, Saurabh Sharma wrote:

Many filters are common among the queries. AFAIK, filter cache are created
against filters and by that logic one should get good hit ratio for those
cached filter conditions.i tried to create a cache of 100K size and that
too was not producing good hit ratio. Any document/suggetion about
efficient usage of various caches  and their internal working.


In order to produce a cache hit, the query or filter must be identical 
in every way.  Whitespace and all.  And it must be identical after parts 
of it are substituted or expanded by Solr.


Take note of the reply you received from Markus Jelsma.  The "NOW" 
keyword is replaced by a current timestamp with millisecond accuracy -- 
which effectively means that queries using NOW are always different and 
cannot produce a cache hit.  Rounding the timestamp using NOW/HOUR or 
NOW/DAY, if that fits user requirements, can be one solution to that 
problem.


Be careful with defining a large filterCache.  The memory requirements 
can become VERY extreme.


Thanks,
Shawn


Re: Very low filter cache hit ratio

2019-05-29 Thread Erick Erickson
You must show us the _exact_ filter queries you’re using, or at least a 
representative sample.

Bumping the cache up very high is almost always the wrong thing to do. Each 
entry takes approximately maxDoc/8 bytes so unless your corpus is very small, 
you’ll eventually blow memory up.

To Markus’ point about NOW, a full treatment is here: 
https://dzone.com/articles/solr-date-math-now-and-filter

Best,
Erick

> On May 29, 2019, at 6:47 AM, Markus Jelsma  wrote:
> 
> Hello,
> 
> What is missing in that article is you must never use NOW without rounding it 
> down in a filter query. If you have it, round it down to an hour, day or 
> minute to prevent flooding the filter cache.
> 
> Regards,
> Markus
> 
> -Original message-
>> From:Atita Arora 
>> Sent: Wednesday 29th May 2019 15:43
>> To: solr-user@lucene.apache.org
>> Subject: Re: Very low filter cache hit ratio
>> 
>> You can refer to this one:
>> https://teaspoon-consulting.com/articles/solr-cache-tuning.html
>> 
>> HTH,
>> Atita
>> 
>> On Wed, May 29, 2019 at 3:33 PM Saurabh Sharma 
>> wrote:
>> 
>>> Hi Shwan,
>>> 
>>> Many filters are common among the queries. AFAIK, filter cache are created
>>> against filters and by that logic one should get good hit ratio for those
>>> cached filter conditions.i tried to create a cache of 100K size and that
>>> too was not producing good hit ratio. Any document/suggetion about
>>> efficient usage of various caches  and their internal working.
>>> 
>>> Thanks
>>> Saurabh
>>> 
>>> On Wed 29 May, 2019, 6:53 PM Shawn Heisey,  wrote:
>>> 
 On 5/29/2019 6:57 AM, Saurabh Sharma wrote:
> What can be the possible reasons for low cache usage?
> How can I leverage cache feature for high traffic indexes?
 
 Your usage apparently does not use the exact same query (or filter
 query, in the case of filterCache) very often.
 
 In order to achieve a high hit ratio on a cache, the same query will
 need to be used by many users.  That's not happening here.  I'm betting
 that each user is sending something unique to Solr - which means it will
 be impossible to get a hit, unless that user sends the same query again.
 
 Thanks,
 Shawn
 
>>> 
>> 



Re: problem indexing GPS metadata for video upload

2019-05-29 Thread Where is Where
Sorry Tim! I missed your last message about this issue! Thank you very much
for the information.
Is the latest 1.21 Tika Incorporated with the change already? and how about
solr?

Thanks!

On Fri, May 3, 2019 at 11:28 AM Where is Where  wrote:

> Thank you very much Tim, I wonder how to make the Tika change apply to
> Solr? I saw Tika core, parse and xml jar files tika-core.jar
> tika-parsers.jar tika-xml.jar in solr contrib/extraction/lib folder. Do we
> just  replace these files? Thanks!
>
> On Thu, May 2, 2019 at 12:16 PM Where is Where  wrote:
>
>> Thank you Alex and Tim.
>> I have looked at the solrconfig.xml file (I am trying the techproducts
>> demo config), the only related place I can find is the extract handle
>>
>> >   startup="lazy"
>>   class="solr.extraction.ExtractingRequestHandler" >
>> 
>>   true
>>   
>>
>>   
>>   true
>>   links
>>   ignored_
>> 
>>   
>>
>> I am using this command bin/post -c techproducts
>> example/exampledocs/1.mp4 -params "literal.id=mp4_1&uprefix=attr_"
>>
>> I have tried commenting out ignored_ and
>> changing to div
>> but still not working. I don't quite get why image is getting gps etc
>> metadata but video is acting differently while it is using the same
>> solrconfig and the gps metadata are in the same fields. There is no
>> differentiation in solrconfig setting between image and video.
>>
>> Tim yes this is related to the TIKA link. Thank you!
>>
>> Here is the output in solr for mp4.
>>
>> {
>> "attr_meta":["stream_size",
>>   "5721559",
>>   "date",
>>   "2019-03-29T04:36:39Z",
>>   "X-Parsed-By",
>>   "org.apache.tika.parser.DefaultParser",
>>   "X-Parsed-By",
>>   "org.apache.tika.parser.mp4.MP4Parser",
>>   "stream_content_type",
>>   "application/octet-stream",
>>   "meta:creation-date",
>>   "2019-03-29T04:36:39Z",
>>   "Creation-Date",
>>   "2019-03-29T04:36:39Z",
>>   "tiff:ImageLength",
>>   "1080",
>>   "resourceName",
>>   "/Volumes/Data/inData/App/solr/example/exampledocs/1.mp4",
>>   "dcterms:created",
>>   "2019-03-29T04:36:39Z",
>>   "dcterms:modified",
>>   "2019-03-29T04:36:39Z",
>>   "Last-Modified",
>>   "2019-03-29T04:36:39Z",
>>   "Last-Save-Date",
>>   "2019-03-29T04:36:39Z",
>>   "xmpDM:audioSampleRate",
>>   "1000",
>>   "meta:save-date",
>>   "2019-03-29T04:36:39Z",
>>   "modified",
>>   "2019-03-29T04:36:39Z",
>>   "tiff:ImageWidth",
>>   "1920",
>>   "xmpDM:duration",
>>   "2.64",
>>   "Content-Type",
>>   "video/mp4"],
>> "id":"mp4_4",
>> "attr_stream_size":["5721559"],
>> "attr_date":["2019-03-29T04:36:39Z"],
>> "attr_x_parsed_by":["org.apache.tika.parser.DefaultParser",
>>   "org.apache.tika.parser.mp4.MP4Parser"],
>> "attr_stream_content_type":["application/octet-stream"],
>> "attr_meta_creation_date":["2019-03-29T04:36:39Z"],
>> "attr_creation_date":["2019-03-29T04:36:39Z"],
>> "attr_tiff_imagelength":["1080"],
>> 
>> "resourcename":"/Volumes/Data/inData/App/solr/example/exampledocs/1.mp4",
>> "attr_dcterms_created":["2019-03-29T04:36:39Z"],
>> "attr_dcterms_modified":["2019-03-29T04:36:39Z"],
>> "last_modified":"2019-03-29T04:36:39Z",
>> "attr_last_save_date":["2019-03-29T04:36:39Z"],
>> "attr_xmpdm_audiosamplerate":["1000"],
>> "attr_meta_save_date":["2019-03-29T04:36:39Z"],
>> "attr_modified":["2019-03-29T04:36:39Z"],
>> "attr_tiff_imagewidth":["1920"],
>> "attr_xmpdm_duration":["2.64"],
>> "content_type":["video/mp4"],
>> "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  
>> \n  \n  \n  \n  \n  \n  \n  \n \n   "],
>> "_version_":1632383499325407232}]
>>   }}
>>
>> JPEG is getting these:
>> "attr_meta":[
>> "GPS Latitude",
>>   "37° 47' 41.99\"",
>> 
>> "attr_gps_latitude":["37° 47' 41.99\""],
>>
>>
>> On Wed, May 1, 2019 at 2:57 PM Where is Where  wrote:
>>
>>> uploading video to solr via tika
>>> https://lucene.apache.org/solr/guide/7_7/uploading-data-with-solr-cell-using-apache-tika.html
>>> The index has no video GPS metadata which is extracted and indexed for
>>> images such as jpeg. I have checked both MP4 and MOV files, the files I
>>> checked all have GPS Exif data embedded in the same fields as image. Any
>>> idea? Thanks!
>>>
>>