Re: Solr as am html cache

Aristedes Maniatis Mon, 21 Nov 2016 23:57:08 -0800

Thanks Erick

Very helpful indeed.


Your guesses on data size are about right. There might only be 50,000 items in 
the whole index. And typically we'd fetch a batch of 10. Disk is cheap and this 
really isn't taking much room anyway. For such a tiny data set, it seems like 
this approach will work well.


This seems like it might even be a good approach for creating additional cores 
primarily for the purpose of caching: that is, a core full of records that are 
only ever queries by some unique key. I wouldn't want to abuse Solr for a 
purpose it wasn't designed, but since it is already there it appears to be a 
useful approach. Rather than getting some data from the db, we fetch it from 
Solr pre-assembled.

Thanks
Ari



On 22/11/16 3:28am, Erick Erickson wrote:
> Searching isn't really going to be impacted much, if at all. You're
> essentially talking about setting some field with store="true" and
> stuffing the HTML into that, right? It will probably have indexed="false"
> and docValues="false".
> 
> So.. what that means is that very early in the indexing process, the
> raw data is dumped to the *.fdt and *.fdx extensions for the segment. These
> are totally irrelevant for querying, they aren't even read from disk to score
> the docs. So let's say your numFound = 10,000 and rows=10. Those 10,000
> docs are scored without having to look at the stored data at all. Now, when
> the 10 docs are assembled for return, the stored data is read off disk
> decompressed and returned.
> 
> So the additional cost will be
> 1> your index is larger on disk
> 2> merging etc. will be a bit more costly. This doesn't
>      seem like a problem if your index doesn't change all
>      that often.
> 3> there will be some additional load to decompress the data
>      and return it.
> 
> This is a perfectly reasonable approach, my guess is that any difference
> in search speed will be lost in the noise of measuring and that the
> additional load of decompressing will be more than offset by not having
> to make a separate service call to actually get the doc, but as always
> measuring the performance is the proof you need.
> 
> You haven't indicated how _many_ docs you have in your corpus, but a
> rough indication of the additional disk space is about half the raw HTML size,
> we've usually seen about a 2:1 compression ratio. With a zillion docs
> that could be sizeable, but disk space is cheap.
> 
> 
> Best,
> Erick
> 
> On Mon, Nov 21, 2016 at 8:08 AM, Aristedes Maniatis
> <amania...@apache.org> wrote:
>> I'm familiar enough with 7-8 years of Solr usage in how it performs as a 
>> full text search index, including spatial coordinates and much more. But for 
>> the most part, we've been returning database ids from Solr rather than a 
>> full record ready to display. We then grab the data and related records from 
>> the database in the usual way and display it.
>>
>> We are thinking now about improving performance of our app. One option is 
>> Reddis to store html pieces for reuse, rather than assembling the html from 
>> dozens of queries to the database. We've done what we can with caching in 
>> the ORM level, and we can't do too much with varnish because of differences 
>> in page rendering per user (eg shopping baskets).
>>
>> But we are thinking about storing the rendered html directly in Solr. The 
>> downsides appear to be:
>>
>> * adding 2-10kB of html to each record and the performance hit this might 
>> have on searching and retrieving
>> * additional load of ensuring we rebuild Solr's data every time some part of 
>> that html changes (but this is minimal in our use case)
>> * additional cores that we'll want to add to cache other data that isn't yet 
>> in Solr
>>
>> Is this a reasonable approach to avoid running yet another cluster of 
>> services? Are there downsides to this I haven't thought of? How does Solr 
>> scale with record size?
>>
>>
>>
>> Cheers
>> Ari
>>
>>
>>
>>
>> --
>> -------------------------->
>> Aristedes Maniatis
>> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A


-- 
-------------------------->
Aristedes Maniatis
GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A

Re: Solr as am html cache

Reply via email to