Re: Solr as am html cache

Erick Erickson Mon, 21 Nov 2016 08:29:45 -0800

Searching isn't really going to be impacted much, if at all. You're
essentially talking about setting some field with store="true" and
stuffing the HTML into that, right? It will probably have indexed="false"
and docValues="false".

So.. what that means is that very early in the indexing process, the
raw data is dumped to the *.fdt and *.fdx extensions for the segment. These
are totally irrelevant for querying, they aren't even read from disk to score
the docs. So let's say your numFound = 10,000 and rows=10. Those 10,000
docs are scored without having to look at the stored data at all. Now, when
the 10 docs are assembled for return, the stored data is read off disk
decompressed and returned.

So the additional cost will be
1> your index is larger on disk
2> merging etc. will be a bit more costly. This doesn't
     seem like a problem if your index doesn't change all
     that often.
3> there will be some additional load to decompress the data
     and return it.

This is a perfectly reasonable approach, my guess is that any difference
in search speed will be lost in the noise of measuring and that the
additional load of decompressing will be more than offset by not having
to make a separate service call to actually get the doc, but as always
measuring the performance is the proof you need.

You haven't indicated how _many_ docs you have in your corpus, but a
rough indication of the additional disk space is about half the raw HTML size,
we've usually seen about a 2:1 compression ratio. With a zillion docs
that could be sizeable, but disk space is cheap.

Best,
Erick

On Mon, Nov 21, 2016 at 8:08 AM, Aristedes Maniatis
<amania...@apache.org> wrote:
> I'm familiar enough with 7-8 years of Solr usage in how it performs as a full 
> text search index, including spatial coordinates and much more. But for the 
> most part, we've been returning database ids from Solr rather than a full 
> record ready to display. We then grab the data and related records from the 
> database in the usual way and display it.
>
> We are thinking now about improving performance of our app. One option is 
> Reddis to store html pieces for reuse, rather than assembling the html from 
> dozens of queries to the database. We've done what we can with caching in the 
> ORM level, and we can't do too much with varnish because of differences in 
> page rendering per user (eg shopping baskets).
>
> But we are thinking about storing the rendered html directly in Solr. The 
> downsides appear to be:
>
> * adding 2-10kB of html to each record and the performance hit this might 
> have on searching and retrieving
> * additional load of ensuring we rebuild Solr's data every time some part of 
> that html changes (but this is minimal in our use case)
> * additional cores that we'll want to add to cache other data that isn't yet 
> in Solr
>
> Is this a reasonable approach to avoid running yet another cluster of 
> services? Are there downsides to this I haven't thought of? How does Solr 
> scale with record size?
>
>
>
> Cheers
> Ari
>
>
>
>
> --
> -------------------------->
> Aristedes Maniatis
> GPG fingerprint CBFB 84B4 738D 4E87 5E5C  5EFA EF6A 7D2E 3E49 102A

Re: Solr as am html cache

Reply via email to