date:20140306

Re: SOLR OutOfMemoryError Java heap space

2014-03-06 Thread Angel Tchorbadjiiski


Hi Shawn,

a big thanks for the long and detailed answer. I am aware of how linux 
uses free RAM for caching and the the problems related to jvm and GC. It 
is nice to hear how this correlates to Solr. I'll take some time and 
think over it. The facet.method=enum and probably a combination of 
DocValue-Fields could be the solution needed in this case.


Thanks again to both of you and Toke for the feedback!

Cheers
Angel

On 05.03.2014 17:06, Shawn Heisey wrote:

On 3/5/2014 4:40 AM, Angel Tchorbadjiiski wrote:

Hi Shawn,

On 05.03.2014 10:05, Angel Tchorbadjiiski wrote:

Hi Shawn,


It may be your facets that are killing you here.  As Toke mentioned, you
have not indicated what your max heap is.20 separate facet fields with
millions of documents will use a lot of fieldcache memory if you use the
standard facet.method, fc.

Try adding facet.method=enum to all your facet queries, or you can put
it in the defaults section of each request handler definition.

Ok, that is easy to try out.


Changing the facet.method does not help really as the performance of the
queries is really bad. This lies mostly on the small cache values, but
even trying to tune them for the "enum" case didn't help much.

The number of documents and unique facet values seems to be too high.
Trying to cache them even with a size of 512 results in many misses and
Solr tries to repopulate the cache all the time. This makes the
performances even worse.


Good performance with Solr requires a fair amount of memory.  You have
two choices when it comes to where that memory gets used - inside Solr
in the form of caches, or free memory, available to the operating system
for caching purposes.

Solr caches are really amazing things.  Data gathered for one query can
significantly speed up another query, because part (or all) of that
query can be simply skipped, the results read right out of the cache.

There are two potential problems with relying exclusively on Solr
caches, though.  One is that they require Java heap memory, which
requires garbage collection.  A large heap causes GC issues, some of
which can be alleviated by GC tuning.  The other problem is that you
must actually do a query in order to get the data into the cache.  WHen
you do a commit and open a new searcher, that cache data does away, so
you have to do the query over again.

The primary reason for slow uncached queries is disk access.  Reading
index data off the disk is a glacial process, comparatively speaking.
This is where OS disk caching becomes a benefit.  Most queries, even
complex ones, become lightning fast if all of the relevant index data is
already in RAM and no disk access is required.  When queries are fast to
begin with, you can reduce the cache sizes in Solr, reducing the heap
requirements.  With a smaller heap, more memory is available for the OS
disk cache.

The facet.method=enum parameter shifts the RAM requirement from Solr to
the OS.  It does not really reduce the amount of required system memory.
  Because disk caching is a kernel level feature and does not utilize
garbage collection, it is far more efficient than Solr ever could be at
caching *raw* data.  Solr's caches are designed for *processed* data.

What this all boils down to is that I suspect you'll simply need more
memory on the machine.  With facets on so many fields, your queries are
probably touching nearly the entire index, so you'll want to put the
entire index into RAM.

Therefore, after Solr allocates its heap and any other programs on the
system allocate their required memory, you must have enough left memory
over to fit all (or most) of your 50GB index data.  Combine this with
facet.method=enum and everything should be good.

Re: need suggestions for storing TBs of strutucred data in SolrCloud

2014-03-06 Thread Toke Eskildsen

On Thu, 2014-03-06 at 08:17 +0100, Chia-Chun Shih wrote:
>1. Raw data is 35,000 CSV files per day. Each file is about 5 MB.
>2. One collection serves one day. 200-day history data is required.

So once your data are indexed, they will not change? If seems to me that
1 shard/day is a fine choice. Consider optimizing down to a single
segment when a days data has been indexed.

It sounds like your indexing needs CPU power, while your searches are
likely to be I/O bound. You might consider a dedicated indexing machine,
if it is acceptable that data only go live when a day's indexing has
been finished (and copied). 

>3. Take less than 10 hours to build one-day index.
>4. Allow to execute an ordinary query (may span 1~7 days) in 10 minutes

Since you want to have 200 days and each day takes about 60GB (guessing
from your test), we're looking at 12TB of index at any time.

At the State and University Library, Denmark, we are building an index
for our web archive. We estimate about 20TB of static index to begin
with. We have done some tests of up to 16*200GB clouded indexes (details
at http://sbdevel.wordpress.com/2013/12/06/danish-webscale/ ) and our
median was about 1200ms for simple queries with light faceting, when we
used a traditional spinning drives backend. That put our estimated
median search time for the full corpus at 10 seconds, which was too slow
for us.

With a response time requirement of 10 minutes, which seems extremely
generous in these sub-second times, I am optimistic that "Just make
daily blocks and put them on traditional storage" will work for you.
Subject to your specific data and queries of course.

If you want a whole other level of performance then use SSDs as your
backend. Especially for your large index scenario, where it is very
expensive to try and compensate for slow spinning drives with RAM. We
designed our search machine around commodity SSDs (Samsung 840) and it
was, relative to data size and performance, dirt cheap.

>5. concurrent user < 10

Our measurements showed that for this amount of data on spinning drives,
throughput was nearly independent of threads: 4 concurrent requests
meant 4 times as long as a single request. YMMW.

Your corpus does represent an extra challenge as it sounds like most of
the indexes will be dormant most of the time. As disk cache favours
often accessed data, I'm guessing that you will get some very ugly
response times when you process one of the rarer queries.

> I have built an experimental SolrCloud based on 3 VMs, each equipped with 8
> cores, 64GB RAM.  Each collection has 3 shards and no replication. Here are
> my findings:
> 
>1. Each collection's actual index size is between 30GB to 90GB,
>depending on the number of stored field.

I'm guessing that 30-90GB is a day's worth of data? How many documents
does a shard contain?

>2. It takes 6 to 12 hours to load raw data. I use multiple (15~30)
>threads to launch http requests. (http://wiki.apache.org/solr/UpdateCSV)

I'm guessing that profiling, tweaking and fiddling will shave the top 2
hours from those numbers.

Regards,
Toke Eskildsen, State and University Library, Denmark

Need help regarding SOLR Fulltext search

2014-03-06 Thread Raman Jhajj

Hello Everyone,

Let me first introduce myself, I am Raman, I am a Masters of CS student. I
am doing a project for my studies which need the use of SOLR. For some
reasons I have to use SOLR 4.3.0 for the project.

I am facing an issue with page numbers in the search result.I came across a
workaround for that https://issues.apache.org/jira/browse/SOLR-380 but this
one is quite old and is not working now with 4.3.0.

I need some help regarding this, if anyone can please help me out. I am new
to SOLR and don't know much.

My particular issue is, we have several manuals of research data which are
in TEI format. I have indexed them for full search and highlights.
Everything is working fine so far. Now when I show results in the webpage.
On click of the particular result I need to move to that particular page in
the TEI formatted document and display. I can display whole document by
just giving an anchor tag for the document but not sure about how to move
to specific page. I am bit confused how I can implement the solution for
this. If anyone has worked on something like this can you please guide me.
I am not getting any way out.

-- 
Kind Regards,

*Ramaninder Singh Jhajj*

Re: Need help regarding SOLR Fulltext search

2014-03-06 Thread Ahmet Arslan

Hi Roman,

I did similar project, this is how :

1) index page by page. Solr document (unit of retrieval) will be pages. You can
generate an uniqueKey by concatenating docId and pageNo => doc50_page0 With
this you will have page no information.

2) Later on you can group by document_id with
https://wiki.apache.org/solr/FieldCollapsing

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-CollapsingQueryParser

Ahmet

On Thursday, March 6, 2014 1:06 PM, Raman Jhajj wrote:
Hello Everyone,

Let me first introduce myself, I am Raman, I am a Masters of CS student. I
am doing a project for my studies which need the use of SOLR. For some
reasons I have to use SOLR 4.3.0 for the project.

I am facing an issue with page numbers in the search result.I came across a
workaround for that https://issues.apache.org/jira/browse/SOLR-380 but this
one is quite old and is not working now with 4.3.0.

I need some help regarding this, if anyone can please help me out. I am new
to SOLR and don't know much.

My particular issue is, we have several manuals of research data which are
in TEI format. I have indexed them for full search and highlights.
Everything is working fine so far. Now when I show results in the webpage.
On click of the particular result I need to move to that particular page in
the TEI formatted document and display. I can display whole document by
just giving an anchor tag for the document but not sure about how to move
to specific page. I am bit confused how I can implement the solution for
this. If anyone has worked on something like this can you please guide me.
I am not getting any way out.

50 matches

Mail list logo