Data Import Handelr Question

2014-04-24 Thread Yuval Dotan
Hi
I want to use the DIH component in order to import data from old postgresql
DB.
I want to be able to recover from errors and crashes.
If an error occurs I should be able to restart and continue indexing from
where it stopped.
Is the DIH good enough for my requirements ?
If not is it possible to extend one of its classes in order to support the
recovery?
Thanks
Yuval


Re: Data Import Handelr Question

2014-04-27 Thread Yuval Dotan
Thanks Shawn

In your opinion, what do you think is easier, writing the importer from
scratch or extending the DIH (for example: adding the state etc...)?


Yuval


On Thu, Apr 24, 2014 at 6:47 PM, Shawn Heisey  wrote:

> On 4/24/2014 9:24 AM, Yuval Dotan wrote:
>
>> I want to use the DIH component in order to import data from old
>> postgresql
>> DB.
>> I want to be able to recover from errors and crashes.
>> If an error occurs I should be able to restart and continue indexing from
>> where it stopped.
>> Is the DIH good enough for my requirements ?
>> If not is it possible to extend one of its classes in order to support the
>> recovery?
>>
>
> The entity in the Dataimport Handler (DIH) config has an "onError"
> attribute.
>
> http://wiki.apache.org/solr/DataImportHandler#Schema_for_the_data_config
> https://cwiki.apache.org/confluence/display/solr/
> Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler#
> UploadingStructuredDataStoreDatawiththeDataImportHandler-EntityProcessors
>
> But honestly, if you want a really robust Java program that indexes to
> Solr and does precisely what you want, you may be better off writing it
> yourself using SolrJ and JDBC.  DIH is powerful and efficient, but when you
> write the program yourself, you can do anything you want with your data.
>
> You also have the possibility of resuming an import after a Solr crash.
>  Because DIH is embedded in Solr and doesn't save any kind of state data
> about an import in progress, that's pretty much impossible with DIH.  With
> a SolrJ program, you'd have to handle that yourself, but it would be
> *possible*.
>
> https://cwiki.apache.org/confluence/display/solr/Using+SolrJ
>
> Thanks,
> Shawn
>
>


Re: distributed search is significantly slower than direct search

2013-11-17 Thread Yuval Dotan
Hi,

I isolated the case

Installed on a new machine (2 x Xeon E5410 2.33GHz)

I have an environment with 12Gb of memory.

I assigned 6gb of memory to Solr and I’m not running any other memory
consuming process so no memory issues should arise.

Removed all indexes apart from two:

emptyCore – empty – used for routing

core1 – holds the stored data – has ~750,000 docs and size of 400Mb

Again this is a single machine that holds both indexes.

The query
http://localhost:8210/solr/emptyCore/select?rows=5000&q=*:*&shards=127.0.0.1:8210/solr/core1&wt=jsonQTime
takes ~3 seconds

and direct query
http://localhost:8210/solr/core1/select?rows=5000&q=*:*&wt=json Qtime takes
~15 ms - a magnitude difference.

I ran the long query several times and got an improvement of about a sec
(33%) but that’s it.

I need to better understand why this is happening.

I tried looking at Solr code and debugging the issue but with no success.

The one thing I did notice is that the getFirstMatch method which receives
the doc id, searches the term dict and returns the internal id takes most
of the time for some reason.

I am pretty stuck and would appreciate any ideas

My only solution for the moment is to bypass the distributed query,
implement code in my own app that directly queries the relevant cores and
handles the sorting etc..

Thanks




On Sat, Nov 16, 2013 at 2:39 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> Did you say what the memory profile of your machine is?  How much memory,
> and how large are the shards? This is just a random guess, but it might be
> that if you are memory-constrained, there is a lot of thrashing caused by
> paging (swapping?) in and out the sharded indexes while a single index can
> be scanned linearly, even if it does need to be paged in.
>
> -Mike
>
>
> On 11/14/2013 8:10 AM, Elran Dvir wrote:
>
>> Hi,
>>
>> We tried returning just the id field and got exactly the same performance.
>> Our system is distributed but all shards are in a single machine so
>> network issues are not a factor.
>> The code we found where Solr is spending its time is on the shard and not
>> on the routing core, again all shards are local.
>> We investigated the getFirstMatch() method and noticed that the
>> MultiTermEnum.reset (inside MultiTerm.iterator) and MultiTerm.seekExact
>> take 99% of the time.
>> Inside these methods, the call to BlockTreeTermsReader$
>> FieldReader$SegmentTermsEnum$Frame.loadBlock  takes most of the time.
>> Out of the 7 seconds  run these methods take ~5 and
>> BinaryResponseWriter.write takes the rest(~ 2 seconds).
>>
>> We tried increasing cache sizes and got hits, but it only improved the
>> query time by a second (~6), so no major effect.
>> We are not indexing during our tests. The performance is similar.
>> (How do we measure doc size? Is it important due to the fact that the
>> performance is the same when returning only id field?)
>>
>> We still don't completely understand why the query takes this much longer
>> although the cores are on the same machine.
>>
>> Is there a way to improve the performance (code, configuration, query)?
>>
>> -Original Message-
>> From: idokis...@gmail.com [mailto:idokis...@gmail.com] On Behalf Of
>> Manuel Le Normand
>> Sent: Thursday, November 14, 2013 1:30 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: distributed search is significantly slower than direct search
>>
>> It's surprising such a query takes a long time, I would assume that after
>> trying consistently q=*:* you should be getting cache hits and times should
>> be faster. Try see in the adminUI how do your query/doc cache perform.
>> Moreover, the query in itself is just asking the first 5000 docs that
>> were indexed (returing the first [docid]), so seems all this time is wasted
>> on transfer. Out of these 7 secs how much is spent on the above method?
>> What do you return by default? How big is every doc you display in your
>> results?
>> Might be the matter that both collections work on the same ressources.
>> Try elaborating your use-case.
>>
>> Anyway, it seems like you just made a test to see what will be the
>> performance hit in a distributed environment so I'll try to explain some
>> things we encountered in our benchmarks, with a case that has at least the
>> similarity of the num of docs fetched.
>>
>> We reclaim 2000 docs every query, running over 40 shards. This means
>> every shard is actually transfering to our frontend 2000 docs every
>> document-match request (the first you were referring to). Even if lazily
>> loaded, reading 2000 id's (on 40 servers) and lazy loading the fields is a
>> tough job. Waiting for the slowest shard to respond, then sorting the docs
>> and reloading (lazy or not) the top 2000 docs might take a long time.
>>
>> Our times are 4-8 secs, but do it's not possible comparing cases. We've
>> done few steps that improved it along the way, steps that led to others.
>> These were our starters:
>>
>> 1. Profile these querie

Re: distributed search is significantly slower than direct search

2013-11-17 Thread Yuval Dotan
Hi Tomás
This is just a test environment meant only to reproduce the issue I am
currently investigating.
The number of documents should grow substantially (billions of docs).



On Sun, Nov 17, 2013 at 7:12 PM, Tomás Fernández Löbbe <
tomasflo...@gmail.com> wrote:

> Hi Yuval, quick question. You say that your code has 750k docs and around
> 400mb? Is this some kind of test dataset and you expect it to grow
> significantly? For an index of this size, I wouldn't use distributed
> search, single shard should be fine.
>
>
> Tomás
>
>
> On Sun, Nov 17, 2013 at 6:50 AM, Yuval Dotan  wrote:
>
> > Hi,
> >
> > I isolated the case
> >
> > Installed on a new machine (2 x Xeon E5410 2.33GHz)
> >
> > I have an environment with 12Gb of memory.
> >
> > I assigned 6gb of memory to Solr and I’m not running any other memory
> > consuming process so no memory issues should arise.
> >
> > Removed all indexes apart from two:
> >
> > emptyCore – empty – used for routing
> >
> > core1 – holds the stored data – has ~750,000 docs and size of 400Mb
> >
> > Again this is a single machine that holds both indexes.
> >
> > The query
> >
> >
> http://localhost:8210/solr/emptyCore/select?rows=5000&q=*:*&shards=127.0.0.1:8210/solr/core1&wt=jsonQTime
> > takes ~3 seconds
> >
> > and direct query
> > http://localhost:8210/solr/core1/select?rows=5000&q=*:*&wt=json Qtime
> > takes
> > ~15 ms - a magnitude difference.
> >
> > I ran the long query several times and got an improvement of about a sec
> > (33%) but that’s it.
> >
> > I need to better understand why this is happening.
> >
> > I tried looking at Solr code and debugging the issue but with no success.
> >
> > The one thing I did notice is that the getFirstMatch method which
> receives
> > the doc id, searches the term dict and returns the internal id takes most
> > of the time for some reason.
> >
> > I am pretty stuck and would appreciate any ideas
> >
> > My only solution for the moment is to bypass the distributed query,
> > implement code in my own app that directly queries the relevant cores and
> > handles the sorting etc..
> >
> > Thanks
> >
> >
> >
> >
> > On Sat, Nov 16, 2013 at 2:39 PM, Michael Sokolov <
> > msoko...@safaribooksonline.com> wrote:
> >
> > > Did you say what the memory profile of your machine is?  How much
> memory,
> > > and how large are the shards? This is just a random guess, but it might
> > be
> > > that if you are memory-constrained, there is a lot of thrashing caused
> by
> > > paging (swapping?) in and out the sharded indexes while a single index
> > can
> > > be scanned linearly, even if it does need to be paged in.
> > >
> > > -Mike
> > >
> > >
> > > On 11/14/2013 8:10 AM, Elran Dvir wrote:
> > >
> > >> Hi,
> > >>
> > >> We tried returning just the id field and got exactly the same
> > performance.
> > >> Our system is distributed but all shards are in a single machine so
> > >> network issues are not a factor.
> > >> The code we found where Solr is spending its time is on the shard and
> > not
> > >> on the routing core, again all shards are local.
> > >> We investigated the getFirstMatch() method and noticed that the
> > >> MultiTermEnum.reset (inside MultiTerm.iterator) and
> MultiTerm.seekExact
> > >> take 99% of the time.
> > >> Inside these methods, the call to BlockTreeTermsReader$
> > >> FieldReader$SegmentTermsEnum$Frame.loadBlock  takes most of the time.
> > >> Out of the 7 seconds  run these methods take ~5 and
> > >> BinaryResponseWriter.write takes the rest(~ 2 seconds).
> > >>
> > >> We tried increasing cache sizes and got hits, but it only improved the
> > >> query time by a second (~6), so no major effect.
> > >> We are not indexing during our tests. The performance is similar.
> > >> (How do we measure doc size? Is it important due to the fact that the
> > >> performance is the same when returning only id field?)
> > >>
> > >> We still don't completely understand why the query takes this much
> > longer
> > >> although the cores are on the same machine.
> > >>
> > >> Is there a way to improve the performance (code, configuration,
> query)?
> > >>
> > >> -Original

Re: distributed search is significantly slower than direct search

2013-11-18 Thread Yuval Dotan
Hi
Thanks very much for your answers :)
Manuel, if you have a patch I will be glad to test it's performance
Yuval



On Mon, Nov 18, 2013 at 10:49 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Manuel, that sounds very interesting. Would you be willing to
> contribute this back to the community?
>
> On Mon, Nov 18, 2013 at 9:53 AM, Manuel Le Normand
>  wrote:
> > In order to accelerate the BinaryResponseWriter.write we extended this
> > writer class to implement the docid to id tranformation by docValues (on
> > memory) with no need to access stored field for id reading nor lazy
> loading
> > of fields that also has a cost. That should improve read rate as
> docValues
> > are sequential and should avoid disk IO. This docValues implementation is
> > accessed during both query stages (as mentioned above) in case you ask
> for
> > id's only, or only once, during the distributed search stage, in case you
> > intend asking for stored fields different than id.
> >
> > We just started testing it for performance. I would love hearing any
> > oppinions or test performances for this implementation
> >
> > Manu
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Performance improvement for solr faceting on large index

2012-11-22 Thread Yuval Dotan
you could always try the fc facet method and maybe increase the filtercache
size

On Thu, Nov 22, 2012 at 2:53 PM, Pravin Agrawal <
pravin_agra...@persistent.co.in> wrote:

> Hi All,
>
> We are using solr 3.4 with following schema fields.
>
>
> ---
>
>  positionIncrementGap="100">
> 
> 
> 
>  maxShingleSize="5" outputUnigrams="true"/>
>  pattern="^([0-9. ])*$" replacement=""
> replace="all"/>
> 
> 
> 
> 
> 
> 
> 
>
> 
>  indexed="true" multiValued="true"/>
> 
> 
>
> 
> 
> 
>
>
> ---
>
> The index on above schema is distributed on two solr shards with each
> index size of about 1.2 million, and size on disk of about 195GB per shard.
>
> We want to retrieve (site, autoSuggestContent term, frequency of the term)
> information from our above main solr index. The site is a field in document
> and contains name of site to which that document belongs. The terms are
> retrieved from multivalued field autoSuggestContent which is created using
> shingles from content and title of the web page.
>
> As of now, we are using facet query to retrieve (term, frequency of term)
>  for each site. Below is a sample query (you may ignore initial part of
> query)
>
>
> http://localhost:8080/solr/select?indent=on&q=*:*&fq=site:www.abc.com&start=0&rows=0&fl=id&qt=dismax&facet=true&facet.field=autoSuggestContent&facet.mincount=25&facet.limit=-1&facet.method=enum&facet.sort=index
>
> The problem is that with increase in index size, this method has started
> taking huge time. It used to take 7 minutes per site with index size of
> 0.4 million docs but takes around 60-90 minutes for index size of 2.5
> million(). With this speed, it will take around 5-6 days to index complete
> 1500 sites. Also we are expecting the index size to grow with more
> documents and more sites and as such time to get the above information will
> increase further.
>
> Please let us know if there is any better way to extract (site, term,
> frequency) information compare to current method.
>
> Thanks,
> Pravin Agrawal
>
>
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>


Re: Java out of memory - with fieldcache faceting

2012-04-30 Thread Yuval Dotan
Thanks for the fast answer
One more question:
Is there a way to know (some formula) what is the size of memory i need for
these actions?

Thanks
Yuval

On Mon, Apr 30, 2012 at 11:50, Dan Tuffery  wrote:

> You need to add more memory to the JVM that is running Solr:
>
> http://wiki.apache.org/solr/SolrPerformanceFactors#OutOfMemoryErrors
>
> Dan
>
> On Mon, Apr 30, 2012 at 9:43 AM, Yuval Dotan  wrote:
>
> > Hi Guys
> > I have a problem and i need your assistance
> > I get an exception when doing field cache faceting (the enum method works
> > perfectly):
> >
> > */solr/select?q=*:*&facet=true&facet.field=src_ip_str&facet.limit=10*
> >
> > 
> > java.lang.OutOfMemoryError: Java heap space
> > 
> > java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
> at
> >
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:449)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:277)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
> > at
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
> > at
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
> > at
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
> > at
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
> > at
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
> > at
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:250)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149)
> > at
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:111)
> > at org.eclipse.jetty.server.Server.handle(Server.java:351) at
> >
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:454)
> > at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:47)
> > at
> >
> org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:890)
> > at
> >
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:944)
> > at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:634) at
> > org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at
> >
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:66)
> > at
> >
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:254)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:599)
> > at
> >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:534)
> > at java.lang.Thread.run(Thread.java:679) Caused by:
> > java.lang.OutOfMemoryError: Java heap space at
> > org.apache.lucene.util.packed.Direct16.(Direct16.java:38) at
> > org.apache.lucene.util.packed.PackedInts.getMutable(PackedInts.java:267)
> at
> > org.apache.lucene.util.packed.GrowableWriter.set(GrowableWriter.java:81)
> at
> >
> org.apache.lucene.search.FieldCacheImpl$DocTermsIndexCache.createValue(FieldCacheImpl.java:1178)
> > at
> >
> org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:248)
> > at
> >
> org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1081)
> > at
> >
> org.apache.lucene.search.FieldCacheImpl.getTermsIndex(FieldCacheImpl.java:1077)
> > at
> >
> org.apache.solr.request.SimpleFacets.getFieldCacheCounts(SimpleFacets.java:459)
> > at
> > org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:310)
> > at
> >
> org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:396)
> > at
> >
> org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:205)
> > at
> >
> org.apache.solr.handler.compo

Partition Question

2012-05-06 Thread Yuval Dotan
Hi All
We have an index of ~2,000,000,000 Documents and the query and facet times
are too slow for us.
Before using the shards solution for improving performance, we thought
about using the multicore feature (our goal is to maximize performance for
a single machine).
Most of our queries will be limited by time, hence we want to partition the
data by date/time.
We want to partition the data because the index size is too big and doesn't
fit into memory (80 Gb's).

1. Is multi core the best way to implement my requirement?
2. I noticed there are some LOAD / UNLOAD actions on a core - should i use
these action when managing my cores? if so how can i LOAD a core that i
have unloaded
for example:
I have 7 partitions / cores - one for each day of the week
In most cases I will search for documents only on the last day core.
Once every 1 queries I need documents from all cores.
Question: Do I need to unload all of the old cores and then load them on
demand (when i see i need data from these cores)?
3. If the question to the last answer is no, how do i ensure that only
cores that are loaded into memory are the ones I want?

Thanks
Yuval


Re: Partition Question

2012-05-08 Thread Yuval Dotan
Hi
Can someone please guide me to the right way to partition the solr index?

On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan  wrote:

> Hi All
> Jan, thanks for the reply - answers for your questions are located below
> Please update me if you have ideas that can solve my problems.
>
> First, some corrections to my previous mail:
>
> > Hi All
> > We have an index of ~2,000,000,000 Documents and the query and facet
> times
> > are too slow for us - our index in fact will be much larger
>
> > Most of our queries will be limited by time, hence we want to partition
> the
> > data by date/time - even when unlimited – which is mostly what will
> happen, we have results in the recent records and querying the whole
> dataset is redundant
>
> > We want to partition the data because the index size is too big and
> doesn't
> > fit into memory (80 Gb's) - our data actually continuously grows over
> time, it will never fit into memory, but has to be available for queries in
> case results are found in older records or a full facet is required
>
> >
> > 1. Is multi core the best way to implement my requirement?
> > 2. I noticed there are some LOAD / UNLOAD actions on a core - should i
> use
> > these action when managing my cores? if so how can i LOAD a core that i
> > have unloaded
> > for example:
> > I have 7 partitions / cores - one for each day of the week - we might
> have 2000 per day
>
> > In most cases I will search for documents only on the last day core.
> > Once every 1 queries I need documents from all cores.
> > Question: Do I need to unload all of the old cores and then load them on
> > demand (when i see i need data from these cores)?
> > 3. If the question to the last answer is no, how do i ensure that only
> > cores that are loaded into memory are the ones I want?
> >
> > Thanks
> > Yuval
> *
> *
> *Answers to Jan:*
>
> Hi,
>
> First you need to investigate WHY faceting and querying is too slow.
> What exactly do you mean by slow? Can you please tell us more about your
> setup?
>
> * How large documents and how many fields?
> small records ~200bytes, 20 fields avg most of them are not stored -
> attached schema and config file
>
> * What kind of queries? How many hits? How many facets? Have you studies
> &debugQuery=true output?
> problem is not with queries being slow per se, it is with getting 50
> matches out of billions of matching docs
>
> * Do you use filter queries (fq) extensively?
> user generated queries, fq would not reduce the dataset for some of our
> usecases
>
> * What data do you facet on? Many unique values per field? Text or ranges?
> What facet.method?
>  problem is not just faceting, it’s with queries – let’s start there
>
> * What kind of hardware? RAM/CPU
> HP DL180G6 , 2 E5645 (12 core)
> 48 GB RAM
>  * How have you configured your JVM? How much memory? GC?
> java -Xms512M -Xmx40960M -jar start.jar
>
> As you see, you will have to provide a lot more information on your use
> case and setup in order for us to judge correct action to take. You might
> need to adjust your config, or to optimize your queries or caches, slim
> your schema, buy some more RAM, or an SSD :)
>
> Normally, going multi core on one box will not necessarily help in itself,
> as there is overhead in sharding multi cores as well. However, it COULD be
> a solution since you say that most of the time you only need to consider
> 1/7 of your data. I would perhaps consider one "hot" core for last 24h, and
> one "archive" core for older data. You could then tune these differently
> regarding caches etc.
>
> Can you get back with some more details?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
>


Re: Partition Question

2012-05-09 Thread Yuval Dotan
Thanks Lance

There is already a clear partition - as you assumed, by date.

My requirement is for the best setup for:
1. A *single machine*
2. Quickly changing index - so i need to have the option to load and unload
partitions dynamically

Do you think that the sharding model that solr offers is the most suitable
for this setup?
What about the solr multi core model?

On Wed, May 9, 2012 at 12:23 AM, Lance Norskog  wrote:

> Lucene does not support more 2^32 unique documents, so you need to
> partition. In Solr this is done with Distributed Search:
>
> http://www.lucidimagination.com/search/link?url=http://wiki.apache.org/solr/DistributedSearch
>
> First, you have to decide a policy for which documents go to which
> 'shard'. It is common to make a hash code as the unique id, then
> distribute the documents modulo this value. This gives a roughly equal
> distribution of documents. If there is already a clear partition, like
> the date of the document (like newspaper articles) you could use that
> also.
>
> You have new documents and existing documents. For new documents you
> need code for this policy to get all new documents to the right index.
> This could be one master program that passes them out, or each indexer
> could know which documents it gets.
>
> If you want to split up your current index, that's different. I have
> done this: for each shard, make a copy of the full index,
> delete-by-query all of the documents that are NOT in that shard, and
> optimize. We had to do this in sequence so it took a few days :) You
> don't need a full optimize. Use 'maxSegments=50' or '100' to suppress
> that last final giant merge.
>
> On Tue, May 8, 2012 at 12:02 AM, Yuval Dotan  wrote:
> > Hi
> > Can someone please guide me to the right way to partition the solr index?
> >
> > On Mon, May 7, 2012 at 11:41 AM, Yuval Dotan 
> wrote:
> >
> >> Hi All
> >> Jan, thanks for the reply - answers for your questions are located below
> >> Please update me if you have ideas that can solve my problems.
> >>
> >> First, some corrections to my previous mail:
> >>
> >> > Hi All
> >> > We have an index of ~2,000,000,000 Documents and the query and facet
> >> times
> >> > are too slow for us - our index in fact will be much larger
> >>
> >> > Most of our queries will be limited by time, hence we want to
> partition
> >> the
> >> > data by date/time - even when unlimited – which is mostly what will
> >> happen, we have results in the recent records and querying the whole
> >> dataset is redundant
> >>
> >> > We want to partition the data because the index size is too big and
> >> doesn't
> >> > fit into memory (80 Gb's) - our data actually continuously grows over
> >> time, it will never fit into memory, but has to be available for
> queries in
> >> case results are found in older records or a full facet is required
> >>
> >> >
> >> > 1. Is multi core the best way to implement my requirement?
> >> > 2. I noticed there are some LOAD / UNLOAD actions on a core - should i
> >> use
> >> > these action when managing my cores? if so how can i LOAD a core that
> i
> >> > have unloaded
> >> > for example:
> >> > I have 7 partitions / cores - one for each day of the week - we might
> >> have 2000 per day
> >>
> >> > In most cases I will search for documents only on the last day core.
> >> > Once every 1 queries I need documents from all cores.
> >> > Question: Do I need to unload all of the old cores and then load them
> on
> >> > demand (when i see i need data from these cores)?
> >> > 3. If the question to the last answer is no, how do i ensure that only
> >> > cores that are loaded into memory are the ones I want?
> >> >
> >> > Thanks
> >> > Yuval
> >> *
> >> *
> >> *Answers to Jan:*
> >>
> >> Hi,
> >>
> >> First you need to investigate WHY faceting and querying is too slow.
> >> What exactly do you mean by slow? Can you please tell us more about your
> >> setup?
> >>
> >> * How large documents and how many fields?
> >> small records ~200bytes, 20 fields avg most of them are not stored -
> >> attached schema and config file
> >>
> >> * What kind of queries? How many hits? How many facets? Have you studies
> >> &debugQuery=true output?
> >> problem is not with queries b

Re: Questions about query times

2012-10-10 Thread Yuval Dotan
OK so I solved the question about the query that returns no results and
still takes time - I needed to add the facet.mincount=1 parameter and this
reduced the time to 200-300 ms instead of seconds.

I still could't figure out why a query that returns very few results (like
query number 2) still takes seconds to return even with
the facet.mincount=1 parameter.
I couldn't understand why the facet pivot takes so much time on 299 docs.

Does anyone have any idea?

Example Query:

(2)
q=*:*&fq=(trimTime:[2012-09-04T15:23:48Z TO *])&fq=(Severity:("High"
"Critical"))&fq=(trimTime:[2012-09-04T15:23:48Z TO
*])&fq=(Confidence_Level:("N/A")) OR (Confidence_Level:("Medium-High")) OR
(Confidence_Level:("High"))&f.product.facet.sort=index&f.product.facet.limit=-1&f.Severity.facet.sort=index&f.Severity.facet.limit=-1&f.trimTime.facet.sort=index&f.trimTime.facet.limit=-1&facet=true&f.product.facet.method=enum&facet.pivot=product,Severity,trimTime

NumFound: 299

Times(ms):
Qtime: 2,756 Query: 307 Facet: 2,449

On Thu, Sep 20, 2012 at 5:24 PM, Yuval Dotan  wrote:

> Hi,
>
> We have a system that inserts logs continuously (real-time).
> We have been using the Solr facet pivot feature for querying and have been
> experiencing slow query times and we were hoping to gain some insights with
> your help.
> schema and solrconfig are attached
>
> Here are our questions (data below):
>
>1. Why is facet time so long in (3) and (5) - in cases where there are
>0 or very few results?
>2. We ran two queries that are only differ in the time limit (for the
>second query - time range is very small) - we got the same time for both
>queries although the second one returned very few results - again why is
>that?
>3. Is there a way to improve pivot facet time?
>
> System Data:
>
> Index size: 63 GB
> RAM:4Gb
> CPU: 2 x Xeon E5410 2.33GHz
> Num of Documents: 109,278,476
>
>
> query examples:
>
> -
> (1)
> Query:
> q=*:*&fq=(trimTime:[2012-09-04T14:29:24Z TO
> *])&fq=(trimTime:[2012-09-04T14:29:24Z TO
> *])&f.product.facet.sort=index&f.product.facet.limit=-1&f.Severity.facet.sort=index&f.Severity.facet.limit=-1&f.trimTime.facet.sort=index&f.trimTime.facet.limit=-1&facet=true&f.product.facet.method=enum&facet.pivot=product,Severity,trimTime
>
> NumFound:
> 11,407,889
>
> Times (ms):
> Qtime: 3,239 Query: 353 Facet: 2,885
> -
>
> (2)
> Query:
> q=*:*&fq=(trimTime:[2012-09-04T15:23:48Z TO *])&fq=(Severity:("High"
> "Critical"))&fq=(trimTime:[2012-09-04T15:23:48Z TO
> *])&fq=(Confidence_Level:("N/A")) OR (Confidence_Level:("Medium-High")) OR
> (Confidence_Level:("High"))&f.product.facet.sort=index&f.product.facet.limit=-1&f.Severity.facet.sort=index&f.Severity.facet.limit=-1&f.trimTime.facet.sort=index&f.trimTime.facet.limit=-1&facet=true&f.product.facet.method=enum&facet.pivot=product,Severity,trimTime
>
> NumFound: 299
>
> Times(ms):
> Qtime: 2,756 Query: 307 Facet: 2,449
>
> -
> (3)
> Query:
> q=*:*&fq=(trimTime:[2012-09-11T12:55:00Z TO *])&fq=(Severity:("High"
> "Critical"))&fq=(trimTime:[2012-09-04T15:23:48Z TO
> *])&fq=(Confidence_Level:("N/A")) OR (Confidence_Level:("Medium-High")) OR
> (Confidence_Level:("High"))&f.product.facet.sort=index&f.product.facet.limit=-1&f.Severity.facet.sort=index&f.Severity.facet.limit=-1&f.trimTime.facet.sort=index&f.trimTime.facet.limit=-1&facet=true&f.product.facet.method=enum&facet.pivot=product,Severity,trimTime
>
> NumFound: 7
>
> Times(ms):
> Qtime: 2,798 Query: 312 Facet: 2,485
>
> -
> (4)
> Query:
> q=*:*&fq=(trimTime:[2012-09-04T15:43:16Z TO
> *])&fq=(trimTime:[2012-09-04T15:43:16Z TO *])&fq=(product:("Application
> Control")) OR (product:("URL
> Filtering"))&f.appi_name.facet.sort=index&f.appi_name.facet.limit=-1&f.app_risk.facet.sort=index&f.app_risk