Re: SOLR OutOfMemoryError Java heap space

2014-03-06 Thread Angel Tchorbadjiiski

Hi Shawn,

a big thanks for the long and detailed answer. I am aware of how linux 
uses free RAM for caching and the the problems related to jvm and GC. It 
is nice to hear how this correlates to Solr. I'll take some time and 
think over it. The facet.method=enum and probably a combination of 
DocValue-Fields could be the solution needed in this case.


Thanks again to both of you and Toke for the feedback!

Cheers
Angel

On 05.03.2014 17:06, Shawn Heisey wrote:

On 3/5/2014 4:40 AM, Angel Tchorbadjiiski wrote:

Hi Shawn,

On 05.03.2014 10:05, Angel Tchorbadjiiski wrote:

Hi Shawn,


It may be your facets that are killing you here.  As Toke mentioned, you
have not indicated what your max heap is.20 separate facet fields with
millions of documents will use a lot of fieldcache memory if you use the
standard facet.method, fc.

Try adding facet.method=enum to all your facet queries, or you can put
it in the defaults section of each request handler definition.

Ok, that is easy to try out.


Changing the facet.method does not help really as the performance of the
queries is really bad. This lies mostly on the small cache values, but
even trying to tune them for the "enum" case didn't help much.

The number of documents and unique facet values seems to be too high.
Trying to cache them even with a size of 512 results in many misses and
Solr tries to repopulate the cache all the time. This makes the
performances even worse.


Good performance with Solr requires a fair amount of memory.  You have
two choices when it comes to where that memory gets used - inside Solr
in the form of caches, or free memory, available to the operating system
for caching purposes.

Solr caches are really amazing things.  Data gathered for one query can
significantly speed up another query, because part (or all) of that
query can be simply skipped, the results read right out of the cache.

There are two potential problems with relying exclusively on Solr
caches, though.  One is that they require Java heap memory, which
requires garbage collection.  A large heap causes GC issues, some of
which can be alleviated by GC tuning.  The other problem is that you
must actually do a query in order to get the data into the cache.  WHen
you do a commit and open a new searcher, that cache data does away, so
you have to do the query over again.

The primary reason for slow uncached queries is disk access.  Reading
index data off the disk is a glacial process, comparatively speaking.
This is where OS disk caching becomes a benefit.  Most queries, even
complex ones, become lightning fast if all of the relevant index data is
already in RAM and no disk access is required.  When queries are fast to
begin with, you can reduce the cache sizes in Solr, reducing the heap
requirements.  With a smaller heap, more memory is available for the OS
disk cache.

The facet.method=enum parameter shifts the RAM requirement from Solr to
the OS.  It does not really reduce the amount of required system memory.
  Because disk caching is a kernel level feature and does not utilize
garbage collection, it is far more efficient than Solr ever could be at
caching *raw* data.  Solr's caches are designed for *processed* data.

What this all boils down to is that I suspect you'll simply need more
memory on the machine.  With facets on so many fields, your queries are
probably touching nearly the entire index, so you'll want to put the
entire index into RAM.

Therefore, after Solr allocates its heap and any other programs on the
system allocate their required memory, you must have enough left memory
over to fit all (or most) of your 50GB index data.  Combine this with
facet.method=enum and everything should be good.




Re: need suggestions for storing TBs of strutucred data in SolrCloud

2014-03-06 Thread Toke Eskildsen
On Thu, 2014-03-06 at 08:17 +0100, Chia-Chun Shih wrote:
>1. Raw data is 35,000 CSV files per day. Each file is about 5 MB.
>2. One collection serves one day. 200-day history data is required.

So once your data are indexed, they will not change? If seems to me that
1 shard/day is a fine choice. Consider optimizing down to a single
segment when a days data has been indexed.

It sounds like your indexing needs CPU power, while your searches are
likely to be I/O bound. You might consider a dedicated indexing machine,
if it is acceptable that data only go live when a day's indexing has
been finished (and copied). 

>3. Take less than 10 hours to build one-day index.
>4. Allow to execute an ordinary query (may span 1~7 days) in 10 minutes

Since you want to have 200 days and each day takes about 60GB (guessing
from your test), we're looking at 12TB of index at any time.

At the State and University Library, Denmark, we are building an index
for our web archive. We estimate about 20TB of static index to begin
with. We have done some tests of up to 16*200GB clouded indexes (details
at http://sbdevel.wordpress.com/2013/12/06/danish-webscale/ ) and our
median was about 1200ms for simple queries with light faceting, when we
used a traditional spinning drives backend. That put our estimated
median search time for the full corpus at 10 seconds, which was too slow
for us.

With a response time requirement of 10 minutes, which seems extremely
generous in these sub-second times, I am optimistic that "Just make
daily blocks and put them on traditional storage" will work for you.
Subject to your specific data and queries of course.

If you want a whole other level of performance then use SSDs as your
backend. Especially for your large index scenario, where it is very
expensive to try and compensate for slow spinning drives with RAM. We
designed our search machine around commodity SSDs (Samsung 840) and it
was, relative to data size and performance, dirt cheap.

>5. concurrent user < 10

Our measurements showed that for this amount of data on spinning drives,
throughput was nearly independent of threads: 4 concurrent requests
meant 4 times as long as a single request. YMMW.

Your corpus does represent an extra challenge as it sounds like most of
the indexes will be dormant most of the time. As disk cache favours
often accessed data, I'm guessing that you will get some very ugly
response times when you process one of the rarer queries.

> I have built an experimental SolrCloud based on 3 VMs, each equipped with 8
> cores, 64GB RAM.  Each collection has 3 shards and no replication. Here are
> my findings:
> 
>1. Each collection's actual index size is between 30GB to 90GB,
>depending on the number of stored field.

I'm guessing that 30-90GB is a day's worth of data? How many documents
does a shard contain?

>2. It takes 6 to 12 hours to load raw data. I use multiple (15~30)
>threads to launch http requests. (http://wiki.apache.org/solr/UpdateCSV)

I'm guessing that profiling, tweaking and fiddling will shave the top 2
hours from those numbers.

Regards,
Toke Eskildsen, State and University Library, Denmark




Need help regarding SOLR Fulltext search

2014-03-06 Thread Raman Jhajj
Hello Everyone,

Let me first introduce myself, I am Raman, I am a Masters of CS student. I
am doing a project for my studies which need the use of SOLR. For some
reasons I have to use SOLR 4.3.0 for the project.

I am facing an issue with page numbers in the search result.I came across a
workaround for that https://issues.apache.org/jira/browse/SOLR-380 but this
one is quite old and is not working now with 4.3.0.

I need some help regarding this, if anyone can please help me out. I am new
to SOLR and don't know much.

My particular issue is, we have several manuals of research data which are
in TEI format. I have indexed them for full search and highlights.
Everything is working fine so far. Now when I show results in the webpage.
On click of the particular result I need to move to that particular page in
the TEI formatted document and display. I can display whole document by
just giving an anchor tag for the document but not sure about how to move
to specific page. I am bit confused how I can implement the solution for
this. If anyone has worked on something like this can you please guide me.
I am not getting any way out.

-- 
Kind Regards,

*Ramaninder Singh Jhajj*


Re: Need help regarding SOLR Fulltext search

2014-03-06 Thread Ahmet Arslan
Hi Roman,

I did similar project, this is how :

1) index page by page. Solr document (unit of retrieval) will be pages. You can 
generate an uniqueKey by concatenating docId and pageNo => doc50_page0 With 
this you will have page no information. 

2) Later on you can group by document_id with 
https://wiki.apache.org/solr/FieldCollapsing

OR 

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-CollapsingQueryParser

Ahmet



On Thursday, March 6, 2014 1:06 PM, Raman Jhajj  wrote:
Hello Everyone,

Let me first introduce myself, I am Raman, I am a Masters of CS student. I
am doing a project for my studies which need the use of SOLR. For some
reasons I have to use SOLR 4.3.0 for the project.

I am facing an issue with page numbers in the search result.I came across a
workaround for that https://issues.apache.org/jira/browse/SOLR-380 but this
one is quite old and is not working now with 4.3.0.

I need some help regarding this, if anyone can please help me out. I am new
to SOLR and don't know much.

My particular issue is, we have several manuals of research data which are
in TEI format. I have indexed them for full search and highlights.
Everything is working fine so far. Now when I show results in the webpage.
On click of the particular result I need to move to that particular page in
the TEI formatted document and display. I can display whole document by
just giving an anchor tag for the document but not sure about how to move
to specific page. I am bit confused how I can implement the solution for
this. If anyone has worked on something like this can you please guide me.
I am not getting any way out.

-- 
Kind Regards,

*Ramaninder Singh Jhajj*



Re: Min Number Should Match (mm) and joins

2014-03-06 Thread mm

Any suggestions?


Zitat von m...@preselect-media.com:


Hello,

I'm using eDisMax to do scoring for my search results.
I have a nested structure of documents. The main (parent) document  
with meta data and the child documents with fulltext content. So I  
have to join them.


My qf looks like this "title^40.0 subtitle^40.0 original_title^10.0  
keywords^5.0" and my query like this:


test _query_:"{!join from=doc_id to=id}{!dismax qf='content  
content_de content_en content_fr content_it content_es' v='test'}"^5


If I use the mm attribute for example with sth like this "2<-25%" I  
get all results where the term "test" was found in the meta data of  
the main document. My problem now: I have documents where the  
searchterm is not in the meta data of the main document, but in the  
fulltext content of the child document. If I use mm, these results  
are never shown, if the term is not in the main document. If I don't  
use mm at all, I get strange results with documents who doesn't  
contain the term at all.


Is there a solution for this problem?

Thx
- Moritz






Mixing lucene scoring and other scoring

2014-03-06 Thread Benson Margulies
Some months ago, I talked to some people at LR about this, but I can't
find my notes.

Imagine a function of some fields that produces a score between 0 and 1.

Imagine that you want to combine this score with relevance over some
more or less complex ordinary query.

What are the options, given the arbitrary nature of Lucene scores?


Re: Polygon search returning "Invalid Number" error.

2014-03-06 Thread leevduhl
My bad, I think this error was actually a result of using the Solr Admin
utility to query the index and the query I entered included the double
quotes.

However, this left me with a different error that I may post a question
about if I cannot figure it out.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Polygon-search-returning-Invalid-Number-error-tp4121189p4121677.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Mixing lucene scoring and other scoring

2014-03-06 Thread Otis Gospodnetic
Hi Benson,

http://lucene.apache.org/core/4_7_0/expressions/org/apache/lucene/expressions/Expression.html
https://issues.apache.org/jira/browse/SOLR-5707

That?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Thu, Mar 6, 2014 at 8:34 AM, Benson Margulies wrote:

> Some months ago, I talked to some people at LR about this, but I can't
> find my notes.
>
> Imagine a function of some fields that produces a score between 0 and 1.
>
> Imagine that you want to combine this score with relevance over some
> more or less complex ordinary query.
>
> What are the options, given the arbitrary nature of Lucene scores?
>


Re: Replicating Between Solr Clouds

2014-03-06 Thread perdurabo
Toby Lazar wrote
> Unless Solr is your system of record, aren't you already replicating your
> source data across the WAN?  If so, could you load Solr in colo B from
> your colo B data source?  You may be duplicating some indexing work, but
> at least your colo B Solr would be more closely in sync with your colo B
> data.

Our system of record exists in a SQL DB that is indeed replicated via
always-on mirroring to the failover data center.  However, a complete forced
re-index of all of the data could take hours and our SLA requires us to be
back up with searchable indices in minutes.  Because we may have to
replicate multiple data centers' data (three plus data centers, A, B and the
failover DC) into this failover data center, we can't dedicate the failover
data center's SolrCloud to constantly re-index data from a single SQL mirror
when we could potentially need it to take over for any given one. 

One thought we had was to have a situation where the DCs A and B would run a
cron job that would force a backup of the indices using the
"replication?command=backup" API command and then we would sync up those
backup snapshots to the failover DC's shut down SorCloud instance to a
separate filesystem directory dedicated to DC A's or DC B's indices.  Then
in the case of a failover we would have to run a script that would symlink
the snapshots for the particular DC we want to failover for to the index dir
for the failover DCs SolrCloud and then start up the nodes.  The problem
comes with how to handle different indices on different nodes in the
SolrCloud then we have 2 shards.  We would have to do a 1:1 copy of each of
the four nodes in DCs A and B to each of the other node in the failover DC. 
Sounds pretty ugly.

Looking at this thread, even this paln may not work:
http://lucene.472066.n3.nabble.com/solrcloud-shards-backup-restoration-td4088447.html

As far as the SolrEntityProcessor, I'm not sure how you would configure it. 
>From what I gather, you have to configure a new requestHandler section in
your Solrconfig.xml like this:



  /data/solr/mysolr/conf/data-config.xml



And then you have to configure a "/data/solr/mysolr/conf/data-config.xml"
with the following contents:


  
http://solrsource.example.com:8983/solr/"; query="*:*"/>
  


However, this doesn't seem to work for me as I'm using a SolrCloud with
zookeeper.  I created these files in my conf directory and uploaded them to
zookeeper, then reloaded the collection/cores but all I got were
initialization errors.  I don't think the docs assume you'll be doing this
under a SolrCloud scenario.

Any other insight?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196p4121685.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing huge data

2014-03-06 Thread Rallavagu

Erick,

That helps so I can focus on the problem areas. Thanks.

On 3/5/14, 6:03 PM, Erick Erickson wrote:

Here's the easiest thing to try to figure out where to
concentrate your energies. Just comment out the
server.add call in your SolrJ program. Well, and any
commits you're doing from SolrJ.

My bet: Your program will run at about the same speed
it does when you actually index the docs, indicating that
your problem is in the data acquisition side. Of course
the older I get, the more times I've been wrong :).

You can also monitor the CPU usage on the box running
Solr. I often see it idling along < 30% when indexing, or
even < 10%, again indicating that the bottleneck is on the
acquisition side.

Note I haven't mentioned any solutions, I'm a believer in
identifying the _problem_ before worrying about a solution.

Best,
Erick

On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky  wrote:

Make sure you're not doing a commit on each individual document add. Commit
every few minutes or every few hundred or few thousand documents is
sufficient. You can set up auto commit in solrconfig.xml.

-- Jack Krupansky

-Original Message- From: Rallavagu
Sent: Wednesday, March 5, 2014 2:37 PM
To: solr-user@lucene.apache.org
Subject: Indexing huge data


All,

Wondering about best practices/common practices to index/re-index huge
amount of data in Solr. The data is about 6 million entries in the db
and other source (data is not located in one resource). Trying with
solrj based solution to collect data from difference resources to index
into Solr. It takes hours to index Solr.

Thanks in advance


Re: Indexing huge data

2014-03-06 Thread Rallavagu
Yeah. I have thought about spitting out JSON and run it against Solr 
using parallel Http threads separately. Thanks.


On 3/5/14, 6:46 PM, Susheel Kumar wrote:

One more suggestion is to collect/prepare the data in CSV format (1-2 million sample 
depending on size) and then import data direct into Solr using CSV handler & curl.  
This will give you the pure indexing time & the differences.

Thanks,
Susheel

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, March 05, 2014 8:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing huge data

Here's the easiest thing to try to figure out where to concentrate your 
energies. Just comment out the server.add call in your SolrJ program. Well, 
and any commits you're doing from SolrJ.

My bet: Your program will run at about the same speed it does when you actually 
index the docs, indicating that your problem is in the data acquisition side. 
Of course the older I get, the more times I've been wrong :).

You can also monitor the CPU usage on the box running Solr. I often see it idling 
along < 30% when indexing, or even < 10%, again indicating that the bottleneck 
is on the acquisition side.

Note I haven't mentioned any solutions, I'm a believer in identifying the 
_problem_ before worrying about a solution.

Best,
Erick

On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky  wrote:

Make sure you're not doing a commit on each individual document add.
Commit every few minutes or every few hundred or few thousand
documents is sufficient. You can set up auto commit in solrconfig.xml.

-- Jack Krupansky

-Original Message- From: Rallavagu
Sent: Wednesday, March 5, 2014 2:37 PM
To: solr-user@lucene.apache.org
Subject: Indexing huge data


All,

Wondering about best practices/common practices to index/re-index huge
amount of data in Solr. The data is about 6 million entries in the db
and other source (data is not located in one resource). Trying with
solrj based solution to collect data from difference resources to
index into Solr. It takes hours to index Solr.

Thanks in advance


Polygon search returning "InvalidShapeException: incompatible dimension (2)... error.

2014-03-06 Thread leevduhl
Getting the following error when attempting to run a polygon query from the
Solr Admin utility: :"com.spatial4j.core.exception.InvalidShapeException:
incompatible dimension (2) and values (Intersects).  Only 0 values
specified",
"code":400

My query is as follows:
q=geoloc:Intersects(POLYGON((-83.6349 42.4718, -83.5096 42.471868, -83.5096
42.4338, -83.6349 42.4338, -83.6349 42.4718)))

The response is as follows:
{
  "responseHeader":{
"status":400,
"QTime":2,
"params":{
  "debugQuery":"true",
  "fl":"id, openhousestartdate, geoloc",
  "sort":"openhousestartdate desc",
  "indent":"true",
  "q":"geoloc:Intersects(POLYGON((83.6349 42.4718, 83.5096 42.471868,
83.5096 42.4338, 83.6349 42.4338, 83.6349 42.4718)))",
  "wt":"json"}},
  "error":{
"msg":"com.spatial4j.core.exception.InvalidShapeException: incompatible
dimension (2) and values (Intersects).  Only 0 values specified",
"code":400}}

My "geoloc" dimension/field is setup as follows in my Schema.xml:


Some sample document "geoloc" data is shown below.
"docs": [
  {
"geoloc": "-82.549200,43.447400"
  },
  {
"geoloc": "-82.671551,43.421797"
  }
]

My Solr version info is as follows:
solr-spec: 4.6.1
solr-impl: 4.6.1 1560866 - mark - 2014-01-23 20:21:50
lucene-spec: 4.6.1
lucene-impl: 4.6.1 1560866 - mark - 2014-01-23 20:11:13

Any info on a solution to this problem would be appreciated.

Thanks
Lee



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Polygon-search-returning-InvalidShapeException-incompatible-dimension-2-error-tp4121704.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: need suggestions for storing TBs of strutucred data in SolrCloud

2014-03-06 Thread Shawn Heisey
On 3/6/2014 12:17 AM, Chia-Chun Shih wrote:
> I am planning a system for searching TB's of structured data in SolrCloud.
> I need suggestions for handling such huge amount of data in SolrCloud.
> (e.g., number of shards per collection, number of nodes, etc.)
> 
> Here are some specs of the system:
> 
>1. Raw data is 35,000 CSV files per day. Each file is about 5 MB.
>2. One collection serves one day. 200-day history data is required.
>3. Take less than 10 hours to build one-day index.
>4. Allow to execute an ordinary query (may span 1~7 days) in 10 minutes
>5. concurrent user < 10
> 
> I have built an experimental SolrCloud based on 3 VMs, each equipped with 8
> cores, 64GB RAM.  Each collection has 3 shards and no replication. Here are
> my findings:
> 
>1. Each collection's actual index size is between 30GB to 90GB,
>depending on the number of stored field.
>2. It takes 6 to 12 hours to load raw data. I use multiple (15~30)
>threads to launch http requests. (http://wiki.apache.org/solr/UpdateCSV)

Nobody can give you any specific answers because there are simply too
many variables:

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

You do have one unusually loose restriction there -- that the query must
take less than 10 minutes.  Most people tend to say that it must take
less than a second, but they'll settle for several seconds.  Almost any
reasonable way you could architect your system will probably take less
than ten minutes for a query.

With this much data and potentially a LOT of servers, you might run into
limits that require config changes to address.  Things like the thread
limits on the servlet container, connection limits on the shard handler
in Solr, etc.

These blog posts (there are two pages of them) may interest you:

http://www.hathitrust.org/blogs/large-scale-search

One thing that I can tell you is that the more RAM you can get your
hands on, the better it will perform.  Ideally you'd have as much free
memory across the whole system as the entire size of your Solr indexes.
 The problem with this idea for you is that with 200 collections
averaging 60GB, that's about twelve terabytes of memory across all your
servers -- for one single copy of the index.  You'll probably want at
least two copies, so you can survive at least one hardware failure.  If
you can't get enough RAM to cache the whole index, putting the index
data on SSD can make a MAJOR difference.

Some strong advice: do everything you can to reduce the size of your
index, which reduces the OS disk cache (RAM) requirements.  Don't store
all your fields.  Use less aggressive tokenization where possible.
Avoid termVectors and docValues unless they are actually needed.  Omit
anything you can -- term frequencies, positions, norms, etc.

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn



Re: SOLR OutOfMemoryError Java heap space

2014-03-06 Thread Divyang Shah
hi,
heap problem is due to memory full.
you should remove unnecessary data and restart server once.



On Thursday, 6 March 2014 10:39 AM, Angel Tchorbadjiiski 
 wrote:
 
Hi Shawn,

a big thanks for the long and detailed answer. I am aware of how linux 
uses free RAM for caching and the the problems related to jvm and GC. It 
is nice to hear how this correlates to Solr. I'll take some time and 
think over it. The facet.method=enum and probably a combination of 
DocValue-Fields could be the solution needed in this case.

Thanks again to both of you and Toke for the feedback!

Cheers
Angel


On 05.03.2014 17:06, Shawn Heisey wrote:
> On 3/5/2014 4:40 AM, Angel Tchorbadjiiski wrote:
>> Hi Shawn,
>>
>> On 05.03.2014 10:05, Angel Tchorbadjiiski wrote:
>>> Hi Shawn,
>>>
 It may be your facets that are killing you here.  As Toke mentioned, you
 have not indicated what your max heap is.20 separate facet fields with
 millions of documents will use a lot of fieldcache memory if you use the
 standard facet.method, fc.

 Try adding facet.method=enum to all your facet queries, or you can put
 it in the defaults section of each request handler definition.
>>> Ok, that is easy to try out.
>>>
>> Changing the facet.method does not help really as the performance of the
>> queries is really bad. This lies mostly on the small cache values, but
>> even trying to tune them for the "enum" case didn't help much.
>>
>> The number of documents and unique facet values seems to be too high.
>> Trying to cache them even with a size of 512 results in many misses and
>> Solr tries to repopulate the cache all the time. This makes the
>> performances even worse.
>
> Good performance with Solr requires a fair amount of memory.  You have
> two choices when it comes to where that memory gets used - inside Solr
> in the form of caches, or free memory, available to the operating system
> for caching purposes.
>
> Solr caches are really amazing things.  Data gathered for one query can
> significantly speed up another query, because part (or all) of that
> query can be simply skipped, the results read right out of the cache.
>
> There are two potential problems with relying exclusively on Solr
> caches, though.  One is that they require Java heap memory, which
> requires garbage collection.  A large heap causes GC issues, some of
> which can be alleviated by GC tuning.  The other problem is that you
> must actually do a query in order to get the data into the cache.  WHen
> you do a commit and open a new searcher, that cache data does away, so
> you have to do the query over again.
>
> The primary reason for slow uncached queries is disk access.  Reading
> index data off the disk is a glacial process, comparatively speaking.
> This is where OS disk caching becomes a benefit.  Most queries, even
> complex ones, become lightning fast if all of the relevant index data is
> already in RAM and no disk access is required.  When queries are fast to
> begin with, you can reduce the cache sizes in Solr, reducing the heap
> requirements.  With a smaller heap, more memory is available for the OS
> disk cache.
>
> The facet.method=enum parameter shifts the RAM requirement from Solr to
> the OS.  It does not really reduce the amount of required system memory.
>   Because disk caching is a kernel level feature and does not utilize
> garbage collection, it is far more efficient than Solr ever could be at
> caching *raw* data.  Solr's caches are designed for *processed* data.
>
> What this all boils down to is that I suspect you'll simply need more
> memory on the machine.  With facets on so many fields, your queries are
> probably touching nearly the entire index, so you'll want to put the
> entire index into RAM.
>
> Therefore, after Solr allocates its heap and any other programs on the
> system allocate their required memory, you must have enough left memory
> over to fit all (or most) of your 50GB index data.  Combine this with
> facet.method=enum and everything should be good.

Re: Replicating Between Solr Clouds

2014-03-06 Thread Shawn Heisey
On 3/6/2014 7:54 AM, perdurabo wrote:
> Toby Lazar wrote
>> Unless Solr is your system of record, aren't you already replicating your
>> source data across the WAN?  If so, could you load Solr in colo B from
>> your colo B data source?  You may be duplicating some indexing work, but
>> at least your colo B Solr would be more closely in sync with your colo B
>> data.
> 
> Our system of record exists in a SQL DB that is indeed replicated via
> always-on mirroring to the failover data center.  However, a complete forced
> re-index of all of the data could take hours and our SLA requires us to be
> back up with searchable indices in minutes.  Because we may have to
> replicate multiple data centers' data (three plus data centers, A, B and the
> failover DC) into this failover data center, we can't dedicate the failover
> data center's SolrCloud to constantly re-index data from a single SQL mirror
> when we could potentially need it to take over for any given one. 

There are a lot of issues with availability and multiple data centers
that must be addressed before SolrCloud can handle this all internally.

Until that day comes, here's what I would do:

Have a SolrCloud install at each online data center, just as you already
do.  It should have collection names that are unique to the functions of
that DC, and may include the DC name.  If you MUST have the same
collection name in all online data centers despite there being different
data, you can use collection aliasing.  The actual collection name would
be something like stuff_dca, but you'd have an alias called stuff that
can be used for both indexing and querying.

You would need to index the data for all data centers to the SolrCloud
install at the failover DC.  Ideally that would be done from the
failover DC's SQL, not over the WAN ... but it really wouldn't matter.
Because each production DC collection will have a unique name, all
collections can coexist on the failover SolrCloud.  If a failover
becomes necessary, you can make or change collection any required
aliases on the fly.

Although I don't use SolrCloud, and I don't have multiple data centers,
my own index uses a similar paradigm.  I have two completely independent
copies of my index.  My indexing program knows about them both and
indexes them independently.

There is another benefit to this: I can make changes (Solr upgrades, new
config/schema, a complete rebuild, etc.) to one copy of my index without
affecting the search application at all.  By simply enabling or
disabling the ping handler in Solr, my load balancer will keep requests
going to whichever copy I choose.

Thanks,
Shawn



Race condition in Leader Election

2014-03-06 Thread KNitin
Hi

 When restarting a node in solrcloud, i run into scenarios where both the
replicas for a shard get into "recovering" state and never come up causing
the error "No servers hosting this shard". To fix this, I either unload one
core or restart one of the nodes again so that one of them becomes the
leader.

Is there a way to "force" leader election for a shard for solrcloud? Is
there a way to break ties automatically (without restarting nodes) to make
a node as the leader for the shard?


Thanks
Nitin


SolrCloud setup guidance

2014-03-06 Thread Priti Solanki
Hello Everyone,

I would like to take you guidance of following

I have a single core with 124 GB of index data size. Indexing and Reading
both are very slow as I have 7 GB RAM to support this huge data.  Almost 8
million of documents.

Hence, we thought of going to SolrCloud so that we can accommodate more
upcoming data. I have data for 13 country with their millions of products
and we want to set up solrcloud for the same.

I am in need of some initial thoughts about how to setup solrCloud for such
requirement. How to we come to know how many nodes,core I would be needing
to support this...

we are thinking to host this on Amazon...Any guidance or reading links
,case study will be highly appreciated.

Regards,


Re: Replicating Between Solr Clouds

2014-03-06 Thread perdurabo
Well, I think I finally figured out how to get SolrEntityProcessor to work,
but there are still some issues.  I had to add a library path to
solrconfig.xml, but the cores are finally coming up and i am now manually
able to run a data import that does seem to index all of the documents on
the remote SolrCloud.  I ran into the issue here where I got version
conflicts:

http://lucene.472066.n3.nabble.com/Version-conflict-during-data-import-from-another-Solr-instance-into-clean-Solr-td4046937.html

I used the suggestion of adding fl="*,old_version:_version_" to the
data-config.xml entity config line.  This seems to be working but I don't
know if this will cause a problem.  When I do a manual data import i get the
correct number of documents from the source SolrCloud (the total number of
docs added up between both shards is 6357 in this test case)

Indexing completed. Added/Updated: 6,357 documents. Deleted 0 documents.
(Duration: 22s)
Requests: 0 (0/s), Fetched: 6,357 (289/s), Skipped: 0, Processed: 6,357 

However, when I check the number of docs indexed for each shard in the core
admin UI on the destination SolrCloud, the numbers are way off and a lot
less than 6357.  Theres nothing in the logs to indicate collisions or
dropped documents.  What could account for the disparity?

I would assume down the road what I need to do is configure multiple
collections/cores on the failover cluster representing each DC its
replicating from, but how would you create multiple collections when using
zookeeper?  How do you upload multiple sets of config files for each one and
keep them separate?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replicating-Between-Solr-Clouds-tp4121196p4121737.html
Sent from the Solr - User mailing list archive at Nabble.com.


Date Range Query taking more time.

2014-03-06 Thread Vijay Kokatnur
I am working with date range query that is not giving me faster response
times.  After modifying date range construct after reading several forums,
response time now is around 200ms, down from 2-3secs.

However, I was wondering if there still some way to improve upon it as
queries without date range have around 2-10ms latency,

Query : To look up upcoming booked trips for a user whenever he logs in to
the app-

q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336 AND Status:Booked
ANDClientID:4 AND  StartDate:[NOW/DAY TO NOW/DAY+1YEAR]

Date configuration in Schema :

 


Appreciate any inputs.

Thanks!


Re: Date Range Query taking more time.

2014-03-06 Thread Ahmet Arslan
Hi,

Since your range query has NOW in it, it won't be cached meaningfully.
http://solr.pl/en/2012/03/05/use-of-cachefalse-and-cost-parameters/

This is untested but can you try this?

&q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336
&fq=Status:Booked
&fq=ClientID:4
&fq={!cache=false cost=150}StartDate:[NOW/DAY TO NOW/DAY+1YEAR]




On Thursday, March 6, 2014 8:29 PM, Vijay Kokatnur  
wrote:
I am working with date range query that is not giving me faster response
times.  After modifying date range construct after reading several forums,
response time now is around 200ms, down from 2-3secs.

However, I was wondering if there still some way to improve upon it as
queries without date range have around 2-10ms latency,

Query : To look up upcoming booked trips for a user whenever he logs in to
the app-

q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336 AND Status:Booked
ANDClientID:4 AND  StartDate:[NOW/DAY TO NOW/DAY+1YEAR]

Date configuration in Schema :




Appreciate any inputs.

Thanks!



Re: Date Range Query taking more time.

2014-03-06 Thread Vijay Kokatnur
Ahmet, I have tried filter queries before to fine tune query performance.

However, whenever we use filter queries the response time goes up and
remains there.  With above change, the response time was consistently
around 4-5 secs.  We are using the default cache settings.

Is there any settings I missed?


On Thu, Mar 6, 2014 at 10:44 AM, Ahmet Arslan  wrote:

> Hi,
>
> Since your range query has NOW in it, it won't be cached meaningfully.
> http://solr.pl/en/2012/03/05/use-of-cachefalse-and-cost-parameters/
>
> This is untested but can you try this?
>
> &q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336
> &fq=Status:Booked
> &fq=ClientID:4
> &fq={!cache=false cost=150}StartDate:[NOW/DAY TO NOW/DAY+1YEAR]
>
>
>
>
> On Thursday, March 6, 2014 8:29 PM, Vijay Kokatnur <
> kokatnur.vi...@gmail.com> wrote:
> I am working with date range query that is not giving me faster response
> times.  After modifying date range construct after reading several forums,
> response time now is around 200ms, down from 2-3secs.
>
> However, I was wondering if there still some way to improve upon it as
> queries without date range have around 2-10ms latency,
>
> Query : To look up upcoming booked trips for a user whenever he logs in to
> the app-
>
> q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336 AND Status:Booked
> ANDClientID:4 AND  StartDate:[NOW/DAY TO NOW/DAY+1YEAR]
>
> Date configuration in Schema :
>
> 
>  positionIncrementGap="0"/>
>
> Appreciate any inputs.
>
> Thanks!
>
>


Re: Race condition in Leader Election

2014-03-06 Thread Mark Miller
Are you using an old version?

- Mark

http://about.me/markrmiller

On Mar 6, 2014, at 11:50 AM, KNitin  wrote:

> Hi
> 
> When restarting a node in solrcloud, i run into scenarios where both the
> replicas for a shard get into "recovering" state and never come up causing
> the error "No servers hosting this shard". To fix this, I either unload one
> core or restart one of the nodes again so that one of them becomes the
> leader.
> 
> Is there a way to "force" leader election for a shard for solrcloud? Is
> there a way to break ties automatically (without restarting nodes) to make
> a node as the leader for the shard?
> 
> 
> Thanks
> Nitin



Re: Date Range Query taking more time.

2014-03-06 Thread Ahmet Arslan
Hi,

Did you try with non-cached filter quries before?
cached Filter queries are useful when they are re-used. How often do you commit?

I thought that we can do something if we disable cache filter queries and 
manipulate their execution order with cost parameter.

What happens with this :
&q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336
&fq={!cache=false cost=100}Status:Booked
&fq={!cache=false cost=50}ClientID:4

&fq={!cache=false cost=150}StartDate:[NOW/DAY TO NOW/DAY+1YEAR] 



On Thursday, March 6, 2014 9:15 PM, Vijay Kokatnur  
wrote:
Ahmet, I have tried filter queries before to fine tune query performance.

However, whenever we use filter queries the response time goes up and
remains there.  With above change, the response time was consistently
around 4-5 secs.  We are using the default cache settings.

Is there any settings I missed?



On Thu, Mar 6, 2014 at 10:44 AM, Ahmet Arslan  wrote:

> Hi,
>
> Since your range query has NOW in it, it won't be cached meaningfully.
> http://solr.pl/en/2012/03/05/use-of-cachefalse-and-cost-parameters/
>
> This is untested but can you try this?
>
> &q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336
> &fq=Status:Booked
> &fq=ClientID:4
> &fq={!cache=false cost=150}StartDate:[NOW/DAY TO NOW/DAY+1YEAR]
>
>
>
>
> On Thursday, March 6, 2014 8:29 PM, Vijay Kokatnur <
> kokatnur.vi...@gmail.com> wrote:
> I am working with date range query that is not giving me faster response
> times.  After modifying date range construct after reading several forums,
> response time now is around 200ms, down from 2-3secs.
>
> However, I was wondering if there still some way to improve upon it as
> queries without date range have around 2-10ms latency,
>
> Query : To look up upcoming booked trips for a user whenever he logs in to
> the app-
>
> q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336 AND Status:Booked
> ANDClientID:4 AND  StartDate:[NOW/DAY TO NOW/DAY+1YEAR]
>
> Date configuration in Schema :
>
> 
>  positionIncrementGap="0"/>
>
> Appreciate any inputs.
>
> Thanks!
>
>



Re: Date Range Query taking more time.

2014-03-06 Thread Vijay Kokatnur
That did the trick Ahmet.  The first response was around 200ms, but the
subsequent queries were around 2-5ms.

I tried this

&q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336
&fq={!cache=false cost=100}Status:Booked
&fq={!cache=false cost=50}ClientID:4
&fq={!cache=false cost=50}[NOW/DAY TO NOW/DAY+1YEAR]



On Thu, Mar 6, 2014 at 11:49 AM, Ahmet Arslan  wrote:

> Hi,
>
> Did you try with non-cached filter quries before?
> cached Filter queries are useful when they are re-used. How often do you
> commit?
>
> I thought that we can do something if we disable cache filter queries and
> manipulate their execution order with cost parameter.
>
> What happens with this :
> &q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336
> &fq={!cache=false cost=100}Status:Booked
> &fq={!cache=false cost=50}ClientID:4
>
> &fq={!cache=false cost=150}StartDate:[NOW/DAY TO NOW/DAY+1YEAR]
>
>
>
> On Thursday, March 6, 2014 9:15 PM, Vijay Kokatnur <
> kokatnur.vi...@gmail.com> wrote:
> Ahmet, I have tried filter queries before to fine tune query performance.
>
> However, whenever we use filter queries the response time goes up and
> remains there.  With above change, the response time was consistently
> around 4-5 secs.  We are using the default cache settings.
>
> Is there any settings I missed?
>
>
>
> On Thu, Mar 6, 2014 at 10:44 AM, Ahmet Arslan  wrote:
>
> > Hi,
> >
> > Since your range query has NOW in it, it won't be cached meaningfully.
> > http://solr.pl/en/2012/03/05/use-of-cachefalse-and-cost-parameters/
> >
> > This is untested but can you try this?
> >
> > &q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336
> > &fq=Status:Booked
> > &fq=ClientID:4
> > &fq={!cache=false cost=150}StartDate:[NOW/DAY TO NOW/DAY+1YEAR]
> >
> >
> >
> >
> > On Thursday, March 6, 2014 8:29 PM, Vijay Kokatnur <
> > kokatnur.vi...@gmail.com> wrote:
> > I am working with date range query that is not giving me faster response
> > times.  After modifying date range construct after reading several
> forums,
> > response time now is around 200ms, down from 2-3secs.
> >
> > However, I was wondering if there still some way to improve upon it as
> > queries without date range have around 2-10ms latency,
> >
> > Query : To look up upcoming booked trips for a user whenever he logs in
> to
> > the app-
> >
> > q=UserID:AC10263A-E28B-99F9-0012-AAA42DDD9336 AND Status:Booked
> > ANDClientID:4 AND  StartDate:[NOW/DAY TO NOW/DAY+1YEAR]
> >
> > Date configuration in Schema :
> >
> > 
> >  > positionIncrementGap="0"/>
> >
> > Appreciate any inputs.
> >
> > Thanks!
> >
> >
>
>


hung threads and CLOSE_WAIT sockets

2014-03-06 Thread Avishai Ish-Shalom
Hi,

We've had a strange mishap with a solr cloud cluster (version 4.5.1) where
we observed high search latency. The problem appears to develop over
several hours until such point where the entire cluster stopped responding
properly.

After investigation we found that the number of threads (both solr and
jetty) gradually rose over several hours until it hit a the maximum allowed
at which point the cluster stopped responding properly. After restarting
several nodes the number of threads dropped and the cluster started
responding again.
We've examined nodes that were not restarted and found a high number of
CLOSE_WAIT sockets held by the solr process; these sockets were using a
random local port and 8983 remote port - meaning they were outgoing
connections. a thread dump did not show a large number of solr threads and
we were unable to determine which thread(s) is holding these sockets.

has anyone else encountered such a situation?

Regards,
Avishai


Re: Polygon search returning "InvalidShapeException: incompatible dimension (2)... error.

2014-03-06 Thread leevduhl
Ok, I think the issue here is that I need to install the JTS library.  I will
have that done and try again.

Lee



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Polygon-search-returning-InvalidShapeException-incompatible-dimension-2-error-tp4121704p4121796.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Filter Cache Size

2014-03-06 Thread Otis Gospodnetic
What Erick said.  That's a giant Filter Cache.  Have a look at these Solr
metrics and note the Filter Cache in the middle:
http://www.flickr.com/photos/otis/8409088080/

Note how small the cache is and how high the hit rate is.  Those are stats
for http://search-lucene.com/ and http://search-hadoop.com/ where you can
see facets on the right that and up being used as filter queries.  Most
Solr apps I've seen had small Filter Caches.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 3:34 PM, Erick Erickson wrote:

> This, BTW, is an ENORMOUS number cached queries.
>
> Here's a rough guide:
> Each entry will be (length of query) + maxDoc/8 bytes long.
>
> Think of the filterCache as a map where the key is the query
> and the value is a bitmap large enough to hold maxDoc bits.
>
> BTW, I'd kick this back to the default (512?) and periodically check
> it with the admin>>plugins/stats page to see what kind of hit ratio
> I have and adjust from there.
>
> Best,
> Erick
>
> On Mon, Mar 3, 2014 at 11:00 AM, Benjamin Wiens
>  wrote:
> > How can we calculate how much heap memory the filter cache will consume?
> We
> > understand that in order to determine a good size we also need to
> evaluate
> > how many filterqueries would be used over a certain time period.
> >
> >
> >
> > Here's our setting:
> >
> >
> >
> >  >
> >   class="solr.FastLRUCache"
> >
> >   size="30"
> >
> >   initialSize="30"
> >
> >   autowarmCount="5"/>
> >
> >
> >
> > According to the post below, 53 GB of RAM would be needed just by the
> > filter cache alone with 1.4 million Docs. Not sure if this true and how
> > this would work.
> >
> >
> >
> > Reference:
> >
> http://stackoverflow.com/questions/2004/solr-filter-cache-fastlrucache-takes-too-much-memory-and-results-in-out-of-mem
> >
> >
> >
> > We filled the filterquery cache with Solr Meter and had a JVM Heap Size
> of
> > far less than 53 GB.
> >
> >
> >
> > Can anyone chime in and enlighten us?
> >
> >
> >
> > Thank you!
> >
> >
> > Ben Wiens & Benjamin Mosior
>


RE: SolrCloud setup guidance

2014-03-06 Thread Susheel Kumar
Setting up Solr cloud(horizontal scaling) is definitely a good idea for this 
big index but before going to Solr cloud, are you able to upgrade your single 
node to 128GB of memory(vertical scaling) to see the difference. 

Thanks,
Susheel

-Original Message-
From: Priti Solanki [mailto:pritiatw...@gmail.com] 
Sent: Thursday, March 06, 2014 10:51 AM
To: solr-user@lucene.apache.org
Subject: SolrCloud setup guidance

Hello Everyone,

I would like to take you guidance of following

I have a single core with 124 GB of index data size. Indexing and Reading both 
are very slow as I have 7 GB RAM to support this huge data.  Almost 8 million 
of documents.

Hence, we thought of going to SolrCloud so that we can accommodate more 
upcoming data. I have data for 13 country with their millions of products and 
we want to set up solrcloud for the same.

I am in need of some initial thoughts about how to setup solrCloud for such 
requirement. How to we come to know how many nodes,core I would be needing to 
support this...

we are thinking to host this on Amazon...Any guidance or reading links ,case 
study will be highly appreciated.

Regards,


Re: Race condition in Leader Election

2014-03-06 Thread KNitin
I am using 4.3.1.


On Thu, Mar 6, 2014 at 11:48 AM, Mark Miller  wrote:

> Are you using an old version?
>
> - Mark
>
> http://about.me/markrmiller
>
> On Mar 6, 2014, at 11:50 AM, KNitin  wrote:
>
> > Hi
> >
> > When restarting a node in solrcloud, i run into scenarios where both the
> > replicas for a shard get into "recovering" state and never come up
> causing
> > the error "No servers hosting this shard". To fix this, I either unload
> one
> > core or restart one of the nodes again so that one of them becomes the
> > leader.
> >
> > Is there a way to "force" leader election for a shard for solrcloud? Is
> > there a way to break ties automatically (without restarting nodes) to
> make
> > a node as the leader for the shard?
> >
> >
> > Thanks
> > Nitin
>
>


Re: Apache Solr Configuration Problem (Japanese Language)

2014-03-06 Thread T. Kuro Kurosaka

Andy,
I don't have a direct answer to your question but I have a question.

On 03/05/2014 07:21 AM, Andy Alexander wrote:

fq=ss_language:ja&q=製品


I am guessing you have a field called ss_language where a language code 
of the document is stored,

and you have Solr documents of different languages.


+DisjunctionMaxQuery((content:製品)~0.01)
This indicate your default query field is "content".  What does the 
analyzer for this field look like?

Does the analyzer work for any languages that you want to support?
Many analyzers have language dependency and won't work with multilingual 
fields.


--
T. "Kuro" Kurosaka • Senior Software Engineer
Healthline - The Power of Intelligent Health
www.healthline.com  |@Healthline  | @HealthlineCorp



Re: hung threads and CLOSE_WAIT sockets

2014-03-06 Thread Mark Miller
It sounds like the distributed update deadlock issue.

It’s fixed in 4.6.1 and 4.7.

- Mark

http://about.me/markrmiller

On Mar 6, 2014, at 3:10 PM, Avishai Ish-Shalom  wrote:

> Hi,
> 
> We've had a strange mishap with a solr cloud cluster (version 4.5.1) where
> we observed high search latency. The problem appears to develop over
> several hours until such point where the entire cluster stopped responding
> properly.
> 
> After investigation we found that the number of threads (both solr and
> jetty) gradually rose over several hours until it hit a the maximum allowed
> at which point the cluster stopped responding properly. After restarting
> several nodes the number of threads dropped and the cluster started
> responding again.
> We've examined nodes that were not restarted and found a high number of
> CLOSE_WAIT sockets held by the solr process; these sockets were using a
> random local port and 8983 remote port - meaning they were outgoing
> connections. a thread dump did not show a large number of solr threads and
> we were unable to determine which thread(s) is holding these sockets.
> 
> has anyone else encountered such a situation?
> 
> Regards,
> Avishai



Re: Date Range Query taking more time.

2014-03-06 Thread Chris Hostetter

: That did the trick Ahmet.  The first response was around 200ms, but the
: subsequent queries were around 2-5ms.

Are you really sure you want "cache=false" on all of those filters?

While the "ClientID:4" query may by something that cahnges significantly 
enough in every query to not be useful to cache, i suspect you'd find a 
lot of value in going ahead and caching those Status:Booked and 
StartDate:[NOW/DAY TO NOW/DAY+1YEAR] clauses ... the first query to hit 
them might be "slower" but ever query after that should be fairly fast -- 
and if you really need them to *always* be fast, configure them as static 
newSeracher warming queries (or make sure you have autowarming on.

It also look like you forgot the "StartDate:" part of your range query in 
your last test...

: &fq={!cache=false cost=50}[NOW/DAY TO NOW/DAY+1YEAR]

And one finally comment just to make sure it doesn't slip throug hthe 
cracks

: > > Since your range query has NOW in it, it won't be cached meaningfully.

this is not applicable.  the use of "NOW" in a range query doesn't mean 
that it can't be cached -- the problem is anytime you use really precise 
dates (or numeric values) that *change* in every query.

if your range query uses "NOW" as a lower/upper end point, then it calls 
in that "really precise dates" situation -- but for this user, who is 
specifically rounding his dates to hte nearest day, that advice isn't 
really applicable -- the date range queries can be cached & reused for an 
entire day.



-Hoss
http://www.lucidworks.com/


SolrCloud constantly crashes after upgrading to Solr 4.7

2014-03-06 Thread Martin de Vries
 

Hi, 

We have 5 Solr servers in a Cloud with about 70 cores and 12GB
indexes in total (every core has 2 shards, so it's 6 GB per server).


After upgrade to Solr 4.7 the Solr servers are crashing constantly
(each server about one time per hour). We currently don't have any clue
about the reason. We tried loads of different settings, but nothing
works out. 

When a server crashes the last log item is (most times) a
"Broken pipe" error. The last queries / used cores are completely random
(as far as we can see). 

We are running with the -Xloggc switch and
during a crash it says: 

10838.015: [Full GC
3141724K->3141724K(3522560K), 1.6936710 secs]
10839.710: [Full GC
3141724K->3141724K(3522560K), 1.5682250 secs]
10841.279: [Full GC
3141728K->3141726K(3522560K), 1.5735450 secs]
10842.854: [Full GC
3141727K->3141727K(3522560K), 1.5773380 secs]
10844.433: [Full GC
3141732K->3141687K(3522560K), 1.5696950 secs]
10846.003: [Full GC
3141698K->3141687K(3522560K), 1.5766940 secs]
10847.581: [Full GC
3141695K->3141688K(3522560K), 1.5879360 secs]
10849.170: [Full GC
3141695K->3141691K(3522560K), 1.5698630 secs]
10850.741: [Full GC
3141695K->3141689K(3522560K), 1.5643990 secs]
10852.307: [Full GC
3141693K->3141650K(3522560K), 1.5759150 secs]

We tried to increase the
memory, but that didn't help. We increased the zkClientTimeout to 60
seconds, but that didn't help. 

We made a memory dump with jmap. The
IndexSchema is using 62% of the memory but we don't know if that's a
problem:
https://www.dropbox.com/s/eyom5c48vhl0q9i/Screenshot%202014-03-06%2023.32.41.png
[1] 

Tomorrow we will downgrade each server to Solr 4.6.1, we need to
reindex every core to do that unless we have a solution. 

Does
anyone have a clue what the problem can be? 

Thanks! 

Martin 




Links:
--
[1]
https://www.dropbox.com/s/eyom5c48vhl0q9i/Screenshot%202014-03-06%2023.32.41.png


Re: Date Range Query taking more time.

2014-03-06 Thread Ahmet Arslan
Hoss,

Thanks for the correction. I missed the /DAY part and thought as it was  
StartDate:[NOW TO NOW+1YEAR]

Ahmet


On Friday, March 7, 2014 12:33 AM, Chris Hostetter  
wrote:

: That did the trick Ahmet.  The first response was around 200ms, but the
: subsequent queries were around 2-5ms.

Are you really sure you want "cache=false" on all of those filters?

While the "ClientID:4" query may by something that cahnges significantly 
enough in every query to not be useful to cache, i suspect you'd find a 
lot of value in going ahead and caching those Status:Booked and 
StartDate:[NOW/DAY TO NOW/DAY+1YEAR] clauses ... the first query to hit 
them might be "slower" but ever query after that should be fairly fast -- 
and if you really need them to *always* be fast, configure them as static 
newSeracher warming queries (or make sure you have autowarming on.

It also look like you forgot the "StartDate:" part of your range query in 
your last test...

: &fq={!cache=false cost=50}[NOW/DAY TO NOW/DAY+1YEAR]

And one finally comment just to make sure it doesn't slip throug hthe 
cracks


: > > Since your range query has NOW in it, it won't be cached meaningfully.

this is not applicable.  the use of "NOW" in a range query doesn't mean 
that it can't be cached -- the problem is anytime you use really precise 
dates (or numeric values) that *change* in every query.

if your range query uses "NOW" as a lower/upper end point, then it calls 
in that "really precise dates" situation -- but for this user, who is 
specifically rounding his dates to hte nearest day, that advice isn't 
really applicable -- the date range queries can be cached & reused for an 
entire day.



-Hoss
http://www.lucidworks.com/



Re: SolrCloud constantly crashes after upgrading to Solr 4.7

2014-03-06 Thread Mark Miller


On Mar 6, 2014, at 5:37 PM, Martin de Vries  wrote:

> IndexSchema is using 62% of the memory but we don't know if that's a
> problem:

That seems odd. Can you see what objects are taking all the RAM in the 
IndexSchema?

- Mark

http://about.me/markrmiller

Re: SolrCloud constantly crashes after upgrading to Solr 4.7

2014-03-06 Thread Shawn Heisey
On 3/6/2014 3:37 PM, Martin de Vries wrote:
> We have 5 Solr servers in a Cloud with about 70 cores and 12GB
> indexes in total (every core has 2 shards, so it's 6 GB per server).
> 
> After upgrade to Solr 4.7 the Solr servers are crashing constantly
> (each server about one time per hour). We currently don't have any clue
> about the reason. We tried loads of different settings, but nothing
> works out. 
> 
> When a server crashes the last log item is (most times) a
> "Broken pipe" error. The last queries / used cores are completely random
> (as far as we can see). 

We'd need to actually see a large chunk of the end of the actual
logfile.  It must be the file, not the logging tab in the admin UI.
Ideally it should be at the INFO logging level.

If the broken pipe error is part of an EofException, it is (in my
experience) caused by a client disconnecting before sending the full
request or disconnecting before Solr responds.  I don't know what kind
of socket timeout your clients have, but 30-60 seconds is a common
default for systems that actually set a timeout.

Are there any messages in the operating system logs?

Full details about the computer, operating system, Solr startup options,
and your index may also be required to dig deeper, so any details you
can share will be useful.  The config and schema may also be useful, but
you can hold off on those until we know for sure whether they will be
needed.

Thanks,
Shawn



SolrCloud recovery after nodes are rebooted in rapid succession

2014-03-06 Thread Nazik Huq
Hello,

 

I have a question from a colleague who's managing a 3-node(VMs) SolrCloud
cluster with a separate 3-node Zookeeper ensemble. Periodically the  data
center underneath the SolrCloud decides to upgrade the SolrCloud instance
infrastructure in  a "rolling upgrade" fashion. So after the 1st instance of
the SolrCloud is shut down and while it is in the process of rebooting, the
2nd  instance starts to shut down  and so on. Eventually all three Solr
instances are rebooted and up and running  but the cluster in now
inoperable.  Meaning clients can't query or  ingest data. My colleague is
trying to ascertain if this problem is due to Solr's inability to recover
from a rapid succession of reboots of the nodes or from the data center
upgrade that is triggering a "situation" making SolrCloud inoperable.

 

My question is, can a SolrCloud cluster become inoperable after its nodes
are rebooted in rapid succession as described above? Is there an edge case
similar to this?

 

Thanks,

 

Nazik Huq

 

 



Re:Solr 4.7.0 - cursorMark question

2014-03-06 Thread Greg Pendlebury
"* New 'cursorMark' request param for efficient deep paging of sorted
  result sets. See http://s.apache.org/cursorpagination";

At the end of the linked doco there is an example that doesn't make sense
to me, because it mentions "sort=timestamp asc" and is then followed by
pseudo code that sorts by id only. I understand that cursorMark requires
that "sort clauses must include the uniqueKey field", but is it really just
'include', or is it the only field that sort can be performed on?

ie. can sort be specified as 'sort=timestamp asc, id asc'?

I am assuming that if the index is changed between requests than we can
still 'miss' or duplicate documents by not sorting on the id as the only
sort parameter, but I can live with that scenario. cursorMark is still
attractive to us since it will prevent the SolrCloud cluster from crashing
when deep pagination requests are sent to it... I'm just trying to explore
all the edge cases our business area are likely to consider.

Ta,
Greg

On 27 February 2014 02:15, Simon Willnauer  wrote:

> February 2014, Apache Solr(tm) 4.7 available
>
> The Lucene PMC is pleased to announce the release of Apache Solr 4.7
>
> Solr is the popular, blazing fast, open source NoSQL search platform
> from the Apache Lucene project. Its major features include powerful
> full-text search, hit highlighting, faceted search, dynamic
> clustering, database integration, rich document (e.g., Word, PDF)
> handling, and geospatial search.  Solr is highly scalable, providing
> fault tolerant distributed search and indexing, and powers the search
> and navigation features of many of the world's largest internet sites.
>
> Solr 4.7 is available for immediate download at:
>   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html
>
> See the CHANGES.txt file included with the release for a full list of
> details.
>
> Solr 4.7 Release Highlights:
>
> * A new 'migrate' collection API to split all documents with a route key
>   into another collection.
>
> * Added support for tri-level compositeId routing.
>
> * Admin UI - Added a new "Files" conf directory browser/file viewer.
>
> * Add a QParserPlugin for Lucene's SimpleQueryParser.
>
> * Suggest improvements: a new SuggestComponent that fully utilizes the
>   Lucene suggester module; queries can now use multiple suggesters;
>   Lucene's FreeTextSuggester and BlendedInfixSuggester are now supported.
>
> * New 'cursorMark' request param for efficient deep paging of sorted
>   result sets. See http://s.apache.org/cursorpagination
>
> * Add a Solr contrib that allows for building Solr indexes via Hadoop's
>   MapReduce.
>
> * Upgrade to Spatial4j 0.4. Various new options are now exposed
>   automatically for an RPT field type.  See Spatial4j CHANGES & javadocs.
>   https://github.com/spatial4j/spatial4j/blob/master/CHANGES.md
>
> * SSL support for SolrCloud.
>
> Solr 4.7 also includes many other new features as well as numerous
> optimizations and bugfixes.
>
> Please report any feedback to the mailing lists
> (http://lucene.apache.org/solr/discussion.html)
>
> Note: The Apache Software Foundation uses an extensive mirroring network
> for distributing releases.  It is possible that the mirror you are using
> may not have replicated the release yet.  If that is the case, please
> try another mirror.  This also goes for Maven access.
>


Re: SolrCloud recovery after nodes are rebooted in rapid succession

2014-03-06 Thread Mark Miller
Would probably need to see some logs to say much. Need to understand why they 
are inoperable.

What version is this?

- Mark

http://about.me/markrmiller

On Mar 6, 2014, at 6:15 PM, Nazik Huq  wrote:

> Hello,
> 
> 
> 
> I have a question from a colleague who's managing a 3-node(VMs) SolrCloud
> cluster with a separate 3-node Zookeeper ensemble. Periodically the  data
> center underneath the SolrCloud decides to upgrade the SolrCloud instance
> infrastructure in  a "rolling upgrade" fashion. So after the 1st instance of
> the SolrCloud is shut down and while it is in the process of rebooting, the
> 2nd  instance starts to shut down  and so on. Eventually all three Solr
> instances are rebooted and up and running  but the cluster in now
> inoperable.  Meaning clients can't query or  ingest data. My colleague is
> trying to ascertain if this problem is due to Solr's inability to recover
> from a rapid succession of reboots of the nodes or from the data center
> upgrade that is triggering a "situation" making SolrCloud inoperable.
> 
> 
> 
> My question is, can a SolrCloud cluster become inoperable after its nodes
> are rebooted in rapid succession as described above? Is there an edge case
> similar to this?
> 
> 
> 
> Thanks,
> 
> 
> 
> Nazik Huq
> 
> 
> 
> 
> 



Re: Indexing huge data

2014-03-06 Thread Kranti Parisa
thats what I do. precreate JSONs following the schema, saving that in
MongoDB, this is part of the ETL process. after that, just dump the JSONs
into Solr using batching etc. with this you can do full and incremental
indexing as well.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu  wrote:

> Yeah. I have thought about spitting out JSON and run it against Solr using
> parallel Http threads separately. Thanks.
>
>
> On 3/5/14, 6:46 PM, Susheel Kumar wrote:
>
>> One more suggestion is to collect/prepare the data in CSV format (1-2
>> million sample depending on size) and then import data direct into Solr
>> using CSV handler & curl.  This will give you the pure indexing time & the
>> differences.
>>
>> Thanks,
>> Susheel
>>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: Wednesday, March 05, 2014 8:03 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Indexing huge data
>>
>> Here's the easiest thing to try to figure out where to concentrate your
>> energies. Just comment out the server.add call in your SolrJ program.
>> Well, and any commits you're doing from SolrJ.
>>
>> My bet: Your program will run at about the same speed it does when you
>> actually index the docs, indicating that your problem is in the data
>> acquisition side. Of course the older I get, the more times I've been wrong
>> :).
>>
>> You can also monitor the CPU usage on the box running Solr. I often see
>> it idling along < 30% when indexing, or even < 10%, again indicating that
>> the bottleneck is on the acquisition side.
>>
>> Note I haven't mentioned any solutions, I'm a believer in identifying the
>> _problem_ before worrying about a solution.
>>
>> Best,
>> Erick
>>
>> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky 
>> wrote:
>>
>>> Make sure you're not doing a commit on each individual document add.
>>> Commit every few minutes or every few hundred or few thousand
>>> documents is sufficient. You can set up auto commit in solrconfig.xml.
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Rallavagu
>>> Sent: Wednesday, March 5, 2014 2:37 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Indexing huge data
>>>
>>>
>>> All,
>>>
>>> Wondering about best practices/common practices to index/re-index huge
>>> amount of data in Solr. The data is about 6 million entries in the db
>>> and other source (data is not located in one resource). Trying with
>>> solrj based solution to collect data from difference resources to
>>> index into Solr. It takes hours to index Solr.
>>>
>>> Thanks in advance
>>>
>>


Re:Solr 4.7.0 - cursorMark question

2014-03-06 Thread Chris Hostetter

: At the end of the linked doco there is an example that doesn't make sense
: to me, because it mentions "sort=timestamp asc" and is then followed by
: pseudo code that sorts by id only. I understand that cursorMark requires

Ok ... 2 things contributing to the confusion.

1) the para that refers to "sort=timestamp asc" should be fixed to include 
"id" as well.

2) psuedo-code you're refering to that uses "sort => 'id asc'" isn't ment 
to give an example of specifically tailing by timestamp -- it's an 
extension on the earlier example (of fetching all docs sorting on id) to 
show "tailing" new docs with new (increasing) ids ... i'll try to fix the 
wording to better elborate

: that "sort clauses must include the uniqueKey field", but is it really just
: 'include', or is it the only field that sort can be performed on?
: 
: ie. can sort be specified as 'sort=timestamp asc, id asc'?

That will absolutely work ... i'll update the doc to include more examples 
with multi-clause sort criteria.

: I am assuming that if the index is changed between requests than we can
: still 'miss' or duplicate documents by not sorting on the id as the only
: sort parameter, but I can live with that scenario. cursorMark is still

If you are using a timestamp param, you should never "miss" a document 
(assuming every doc gets a timestamp) but yes: you can absolutely get the 
same doc twice if it's updated after the first time you fetch it -- that's 
one of the advantages of sorting on a timestamp field like that.



-Hoss
http://www.lucidworks.com/


Re: Solr 4.7.0 - cursorMark question

2014-03-06 Thread Greg Pendlebury
Thank-you, that all sounds great. My assumption about documents being
missed was something like this:

A,B,C,D

where they are sorted by timestamp first and ID second. Say the first
'page' of results is 'A,B', and before the second page is requested both
documents B + C receive update events and the new order (by timestamp) is:

A,D,B,C

In that situation D would always be missed, whether the cursorMark 'C or
greater' or 'greater than B' (I'm not sure which it is in practice), simply
because the cursorMark is the unique ID and the unique ID is not your first
sort mechanism.

However, I'm not really concerned about that anyway since it is not a use
case we consider important, and in an information science sense of things I
think it is a non-trivial problem to solve without brute force caching of
all result sets. I'm just happy that we don't have to get our users to
replace existing sort options; we just need to add a unique ID field at the
end and change the parameters we send into the cluster.

Thanks,
Greg


On 7 March 2014 11:05, Chris Hostetter  wrote:

>
> : At the end of the linked doco there is an example that doesn't make sense
> : to me, because it mentions "sort=timestamp asc" and is then followed by
> : pseudo code that sorts by id only. I understand that cursorMark requires
>
> Ok ... 2 things contributing to the confusion.
>
> 1) the para that refers to "sort=timestamp asc" should be fixed to include
> "id" as well.
>
> 2) psuedo-code you're refering to that uses "sort => 'id asc'" isn't ment
> to give an example of specifically tailing by timestamp -- it's an
> extension on the earlier example (of fetching all docs sorting on id) to
> show "tailing" new docs with new (increasing) ids ... i'll try to fix the
> wording to better elborate
>
> : that "sort clauses must include the uniqueKey field", but is it really
> just
> : 'include', or is it the only field that sort can be performed on?
> :
> : ie. can sort be specified as 'sort=timestamp asc, id asc'?
>
> That will absolutely work ... i'll update the doc to include more examples
> with multi-clause sort criteria.
>
> : I am assuming that if the index is changed between requests than we can
> : still 'miss' or duplicate documents by not sorting on the id as the only
> : sort parameter, but I can live with that scenario. cursorMark is still
>
> If you are using a timestamp param, you should never "miss" a document
> (assuming every doc gets a timestamp) but yes: you can absolutely get the
> same doc twice if it's updated after the first time you fetch it -- that's
> one of the advantages of sorting on a timestamp field like that.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Date Range Query taking more time.

2014-03-06 Thread Vijay Kokatnur
My initial approach was to use filter cache static fields.  However when
filter query is used, every query after the first has the same response
time as the first.  For instance, when cache is enabled in the query under
review, response time shoots up to 4-5secs and stays there.

We are using default filter cache settings provided with 4.5.0
distribution.

Current Filter Cache stats :

lookups:0
hits:0
hitratio:0
inserts:0
evictions:0
size:0
warmupTime:0
cumulative_lookups:17135
cumulative_hits:2465
cumulative_hitratio:0.14
cumulative_inserts:14670
cumulative_evictions:0

I did not find what cumulative_* fields mean
here ,
but it looks like nothing is being cached with fq as hit ratio is 0.

Any idea whats happening?



On Thu, Mar 6, 2014 at 2:41 PM, Ahmet Arslan  wrote:

> Hoss,
>
> Thanks for the correction. I missed the /DAY part and thought as it was
>  StartDate:[NOW TO NOW+1YEAR]
>
> Ahmet
>
>
> On Friday, March 7, 2014 12:33 AM, Chris Hostetter <
> hossman_luc...@fucit.org> wrote:
>
> : That did the trick Ahmet.  The first response was around 200ms, but the
> : subsequent queries were around 2-5ms.
>
> Are you really sure you want "cache=false" on all of those filters?
>
> While the "ClientID:4" query may by something that cahnges significantly
> enough in every query to not be useful to cache, i suspect you'd find a
> lot of value in going ahead and caching those Status:Booked and
> StartDate:[NOW/DAY TO NOW/DAY+1YEAR] clauses ... the first query to hit
> them might be "slower" but ever query after that should be fairly fast --
> and if you really need them to *always* be fast, configure them as static
> newSeracher warming queries (or make sure you have autowarming on.
>
> It also look like you forgot the "StartDate:" part of your range query in
> your last test...
>
> : &fq={!cache=false cost=50}[NOW/DAY TO NOW/DAY+1YEAR]
>
> And one finally comment just to make sure it doesn't slip throug hthe
> cracks
>
>
> : > > Since your range query has NOW in it, it won't be cached
> meaningfully.
>
> this is not applicable.  the use of "NOW" in a range query doesn't mean
> that it can't be cached -- the problem is anytime you use really precise
> dates (or numeric values) that *change* in every query.
>
> if your range query uses "NOW" as a lower/upper end point, then it calls
> in that "really precise dates" situation -- but for this user, who is
> specifically rounding his dates to hte nearest day, that advice isn't
> really applicable -- the date range queries can be cached & reused for an
> entire day.
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
>


RE: SolrCloud recovery after nodes are rebooted in rapid succession

2014-03-06 Thread Nazik Huq
The version is 4.6. I am going to ask for the log files and post it.

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Thursday, March 06, 2014 6:33 PM
To: solr-user
Subject: Re: SolrCloud recovery after nodes are rebooted in rapid succession

Would probably need to see some logs to say much. Need to understand why
they are inoperable.

What version is this?

- Mark

http://about.me/markrmiller

On Mar 6, 2014, at 6:15 PM, Nazik Huq  wrote:

> Hello,
> 
> 
> 
> I have a question from a colleague who's managing a 3-node(VMs) 
> SolrCloud cluster with a separate 3-node Zookeeper ensemble. 
> Periodically the  data center underneath the SolrCloud decides to 
> upgrade the SolrCloud instance infrastructure in  a "rolling upgrade" 
> fashion. So after the 1st instance of the SolrCloud is shut down and 
> while it is in the process of rebooting, the 2nd  instance starts to 
> shut down  and so on. Eventually all three Solr instances are rebooted 
> and up and running  but the cluster in now inoperable.  Meaning 
> clients can't query or  ingest data. My colleague is trying to 
> ascertain if this problem is due to Solr's inability to recover from a 
> rapid succession of reboots of the nodes or from the data center upgrade
that is triggering a "situation" making SolrCloud inoperable.
> 
> 
> 
> My question is, can a SolrCloud cluster become inoperable after its 
> nodes are rebooted in rapid succession as described above? Is there an 
> edge case similar to this?
> 
> 
> 
> Thanks,
> 
> 
> 
> Nazik Huq
> 
> 
> 
> 
> 



Dataimport handler Date

2014-03-06 Thread Pritesh Patel
I'm using the dataimporthandler to index data from a mysql DB.  Been
running it just fine. I've been using full-imports. I'm now trying
implement the delta import functionality.

To implement the delta query, you need to be reading the last_index_time
from a properties file to know what new to index.  So I'm using the
parameter:
{dataimporter.last_index_time} within my query.

The problem is when I use this, the date always is : "Thu Jan 01 00:00:00
UTC 1970".  It's never actually reading the correct date stored in the
dataimport.properties file.

So my delta query does not work.  Has anybody see this issue?

Seems like its always using the beginning date for epoch or unix timestamp
code 0.

--Pritesh

P.S.  If you want to see the delta query, see below.

deltaQuery="SELECT node.nid from node where node.type = 'news' and
node.status = 1 and (node.changed >
UNIX_TIMESTAMP('${dataimporter.last_index_time}'jgkg) or node.created >
UNIX_TIMESTAMP('${dataimporter.last_index_time}'))"

deltaImportQuery="SELECT node.nid, node.vid, node.type, node.language,
node.title, node.uid, node.status,
FROM_UNIXTIME(node.created,'%Y-%m-%dT%TZ') as created,
FROM_UNIXTIME(node.changed,'%Y-%m-%dT%TZ') as changed, node.comment,
node.promote, node.moderate, node.sticky, node.tnid, node.translate,
content_type_news.field_image_credit_value,
content_type_news.field_image_caption_value,
content_type_news.field_subhead_value,
content_type_news.field_author_value,
content_type_news.field_dateline_value,
content_type_news.field_article_image_fid,
content_type_news.field_article_image_list,
content_type_news.field_article_image_data,
content_type_news.field_news_blurb_value,
content_type_news.field_news_blurb_format,
content_type_news.field_news_syndicate_value,
content_type_news.field_news_video_reference_nid,
content_type_news.field_news_inline_location_value,
content_type_news.field_article_contributor_nid,
content_type_news.field_news_title_value, page_title.page_title FROM node
LEFT JOIN content_type_news ON node.nid = content_type_news.nid LEFT JOIN
page_title ON node.nid = page_title.id where node.type = 'news' and
node.status = 1 and node.nid = '${deltaimport.delta.nid}'"


Re: SolrCloud setup guidance

2014-03-06 Thread Priti Solanki
Thanks Susheel,

But this index will keep on growing that my worry So I always have to
increase the RAM .

Can you suggest how many nodes one can think to support this bug index?

Regards,



On Fri, Mar 7, 2014 at 2:50 AM, Susheel Kumar <
susheel.ku...@thedigitalgroup.net> wrote:

> Setting up Solr cloud(horizontal scaling) is definitely a good idea for
> this big index but before going to Solr cloud, are you able to upgrade your
> single node to 128GB of memory(vertical scaling) to see the difference.
>
> Thanks,
> Susheel
>
> -Original Message-
> From: Priti Solanki [mailto:pritiatw...@gmail.com]
> Sent: Thursday, March 06, 2014 10:51 AM
> To: solr-user@lucene.apache.org
> Subject: SolrCloud setup guidance
>
> Hello Everyone,
>
> I would like to take you guidance of following
>
> I have a single core with 124 GB of index data size. Indexing and Reading
> both are very slow as I have 7 GB RAM to support this huge data.  Almost 8
> million of documents.
>
> Hence, we thought of going to SolrCloud so that we can accommodate more
> upcoming data. I have data for 13 country with their millions of products
> and we want to set up solrcloud for the same.
>
> I am in need of some initial thoughts about how to setup solrCloud for
> such requirement. How to we come to know how many nodes,core I would be
> needing to support this...
>
> we are thinking to host this on Amazon...Any guidance or reading links
> ,case study will be highly appreciated.
>
> Regards,
>


What is mean by Index Searcher?

2014-03-06 Thread search engn dev
I am reading apache solr reference guide and it has lines as below

". Solr caches are associated with a specific instance of an Index
Searcher, a specific view of an index that doesn't change during the
lifetime of that
searcher. As long as that Index Searcher is being used, any items in its
cache will be valid and available for reuse"

What is the concept of index searcher. ? basically what is mean by index
searcher here? is it user or something else.?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/What-is-mean-by-Index-Searcher-tp4121898.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: What is mean by Index Searcher?

2014-03-06 Thread Alexandre Rafalovitch
That's under the covers implementation. Unless you are doing
extensions, you probably don't need to worry.

Where it connects to the userland is - for example - the commits.
Until you commit, your records are not visible. Even though Solr
already has them. This is because the 'index searcher' does not see
new items. When commit is done, a searcher is closed/reopened and you
see those changes.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Fri, Mar 7, 2014 at 2:04 PM, search engn dev
 wrote:
> I am reading apache solr reference guide and it has lines as below
>
> ". Solr caches are associated with a specific instance of an Index
> Searcher, a specific view of an index that doesn't change during the
> lifetime of that
> searcher. As long as that Index Searcher is being used, any items in its
> cache will be valid and available for reuse"
>
> What is the concept of index searcher. ? basically what is mean by index
> searcher here? is it user or something else.?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/What-is-mean-by-Index-Searcher-tp4121898.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Dataimport handler Date

2014-03-06 Thread Gora Mohanty
On 7 March 2014 08:50, Pritesh Patel  wrote:
> I'm using the dataimporthandler to index data from a mysql DB.  Been
> running it just fine. I've been using full-imports. I'm now trying
> implement the delta import functionality.
>
> To implement the delta query, you need to be reading the last_index_time
> from a properties file to know what new to index.  So I'm using the
> parameter:
> {dataimporter.last_index_time} within my query.
>
> The problem is when I use this, the date always is : "Thu Jan 01 00:00:00
> UTC 1970".  It's never actually reading the correct date stored in the
> dataimport.properties file.
[...]

I take it that you have verified that the dataimport.properties file exists.
What are its contents?

Please share the exact DIH configuration file that you use, obfuscating
DB password/username. Your cut-and-paste seems to have a syntax
error in the deltaQuery (notice the 'jgkg' string):
deltaQuery="SELECT node.nid from node where node.type = 'news' and
node.status = 1 and (node.changed >
UNIX_TIMESTAMP('${
dataimporter.last_index_time}'jgkg) or node.created >
UNIX_TIMESTAMP('${dataimporter.last_index_time}'))"

What response do you get fromm the delta-import URL?
Are there any error messages in your Solr log?

Regards,
Gora


SolrCloud with Tomcat

2014-03-06 Thread Vineet Mishra
Hi

I am installing SolrCloud with 3 External
Zookeeper(localhost:2181,localhost:2182,localhost:2183) and 2
Tomcats(localhost:8181,localhost:8182) all available on a single
Machine(Just for getting started).
By Following these links

http://myjeeva.com/solrcloud-cluster-single-collection-deployment.html
http://wiki.apache.org/solr/SolrCloudTomcat

I have got the Solr UI on the machine pointing to

http://localhost:8181/solr/#/~cloud

In the Cloud Graph View it is coming with

mycollection
|
|_ shard1
|_ shard2

But both the shards are empty and showing no cores or replica.

Following
http://myjeeva.com/solrcloud-cluster-single-collection-deployment.htmlblog,
I have been successful till starting tomcat,
since after the section "Creating Collection, Shard(s), Replica(s) in
SolrCloud" I am facing the problem.

Giving command to create replica for the shard using

*curl
'http://localhost:8181/solr/admin/cores?action=CREATE&name=shard1-replica-2&collection=mycollection&shard=shard1
'*

it is giving error


400137
*Error CREATEing SolrCore 'shard1-replica-2':
192.168.2.183:8182_solr_shard1-replica-2 is removed*
400


Has anybody went through this issue?

Regards