shard query with duplicated documents cause inaccuate paginating

2014-04-29 Thread Jie Sun
When we have duplicated documents (same uniqueID) among the shards, the query
results could be non-deterministic, this is an known issue.

The consequence when we display the search results on our UI page with
paginating is: if user click the 'last page', it could display an empty page
since the total doc count returned by the query with dups is not accurate
(includes dups apparently).

Is there a known work around for this problem?

We tried the following 2 approaches but each of them problem:
1) use a query like:
curl -d "q=*:*&fl=message_id&rows=1&start=1999"
http://[hostname]:8080/mywebapp/shards/[coreid]/select?
Since I am using a very large number for the 'rows', it will return the
accurate doc count, but it takes about 20 second to run this query on an
average customer with a little over 1 million rows returned, so the
performance is not acceptable.

2) use facet query:
curl -d
"q=*:*&fl=message_id&facet=true&facet.mincount=2&rows=0&facet.field=message_id&indent=on"
http://[hostname]:8080/[mywebapp]/shards/[coreid]/select?
the test shows this might not return accurate doc counts from time to time.

any suggestions what is the best work around to get an accurate doc count
with sharded query with dups, and efficient when run with large data set?

thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/shard-query-with-duplicated-documents-cause-inaccuate-paginating-tp4133666.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2013-07-15 Thread Jie Sun
Yandong,
have you figured out if it works for you to use one collection per customer? 

We have the similar use-case as yours: customer id's are used as core names.

that was the reason our company did not upgrade to solrcould ... I might
remember it wrong but I vaguely remember I looked into using collection for
each customer, and it seems the number of collections as current release are
fixes, aren't they?

thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-how-to-index-documents-into-a-specific-core-and-how-to-search-against-that-core-tp3985262p4078210.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr 3.5 core rename issue

2013-04-16 Thread Jie Sun
We just tried to use 
.../solr/admin/cores?action=RENAME&core=core0&other=core5

to rename a core 'old' to 'new'.

After the request is done, the solr.xml has new core name, and the solr
admin shows the new core name in the list. But the index dir still has the
old name as the directory name. I looked into solr 3.5 code, this is what
the code does.

However, if I bounce tomcat/solr, when solr is started up, it creates new
index dir with 'new', and now of course there is no longer any document
returned if you search the core.

is this a bug? or did I miss anything?
thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-3-5-core-rename-issue-tp4056425.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 3.5 core rename issue

2013-04-16 Thread Jie Sun
Hi Shawn,
I do have persistent="true" in my solr.xml:



  



...
  


the command I ran was to rename from '413' to '413a'. 

when i debug through solr CoreAdminHandler, I notice the persistent flag
only controls if the new data will be persisted to solr.xml or not, thus as
you can see, it did changed my solr.xml, there is no problem here.

But the index dir ends up with no change at all (still '413'). I guess swap
will have similar issue, I bet your 's0_0' directory actually hold data for
core s0build, and s0_1 holds data for s0live after you swap them. Because I
dont see anywhere in CoreAdminHandler and CoreContainer code actually rename
the index directory. I might be wrong, but you can test and find out.

Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-3-5-core-rename-issue-tp4056425p4056435.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 3.5 core rename issue

2013-04-17 Thread Jie Sun
thanks Shawn for filing the issue.

by the way my solrconfig.xml has:
 
${MYSOLRROOT:/mysolrroot}/messages/solr/data/${solr.core.name}

For now I will have to shutdown solr and write a script to modify the
solr.xml manually and rename the core data directory to new one.

by the way when I try to remove a core using unload (I am using solr 3.5):

.../solr/admin/cores?action=UNLOAD&core=4130&deleteIndex=true 

it removes the core from solr.xml, but it leaves the data directory '413',
but the index subfolder under 413 is removed, however there are
spellchecker1 and  spellchecker2 still remain. 

Do you know why?
thanks
Jie




--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-3-5-core-rename-issue-tp4056425p4056865.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 3.5 core rename issue

2013-04-18 Thread Jie Sun
yeah I realize using ${solr.core.name} for dataDir must be the cause for the
issue we see... it is fair to say the SWAP and RENAME just create an alias
that still points to the old datadir.

if they can not fix it then it is not a bug :-) at least we understand
exactly what is going on there.

thanks so much for your help!
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-3-5-core-rename-issue-tp4056425p4057037.html
Sent from the Solr - User mailing list archive at Nabble.com.



shard query return 500 on large data set

2013-04-18 Thread Jie Sun
Hi -

when I execute a shard query like:

[myhost]:8080/solr/mycore/select?q=type:message&rows=14&...&qt=standard&wt=standard&explainOther=&hl.fl=&shards=solrserver1:8080/solr/mycore,solrserver2:8080/solr/mycore,solrserver3:8080/solr/mycore
 

everything works fine until I query against a large set of data (> 100k
documents), 
when the number of rows returned exceeds about 50k.

by the way I am using HttpClient GET method to send the solr shard query
over.

In the above scenario, the query fails with a 500 server error as returned
status code.

I am using solr 3.5.

I encountered a 404 before, when one of the shard servers does not have the
core (404) the whole shard query will return 404 to me; so I expect if one
of the server encounter a timeout (408?),  the shard query should return
time out status code? 

I guess I am not sure what will be the shard query results with various
error scenario... guess i could look into solr code, but if you have any
input, it will be appreciated. thanks

Renee



--
View this message in context: 
http://lucene.472066.n3.nabble.com/shard-query-return-500-on-large-data-set-tp4057038.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: numFound changes on changing start and rows

2013-05-08 Thread Jie Sun
any update on this?

will this be addressed/fixed? 

in our system, our UI will allow user to paginate through search results. 

As my in deep test find out, if the rows=0, the results size is consistently
the total sum of the documents on all shards regardless there is any
duplicates; if the rows is a number larger than the supposedly returned the
merge document number, the result numFound is accurate and consistent,
however, if the rows is with a number smaller than the supposedly merge
results size, it will be non-deterministic.

unfortunately, in our system, it is not easy to work around this problem. we
have to issue and query whenever use click on Next button, and the rows is
20 in our case and in most of the cases it is smaller than the merged
results size, so we get a different number each time.

If we do rows=0 up in front, it wont work either, since we want the accurate
number and others may have indexed new documents at the same time.
Especially when user hit the last page, sometimes we see the numFound off by
hundreds, this wont work.

please advice.
thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/numFound-changes-on-changing-start-and-rows-tp3999752p4061628.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: numFound changes on changing start and rows

2013-05-08 Thread Jie Sun
ok when my head is cooled down, I remember this old school issue... that I
have been dealing with it myself.

so I do not expect this can be straighten out or fixed in anyways.

basically when you have to sorted results sets you need to merge, and
paginate through, it is never an easy job (if all is possible) to figure out
what is exactly the number if you only require a portion of the results
being returned.

for example if 1 set has 40,000 rows returned, the other set has 50,000
returned, and you want the start=440 and rows=20 (paginate on UI), the
typical algorithm will be sort both sets and return the near portion of both
sets, toss away the duplicates in that range (20 rows), so even you
calcualte with the duplicates prior to that start point, you have no way to
tell how many duplicates after that point, so you really do not know for
fact the exact / accurate numFound, unless you require return the whole
thing. and that is why when I give a huge rows number, it will give me the
accurate count each time. However, solr shard query will throw 500 server
error if the returned set is around 50k, which is reasonable.

So find work around in the context is the only solution. Check with google
search pattern, may get some fuzzy idea :-)

thanks
jie 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/numFound-changes-on-changing-start-and-rows-tp3999752p4061633.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: rename a core to same name of existing core

2013-05-13 Thread Jie Sun
did any one verified the following is ture?
> the Description on http://wiki.apache.org/solr/CoreAdmin#CREATE is:
>
> *quote*
> If a core with the same name exists, while the "new" created core is
> initalizing, the "old" one will continue to accept requests. Once it
> has finished, all new request will go to the "new" core, and the "old"
> core will be unloaded.
> */quote*

step1 - I have a core 'abc' with 30 documents in it:
http://myhost.com:8080/solr/abc/select/?q=type%3Amessage&version=2.2&start=0&rows=10&indent=on

10

step2 - then I create a new core with same name 'abc':
http://myhost.com:8080/solr/admin/cores?action=create&name=abc&instanceDir=./

0303abc/mxl/var/solr/solr.xml

step3 - I cleared out my browser cache

step4 - I did same query as in step1, got same results (30 documents):
http://myhost.com:8080/solr/abc/select/?q=type%3Amessage&version=2.2&start=0&rows=10&indent=on

10

I thought the old core should be unloaded?
did I misunderstand any thing here?

thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/rename-a-core-to-same-name-of-existing-core-tp3090960p4063008.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: rename a core to same name of existing core

2013-05-13 Thread Jie Sun
thanks for the information, you are right, I was using the same instance dir.

I agree with you, I would like to see an error is I am creating a core with
the name of existing core name.

right now I have to do ping first, and analyze if the returned code is 404
or not.

Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/rename-a-core-to-same-name-of-existing-core-tp3090960p4063047.html
Sent from the Solr - User mailing list archive at Nabble.com.


programmatically get dataDir setting from solrconfig.xml

2012-11-28 Thread Jie Sun
I am trying to get the value of 'dataDir' that was set in solrconfig.xml.

other thank query solr with

http://[host]:8080/solr/default/admin/file/?contentType=text/xml;charset=utf-8&file=solrconfig.xml

and parse the dataDir element using some xml parser, then resolve all
possible environment variables and system properties (essentially same thing
solr core manager does) to get the value in my java program...

is there an admin URL or java API I can use to just get a setting defined in
solrconfig.xml?

eventually what I am trying to do is find the size of the index of a core. I
am trying to reconstruct the path to the core and do a 'du' on the file
system.

so the second question is: is there a better way to do this? 
thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/programmatically-get-dataDir-setting-from-solrconfig-xml-tp4023108.html
Sent from the Solr - User mailing list archive at Nabble.com.


suggestion howto handle highly repetitive valued field

2012-12-11 Thread Jie Sun
Hi -
our indexed documents currently store solr fields like 'digest' or 'type',
which most of our documents will end up with same value (such as 'sha1' for
field 'digest', or 'message' for field 'type' etc).

on each solr server, we usually have 100 of millions of documents indexed
and with the same value on these fields (fields are stored and indexed).

any suggestion what is the  best approach if we suspect this will be very
inefficient on disk space usage, or is it?

thanks!
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/suggestion-howto-handle-highly-repetitive-valued-field-tp4026104.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: suggestion howto handle highly repetitive valued field

2012-12-11 Thread Jie Sun
thank you David!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/suggestion-howto-handle-highly-repetitive-valued-field-tp4026104p4026163.html
Sent from the Solr - User mailing list archive at Nabble.com.


how to understand this benchmark test results (compare index size after schema change)

2012-12-12 Thread Jie Sun
I cleaned up the solr schema by change a small portion of the stored fields
to stored="false".

out for 5000 document (about 500M total size of original documents), I ran a
benchmark comparing the solr index size between the schema before/after the
clean up.

first time run it showed about 40% reduction of index size (using old schema
the index size is 52M, using new schema the index size is 30M).

However, the second time I added another 5000 documents (similar data but
different documents) to the index. This time for the total of 10,000
documents, index size using old schema is 57M, but the index size using new
schema grows to 54M.

How should I explain what I see, could it be possible the second group of
5000 documents have very different data size on the fields that is changed
to be not stored? or is it because Solr/Lucene's index strategy or
implementation will have smaller differences on the size of index when the
number of documents grows?

any input will be appreciated.
thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-understand-this-benchmark-test-results-compare-index-size-after-schema-change-tp4026674.html
Sent from the Solr - User mailing list archive at Nabble.com.


if I only need exact search, does frequency/score matter?

2012-12-13 Thread Jie Sun
this is related to my previous post where I did not get any feedback yet...

I am going through a practice to reduce the disk usage by solr index files.

first step I took was to move some fields from stored to not stored. this
reduced the size of .fdt by 30-60%.

very promising... however I notice the .frq are taking almost as much disk
space as the .fdt files.

It seems .frq keeps the term frequency information. 

In our application, we only care about exact search (legal purpose), we do
not care about search results in relevance (by score) at all.

does this mean I can omit the freq? is it feasible in solr to turn the
frequency off?
I do need phrase search so I will have to keep the .prx which is also the
huge files similar to .fdt files.

Any suggestions or inputs?
thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/if-I-only-need-exact-search-does-frequency-score-matter-tp4026893.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: if I only need exact search, does frequency/score matter?

2012-12-15 Thread Jie Sun
thanks for the information...

I did come across that discussion, I guess I will try to write a customized
Similarity class and disable tf.

I hope this is not totally odd to do ... I do notice about 10GB .frq file
size in cores that have total 10-30GB .fdt files. I wish the benchmark will
show me enough disk usage reduction that worth this.

if in future we are to bring back the relevant search, I believe we will
have to re-index everything...

thanks again!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/if-I-only-need-exact-search-does-frequency-score-matter-tp4026893p4027327.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to understand this benchmark test results (compare index size after schema change)

2012-12-17 Thread Jie Sun
thanks Erik ... I did run optimize on both indices to get ride of the deleted
data when compare to each other. (and my benchmark tests were just indexing
5000 new documents without duplicates...into a new core...  but I did
optimize just to make sure).

I think one results is consistent that the .fdt/.fdx files are reduced by
30-60% after the stored= changes. So that is very promising results for my
purpose.

I am trying to get rid of the .frq (which is the 3rd largest seg files in my
production), I have some discussion in another topic about this.
thanks!
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-understand-this-benchmark-test-results-compare-index-size-after-schema-change-tp4026674p4027544.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: if I only need exact search, does frequency/score matter?

2012-12-17 Thread Jie Sun
thanks, this is very helpful



--
View this message in context: 
http://lucene.472066.n3.nabble.com/if-I-only-need-exact-search-does-frequency-score-matter-tp4026893p4027559.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: if I only need exact search, does frequency/score matter?

2012-12-17 Thread Jie Sun
Hi Otis,

do you think I should customize both tf and idf to disable the term
frequency?

i.e. something like:

public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}

  public float idf(int docFreq, int numDocs) {
return docFreq > 0 ? 1.0f : 0.0f;
  }

thanks!
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/if-I-only-need-exact-search-does-frequency-score-matter-tp4026893p4027578.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: if I only need exact search, does frequency/score matter?

2012-12-19 Thread Jie Sun
Hi Otis,
I customized the Similarity class and add it through the end of schema.xml:
... ...
  
  


and mypackage.NoTfSimilarity.java is like:

public class NoTfSimilarity extends DefaultSimilarity
{
public float tf(float freq)
{
return freq > 0 ? 1.0f : 0.0f;
}

public float idf(int docFreq, int numDocs)
{
return docFreq > 0 ? 1.0f : 0.0f;
}

}

I deploy the class in
.../tomcat/webapps/solr/WEB-INF/classes/mypackage/NoTfSimilarity.class

restart tomcat.

I ran the benchmark with indexing same set of data, comparing to results
previous to the change, the .frq files size remain the same. Also the query
still shows the scores being calculated:
... ...
0.8838835 
... ...

any idea what I am missing here? seems it is not using my customized
similarity class.
thanks
jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/if-I-only-need-exact-search-does-frequency-score-matter-tp4026893p4028125.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: if I only need exact search, does frequency/score matter?

2012-12-19 Thread Jie Sun
Hi Otis,
here is the debug output on the query... seems all tf and idf indeed return
1.0f as I customized... I did not overwrite queryNorm or weight etc...  see
below.

but the bottom line is that if my purpose is to reduce the frq file size,
customize similarity seems wont help on that. I guess the term frequency is
still stored no matter what the similarity algorithm is, correct?
thanks
Jie



type:message  AND subject_eng:Resourcestype:message  AND subject_eng:Resources+type:message +subject_eng:resources+type:message +subject_eng:resources



0.92807764 = (MATCH) sum of:
  0.70710677 = (MATCH) weight(type:message in 596), product of:
0.70710677 = queryWeight(type:message), product of:
  1.0 = idf(docFreq=10247, maxDocs=10247)
  0.70710677 = queryNorm
1.0 = (MATCH) fieldWeight(type:message in 596), product of:
  1.0 = tf(termFreq(type:message)=1)
  1.0 = idf(docFreq=10247, maxDocs=10247)
  1.0 = fieldNorm(field=type, doc=596)
  0.22097087 = (MATCH) weight(subject_eng:resources in 596), product of:
0.70710677 = queryWeight(subject_eng:resources), product of:
  1.0 = idf(docFreq=20, maxDocs=10247)
  0.70710677 = queryNorm
0.3125 = (MATCH) fieldWeight(subject_eng:resources in 596), product of:
  1.0 = tf(termFreq(subject_eng:resources)=1)
  1.0 = idf(docFreq=20, maxDocs=10247)
  0.3125 = fieldNorm(field=subject_eng, doc=596)
LuceneQParser2.01.00.00.00.00.00.00.00.00.00.00.00.00.00.0



--
View this message in context: 
http://lucene.472066.n3.nabble.com/if-I-only-need-exact-search-does-frequency-score-matter-tp4026893p4028131.html
Sent from the Solr - User mailing list archive at Nabble.com.


POST query with non-ASCII to solr using httpclient wont work

2013-01-12 Thread Jie Sun
When I use HttpClient and its PostMethod to post a query with some Chinese,
solr fails returning any record, or return everything.
... ...
method = new PostMethod(solrReq);
method.getParams().setContentCharset("UTF-8");
method.setRequestHeader("Content-Type",
"application/x-www-form-urlencoded; charset=UTF-8");
... ...

I used tcp dump and found out the query my application above sent is an
urlencoded query string to solr (see the "q=xxx" part):

../SPOST /solr/413/select HTTP/1.1
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Accept: */*
User-Agent: Jakarta Commons-HttpClient/3.1
Host: 172.20.73.142:8080
Content-Length: 192

q=type%3Amessage+AND+customer_id%3A413+AND+subject_zhs%3A%E8%83%BD%E5%8A%9B+&hl.fl=&qt=standard&wt=standard&rows=20
17:09:55.592527 IP xxx> yyy.webcache: tcp 0
... ...

I found this urlencoding is what causing solr query failing. I found this by
copying the above urlencoded query to a file and use curl command, then I
got same error, but if I replace the above query with decoded string, then
it works with solr:

curl -v -H 'Content-type:application/x-www-form-urlencoded; charset=utf-8' 
http://localhost:8080/solr/413/select --data @/tmp/chinese_query

when /tmp/chinese_query has following it works with solr:
q=type:message+AND+customer_id:413+AND+subject_zhs:能力+&hl.fl=&qt=standard&wt=standard&rows=20

But if I switched the /tmp/chinese_query  to use urlencoded string, it fails
again with same error:
q=type%3Amessage+AND+customer_id%3A413+AND+subject_zhs%3A%E8%83%BD%E5%8A%9B+&hl.fl=&qt=standard&wt=standard&rows=20

So, my conclusion:
1) solr (I am using 3.5) only accept decoded query string, it fails with url
encoded query
2) httpclient will send out urlencoded string no matter what (there is no
way seems to me to make it sends out request in POST without urlencoding the
body).

am I missing something, or do you have any suggestion what I am doing wrong?
thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/POST-query-with-non-ASCII-to-solr-using-httpclient-wont-work-tp4032957.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: POST query with non-ASCII to solr using httpclient wont work

2013-01-12 Thread Jie Sun
:-) Otis, I also looked at solrJ source code, seems exactly what I am doing
here... but I probably will do what you suggested ... thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/POST-query-with-non-ASCII-to-solr-using-httpclient-wont-work-tp4032957p4032973.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: POST query with non-ASCII to solr using httpclient wont work

2013-01-14 Thread Jie Sun
unfortunately solrj is not an option here...
we will have to make a quick fix with a patch out in production.

I am still unable to make the solr (3.5) take url encoded query. again
passing non-urlencoded query string works with non-ASIIC (Chinese), but
fails return anything when sending request with urlencoded + Chinese.

any suggestion?
thanks
jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/POST-query-with-non-ASCII-to-solr-using-httpclient-wont-work-tp4032957p4033262.html
Sent from the Solr - User mailing list archive at Nabble.com.


queryResultWindowSize vs rows

2012-10-05 Thread Jie Sun
what will happen if in  my query I specify a greater number for rows than the
queryResultWindowSize in my solrconfig.xml

for example, if queryResultWindowSize=100, but I need process a batch query
from solr with rows=1000 each time and vary the start move on... what will
happen? if I do not turn off the searchResultCache, I look into your code a
bit and it seems to me it will get 
supersetMaxDoc = ((maxDocRequested -1)/queryResultWindowSize +
1)*queryResultWindowSize
and 'supersetMaxDoc' number of doc will be cached?

I hope it does not, otherwise we should turn off the cache and sacrifice the
performance with UI paging.

thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/queryResultWindowSize-vs-rows-tp401.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: queryResultWindowSize vs rows

2012-10-07 Thread Jie Sun
any suggestions?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/queryResultWindowSize-vs-rows-tp401p4012336.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: queryResultWindowSize vs rows

2012-10-07 Thread Jie Sun
Hi Erik,
no I dont have any evidence, just a precaution question.
So according to your explanation, this cache only keep the document ID, so
if client paying to next group of document in the window, there will be
another query to solr server to retrieve these docs, correct? 

ok that is good to know, because in production we are sharing the same solr
server for both UI search and the batch query I mentioned in original
question.

thanks!
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/queryResultWindowSize-vs-rows-tp401p4012340.html
Sent from the Solr - User mailing list archive at Nabble.com.


CheckIndex question

2012-10-17 Thread Jie Sun
Hi -

with a corrupted core, 

1. if I run CheckIndex with -fix, it will drop the hook to the corrupted
segment, but the segment files are still there, when we have a lot of
corrupted segments, we have to manually pick them out and remove them, is
there a way the tool can suffix them or prefix them so it is easier to be
cleaned out?

2. we know the doc count in the corrupted segment, is it easy also output
the doc id on those docs?

thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/CheckIndex-question-tp4014366.html
Sent from the Solr - User mailing list archive at Nabble.com.


[/solr] memory leak prevent tomcat shutdown

2012-10-19 Thread Jie Sun
very often when we try to shutdown tomcat, we got following error in
catalina.out indicating a solr thread can not be stopped, the tomcat results
hanging, we have to kill -9, which we think lead to some core corruptions in
our production environment. please help ...

catalina.out:

... ...

Oct 19, 2012 10:17:22 AM org.apache.catalina.loader.WebappClassLoader
clearReferencesThreads
SEVERE: The web application [/solr] appears to have started a thread named
[pool-69-thread-1] but has failed to stop it. This is very likely to create
a memory leak.

Then I used kill -3 to signal the thread dump, here is what I get (note the
thread [pool-69-thread-1] is hanging) :

2012-10-19 10:18:39
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.2-b06 mixed mode):

"DestroyJavaVM" prio=10 tid=0x55b39800 nid=0x7e82 waiting on
condition [0x]
   java.lang.Thread.State: RUNNABLE

"pool-69-thread-1" prio=10 tid=0x2aaabcb41800 nid=0x19fa waiting on
condition [0x4205e000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0006de699d80> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown
Source)
at java.util.concurrent.LinkedBlockingQueue.take(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.getTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

"JDWP Transport Listener: dt_socket" daemon prio=10 tid=0x578aa000
nid=0x19f9 runnable [0x]
   java.lang.Thread.State: RUNNABLE

... ...



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-memory-leak-prevent-tomcat-shutdown-tp4014788.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [/solr] memory leak prevent tomcat shutdown

2012-10-19 Thread Jie Sun
by the way, I am running tomcat 6, solr 3.5 on redhat 2.6.18-274.el5 #1 SMP
Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-memory-leak-prevent-tomcat-shutdown-tp4014788p4014792.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [/solr] memory leak prevent tomcat shutdown

2012-10-19 Thread Jie Sun
found a solr/lucene bug : TimeLimitingCollector starts thread in static {}
with no way to stop them
 https://issues.apache.org/jira/browse/LUCENE-2822

is this the same issue? it is fixed in Luence 3.5.   but I am using solr3.5
with lucene 2.9.3 (matched lucene version).

can anyone shed some light on if this means I need to upgrade to lucene 3.5?
thanks
jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-memory-leak-prevent-tomcat-shutdown-tp4014788p4014833.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: [/solr] memory leak prevent tomcat shutdown

2012-10-22 Thread Jie Sun
any input on this?
thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-memory-leak-prevent-tomcat-shutdown-tp4014788p4015265.html
Sent from the Solr - User mailing list archive at Nabble.com.


solr replication against active indexing on master

2012-11-01 Thread Jie Sun
I have a question about the solr replication (master/slaves).

when index activities are on going on master, when slave send in file list
command to get a version (actually to my understand a snapshot of the time)
of all files and their size/timestamp etc.

then slaves will decide which files need to be polled and send in another
request.

if the master has on going activities of indexing, especially if commit just
happened during 2 slave commands (file list and poll), then we will fail,
correct?

how is this working correctly?
thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-replication-against-active-indexing-on-master-tp4017696.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr replication against active indexing on master

2012-11-01 Thread Jie Sun
thanks ...

could you please point me to some more detailed explanation on line or I
will have to read the code to find out? I would like to understand a little
more on how this is achieved. thanks!

Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-replication-against-active-indexing-on-master-tp4017696p4017707.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr replication against active indexing on master

2012-11-01 Thread Jie Sun
thanks... I just read the related code ... now I understand it seems the
master keeps replicable snapshots (version), so it should be static. thank
you Otis!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-replication-against-active-indexing-on-master-tp4017696p4017743.html
Sent from the Solr - User mailing list archive at Nabble.com.


load balance with SolrCloud

2012-11-05 Thread Jie Sun
we are using solr 3.5 in production and we deal with customers data of
terabytes.

we are using shards for large customers and write our own replica management
in our software.

Now with the rapid growth of data, we are looking into solrcloud for its
robustness of sharding and replications.

I understand by read some documents on line that there is no SPOF using
solrcloud, so any instance in the cluster can server the query/index.
However, is it true that we need to write our own load balancer in front of
solrCloud? 

For example if we want to implement a model similar to Loggly, i.e. each
customer start indexing into the small shard of its own, then if any of the
customers grow more than the small shard's limit, we switch to index into
another small shard (we call it front end shard), meanwhile merge the just
released small shard to next level larger shard. 

Since the merge can happen between two instances on different servers, we
probably end up with synch the index files for the merging shards and then
use solr merge.

I am curious if there is anything solr provide to help on these kind of
strategy dealing with unevenly grow big customer data (a core)? or do we
have to write these in our software layer from scratch?

thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/load-balance-with-SolrCloud-tp4018367.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: load balance with SolrCloud

2012-11-06 Thread Jie Sun
thanks for your feedback Erick.

I am also aware of the current limitation of shard number in a collection is
fixed. changing the number will need re-config and re-index. Let's say if
the limitation gets levitated in near future release, I would then consider
setup collection for each customer, which will include varies number of
shards and their replicas (depend on the customer size and it should grow
dynamically).

 so this will lead to having multiple collections on one solr server
instance... I assume setup n collections on one server is not an issue? or
is it? I am skeptical, see example on solr wiki below, it seems it is
starting a solr instance with a specific collection and its config:
cd example
java -Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar

thanks
Jie



--
View this message in context: 
http://lucene.472066.n3.nabble.com/load-balance-with-SolrCloud-tp4018367p4018659.html
Sent from the Solr - User mailing list archive at Nabble.com.