Re: How to configure Solr to use ZooKeeper ACLs in order to protect it's content

2015-03-20 Thread Per Steffensen
Sorry, I did not follow this mailing-list close enough to detect this 
question. But Dmitry mailed to me privately asking for help, so here I am


Initial steps
* mkdir solr-test
* cd solr-test
* Downloaded solr-5.0.0.zip and unzipped into solr-test folder, so that 
I have solr-test/solr-5.0.0 folder

* cd solr-5.0.0
* export SOLR_HOME=$(pwd)
* Started new/empty ZK at localhost:2181 (sure you can do that)

Setting the VM-params
* export 
SOLR_ZK_PROVIDERS="-DzkCredentialsProvider=org.apache.solr.common.cloud.VMParamsSingleSetCredentialsDigestZkCredentialsProvider 
-DzkACLProvider=org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkACLProvider"
* export SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=admin-user 
-DzkDigestPassword=admin-password 
-DzkDigestReadonlyUsername=readonly-user 
-DzkDigestReadonlyPassword=readonly-password"


Starting solr, just to have the jar extracted into webapp folder, so 
that I can use the classpath you used

* cd $SOLR_HOME/server
* java -jar start.jar
* CTRL-C to stop again

Bootstrapping (essentially creating the /solr root-node in ZK)
* cd $SOLR_HOME/server
* java $SOLR_ZK_PROVIDERS $SOLR_ZK_CREDS_AND_ACLS -classpath 
"$SOLR_HOME/server/solr-webapp/webapp/WEB-INF/lib/*:$SOLR_HOME/server/lib/ext/*" 
org.apache.solr.cloud.ZkCLI -cmd bootstrap -zkhost localhost:2181/solr 
-solrhome $SOLR_HOME/server/solr


Uploading the config
* cd $SOLR_HOME/server
* java $SOLR_ZK_PROVIDERS $SOLR_ZK_CREDS_AND_ACLS -classpath 
"$SOLR_HOME/server/solr-webapp/webapp/WEB-INF/lib/*:$SOLR_HOME/server/lib/ext/*" 
org.apache.solr.cloud.ZkCLI -zkhost localhost:2181/solr -cmd upconfig 
-confdir 
$SOLR_HOME/server/solr/configsets/data_driven_schema_configs/conf 
-confname gettingstarted_shard1_replica1


Starting Solr node
* cd $SOLR_HOME/server
* java $SOLR_ZK_PROVIDERS $SOLR_ZK_CREDS_AND_ACLS 
-Dsolr.solr.home=$SOLR_HOME/server/solr 
-Dsolr.data.dir=$SOLR_HOME/server/solr/gettingstarted_shard1_replica1 
-Dsolr.log=$SOLR_HOME/server/solr/logs -DzkHost=localhost:2181/solr 
-Djetty.port=8983 -jar start.jar


PROBLEM REPRODUCED!!!

Checking out 5.0.0 source-code to see what is wrong. Finding out that 
you need to set the provider-classess in solr.xml - a Solr-node seems 
not to be able to take the provider-classes from VM-params. When I 
handed over the patch for SOLR-4580, VM-parameters was the only way to 
set providers. The other guys added support for setting it in solr.xml, 
which is a good idea. It seems that at the same time VM-params is not 
any longer supported for Solr-nodes. Do not know it that was intentionally?
Anyway. Added the following to -section in 
$SOLR_HOME/server/solr/solr.xml
name="zkCredentialsProvider">org.apache.solr.common.cloud.VMParamsSingleSetCredentialsDigestZkCredentialsProvider
name="zkACLProvider">org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkACLProvider


Trying to start again (without the SOLR_ZK_PROVIDERS VM-params - because 
they are not used anyway)
* java $SOLR_ZK_CREDS_AND_ACLS -Dsolr.solr.home=$SOLR_HOME/server/solr 
-Dsolr.data.dir=$SOLR_HOME/server/solr/gettingstarted_shard1_replica1 
-Dsolr.log=$SOLR_HOME/server/solr/logs -DzkHost=localhost:2181/solr 
-Djetty.port=8983 -jar start.jar


Viola

Regards, Per Steffensen

On 19/03/15 15:01, Dmitry Karanfilov wrote:

Looks like it is still broken.
The fixed name of system property  zkCredentialsProvider and zkACLProvider
are only impacted on the zkcli.sh script (org.apache.solr.cloud.ZkCLI).
So using command bellow, I'm able to *bootstrap *and *upconfig *to the
Zookeeper with appropriate credentials and ACLs:

export
SOLR_ZK_PROVIDERS="-DzkCredentialsProvider=org.apache.solr.common.cloud.VMParamsSingleSetCredentialsDigestZkCredentialsProvider
-DzkACLProvider=org.apache.solr.common.cloud.VMParamsAllAndReadonlyDigestZkACLProvider"
export SOLR_ZK_CREDS_AND_ACLS="-DzkDigestUsername=admin-user
-DzkDigestPassword=admin-password -DzkDigestReadonlyUsername=readonly-user
-DzkDigestReadonlyPassword=readonly-password"

java $SOLR_ZK_PROVIDERS $SOLR_ZK_CREDS_AND_ACLS -classpath
"server/solr-webapp/webapp/WEB-INF/lib/*:server/lib/ext/*"
org.apache.solr.cloud.ZkCLI -cmd bootstrap -zkhost 10.0.1.112:2181/solr
-solrhome /opt/solr/example/cloud/node1/solr/
java $SOLR_ZK_PROVIDERS $SOLR_ZK_CREDS_AND_ACLS -classpath
"server/solr-webapp/webapp/WEB-INF/lib/*:server/lib/ext/*"
org.apache.solr.cloud.ZkCLI -zkhost 10.0.1.112:2181/solr -cmd upconfig
-confdir /opt/solr/server/solr/configsets/data_driven_schema_configs/conf
-confname gettingstarted_shard1_replica1


But when I start a Solr it is not able to connect to the Zookeeper:

java $SOLR_ZK_PROVIDERS $SOLR_ZK_CREDS_AND_ACLS
-Dsolr.solr.home=/opt/solr/example/cloud/node1/solr
-Dsolr.data.dir=/opt/solr/example/cloud/node1/solr/gettingstarted_shard1_replica1
-Dsolr.log=/opt/solr/example/cloud/node1/logs -DzkHost=10.0.1.112:2181/solr
-Djetty.port=898

Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Per Steffensen
In one of our production environments we use 32GB, 4-core, 3T RAID0 
spinning disk Dell servers (do not remember the exact model). We have 
about 25 collections with 2 replica (shard-instances) per collection on 
each machine - 25 machines. Total of 25 coll * 2 replica/coll/machine * 
25 machines = 1250 replica. Each replica contains about 800 million 
pretty small documents - thats about 1000 billion (do not know the 
english word for it) documents all in all. We index about 1.5 billion 
new documents every day (mainly into one of the collections = 50 replica 
across 25 machine) and keep a history of 2 years on the data. Shifting 
the "index into" collection every month. We can fairly easy keep up with 
the indexing load. We have almost non of the data on the heap, but of 
course a small fraction of the data in the files will at any time be in 
OS file-cache.
Compared to our indexing frequency we do not do a lot of searches. We 
have about 10 users searching the system from time to time - anything 
from major extracts to small quick searches. Depending on the nature of 
the search we have response-times between 1 sec and 5 min. But of course 
that is very dependent on "clever" choice on each field wrt index, 
store, doc-value etc.
BUT we are not using out-of-box Apache Solr. We have made quit a lot of 
performance tweaks ourselves.
Please note that, even though you disable all Solr caches, each replica 
will use heap-memory linearly dependent on the number of documents (and 
their size) in that replica. But not much, so you can get pretty far 
with relatively little RAM.
Our version of Solr is based on Apache Solr 4.4.0, but I expect/hope it 
did not get worse in newer releases.


Just to give you some idea of what can at least be achieved - in the 
high-end of #replica and #docs, I guess


Regards, Per Steffensen

On 24/03/15 14:02, Ian Rose wrote:

Hi all -

I'm sure this topic has been covered before but I was unable to find any
clear references online or in the mailing list.

Are there any rules of thumb for how many cores (aka shards, since I am
using SolrCloud) is "too many" for one machine?  I realize there is no one
answer (depends on size of the machine, etc.) so I'm just looking for a
rough idea.  Something like the following would be very useful:

* People commonly run up to X cores/shards on a mid-sized (4 or 8 core)
server without any problems.
* I have never heard of anyone successfully running X cores/shards on a
single machine, even if you throw a lot of hardware at it.

Thanks!
- Ian





Re: rough maximum cores (shards) per machine?

2015-03-25 Thread Per Steffensen

On 25/03/15 15:03, Ian Rose wrote:

Per - Wow, 1 trillion documents stored is pretty impressive.  One
clarification: when you say that you have 2 replica per collection on each
machine, what exactly does that mean?  Do you mean that each collection is
sharded into 50 shards, divided evenly over all 25 machines (thus 2 shards
per machine)?

Yes

   Or are some of these slave replicas (e.g. 25x sharding with
1 replica per shard)?
No replication. It does not work very well, at least in 4.4.0. Besides 
that I am not a big fan of two (or more) machines having to do all the 
indexing work and making sure to keep synchronized. Use a distributed 
file-system supporting multiple copies of every piece of data (like 
HDFS) for HA on data-level. Have only one Solr-node handle the indexing 
into a particular shard - if this Solr-node breaks down let another 
Solr-node take over the indexing "leadership" on this shard. Besides the 
indexing Solr-node several other Solr-nodes can serve data from this 
shard - just watching the data-folder (can commits) done by the 
indexing-leader of this particular shard - will give you HA on 
service-level. That is probably how we are going to do HA - pretty soon. 
But that is another story


Thanks!

No problem



Re: Solr replicas going in recovering state during heavy indexing

2015-03-27 Thread Per Steffensen
I think it is very likely that it is due to Solr-nodes losing 
ZK-connections (after timeout). We have experienced that a lot. One 
thing you want to do, is to make sure your ZK-servers does not run on 
the same machines as your Solr-nodes - that helped us a lot.


On 24/03/15 13:57, Gopal Jee wrote:

Hi
We have a large solrcloud cluster. We have observed that during heavy
indexing, large number of replicas go to recovering or down state.
What could be the possible reason and/or fix for the issue.

Gopal





Re: SOLR 5.0.0 and Tomcat version ?

2015-03-27 Thread Per Steffensen

On 23/03/15 20:05, Erick Erickson wrote:

you don't run a SQL engine from a servlet
container, why should you run Solr that way?

https://twitter.com/steff1193/status/580491034175660032
https://issues.apache.org/jira/browse/SOLR-7236?focusedCommentId=14383624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14383624
etc

Not that I want to start the discussion again. The war seems to be lost.


Re: Securing solr index

2015-04-14 Thread Per Steffensen

Hi

I might misunderstand you, but if you are talking about securing the 
actual files/folders of the index, I do not think this is a Solr/Lucene 
concern. Use standard mechanisms of your OS. E.g. on linux/unix use 
chown, chgrp, chmod, sudo, apparmor etc - e.g. allowing only root to 
write the folders/files and sudo the user running Solr/Lucene to operate 
as root in this area. Even admins should not (normally) operate as root 
- that way they cannot write the files either. No one knows the 
root-password - except maybe for the super-super-admin, or you split the 
root-password in two and two admins know a part each, so that they have 
to both agree in order to operate as root. Be creative yourself.


Regards, Per Steffensen

On 13/04/15 12:13, Suresh Vanasekaran wrote:

Hi,

We are having the solr index maintained in a central server and multiple users 
might be able to access the index data.

May I know what are best practice for securing the solr index folder where 
ideally only application user should be able to access. Even an admin user 
should not be able to copy the data and use it in another schema.

Thanks



 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***





Re: Securing solr index

2015-04-15 Thread Per Steffensen
That said, it might be nice with a wiki-page (or something) explaining 
how it can be done, including maybe concrete cases about exactly how it 
has been done on different installations around the world using Solr


On 14/04/15 14:03, Per Steffensen wrote:

Hi

I might misunderstand you, but if you are talking about securing the 
actual files/folders of the index, I do not think this is a 
Solr/Lucene concern. Use standard mechanisms of your OS. E.g. on 
linux/unix use chown, chgrp, chmod, sudo, apparmor etc - e.g. allowing 
only root to write the folders/files and sudo the user running 
Solr/Lucene to operate as root in this area. Even admins should not 
(normally) operate as root - that way they cannot write the files 
either. No one knows the root-password - except maybe for the 
super-super-admin, or you split the root-password in two and two 
admins know a part each, so that they have to both agree in order to 
operate as root. Be creative yourself.


Regards, Per Steffensen

On 13/04/15 12:13, Suresh Vanasekaran wrote:

Hi,

We are having the solr index maintained in a central server and 
multiple users might be able to access the index data.


May I know what are best practice for securing the solr index folder 
where ideally only application user should be able to access. Even an 
admin user should not be able to copy the data and use it in another 
schema.


Thanks



 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended 
solely
for the use of the addressee(s). If you are not the intended 
recipient, please
notify the sender by e-mail and delete the original message. Further, 
you are not
to copy, disclose, or distribute this e-mail or its contents to any 
other person and
any such actions are unlawful. This e-mail may contain viruses. 
Infosys has taken
every reasonable precaution to minimize this risk, but is not liable 
for any damage
you may sustain as a result of any virus in this e-mail. You should 
carry out your
own virus checks before opening the e-mail or attachment. Infosys 
reserves the
right to monitor and review the content of all messages sent to or 
from this e-mail
address. Messages sent to or from this e-mail address may be stored 
on the

Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***








Commit (hard) at shutdown?

2016-05-18 Thread Per Steffensen

Hi

Solr 5.1.
Someone in production in my organization claims that even though Solrs 
are shut down gracefully, there can be huge tlogs to replay when 
starting Solrs again. We are doing heavy indexing right up until Solrs 
are shut down, and we have  set to 1 min. Can anyone confirm 
(or the opposite) that Solrs, upon graceful shutdown, OUGHT TO do a 
(hard) commit, leaving tlogs empty (= nothing to replay when starting 
again)?


Regards, Per Steffensen


Re: Commit (hard) at shutdown?

2016-05-23 Thread Per Steffensen
Sorry, I did not see the responses here because I found out myself. I 
definitely seems like a hard commit it performed when shutting down 
gracefully. The info I got from production was wrong.
It is not necessarily obvious that you will loose data on "kill -9". The 
tlog ought to save you, but it probably not 100% bulletproof.

We are not using the bin/solr script (yet)

On 21/05/16 04:02, Shawn Heisey wrote:

On 5/20/2016 2:51 PM, Jon Drews wrote:

I would be interested in an answer to this question.

 From my research it looks like it will do a hard commit if cleanly shut
down. However if you "kill -9" it you'll loose data (obviously). Perhaps
production isn't cleanly shutting down solr?
https://dzone.com/articles/understanding-solr-soft

I do not know whether a graceful shutdown does a hard commit or not.

I do know that all versions of Solr that utilize the bin/solr script are
configured by default to forcibly kill Solr only five seconds after the
graceful shutdown is requested.  Five seconds is usually not enough time
for production installs, so it needs to be increased.  The only way to
do this currently is to edit the bin/solr script directly.

Thanks,
Shawn






Export big extract from Solr to [My]SQL

2014-05-02 Thread Per Steffensen

Hi

I want to make extracts from my Solr to MySQL. Any tools around that can 
help med perform such a task? I find a lot about data-import from SQL 
when googling, but nothing about export/extract. It is not all of the 
data in Solr I need to extract. It is only documents that full fill a 
normal Solr query, but the number of documents fulfilling it will 
(potentially) be huge.


Regards, Per Steffensen


How does query on AND work

2014-05-19 Thread Per Steffensen

Hi

Lets say I have a Solr collection (running across several servers) 
containing 5 billion documents. A.o. each document have a value for 
field "no_dlng_doc_ind_sto" (a long) and field 
"timestamp_dlng_doc_ind_sto" (also a long). Both "no_dlng_doc_ind_sto" 
and "timestamp_dlng_doc_ind_sto" are doc-value, indexed and stored. Like 
this in schema.xml
stored="true" required="true" docValues="true"/>
positionIncrementGap="0" docValuesFormat="Disk"/>


I make queries like this: no_dlng_doc_ind_sto:() AND 
timestamp_dlng_doc_ind_sto:([ TO ])
* The "no_dlng_doc_ind_sto:()"-part of a typical query will hit 
between 500 and 1000 documents out of the total 5 billion
* The "timestamp_dlng_doc_ind_sto:([ TO ])"-part 
of a typical query will hit between 3-4 billion documents out of the 
total 5 billion


Question is how Solr/Lucene deals with such requests?
I am thinking that using the indices on both "no_dlng_doc_ind_sto" and 
"timestamp_dlng_doc_ind_sto" to get two sets of doc-ids and then make an 
intersection of those might not be the most efficient. You are making an 
intersection of two doc-id-sets of size 500-1000 and 3-4 billion. It 
might be faster to just use the index for "no_dlng_doc_ind_sto" to get 
the doc-ids for the 500-1000 documents, then for each of those fetch 
their "timestamp_dlng_doc_ind_sto"-value (using doc-value) to filter out 
the ones among the 500-1000 that does not match the timestamp-part of 
the query.
But what does Solr/Lucene actually do? Is it Solr- or Lucene-code that 
make the decision on what to do? Can you somehow "hint" the 
search-engine that you want one or the other method used?


Solr 4.4 (and corresponding Lucene), BTW, if that makes a difference

Regards, Per Steffensen


Re: How does query on AND work

2014-05-23 Thread Per Steffensen
I can answer some of this myself now that I have dived into it to 
understand what Solr/Lucene does and to see if it can be done better
* In current Solr/Lucene (or at least in 4.4) indices on both 
"no_dlng_doc_ind_sto" and "timestamp_dlng_doc_ind_sto" are used and the 
doc-id-sets found are intersected to get the final set of doc-ids
* It IS more efficient to just use the index for the 
"no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that 
part and then fetch timestamp-doc-values for those doc-ids to filter out 
the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of 
the query. I have made changes to our version of Solr (and Lucene) to do 
that and response-times go from about 10 secs to about 1 sec (of course 
dependent on whats in file-cache etc.) - in cases where 
"no_dlng_doc_ind_sto" hit about 500-1000 docs and 
"timestamp_dlng_doc_ind_sto" hit about 3-4 billion.


Regards, Per Steffensen

On 19/05/14 13:33, Per Steffensen wrote:

Hi

Lets say I have a Solr collection (running across several servers) 
containing 5 billion documents. A.o. each document have a value for 
field "no_dlng_doc_ind_sto" (a long) and field 
"timestamp_dlng_doc_ind_sto" (also a long). Both "no_dlng_doc_ind_sto" 
and "timestamp_dlng_doc_ind_sto" are doc-value, indexed and stored. 
Like this in schema.xml
stored="true" required="true" docValues="true"/>
positionIncrementGap="0" docValuesFormat="Disk"/>


I make queries like this: no_dlng_doc_ind_sto:() AND 
timestamp_dlng_doc_ind_sto:([ TO ])
* The "no_dlng_doc_ind_sto:()"-part of a typical query will hit 
between 500 and 1000 documents out of the total 5 billion
* The "timestamp_dlng_doc_ind_sto:([ TO ])"-part 
of a typical query will hit between 3-4 billion documents out of the 
total 5 billion


Question is how Solr/Lucene deals with such requests?
I am thinking that using the indices on both "no_dlng_doc_ind_sto" and 
"timestamp_dlng_doc_ind_sto" to get two sets of doc-ids and then make 
an intersection of those might not be the most efficient. You are 
making an intersection of two doc-id-sets of size 500-1000 and 3-4 
billion. It might be faster to just use the index for 
"no_dlng_doc_ind_sto" to get the doc-ids for the 500-1000 documents, 
then for each of those fetch their "timestamp_dlng_doc_ind_sto"-value 
(using doc-value) to filter out the ones among the 500-1000 that does 
not match the timestamp-part of the query.
But what does Solr/Lucene actually do? Is it Solr- or Lucene-code that 
make the decision on what to do? Can you somehow "hint" the 
search-engine that you want one or the other method used?


Solr 4.4 (and corresponding Lucene), BTW, if that makes a difference

Regards, Per Steffensen





Re: How does query on AND work

2014-05-26 Thread Per Steffensen
Do not know if this is a special-case. I guess an AND-query where one 
side hits 500-1000 and the other side hits billions is a special-case. 
But this way of carrying out the query might also be an optimization in 
less uneven cases.
It does not require that the "lots of hits"-part of the query is a 
range-query, and it does not necessarily require that the field used in 
this part is DocValue (you can go fetch the values from "slow" store). 
But I guess it has to be a very uneven case if this approach should be 
faster on a non-DocValue field.


I think this can be generalized. I think of it as something similar as 
being able to "hint" relational databases not to use an specific index. 
I do not know that much about Solr/Lucene query-syntax, but I believe 
"filter-queries" (fq) are kinda queries that will be AND'ed onto the 
real query (q), and in order not to have to change the query-syntax too 
much (adding hits or something), I guess a first step for a feature 
doing what I am doing here, could be introduce something similar to 
"filter-queries" - queries that will be carried out on the result of (q 
+ fqs) but looking a the values of the documents in that result instead 
of intersecting with doc-sets found from index. Lets call it 
"post-query-value-filter"s (yes, we can definitely come up with a 
better/shorter name)


1) q=no_dlng_doc_ind_sto:() AND 
timestamp_dlng_doc_ind_sto:([ TO ])
2) 
q=no_dlng_doc_ind_sto:(),fq=timestamp_dlng_doc_ind_sto:([ TO 
])
3) 
q=no_dlng_doc_ind_sto:(),post-query-value-filter=timestamp_dlng_doc_ind_sto:([ 
TO ])


1) and 2) both use index on both no_dlng_doc_ind_sto and 
timestamp_dlng_doc_ind_sto. 3) uses only index on no_dlng_doc_ind_sto 
and does the time-interval filter part by fetching values (using 
DocValue if possible) for timestamp_dlng_doc_ind_sto for each of the 
docs found through the no_dlng_doc_ind_sto-index to see if this doc 
should really be included.


There are some things that I did not initially tell about actually 
wanting to do a facet search etc. Well, here is the full story: 
http://solrlucene.blogspot.dk/2014/05/performance-of-and-queries-with-uneven.html


Regards, Per Steffensen

On 23/05/14 17:37, Toke Eskildsen wrote:

Per Steffensen [st...@designware.dk] wrote:

* It IS more efficient to just use the index for the
"no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
part and then fetch timestamp-doc-values for those doc-ids to filter out
the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
the query.

Thank you for the follow up. It sounds rather special-case though, with 
requirement of DocValues for the range-field. Do you think this can be 
generalized?

- Toke Eskildsen





Re: How does query on AND work

2014-05-27 Thread Per Steffensen
Thanks for responding, Yonik. I tried out your suggestion, and it seems 
to work as it is supposed to, and it performs at least as well as the 
"hacky implementation I did myself". Wish you had responded earlier. Or 
maybe not, then I wouldn't have dived into it myself making an 
implementation that does (almost) exactly what seems to be done when 
using your approach, and then I wouldn't have learned so much. But the 
great thing is that now I do not have to go suggest (or implement 
myself) this idea as a new Solr/Lucene feature - it is already there!


See 
http://solrlucene.blogspot.dk/2014/05/performance-of-and-queries-with-uneven.html. 
Hope you do not mind that I reference you and the link you pointed out.


Thanks a lot!

Regards, Per Steffensen

On 23/05/14 18:13, Yonik Seeley wrote:

On Fri, May 23, 2014 at 11:37 AM, Toke Eskildsen  
wrote:

Per Steffensen [st...@designware.dk] wrote:

* It IS more efficient to just use the index for the
"no_dlng_doc_ind_sto"-part of the request to get doc-ids that match that
part and then fetch timestamp-doc-values for those doc-ids to filter out
the docs that does not match the "timestamp_dlng_doc_ind_sto"-part of
the query.

Thank you for the follow up. It sounds rather special-case though, with 
requirement of DocValues for the range-field. Do you think this can be 
generalized?

Maybe it already is?
http://heliosearch.org/advanced-filter-caching-in-solr/

Something like this:
  &fq={!frange cache=false cost=150 v=timestampField l=beginTime u=endTime}


-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache





Re: How does query on AND work

2014-05-27 Thread Per Steffensen
Well, the only "search" i did, was ask this question on this 
mailing-list :-)


On 26/05/14 17:05, Alexandre Rafalovitch wrote:

Did not follow the whole story but " post-query-value-filter" does exist in
Solr. Have you tried searching for pretty much that expression. and maybe
something about cost-based filter.

Regards,
 Alex




How to migrate content of a collection to a new collection

2014-07-22 Thread Per Steffensen

Hi

We have numerous collections each with numerous shards spread across 
numerous machines. We just discovered that all documents have a field 
with a wrong value and besides that we would like to add a new field to 
all documents
* The field with the wrong value is a long, DocValued, Indexed and 
Stored. Some (about half) of the documents need to have a constant added 
to their current value
* The field we want to add will be and int, DocValued, Indexed and 
Stored. Needs to be added to all documents, but will have different 
values among the documents


How to achieve our goal in the easiest possible way?

We thought about spooling/streaming from the existing collection into a 
"twin"-collection, then delete the existing collection and finally 
rename the "twin"-collection to have the same name as the original 
collection. Basically indexing all documents again. If that is the 
easiest way, how do we query in a way so that we get all documents 
streamed. We cannot just do a *:* query that returns everything into 
memory and the index from there, because we have billions of documents 
(not enough memory). Please note that we are on 4.4, which does not 
contain the new CURSOR-feature. Please also note that speed is an 
important factor for us.


Guess this could also be achieved by doing 1-1 migration on shard-level 
instead of collection-level, keeping everything in the new collections 
on the same machine as where they lived in the old collections. That 
could probably complete faster than the 1-1 on collection-level 
approach. But this 1-1 on shard-level approach is not very good for us, 
because the long field we need to change is also part of the id 
(controlling the routing to a particular shard) and therefore actually 
we also need to change the id on all documents. So if we do the 1-1 on 
shard-level approach, we will end up having documents in shards that 
they actually do not be to (they would not have been routed there by the 
routing system in Solr). We might be able to live with this disadvantage 
if 1-1 on shard-level can be easily achieved much faster than the 1-1 on 
collection-level.


Any input is very much appreciated! Thanks

Regards, Per Steffensen


Re: How to migrate content of a collection to a new collection

2014-07-24 Thread Per Steffensen

On 23/07/14 17:13, Erick Erickson wrote:

Per:

Given that you said that the field redefinition also includes routing
info
Exactly. It would probably be much faster to make sure that the new 
collection have the same number of shards on each Solr-machine and that 
the routing-ranges are identical and then to a local 1-1 copy on 
shard-level. But it just will not end up correctly wrt routing, because 
we also need to change our ids while copying from old to new collections.

  I don't see
any other way than re-indexing each collection. That said, could you use the
collection aliasing and do one collection at a time?
We will definitely do one collection at the time. Whether we will use 
aliasing or do something else to achieve create-new-twin-collection -> 
copy-from-old-collection-to-new-twin-collection -> 
delete-old-collection-and-let-new-twin-collection-take-its-place I do 
not know yet. But that is details, we will definitely be able to manage.


Best,
Erick




Re: How to migrate content of a collection to a new collection

2014-07-24 Thread Per Steffensen

Thanks for replying

I tried this "poor mans" cursor approach out ad-hoc, but I get OOM. 
Pretty sure this is because you need all uniqueKey-values in FieldCache 
in order to be able to sort on it. We do not have memory for that - and 
never will. Our uniqueKey field is not DocValue.

Just out of curiosity
* Will I have the same OOM problem using the CURSOR-feature in later Solrs?
* Will the "poor mans" cursor approach still be efficient if my 
uniqueKey was DocValued, knowing that all values for uniqueKey (the 
DocValue file) cannot fit in memory (OS file cache)?


Regards, Per Steffensen

On 23/07/14 23:57, Chris Hostetter wrote:

: billions of documents (not enough memory). Please note that we are on 4.4,
: which does not contain the new CURSOR-feature. Please also note that speed is
: an important factor for us.

for situations where you know you will be processing every doc and order
doesn't matter you can use a "poor mans" cursor by filtering on sccessive
ranges of your uniqueKey field as described in the "Is There A
Workaround?" section of this blog post...

http://searchhub.org/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

* sort on uniqueKey
* leave start=0 on every requets
* add an fq to each request based on the last uniqueKey value from
   the previous request.


-Hoss
http://www.lucidworks.com/





Bloom filter

2014-07-28 Thread Per Steffensen

Hi

Where can I find documentation on how to use Bloom filters in Solr 
(4.4). http://wiki.apache.org/solr/BloomIndexComponent seems to be 
outdated - there is no BloomIndexComponent included in 4.4 code.


Regards, Per Steffensen


Re: Bloom filter

2014-07-28 Thread Per Steffensen
Yes I found that one, along with SOLR-3950. Well at least it seems like 
the support is there in Lucene. I will figure out myself how to make it 
work via Solr, the way I need it to work. My use-case is not as 
specified in SOLR-1375, but the solution might be the same. Any input is 
of course still very much appreciated.


Regards, Per Steffensen

On 28/07/14 15:42, Lukas Drbal wrote:

Hi Per,

link to jira - https://issues.apache.org/jira/browse/SOLR-1375 Unresolved
;-)

L.


On Mon, Jul 28, 2014 at 1:17 PM, Per Steffensen  wrote:


Hi

Where can I find documentation on how to use Bloom filters in Solr (4.4).
http://wiki.apache.org/solr/BloomIndexComponent seems to be outdated -
there is no BloomIndexComponent included in 4.4 code.

Regards, Per Steffensen








Re: Bloom filter

2014-07-30 Thread Per Steffensen

On 30/07/14 08:55, jim ferenczi wrote:

Hi Per,
First of all the BloomFilter implementation in Lucene is not exactly a
bloom filter. It uses only one hash function and you cannot set the false
positive ratio beforehand. ElasticSearch has its own bloom filter
implementation (using "guava like" BloomFilter), you should take a look at
their implementation if you really need this feature.
Yes, I am looking into what Lucene can do and how to use it through 
Solr. If it does not fit our needs I will enhance it - potentially with 
inspiration from ES implementation. Thanks

What is your use-case ? If your index fits in RAM the bloom filter won't
help (and it may have a negative impact if you have a lot of segments). In
fact the only use case where the bloom filter can help is when your term
dictionary does not fit in RAM which is rarely the case.
We have so many documents that it will never fit in memory. We use 
optimistic locking (our own implementation) to do correct concurrent 
assembly of documents and to do duplicate control. This require a lot of 
finding docs from their id, and most of the time the document is not 
there, but to be sure we need to check both transactionlog and the 
actual index (UpdateLog). We would like to use Bloom Filter to quickly 
tell that a document with a particular id is NOT present.


Regards,
Jim

Regards, Per Steffensen


Re: Bloom filter

2014-07-30 Thread Per Steffensen

Hi

I am not sure exactly what LUCENE-5675 does, but reading the description 
it seems to me that it would help finding out that there is no document 
(having an id-field) where version-field is less than . As 
far as I can see this will not help finding out if a document with 
id= exists. We want to ask "does a document with id  
exist", without knowing the value of its version-field (if it actually 
exists). You do not know if it ever existed, either.


Please elaborate. Thanks!

Regarding " The only other choice today is bloom filters, which use up 
huge amounts of memory", I guess a bloom filter only takes as much space 
(disk or memory) as you want it to. The more space you allow it to use 
the more it gives you a false positive (saying "this doc might exist" in 
cases where the doc actually does not exist). So the space you need to 
use for the bloom filter depends on how frequently you can live with 
false positives (where you have to actually look it up in the real index).


Regards, Per Steffensen

On 30/07/14 10:05, Shalin Shekhar Mangar wrote:

Hi Per,

There's LUCENE-5675 which has added a new postings format for IDs. Trying
it out in Solr is in my todo list but maybe you can get to it before me.

https://issues.apache.org/jira/browse/LUCENE-5675


On Wed, Jul 30, 2014 at 12:57 PM, Per Steffensen 
wrote:


On 30/07/14 08:55, jim ferenczi wrote:


Hi Per,
First of all the BloomFilter implementation in Lucene is not exactly a
bloom filter. It uses only one hash function and you cannot set the false
positive ratio beforehand. ElasticSearch has its own bloom filter
implementation (using "guava like" BloomFilter), you should take a look at
their implementation if you really need this feature.


Yes, I am looking into what Lucene can do and how to use it through Solr.
If it does not fit our needs I will enhance it - potentially with
inspiration from ES implementation. Thanks

  What is your use-case ? If your index fits in RAM the bloom filter won't

help (and it may have a negative impact if you have a lot of segments). In
fact the only use case where the bloom filter can help is when your term
dictionary does not fit in RAM which is rarely the case.


We have so many documents that it will never fit in memory. We use
optimistic locking (our own implementation) to do correct concurrent
assembly of documents and to do duplicate control. This require a lot of
finding docs from their id, and most of the time the document is not there,
but to be sure we need to check both transactionlog and the actual index
(UpdateLog). We would like to use Bloom Filter to quickly tell that a
document with a particular id is NOT present.


Regards,
Jim


Regards, Per Steffensen








Re: Bloom filter

2014-08-04 Thread Per Steffensen
I just finished adding support for persisted ("backed" as I call them) 
bloom-filters in Guava Bloom Filter. Implemented one kind of persisted 
bloom-filter that works on memory mapped files.
I have changed our Solr code so that it uses such a enhanced Guava Bloom 
Filter. Making sure it is kept up to date and using it when quick "does 
definitely not exist checks" will help performance.


We do duplicate check also, because we also might get the "same" data 
from our external provider numerous times. We do it using unique-id 
feature in Solr where we make sure that if and only if (in practice) a 
two documents are "the same" they have the same id. We encode most info 
on the document in its id - including hashes of textual fields. Works 
like a charm. It is exactly in this case we want to improve performance. 
Most of the time a document does not already exist when we do this 
duplicate check (using the unique-id feature), but it just takes 
relatively long time to verify it, because you have to visit the index. 
We can get a quick "document with this id does not exist" using 
bloom-filter on id.


Regards, Per Steffensen

On 03/08/14 03:58, Umesh Prasad wrote:

+1 to Guava's BloomFilter implementation.

You can actually hook into UpdateProcessor chain and have the logic of
updating bloom filter / checking there.

We had a somewhat similar use case.  We were using DIH and it was possible
that same solr input document (meaning same content) will be coming lots of
times and it was leading to a lot of unnecessary updates in index. I
introduced a DuplicateDetector using update processor chain which kept a
map of Unique ID --> solr doc hash code and will drop the document if it
was a duplicate.

There is a nice video of other usage of Update chain

https://www.youtube.com/watch?v=qoq2QEPHefo




Re: can't overwrite and can't delete by id

2013-11-23 Thread Per Steffensen

Believe it would be easier to help you out if you
* Tell exactly what you do when you try to update the doc
* Show your solrconfig.xml (especially the UpdateHandler-part)
* Tell how your shards are distributed. Is it 4 shards on one Solr-node, 
or is it 4 Solr-nodes with one shard each, or... Assume you are running 
"cloud"-mode and that the shards belong to the same collection?

Any custom routing?

Regards, Per Steffensen

On 11/22/13 8:32 PM, Mingfeng Yang wrote:

BTW:  it's a 4 shards solorcloud cluster using zookeeper 3.3.5


On Fri, Nov 22, 2013 at 11:07 AM, Mingfeng Yang wrote:


Recently, I found out that  I can't delete doc by id or overwrite a doc
  from/in my SOLR index which is based on SOLR 4.4.0 version.

Say, I have a doc  http://pastebin.com/GqPP4Uw4  (to make it easier to
view, I use pastebin here).  And I tried to add a dynamic field "rank_ti"
to it, want to make it like http://pastebin.com/dGnRRwux

Funny thing is that after I inserted the new version of doc, if I do query
"curl 'localhost:8995/solr/select?wt=json&indent=true&q=id:28583776' " ,
  the two versions above will appear randomly. And after half a minute,
  version 2 will disappear, which means the update is not get write into the
disk.

I tried to delete by id with rsolr, and the doc just can't be removed.

Insert new doc into the index is fine though.

Anyone ran into this strange behavior before?

Thanks
Ming





Upgrading from SolrCloud 4.x to 4.y - as if you had used 4.y all along

2014-01-22 Thread Per Steffensen
If you are upgrading from SolrCloud 4.x to a later version 4.y, and 
basically want your end-system to seem as if it had been running 4.y (no 
legacy mode or anything) all along, you might find some inspiration here


http://solrlucene.blogspot.dk/2014/01/upgrading-from-solrcloud-4x-to-4y-as-if.html 



Solr in non-persistent mode

2014-01-23 Thread Per Steffensen

Hi

In Solr 4.0.0 I used to be able to run with persistent=false (in 
solr.xml). I can see 
(https://cwiki.apache.org/confluence/display/solr/Format+of+solr.xml) 
that persistent is no longer supported in solr.xml. Does this mean that 
you cannot run in non-persistent mode any longer, or does it mean that I 
have to configure it somewhere else?


Thanks!

Regards, Per Steffensen


Re: Solr in non-persistent mode

2014-01-25 Thread Per Steffensen
Well, we where using it in our automatic tests to make them run faster - 
so that is at least an use-case. But after upgrade to 4.4 using the new 
solr.xml-style we are not running our test-suite with Solrs in 
non-persistent mode anymore (we cant). But actually it seems like the 
test-suite is completed in almost the same time as before, so it is not 
a big issue for us.


Regards, Per Steffensen

On 1/23/14 6:09 PM, Mark Miller wrote:

Yeah, I think we removed support in the new solr.xml format. It should still 
work with the old format.

If you have a good use case for it, I don’t know that we couldn’t add it back 
with the new format.

- Mark



On Jan 23, 2014, 3:26:05 AM, Per Steffensen  wrote: Hi

In Solr 4.0.0 I used to be able to run with persistent=false (in
solr.xml). I can see
(https://cwiki.apache.org/confluence/display/solr/Format+of+solr.xml)
that persistent is no longer supported in solr.xml. Does this mean that
you cannot run in non-persistent mode any longer, or does it mean that I
have to configure it somewhere else?

Thanks!

Regards, Per Steffensen





ant eclipse hangs - branch_4x

2014-01-30 Thread Per Steffensen

Hi

Earlier in used to be able to successfully run "ant eclipse" from 
branch_4x. With the newest code (tip of branch_4x today) I cant. "ant 
eclipse" hangs forever at the point showed by console output below. I 
noticed that this problem has been around for a while - not something 
that happened today. Any idea about what might be wrong? A solution? 
Help to debug?


Regards Per Steffensen

--- console when running "ant eclipse" -

...

resolve:
 [echo] Building solr-example-DIH...

ivy-availability-check:
 [echo] Building solr-example-DIH...

ivy-fail:

ivy-configure:
[ivy:configure] :: loading settings :: file = 
/Some/Path/ws_kepler_apache_lucene_solr_4x/branch_4x/lucene/ivy-settings.xml


resolve:

resolve:
 [echo] Building solr-core...

ivy-availability-check:
 [echo] Building solr-core...

ivy-fail:

ivy-fail:

ivy-configure:
[ivy:configure] :: loading settings :: file = 
/Some/Path/ws_kepler_apache_lucene_solr_4x/branch_4x/lucene/ivy-settings.xml


resolve:

HERE IT JUST HANGS FOREVER
-


Re: ant eclipse hangs - branch_4x

2014-01-30 Thread Per Steffensen

Hi

I used Ivy 2.2.0. Upgraded to 2.3.0. Didnt help
No lck files found in ~/.ivy2/cache, so nothing to delete
Deleted the entire ~/.ivy2/cache folder. Didnt help
Debugged a little and found that it was hanging due to org.apache.hadoop 
dependencies in solr/core/ivy.xml - if I commended out everything that 
had to do with hadoop in that ivy.xml it didnt hang in "ant resolve" 
(from solr/core)
Finally the problem was solved when I tried to add 
http://central.maven.org/maven2 to our Artifactory. Do not understand 
why that was necessary, because we already had 
http://repo1.maven.org/maven2/ in our Artifactory.


Well never mind - it works for me now.

Thanks for the help!

Regards, Per Steffensen

On 1/30/14 1:11 PM, Steve Rowe wrote:

Hi Per,

You may be seeing the stale-Ivy-lock problem (see IVY-1388). LUCENE-4636 
upgraded the bootstrapped Ivy to 2.3.0 to reduce the likelihood of this 
problem, so the first thing is to make sure you have that version in 
~/.ant/lib/ - if not, remove the Ivy jar that’s there and run ‘ant 
ivy-bootstrap’ to download and put the 2.3.0 jar in place.

You should run the following and remove any files it finds:

 find ~/.ivy2/cache -name ‘*.lck’

That should stop ‘ant resolve’ from hanging.

Steve
  
On Jan 30, 2014, at 5:06 AM, Per Steffensen  wrote:



Hi

Earlier in used to be able to successfully run "ant eclipse" from branch_4x. With the 
newest code (tip of branch_4x today) I cant. "ant eclipse" hangs forever at the point 
showed by console output below. I noticed that this problem has been around for a while - not 
something that happened today. Any idea about what might be wrong? A solution? Help to debug?

Regards Per Steffensen

--- console when running "ant eclipse" -

...

resolve:
 [echo] Building solr-example-DIH...

ivy-availability-check:
 [echo] Building solr-example-DIH...

ivy-fail:

ivy-configure:
[ivy:configure] :: loading settings :: file = 
/Some/Path/ws_kepler_apache_lucene_solr_4x/branch_4x/lucene/ivy-settings.xml

resolve:

resolve:
 [echo] Building solr-core...

ivy-availability-check:
 [echo] Building solr-core...

ivy-fail:

ivy-fail:

ivy-configure:
[ivy:configure] :: loading settings :: file = 
/Some/Path/ws_kepler_apache_lucene_solr_4x/branch_4x/lucene/ivy-settings.xml

resolve:

HERE IT JUST HANGS FOREVER
-






Re: Fault Tolerant Technique of Solr Cloud

2014-02-18 Thread Per Steffensen
If localhost:8900 is down but localhost:8983 contain replica of the same 
shard(s) that 8900 was running, all data/documents are still available. 
You cannot query the shutdown server (port 8900), but you can query any 
of the other servers (8983, 7574 or 7500). If you make a distributed 
query to collection1 you should still be able to find all of your 
documents, even though 8900 is down.


It is cumbersome to keep a list of crashed/shutdown servers, in order to 
make sure you are always querying a server that is not down. The 
information about what servers are running (and which are not) and which 
replica they run are all in ZooKeeper. So basically, just go look in 
ZooKeeper :-) Ahh, Solr has tool to help you do that - at least if you 
are running your client in java-code. Solr implement different kinds of 
clients (called XXXSolrServer - yes, obvious name for a client). There 
are HttpSolrServer that can do queries against a particular server (wont 
help you), there are LBHttpSolrServer that can do load-balancing over 
several HttpSolrServers (ahh, still not there), and there are 
CloudSolrServer that watches ZooKeeper in order to know what is running 
and where to send requests. CloudSolrServer uses LBHttpSolrServer behind 
the scenes. If you use CloudSolrServer as a client everything should be 
smooth and transparent with respect to querying when servers are down. 
CloudSolrServer will find out where to (and not to) route your requests.


Regards, Per Steffensen

On 18/02/14 14:05, Vineet Mishra wrote:

Hi All,

I want to have clear idea about the Fault Tolerant Capability of SolrCloud

Considering I have setup the SolrCloud with a external Zookeeper, 2 shards,
each having a replica with single collection as given in the official Solr
Documentation.

https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud

*Collection1*
  /\
/\
  /\
/\
  /\
 /   \
*Shard 1 Shard 2*
localhost:8983localhost:7574
localhost:8900localhost:7500


I Indexed some document and then if I shutdown any of the replica or Leader
say for ex- *localhost:8900*, I can't query to the collection to that
particular port

http:/*/localhost:8900*/solr/collection1/select?q=*:*

Then how is it Fault Tolerant or how the query has to be made.

Regards





Re: Fault Tolerant Technique of Solr Cloud

2014-02-19 Thread Per Steffensen

On 19/02/14 07:57, Vineet Mishra wrote:

Thanks for all your response but my doubt is which *Server:Port* should the
query be made as we don't know the crashed server or which server might
crash in the future(as any server can go down).
That is what CloudSolrServer will deal with for you. It knows which 
servers are down and make sure not to send request to those servers.


The only intention for writing this doubt is to get an idea about how the
query format for distributed search might work if any of the shard or
replica goes down.


// Setting up your CloudSolrServer-client
CloudSolrServer client=  new  CloudSolrServer();  // 
 being the same string as you provide in -D|zkHost when starting 
your servers
|client.setDefaultCollection("collection1");
client.connect();

// Creating and firing queries (you can do it in different way, but at least 
this is an option)
SolrQuery query = new SolrQuery("*:*");
QueryResponse results = client.query(query);


Because you are using CloudSolrServer you do not have to worry about not 
sending the request to a crashed server.


In your example I believe the situation is as follows:
* One collection called "collection1" with two shards "shard1" and 
"shard2" each having two replica "replica1" and "replica2" (a replica is 
an "instance" of a shard, and when you have one replica you are not 
having replication).
* collection1.shard1.replica1 is running on localhost:8983 and 
collection1.shard1.replica2 is running on localhost:8900 (or maybe switched)
* collection1.shard2.replica1 is running on localhost:7574 and 
collection1.shard2.replica2 is running on localhost:7500 (or maybe switched)
If localhost:8900 is the only server that is down, all data is still 
available for search because every shard has at least on replica 
running. In that case I believe setting "shards.tolerant" will not make 
a difference. You will get your response no matter what. But if 
localhost:8983 was also down there would no live replica of shard1. I 
that case you will get an exception from you query, indicating that the 
query cannot be carried out over the complete data-set. In that case if 
you set "shards.tolerant" that behaviour will change, and you will not 
get an exception - you will get a real response, but it will just not 
include data from shard1, because it is not available at the moment. 
That is just the way I believe "shards.tolerant" works, but you might 
want to verify that.


To set "shards.tolerant":

SolrQuery query = new SolrQuery("*:*");
query.set("shards.tolerant", true);
QueryResponse results = client.query(query);


Believe distributes search is default, but you can explicitly require it by

query.setDistrib(true);

or

query.set("distrib", true);



Thanks




Re: Fault Tolerant Technique of Solr Cloud

2014-02-24 Thread Per Steffensen

On 24/02/14 13:04, Vineet Mishra wrote:

Can you brief as how to make a direct call to Zookeeper instead of Cloud
Collection(as currently I was querying the Cloud something like
*"http://192.168.2.183:8900/solr/collection1/select?q=*:*
<http://192.168.2.183:8900/solr/collection1/select?q=*:*>"* ) from UI, now
if I assume shard 8900 is down then how can I still make the call.
It is obvious that you cannot make the call to localhost:8900 - the 
server listening to that port is down. You can make the call to any of 
the other servers, though. Information about which Solr-servers are 
running is available in ZooKeeper, CloudSolrServer reads that 
information in order to know which servers to route requests to. As long 
as localhost:8900 is down it will not route requests to that server.


You do not make a "direct call to ZooKeeper". ZooKeeper is not an HTTP 
server that will receive your calls. It just has information about which 
Solr-servers are up and running. CloudSolrServers takes advantage of 
that information. You really cannot do without CloudSolrServer (or at 
least LBHttpSolrServer), unless you write a component that can do the 
same thing in some other language (if the reason you do not want to use 
CloudSolrServer, is that your client is not java). Else you need to do 
other clever stuff, like e.g. what Shalin suggests.


Regards, Per Steffensen


Re: SOLR cloud disaster recovery

2014-02-28 Thread Per Steffensen
We have created some scripts that can do this for you - basically 
reconstruct (by looking at information in ZK) solr.xml, core.properties 
etc on the new machine as they where on the machine that crashed. Our 
procedure when a machine crashes is

* Remove it from rack, replace it by a similar machine with same hostname/IP
* Run the scripts pointing out the IP of the machine that needs to have 
solr.xml and core.properties written
* Start solr on this machine - it now run that same set of replica that 
the crashed machine did. Guess they will sync automatically with their 
sister-replica, but I do not know, because we do not use replication.


I might be able to find something for you. Which version are you using - 
I have some scripts that work on 4.0 and some other scripts that work 
for 4.4 (and maybe later).


Regards, Per Steffensen

On 28/02/14 16:17, Jan Van Besien wrote:

Hi,

I am a bit confused about how solr cloud disaster recovery is supposed
to work exactly in the case of loosing a single node completely.

Say I have a solr cloud cluster with 3 nodes. My collection is created
with numShards=3&replicationFactor=3&maxShardsPerNode=3, so there is
no data loss when I loose a node.

However, how do configure a new node to take the place of the dead
node? I bring up a new node (same hostname, ip, as the dead node)
which is completely empty (empty data dir, empty solr.xml), install
solr, and connect it to zookeeper.

Is it supposed to work automatically from there? In my tests, the
server has no cores and the solr-cloud graph overview simply shows all
the shards/replicas on this node as down. Do I need to recreate the
cores first? Note that these cores were initially created indirectly
by creating the collection.

Thanks,
Jan





In-memory collections?

2013-08-06 Thread Per Steffensen

Hi

Is there a way I can configure Solrs so that it handles its shared 
completely in memory? If yes, how? No writing to disk - neither 
transactionlog nor lucene indices. Of course I accept that data is lost 
if the Solr crash or is shut down.


Regards, Per Steffensen


Re: In-memory collections?

2013-08-07 Thread Per Steffensen

On 8/7/13 9:04 AM, Shawn Heisey wrote:

On 8/7/2013 12:13 AM, Per Steffensen wrote:

Is there a way I can configure Solrs so that it handles its shared
completely in memory? If yes, how? No writing to disk - neither
transactionlog nor lucene indices. Of course I accept that data is lost
if the Solr crash or is shut down.

The lucene index part can be done using RAMDirectoryFactory.  It's
generally not a good idea, though.  If you have enough RAM for that,
then you have enough RAM to fit your entire index into the OS disk
cache.  I don't think you can really do anything about the transaction
log being on disk, but I could be incorrect about that.

Relying on the OS disk cache and the default directory implementation
will usually give you equivalent or better query performance compared to
putting your index into JVM memory.  You won't need a massive Java heap
and the garbage collection problems that it creates.  A side bonus: you
don't lose your index when Solr shuts down.

If you have extremely heavy indexing, then RAMDirectoryFactory might
work better -- assuming you've got your GC heavily tuned.  A potentially
critical problem with RAMDirectoryFactory is that merging/optimizing
will require at least twice as much RAM as your total index size.

Here's a complete discussion about this:

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

NB: That article was written for 3.x, when NRTCachingDirectoryFactory
(the default in 4.x) wasn't available.  The NRT factory *uses*
MMapDirectory.

Thanks,
Shawn



Thanks, Shawn

The thing is that this will be used for a small ever-changing 
collection. In our system we load a lot of documents into a SolrCloud 
cluster. A lot of processes across numerous machines work in parallel on 
loading those documents. Those processes needs to coordinate (hold each 
other back) from time to time and they do so by taking distributed 
locks. Until now we have used the ZooKeeper cluster at hand for taking 
those distributed locks, but the need for locks is so heavy that it 
causes congestion in ZooKeeper, and ZooKeeper really cannot scale in 
that area. We could use several ZooKeeper clusters, but we have decided 
to use a "locking" collection in Solr instead - that will scale. You can 
implement locking in Solr using versioning and optimistic locking. So 
this collection will at any time just contain the few locks (counted in 
max a few hundreds) that are current "right now". Lots of locks will be 
taken, but each of them will only exist in a few ms before deleted 
again. Therefore it will not take up a lot of memory, I guess?


Guess we will try RAMDirectoryFactory, and I will look into how we can 
avoid Solr transactionlog being written (to disk at least).


Regards, Per Steffensen


Group/distinct

2013-08-28 Thread Per Steffensen

Hi

I have a set of collections containing documents with the fields: "a", 
"b" and "timestamp"
A LOT of documents and a lot of them have same values for "a", and for 
each value of "a" there is only a very limited set of distinct values in 
the "b"'s. The "timestamp"-values are different for (almost) all documents.


Can I make a group/distinct query to Solr returning all distinct values 
of "a" where "timestamp" is within a certain period of time. If yes, 
how? Guess this is just using group of facet, but what is the difference 
and which one is best? Do any of them require that the fields has been 
"prepared" for grouping/faceting by setting it up in the schema?


Can I make a query to Solr returning all distinct values of "a" where 
"timestamp" is within a certain period of time, and also, for each 
distinct "a", have the limited set of distinct "b"-values returned? I 
guess this will beg grouping/faceting on multiple fields, but can you do 
that? Other suggestions on how to achieve this?


Regards, Per Steffensen


Complex group request

2013-08-30 Thread Per Steffensen

Hi

I want to do a fairly complex grouping request against Solr. Lets say 
that I have fields "field1" and "timestamp" for all my documents.


In the request I want to provide a set of time-intervals and for each 
distinct value of "field1" I want to get a count on in how many of the 
time-intervals there is at least one document where the value of 
"field1" is this distinct value. Smells like grouping but with an 
advanced counting.


Example
Documents in Solr
field1 | timestamp
a| 1
a| 2
b| 1
a| 3
c| 5
a| 10
b| 12
b| 11
a| 13
d| 14

Doing a query with the following time-intervals (both ends included)
time-interval#1: 1 to 2
time-interval#2: 3 to 5
time-interval#3: 6 to 12

I would like to get the following result
field1-value | count
a  | 3
b  | 2
c  | 1
Reasons
* field1-value a: Count=3, because there is a document with field1=a and 
a timestamp between 1 to 2 (actually there are 2 such documents, but we 
only count in how many time-intervals a is present and do not consider 
how many times a is present in that interval), AND because there is a 
document with field1=a and a timestamp between 3 and 5, AND because 
there is a document with field1=a and a timestamp between 6 and 12
* field1-value b: Count=2, because there is at least one document with 
field1=b in time-interval#1 AND time-interval#3 (there is no document 
with field1=b in time-interval#2)
* field1-value c: Count=1, because there is at least one document with 
field1=c in time-interval#2 (there is no document with field1=c in 
neither time-interval#1 nor time-interval#3)
* No field1-value=d in the result-set, because d is not in at least in 
one of the time-intervals.


The query part of the request probably needs to be
* q=timestamp:([1 TO 2]) OR timestamp:([3 TO 5]) OR timestamp:([6 TO 12])
but if I just add the following to the request
* group=true
* group.field=field1
* group.limit=1 (strange that you cannot set this to 0 BTW - I am not 
interested in one of the documents)

I will get the following result
field1/group-value | count
a| 4 (because there is a total of 4 
documents with field1=a in those time-intervals)

b| 3
c| 1

1) Is it possible for me to create a request that will produce the 
result I want?

2) If yes to 1), how? What will the request look like?
3) If yes to 1), will it work in a distributed SolrCloud setup?
4) If yes to 1), will it perform?
5) If no to 1), is there a fairly simple Solr-code-change I can do in 
order to make it possible? You do not have to hand me the solution, but 
a few comments on how easy/hard it would be, and ideas on how to attack 
the challenge would be nice.


Thanks!

Regards, Per Steffensen


No or limited use of FieldCache

2013-09-11 Thread Per Steffensen

Hi

We have a SolrCloud setup handling huge amounts of data. When we do 
group, facet or sort searches Solr will use its FieldCache, and add data 
in it for every single document we have. For us it is not realistic that 
this will ever fit in memory and we get OOM exceptions. Are there some 
way of disabling the FieldCache (taking the performance penalty of 
course) or make it behave in a nicer way where it only uses up to e.g. 
80% of the memory available to the JVM? Or other suggestions?


Regards, Per Steffensen


Re: No or limited use of FieldCache

2013-09-11 Thread Per Steffensen
The reason I mention sort is that we in my project, half a year ago, 
have dealt with the FieldCache->OOM-problem when doing sort-requests. We 
basically just reject sort-requests unless they hit below X documents - 
in case they do we just find them without sorting and sort them 
ourselves afterwards.


Currently our problem is, that we have to do a group/distinct (in 
SQL-language) query and we have found that we can do what we want to do 
using group (http://wiki.apache.org/solr/FieldCollapsing) or facet - 
either will work for us. Problem is that they both use FieldCache and we 
"know" that using FieldCache will lead to OOM-execptions with the amount 
of data each of our Solr-nodes administrate. This time we have really no 
option of just "limit" usage as we did with sort. Therefore we need a 
group/distinct-functionality that works even on huge data-amounts (and a 
algorithm using FieldCache will not)


I believe setting facet.method=enum will actually make facet not use the 
FieldCache. Is that true? Is it a bad idea?


I do not know much about DocValues, but I do not believe that you will 
avoid FieldCache by using DocValues? Please elaborate, or point to 
documentation where I will be able to read that I am wrong. Thanks!


Regards, Per Steffensen

On 9/11/13 1:38 PM, Erick Erickson wrote:

I don't know any more than Michael, but I'd _love_ some reports from the
field.

There are some restriction on DocValues though, I believe one of them
is that they don't really work on analyzed data

FWIW,
Erick




Re: No or limited use of FieldCache

2013-09-11 Thread Per Steffensen
Thanks, guys. Now I know a little more about DocValues and realize that 
they will do the job wrt FieldCache.


Regards, Per Steffensen

On 9/12/13 3:11 AM, Otis Gospodnetic wrote:

Per,  check zee Wiki, there is a page describing docvalues. We used them
successfully in a solr for analytics scenario.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 9:15 AM, "Michael Sokolov" 
wrote:


On 09/11/2013 08:40 AM, Per Steffensen wrote:


The reason I mention sort is that we in my project, half a year ago, have
dealt with the FieldCache->OOM-problem when doing sort-requests. We
basically just reject sort-requests unless they hit below X documents - in
case they do we just find them without sorting and sort them ourselves
afterwards.

Currently our problem is, that we have to do a group/distinct (in
SQL-language) query and we have found that we can do what we want to do
using group 
(http://wiki.apache.org/solr/**FieldCollapsing<http://wiki.apache.org/solr/FieldCollapsing>)
or facet - either will work for us. Problem is that they both use
FieldCache and we "know" that using FieldCache will lead to OOM-execptions
with the amount of data each of our Solr-nodes administrate. This time we
have really no option of just "limit" usage as we did with sort. Therefore
we need a group/distinct-functionality that works even on huge data-amounts
(and a algorithm using FieldCache will not)

I believe setting facet.method=enum will actually make facet not use the
FieldCache. Is that true? Is it a bad idea?

I do not know much about DocValues, but I do not believe that you will
avoid FieldCache by using DocValues? Please elaborate, or point to
documentation where I will be able to read that I am wrong. Thanks!


There is Simon Willnauer's presentation http://www.slideshare.net/**
lucenerevolution/willnauer-**simon-doc-values-column-**
stride-fields-in-lucene<http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene>

and this blog post http://blog.trifork.com/2011/**
10/27/introducing-lucene-**index-doc-values/<http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/>

and this one that shows some performance comparisons:
http://searchhub.org/2013/04/**02/fun-with-docvalues-in-solr-**4-2/<http://searchhub.org/2013/04/02/fun-with-docvalues-in-solr-4-2/>








Storing/indexing speed drops quickly

2013-09-11 Thread Per Steffensen

Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node on 
each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread one 
doc at the time, full speed (they always have a new doc to store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt storing/indexing 
speed for the first two-three hours (100M docs per hour), then speed 
goes down dramatically, to an, for us, unacceptable level (max 10M per 
hour). At the same time as speed goes down, we see that I/O wait 
increases dramatically. I am not 100% sure, but quick investigation has 
shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will end 
up with lots and lots of "small" files, and I guess this is not good for 
search response-time)


Regards, Per Steffensen


Re: Storing/indexing speed drops quickly

2013-09-12 Thread Per Steffensen
Maybe the fact that we are never ever going to delete or update 
documents, can be used for something. If we delete we will delete entire 
collections.


Regards, Per Steffensen

On 9/12/13 8:25 AM, Per Steffensen wrote:

Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node 
on each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread 
one doc at the time, full speed (they always have a new doc to 
store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt 
storing/indexing speed for the first two-three hours (100M docs per 
hour), then speed goes down dramatically, to an, for us, unacceptable 
level (max 10M per hour). At the same time as speed goes down, we see 
that I/O wait increases dramatically. I am not 100% sure, but quick 
investigation has shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will end 
up with lots and lots of "small" files, and I guess this is not good 
for search response-time)


Regards, Per Steffensen




Re: Storing/indexing speed drops quickly

2013-09-12 Thread Per Steffensen

Seems like the attachments didnt make it through to this mailing list

https://dl.dropboxusercontent.com/u/25718039/doccount.png
https://dl.dropboxusercontent.com/u/25718039/iowait.png


On 9/12/13 8:25 AM, Per Steffensen wrote:

Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node 
on each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread 
one doc at the time, full speed (they always have a new doc to 
store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt 
storing/indexing speed for the first two-three hours (100M docs per 
hour), then speed goes down dramatically, to an, for us, unacceptable 
level (max 10M per hour). At the same time as speed goes down, we see 
that I/O wait increases dramatically. I am not 100% sure, but quick 
investigation has shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will end 
up with lots and lots of "small" files, and I guess this is not good 
for search response-time)


Regards, Per Steffensen




Re: No or limited use of FieldCache

2013-09-12 Thread Per Steffensen

Yes, thanks.

Actually some months back I made PoC of a FieldCache that could expand 
beyond the heap. Basically imagine a FieldCache with room for 
"unlimited" data-arrays, that just behind the scenes goes to 
memory-mapped files when there is no more room on heap. Never finished 
it, and it might be kinda stupid because you actually just go read the 
data from lucene indices and write them to memory-mapped files in order 
to use them. It is better to just use the data in the Lucene indices 
instead. But it had some nice features. But that solution will also have 
the "running out of swap space"-problems.


Regards, Per Steffensen

On 9/12/13 12:48 PM, Erick Erickson wrote:

Per:

One thing I'll be curious about. From my reading of DocValues, it uses
little or no heap. But it _will_ use memory from the OS if I followed
Simon's slides correctly. So I wonder if you'll hit swapping issues...
Which are better than OOMs, certainly...

Thanks,
Erick




Re: No or limited use of FieldCache

2013-09-12 Thread Per Steffensen

On 9/12/13 3:28 PM, Toke Eskildsen wrote:

On Thu, 2013-09-12 at 14:48 +0200, Per Steffensen wrote:

Actually some months back I made PoC of a FieldCache that could expand
beyond the heap. Basically imagine a FieldCache with room for
"unlimited" data-arrays, that just behind the scenes goes to
memory-mapped files when there is no more room on heap.

That sounds a lot like disk-based DocValues.


He he

But that solution will also have the "running out of swap space"-problems.

Not really. Memory mapping works like the disk cache: There is no
requirement that a certain amount of physical memory needs to be
available, it just takes what it can get. If there are not a lot of
physical memory, it will require a lot of storage access, but it will
not over-allocate swap space.
That was also my impression, but during the work, I experienced some 
problems around swap space, but I do not remember exactly what I saw, 
and therefore how I concluded that everything in mm-files actually have 
to fit in physical mem + swap. I might very well have been wrong in that 
conclusion

It seems that different setups vary quite a lot in this area and some
systems are prone to aggressive use of the swap file, which can severely
harm responsiveness of applications with out-swapped data.

However, this should still not result in any OOM's, as the system can
always discard some of the memory mapped data if it needs more physical
memory.

I saw no OOMs

- Toke Eskildsen, State and University Library, Denmark





Re: Storing/indexing speed drops quickly

2013-09-13 Thread Per Steffensen

On 9/12/13 4:26 PM, Shawn Heisey wrote:

On 9/12/2013 2:14 AM, Per Steffensen wrote:

Starting from an empty collection. Things are fine wrt
storing/indexing speed for the first two-three hours (100M docs per
hour), then speed goes down dramatically, to an, for us, unacceptable
level (max 10M per hour). At the same time as speed goes down, we see
that I/O wait increases dramatically. I am not 100% sure, but quick
investigation has shown that this is due to almost constant merging.

While constant merging is contributing to the slowdown, I would guess
that your index is simply too big for the amount of RAM that you have.
Let's ignore for a minute that you're distributed and just concentrate
on one machine.

After three hours of indexing, you have nearly 300 million documents.
If you have a replicationFactor of 1, that's still 50 million documents
per machine.  If your replicationFactor is 2, you've got 100 million
documents per machine.  Let's focus on the smaller number for a minute.
replicationFactor is 1, so that is about 50 million docs per machine at 
this point


50 million documents in an index, even if they are small documents, is
probably going to result in an index size of at least 20GB, and quite
possibly larger.  In order to make Solr function with that many
documents, I would guess that you have a heap that's at least 4GB in size.
Currently I have 2,5GB heap, on the 8GB machine - to leave something for 
the OS cache


With only 8GB on the machine, this doesn't leave much RAM for the OS
disk cache.  If we assume that you have 4GB left for caching, then I
would expect to see problems about the time your per-machine indexes hit
15GB in size.  If you are making it beyond that with a total of 300
million documents, then I am impressed.

Two things are going to happen when you have enough documents:  1) You
are going to fill up your Java heap and Java will need to do frequent
collections to free up enough RAM for normal operation.  When this
problem gets bad enough, the frequent collections will be *full* GCs,
which are REALLY slow.
What is it that will fill my heap? I am trying to avoid the FieldCache. 
For now, I am actually not doing any searches - focus on indexing for 
now - and certainly not group/facet/sort searches that will use the 
FieldCache.

   2) The index will be so big that the OS disk
cache cannot effectively cache it.  I suspect that the latter is more of
the problem, but both might be happening at nearly the same time.




When dealing with an index of this size, you want as much RAM as you can
possibly afford.  I don't think I would try what you are doing without
at least 64GB per machine, and I would probably use at least an 8GB heap
on each one, quite possibly larger.  With a heap that large, extreme GC
tuning becomes a necessity.
More RAM will probably help, but only for a while. I want billions of 
documents in my collections - and also on each machine. Currently we are 
aiming 15 billion documents per month (500 million per day) and keep at 
least two years of data in the system. Currently we use one collection 
for each month, so when the system has been running for two years it 
will be 24 collections with 15 billion documents each. Indexing will 
only go on in the collection corresponding to the "current" month, but 
searching will (potentially) be across all 24 collections. The documents 
are very small. I know that 6 machines will not do in the long run - 
currently this is only testing - but number of machines should not be 
higher than about 20-40. In general it is a problem if Solr/Lucene will 
not perform fairly well if data does not fit RAM - then it cannot really 
be used for "big data". I would have to buy hundreds or even thousands 
of machines with 64GB+ RAM. That is not realistic.


To cut down on the amount of merging, I go with a fairly large
mergeFactor, but mergeFactor is basically deprecated for
TieredMergePolicy, there's a new way to configure it now.  Here's the
indexConfig settings that I use on my dev server:


   
 35
 35
 105
   
   
 1
 6
   
   48
   false


Thanks,
Shawn



Thanks!


Re: Storing/indexing speed drops quickly

2013-09-23 Thread Per Steffensen
Now running the tests on a slightly reduced setup (2 machines, quadcore, 
8GB ram ...), but that doesnt matter


We see that storing/indexing speed drops when using 
IndexWriter.updateDocument in DirectUpdateHandler2.addDoc. But it does 
not drop when just using IndexWriter.addDocument (update-requests with 
overwrite=false)
Using addDocument: 
https://dl.dropboxusercontent.com/u/25718039/AddDocument_2Solr8GB_DocCount.png
Using updateDocument: 
https://dl.dropboxusercontent.com/u/25718039/UpdateDocument_2Solr8GB_DocCount.png
We are not too happy about having to use addDocument, because that 
allows for duplicates, and we would really want to avoid that (on 
Solr/Lucene level)


We have confirmed that doubling amount of total RAM will double the 
amount of documents in the index where the indexing-speed starts 
dropping (when we use updateDocument)
On 
https://dl.dropboxusercontent.com/u/25718039/UpdateDocument_2Solr8GB_DocCount.png 
you can see that the speed drops at around 120M documents. Running the 
same test, but with Solr machine having 16GB RAM (instead of 8GB) the 
speed drops at around 240M documents.


Any comments on why indexing speed drops with IndexWriter.updateDocument 
but not with IndexWriter.addDocument?


Regards, Per Steffensen

On 9/12/13 10:14 AM, Per Steffensen wrote:

Seems like the attachments didnt make it through to this mailing list

https://dl.dropboxusercontent.com/u/25718039/doccount.png
https://dl.dropboxusercontent.com/u/25718039/iowait.png


On 9/12/13 8:25 AM, Per Steffensen wrote:

Hi

SolrCloud 4.0: 6 machines, quadcore, 8GB ram, 1T disk, one Solr-node 
on each, one collection across the 6 nodes, 4 shards per node
Storing/indexing from 100 threads on external machines, each thread 
one doc at the time, full speed (they always have a new doc to 
store/index)

See attached images
* iowait.png: Measured I/O wait on the Solr machines
* doccount.png: Measured number of doc in Solr collection

Starting from an empty collection. Things are fine wrt 
storing/indexing speed for the first two-three hours (100M docs per 
hour), then speed goes down dramatically, to an, for us, unacceptable 
level (max 10M per hour). At the same time as speed goes down, we see 
that I/O wait increases dramatically. I am not 100% sure, but quick 
investigation has shown that this is due to almost constant merging.


What to do about this problem?
Know that you can play around with mergeFactor and commit-rate, but 
earlier tests shows that this really do not seem to do the job - it 
might postpone the time where the problem occurs, but basically it is 
just a matter of time before merging exhaust the system.
Is there a way to totally avoid merging, and keep indexing speed at a 
high level, while still making sure that searches will perform fairly 
well when data-amounts become big? (guess without merging you will 
end up with lots and lots of "small" files, and I guess this is not 
good for search response-time)


Regards, Per Steffensen







Sort-field for ALL docs in FieldCache for sort queries -> OOM on lots of docs

2013-03-21 Thread Per Steffensen

Hi

We have a lot of docs in Solr. Each particular Solr-node handles a lot 
of docs distributed among several replica. When you issue a sort query, 
it seems to me that, the value of the sort-field of ALL docs under the 
Solr-node is added to the FieldCache. This leads to OOM-exceptions at 
some point when you have enough docs under the Solr-node - relative to 
its Xmx of course. Are there any "tricks" to get around this issue, so 
that a sort-query will never trigger a OOM, no matter how many docs are 
handled by a particular Solr-node. Of course you need to be ready to 
accept the penalty of more disk-IO as soon as the entire thing does not 
fit in memory, but I would rather accept that than accept OOM's.


Regards, Per Steffensen


Re: Sort-field for ALL docs in FieldCache for sort queries -> OOM on lots of docs

2013-03-21 Thread Per Steffensen

On 3/21/13 9:48 AM, Toke Eskildsen wrote:

On Thu, 2013-03-21 at 09:13 +0100, Per Steffensen wrote:

We have a lot of docs in Solr. Each particular Solr-node handles a lot
of docs distributed among several replica. When you issue a sort query,
it seems to me that, the value of the sort-field of ALL docs under the
Solr-node is added to the FieldCache. [...]

I haven't used it yet, but DocValues in Solr 4.2 seems to be the answer.

- Toke Eskildsen


Thanks Toke! Can you please elaborate a little bit? How to use it? What 
it is supposed to do for you?


Regards, Per Steffensen


Re: Sort-field for ALL docs in FieldCache for sort queries -> OOM on lots of docs

2013-03-21 Thread Per Steffensen

On 3/21/13 10:52 AM, Toke Eskildsen wrote:

On Thu, 2013-03-21 at 09:57 +0100, Per Steffensen wrote:

Thanks Toke! Can you please elaborate a little bit? How to use it? What
it is supposed to do for you?

Sorry, no, I only know about it on the abstract level. The release notes
for Solr 4.2 says

* DocValues have been integrated into Solr. DocValues can be loaded up a
lot faster than the field cache and can also use different compression
algorithms as well as in RAM or on Disk representations. Faceting,
sorting, and function queries all get to benefit. How about the OS
handling faceting and sorting caches off heap? No more tuning 60
gigabyte heaps? How about a snappy new per segment DocValues faceting
method? Improved numeric faceting? Sweet.

Spending 5 minutes searching on how to activate the new powers did not
get me much; my Google-fu is clearly not strong enough. The example
schema shows that docValues="true" is a valid attribute for "StrField,
UUIDField and all Trie*Fields", but I do not know if they are used
automatically by sort or if they should be requested explicitly.

Regards,
Toke Eskildsen



Thanks again, Toke!

Can anyone else elaborate? How to "activate" it? How to make sure, for 
sorting, that sort-field-value for all docs are not read into memory for 
sorting - leading to OOM when you have a lot of docs? Can this feature 
be activated on top of an existing 4.0 index, or do you have to re-index 
everything?


Thanks a lot for any feedback!

Regards, Per Steffensen


Re: Sort-field for ALL docs in FieldCache for sort queries -> OOM on lots of docs

2013-03-22 Thread Per Steffensen

On 3/21/13 10:50 PM, Shawn Heisey wrote:

On 3/21/2013 4:05 AM, Per Steffensen wrote:

Can anyone else elaborate? How to "activate" it? How to make sure, for
sorting, that sort-field-value for all docs are not read into memory for
sorting - leading to OOM when you have a lot of docs? Can this feature
be activated on top of an existing 4.0 index, or do you have to re-index
everything?


There is one requirement that may not be obvious - every document must 
have a value in the field, so you must either make the field either 
required or give it a default value in the schema.  Solr 4.2 will 
refuse to start the core if this requirement is not met.

That is not problem for us. The field exist on every document.
The example schema hints that the value might need to be 
single-valued.  I have not tested this.  Sorting is already 
problematic on multi-valued fields, so I assume that this won't be the 
case for you.

That is not a problem for us either. The field is single-valued.


To use docValues, add docValues="true" and then either set 
required="true" or default="" on the field definition in 
schema.xml, restart Solr or reload the core, and reindex.  Your index 
will get bigger.

So the answer to "...or do you have to re-index everything?" is yes!?


If the touted behavior of handling the sort mechanism in OS disk cache 
memory (or just reading the disk if there's not enough memory) rather 
than heap is correct, then it should solve your issues.  I hope it does!
Me too. I will find out soon - I hope! But re-indexing is kinda a 
problem for us, but we will figure out.
Any "guide to re-index all you stuff" anywhere, so I do it the easiest 
way? Guess maybe there are some nice tricks about steaming data directly 
from one Solr running the old index into a new Solr running the new 
index, and then discard the old index afterwards?


Thanks,
Shawn



Thanks a lot, Shawn!

Regards, Per Steffensen


Re: AW: AW: java.lang.OutOfMemoryError: Map failed

2013-04-02 Thread Per Steffensen
I have seen the exact same on Ubuntu Server 12.04. It helped adding some 
swap space, but I do not understand why this is necessary, since OS 
ought to just use the actual memory mapped files if there is not room in 
(virtual) memory, swapping pages in and out on demand. Note that I saw 
this for memory mapped files opened for read+write - not in the exact 
same context as you see it where MMapDirectory is trying to map memory 
mapped files.


If you find a solution/explanation, please post it here. I really want 
to know more about why FileChannel.map can cause OOM. I do not think the 
OOM is a "real" OOM indicating no more space on java heap, but is more 
an exception saying that OS has no more memory (in some interpretation 
of that).


Regards, Per Steffensen

On 4/2/13 11:32 AM, Arkadi Colson wrote:

It is running as root:

root@solr01-dcg:~# ps aux | grep tom
root  1809 10.2 67.5 49460420 6931232 ?Sl   Mar28 706:29 
/usr/bin/java 
-Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.properties 
-server -Xms2048m -Xmx6144m -XX:PermSize=64m -XX:MaxPermSize=128m 
-XX:+UseG1GC -verbose:gc -Xloggc:/solr/tomcat-logs/gc.log 
-XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Duser.timezone=UTC 
-Dfile.encoding=UTF8 -Dsolr.solr.home=/opt/solr/ -Dport=8983 
-Dcollection.configName=smsc -DzkClientTimeout=2 
-DzkHost=solr01-dcg.intnet.smartbit.be:2181,solr01-gs.intnet.smartbit.be:2181,solr02-dcg.intnet.smartbit.be:2181,solr02-gs.intnet.smartbit.be:2181,solr03-dcg.intnet.smartbit.be:2181,solr03-gs.intnet.smartbit.be:2181 
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.port= 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=false 
-Djava.endorsed.dirs=/usr/local/tomcat/endorsed -classpath 
/usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli.jar 
-Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat 
-Djava.io.tmpdir=/usr/local/tomcat/temp 
org.apache.catalina.startup.Bootstrap start


Arkadi

On 04/02/2013 11:29 AM, André Widhani wrote:

The output is from the root user. Are you running Solr as root?

If not, please try again using the operating system user that runs Solr.

André

Von: Arkadi Colson [ark...@smartbit.be]
Gesendet: Dienstag, 2. April 2013 11:26
An: solr-user@lucene.apache.org
Cc: André Widhani
Betreff: Re: AW: java.lang.OutOfMemoryError: Map failed

Hmmm I checked it and it seems to be ok:

root@solr01-dcg:~# ulimit -v
unlimited

Any other tips or do you need more debug info?

BR

On 04/02/2013 11:15 AM, André Widhani wrote:

Hi Arkadi,

this error usually indicates that virtual memory is not sufficient 
(should be "unlimited").


Please see 
http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/69168


Regards,
André


Von: Arkadi Colson [ark...@smartbit.be]
Gesendet: Dienstag, 2. April 2013 10:24
An: solr-user@lucene.apache.org
Betreff: java.lang.OutOfMemoryError: Map failed

Hi

Recently solr crashed. I've found this in the error log.
My commit settings are loking like this:

  1
  false


  
2000
  

The machine has 10GB of memory. Tomcat is running with -Xms2048m 
-Xmx6144m


Versions
Solr: 4.2
Tomcat: 7.0.33
Java: 1.7

Anybody any idea?

Thx!

Arkadi

SEVERE: auto commit error...:org.apache.solr.common.SolrException: 
Error

opening new searcher
   at
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1415)
   at 
org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1527)

   at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:562) 

   at 
org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)

   at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
   at java.util.concurrent.FutureTask.run(FutureTask.java:166)
   at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) 


   at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) 


   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 


   at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 


   at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.IOException: Map failed
   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849)
   at
org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)
   at
org.apache.lucene.store.MMapDirectory$MMapIndexInput.(MMapDirectory.java:228) 


   at
org.apache.lucene.store.MMapDirectory.openInp

Re: Solr Collection's Size

2013-04-10 Thread Per Steffensen
"number of documents found" can be found in a field called "numFound" in 
the response.


If you do use SolrJ you will likely have a QueryResponse qr and can just 
do a qr.setNumFound().


If you use do not use SolrJ try to add e.g. wt=json to your search query 
to get the response in JSON. Find the numFound field in the readable 
JSON response - it should be at "response.numFound". If in javascript 
with jQuery something like this should work:

$.getJSON(search_url,
  function(data) {
... data.response.numFound ...
  }
)
Go figure who to extract it in javascript without jQuery

Regards, Per Steffensen

On 4/5/13 3:20 PM, Alexandre Rafalovitch wrote:

I'd add rows=0, just to avoid the actual records serialization if size is
all that matters.

Regards,
Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Apr 5, 2013 at 8:26 AM, Jack Krupansky wrote:


Query for "*:*" and look at the number of documents found.

-- Jack Krupansky

-Original Message- From: Ranjith Venkatesan
Sent: Friday, April 05, 2013 2:06 AM
To: solr-user@lucene.apache.org
Subject: Solr Collection's Size


Hi,

I am new to solr. I want to find size of collection dynamically via solrj.
I
tried many ways but i couldnt succeed in any of those. Pls help me with
this
issue.





Re: Solr Collection's Size

2013-04-10 Thread Per Steffensen

On 4/10/13 12:17 PM, Per Steffensen wrote:
"number of documents found" can be found in a field called "numFound" 
in the response.


If you do use SolrJ you will likely have a QueryResponse qr and can 
just do a qr.setNumFound().

qr.getResults().getNumFound() :-)


If you use do not use SolrJ try to add e.g. wt=json to your search 
query to get the response in JSON. Find the numFound field in the 
readable JSON response - it should be at "response.numFound". If in 
javascript with jQuery something like this should work:

$.getJSON(search_url,
  function(data) {
... data.response.numFound ...
  }
)
Go figure who to extract it in javascript without jQuery

Regards, Per Steffensen

On 4/5/13 3:20 PM, Alexandre Rafalovitch wrote:
I'd add rows=0, just to avoid the actual records serialization if 
size is

all that matters.

Regards,
Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD 
book)



On Fri, Apr 5, 2013 at 8:26 AM, Jack Krupansky 
wrote:



Query for "*:*" and look at the number of documents found.

-- Jack Krupansky

-Original Message- From: Ranjith Venkatesan
Sent: Friday, April 05, 2013 2:06 AM
To: solr-user@lucene.apache.org
Subject: Solr Collection's Size


Hi,

I am new to solr. I want to find size of collection dynamically via 
solrj.

I
tried many ways but i couldnt succeed in any of those. Pls help me with
this
issue.








SolrCloud and replication

2011-12-05 Thread Per Steffensen

Hi

I have been working with ElasticSearch for a while now, and find it very 
cool. Unfortunately we are no longer allowed to use ElasticSearch in our 
project. Therefore we are looking for alternatives - Solr(Cloud) is an 
option.


I have been looking at SolrCloud and worked through the "examples" on 
http://wiki.apache.org/solr/SolrCloud. I realized that in "Example B: 
Simple two shard cluster with shard replicas" you really need an "open" 
:-) definition of replica to claim that the shard running at 8900 is a 
replica of the one running at 8983. If you index documents into one of 
them it is never replicated to the other one. So I guess the only 
"connection" between the shards running on 8983 and 8900 is that they 
agree that they are running the same logical shard "shard1", and that 
any of them can be queried when you want results from shard "shard1". 
But results will be different depending on which instance of the shard 
you get when you are quering, as soon as you start indexing documents to 
one of them AFTER the "cp -r example exampleB".


I order to get the actual replication I guess I need to turn my eyes at 
http://wiki.apache.org/solr/SolrReplication, but reading that page I get 
a lot in doubt what to do and what not to do now that I am using 
SolrCloud. It is all based on replicating config-files around, but my 
impression is that SolrCloud takes another approach to configs, namely 
that they are kept in ZK. Could you please elaborate on how to use 
"real" replication as described on 
http://wiki.apache.org/solr/SolrReplication in coexistence with 
SolrCloud as described on http://wiki.apache.org/solr/SolrCloud. It 
would be nice if the Wiki pages where updated with some kind of 
explanation but a reply to this mailing-list posting will also do.


Thanks!

Regards, Per Steffensen


Continuous update on progress of "New SolrCloud Design" work

2011-12-05 Thread Per Steffensen

Hi

My guess is that the work for acheiving 
http://wiki.apache.org/solr/NewSolrCloudDesign has begun on branch 
"solrcloud". It is hard to follow what is going on and how to use what 
has been acheived - you cannot follow the examples on 
http://wiki.apache.org/solr/SolrCloud anymore (e.g. there is no 
shard="shard1" in solr/example/solr/solr.xml anymore). Will it be 
possible to maintain a how-to-use section on 
http://wiki.apache.org/solr/NewSolrCloudDesign with examples, e.g. like 
to ones on http://wiki.apache.org/solr/SolrCloud, on how to use it, that 
"at any time" reflects how to use whats on the HEAD of "solrcloud" branch?


In my project we are about to start using something else that 
ElasticSearch, and SolrCloud is an option, but there is a lot to be done 
in Solr(Cloud) before it is even comparable with ElasticSearch wrt 
features. If we choose to go for SolrCloud we would like to participate 
in the development of the new SolrCloud, and add features corresponding 
to stuff that we used to use in ElasticSearch, but it is very hard to 
contribute to SolrCloud if it is "black box" (that only a few persons 
know about) work going on on branch "solrcloud" getting us from 
http://wiki.apache.org/solr/SolrCloud to 
http://wiki.apache.org/solr/NewSolrCloudDesign.


Regards, Per Steffensen


Re: SolrCloud and replication

2011-12-05 Thread Per Steffensen

Tomás Fernández Löbbe skrev:

Hi, AFAIK SolrCloud still doesn't support replication, that's why in the
example you have to copy the directory manually. Replication has to be
implemented by using the SolrReplication as you mentioned or use some kind
of distributed indexing (you'll have to do it yourself).
Well, I could do it myself, but guess it would be nice to use the 
"built-in" feature now that it is there.

 SolrReplication
stuff is there to replicate the index and CAN replicate configuration
files, this doesn't mean that you HAVE to replicate config files.

No, I figured that out.

 Do you
need to replicate configuration files too or just the index?
  
Guess that is the whole point. Guess that I do not have to replicate 
configuration files, since SolrCloud (AFAIK) does not use local 
configuration files but information in ZK. And the it gets a little hard 
to guess how to do it, since the explanation on 
http://wiki.apache.org/solr/SolrReplication talkes about different 
configs for master and slave. How to acheive that when the config is 
shared (in ZK).

Tomás

  




Replication not done "for real" on commit?

2011-12-05 Thread Per Steffensen

Hi

Reading http://wiki.apache.org/solr/SolrReplication I notice the 
"pollInterval" (guess it should have been "pullInterval") on the slaves. 
That indicate to me that indexed information is not really "pushed" from 
master to slave(s) on events defined by "replicateAfter" (e.g. commit), 
but that it only will be made available for pulling by the slaves at 
those events. So even though I run with a master with 
"replicateAfter=commit", I am not sure that I will be able to query a 
document that I have just indexed from one of the slaves immediately 
after having done the indexing on the master - I will have to wait 
"pollInterval" (+ time for replication). Can anyone confirm that this is 
a correct interpretation, or explain how to understand "pollInterval" if 
it is not?


I want to acheive this always-in-sync property between master and slaves 
(primary and replica if you like). What is the easiest way? Will I just 
have to make sure myself that indexing goes on directly on all "replica" 
of a shard, and then drop using the replication explained on 
http://wiki.apache.org/solr/SolrReplication?


Regards, Per Steffensen


Re: SolrCloud and replication

2011-12-05 Thread Per Steffensen

Thanks for answering

Mark Miller skrev:

Guess that is the whole point. Guess that I do not have to replicate
  

configuration files, since SolrCloud (AFAIK) does not use local
configuration files but information in ZK. And the it gets a little hard to
guess how to do it, since the explanation on http://wiki.apache.org/solr/*
*SolrReplication  talkes
about different configs for master and slave. How to acheive that when the
config is shared (in ZK).




You can System properties to allow a config to be used both for a slave and
a master. You could also just not configure as a slave or master and try
doing one off snap pulls with a chron job or something.
  

Thanks

We have not hooked up SolrCloud with replication because the new SolrCloud
work happening on the solrcloud branch will only use it for recovery.
Rather than replicate, added documents will simply be forwarded to all
replicas so you can also use Near Realtime and to provide better
consistency.
  

Sounds like the right solution to me!


  




Re: Continuous update on progress of "New SolrCloud Design" work

2011-12-06 Thread Per Steffensen

Yonik Seeley skrev:

On Mon, Dec 5, 2011 at 6:23 AM, Per Steffensen  wrote:
  

Will it be possible to maintain a how-to-use section on 
http://wiki.apache.org/solr/NewSolrCloudDesign with examples, e.g. like to ones 
on http://wiki.apache.org/solr/SolrCloud,



Yep, it was on my near-term todo list to put up a quick developers
guide on how to get started quickly, and a little of how it works
under the covers.
Nothing too polished of course since it's rapidly evolving.
  

Yes, of course. Sounds nice.
  

If we choose
to go for SolrCloud we would like to participate in the development of the
new SolrCloud



Great!

-Yonik
http://www.lucidimagination.com

  




Re: Continuous update on progress of "New SolrCloud Design" work

2011-12-06 Thread Per Steffensen

Andy skrev:

Hi,

  

add features corresponding to stuff that we used to use in ElasticSearch



Does that mean you have used ElasticSearch but decided to try SolrCloud instead?
  
Yes, or at least we are looking for altertives right now. Considering 
Solandra, SolrCloud, Katta, Riak Search, OrientDB, Lily etc. etc. etc.

I'm also looking at a distributed solution. ElasticSearch just seems much 
further along than SolrCloud. So I'd be interested to hear about any particular 
reasons you decided to pick SolrCloud instead of ElasticSearch.
  
I agree that ES is much further along than SolrCloud (and the other 
alternatives for that matter). I would like to stay with ES for my 
project, but its a political decision (you know product owners :-) ) not 
to stay with ES. Nothing technical. Im affraid that I cannot say more 
about why we are not staying with ES. But basically I would also go for 
ElasticSearch if I was you - at least for now, until SolrCloud gets 
further wrt implementation. But I believe that the "intentions" on where 
to go with SolrCloud listed on Wiki sounds great, so SolrCloud might 
eventually catch up with ES.

Andy



____
 From: Per Steffensen 
To: solr-user@lucene.apache.org 
Sent: Monday, December 5, 2011 6:23 AM

Subject: Continuous update on progress of "New SolrCloud Design" work
 
Hi


My guess is that the work for acheiving http://wiki.apache.org/solr/NewSolrCloudDesign has begun on branch 
"solrcloud". It is hard to follow what is going on and how to use what has been acheived - you cannot follow 
the examples on http://wiki.apache.org/solr/SolrCloud anymore (e.g. there is no shard="shard1" in 
solr/example/solr/solr.xml anymore). Will it be possible to maintain a how-to-use section on 
http://wiki.apache.org/solr/NewSolrCloudDesign with examples, e.g. like to ones on 
http://wiki.apache.org/solr/SolrCloud, on how to use it, that "at any time" reflects how to use whats on the 
HEAD of "solrcloud" branch?

In my project we are about to start using something else that ElasticSearch, and SolrCloud is an 
option, but there is a lot to be done in Solr(Cloud) before it is even comparable with 
ElasticSearch wrt features. If we choose to go for SolrCloud we would like to participate in the 
development of the new SolrCloud, and add features corresponding to stuff that we used to use in 
ElasticSearch, but it is very hard to contribute to SolrCloud if it is "black box" (that 
only a few persons know about) work going on on branch "solrcloud" getting us from 
http://wiki.apache.org/solr/SolrCloud to http://wiki.apache.org/solr/NewSolrCloudDesign.

Regards, Per Steffensen
  




Commit and sessions

2012-01-27 Thread Per Steffensen

Hi

If I have added some document to solr, but not done explicit commit yet, 
and I get a power outage, will I then loose data? Or asked in another 
way, does data go into persistent store before commit? How to avoid 
possibility of loosing data?


Does solr have some kind of session concept, so that different threads 
can add documents to the same solr, and when one of them says "commit" 
it is only the documents added by this thread that gets committed? Or is 
it always "all documents added by any thread since last commit" that 
gets committed?


Regards, Per Steffensen


Re: Versioning

2012-12-10 Thread Per Steffensen
Depends on exactly what you mean by "versioning". But if you mean that 
every document in Solr gets a version-number which is increased every 
time the document is updated, all you need to do is to add a _version_ 
field in you schema: http://wiki.apache.org/solr/SolrCloud#Required_Config
Believe you will get optimistic locking out-of-the-box if you do this 
(you will also need the updateLog configured in solrconfig.xml). Or else 
you can take my patch for SOLR-3178 and have optimistic locking work as 
described on: 
http://wiki.apache.org/solr/Per%20Steffensen/Update%20semantics


Regards, Per Steffensen

Sushil jain skrev:

Hello Everyone,

I am a Solr beginner.

I just want to know if versioning of data is possible in Solr, if yes then
please share the procedure.

Thanks & Regards,
Sushil Jain

  




Re: Partial results returned

2012-12-11 Thread Per Steffensen

When you say "2 shards" do you mean "2 nodes each running a shard"?
Seems like you have a collection named "index" - how did you create this 
collection (containing two shards)?

How do you start your 2 nodes - exact command used?
You might want to attach the content of clusterstate.json from ZK. While 
system is running go to http://localhost:7500/solr/#/~cloud and see if 
there isnt a clusterstate.json to be found there somewhere - attach it.
Cant promis that I will remember to get back to you when you have 
answered the above questions, but I believe the information asked for 
above will also help others to help you. I will try to remember though.


Regards, Per Steffensen

adm1n skrev:

Hello,

I'm running solrcloud with 2 shards.

Lets assume I've 100 documents indexed in total, which are divided 55/45 by
the shards...
when I query, for example:
curl
'http://localhost:7500/solr/index/select?q=*:*l&wt=json&indent=true&rows=0'
sometimes I got "response":{"numFound":0, sometimes -
"response":{"numFound":45, "response":{"numFound":55 or
"response":{"numFound":100.

But when I run the query:
curl
'http://localhost:7500/solr/index/select?shards=localhost:7500/solr/index,localhost:7501/solr/index&q=*:*&wt=json&indent=true&rows=0'

it always returns the complete list of 100 documents.

Am I missing some configuration?

thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Partial-results-returned-tp4026027.html
Sent from the Solr - User mailing list archive at Nabble.com.

  




Re: Partial results returned

2012-12-12 Thread Per Steffensen
In general you probably want to add a parameter "distrib=true" to your 
search requests.


adm1n wrote:

I have 1 collection called index.
I created it like explained here: http://wiki.apache.org/solr/SolrCloud in
Example A: Simple two shard cluster section
here are the start up commands:

1)java -Dbootstrap_confdir=./solr/index/conf -Dcollection.configName=myconf
-DzkRun -DnumShards=2 -jar start.jar -Djetty.port=7500 >
logs/solr_server_java.`date +"%Y%m%d"`.log 2>&1 &
2)java -Dbootstrap_confdir=./solr/index/conf -Dcollection.configName=myconf
-Djetty.port=7501 -DzkHost=localhost:8500 -jar start.jar >
logs/solr_server_java.`date +"%Y%m%d"`.log 2>&1 &
  
Well no need to include bootstrap_confdir in the second startup. Adding 
this parameter will load the config pointed to into ZK under the name 
defined in collection.configName. This is a to-do-once task, and when 
you start jetty #2 there is no need to add it. If you where not running 
zk as part of a solr you shouldnt add the bootstrap_confdir either when 
restarting #1 and #2 later (because the config would be stored once and 
for all in ZK) - but since the zk in solr are probably not using the 
same data-dir between startups you probably will need to add 
boostrap_confdir every to the first solr every time time you start. 


How was your collection named "index" instead of "collection1"?

Did you also load documents as explained on 
http://wiki.apache.org/solr/SolrCloud, example A? Guess at least you 
needed to change "collection1" to "index" in the urls.


I believe all you did was executing a bunch of commands from the 
command-line. You might want to provide the full list of commands from 
beginning to end, so that others can repeat your situation/problem.


in my http://localhost:7500/solr/#/~cloud  There is only a chart of my
collection with the shards
  

It would probably also be nice to get this chart - could you attach it.
To get the clusterstate.json - click the "Tree" entry in the menu og 
just go directly to http://192.168.78.195:8983/solr/#/~cloud?view=tree. 
You should find the clusterstate.json there. Attach it.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Partial-results-returned-tp4026027p4026076.html
Sent from the Solr - User mailing list archive at Nabble.com.

  




Re: SolrCloud breaks distributed query strings

2012-12-12 Thread Per Steffensen
It doesnt sound exactly like a problem we experienced some time ago, 
where long request where mixed put during transport. Jetty was to blame. 
I might be Jetty that f up you request too? SOLR-4031. Are you still 
running 8.1.2?


Regards, Per Steffensen

Markus Jelsma skrev:

Hi,

We're starting to see issues on a test cluster where Solr breaks up query 
string parameters that are either defined in the request handler or are passed 
in the URL in the initial request.

In our request handler we have an SF parameter for edismax (SOLR-3925):

  
title_general~2^4
title_nl~2^4
title_en~2^4
title_de~2^4
 

Almost all queries pass without issue but some fail because the parameter 
arrives in an incorrect format, i've logged several occurences:

2012-12-12 12:01:12,159 ERROR [solr.core.SolrCore] - [http-8080-exec-23] - : org
.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: Invalid a
rguments for sf, must be sf=FIELD~DISTANCE^BOOST, got 
title_general~2^4

title_nl~2^4
title_en~2^4
title_de~2
4

  
at org.apache.solr.handler.component.QueryComponent.prepare(QueryCompone

nt.java:154)


2012-12-12 12:00:57,164 ERROR [solr.core.SolrCore] - [http-8080-exec-1] - : org.
apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: Invalid ar
guments for sf, must be sf=FIELD~DISTANCE^BOOST, got 
title_general~2^4

title_nl~2
4
title_en~2^4
title_de~2^4

  
at org.apache.solr.handler.component.QueryComponent.prepare(QueryCompone

nt.java:154)


2012-12-12 12:01:11,223 ERROR [solr.core.SolrCore] - [http-8080-exec-8] - : org.
apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: Invalid ar
guments for sf, must be sf=FIELD~DISTANCE^BOOST, got ^title_general~2^4
title_nl~2^4
title_en~2^4
title_de~2^4

  
at org.apache.solr.handler.component.QueryComponent.prepare(QueryCompone

nt.java:154)


This seems crazy! For some reason, some times, the parameter get corrupted in 
some manner! We've also seen this with a function query in the edismax boost 
parameter where for some reasons a comma is replaced by a newline:

2012-12-12 11:11:45,527 ERROR [solr.core.SolrCore] - [http-8080-exec-16] - : 
org.apache.solr.common.SolrException: org.apache.solr.search.SyntaxError: 
Expected ',' at position 55 in 
'if(exists(date),max(recip(ms(NOW/DAY,date),3.17e-8,143
.9),.8),.7)'
at 
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:154)
...
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.search.SyntaxError: Expected ',' at position 55 in 
'if(exists(date),max(recip(ms(NOW/DAY,date),3.17e-8,143
.9),.8),.7)'

Accompanying these errors is a number of AIOOBexceptions without stack trace 
and Spellchecker NPE's (SOLR-4049).  I'm completely puzzled here because it 
queries get randomly mangled in some manner. The SF parameter seems to get 
mangled only by replacing ^ with a newline. The boost query seems to be mangled 
in the same way if it fails. Only about 6% of all queries fired to the cluster 
end in such an error.

We're also seeing strange facets returned where two constraints seem to appear 
in a single returned value for a field, completely messed up :)

2012-12-12 12:00:56,341 ERROR [handler.component.FacetComponent] - 
[http-8080-exec-11] - : Unexpected term returned for facet refining. key=host 
term='aandeanderekant.domain.ext^aanoukk.domain.ext'
request 
params=spellcheck=false&facet=true&sort=score+desc&tie=0.35&spellcheck.maxCollationTries=2&ps3=5&facet.limit=8&hl.simple.pre=%3Cem%3E&q.alt=*%3A*&distrib=true&facet.method=enum&hl=false&shards.tolerant=true&omitHeader=true&echoParams=none&fl=md_*+title_*+id+type+subcollection+host+cat+date+size+lang&ps2=10&hl.simple.post=%3C%2Fem%3E&spellcheck.count=1&qs=9&spellcheck.alternativeTermCount=1&hl.fragsize=192&mm=80%25&spellcheck.maxResultsForSuggest=12&facet.mincount=1&spellcheck.extendedResults=true&uf=-*&f.host.facet.method=fc&qf=%0Adomain_grams%5E3.7%0Adomain_idx%5E15.9%0Ahost_idx%5E2.8%0Aurl%5E3.64%0Acontent_general%5E1.6+title_general%5E6.4+h1_general%5E5.4+h2_general%5E2.3%0Acontent_nl%5E1.6+title_nl%5E6.4+h1_nl%5E5.4+h2_nl%5E2.3%0Acontent_en%5E1.6+title_en%5E6.4+h1_en%5E5.4+h2_en%5E2.3%0Acontent_de%5E1.6+title_de%5E6.4+h1_de%5E5.4+h2_d
 
e%5E2.3%0A%0A++&sf=%0Atitle_general%7E2%5E4%0Atitle_nl%7E2%5E4%0Atitle_en%7E2%5E4%0Atitle_de%7E2%5E4%0A%0A++&hl.fl=content_*&json.nl=map&am

Re: Solrj connect to already running solr server

2012-12-14 Thread Per Steffensen

Billy Newman skrev:

I have deployed the solr.war to my application server.  On deploy I
can see the solr server and my core "general" start up.

I have a timer that fires every so ofter to go out and 'crawl' some
services and index into Solr.  I am using Solrj in my application and
I am having trouble understanding the best way to get data in my
already running Solr Server.

I am trying to use the EmbeddedSolrServer to connect to the "general"
core that is already running:

String solrHomeProperty = System.getProperty("solr.solr.home");
File solrHome = new File(solrHomeProperty);
CoreContainer coreContainer = new CoreContainer(solrHomeProperty);
coreContainer.load(solrHomeProperty, new File(solrHome, "solr.xml"));
SolrServer solrServer = new EmbeddedSolrServer(coreContainer, "general");

The problem here is that I think this is trying to start a new Solr
server, and it collides with the Solr server that is already running,
as I get the following exception:

SEVERE [org.apache.solr.core.CoreContainer] Unable to create core: general
java.nio.channles.OverlappingFileLockException
...
  
You create a new CoreContainer and it will create and start the 
SolrCores described in your solr.xml. But they are already running. You 
dont want to create a new CoreContainer but look up the existing one 
instead, and use that for your EmbeddedSolrServer. The CoreContainer is 
available through SolrDispatchFilter.getCores, so basically you need to 
get hold of the SolrDispatchFilter instance.


You do not say much about where/how your "other code" (the code with you 
timer job) runs. Is it in the same webapp as Solr (you might have hacked 
the web.xml of Solr), is it another webapp running side by side with the 
solr-webapp in the same webcontainer, is it a ejb-app or what is it. 
Depending on where/how your "other code" runs there are different ways 
to get hold of the SolrDispatchFilter. You can probably get it through 
JNDI but in some containers this means that you need to do some 
JNDI-name-wireing between your apps. There are other "easier" ways 
depending on where/how your "other code" runs - I believe you want to 
google for things like ServletContext, getRequestDispatcher, 
getNamedDispatcher etc.
You might also have to consider the actual container you are using (for 
Solr its Jetty out-of-the-box, but you might run on tomcat or something 
else), even though I believe the specs allow you to do what you want - 
and if the spec dictates a way, all webcontainers (that want to be 
certified) have to support it.


Hope it helps you. Else I might get time to help you in a little more 
concrete way. Have been teaching webapp and ejb-app stuff many years 
ago, so I might be able to dust off my knowledge about it in order to 
help you.



I know that I can also use HttpSolrServer, but I don't really want to
connect Http when I am already in  the application server.

What is the suggested way to connect to an already running solr server
using Solrj.  Am I using EmbeddedSolrServer wrong, or is it expected
behavior for it to try and start a new server?

Thanks in advance!!!

  




Re: Solrj connect to already running solr server

2012-12-14 Thread Per Steffensen

Per Steffensen skrev:

Billy Newman skrev:

I have deployed the solr.war to my application server.  On deploy I
can see the solr server and my core "general" start up.

I have a timer that fires every so ofter to go out and 'crawl' some
services and index into Solr.  I am using Solrj in my application and
I am having trouble understanding the best way to get data in my
already running Solr Server.

I am trying to use the EmbeddedSolrServer to connect to the "general"
core that is already running:

String solrHomeProperty = System.getProperty("solr.solr.home");
File solrHome = new File(solrHomeProperty);
CoreContainer coreContainer = new CoreContainer(solrHomeProperty);
coreContainer.load(solrHomeProperty, new File(solrHome, "solr.xml"));
SolrServer solrServer = new EmbeddedSolrServer(coreContainer, 
"general");


The problem here is that I think this is trying to start a new Solr
server, and it collides with the Solr server that is already running,
as I get the following exception:

SEVERE [org.apache.solr.core.CoreContainer] Unable to create core: 
general

java.nio.channles.OverlappingFileLockException
...
  
You create a new CoreContainer and it will create and start the 
SolrCores described in your solr.xml. But they are already running. 
You dont want to create a new CoreContainer but look up the existing 
one instead, and use that for your EmbeddedSolrServer. The 
CoreContainer is available through SolrDispatchFilter.getCores, so 
basically you need to get hold of the SolrDispatchFilter instance.


You do not say much about where/how your "other code" (the code with 
you timer job) runs. Is it in the same webapp as Solr (you might have 
hacked the web.xml of Solr), is it another webapp running side by side 
with the solr-webapp in the same webcontainer, is it a ejb-app or what 
is it. Depending on where/how your "other code" runs there are 
different ways to get hold of the SolrDispatchFilter. You can probably 
get it through JNDI but in some containers this means that you need to 
do some JNDI-name-wireing between your apps. There are other "easier" 
ways depending on where/how your "other code" runs - I believe you 
want to google for things like ServletContext, getRequestDispatcher, 
getNamedDispatcher etc.
You might also have to consider the actual container you are using 
(for Solr its Jetty out-of-the-box, but you might run on tomcat or 
something else), even though I believe the specs allow you to do what 
you want - and if the spec dictates a way, all webcontainers (that 
want to be certified) have to support it.


Hope it helps you. Else I might get time to help you in a little more 
concrete way. Have been teaching webapp and ejb-app stuff many years 
ago, so I might be able to dust off my knowledge about it in order to 
help you.
Chris mentions that EmbeddedSolrServer might not work even though you 
are able to get hold of the existing CoreContainer. I do not know much 
about EmbeddedSolrServer so I cant argue about that. Dont know the exact 
design gold for EmbeddedSolrServer, but you should be able to get hold 
of the CoreContainer.



I know that I can also use HttpSolrServer, but I don't really want to
connect Http when I am already in  the application server.

What is the suggested way to connect to an already running solr server
using Solrj.  Am I using EmbeddedSolrServer wrong, or is it expected
behavior for it to try and start a new server?

Thanks in advance!!!

  







Re: Solrcloud and Node.js

2012-12-15 Thread Per Steffensen
As Mark mentioned Solr(Cloud) can be accessed through HTTP and return 
e.g. JSON which should be easy to handle in a javascript. But the 
client-part (SolrJ) of Solr is not just a dumb client interface - it 
provides a lot of client-side functionality, e.g. some intelligent 
decision making based on ZK state. I would probably try to see if I 
could make SolrJ and in particular CloudSolrServer (yes its a client, 
even though the name does not indicate) work. Maybe you will successful 
using one of:
* https://github.com/nearinfinity/node-java to embed CloudSolrServer in 
node.js
* use GWT to compile CloudSolrServer to javascript (I would imagine it 
will be hard to make it work though)


Regards, Per Steffensen

Luis Cappa Banda skrev:

Hello!

I've always used Java as the backend language to program search modules,
and I know that CloudSolrServer implementation is the way to interact with
SolrCloud. However, I'm starting to love Node.js and I was wondering if
there exists the posibility to launch queries to a SolrCloud with the "old
fashioned" sharding syntax.

Thank you in advance!

Best regards.

  




Re: Solrcloud and Node.js

2012-12-15 Thread Per Steffensen

Luis Cappa Banda skrev:

Do you know if SolrCloud replica shards have 100% the same data as the
leader ones every time? Probably wen synchronizing with leaders there
exists a delay, so executing queries to replicas won't be a good idea.
  
As long as the replica is in state "active" it will be 100% up to date 
with leader - updates goes to leader, but it dispatches simular request 
to replica and does not respond (positively) to your update-request 
before it has successfully received positive answers from replica (and 
of course also locally stored the update successfully). If replica is in 
state "recovering" or "down" or somthing it is (potentially) not up to 
date with leader.


Remember that even though updates are made on both leader and replica 
synchronously it might not be available for (non-real-time) search on 
leader and replica at exactly the same time, if you do not also make 
sure to commit as part of you update. If you update alot you probably do 
not want to commit every time. If you use (soft) auto-commit on the 
leader and replica it will be possible that leader and replica does not 
respond equally to the same request at the same time - but the leader 
can just as well as the replica be the one that is "behind". If you use 
low values for (soft) auto-commit in practice leader and replica will 
have the same documents available for search at any time.

Thank you very much in advance.

Best regards,


  




Re: Solrcloud and Node.js

2012-12-17 Thread Per Steffensen

Luis Cappa Banda skrev:

Thanks a lot, Per. Now I understand the whole scenario. One last question:
I've been searching trying to find some kind of request handler that
retrieves cluster status information, but no luck. I know that there exists
a JSON called clusterstate.json, but I don't know the way to get it in raw
JSON format.
If you want the clusterstate in raw JSON format, I believe there is 
currently no other way than go fetch it youself from ZK. Or maybe 
something in the admin-console /zookeeper will help you.

 Do you know how to get it status? Any request handler or Solr
query? Maybe checking directly from Zookeeper?
  
Yes, if you want it in raw JSON format. If you want the "information" 
parsed as a java object hierarchy you can access through ClusterState 
object. The best way to get a ClusterState (that keeps itself up to date 
with changing states) is probably to use the ZkStateReader:
   ZkStateReader zk = new ZkStateReader(, 
, );

   zk.createClusterStateWatchersAndUpdate();
Then whenever you want a updated "picture" of the cluster state:
   zk.getClusterState();
You can also use a CloudSolrServer which carries a ZkStateReader if you 
are already using that one. But I guess not since it didnt sound like 
you would try the node-java bridge to be able to use SolrJ stuff in node.js

Best regards,

- Luis Cappa.




Re: Solr Cloud 4.0 Production Ready?

2012-12-21 Thread Per Steffensen
I would start using Solr Cloud now (keep experimenting), so that you 
will have a chance to work on setting it up, creating and managing 
collection, using it from your clients etc. Solr Cloud is very different 
from simple standalone 3.5 Solr, so you have a lot to learn about how it 
works before you are ready to use it in production. You can start that 
now, but maybe you dont want to go into production with Solr Cloud 
before 4.1. It depends on your "requirements on quality" if 4.0 is ready 
for production, but I believe the same will be true for 4.1. There will 
be a lot of fixes in 4.1, but Im sure there will still be lots of 
unfixed "issues". Best way (as always) is to test if your system works 
as it is supposed to when it is based on Solr 4.0.
I wouldnt start using Solr Cloud replication if you are running under 
high load - it has (or had a month or two ago) a lot of issues, and IMHO 
it is not ready for production usage - unless a lot happend within the 
last months, and I couldnt imagine. We have just relased the first 
production version of our product based on 4.0 plus a bunch of "our own" 
fixed and improvements. This version of our product does not support 
replication, simply because we have seen to many issues with Solr Cloud 
replication - especially under high load. Replica is a goal for version 
1.x of our product and we are about to start up a phase where we will 
have a lot of focus on stabilizing Solr Cloud replication - hopefully we 
will succeed in collaboration with the rest of the Solr community, and 
hopefully Solr Cloud replication will be production ready within the 
next half year.


Regards, Per Steffensen

On 12/18/12 3:28 PM, Otis Gospodnetic wrote:

Hi,

If you are not in a rush, I'd wait for Solr 4.1.  Not that Solr 4.0 is not
usable, but Solr 4.1 will have a ton of fixes.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm/index.html
Search Analytics - http://sematext.com/search-analytics/index.html




On Tue, Dec 18, 2012 at 2:46 AM, Cool Techi  wrote:


Hi,

We have been using solr 3.5 in our production for sometime now and facing
the problems faced by a large solr index. We wanted to migrate to Solr
Cloud and have started some experimentation. But in the mean time also
following the user forum and seem to be noticing a lot of bugs which were
raised post the release and will be fixed in 4.1.

Should we wait for 4.1 release for production or we can go ahead with the
current release?

Regards,
Ayush







Re: Will SolrCloud always slice by ID hash?

2012-12-21 Thread Per Steffensen
Custom routing is a nice improvement for 4.1, but if I understand you 
correctly it is probably not what you want to use.


If I understand you correctly you want to make a collection with a 
number of slices - one slice for each day (or other period) - and then 
make kinda "slicing window" where you create a new slice under this 
collection every day and delete the slice corresponding to "the oldest 
day". It is hard to create and delete slices under a particular 
collection. It is much easier to delete an entire collection. Therefore 
I suggest you make a collection for each day (or other period) and 
delete collection corresponding to "the oldest day". We do that in our 
system based on 4.0. We are doing one collection per month though. There 
is a limit to how much you can put into a single slice/shard before it 
becomes slower to index/search - that is part of the reason for 
sharding. With a collection-per-day solution you also get the 
opportunity to put as many documents into a collection/day as you want - 
it is just a matter of slicing into enough slices/shards and throw 
enough hardware into it. If you dont have a lot of data for each day, 
you can just have one or two slices/shards per day-collection.


We are running our Solr cluster across 10 4CPU-core/4GB-RAM machines and 
we are able to index over 1 billion documents (per month) into a 
collection with 40 shards (=40 slices because we are not using 
replication) - 4 shards on each Solr node in the cluster. We still do 
not know how the system will behave when we have and cross-search many 
(up to 24 since we are supposed to keep data for 2 years before we can 
throw it away) collections with 1+ billion documents each.


Regards, Per Steffensen

On 12/18/12 8:20 PM, Scott Stults wrote:

I'm going to be building a Solr cluster and I want to have a rolling set of
slices so that I can keep a fixed number of days in my collection. If I
send an update to a particular slice leader, will it always hash the unique
key and (probably) forward the doc to another leader?


Thank you,
Scott





Re: SolrCloud: only partial results returned

2012-12-21 Thread Per Steffensen
Are you using (soft) auto-commit or do you perform a manual commit after 
the documents have been indexed? You can index documents, but they wont 
be searchable before a (soft) commit has been performed. Even if you are 
running with (soft) auto-commit there is not guarantee that the 
documents are searchable before "configured auto-commit time-period" has 
passed since you indexed your last document.


Regards, Per Steffensen

On 12/20/12 6:37 PM, Lili wrote:

Mark,  yes,  they have unique ids.   Most the time, after the 2nd json http
post, query will return complete results.

I believe the data was indexed already with 1st post since if I shutdown the
solr after 1st post and restart again,  query will return complete result
set.

Thanks,

Lili



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-only-partial-results-returned-tp4028200p4028367.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Dynamic collections in SolrCloud for log indexing

2012-12-24 Thread Per Steffensen
I believe it is a misunderstandig to use custom routing (or sharding as 
Erick calls it) for this kind of stuff. Custom routing is nice if you 
want to control which slice/shard under a collection a specific document 
goes to - mainly to be able to control that two (or more) documents are 
indexed on the same slice/shard, but also just to be able to control on 
which slice/shard a specific document is indexed. Knowing/controlling 
this kind of stuff can be used for a lot of nice purposes. But you dont 
want to move slices/shards around among collection or delete/add slices 
from/to a collection - unless its for elasticity reasons.


I think you should fill a collection every week/month and just keep 
those collections as is. Instead of ending up with a big "historic" 
collection containing many slices/shards/cores (one for each historic 
week/month), you will end up with many historic collections (one for 
each historic week/month). Searching historic data you will have to 
cross-search those historic collections, but that is no problem at all. 
If Solr Cloud is made at it is supposed to be made (and I believe it is) 
it shouldnt require more resouces or be harder in any way to 
cross-search X slices across many collections, than it is to 
cross-search X slices under the same collection.


Besides that see my answer for topic "Will SolrCloud always slice by ID 
hash?" a few days back.


Regards, Per Steffensen

On 12/24/12 1:07 AM, Erick Erickson wrote:

I think this is one of the primary use-cases for custom sharding. Solr 4.0
doesn't really lend itself to this scenario, but I _believe_ that the patch
for custom sharding has been committed...

That said, I'm not quite sure how you drop off the old shard if you don't
need to keep old data. I'd guess it's possible, but haven't implemented
anything like that myself.

FWIW,
Erick


On Fri, Dec 21, 2012 at 12:17 PM, Upayavira  wrote:


I'm working on a system for indexing logs. We're probably looking at
filling one core every month.

We'll maintain a short term index containing the last 7 days - that one
is easy to handle.

For the longer term stuff, we'd like to maintain a collection that will
query across all the historic data, but that means every month we need
to add another core to an existing collection, which as I understand it
in 4.0 is not possible.

How do people handle this sort of situation where you have rolling new
content arriving? I'm sure I've heard people using SolrCloud for this
sort of thing.

Given it is logs, distributed IDF has no real bearing.

Upayavira





Re: Solr 4.0 NRT Search

2013-01-02 Thread Per Steffensen

On 1/1/13 2:07 PM, hupadhyay wrote:

I was reading a solr wiki located at
http://wiki.apache.org/solr/NearRealtimeSearch

It says all commitWithin are now soft commits.

can any one explain what does it means?
Soft commit means that the documents indexed before the soft commit will 
become searchable, but not necessarily persisted and flushed to disk (so 
you might loose data that has only been soft-committed (not 
hard-committed) in case of a crash)
Hard commit means that the documents indexed before the hard commit will 
become searchable and persisted and flushed to disk

Does It means commitWithin will not cause a hard commit?

Yes


Moreover that wiki itself is insufficient,as feature is NRT.

can any one list down the config steps to enable NRT in solr 4.0?
In your solrconfig.xml (sub-section " updateHandler") make sure that you 
have "autoSoftCommit" and/or "autoCommit" (hard commit) not commented 
out and that you have considered the values of maxDocs/maxTime. 
http://wiki.apache.org/solr/SolrConfigXml?#Update_Handler_Section


Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-4-0-NRT-Search-tp4029928.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Max number of core in Solr multi-core

2013-01-02 Thread Per Steffensen
Furthermore, if you plan to index "a lot" of data per application, and 
you are using Solr 4.0.0+ (including Solr Cloud), you probably want to 
consider creating a collection per application instead of a core per 
application.


On 1/2/13 2:38 PM, Erick Erickson wrote:

This is a common approach to this problem, having separate
cores keeps the apps from influencing each other when it comes
to term frequencies & etc. It also keeps the chances of returning
the wrong data do a minimum.

As to how many cores can fit, "it depends" (tm). There's lots of
work going on right now, see: http://wiki.apache.org/solr/LotsOfCores.

But having all those cores does allow you to expand your system
pretty easily if you do run over the limit your hardware can handle, just
move the entire core to a new machine. Only testing will tell
you where that limit is.

Best
Erick


On Wed, Jan 2, 2013 at 7:18 AM, Parvin Gasimzade 
wrote:
Hi all,

We have a system that enables users to create applications and store data
on their application. We want to separate the index of each application. We
create a core for each application and search on the given application when
user make query. Since there isn't any relation between the applications,
this solution could perform better than the storing all index together.

I have two questions related to this.
1. Is this a good solution? If not could you please suggest any better
solution?
2. Is there a limit on the number of core that I can create on Solr? There
will be thousands maybe more application on the system.

P.S. This question is also asked in the
stackoverflow<
http://stackoverflow.com/questions/14121624/max-number-of-core-in-solr-multi-core
.

Thanks,
Parvin





Re: Solr Collection API doesn't seem to be working

2013-01-03 Thread Per Steffensen
There are defaults for both replicationFactor and maxShardsPerNode, so 
non of them HAS to be provided - default is 1 in both cases.


  int repFactor = msgStrToInt(message, REPLICATION_FACTOR, 1);
  int maxShardsPerNode = msgStrToInt(message, MAX_SHARDS_PER_NODE, 1);

Remember than replicationFactor decides how many "instances" of you 
shard you will get, so a value of 1 does not provide you any replication.


On 1/3/13 3:46 AM, Yonik Seeley wrote:

On Wed, Jan 2, 2013 at 9:21 PM, davers  wrote:

So by providing the correct replicationFactor parameter for the number of
servers has fixed my issue.

So can you not provide a higher replicationFactor than you have live_nodes?
Yes, but you will end up with multiple replica of the same shard running 
on the same node. It is kinda pointless (IMHO), but might be ok if you 
later want to "move" one of the replica to a "new" node. I would prefer 
to not allow creating more replica than you have nodes (because it is 
kinda pointless) and then just copy data when you, in the future, create 
a new replica of a shard on a new node. There is not a big difference 
between "moving a replica from one node to another" and "establishing a 
new copy of the shard on a new node from an old node". But I guess this 
has already been discussed, and current implementation is that it is 
allowed to create more replica than you have nodes.

What if you want to add more replicants to the collection in the future?

I advocated that replicationFactor / maxShardsPerNode only be a
target, not a requirement in
https://issues.apache.org/jira/browse/SOLR-4114
and I hope that's in what will be 4.1, but I haven't verified.

-Yonik
http://lucidworks.com





Re: Solr Collection API doesn't seem to be working

2013-01-03 Thread Per Steffensen

On 1/3/13 3:05 AM, davers wrote:

This is what I get from the leader overseer log:

2013-01-02 18:04:24,663 - INFO  [ProcessThread:-1:PrepRequestProcessor@419]
- Got user-level KeeperException when processing sessionid:0x23bfe1d4c280001
type:create cxid:0x58 zxid:0xfffe txntype:unknown reqpath:n/a
Error Path:/overseer Error:KeeperErrorCode = NodeExists for /overseer
Believe this is just a dump error because we try to create the /overseer 
z-node without first checking if it is already there. Believe it doesnt 
stop the job from being carried out. It is also only logged on INFO level.


Re: Solr Collection API doesn't seem to be working

2013-01-03 Thread Per Steffensen

On 1/3/13 2:50 AM, Mark Miller wrote:

Unfortunately, for 4.0, the collections API was pretty bare bones. You don't 
actually get back responses currently - you just pass off the create command to 
zk for the Overseer to pick up and execute.

So you actually have to check the logs of the Overseer to see what the problem 
may be. I'm working on making sure we address this for 4.1.

If you look at the admin UI, in the zk tree, you should be able to see what 
node is the overseer (look for its election node). The logs for that node 
should indicate the problem.

FYI, if I remember right, replication factor is not currently optional.

Actually I believe it is.


In the future, I'd like it so you can say like replicationFactor=max_int, and 
the overseer will periodically try to match that given the nodes it sees - but 
we don't have that yet.

U, but why!

It would be nice if you can say replicationFactor=X where X is higher 
than your current number of nodes, and overseer then periodically tries 
to see if it can honor your original request for replicationFactor X (it 
will be when you eventually have X nodes in your cluster).


But specifying a MAX_INT value is IMHO a bad idea. It requires double 
"resource"-usage to maintain double number of replica, so you dont want 
more replica than necessary relative to your risk/HA-profile. I couldnt 
imaging a setup where you want replica of each shard across all nodes no 
matter how many nodes you add to your cluster. Of course you can always 
give a replicationFactor of 10 (or something high) and then if you know 
(currently believe) that you will never add more than 10 nodes to your 
cluster, then basically you will achieve what you wanted to do with 
MAX_INT. But if things evolve and you end up having 20 or 100 nodes in 
you cluster you probably do not want more than 10 replica anyway.


When you add new nodes, to add them to a current collection you will either 
have to use CoreAdmin API or pre configure the cores in solr.xml. All you need 
is to specify a matching collection name for the new core.

- Mark




Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Per Steffensen

Hi

Here is my version - do not believe the explanations have been very clear

We have the following concepts (here I will try to explain what each the 
concept cover without naming it - its hard)
1) Machines (virtual or physical) running Solr server JVMs (one machine 
can run several Solr server JVMs if you like)

2) Solr server JVMs
3) Logical "stores" where you can add/update/delete data-instances 
(closest to "logical" tables in RDBMS)
4) Logical "slices" of a store (closest to non-overlapping "logical" 
sets of rows for the "logical" table in a RDBMS)
5) Physical instances of "slices" (a physical (disk/memory) instance of 
the a "logical" slice). This is where data actually goes on disk - the 
logical "stores" and "slices" above are just non-physical concepts


Terminology
1) Believe we have no name for this (except of course machine :-) ), 
even though Jack claims that this is called a "node". Maybe sometimes it 
is called a "node", but I believe "node" is more often used to refer to 
a "Solr server JVM".

2) "Node"
3) "Collection"
4) "Shard". Used to be called "Slice" but I believe now it is officially 
called "Shard". I agree with that change, because I believe most of the 
industry also uses the term "Shard" for this logical/non-physical 
concept  - just needs to be reflected it across documentation and code
5) "Replica". Used to be called "Shard" but I believe now it is 
officially called "Replica". I certainly do not agree with the name 
"Replica", because it suggests that it is a copy of an "original", but 
it isnt. I would prefer "Shard-instance" here, to avoid the confusion. I 
understand that you can argue (if you argue long enough) that "Replica" 
is a fine name, but you really need the explanation to understand why 
"Replica" can be defended as the name for this. Is is not immediately 
obvious what this is as long as it is called "Replica". A "Replica" is 
basically a Solr Cloud managed Core and behind every Replica/Core lives 
a physical Lucene index. So Replica=Core) contains/maintains Lucene 
index behind the scenes. The term "Replica" also needs to be reflected 
across documentation and code.


Regards, Per Steffensen

On 1/3/13 10:42 AM, Alexandre Rafalovitch wrote:

Hello,

I am trying to understand the core Solr terminology. I am looking for
correct rather than loose meaning as I am trying to teach an example that
starts from easy scenario and may scale to multi-core, multi-machine
situation.

Here are the terms that seem to be all overlapping and/or crossing over in
my mind a the moment.

1) Index
2) Core
3) Collection
4) Instance
5) Replica (Replica of _what_?)
6) Others?

I tried looking through documentation, but either there is a terminology
drift or I am having trouble understanding the distinctions.

If anybody has a clear picture in their mind, I would appreciate a
clarification.

Regards,
Alex.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)





Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Per Steffensen
For the same reasons that "Replica" shouldnt be called "Replica" (it 
requires to long an explanation to agree that it is an ok name), 
"replicationFactor" shouldnt be called "replicationFactor" and long as 
it referes to the TOTAL number of cores you get for your "Shard". 
"replicationFactor" would be an ok name if replicationFactor=0 meant one 
core, replicationFactor=1 meant two cores etc., but as long as 
replicationFactor=1 means one core, replicationFactor=2 means two cores, 
it is bad naming (you will not get any replication with 
replicationFactor=1 - WTF!?!?). If we want to insist that you specify 
the total number of cores at least use "replicaPerShard" instead of 
"replicationFactor", or even better rename "Replica" to "Shard-instance" 
and use "instancesPerShard" instead of "replicationFactor".


Regards, Per Steffensen

On 1/3/13 3:52 PM, Per Steffensen wrote:

Hi

Here is my version - do not believe the explanations have been very clear

We have the following concepts (here I will try to explain what each 
the concept cover without naming it - its hard)
1) Machines (virtual or physical) running Solr server JVMs (one 
machine can run several Solr server JVMs if you like)

2) Solr server JVMs
3) Logical "stores" where you can add/update/delete data-instances 
(closest to "logical" tables in RDBMS)
4) Logical "slices" of a store (closest to non-overlapping "logical" 
sets of rows for the "logical" table in a RDBMS)
5) Physical instances of "slices" (a physical (disk/memory) instance 
of the a "logical" slice). This is where data actually goes on disk - 
the logical "stores" and "slices" above are just non-physical concepts


Terminology
1) Believe we have no name for this (except of course machine :-) ), 
even though Jack claims that this is called a "node". Maybe sometimes 
it is called a "node", but I believe "node" is more often used to 
refer to a "Solr server JVM".

2) "Node"
3) "Collection"
4) "Shard". Used to be called "Slice" but I believe now it is 
officially called "Shard". I agree with that change, because I believe 
most of the industry also uses the term "Shard" for this 
logical/non-physical concept  - just needs to be reflected it across 
documentation and code
5) "Replica". Used to be called "Shard" but I believe now it is 
officially called "Replica". I certainly do not agree with the name 
"Replica", because it suggests that it is a copy of an "original", but 
it isnt. I would prefer "Shard-instance" here, to avoid the confusion. 
I understand that you can argue (if you argue long enough) that 
"Replica" is a fine name, but you really need the explanation to 
understand why "Replica" can be defended as the name for this. Is is 
not immediately obvious what this is as long as it is called 
"Replica". A "Replica" is basically a Solr Cloud managed Core and 
behind every Replica/Core lives a physical Lucene index. So 
Replica=Core) contains/maintains Lucene index behind the scenes. The 
term "Replica" also needs to be reflected across documentation and code.


Regards, Per Steffensen




Re: Solr Collection API doesn't seem to be working

2013-01-03 Thread Per Steffensen

Ok, sorry. Easy to misunderstand, though.

On 1/3/13 3:58 PM, Mark Miller wrote:

MAX_INT is just a place holder for a high value given the context of this guy 
wanting to add replicas for as many machines as he adds down the line. You are 
taking it too literally.

- Mark




Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Per Steffensen

On 1/3/13 4:33 PM, Mark Miller wrote:

This has pretty much become the standard across other distributed systems and 
in the literat…err…books.
Hmmm Im not sure you are right about that. Maybe more than one 
distributed system calls them "Replica", but there is also a lot that 
doesnt. But if you are right, thats at least a good valid argument to do 
it this way, even though I generally prefer "good logical naming" over 
"following bad naming from the industry" :-) Just because there is a lot 
of crap out there, doesnt mean that we also want to make crap. Maybe 
good logical naming could even be a small entry in the "Why Solr is 
better than its competitors" list :-)


Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Per Steffensen

On 1/3/13 4:55 PM, Mark Miller wrote:
Trying to forge our own path here seems more confusing than helpful 
IMO. We have enough issues with terminology right now - where we can 
go with the industry standard, I think we should. - Mark 

Fair enough.

I dont think our biggest problem is whether we decide to call it 
Replica/replicationFactor or ShardInstance/InstancesPerShard. Our 
biggest problem is that we really havent decided once and for all and 
made sure to reflect the decision consistently across code and 
documentation. As long as we havnt I believe it is still ok to change 
our minds.




Re: Solr Collection API doesn't seem to be working

2013-01-03 Thread Per Steffensen

On 1/3/13 5:26 PM, Yonik Seeley wrote:
I agree - it's pointless to have two replicas of the same shard on a 
single node. But I'm talking about having replicationFactor as a 
target, so when you start up *new* nodes they will become a replica 
for any shard where the number of replicas is currently less than the 
replicationFactor. Ideally, one would be able to create a new 
collection with no nodes initially assigned to it if desired. 
Yes THAT is a cool thing, as I also mentioned earlier. We should 
implement that!
But as it is today, if e.g. you have 2 nodes and ask for 
replicationFactor 3 you will get 2 replica on one node and 1 replica on 
the other. Here I would rather only create 1 replica on each node, so 
that you only start up with 2 replica all in all (even though you asked 
for 3). Then "remember" (in ZK) that you actually asked for 3 and let 
the overseer create the 3rd replica on a new node as soon as such an 
additional node is available.


Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Per Steffensen

On 1/3/13 5:58 PM, Walter Underwood wrote:

A "factor" is multiplied, so multiplying the leader by a replicationFactor of 1 
means you have exactly one copy of that shard.

I think that recycling the term "replication" within Solr was confusing, but it 
is a bit late to change that.

wunder
Yes, the term "factor" is not misleading, but the term "replication" is. 
If we keep calling shard-instances for "Replica" I guess "replicaFactor" 
will be ok - at least much better than "replicationFactor". But it would 
still be better with e.g. "ShardInstance" and "InstancesPerShard"




Re: SolrCloud and Join Queries

2013-01-04 Thread Per Steffensen

On 1/4/13 9:21 AM, Hassan wrote:

Hi,

I am considering SolrCloud for our applications but I have run into 
the limitation of not being able to use Join Queries in distributed 
searches.

Our requirements are the following:
- SolrCloud will serve many applications where each application 
"index" is separate from other application. Each application really is 
customer deployment and we need to isolate customers data from each other
-Join queries are required. Queries will only look at one customer at 
a time.
- Since data volume for each customer is small in Solr/Lucene 
standards, (1-2 Million document is small, right?

Yes
), we are really interested in the replication aspect of SolrCloud 
more than distributed search.


I am considering the following SolrCloud design with questions:
- Start SolrCloud with 1 shard only. This should allow join queries to 
work correctly since all documents will be available in the same shard 
(index). is this a correct assumption?

- Each customer will have its own collection in the SolrCloud.
You cant have only one shard and several collections. A collections 
consists of a number of shards, but a shards "belong" to a collection, 
so two different collections do not use the same shard. Shard is "below" 
collection in the concept-hierarchy so to speak.

Do collections provide me with data isolation between customers?

Yes?
Depends on what you mean with "isolation". Since different collections 
enforce different shards, and each shard basically has its own lucene 
index (set of lucene indices if you use replication), and distinct 
lucene indices typically persist in different disk-folders, you will get 
"isolation" of data in the way that data for different customers will be 
stored in different disk-folders.
- Adding more nodes as replicas of the single shard to achieve 
replication and fault tolerance.


Thank you,
Hs
Not sure I understand completely what you want to achieve, but you might 
want to have a collection per customer. One shard per collection = one 
shard per customer = (as long as we do not consider replication) one 
lucene index per customer = one data-disk-folder per customer. You 
should be able to do join queries inside the specific customers shard.


Regards, Per Steffensen



Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Per Steffensen

It was a very good explanation, Jack!

I believe I have heard most of it before, so it is really not new for 
me. I DO understand that the name "replica" and "replication-factor" CAN 
be justified, but it requires a long and thorough explanation. And thats 
the point. A good name for a concept means that:
* The name is among the first that pops up in your mind when you think 
about the concept, or at least you can make a very short explanation why 
you choose this name for that concept
* When a (fairly educated) newcomer hears the name for the first time, 
his first thoughts about the concept it covers is as close as possible 
to the actual concept


Good metrics for whether or not we have good names must therefore be
1) The frequency of questions about the concepts behind the names
2) The frequency of wrong usage of names (cases where people actually 
didnt understand the concept behind the name, didnt ask (1. above) and 
just used the name for what he thought it meant)

3) The length of the explanation of why you chose this name for that concept

Ad 1)
I counted several questions just this week. Especially I noted "Replica 
(Replica of _what_?)" in the original post of this thread. Whether we 
want it or not, newcomers will keep "not getting" the concept of replica 
or getting it wrong. Why? Because it is a bad name.

Ad 2)
I also counted several cases where names where used completely wrong 
this week.

Ad 3)
Take a look at the length of Jacks great post below, and take a look at 
the length of this mail-thread.


I believe we will do better on the metrics if we use 
node/collection/shard/shard-instance/index instead of 
node/collection/shard/replica/(core/)index, and use instances-per-shard 
instead of replication-factor. And say that "core" is the same as a 
"shard-instance", but typically used in a non/pre-Cloud context. That 
index is a physical lucene thing - and nothing but that. That 
collections and shards are logical concepts. That a shard-instance is a 
physical instance of a shard implemented using a lucene index persisting 
its data on physical disk.


My only interest here is to try to pull the project in a good direction. 
You just get my opinion. Keep it simple and no bullshit.


This entire discussion is great I think, but it probably belong on 
dev-list (or maybe on a JIRA).
I belive Alexandre Rafalovitch got his answer already :-) To the level a 
clean answer exists at the moment.


Regards, Per Steffensen

On 1/4/13 2:54 PM, Jack Krupansky wrote:

Replication makes perfect sense even if our explanations so far do not.

A shard is an abstraction of a subset of the data for a collection.

A replica is an instance of the data of the shard and instances of 
Solr servers that have indicated a readiness to service queries and 
updates for the data. Alternatively, a replica is a node which has 
indicated a readiness to receive and serve the data of a shard, but 
may not have any data at the moment.


Lets describe it operationally for SolrCloud: If data comes in to any 
replica of a shard it will automatically and quickly be "replicated" 
to all other replicas of the shard. If a new replica of a shard comes 
up it will be streamed all of the data from the another replica of the 
shard. If an existing replica of a shard restarts or reconnects to the 
cluster, it will be streamed updates of any new data since it was last 
updated from another replica of the shard.


Replication is simply the process of assuring that all replicas are 
kept up to date. That's the same abstract meaning as for Master/Slave 
even though the operational details are somewhat different. The goal 
remains the same.


Replication factor is the number of instances of the data of the shard 
and instances of Solr servers that can service queries and updates for 
the data. Alternatively, the replication factor is the number of nodes 
of the SolrCloud cluster  which have indicated a readiness to receive 
and serve the data of a shard, but may not have any data at the moment.


A node is an instance of Solr running in a Java JVM that has indicated 
to the Zookeeper ensemble of a SolrCloud cluster that it is ready to 
be a replica for a shard of a collection. [The latter part of that is 
a bit too fuzzy - I'm not sure what the node tells Zookeeper and who 
does shard assignment. I mean, does a node explicitly say what shard 
it wants to be, or is that assigned by Zookeeper, or is that a node's 
choice/option? But none of that changes the fact that a node 
"registers" with Zookeeper and then somehow becomes a replica for a 
shard.]


A node (instance of a Solr server) can be a replica of shards from 
multiple collections (potentially multiple shards per collection). A 
node is not a replica per se, but a container that can serve multiple 
collections. A node can serve as multiple replicas, each of a 
different collection.


My only

Re: SolrCloud and Join Queries

2013-01-05 Thread Per Steffensen
Do you remember to add replicationFactor parameter when you create your 
"customer1" and "customer2" collections/shards?
http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API 
(note that maxShardsPerNode and createNodeSet params are not available 
in 4.0.0, but will be in 4.1)


Regards, Per Steffensen

On 1/5/13 11:55 AM, Hassan wrote:

Thanks Per and Otis,

It is much clearer now but I have a question about adding new solr 
nodes and collections.
I have a dedicated zookeeper instance. Lets say I have uploaded my 
configuration to zookeeper using "zkcli" and named it, say, 
"configuration1".
Now I want to create a new solrcloud from scratch with two solr nodes. 
I need to create a new collection (with one shard) called "customer1" 
using the configuration name "configuration1". I have tried different 
ways using Collections API, zkcli linkconfig/downconfig but I cannot 
get it to work. Collection is only available on one node. The example 
"collection1" works as expected where one node has the Leader shard 
and the other node has the replica. See the cloud graph 
http://imageshack.us/f/706/selection008p.png/


What is the correct way to dynamically add collections to already 
existing nodes and new nodes?


Thanks you,
Hs




Re: SolrCloud and Join Queries

2013-01-06 Thread Per Steffensen
And you will have "loadbalancing" since a "random" of the replica behind 
the shard will be chosen to handle the query.


On 1/6/13 3:10 AM, Otis Gospodnetic wrote:

Hi Hassan,

Correct. If you have a single shard, then the query will execute the query
on only one node and that is it.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Sat, Jan 5, 2013 at 9:06 AM, Hassan  wrote:


Missed the replicationFactor parameter. Works great now.
http://imm.io/RM66
Thanks a lot for you help,

One last question. in terms of scalability, having this design of one
collection per customer, with one shard and many replicas, A query will be
handled by one shard (or replica) on one node only and scalability here is
really about load balancing queries between the replicas only. i.e no
distributed search. is this correct?

Hassan


On 05/01/13 15:47, Per Steffensen wrote:


Do you remember to add replicationFactor parameter when you create your
"customer1" and "customer2" collections/shards?
http://wiki.apache.org/solr/**SolrCloud#Managing_**collections_via_the_**
Collections_API<http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_Collections_API>(note
 that maxShardsPerNode and createNodeSet params are not available in
4.0.0, but will be in 4.1)

Regards, Per Steffensen

On 1/5/13 11:55 AM, Hassan wrote:


Thanks Per and Otis,

It is much clearer now but I have a question about adding new solr nodes
and collections.
I have a dedicated zookeeper instance. Lets say I have uploaded my
configuration to zookeeper using "zkcli" and named it, say,
"configuration1".
Now I want to create a new solrcloud from scratch with two solr nodes. I
need to create a new collection (with one shard) called "customer1" using
the configuration name "configuration1". I have tried different ways using
Collections API, zkcli linkconfig/downconfig but I cannot get it to work.
Collection is only available on one node. The example "collection1" works
as expected where one node has the Leader shard and the other node has the
replica. See the cloud graph http://imageshack.us/f/706/**
selection008p.png/ <http://imageshack.us/f/706/selection008p.png/>

What is the correct way to dynamically add collections to already
existing nodes and new nodes?

Thanks you,
Hs









Re: How to size a SOLR Cloud

2013-01-07 Thread Per Steffensen

Hi

I have some experience with practical limits. We have several setup we 
have tried to run with high load for long time:

1)
* 20 shards in one collection spread over 5 nodes (4 shards for the 
collection per node), no redunancdy (only one replica per shard)

* Indexing 35-50 mio documents per day and searching a little along the way
* We do not have detailed measurements on searching, but my impression 
is that search response times are fairly ok (below 5 secs for 
non-complicated searches) - at least the first 15 days, up to about 500 
mio documents
* We have very detailed measurements on indexing times though. They are 
good the first 15-17 days, up to 500-600 mio documents. Then we see a 
temporary slow-down in indexing times. This is because major merges 
happen at the same time across all shards. The indexing times speed up 
when this is over, though. After about 20 days everything stops running 
- things just get too slow and eventually nothing happens.

2)
* Same as 1), except 40 shards in one collection spread over 10 nodes, 
no redundancy
* Slowdown points seems to change linearly - slow-down around 1 billion 
docs and complete stop 1.3-1.4 billion docs


Therefore it seems a little strange to me that you have problems with 25 
mio docs in two shards.
One major difference is the redundancy, though. We are having only one 
replica per shard. We started our trying to run with redundancy (2 
replica per shard) but that involved a lot of problems. Things never 
successfully recover when recover situations occur, and we see like 
4-times indexing times compared to non-redundancy (even though a max of 
2-times should be expected).


Regards, Per Steffensen


On 1/7/13 6:14 PM, f.fourna...@gibmedia.fr wrote:

Hello,
I'm new in SOLR and I've a collection with 25 millions of records.
I want to run this collection on SOLR Cloud (sorl 4.0) under Amazon EC2
instances.
Currently I've configured 2 shards and 2 replica per shard with Medium
instances (4Go, 1 CPU core) and response times are very long.
How to size the cloud (sharding, replica, memory, CPU,...) to have
acceptable response times in my situation? more memory ? more cpu ? more
shards ? Does rules to size a solr cloud exists ?
Is it possible to have more than 2 replicas on one shard ? is it relevant ?
Best regards
FF





Re: Solr 4 exceptions on trying to create a collection

2013-01-08 Thread Per Steffensen

JIRA about the fix for 4.1: https://issues.apache.org/jira/browse/SOLR-4140

On 1/8/13 4:01 PM, Jay Parashar wrote:

Thanks Mark...I will use it with 4.1. For now, I used httpclient to call the
Collections api directly (do a Get on
http://127.0.0.1:8983/solr/admin/collections?action=CREATE etc). This is
working.





Re: CoreAdmin STATUS performance

2013-01-10 Thread Per Steffensen
If you are using ZK-coordinating Solr (SolrCloud - you need 4.0+) you 
can maintain a in-memory always-up-to-date data-structure containing the 
information - ClusterState. You can get it through CloudSolrServer og 
ZkStateReader that you connect to ZK once and it will automatically 
update the in-memory ClusterState with changes.


Regards, Per Steffensen

On 1/9/13 4:38 PM, Shahar Davidson wrote:

Hi All,

I have a client app that uses SolrJ and which requires to collect the names 
(and just the names) of all loaded cores.
I have about 380 Solr Cores on a single Solr server (net indices size is about 
220GB).

Running the STATUS action takes about 800ms - that seems a bit too long, given 
my requirements.

So here are my questions:
1) Is there any way to get _only_ the core Name of all cores?
2) Why does the STATUS request take such a long time and is there a way to 
improve its performance?

Thanks,

Shahar.





Re: CoreAdmin STATUS performance

2013-01-10 Thread Per Steffensen

On 1/10/13 10:09 AM, Shahar Davidson wrote:

search request, the system must be aware of all available cores in order to 
execute distributed search on_all_  relevant cores

For this purpose I would definitely recommend that you go "SolrCloud".

Further more we do something "ekstra":
We have several collections each containing data from a specific period 
in time - timestamp of ingoing data decides which collection it is 
indexed into. One important search-criteria for our clients are search 
on timestamp-interval. Therefore most searches can be restricted to only 
consider a subset of all our collections. Instead of having the logic 
calculating the subset of collections to search (given the timestamp 
search-interval) in clients, we just let clients do "dumb" searches by 
giving the timestamp-interval. The subset of collections to search are 
calculated on server-side from the timestamp-interval in the 
search-query. We handle this in a Solr SearchComponent which we place 
"early" in the chain of SearchComponents. Maybe you can get some 
inspiration by this approach, if it is also relevant for you.


Regards, Per Steffensen


Re: CoreAdmin STATUS performance

2013-01-10 Thread Per Steffensen
The collections are created dynamically. Not on update though. We use 
one collection per month and we have a timer-job running (every hour or 
so), which checks if all collections that need to exist actually does 
exist - if not it creates the collection(s). The rule is that the 
collection for "next month" has to exist as soon as we enter "current 
month", so the first time the timer job runs e.g. 1. July it will create 
the August-collection. We never get data with timestamp in the future. 
Therefore if the timer-job just gets to run once within every month we 
will always have needed collections ready.


We create collections using the new Collection API in Solr. Be used to 
manage creation of every single Shard/Replica/Core of the collections 
during the Core Admin API in Solr, but since an Collection API was 
introduced we decided that we better use that. In 4.0 it did not have 
the features we needed, which triggered SOLR-4114, SOLR-4120 and 
SOLR-4140 which will be available in 4.1. With those features we are now 
using the Collection API.


BTW, our timer-job also handles deletion of "old" collections. In our 
system you can configure how many historic month-collection you will 
keep before it is ok to delete them. Lets say that this is configured to 
3, as soon at it becomes 1. July the timer-job will delete the 
March-collection (the historic collections to keep will just have become 
April-, May- and June-collections). This way we will always have a least 
3 months of historic data, and last in a month close to 4 months of 
history. It does not matter that we have a little to much history, when 
we just do not go below the lower limit on lenght of historic data. We 
also use the new Collection API for deletion.


Regards, Per Steffensen

On 1/10/13 3:04 PM, Shahar Davidson wrote:

Hi Per,

Thanks for your reply!

That's a very interesting approach.

In your system, how are the collections created? In other words, are the 
collections created dynamically upon an update (for example, per new day)?
If they are created dynamically, who handles their creation (client/server)  
and how is it done?

I'd love to hear more about it!

Appreciate your help,

Shahar.

-Original Message-
From: Per Steffensen [mailto:st...@designware.dk]
Sent: Thursday, January 10, 2013 1:23 PM
To: solr-user@lucene.apache.org
Subject: Re: CoreAdmin STATUS performance

On 1/10/13 10:09 AM, Shahar Davidson wrote:

search request, the system must be aware of all available cores in
order to execute distributed search on_all_  relevant cores

For this purpose I would definitely recommend that you go "SolrCloud".

Further more we do something "ekstra":
We have several collections each containing data from a specific period in time 
- timestamp of ingoing data decides which collection it is indexed into. One 
important search-criteria for our clients are search on timestamp-interval. 
Therefore most searches can be restricted to only consider a subset of all our 
collections. Instead of having the logic calculating the subset of collections 
to search (given the timestamp
search-interval) in clients, we just let clients do "dumb" searches by giving the 
timestamp-interval. The subset of collections to search are calculated on server-side from the 
timestamp-interval in the search-query. We handle this in a Solr SearchComponent which we place 
"early" in the chain of SearchComponents. Maybe you can get some inspiration by this 
approach, if it is also relevant for you.

Regards, Per Steffensen

Email secured by Check Point





Forwarding authentication credentials in internal node-to-node requests

2013-01-11 Thread Per Steffensen

Hi

I read http://wiki.apache.org/solr/SolrSecurity and know a lot about 
webcontainer authentication and authorization. Im sure I will be able to 
set it up so that each solr-node is will require HTTP authentication for 
(selected) incoming requests.


But solr-nodes also make requests among each other and Im in doubt if 
credentials are forwarded from the "original request" to the internal 
sub-requests?
E.g. lets say that each solr-node is set up to require authentication 
for search request. An "outside" user makes a distributed request 
including correct username/password. Since it is a distributed search, 
the node which handles the original request from the user will have to 
make sub-requests to other solr-nodes but they also require correct 
credentials in order to accept this sub-request. Are the credentials 
from the original request duplicated to the sub-requests or what options 
do I have?
Same thing goes for e.g. update requests if they are sent to a node 
which does not run (all) the replica of the shard in which the documents 
to be added/updated/deleted belong. The node needs to make sub-request 
to other nodes, and it will require forwarding the credentials.


Does this just work out of the box, or ... ?

Regards, Per Steffensen


Re: Forwarding authentication credentials in internal node-to-node requests

2013-01-11 Thread Per Steffensen
Hmmm, it will not work for me. I want the "original" credential 
forwarded in the sub-requests. The credentials are mapped to permissions 
(authorization), and basically I dont want a user to be able have 
something done in the (automatically performed by the contacted 
solr-node) sub-requests that he is not authorized to do. Forward of 
credentials is a must. So what you are saying is that I should expect to 
have to do some modifications to Solr in order to achieve what I want?


Regards, Per Steffensen

On 1/11/13 2:11 PM, Markus Jelsma wrote:

Hi,

If your credentials are fixed i would configure username:password in your 
request handler's shardHandlerFactory configuration section and then modify 
HttpShardHandlerFactory.init() to create a HttpClient with an AuthScope 
configured with those settings.

I don't think you can obtain the original credentials very easy when inside 
HttpShardHandlerFactory.

Cheers
  
-Original message-

From:Per Steffensen 
Sent: Fri 11-Jan-2013 13:07
To: solr-user@lucene.apache.org
Subject: Forwarding authentication credentials in internal node-to-node requests

Hi

I read http://wiki.apache.org/solr/SolrSecurity and know a lot about
webcontainer authentication and authorization. Im sure I will be able to
set it up so that each solr-node is will require HTTP authentication for
(selected) incoming requests.

But solr-nodes also make requests among each other and Im in doubt if
credentials are forwarded from the "original request" to the internal
sub-requests?
E.g. lets say that each solr-node is set up to require authentication
for search request. An "outside" user makes a distributed request
including correct username/password. Since it is a distributed search,
the node which handles the original request from the user will have to
make sub-requests to other solr-nodes but they also require correct
credentials in order to accept this sub-request. Are the credentials
from the original request duplicated to the sub-requests or what options
do I have?
Same thing goes for e.g. update requests if they are sent to a node
which does not run (all) the replica of the shard in which the documents
to be added/updated/deleted belong. The node needs to make sub-request
to other nodes, and it will require forwarding the credentials.

Does this just work out of the box, or ... ?

Regards, Per Steffensen





Re: Forwarding authentication credentials in internal node-to-node requests

2013-01-12 Thread Per Steffensen
I will figure out. Essence of question was if it was there 
out-of-the-box. Thanks!


Regards, Per Steffensen

On 1/11/13 5:38 PM, Markus Jelsma wrote:

Hmm, you need to set up the HttpClient in HttpShardHandlerFactory but you 
cannot access the HttpServletRequest from there, it is only available in 
SolrDispatchFilter AFAIK. And then, the HttpServletRequest can only return the 
remote user name, not the password he, she or it provided. I don't know how to 
obtain the password.
  
-Original message-

From:Per Steffensen 
Sent: Fri 11-Jan-2013 15:28
To: solr-user@lucene.apache.org
Subject: Re: Forwarding authentication credentials in internal node-to-node 
requests

Hmmm, it will not work for me. I want the "original" credential
forwarded in the sub-requests. The credentials are mapped to permissions
(authorization), and basically I dont want a user to be able have
something done in the (automatically performed by the contacted
solr-node) sub-requests that he is not authorized to do. Forward of
credentials is a must. So what you are saying is that I should expect to
have to do some modifications to Solr in order to achieve what I want?

Regards, Per Steffensen

On 1/11/13 2:11 PM, Markus Jelsma wrote:

Hi,

If your credentials are fixed i would configure username:password in your 
request handler's shardHandlerFactory configuration section and then modify 
HttpShardHandlerFactory.init() to create a HttpClient with an AuthScope 
configured with those settings.

I don't think you can obtain the original credentials very easy when inside 
HttpShardHandlerFactory.

Cheers
   
-Original message-

From:Per Steffensen 
Sent: Fri 11-Jan-2013 13:07
To: solr-user@lucene.apache.org
Subject: Forwarding authentication credentials in internal node-to-node requests

Hi

I read http://wiki.apache.org/solr/SolrSecurity and know a lot about
webcontainer authentication and authorization. Im sure I will be able to
set it up so that each solr-node is will require HTTP authentication for
(selected) incoming requests.

But solr-nodes also make requests among each other and Im in doubt if
credentials are forwarded from the "original request" to the internal
sub-requests?
E.g. lets say that each solr-node is set up to require authentication
for search request. An "outside" user makes a distributed request
including correct username/password. Since it is a distributed search,
the node which handles the original request from the user will have to
make sub-requests to other solr-nodes but they also require correct
credentials in order to accept this sub-request. Are the credentials
from the original request duplicated to the sub-requests or what options
do I have?
Same thing goes for e.g. update requests if they are sent to a node
which does not run (all) the replica of the shard in which the documents
to be added/updated/deleted belong. The node needs to make sub-request
to other nodes, and it will require forwarding the credentials.

Does this just work out of the box, or ... ?

Regards, Per Steffensen







Re: Way to lock solr for incoming writes

2013-01-16 Thread Per Steffensen

Well you can stop the solrs :-)
If you are making backup by copying the actual files stored by solr, you 
probably want to stop them anyway to make sure everything is consistent 
and written to disk. If you dont stop the solrs, at least make sure that 
you do a "commit" (not soft) after all incomming writes have been stopped.
If you cannot afford stopping the solrs, when of course you will need to 
do something smarter. Maybe it is possible to just close the http 
endpoint in your webcontainer (jetty or tomcat or whatever) for a short 
while, or close the port on OS level or ...


Regards, Per Steffensen

On 1/16/13 4:02 PM, mizayah wrote:

Is there a way to lock solr for writes?
I don't wona use solr integrated backup because i'm using ceph claster.

What I need is to have consistent data for few seconds to make backup.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Way-to-lock-solr-for-incoming-writes-tp4033873.html
Sent from the Solr - User mailing list archive at Nabble.com.





  1   2   >