Re: Migration from Solr 1.2 to Solr 1.4

2011-02-17 Thread Stijn Vanhoorelbeke
Hi,

I recently ran across the same issues;
I'm updating my solr 1.4 up to the last Nightly Build ( to have ZooKeeper
functionality ).

I've copied the solr_home dir - but with no success.
( The config files were not accepted on the new build - due to version
mismatch ).
Then I copied only the index data & used a fresh copy of conf/solrconfig.xml
( which I adapted to represent my old solrconfig.xml settings ) - I still
could use the old schema.

That way, I came out with a working - new version - of solr.
But the copied index gave some trouble: 'Format version is not supported'.

I guess I have to rebuild the index.

For your case:
Maybe you could replicate the index by using the Replication handler - build
into solconfig.xml
You could set your old system as master and your new one as slave & hope the
slave gets updated.
( Note: I guess this will not work - as the replication handler, in
solrconfig.xml was only introduced in solr 1.4 and is not present in solr
1.3 )

2011/2/17 Chris Hostetter 
>
> : > if you don't have any custom components, you can probably just use
> : > your entire solr home dir as is -- just change the solr.war.  (you
can't
> : > just copy the data dir though, you need to use the same configs)
> : >
> : > test it out, and note the "Upgrading" notes in the CHANGES.txt for the
> : > 1.3, 1.4, and 1.4.1 releases for "gotchas" that you might wnat to
watch
> : > out for.
>
> : Thank you for your reply, I've tried to copy the data and configuration
> : directory without success :
> : SEVERE: Could not start SOLR. Check solr/home property
> : java.lang.RuntimeException:
org.apache.lucene.index.CorruptIndexException:
> : Unknown format version: -10
>
> Hmmm... ok, i'm not sure why that would happen.  According to the
> CAHNGES.txt,  Solr 1.2 used Lucene 2.1 and Solr 1.4.1 used 2.9.3 -- so
> Solr 1.4 should have been able to read an index created by Solr 1.2.
>
> You *could* try upgrading first from 1.2 to 1.3, run an optimize command,
> and then try upgradin from 1.3 to 1.4 -- but i can't make any assertions
> that that will work better, since going straight from 1.2 to 1.4 should
> have worked the same way.
>
> When in doubt: reindex.
>
>
> -Hoss


Re: Separating Index Reader and Writer

2011-02-17 Thread Em

Push again.

Regards


Em wrote:
> 
> Just wanted to push that topic.
> 
> Regards
> 
> 
> Em wrote:
>> 
>> Hi Peter,
>> 
>> I must jump in this discussion: From a logical point of view what you are
>> saying makes only sense if both instances do not run on the same machine
>> or at least not on the same drive.
>> 
>> When both run on the same machine and the same drive, the overall used
>> memory should be equal plus I do not understand why this setup should
>> affect cache warming etc., since the process of rewarming should be the
>> same.
>> 
>> Well, my knowledge about the internals is not very deep. But from just a
>> logical point of view - to me - the same is happening as if I would do it
>> in a single solr-instance. So what is the difference, what do I overlook?
>> 
>> Another thing: While W is committing and writing to the index, is there
>> any inconsistency in R or isn't there any, because W is writing a new
>> Segment and so for R there isn't anything different until the commit
>> finished?
>> Are there problems during optimizing an index?
>> 
>> How do you inform R about the finished commit?
>> 
>> Thank you for your explanation, it's a really interesting topic!
>> 
>> Regards,
>> Em
>> 
>> Peter Sturge-2 wrote:
>>> 
>>> Hi,
>>> 
>>> We use this scenario in production where we have one write-only Solr
>>> instance and 1 read-only, pointing to the same data.
>>> We do this so we can optimize caching/etc. for each instance for
>>> write/read. The main performance gain is in cache warming and
>>> associated parameters.
>>> For your Index W, it's worth turning off cache warming altogether, so
>>> commits aren't slowed down by warming.
>>> 
>>> Peter
>>> 
>>> 
>>> On Sun, Feb 6, 2011 at 3:25 PM, Isan Fulia 
>>> wrote:
 Hi all,
 I have setup two indexes one for reading(R) and other for
 writing(W).Index R
 refers to the same data dir of W (defined in solrconfig via ).
 To make sure the R index sees the indexed documents of W , i am firing
 an
 empty commit on R.
 With this , I am getting performance improvement as compared to using
 the
 same index for reading and writing .
 Can anyone help me in knowing why this performance improvement is
 taking
 place even though both the indexeses are pointing to the same data
 directory.

 --
 Thanks & Regards,
 Isan Fulia.

>>> 
>>> 
>> 
>> 
> 
> 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Separating-Index-Reader-and-Writer-tp2437666p2516736.html
Sent from the Solr - User mailing list archive at Nabble.com.


My Plan to Scale Solr

2011-02-17 Thread Bing Li
Dear all,

I started to learn how to use Solr three months ago. My experiences are
still limited.

Now I crawl Web pages with my crawler and send the data to a single Solr
server. It runs fine.

Since the potential users are large, I decide to scale Solr. After
configuring replication, a single index can be replicated to multiple
servers.

For shards, I think it is also required. I attempt to split the index
according to the data categories and priorities. After that, I will use the
above replication techniques and get high performance. The following work is
not so difficult.

I noticed some new terms, such as SolrClould, Katta and ZooKeeper. According
to my current understandings, it seems that I can ignore them. Am I right?
What benefits can I get if using them?

Thanks so much!
LB


Re: Replication and newSearcher registerd > poll interval

2011-02-17 Thread dan sutton
Hi,

Keeping the thread alive, any thought on only doing replication if
there is no warming currently going on?

Cheers,
Dan

On Thu, Feb 10, 2011 at 11:09 AM, dan sutton  wrote:
> Hi,
>
> If the replication window is too small to allow a new searcher to warm
> and close the current searcher before the new one needs to be in
> place, then the slaves continuously has a high load, and potentially
> an OOM error. we've noticed this in our environment where we have
> several facets on large multivalued fields.
>
> I was wondering what the list though about modifying the replication
> process to skip polls (though warning to logs) when there is a
> searcher in the process of warming? Else as in our case it brings the
> slave to it's knees, workaround was to extend the poll interval,
> though not ideal.
>
> Cheers,
> Dan
>


fine tuning the solr search

2011-02-17 Thread Churchill Nanje Mambe
Hi
 I would love to know how to do this with solr
 say a user inputs "Account manager files",
 I wish that solr puts priority on the documents it finds as follows
 1) documents containing account manager files gets a greater score
 2) then documents with account manager come next
 3) then documents with account can come
 before the other words are used to get documents in the search

right now I think it works different as it finds documents with
accounts and puts them like in first position or documents with word
manager in second position or so

thanks
-- 
Mambe Churchill Nanje
237 33011349,
AfroVisioN Founder, President,CEO
http://www.afrovisiongroup.com | http://mambenanje.blogspot.com
skypeID: mambenanje
www.twitter.com/mambenanje


Re: Replication and newSearcher registerd > poll interval

2011-02-17 Thread Markus Jelsma
If you set maxWarmingSearchers to 1 then you cannot issue an overlapping 
commit. Slaves won't poll for a new index version while replication is in 
progress.

It works well in my environment where there is a high update/commit frequency, 
about a thousand documents per minute. The system even behaves well a thousand 
updates per second and a commit per minute with a poll interval of 2 seconds.

On Thursday 17 February 2011 11:54:32 dan sutton wrote:
> Hi,
> 
> Keeping the thread alive, any thought on only doing replication if
> there is no warming currently going on?
> 
> Cheers,
> Dan
> 
> On Thu, Feb 10, 2011 at 11:09 AM, dan sutton  wrote:
> > Hi,
> > 
> > If the replication window is too small to allow a new searcher to warm
> > and close the current searcher before the new one needs to be in
> > place, then the slaves continuously has a high load, and potentially
> > an OOM error. we've noticed this in our environment where we have
> > several facets on large multivalued fields.
> > 
> > I was wondering what the list though about modifying the replication
> > process to skip polls (though warning to logs) when there is a
> > searcher in the process of warming? Else as in our case it brings the
> > slave to it's knees, workaround was to extend the poll interval,
> > though not ideal.
> > 
> > Cheers,
> > Dan

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: My Plan to Scale Solr

2011-02-17 Thread Markus Jelsma
Hi Bing Li,

On Thursday 17 February 2011 10:32:11 Bing Li wrote:
> Dear all,
> 
> I started to learn how to use Solr three months ago. My experiences are
> still limited.
> 
> Now I crawl Web pages with my crawler and send the data to a single Solr
> server. It runs fine.
> 
> Since the potential users are large, I decide to scale Solr. After
> configuring replication, a single index can be replicated to multiple
> servers.
> 
> For shards, I think it is also required. I attempt to split the index
> according to the data categories and priorities. After that, I will use the
> above replication techniques and get high performance. The following work
> is not so difficult.

It's better to use a consistent hashing algorithm to decide which server takes 
which documents if you want good relevancy. Using a modulo with the number of 
servers will return the shard per document. If you have integers as unique key 
then just a modulo will suffice.

> 
> I noticed some new terms, such as SolrClould, Katta and ZooKeeper.
> According to my current understandings, it seems that I can ignore them.
> Am I right? What benefits can I get if using them?

SolrCloud comes with ZooKeeper. It's designed to provide a fail-over cluster 
and more useful features. I haven't tried Katta.

> 
> Thanks so much!
> LB

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: My Plan to Scale Solr

2011-02-17 Thread Stijn Vanhoorelbeke
Hi,

I'm currently looking at SolrCloud. I've managed to set up a scalable
cluster with ZooKeeper.
( see the examples in http://wiki.apache.org/solr/SolrCloud for a quick
understanding )
This way, all different shards / replicas are stored in a centralised
configuration.

Moreover the ZooKeeper contains out-of-the-box loadbalancing.
So, lets say - you have 2 different shards and each is replicated 2 times.
Your zookeeper config will look like this:

\config
 ...
   /live_nodes (v=6 children=4)
  lP_Port:7500_solr (ephemeral v=0)
  lP_Port:7574_solr (ephemeral v=0)
  lP_Port:8900_solr (ephemeral v=0)
  lP_Port:8983_solr (ephemeral v=0)
 /collections (v=20 children=1)
  collection1 (v=0 children=1) "configName=myconf"
   shards (v=0 children=2)
shard1 (v=0 children=3)
 lP_Port:8983_solr_ (v=4)
"node_name=lP_Port:8983_solr url=http://lP_Port:8983/solr/";
 lP_Port:7574_solr_ (v=1)
"node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/";
 lP_Port:8900_solr_ (v=1)
"node_name=lP_Port:8900_solr url=http://lP_Port:8900/solr/";
shard2 (v=0 children=2)
 lP_Port:7500_solr_ (v=0)
"node_name=lP_Port:7500_solr url=http://lP_Port:7500/solr/";
 lP_Port:7574_solr_ (v=1)
"node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/";

--> This setup can be realised, by 1 ZooKeeper module - the other solr
machines need just to know the IP_Port were the zookeeper is active & that's
it.
--> So no configuration / installing is needed to realise quick a scalable /
load balanced cluster.

Disclaimer:
ZooKeeper is a relative new feature - I'm not sure if it will work out in a
real production environment, which has a tight SLA pending.
But - definitely keep your eyes on this stuff - this will mature quickly!

Stijn Vanhoorelbeke


Re: Validate Query Syntax of Solr Request Before Sending

2011-02-17 Thread Markus Jelsma
Uh, how about the LuceneQParser? It does some checks and can return 
appropriate error messages.

On Thursday 17 February 2011 06:44:16 csj wrote:
> Hi,
> 
> I wonder if it is possible to let the user build up a Solr Query and have
> it validated by some java API before sending it to Solr.
> 
> Is there a parser that could help with that? I would like to help the user
> building a valid query as she types by showing messages like "The query is
> not valid" or purhaps even more advanced: "The parentheses are not
> balanced".
> 
> Maybe one day it would also be possible to analyse the semantics of the
> query like: "This query has a build-in inconsistency because the two dates
> you have specified requires documents to be before AND after these date".
> But this is far future...
> 
> Regards,
> 
> Christian Sonne Jensen

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


is solr dynamic calculation??

2011-02-17 Thread satya swaroop
Hi All,
 I have a query whether the solr shows the results of documents by
calculating the score on dynamic or is it pre calculating and supplying??..

for example:
if a query is made on q=solr in my index... i get a results of 25
documents... what is it calculating?? i am very keen to know its way of
calculation of score and ordering of results


Regards,
satya


Re: fine tuning the solr search

2011-02-17 Thread Markus Jelsma
Have a read:
http://lucene.apache.org/java/2_9_1/scoring.html

On Thursday 17 February 2011 12:50:08 Churchill Nanje Mambe wrote:
> Hi
>  I would love to know how to do this with solr
>  say a user inputs "Account manager files",
>  I wish that solr puts priority on the documents it finds as follows
>  1) documents containing account manager files gets a greater score
>  2) then documents with account manager come next
>  3) then documents with account can come
>  before the other words are used to get documents in the search
> 
> right now I think it works different as it finds documents with
> accounts and puts them like in first position or documents with word
> manager in second position or so
> 
> thanks

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: is solr dynamic calculation??

2011-02-17 Thread Markus Jelsma
Both, you should also read about scoring.
http://lucene.apache.org/java/2_9_1/scoring.html

On Thursday 17 February 2011 13:39:05 satya swaroop wrote:
> Hi All,
>  I have a query whether the solr shows the results of documents by
> calculating the score on dynamic or is it pre calculating and supplying??..
> 
> for example:
> if a query is made on q=solr in my index... i get a results of 25
> documents... what is it calculating?? i am very keen to know its way of
> calculation of score and ordering of results
> 
> 
> Regards,
> satya

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: How to use XML parser in DIH for a database?

2011-02-17 Thread Estrada Groups
If your using a DIH for SQL server you can configure it however you want. Here 
is a
snippet of my code. Note the Driver you need to grab from sourcenet.


 
 
   

 
 
 
 
 
 
 
 
 
 
 
 
 
> 
>driver="oracle.jdbc.driver.OracleDriver" 
>  url="jdbc:oracle:thin:@localhost:1521:xe" 
>  user="user" 
>  password="password"
>  name="ds"/>
>  
>  
> transformer="ClobTransformer">
>  
>   clob="true"/>
> processor="XPathEntityProcessor" forEach="/suppliers/supplier"
> dataField="clobxml.SUPPLIER_APPROVALS" onError="continue" >
>   />
>  
>
>
>  
> 
> 
> 
> -
> Thanx:
> Grijesh
> http://lucidimagination.com
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-use-XML-parser-in-DIH-for-a-database-tp2508015p2515910.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to use XML parser in DIH for a database?

2011-02-17 Thread Estrada Groups
I was also gonna say why even worry about using XPath when you can write a SQL 
query to get your data out. Thats what i did and it seems much simpler and cuts 
out a step.

Adam

Sent from my iPhone

On Feb 16, 2011, at 10:21 PM, Bill Bell  wrote:

> Does anyone have an example of using this with SQL Server varchar or XML
> field?
> 
> ??
> 
> 
>
>
>
> forEach="/the/record/xpath" url="${y.xml_name}">
>
>
>
>
> 
> 
> 
> 
> On 2/16/11 2:17 AM, "Stefan Matheis"  wrote:
> 
>> What about using
>> http://wiki.apache.org/solr/DataImportHandler#XPathEntityProcessor ?
>> 
>> On Wed, Feb 16, 2011 at 10:08 AM, Bill Bell  wrote:
>>> I am using DIH.
>>> 
>>> I am trying to take a column in a SQL Server database that returns an
>>> XML
>>> string and use Xpath to get data out of it.
>>> 
>>> I noticed that Xpath works with external files, how do I get it to work
>>> with
>>> a database?
>>> 
>>> I need something like "//insur[5][@name='Blue Cross']"
>>> 
>>> Thanks.
>>> 
>>> 
>>> 
> 
> 


Building queries for SolR

2011-02-17 Thread Vincent Cautaerts
Hi,

I'm porting/upgrading a project from Lucene to Solr.

In Lucene, I was using the user-provided Lucene query string, and I did
complete it to implement access restriction, based on fields saved in the
index:

Query userQ=... // query from user
Query restrictQ=.. // implement restrictions
Query finalQ=new BooleanQuery();
finalQ.add(userQ,BooleanClause.Occur.MUST);
finalQ.add(restrictQ,BooleanClause.Occur.MUST);

Searching on 'finalQ' gives only the results which can be shown to the user.


Since Solr doesn't have programmatically build queries, how can I do
something equivalent?

Do we always have to build strings to make a query in SolR? Is there really
no equivalent to the Lucene API to build a query, using BooleanQuery,
TermQuery,... ?

Thanks for your help!


Re: Building queries for SolR

2011-02-17 Thread Erik Hatcher
Vincent,

Look at Solr's fq (filter query) capability.  You'll likely want to put your 
restricting query in an fq parameter from your search client.

If your restricting query is a simple TermQuery, have a look at the various 
built-in query parsers in Solr.  On trunk you can do this: &fq={!term 
f=restriction_field}value, or in other versions look at the raw or field query 
parsers.

Or you can use Lucene query syntax in an fq parameter and make more 
sophisticated expressions (at the risk of query parsing exceptions, of course).

Erik



On Feb 17, 2011, at 09:09 , Vincent Cautaerts wrote:

> Hi,
> 
> I'm porting/upgrading a project from Lucene to Solr.
> 
> In Lucene, I was using the user-provided Lucene query string, and I did
> complete it to implement access restriction, based on fields saved in the
> index:
> 
> Query userQ=... // query from user
> Query restrictQ=.. // implement restrictions
> Query finalQ=new BooleanQuery();
> finalQ.add(userQ,BooleanClause.Occur.MUST);
> finalQ.add(restrictQ,BooleanClause.Occur.MUST);
> 
> Searching on 'finalQ' gives only the results which can be shown to the user.
> 
> 
> Since Solr doesn't have programmatically build queries, how can I do
> something equivalent?
> 
> Do we always have to build strings to make a query in SolR? Is there really
> no equivalent to the Lucene API to build a query, using BooleanQuery,
> TermQuery,... ?
> 
> Thanks for your help!



Re: last item in results page is always the same

2011-02-17 Thread Paul
Thanks, going to update now. This is a system that is currently
deployed. Should I just update to 1.4.1, or should I go straight to
3.0? Does 1.4 => 3.0 require reindexing?

On Wed, Feb 16, 2011 at 5:37 PM, Yonik Seeley
 wrote:
> On Wed, Feb 16, 2011 at 5:08 PM, Paul  wrote:
>> Is this a known solr bug or is there something subtle going on?
>
> Yes, I think it's the following bug, fixed in 1.4.1:
>
> * SOLR-1777: fieldTypes with sortMissingLast=true or sortMissingFirst=true can
>  result in incorrectly sorted results.
>
> -Yonik
> http://lucidimagination.com
>


Re: last item in results page is always the same

2011-02-17 Thread Markus Jelsma
Its fixed in 1.4.1.
https://issues.apache.org/jira/browse/SOLR-1777

On Thursday 17 February 2011 16:04:18 Paul wrote:
> Thanks, going to update now. This is a system that is currently
> deployed. Should I just update to 1.4.1, or should I go straight to
> 3.0? Does 1.4 => 3.0 require reindexing?
> 
> On Wed, Feb 16, 2011 at 5:37 PM, Yonik Seeley
> 
>  wrote:
> > On Wed, Feb 16, 2011 at 5:08 PM, Paul  wrote:
> >> Is this a known solr bug or is there something subtle going on?
> > 
> > Yes, I think it's the following bug, fixed in 1.4.1:
> > 
> > * SOLR-1777: fieldTypes with sortMissingLast=true or
> > sortMissingFirst=true can result in incorrectly sorted results.
> > 
> > -Yonik
> > http://lucidimagination.com

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: last item in results page is always the same

2011-02-17 Thread Erik Hatcher
Paul - go with 1.4.1 in this case.  

Keep tabs on the upcoming 3.1 release (of both Lucene and Solr) and consider 
that in a month or so.

Erik

On Feb 17, 2011, at 10:04 , Paul wrote:

> Thanks, going to update now. This is a system that is currently
> deployed. Should I just update to 1.4.1, or should I go straight to
> 3.0? Does 1.4 => 3.0 require reindexing?
> 
> On Wed, Feb 16, 2011 at 5:37 PM, Yonik Seeley
>  wrote:
>> On Wed, Feb 16, 2011 at 5:08 PM, Paul  wrote:
>>> Is this a known solr bug or is there something subtle going on?
>> 
>> Yes, I think it's the following bug, fixed in 1.4.1:
>> 
>> * SOLR-1777: fieldTypes with sortMissingLast=true or sortMissingFirst=true 
>> can
>>  result in incorrectly sorted results.
>> 
>> -Yonik
>> http://lucidimagination.com
>> 



Re: last item in results page is always the same

2011-02-17 Thread Yonik Seeley
On Thu, Feb 17, 2011 at 10:04 AM, Paul  wrote:
> Thanks, going to update now. This is a system that is currently
> deployed. Should I just update to 1.4.1, or should I go straight to
> 3.0? Does 1.4 => 3.0 require reindexing?

There is no 3.0 - that release happened before the Lucene/Solr merge,
hence there is no corresponding Solr version.
We're working on the 3.1 release now (hopefully out within a month).

-Yonik
http://lucidimagination.com


RE: Solr multi cores or not

2011-02-17 Thread Thumuluri, Sai
We have 3 applications and they need to have different relevancy models,
synonyms, stop words etc. 

App A - content size - 20 GB - MySQL and Drupal based app
App B - # of documents ~ 400K; index size ~ 25 GB - primarily a portal
with links to different applications, data sources include crawl pages
and db sources
App C - PeopleSoft based application - underlying Oracle DB ~ content
size ~ 10 GB 

App A - approx 60k hits/week
App B - approx 1 million hits/week
App C - approx 250k hits/wk

Frequency of updates
App A - near real time indexing - every 20 minutes
App B - every 2 hours
App C - daily

All applications need personalization based on appl specific biz rules.
Yes, we must enforce security and Clients are in our control

Reason, our server (Virtual Machine) was configured that way - is when
we first installed - we were told to throw lot of memory for Solr. App A
runs on our production server, it hardly does anything to the server -
our CPUs are less than 4% and our memory is hardly troubled. 

Our business need now is that all the three apps wants to use Solr for
their search needs and with the ability to share indexes. I need to not
only separate indexes, but also selectively query across the
applications. 
 
-Original Message-
From: Markus Jelsma [mailto:markus.jel...@openindex.io] 
Sent: Wednesday, February 16, 2011 6:25 PM
To: solr-user@lucene.apache.org
Cc: Thumuluri, Sai
Subject: Re: Solr multi cores or not

Hi,

That depends (as usual) on your scenario. Let me ask some questions:

1. what is the sum of documents for your applications?
2. what is the expected load in queries/minute
3. what is the update frequency in documents/minute and how many
documents per 
commit?
4. how many different applications do you have?
5. are the query demands for the business the same (or very similar) for
all 
applications?
6. can you easily upgrade hardware or demand more machines?
7. must you enforce security between applications and are the clients
not 
under your control?

I'm puzzled though, you have so much memory but so little CPU. What
about the 
disks? Size? Spinning or SSD?

Cheers,

> Hi,
> 
> I have a need to index multiple applications using Solr, I also have
the
> need to share indexes or run a search query across these application
> indexes. Is solr multi-core - the way to go?  My server config is
> 2virtual CPUs @ 1.8 GHz and has about 32GB of memory. What is the
> recommendation?
> 
> Thanks,
> Sai Thumuluri


Re: My Plan to Scale Solr

2011-02-17 Thread Dennis Gearon
What's an 'LSA'

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.





From: Stijn Vanhoorelbeke 
To: solr-user@lucene.apache.org; bing...@asu.edu
Sent: Thu, February 17, 2011 4:28:13 AM
Subject: Re: My Plan to Scale Solr

Hi,

I'm currently looking at SolrCloud. I've managed to set up a scalable
cluster with ZooKeeper.
( see the examples in http://wiki.apache.org/solr/SolrCloud for a quick
understanding )
This way, all different shards / replicas are stored in a centralised
configuration.

Moreover the ZooKeeper contains out-of-the-box loadbalancing.
So, lets say - you have 2 different shards and each is replicated 2 times.
Your zookeeper config will look like this:

\config
...
   /live_nodes (v=6 children=4)
  lP_Port:7500_solr (ephemeral v=0)
  lP_Port:7574_solr (ephemeral v=0)
  lP_Port:8900_solr (ephemeral v=0)
  lP_Port:8983_solr (ephemeral v=0)
 /collections (v=20 children=1)
  collection1 (v=0 children=1) "configName=myconf"
   shards (v=0 children=2)
shard1 (v=0 children=3)
 lP_Port:8983_solr_ (v=4)
"node_name=lP_Port:8983_solr url=http://lP_Port:8983/solr/";
 lP_Port:7574_solr_ (v=1)
"node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/";
 lP_Port:8900_solr_ (v=1)
"node_name=lP_Port:8900_solr url=http://lP_Port:8900/solr/";
shard2 (v=0 children=2)
 lP_Port:7500_solr_ (v=0)
"node_name=lP_Port:7500_solr url=http://lP_Port:7500/solr/";
 lP_Port:7574_solr_ (v=1)
"node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/";

--> This setup can be realised, by 1 ZooKeeper module - the other solr
machines need just to know the IP_Port were the zookeeper is active & that's
it.
--> So no configuration / installing is needed to realise quick a scalable /
load balanced cluster.

Disclaimer:
ZooKeeper is a relative new feature - I'm not sure if it will work out in a
real production environment, which has a tight SLA pending.
But - definitely keep your eyes on this stuff - this will mature quickly!

Stijn Vanhoorelbeke


Re: My Plan to Scale Solr

2011-02-17 Thread Walter Underwood
http://lmgtfy.com/?q=SLA

wunder

On Feb 17, 2011, at 11:04 AM, Dennis Gearon wrote:

> What's an 'LSA'
> 
> Dennis Gearon
> 
> 
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a 
> better 
> idea to learn from others’ mistakes, so you do not have to make them 
> yourself. 
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> 
> 
> EARTH has a Right To Life,
> otherwise we all die.
> 
> 
> 
> 
> 
> From: Stijn Vanhoorelbeke 
> To: solr-user@lucene.apache.org; bing...@asu.edu
> Sent: Thu, February 17, 2011 4:28:13 AM
> Subject: Re: My Plan to Scale Solr
> 
> Hi,
> 
> I'm currently looking at SolrCloud. I've managed to set up a scalable
> cluster with ZooKeeper.
> ( see the examples in http://wiki.apache.org/solr/SolrCloud for a quick
> understanding )
> This way, all different shards / replicas are stored in a centralised
> configuration.
> 
> Moreover the ZooKeeper contains out-of-the-box loadbalancing.
> So, lets say - you have 2 different shards and each is replicated 2 times.
> Your zookeeper config will look like this:
> 
> \config
> ...
>   /live_nodes (v=6 children=4)
>  lP_Port:7500_solr (ephemeral v=0)
>  lP_Port:7574_solr (ephemeral v=0)
>  lP_Port:8900_solr (ephemeral v=0)
>  lP_Port:8983_solr (ephemeral v=0)
> /collections (v=20 children=1)
>  collection1 (v=0 children=1) "configName=myconf"
>   shards (v=0 children=2)
>shard1 (v=0 children=3)
> lP_Port:8983_solr_ (v=4)
> "node_name=lP_Port:8983_solr url=http://lP_Port:8983/solr/";
> lP_Port:7574_solr_ (v=1)
> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/";
> lP_Port:8900_solr_ (v=1)
> "node_name=lP_Port:8900_solr url=http://lP_Port:8900/solr/";
>shard2 (v=0 children=2)
> lP_Port:7500_solr_ (v=0)
> "node_name=lP_Port:7500_solr url=http://lP_Port:7500/solr/";
> lP_Port:7574_solr_ (v=1)
> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/";
> 
> --> This setup can be realised, by 1 ZooKeeper module - the other solr
> machines need just to know the IP_Port were the zookeeper is active & that's
> it.
> --> So no configuration / installing is needed to realise quick a scalable /
> load balanced cluster.
> 
> Disclaimer:
> ZooKeeper is a relative new feature - I'm not sure if it will work out in a
> real production environment, which has a tight SLA pending.
> But - definitely keep your eyes on this stuff - this will mature quickly!
> 
> Stijn Vanhoorelbeke

--
Walter Underwood
Venture ASM, Troop 14, Palo Alto





GET or POST for large queries?

2011-02-17 Thread mrw

We are running into some issues with large queries.  Initially, they were
ostensibly header buffer overruns, because increasing Jetty's
headerBufferSize value to 65536 resolved them. This seems like a kludge, but
it does solve the problem for 95% of our users.

However, we do have queries that are physically larger than that and for
which increasing the headerBufferSize to 65536 does not work.  This is due
to security requirements:  Security descriptors are baked into the index,
and then potentially thousands of them (depending on the user context) are
passed in with each query.  These excessive queries are only a problem with
approximately 5% of users who are highly entitled, but the number of
security descriptors in are likely to increase and we won't have a
workaround for this security policy any time soon.

After a lot of Googling, it seems to me that it's common to increase the
headerBufferSize, but I don't see any other strategies.  Is it
possible/feasible to switch to use POST for querying?

Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/GET-or-POST-for-large-queries-tp2521700p2521700.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Question About Highlighting

2011-02-17 Thread Ahmet Arslan
> I had a requirement to implement phrase proximity like ["a
> b c" w/5 "d e f"] for 
> this i have implemented a custom query parser plug in which
> I make use of nested 
> span queries to fulfill this requirement. Now it looks that
> documents are 
> filtered correctly, but there is an issue in highlighting
> that also highlights 
> the terms that are alone(not in phrase), can some body
> suggest me a fix to this 
> issue 
> 

Appending &hl.usePhraseHighlighter=true should work.


  


Re: GET or POST for large queries?

2011-02-17 Thread Erik Hatcher
Yes, you may use POST to make search requests to Solr.

Erik

On Feb 17, 2011, at 14:27 , mrw wrote:

> 
> We are running into some issues with large queries.  Initially, they were
> ostensibly header buffer overruns, because increasing Jetty's
> headerBufferSize value to 65536 resolved them. This seems like a kludge, but
> it does solve the problem for 95% of our users.
> 
> However, we do have queries that are physically larger than that and for
> which increasing the headerBufferSize to 65536 does not work.  This is due
> to security requirements:  Security descriptors are baked into the index,
> and then potentially thousands of them (depending on the user context) are
> passed in with each query.  These excessive queries are only a problem with
> approximately 5% of users who are highly entitled, but the number of
> security descriptors in are likely to increase and we won't have a
> workaround for this security policy any time soon.
> 
> After a lot of Googling, it seems to me that it's common to increase the
> headerBufferSize, but I don't see any other strategies.  Is it
> possible/feasible to switch to use POST for querying?
> 
> Thanks!
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/GET-or-POST-for-large-queries-tp2521700p2521700.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: GET or POST for large queries?

2011-02-17 Thread mrw

Yeah, I tried switching to POST.

It seems to be handling the size, but apparently Solr has a limit on the
number of boolean comparisons -- I'm now getting "too many boolean clauses"
errors emanating from

org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:108).
 
:)


Thanks for responding.



Erik Hatcher-4 wrote:
> 
> Yes, you may use POST to make search requests to Solr.
> 
>   Erik
> 
> 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/GET-or-POST-for-large-queries-tp2521700p2522293.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: GET or POST for large queries?

2011-02-17 Thread Jonathan Rochkind
Yes, I think it's 1024 by default.  I think you can raise it in your 
config. But your performance may suffer.


Best would be to try and find a better way to do what you want without 
using thousands of clauses. This might require some custom Java plugins 
to Solr though.


On 2/17/2011 3:52 PM, mrw wrote:

Yeah, I tried switching to POST.

It seems to be handling the size, but apparently Solr has a limit on the
number of boolean comparisons -- I'm now getting "too many boolean clauses"
errors emanating from

org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:108).
:)


Thanks for responding.



Erik Hatcher-4 wrote:

Yes, you may use POST to make search requests to Solr.

Erik




Re: GET or POST for large queries?

2011-02-17 Thread Dennis Gearon
Probably you could do it, and solving a problem in business supersedes 
'rightness' concerns, much to the dismay of geeks and 'those who like rightness 
and say the word "Neemph!" '. 


the not rightness about this is that:
POST, PUT, DELETE are assumed to make changes to the URL's backend.
GET is assumed NOT to make changes.

So if your POST does not make a change . . . it breaks convention. But if it 
solves the problem . . . :-)

Another way would be to GET with a 'query file' location, and then have the 
server fetch that query and execute it.

Boy!!! I'd love to see one of your queries!!! You must have a few ANDs/ORs in 
them :-)

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.





From: mrw 
To: solr-user@lucene.apache.org
Sent: Thu, February 17, 2011 11:27:06 AM
Subject: GET or POST for large queries?


We are running into some issues with large queries.  Initially, they were
ostensibly header buffer overruns, because increasing Jetty's
headerBufferSize value to 65536 resolved them. This seems like a kludge, but
it does solve the problem for 95% of our users.

However, we do have queries that are physically larger than that and for
which increasing the headerBufferSize to 65536 does not work.  This is due
to security requirements:  Security descriptors are baked into the index,
and then potentially thousands of them (depending on the user context) are
passed in with each query.  These excessive queries are only a problem with
approximately 5% of users who are highly entitled, but the number of
security descriptors in are likely to increase and we won't have a
workaround for this security policy any time soon.

After a lot of Googling, it seems to me that it's common to increase the
headerBufferSize, but I don't see any other strategies.  Is it
possible/feasible to switch to use POST for querying?

Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/GET-or-POST-for-large-queries-tp2521700p2521700.html

Sent from the Solr - User mailing list archive at Nabble.com.


solr.KeepWordsFilterFactory confusion

2011-02-17 Thread Robert Haschart
I have a solr index where certain facet fields should only contain one 
or more items from a limited list of values.  To enforce this 
restriction at index time I have been looking at using a 
KeepWordFilterFactory.  It seems it ought to work as I have it 
implamented, and actually seems to work when tested through the admin 
analysis page, but when I index a document with that filter in place 
values that ought to be filtered out aren't.  (I am running the solr 1.4 
release)


I've added a new field type in schema.xml:

  sortMissingLast="true" omitNorms="true">

 
  
words="format_facet.txt" ignoreCase="false" />

 
   

placed a file format_facet.txt in the conf directory containing:

Book
Online
Microform
Journal/Magazine
Musical Score
Musical Recording
Thesis/Dissertation
Video
Streaming Video
Software/Multimedia
Photographs
Cassette

referenced this new field type with a field declaration in schema.xml

   stored="true" multiValued="true" />


also have this dynamic field, but this seems irrelevant:

   stored="true" multiValued="true" omitNorms="true" />



restarted the jetty server running the solr server.

and submitted a solr add document containing

format_facet=format_facet(1.0)={[Video, Streaming Video, Online, 
Gooberhead, Book of the Month]}


Of these values only Video, Streaming Video and Online  ought to end up 
in the index, however all five values end up as format_facet values for 
the solr item in question.  



   Video
   Streaming Video
   Online
   Gooberhead
   Book of the Month



I must be missing something fairly basic, since this doesn't seem 
especially complicated.


Thanks in advance for any assistance,

-Bob Haschart


Re: solr.KeepWordsFilterFactory confusion

2011-02-17 Thread Ahmet Arslan
> I've added a new field type in schema.xml:
> 
>    class="solr.StrField" sortMissingLast="true"
> omitNorms="true">
>      
>            class="solr.KeywordTokenizerFactory"/>
>              class="solr.KeepWordFilterFactory" words="format_facet.txt"
> ignoreCase="false" />
>      
>    
> 

class="solr.StrField" should be class="solr.TextField"





Re: SolrCloud - Example C not working

2011-02-17 Thread Yonik Seeley
FYI, this should be fixed in the (very) latest trunk.

-Yonik
http://lucidimagination.com


Date Math

2011-02-17 Thread Andreas Kemkes
The SolrQuerySyntax Wiki page refers to DateMathParser for examples.

When I tried "-1DAY", I got:

org.apache.lucene.queryParser.ParseException: Cannot parse 
'last_modified:-DAY': 
Encountered " "-" "- "" at line 1, column 14.
Was expecting one of: "(" ... "*" ...  ...  ...   
  
 ...  ... "[" ... "{" ...  ... 

Are they not supported as a short-cut for "NOW-1DAY"?  I'm using Solr 1.4.


  

Index Design Question

2011-02-17 Thread Andreas Kemkes
We are indexing documents with several associated fields for search and 
display, 
some of which may change with a much higher frequency than the document 
content. 
 As per my understanding, we have to resubmit the entire gamut of fields with 
every update.

If the reindexing of the documents becomes a performance bottleneck, what 
choices of design alternatives are there within Solr?

Thanks in advance for your contributions.


  

Re: TermVector query using Solr Tutorial

2011-02-17 Thread Chris Hostetter

: I am searching the keyword 25, in the field
: 
: 30" TFT active matrix LCD, 2560 x 1600, .25mm
: dot pitch, 700:1 contrast
: 
: I want to know the character position of matched keyword in the
: corresponding field.
: 
: usb or cabl is not what I want.

your search is getting a match on the features field, but the termvectors 
being returned are from the "includes" field, which you can see based on 
the output you mentioned in your previous message...

> 
> -
> 
> 3007WFP
> -
> 
> -
> 
> 1
...

...by the looks of things, the "includes" field is the only field with 
termVectors enabled in your schema.xml (which is consistent with the 
trunk, 3x, and solr 1.4 example schemas. 

if you want termVectors for hte "features" field, you need to specify 
termVectors="true" on the "features" field.




-Hoss


Re: Index Design Question

2011-02-17 Thread kenf_nc

Some options to reduce performance implications are:
   replication... index your documents in one solr instance, and query in a
different one. that way the users of the query side will not be as adversely
impacted by frequent changes. You have better control over when change
occurs.

  separate search from display...one mistake I see a lot is people putting
everything into Solr. Solr is optimized for search, therefore it sometimes
makes sense to only put into a solr index those fields you are searching
against. In some architectures this leaves a large amount of data that can
be stored somewhere else, an RDBMS, a file system, a third party host,
whatever. You search on Solr, and use some identifier to get the rest of the
data from somewhere else. That way, only changes to searchable fields need
to be indexed, the rest just need to be stored. It could minimize the impact
on your Solr documents.

   Multi-threading...usually any performance bottleneck is on the sending
side, not the Solr side. Solr handles multiple data inputs gracefully. Be
very aware of how many commits you are doing, and what kind of warming
queries you have in place. Those are the biggest performance issues from
what I've seen. Having 2 Solr instances, one optimized for Indexing (the
master) and one optimized for querying (the slave) with replication, would
help minimize the problem.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-Design-Question-tp2523811p2524424.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: is solr dynamic calculation??

2011-02-17 Thread satya swaroop
Hi Markus,
As far i gone through the scoring of solr. The scoring is
done during searching on the use of boost values which were given during the
indexing.
I have a query now if i search for a keyword java then
1)if for a term named "java" in index contain 50,000 documents then do solr
calculate the score value for each and every document and filter them and
then sort it and   server results??? if it does the dynamic calculation
for each and every document then it takes a long time, but how can solr
reduced it??
 Am i right??? or if any wrong please tell me???

Regards,
satya


Re: My Plan to Scale Solr

2011-02-17 Thread Lance Norskog
Or even better, search with 'LSA'.

On Thu, Feb 17, 2011 at 9:22 AM, Walter Underwood  wrote:
> http://lmgtfy.com/?q=SLA
>
> wunder
>
> On Feb 17, 2011, at 11:04 AM, Dennis Gearon wrote:
>
>> What's an 'LSA'
>>
>> Dennis Gearon
>>
>>
>> Signature Warning
>> 
>> It is always a good idea to learn from your own mistakes. It is usually a 
>> better
>> idea to learn from others’ mistakes, so you do not have to make them 
>> yourself.
>> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>>
>>
>> EARTH has a Right To Life,
>> otherwise we all die.
>>
>>
>>
>>
>> 
>> From: Stijn Vanhoorelbeke 
>> To: solr-user@lucene.apache.org; bing...@asu.edu
>> Sent: Thu, February 17, 2011 4:28:13 AM
>> Subject: Re: My Plan to Scale Solr
>>
>> Hi,
>>
>> I'm currently looking at SolrCloud. I've managed to set up a scalable
>> cluster with ZooKeeper.
>> ( see the examples in http://wiki.apache.org/solr/SolrCloud for a quick
>> understanding )
>> This way, all different shards / replicas are stored in a centralised
>> configuration.
>>
>> Moreover the ZooKeeper contains out-of-the-box loadbalancing.
>> So, lets say - you have 2 different shards and each is replicated 2 times.
>> Your zookeeper config will look like this:
>>
>> \config
>> ...
>>   /live_nodes (v=6 children=4)
>>          lP_Port:7500_solr (ephemeral v=0)
>>          lP_Port:7574_solr (ephemeral v=0)
>>          lP_Port:8900_solr (ephemeral v=0)
>>          lP_Port:8983_solr (ephemeral v=0)
>>     /collections (v=20 children=1)
>>          collection1 (v=0 children=1) "configName=myconf"
>>               shards (v=0 children=2)
>>                    shard1 (v=0 children=3)
>>                         lP_Port:8983_solr_ (v=4)
>> "node_name=lP_Port:8983_solr url=http://lP_Port:8983/solr/";
>>                         lP_Port:7574_solr_ (v=1)
>> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/";
>>                         lP_Port:8900_solr_ (v=1)
>> "node_name=lP_Port:8900_solr url=http://lP_Port:8900/solr/";
>>                    shard2 (v=0 children=2)
>>                         lP_Port:7500_solr_ (v=0)
>> "node_name=lP_Port:7500_solr url=http://lP_Port:7500/solr/";
>>                         lP_Port:7574_solr_ (v=1)
>> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/";
>>
>> --> This setup can be realised, by 1 ZooKeeper module - the other solr
>> machines need just to know the IP_Port were the zookeeper is active & that's
>> it.
>> --> So no configuration / installing is needed to realise quick a scalable /
>> load balanced cluster.
>>
>> Disclaimer:
>> ZooKeeper is a relative new feature - I'm not sure if it will work out in a
>> real production environment, which has a tight SLA pending.
>> But - definitely keep your eyes on this stuff - this will mature quickly!
>>
>> Stijn Vanhoorelbeke
>
> --
> Walter Underwood
> Venture ASM, Troop 14, Palo Alto
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: My Plan to Scale Solr

2011-02-17 Thread Grijesh

its just a joke?

-
Thanx:
Grijesh
http://lucidimagination.com
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/My-Plan-to-Scale-Solr-tp2516904p2524700.html
Sent from the Solr - User mailing list archive at Nabble.com.