unloading a solr core doesn't free any memory

2014-11-17 Thread Ofer Fort
Getting a lot of those today.
Is it all from the same site we saw last week?

OFER FORT
Head of R&D

437 Fifth Avenue 9th floor, New York, NY 10016
cell: ISR +972-54-5678339  US +1 212 738 9594 ext 34
skype: oferfort
tracx
social intelligence
www.tracx.com<http://www.tracx.com/>
Follow us:
[Tracx on Facebook]<https://www.facebook.com/at.tra.cx> [Tracx on Twitter] 
<https://twitter.com/tracx>  [Tracx on Linked In] 
<http://www.linkedin.com/company/tracx>  [Making Tracx] <http://blog.tracx.com/>



public constructor for KStemmer

2011-09-22 Thread Ofer Fort
He all,
I was very happy to see that Kstemmer implementation was added to lucene.
I was wondering why is the constructor not public?
I have a case where i want to create an analyzer that uses the stemmer
itself, and in order to construct a new instance, it has to be in the same
package and be loaded by the same classloader.
I know i can just change the source file, or add my jar to the solr.war, but
i was wondering if there is a reason why this constructor was not added to
the class.

thanks
ofer


Re: core sleep/wake

2012-05-01 Thread Ofer Fort
My random searches can be a bit slow on startup, so i still would like to
get that lazy load but have more cores available.
I'm actually trying now the LotsOfCores way of handling things.
Had to work a bit to get the patch suitable for 3.5 but it seems to be
doing what i need.


On Tue, May 1, 2012 at 2:31 PM, Erick Erickson wrote:

> Well, that'll be kinda self-defeating. The whole point of auto-warming
> is to fill up the caches, consuming memory. Without that, searches
> will be slow. So the idea of using minimal resources is really
> antithetical to having these in-memory structures filled up.
>
> You can try configuring minimal caches & etc. Or just give it
> lots of memory and count on your OS to swap the pages out
> if the particular core doesn't get used.
>
> Best
> Erick
>
> On Mon, Apr 30, 2012 at 5:18 PM, oferiko  wrote:
> > I have a multicore solr with a lot of cores that contains a lot of data
> (~50M
> > documents), but are rarely used.
> > Can i load a core from configuration, but have keep it in sleep mode,
> where
> > is has all the configuration available, but it hardly consumes resources,
> > and based on a query or an update, it will "come to life"?
> > Thanks
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/core-sleep-wake-tp3951850.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Master/Slave High CPU Usage

2010-11-17 Thread Ofer Fort
Hi, I'm working with Erez,
we experienced this again, and this time the slave index folder didn't
contain the index.XXX folder, only one index folder.
if we shutdown the slave, the CPU on the master was normal, as soon as we
started the slave again, the CPU went up to 100% again.
thanks for any help
ofer

On Wed, Nov 17, 2010 at 11:15 AM, Erez Zarum  wrote:

> Hi all,
> We've been seeing this for the second time already.
> I have a solr (1.4.1) master and a slave. both are located on the same
> machine (16GB RAM, 4GB allocated to the slave and 3GB to the master)
> All our updates are going towards the master, and all the queries are
> towards the slave.
> Once in a while the slave gets OutOfMemoryError. This is not the big
> problem (i have a about 100M documents)
> The problem is that from that moment the CPU of the slave AND the master is
> almost 100%.
> If i shutdown the slave, the CPU of the master drops.
> If i start the slave again, the CPU is 100% again.
> I have the replication set on commit and startup.
> I see that in the data folder contains three index folders: index,
> index.XXXYYY and  index.XXXYYY.ZZZ
>
> The only way i was able to get pass it (worked two times already), is to
> shutdown the two servers, and to copy all the index of the master to the
> slave, and start them again.
> From that moment and on, they continue to work and replicate with a very
> reasonable CPU usage.
>
> Our guess is that it failed to replicate due to the OOM and since then
> tries to do a full replication again and again?
> but why is the CPU of the master so high?
>


Re: Master/Slave High CPU Usage

2010-11-17 Thread Ofer Fort
anybody?

On Wed, Nov 17, 2010 at 12:09 PM, Ofer Fort  wrote:
>
> Hi, I'm working with Erez,
> we experienced this again, and this time the slave index folder didn't 
> contain the index.XXX folder, only one index folder.
> if we shutdown the slave, the CPU on the master was normal, as soon as we 
> started the slave again, the CPU went up to 100% again.
> thanks for any help
> ofer
>
> On Wed, Nov 17, 2010 at 11:15 AM, Erez Zarum  wrote:
>>
>> Hi all,
>> We've been seeing this for the second time already.
>> I have a solr (1.4.1) master and a slave. both are located on the same 
>> machine (16GB RAM, 4GB allocated to the slave and 3GB to the master)
>> All our updates are going towards the master, and all the queries are 
>> towards the slave.
>> Once in a while the slave gets OutOfMemoryError. This is not the big problem 
>> (i have a about 100M documents)
>> The problem is that from that moment the CPU of the slave AND the master is 
>> almost 100%.
>> If i shutdown the slave, the CPU of the master drops.
>> If i start the slave again, the CPU is 100% again.
>> I have the replication set on commit and startup.
>> I see that in the data folder contains three index folders: index, 
>> index.XXXYYY and  index.XXXYYY.ZZZ
>>
>> The only way i was able to get pass it (worked two times already), is to 
>> shutdown the two servers, and to copy all the index of the master to the 
>> slave, and start them again.
>> From that moment and on, they continue to work and replicate with a very 
>> reasonable CPU usage.
>>
>> Our guess is that it failed to replicate due to the OOM and since then tries 
>> to do a full replication again and again?
>> but why is the CPU of the master so high?
>


Re: Master/Slave High CPU Usage

2010-11-19 Thread Ofer Fort
That sounds like a great option, and it will also free some storage space on
that sever (now each index is about 130GB).
Other than the lock policy (we use single), any other things to look out to?
Thanks


ב-19 בנוב 2010, בשעה 05:30, Lance Norskog  כתב/ה:

If they are on the same server, you do not need to replicate.

If you only do queries, the query server can use the same index
directory as the master. Works quite well. Both have to have the same
LockPolicy in solrconfig.xml. For security reasons, I would run the
query server as a different user who has read-only access to the
index; that way it cannot touch the index.

On Wed, Nov 17, 2010 at 11:28 PM, Ofer Fort  wrote:

anybody?


On Wed, Nov 17, 2010 at 12:09 PM, Ofer Fort  wrote:


Hi, I'm working with Erez,

we experienced this again, and this time the slave index folder didn't
contain the index.XXX folder, only one index folder.

if we shutdown the slave, the CPU on the master was normal, as soon as we
started the slave again, the CPU went up to 100% again.

thanks for any help

ofer


On Wed, Nov 17, 2010 at 11:15 AM, Erez Zarum  wrote:


Hi all,

We've been seeing this for the second time already.

I have a solr (1.4.1) master and a slave. both are located on the same
machine (16GB RAM, 4GB allocated to the slave and 3GB to the master)

All our updates are going towards the master, and all the queries are
towards the slave.

Once in a while the slave gets OutOfMemoryError. This is not the big problem
(i have a about 100M documents)

The problem is that from that moment the CPU of the slave AND the master is
almost 100%.

If i shutdown the slave, the CPU of the master drops.

If i start the slave again, the CPU is 100% again.

I have the replication set on commit and startup.

I see that in the data folder contains three index folders: index,
index.XXXYYY and  index.XXXYYY.ZZZ


The only way i was able to get pass it (worked two times already), is to
shutdown the two servers, and to copy all the index of the master to the
slave, and start them again.

>From that moment and on, they continue to work and replicate with a very
reasonable CPU usage.


Our guess is that it failed to replicate due to the OOM and since then tries
to do a full replication again and again?

but why is the CPU of the master so high?






-- 
Lance Norskog
goks...@gmail.com


Re: Master/Slave High CPU Usage

2010-11-20 Thread Ofer Fort
Another question on that configuration, when the "master" commits, how does
the "slave" knows that the index has changed? Does it check the index and
finds out that it has a newer version?
Thanks again for the help,
Ofer



ב-19 בנוב 2010, בשעה 05:30, Lance Norskog  כתב/ה:

If they are on the same server, you do not need to replicate.

If you only do queries, the query server can use the same index
directory as the master. Works quite well. Both have to have the same
LockPolicy in solrconfig.xml. For security reasons, I would run the
query server as a different user who has read-only access to the
index; that way it cannot touch the index.

On Wed, Nov 17, 2010 at 11:28 PM, Ofer Fort  wrote:

anybody?


On Wed, Nov 17, 2010 at 12:09 PM, Ofer Fort  wrote:


Hi, I'm working with Erez,

we experienced this again, and this time the slave index folder didn't
contain the index.XXX folder, only one index folder.

if we shutdown the slave, the CPU on the master was normal, as soon as we
started the slave again, the CPU went up to 100% again.

thanks for any help

ofer


On Wed, Nov 17, 2010 at 11:15 AM, Erez Zarum  wrote:


Hi all,

We've been seeing this for the second time already.

I have a solr (1.4.1) master and a slave. both are located on the same
machine (16GB RAM, 4GB allocated to the slave and 3GB to the master)

All our updates are going towards the master, and all the queries are
towards the slave.

Once in a while the slave gets OutOfMemoryError. This is not the big problem
(i have a about 100M documents)

The problem is that from that moment the CPU of the slave AND the master is
almost 100%.

If i shutdown the slave, the CPU of the master drops.

If i start the slave again, the CPU is 100% again.

I have the replication set on commit and startup.

I see that in the data folder contains three index folders: index,
index.XXXYYY and  index.XXXYYY.ZZZ


The only way i was able to get pass it (worked two times already), is to
shutdown the two servers, and to copy all the index of the master to the
slave, and start them again.

>From that moment and on, they continue to work and replicate with a very
reasonable CPU usage.


Our guess is that it failed to replicate due to the OOM and since then tries
to do a full replication again and again?

but why is the CPU of the master so high?






-- 
Lance Norskog
goks...@gmail.com


Re: Master/Slave High CPU Usage

2010-11-20 Thread Ofer Fort
thanks Erick,
but my question was regard the configuration Lance suggested, a
configuration where i have two servers, set set logical master and slave,
not as a true replication. Since both are running on the same machine, just
have one only doing updates, and the other only queries, but both are using
the same index files.

Ofer


On Sat, Nov 20, 2010 at 8:52 PM, Erick Erickson wrote:

> The slave polls. See: http://wiki.apache.org/solr/SolrReplication
>
> Best
> Erick
>
> On Sat, Nov 20, 2010 at 1:13 PM, Ofer Fort  wrote:
>
> > Another question on that configuration, when the "master" commits, how
> does
> > the "slave" knows that the index has changed? Does it check the index and
> > finds out that it has a newer version?
> > Thanks again for the help,
> > Ofer
> >
> >
> >
> > ב-19 בנוב 2010, בשעה 05:30, Lance Norskog  כתב/ה:
> >
> > If they are on the same server, you do not need to replicate.
> >
> > If you only do queries, the query server can use the same index
> > directory as the master. Works quite well. Both have to have the same
> > LockPolicy in solrconfig.xml. For security reasons, I would run the
> > query server as a different user who has read-only access to the
> > index; that way it cannot touch the index.
> >
> > On Wed, Nov 17, 2010 at 11:28 PM, Ofer Fort  wrote:
> >
> > anybody?
> >
> >
> > On Wed, Nov 17, 2010 at 12:09 PM, Ofer Fort  wrote:
> >
> >
> > Hi, I'm working with Erez,
> >
> > we experienced this again, and this time the slave index folder didn't
> > contain the index.XXX folder, only one index folder.
> >
> > if we shutdown the slave, the CPU on the master was normal, as soon as we
> > started the slave again, the CPU went up to 100% again.
> >
> > thanks for any help
> >
> > ofer
> >
> >
> > On Wed, Nov 17, 2010 at 11:15 AM, Erez Zarum  wrote:
> >
> >
> > Hi all,
> >
> > We've been seeing this for the second time already.
> >
> > I have a solr (1.4.1) master and a slave. both are located on the same
> > machine (16GB RAM, 4GB allocated to the slave and 3GB to the master)
> >
> > All our updates are going towards the master, and all the queries are
> > towards the slave.
> >
> > Once in a while the slave gets OutOfMemoryError. This is not the big
> > problem
> > (i have a about 100M documents)
> >
> > The problem is that from that moment the CPU of the slave AND the master
> is
> > almost 100%.
> >
> > If i shutdown the slave, the CPU of the master drops.
> >
> > If i start the slave again, the CPU is 100% again.
> >
> > I have the replication set on commit and startup.
> >
> > I see that in the data folder contains three index folders: index,
> > index.XXXYYY and  index.XXXYYY.ZZZ
> >
> >
> > The only way i was able to get pass it (worked two times already), is to
> > shutdown the two servers, and to copy all the index of the master to the
> > slave, and start them again.
> >
> > From that moment and on, they continue to work and replicate with a very
> > reasonable CPU usage.
> >
> >
> > Our guess is that it failed to replicate due to the OOM and since then
> > tries
> > to do a full replication again and again?
> >
> > but why is the CPU of the master so high?
> >
> >
> >
> >
> >
> >
> > --
> > Lance Norskog
> > goks...@gmail.com
> >
>


Re: Master/Slave High CPU Usage

2010-11-20 Thread Ofer Fort
OK,
so to make sure i understand, even though the "slave" doesn't do any
indexing, i will call commit and it will do nothing to the index itself, but
will reload it?
thanks

On Sun, Nov 21, 2010 at 8:26 AM, Lance Norskog  wrote:

> Ah! If the program doing the indexing has manual commits, the program
> could send a commit to the slave. If the indexer uses automatic
> commits, there is a trick: you can add a program as a postCommit event
> in solrconfig.xml. This can just be a shell script or a curl command
> that sends a commit to the slave Solr.
>
> Be sure to make all of the wait options false to this command; you
> don't want the master to block while the slave loads up the new index.
> Or, to control the maximum load on your server, you might actually
> want to make the master wait while the slave loads up
>
> Lance
>
> On Sat, Nov 20, 2010 at 2:13 PM, Ofer Fort  wrote:
> > thanks Erick,
> > but my question was regard the configuration Lance suggested, a
> > configuration where i have two servers, set set logical master and slave,
> > not as a true replication. Since both are running on the same machine,
> just
> > have one only doing updates, and the other only queries, but both are
> using
> > the same index files.
> >
> > Ofer
> >
> >
> > On Sat, Nov 20, 2010 at 8:52 PM, Erick Erickson  >wrote:
> >
> >> The slave polls. See: http://wiki.apache.org/solr/SolrReplication
> >>
> >> Best
> >> Erick
> >>
> >> On Sat, Nov 20, 2010 at 1:13 PM, Ofer Fort  wrote:
> >>
> >> > Another question on that configuration, when the "master" commits, how
> >> does
> >> > the "slave" knows that the index has changed? Does it check the index
> and
> >> > finds out that it has a newer version?
> >> > Thanks again for the help,
> >> > Ofer
> >> >
> >> >
> >> >
> >> > ב-19 בנוב 2010, בשעה 05:30, Lance Norskog  כתב/ה:
> >> >
> >> > If they are on the same server, you do not need to replicate.
> >> >
> >> > If you only do queries, the query server can use the same index
> >> > directory as the master. Works quite well. Both have to have the same
> >> > LockPolicy in solrconfig.xml. For security reasons, I would run the
> >> > query server as a different user who has read-only access to the
> >> > index; that way it cannot touch the index.
> >> >
> >> > On Wed, Nov 17, 2010 at 11:28 PM, Ofer Fort 
> wrote:
> >> >
> >> > anybody?
> >> >
> >> >
> >> > On Wed, Nov 17, 2010 at 12:09 PM, Ofer Fort 
> wrote:
> >> >
> >> >
> >> > Hi, I'm working with Erez,
> >> >
> >> > we experienced this again, and this time the slave index folder didn't
> >> > contain the index.XXX folder, only one index folder.
> >> >
> >> > if we shutdown the slave, the CPU on the master was normal, as soon as
> we
> >> > started the slave again, the CPU went up to 100% again.
> >> >
> >> > thanks for any help
> >> >
> >> > ofer
> >> >
> >> >
> >> > On Wed, Nov 17, 2010 at 11:15 AM, Erez Zarum 
> wrote:
> >> >
> >> >
> >> > Hi all,
> >> >
> >> > We've been seeing this for the second time already.
> >> >
> >> > I have a solr (1.4.1) master and a slave. both are located on the same
> >> > machine (16GB RAM, 4GB allocated to the slave and 3GB to the master)
> >> >
> >> > All our updates are going towards the master, and all the queries are
> >> > towards the slave.
> >> >
> >> > Once in a while the slave gets OutOfMemoryError. This is not the big
> >> > problem
> >> > (i have a about 100M documents)
> >> >
> >> > The problem is that from that moment the CPU of the slave AND the
> master
> >> is
> >> > almost 100%.
> >> >
> >> > If i shutdown the slave, the CPU of the master drops.
> >> >
> >> > If i start the slave again, the CPU is 100% again.
> >> >
> >> > I have the replication set on commit and startup.
> >> >
> >> > I see that in the data folder contains three index folders: index,
> >> > index.XXXYYY and  index.XXXYYY.ZZZ
> >> >
> >> >
> >> > The only way i was able to get pass it (worked two times already), is
> to
> >> > shutdown the two servers, and to copy all the index of the master to
> the
> >> > slave, and start them again.
> >> >
> >> > From that moment and on, they continue to work and replicate with a
> very
> >> > reasonable CPU usage.
> >> >
> >> >
> >> > Our guess is that it failed to replicate due to the OOM and since then
> >> > tries
> >> > to do a full replication again and again?
> >> >
> >> > but why is the CPU of the master so high?
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Lance Norskog
> >> > goks...@gmail.com
> >> >
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: Master/Slave High CPU Usage

2010-11-21 Thread Ofer Fort
ok, i'll try that and update the group
thanks

On Sun, Nov 21, 2010 at 12:17 PM, Lance Norskog  wrote:

> Yes, the Solr commit operations always reloads the index. And it
> always throws away the Solr caches: queryresult, document, filter
> query.
>
> If you do this, please post your results.
>
> On Sat, Nov 20, 2010 at 11:16 PM, Ofer Fort  wrote:
> > OK,
> > so to make sure i understand, even though the "slave" doesn't do any
> > indexing, i will call commit and it will do nothing to the index itself,
> but
> > will reload it?
> > thanks
> >
> > On Sun, Nov 21, 2010 at 8:26 AM, Lance Norskog 
> wrote:
> >
> >> Ah! If the program doing the indexing has manual commits, the program
> >> could send a commit to the slave. If the indexer uses automatic
> >> commits, there is a trick: you can add a program as a postCommit event
> >> in solrconfig.xml. This can just be a shell script or a curl command
> >> that sends a commit to the slave Solr.
> >>
> >> Be sure to make all of the wait options false to this command; you
> >> don't want the master to block while the slave loads up the new index.
> >> Or, to control the maximum load on your server, you might actually
> >> want to make the master wait while the slave loads up
> >>
> >> Lance
> >>
> >> On Sat, Nov 20, 2010 at 2:13 PM, Ofer Fort  wrote:
> >> > thanks Erick,
> >> > but my question was regard the configuration Lance suggested, a
> >> > configuration where i have two servers, set set logical master and
> slave,
> >> > not as a true replication. Since both are running on the same machine,
> >> just
> >> > have one only doing updates, and the other only queries, but both are
> >> using
> >> > the same index files.
> >> >
> >> > Ofer
> >> >
> >> >
> >> > On Sat, Nov 20, 2010 at 8:52 PM, Erick Erickson <
> erickerick...@gmail.com
> >> >wrote:
> >> >
> >> >> The slave polls. See: http://wiki.apache.org/solr/SolrReplication
> >> >>
> >> >> Best
> >> >> Erick
> >> >>
> >> >> On Sat, Nov 20, 2010 at 1:13 PM, Ofer Fort 
> wrote:
> >> >>
> >> >> > Another question on that configuration, when the "master" commits,
> how
> >> >> does
> >> >> > the "slave" knows that the index has changed? Does it check the
> index
> >> and
> >> >> > finds out that it has a newer version?
> >> >> > Thanks again for the help,
> >> >> > Ofer
> >> >> >
> >> >> >
> >> >> >
> >> >> > ב-19 בנוב 2010, בשעה 05:30, Lance Norskog 
> כתב/ה:
> >> >> >
> >> >> > If they are on the same server, you do not need to replicate.
> >> >> >
> >> >> > If you only do queries, the query server can use the same index
> >> >> > directory as the master. Works quite well. Both have to have the
> same
> >> >> > LockPolicy in solrconfig.xml. For security reasons, I would run the
> >> >> > query server as a different user who has read-only access to the
> >> >> > index; that way it cannot touch the index.
> >> >> >
> >> >> > On Wed, Nov 17, 2010 at 11:28 PM, Ofer Fort 
> >> wrote:
> >> >> >
> >> >> > anybody?
> >> >> >
> >> >> >
> >> >> > On Wed, Nov 17, 2010 at 12:09 PM, Ofer Fort 
> >> wrote:
> >> >> >
> >> >> >
> >> >> > Hi, I'm working with Erez,
> >> >> >
> >> >> > we experienced this again, and this time the slave index folder
> didn't
> >> >> > contain the index.XXX folder, only one index folder.
> >> >> >
> >> >> > if we shutdown the slave, the CPU on the master was normal, as soon
> as
> >> we
> >> >> > started the slave again, the CPU went up to 100% again.
> >> >> >
> >> >> > thanks for any help
> >> >> >
> >> >> > ofer
> >> >> >
> >> >> >
> >> >> > On Wed, Nov 17, 2010 at 11:15 AM, Erez Zarum 
> >> wrote:
> >> >> >
> >> >> >
&

Re: Master/Slave High CPU Usage

2010-11-21 Thread Ofer Fort
do i really need a commit? or can i use the
*readercycle*<http://wiki.apache.org/solr/SolrOperationsTools>script?
since i don't need to comit anything, just reopen the reader.
thanks

On Sun, Nov 21, 2010 at 12:17 PM, Lance Norskog  wrote:

> Yes, the Solr commit operations always reloads the index. And it
> always throws away the Solr caches: queryresult, document, filter
> query.
>
> If you do this, please post your results.
>
> On Sat, Nov 20, 2010 at 11:16 PM, Ofer Fort  wrote:
> > OK,
> > so to make sure i understand, even though the "slave" doesn't do any
> > indexing, i will call commit and it will do nothing to the index itself,
> but
> > will reload it?
> > thanks
> >
> > On Sun, Nov 21, 2010 at 8:26 AM, Lance Norskog 
> wrote:
> >
> >> Ah! If the program doing the indexing has manual commits, the program
> >> could send a commit to the slave. If the indexer uses automatic
> >> commits, there is a trick: you can add a program as a postCommit event
> >> in solrconfig.xml. This can just be a shell script or a curl command
> >> that sends a commit to the slave Solr.
> >>
> >> Be sure to make all of the wait options false to this command; you
> >> don't want the master to block while the slave loads up the new index.
> >> Or, to control the maximum load on your server, you might actually
> >> want to make the master wait while the slave loads up
> >>
> >> Lance
> >>
> >> On Sat, Nov 20, 2010 at 2:13 PM, Ofer Fort  wrote:
> >> > thanks Erick,
> >> > but my question was regard the configuration Lance suggested, a
> >> > configuration where i have two servers, set set logical master and
> slave,
> >> > not as a true replication. Since both are running on the same machine,
> >> just
> >> > have one only doing updates, and the other only queries, but both are
> >> using
> >> > the same index files.
> >> >
> >> > Ofer
> >> >
> >> >
> >> > On Sat, Nov 20, 2010 at 8:52 PM, Erick Erickson <
> erickerick...@gmail.com
> >> >wrote:
> >> >
> >> >> The slave polls. See: http://wiki.apache.org/solr/SolrReplication
> >> >>
> >> >> Best
> >> >> Erick
> >> >>
> >> >> On Sat, Nov 20, 2010 at 1:13 PM, Ofer Fort 
> wrote:
> >> >>
> >> >> > Another question on that configuration, when the "master" commits,
> how
> >> >> does
> >> >> > the "slave" knows that the index has changed? Does it check the
> index
> >> and
> >> >> > finds out that it has a newer version?
> >> >> > Thanks again for the help,
> >> >> > Ofer
> >> >> >
> >> >> >
> >> >> >
> >> >> > ב-19 בנוב 2010, בשעה 05:30, Lance Norskog 
> כתב/ה:
> >> >> >
> >> >> > If they are on the same server, you do not need to replicate.
> >> >> >
> >> >> > If you only do queries, the query server can use the same index
> >> >> > directory as the master. Works quite well. Both have to have the
> same
> >> >> > LockPolicy in solrconfig.xml. For security reasons, I would run the
> >> >> > query server as a different user who has read-only access to the
> >> >> > index; that way it cannot touch the index.
> >> >> >
> >> >> > On Wed, Nov 17, 2010 at 11:28 PM, Ofer Fort 
> >> wrote:
> >> >> >
> >> >> > anybody?
> >> >> >
> >> >> >
> >> >> > On Wed, Nov 17, 2010 at 12:09 PM, Ofer Fort 
> >> wrote:
> >> >> >
> >> >> >
> >> >> > Hi, I'm working with Erez,
> >> >> >
> >> >> > we experienced this again, and this time the slave index folder
> didn't
> >> >> > contain the index.XXX folder, only one index folder.
> >> >> >
> >> >> > if we shutdown the slave, the CPU on the master was normal, as soon
> as
> >> we
> >> >> > started the slave again, the CPU went up to 100% again.
> >> >> >
> >> >> > thanks for any help
> >> >> >
> >> >> > ofer
> >> >> >
> >> >> >
> >> &g

Re: Master/Slave High CPU Usage

2010-11-23 Thread Ofer Fort
ok, we ran some tests and doing the commit for the "slave" as a post commit
event of the "master" reloaded the index and allowed us to achieve a master
slave configuration, without replication
This is useful only if your master and slave are on the same machine, and it
helps reducing the resources needed, as you don't have 2 indexes and you
don't need to copy the data from one to the other.

Thanks  Lance for that proposal.


On Sun, Nov 21, 2010 at 2:12 PM, Ofer Fort  wrote:

> do i really need a commit? or can i use the 
> *readercycle*<http://wiki.apache.org/solr/SolrOperationsTools>script? since i 
> don't need to comit anything, just reopen the reader.
> thanks
>
>
> On Sun, Nov 21, 2010 at 12:17 PM, Lance Norskog  wrote:
>
>> Yes, the Solr commit operations always reloads the index. And it
>> always throws away the Solr caches: queryresult, document, filter
>> query.
>>
>> If you do this, please post your results.
>>
>> On Sat, Nov 20, 2010 at 11:16 PM, Ofer Fort  wrote:
>> > OK,
>> > so to make sure i understand, even though the "slave" doesn't do any
>> > indexing, i will call commit and it will do nothing to the index itself,
>> but
>> > will reload it?
>> > thanks
>> >
>> > On Sun, Nov 21, 2010 at 8:26 AM, Lance Norskog 
>> wrote:
>> >
>> >> Ah! If the program doing the indexing has manual commits, the program
>> >> could send a commit to the slave. If the indexer uses automatic
>> >> commits, there is a trick: you can add a program as a postCommit event
>> >> in solrconfig.xml. This can just be a shell script or a curl command
>> >> that sends a commit to the slave Solr.
>> >>
>> >> Be sure to make all of the wait options false to this command; you
>> >> don't want the master to block while the slave loads up the new index.
>> >> Or, to control the maximum load on your server, you might actually
>> >> want to make the master wait while the slave loads up
>> >>
>> >> Lance
>> >>
>> >> On Sat, Nov 20, 2010 at 2:13 PM, Ofer Fort  wrote:
>> >> > thanks Erick,
>> >> > but my question was regard the configuration Lance suggested, a
>> >> > configuration where i have two servers, set set logical master and
>> slave,
>> >> > not as a true replication. Since both are running on the same
>> machine,
>> >> just
>> >> > have one only doing updates, and the other only queries, but both are
>> >> using
>> >> > the same index files.
>> >> >
>> >> > Ofer
>> >> >
>> >> >
>> >> > On Sat, Nov 20, 2010 at 8:52 PM, Erick Erickson <
>> erickerick...@gmail.com
>> >> >wrote:
>> >> >
>> >> >> The slave polls. See: http://wiki.apache.org/solr/SolrReplication
>> >> >>
>> >> >> Best
>> >> >> Erick
>> >> >>
>> >> >> On Sat, Nov 20, 2010 at 1:13 PM, Ofer Fort 
>> wrote:
>> >> >>
>> >> >> > Another question on that configuration, when the "master" commits,
>> how
>> >> >> does
>> >> >> > the "slave" knows that the index has changed? Does it check the
>> index
>> >> and
>> >> >> > finds out that it has a newer version?
>> >> >> > Thanks again for the help,
>> >> >> > Ofer
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > ב-19 בנוב 2010, בשעה 05:30, Lance Norskog 
>> כתב/ה:
>> >> >> >
>> >> >> > If they are on the same server, you do not need to replicate.
>> >> >> >
>> >> >> > If you only do queries, the query server can use the same index
>> >> >> > directory as the master. Works quite well. Both have to have the
>> same
>> >> >> > LockPolicy in solrconfig.xml. For security reasons, I would run
>> the
>> >> >> > query server as a different user who has read-only access to the
>> >> >> > index; that way it cannot touch the index.
>> >> >> >
>> >> >> > On Wed, Nov 17, 2010 at 11:28 PM, Ofer Fort 
>> >> wrote:
>> >> >> >
>> >> >> > anybody?
>> >> >

Re: Does Solr supports indexing & search for Hebrew.

2011-01-18 Thread Ofer Fort
take a look at :
http://github.com/synhershko/HebMorph with more info at
http://www.code972.com/blog/hebmorph/


On Tue, Jan 18, 2011 at 11:04 AM, prasad deshpande <
prasad.deshpand...@gmail.com> wrote:

> Hello,
>
> With reference to below links I haven't found Hebrew support in Solr.
>
> http://wiki.apache.org/solr/LanguageAnalysis
>
> http://lucene.apache.org/java/3_0_3/api/all/index.html
>
> If I want to index and search Hebrew files/data then how would I achieve
> this?
>
> Thanks,
> Prasad
>


Efficient boolean query

2011-03-02 Thread Ofer Fort
Hey all,
I have an index with a lot of documents with the term X and no documents
with the term Y.
If i query for X it take a few seconds and returns the results.
If I query for Y it takes a millisecond and returns an empty set.
If i query for Y AND X it takes a few seconds and returns an empty set.

I'm guessing that it evaluate both X and Y and only then tries to intersect
them?

Am i wrong? is there another way to run this query more efficiently?

thanks for any input


Re: Efficient boolean query

2011-03-02 Thread Ofer Fort
Thanks,
I tried it in the past and found out that my hit ratio was pretty low, so it
doesn't help most of my queries

ofer

On Wed, Mar 2, 2011 at 7:16 PM, Geert-Jan Brits  wrote:

> If you often query X as part of several other queries (e.g: X  | X AND Y |
>  X AND Z)
> you might consider putting X in a filter query (
> http://wiki.apache.org/solr/CommonQueryParameters#fq)
>
> leading to:
> q=*:*&fq=X
> q=Y&fq=X
> q=Z&fq=X
>
> Filter queries are cached seperately which means that after the first query
> involving X, X should be returned quickly.
> So your FIRST query will probably still be in the 'few seconds'- range, but
> all following queries involving X will return much quicker.
>
> hth,
> Geert-Jan
>
> 2011/3/2 Ofer Fort 
>
> > Hey all,
> > I have an index with a lot of documents with the term X and no documents
> > with the term Y.
> > If i query for X it take a few seconds and returns the results.
> > If I query for Y it takes a millisecond and returns an empty set.
> > If i query for Y AND X it takes a few seconds and returns an empty set.
> >
> > I'm guessing that it evaluate both X and Y and only then tries to
> intersect
> > them?
> >
> > Am i wrong? is there another way to run this query more efficiently?
> >
> > thanks for any input
> >
>


Re: Efficient boolean query

2011-03-02 Thread Ofer Fort
you are correct that my query is a tange one, probably should have mentioned
it in the first post.
this is the debug data:





 0
 4173
 
  on
  on

  0
  timestamp:[2011-02-01T00:00:00Z TO NOW] AND oferiko
  2.2
  10
 




 timestamp:[2011-02-01T00:00:00Z TO NOW] AND
oferiko
 timestamp:[2011-02-01T00:00:00Z TO NOW] AND
oferiko
 +timestamp:[129651840 TO 1299069584823]
+contents:oferiko
 +timestamp:[129651840 TO
1299069584823] +contents:oferiko
 
 LuceneQParser

 
  4171.0
  
0.0

 0.0



 0.0


 0.0


 0.0



 0.0


 0.0

  

  
4171.0

 4171.0


 0.0



 0.0


 0.0



 0.0


 0.0

  
 





On Wed, Mar 2, 2011 at 7:48 PM, Yonik Seeley wrote:

> On Wed, Mar 2, 2011 at 12:11 PM, Ofer Fort  wrote:
> > Hey all,
> > I have an index with a lot of documents with the term X and no documents
> > with the term Y.
> > If i query for X it take a few seconds and returns the results.
> > If I query for Y it takes a millisecond and returns an empty set.
> > If i query for Y AND X it takes a few seconds and returns an empty set.
>
> This depends on the specifics of what X is.   Some query types must
> generate all hits first internally - an example is a multi-term query
> (like numeric range query, etc) that matches many terms.
>
> Can you show the generated query (i.e. add debugQuery=true to the request)?
>
> -Yonik
> http://lucidimagination.com
>


Re: Efficient boolean query

2011-03-02 Thread Ofer Fort
timestamp is of type:




On Wed, Mar 2, 2011 at 8:11 PM, Ofer Fort  wrote:

> you are correct that my query is a tange one, probably should have
> mentioned it in the first post.
> this is the debug data:
>
> 
> 
>
> 
>  0
>  4173
>  
>   on
>   on
>
>   0
>   timestamp:[2011-02-01T00:00:00Z TO NOW] AND oferiko
>   2.2
>   10
>  
> 
> 
> 
>
>  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
> oferiko
>  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
> oferiko
>  +timestamp:[129651840 TO 1299069584823]
> +contents:oferiko
>  +timestamp:[129651840 TO
> 1299069584823] +contents:oferiko
>  
>  LuceneQParser
>
>  
>   4171.0
>   
> 0.0
> 
>  0.0
> 
>
> 
>  0.0
> 
> 
>  0.0
> 
> 
>  0.0
>
> 
> 
>  0.0
> 
> 
>  0.0
> 
>   
>
>   
> 4171.0
> 
>  4171.0
> 
> 
>  0.0
>
> 
> 
>      0.0
> 
> 
>  0.0
> 
> 
>
>  0.0
> 
> 
>  0.0
> 
>   
>  
> 
>
> 
>
>
>
> On Wed, Mar 2, 2011 at 7:48 PM, Yonik Seeley 
> wrote:
>
>> On Wed, Mar 2, 2011 at 12:11 PM, Ofer Fort  wrote:
>> > Hey all,
>> > I have an index with a lot of documents with the term X and no documents
>> > with the term Y.
>> > If i query for X it take a few seconds and returns the results.
>> > If I query for Y it takes a millisecond and returns an empty set.
>> > If i query for Y AND X it takes a few seconds and returns an empty set.
>>
>> This depends on the specifics of what X is.   Some query types must
>> generate all hits first internally - an example is a multi-term query
>> (like numeric range query, etc) that matches many terms.
>>
>> Can you show the generated query (i.e. add debugQuery=true to the
>> request)?
>>
>> -Yonik
>> http://lucidimagination.com
>>
>
>


Re: Efficient boolean query

2011-03-02 Thread Ofer Fort
Thanks,
But each query tries to see if there is something new since the last result
that was found, so rounding things will return the same documents over  and
over again, till we reach to the next rounded point.

Could i use the document id somehow?  or something else that's bigger than
my last search?

And even it was a simple term query, on the lucene side of things, why would
it try to fetch ALL the terms if one of the required ones resulted in an
empty set?

thanks for your help, specifically on this matter and in general, to the
search community :-)

On Wed, Mar 2, 2011 at 8:35 PM, Yonik Seeley wrote:

> One way to speed things up would be to reduce the resolution on
> timestamps that you index.
> Another way would be to decrease the precisionStep on the tdate field
> type (bigger index, but faster range queries)
> Yet another way is to use "fq" filters that can be reused many times.
>
> One way to increase fq reuse is to round.
> This rounds up to the nearest hour... assumes 2011-02-01T00:00:00Z is
> the same across many queries.
> fq=timestamp:[2011-02-01T00:00:00Z TO NOW/HOUR+1HOUR]
>
> Another way is to split the filter into two parts - a large part that
> doesn't change much + a small part that does.
> Again this assumes that the first endpoint is reused across many queries.
> fq=timestamp:[2011-02-01T00:00:00Z TO
> NOW/HOUR+1HOUR]&fq=timestamp:[NOW/HOUR TO NOW]
>
> If the first endpoint is *not* reused across many queries, then you
> can still use the same strategy as above by adding another small "fq"
> for the lower endpoint.
>
> -Yonik
> http://lucidimagination.com
>
>
>
> On Wed, Mar 2, 2011 at 1:11 PM, Ofer Fort  wrote:
> > you are correct that my query is a tange one, probably should have
> mentioned
> > it in the first post.
> > this is the debug data:
> >
> > 
> > 
> >
> > 
> >  0
> >  4173
> >  
> >  on
> >  on
> >
> >  0
> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND oferiko
> >  2.2
> >  10
> >  
> > 
> > 
> > 
> >
> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
> > oferiko
> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
> > oferiko
> >  +timestamp:[129651840 TO 1299069584823]
> > +contents:oferiko
> >  +timestamp:[129651840 TO
> > 1299069584823] +contents:oferiko
> >  
> >  LuceneQParser
> >
> >  
> >  4171.0
> >  
> >0.0
> >
> > 0.0
> >
> >
> >
> > 0.0
> >
> >
> > 0.0
> >
> >
> >     0.0
> >
> >
> >
> > 0.0
> >
> >
> > 0.0
> >
> >  
> >
> >  
> >4171.0
> >
> > 4171.0
> >
> >
> > 0.0
> >
> >
> >
> > 0.0
> >
> >
> > 0.0
> >
> >
> >
> > 0.0
> >
> >
> > 0.0
> >
> >  
> >  
> > 
> >
> > 
> >
> >
> > On Wed, Mar 2, 2011 at 7:48 PM, Yonik Seeley  >wrote:
> >
> >> On Wed, Mar 2, 2011 at 12:11 PM, Ofer Fort  wrote:
> >> > Hey all,
> >> > I have an index with a lot of documents with the term X and no
> documents
> >> > with the term Y.
> >> > If i query for X it take a few seconds and returns the results.
> >> > If I query for Y it takes a millisecond and returns an empty set.
> >> > If i query for Y AND X it takes a few seconds and returns an empty
> set.
> >>
> >> This depends on the specifics of what X is.   Some query types must
> >> generate all hits first internally - an example is a multi-term query
> >> (like numeric range query, etc) that matches many terms.
> >>
> >> Can you show the generated query (i.e. add debugQuery=true to the
> request)?
> >>
> >> -Yonik
> >> http://lucidimagination.com
> >>
> >
>


Re: Efficient boolean query

2011-03-02 Thread Ofer Fort
I'm guessing what i was describing is a short-circuit evaluation and i see
that lucene doesn't have it:
http://lucene.472066.n3.nabble.com/Short-circuit-in-query-td738551.html

Still would love to hear any suggestions for my type of query

ofer

On Wed, Mar 2, 2011 at 8:58 PM, Ofer Fort  wrote:

> Thanks,
> But each query tries to see if there is something new since the last result
> that was found, so rounding things will return the same documents over  and
> over again, till we reach to the next rounded point.
>
> Could i use the document id somehow?  or something else that's bigger than
> my last search?
>
> And even it was a simple term query, on the lucene side of things, why
> would it try to fetch ALL the terms if one of the required ones resulted in
> an empty set?
>
> thanks for your help, specifically on this matter and in general, to the
> search community :-)
>
>
> On Wed, Mar 2, 2011 at 8:35 PM, Yonik Seeley 
> wrote:
>
>> One way to speed things up would be to reduce the resolution on
>> timestamps that you index.
>> Another way would be to decrease the precisionStep on the tdate field
>> type (bigger index, but faster range queries)
>> Yet another way is to use "fq" filters that can be reused many times.
>>
>> One way to increase fq reuse is to round.
>> This rounds up to the nearest hour... assumes 2011-02-01T00:00:00Z is
>> the same across many queries.
>> fq=timestamp:[2011-02-01T00:00:00Z TO NOW/HOUR+1HOUR]
>>
>> Another way is to split the filter into two parts - a large part that
>> doesn't change much + a small part that does.
>> Again this assumes that the first endpoint is reused across many queries.
>> fq=timestamp:[2011-02-01T00:00:00Z TO
>> NOW/HOUR+1HOUR]&fq=timestamp:[NOW/HOUR TO NOW]
>>
>> If the first endpoint is *not* reused across many queries, then you
>> can still use the same strategy as above by adding another small "fq"
>> for the lower endpoint.
>>
>> -Yonik
>> http://lucidimagination.com
>>
>>
>>
>> On Wed, Mar 2, 2011 at 1:11 PM, Ofer Fort  wrote:
>> > you are correct that my query is a tange one, probably should have
>> mentioned
>> > it in the first post.
>> > this is the debug data:
>> >
>> > 
>> > 
>> >
>> > 
>> >  0
>> >  4173
>> >  
>> >  on
>> >  on
>> >
>> >  0
>> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND oferiko
>> >  2.2
>> >  10
>> >  
>> > 
>> > 
>> > 
>> >
>> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
>> > oferiko
>> >  timestamp:[2011-02-01T00:00:00Z TO NOW] AND
>> > oferiko
>> >  +timestamp:[129651840 TO 1299069584823]
>> > +contents:oferiko
>> >  +timestamp:[129651840 TO
>> > 1299069584823] +contents:oferiko
>> >  
>> >  LuceneQParser
>> >
>> >  
>> >  4171.0
>> >  
>> >0.0
>> >
>> > 0.0
>> >
>> >
>> >
>> > 0.0
>> >
>> >
>> > 0.0
>> >
>> >
>> > 0.0
>> >
>> >
>> >
>> > 0.0
>> >
>> >
>> > 0.0
>> >
>> >  
>> >
>> >  
>> >4171.0
>> >
>> > 4171.0
>> >
>> >
>> > 0.0
>> >
>> >
>> >
>> > 0.0
>> >
>> >
>> > 0.0
>> >
>> >
>> >
>> > 0.0
>> >
>> >
>> > 0.0
>> >
>> >  
>> >  
>> > 
>> >
>> > 
>> >
>> >
>> > On Wed, Mar 2, 2011 at 7:48 PM, Yonik Seeley <
>> yo...@lucidimagination.com>wrote:
>> >
>> >> On Wed, Mar 2, 2011 at 12:11 PM, Ofer Fort  wrote:
>> >> > Hey all,
>> >> > I have an index with a lot of documents with the term X and no
>> documents
>> >> > with the term Y.
>> >> > If i query for X it take a few seconds and returns the results.
>> >> > If I query for Y it takes a millisecond and returns an empty set.
>> >> > If i query for Y AND X it takes a few seconds and returns an empty
>> set.
>> >>
>> >> This depends on the specifics of what X is.   Some query types must
>> >> generate all hits first internally - an example is a multi-term query
>> >> (like numeric range query, etc) that matches many terms.
>> >>
>> >> Can you show the generated query (i.e. add debugQuery=true to the
>> request)?
>> >>
>> >> -Yonik
>> >> http://lucidimagination.com
>> >>
>> >
>>
>
>


Re: Efficient boolean query

2011-03-02 Thread Ofer Fort
I didn't see this behavior, running solr 1.4.1, was that implemented
after this release?

On Wednesday, March 2, 2011, Yonik Seeley  wrote:
> On Wed, Mar 2, 2011 at 1:58 PM, Ofer Fort  wrote:
>> Thanks,
>> But each query tries to see if there is something new since the last result
>> that was found, so rounding things will return the same documents over  and
>> over again, till we reach to the next rounded point.
>>
>> Could i use the document id somehow?  or something else that's bigger than
>> my last search?
>>
>> And even it was a simple term query, on the lucene side of things, why would
>> it try to fetch ALL the terms if one of the required ones resulted in an
>> empty set?
>
> In general, all items are fetched for a big multi-term query because
> it's very difficult to answer the question "what's the first document
> after x that matches any of the terms" without doing so.
>
> More specifically, Lucene does do some short-circuiting for
> non-matches (at least in trunk... not sure about other versions).
> If you reorder your query to
> oferiko AND timestamp:[2011-02-01T00:00:00Z TO NOW]
>
> Then when there is no match on oferiko, BooleanScorer will not ask for
> the scorer for the second clause.
>
> -Yonik
> http://lucidimagination.com
>


Re: Efficient boolean query

2011-03-02 Thread Ofer Fort
That's great, just what I needed, I was debugging and was expecting to
see something like this.
 i'll look through the SVN history to see in which version it was added.
Thanks

On Wednesday, March 2, 2011, Yonik Seeley  wrote:
> On Wed, Mar 2, 2011 at 2:43 PM, Ofer Fort  wrote:
>> I didn't see this behavior, running solr 1.4.1, was that implemented
>> after this release?
>
> I think so.
> It's implemented now in BooleanWeight.scorer()
>
>       for (Weight w  : weights) {
>         BooleanClause c =  cIter.next();
>         Scorer subScorer = w.scorer(context, ScorerContext.def());
>         if (subScorer == null) {
>           if (c.isRequired()) {
>             return null;
>           }
>
> And TermWeight returns null from scorer() if there are no matches for
> the segment.
>
> -Yonik
> http://lucidimagination.com
>


mixing version of solr

2011-03-03 Thread Ofer Fort
Hey all,
I have a master slave using the same index folder, the master only writes,
and the slave only reads.
Is it possible to use different versions of solr for those two servers?
Let's say i want to gain from the improved search speed of solr4.0 but since
it's my production system, am not willing to index using it since it's not a
stable release.
Since the slave only reads, if it will crash i'll just restart it.

Can i index using solr 1.4.1 and read the same index with solr 4.0?

thanks


Re: mixing version of solr

2011-03-03 Thread Ofer Fort
we've been running like this for almost six months now and it's working ok.
We have a post-commit event on the "master" that executes a commit call on
the "slave", this forces the slave to reload the index.

We started with a "standard" master/slave replication, but we had a few
times that the slave got and OOM and it caused a 100% CPU on the master
itself, restart to both didn't help, and we had to shotdown the both, copy
the files from the master to the slave, and continue the replication.
Since we couldn't resolve this issue we moved to this configuration.

Is anybody here working with solr4.0 in production? feels risky...

On Thu, Mar 3, 2011 at 9:31 PM, Jonathan Rochkind  wrote:

> In general, no. I think there are index format changes between 1.4.1 and
> 4.0.
>
> If the two versions of Solr have the exact same index formats, it would
> theoretically work, but you'd need to figure that out and be sure of it, any
> two arbitrary versions of Solr/lucene may or may not have the exact same
> index formats. _Maybe_ 4.0 can read a 1.4.1 index.  In some cases I think
> it's supposed to be able to. But it all starts getting confusing and with
> edge cases where things don't quite work, I personally wouldn't try it.
>
> But personally, I don't like the idea of having two running instances of
> Solr using the exact same on-disk index anyway.  I know people do it, you
> aren't alone, but it makes me nervous, seems like asking for trouble. When
> the indexing instances writes new indexes, when and how is the read-only
> Solr going to figure that out and load new searchers for it?  It just gets
> confusing and complicated.
>
>
>
>
> On 3/3/2011 2:03 PM, Ofer Fort wrote:
>
>> Hey all,
>> I have a master slave using the same index folder, the master only writes,
>> and the slave only reads.
>> Is it possible to use different versions of solr for those two servers?
>> Let's say i want to gain from the improved search speed of solr4.0 but
>> since
>> it's my production system, am not willing to index using it since it's not
>> a
>> stable release.
>> Since the slave only reads, if it will crash i'll just restart it.
>>
>> Can i index using solr 1.4.1 and read the same index with solr 4.0?
>>
>> thanks
>>
>>


Re: KStemmer for Solr 3.x +

2011-04-20 Thread Ofer Fort
Seems like it isn't. In my installation (1.4.1) i used
LucidKStemFilterFactory, and when switching the solr.war file to the 3.1 one
i get:
14:42:31.664 ERROR [pool-1-thread-1]: java.lang.AbstractMethodError:
org.apache.lucene.analysis.TokenStream.incrementToken()Z
at
org.apache.lucene.analysis.CachingTokenFilter.fillCache(CachingTokenFilter.java:78)
at
org.apache.lucene.analysis.CachingTokenFilter.incrementToken(CachingTokenFilter.java:50)
at
org.apache.lucene.queryParser.QueryParser.getFieldQuery(QueryParser.java:606)
at
org.apache.solr.search.SolrQueryParser.getFieldQuery(SolrQueryParser.java:151)
at
org.apache.lucene.queryParser.QueryParser.Term(QueryParser.java:1421)
at
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1309)
at
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1237)
at
org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226)
at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206)
at
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80)
at org.apache.solr.search.QParser.getQuery(QParser.java:142)
at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:84)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
at
org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:52)
at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1169)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.lang.Thread.run(Unknown Source)

when the config is:
  







  

anybody familiar with this issue?

On Sat, Apr 9, 2011 at 7:00 AM, David Smiley (@MITRE.org)  wrote:

> I see no reason why it would not be compatible.
>
> -
>  Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/KStemmer-for-Solr-3-x-tp2796594p2798213.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort
Hi,
I am looking for the best way to find the terms with the highest frequency
for a given subset of documents. (terms in the text field)
My first thought was to do a count facet search , where the query defines
the subset of documents and the facet.field is the text field, this gives me
the result but it is very very slow.
These are my params:
true
0
3
on
500
enum
xml
0
2.2
count
   in_subset:1
text


The index contains 7M documents, the subset is about 200K. A simple query
for the subset takes around 100ms, but the facet search takes 40s.

Am i doing something wrong?

If facet search is not the correct approach, i thought about using something
like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
in solr. Should i implememt a request handler that executes this kind of
code?

thanks for any help


Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort
thanks, but that's what i started with, but it took an even longer time and
threw this:
Approaching too many values for UnInvertedField faceting on field 'text' :
bucket size=15560140
Approaching too many values for UnInvertedField faceting on field 'text :
bucket size=15619075
Exception during facet counts:org.apache.solr.common.SolrException: Too many
values for UnInvertedField faceting on field text


On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind  wrote:

> I think faceting is probably the best way to do that, indeed. It might be
> slow, but it's kind of set up for exactly that case, I can't imagine any
> other technique being faster -- there's stuff that has to be done to look up
> the info you want.
>
> BUT, I see your problem:  don't use facet.method=enum. Use facet.method=fc.
>  Works a LOT better for very high arity fields (lots and lots of unique
> values) like you have. I bet you'll see significant speed-up if you use
> facet.method=fc instead, hopefully fast enough to be workable.
>
> With facet.method=enum, I would have indeed predicted it would be horribly
> slow, before solr 1.4 when facet.method=fc became available, it was nearly
> impossible to facet on very high arity fields, facet.method=fc is the magic.
> I think facet.method=fc is even the default in Solr 1.4+, if you hadn't
> explicitly set it to enum instead!
>
> Jonathan
> 
> From: Ofer Fort [ofer...@gmail.com]
> Sent: Wednesday, April 20, 2011 6:49 PM
> To: solr-user@lucene.apache.org
> Subject: Highest frequency terms for a subset of documents
> Hi,
> I am looking for the best way to find the terms with the highest frequency
> for a given subset of documents. (terms in the text field)
> My first thought was to do a count facet search , where the query defines
> the subset of documents and the facet.field is the text field, this gives
> me
> the result but it is very very slow.
> These are my params:
> true
> 0
> 3
> on
> 500
> enum
> xml
> 0
> 2.2
> count
>   in_subset:1
> text
> 
>
> The index contains 7M documents, the subset is about 200K. A simple query
> for the subset takes around 100ms, but the facet search takes 40s.
>
> Am i doing something wrong?
>
> If facet search is not the correct approach, i thought about using
> something
> like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
> in solr. Should i implememt a request handler that executes this kind of
> code?
>
> thanks for any help
>


Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort
seems like the facet search is not all that suited for a full text field. (
http://search.lucidimagination.com/search/document/178f1a82ff19070c/solr_severe_error_when_doing_a_faceted_search#16562790cda76197
)

Maybe i should go another direction. I think that the HighFreqTerms
approach, just not sure how to start.

On Thu, Apr 21, 2011 at 2:23 AM, Ofer Fort  wrote:

> thanks, but that's what i started with, but it took an even longer time and
> threw this:
> Approaching too many values for UnInvertedField faceting on field 'text' :
> bucket size=15560140
> Approaching too many values for UnInvertedField faceting on field 'text :
> bucket size=15619075
> Exception during facet counts:org.apache.solr.common.SolrException: Too
> many values for UnInvertedField faceting on field text
>
>
>
> On Thu, Apr 21, 2011 at 2:11 AM, Jonathan Rochkind wrote:
>
>> I think faceting is probably the best way to do that, indeed. It might be
>> slow, but it's kind of set up for exactly that case, I can't imagine any
>> other technique being faster -- there's stuff that has to be done to look up
>> the info you want.
>>
>> BUT, I see your problem:  don't use facet.method=enum. Use
>> facet.method=fc.  Works a LOT better for very high arity fields (lots and
>> lots of unique values) like you have. I bet you'll see significant speed-up
>> if you use facet.method=fc instead, hopefully fast enough to be workable.
>>
>> With facet.method=enum, I would have indeed predicted it would be horribly
>> slow, before solr 1.4 when facet.method=fc became available, it was nearly
>> impossible to facet on very high arity fields, facet.method=fc is the magic.
>> I think facet.method=fc is even the default in Solr 1.4+, if you hadn't
>> explicitly set it to enum instead!
>>
>> Jonathan
>> 
>> From: Ofer Fort [ofer...@gmail.com]
>> Sent: Wednesday, April 20, 2011 6:49 PM
>> To: solr-user@lucene.apache.org
>> Subject: Highest frequency terms for a subset of documents
>> Hi,
>> I am looking for the best way to find the terms with the highest frequency
>> for a given subset of documents. (terms in the text field)
>> My first thought was to do a count facet search , where the query defines
>> the subset of documents and the facet.field is the text field, this gives
>> me
>> the result but it is very very slow.
>> These are my params:
>> true
>> 0
>> 3
>> on
>> 500
>> enum
>> xml
>> 0
>> 2.2
>> count
>>   in_subset:1
>> text
>> 
>>
>> The index contains 7M documents, the subset is about 200K. A simple query
>> for the subset takes around 100ms, but the facet search takes 40s.
>>
>> Am i doing something wrong?
>>
>> If facet search is not the correct approach, i thought about using
>> something
>> like org.apache.lucene.misc.HighFreqTerms, but i'm not sure how to do this
>> in solr. Should i implememt a request handler that executes this kind of
>> code?
>>
>> thanks for any help
>>
>
>


Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort
Thanks
but i've disabled the cache already, since my concern is speed and i'm
willing to pay the price (memory), and my subset are not fixed.
Does the facet search do any extra work that i don't need, that i might be
able to disable (either by a flag or by a code change),
Somehow i feel, or rather hope, that counting the terms of 200K documents
and finding the top 500 should take less than 30 seconds.


On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley wrote:

> On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter
>  wrote:
> >
> > : thanks, but that's what i started with, but it took an even longer time
> and
> > : threw this:
> > : Approaching too many values for UnInvertedField faceting on field
> 'text' :
> > : bucket size=15560140
> > : Approaching too many values for UnInvertedField faceting on field 'text
> :
> > : bucket size=15619075
> > : Exception during facet counts:org.apache.solr.common.SolrException: Too
> many
> > : values for UnInvertedField faceting on field text
> >
> > right ... facet.method=fc is a good default, but cases like full text
> > faceting can cause it to seriously blow up the memory ... i didn't eve
> > realize it was possible to get it to fail this way, i would have just
> > expected an OutOfmemoryException.
> >
> > facet.method=enum is probably your best bet in this situation precisely
> > because it does a linera scan over the terms ... it's slower because it's
> > safer.
> >
> > the one speed up you might be able to get is to ensure you don't use the
> > filterCache -- that way you don't wast time constantly
> caching/overwriting
> > DocSets
>
> Right - or only using filterCache for high df terms via
> http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>


Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort
BTW,
i'm using solr 1.4.1, does 3.1 or 4.0 contain any performance improvements
that will make a difference as far as facet search?
thanks again
Ofer

On Thu, Apr 21, 2011 at 2:45 AM, Ofer Fort  wrote:

> Thanks
> but i've disabled the cache already, since my concern is speed and i'm
> willing to pay the price (memory), and my subset are not fixed.
> Does the facet search do any extra work that i don't need, that i might be
> able to disable (either by a flag or by a code change),
> Somehow i feel, or rather hope, that counting the terms of 200K documents
> and finding the top 500 should take less than 30 seconds.
>
>
>
> On Thu, Apr 21, 2011 at 2:41 AM, Yonik Seeley 
> wrote:
>
>> On Wed, Apr 20, 2011 at 7:34 PM, Chris Hostetter
>>  wrote:
>> >
>> > : thanks, but that's what i started with, but it took an even longer
>> time and
>> > : threw this:
>> > : Approaching too many values for UnInvertedField faceting on field
>> 'text' :
>> > : bucket size=15560140
>> > : Approaching too many values for UnInvertedField faceting on field
>> 'text :
>> > : bucket size=15619075
>> > : Exception during facet counts:org.apache.solr.common.SolrException:
>> Too many
>> > : values for UnInvertedField faceting on field text
>> >
>> > right ... facet.method=fc is a good default, but cases like full text
>> > faceting can cause it to seriously blow up the memory ... i didn't eve
>> > realize it was possible to get it to fail this way, i would have just
>> > expected an OutOfmemoryException.
>> >
>> > facet.method=enum is probably your best bet in this situation precisely
>> > because it does a linera scan over the terms ... it's slower because
>> it's
>> > safer.
>> >
>> > the one speed up you might be able to get is to ensure you don't use the
>> > filterCache -- that way you don't wast time constantly
>> caching/overwriting
>> > DocSets
>>
>> Right - or only using filterCache for high df terms via
>> http://wiki.apache.org/solr/SimpleFacetParameters#facet.enum.cache.minDf
>>
>> -Yonik
>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>> 25-26, San Francisco
>>
>
>


Re: Highest frequency terms for a subset of documents

2011-04-20 Thread Ofer Fort
my documents are user entries, so i'm guessing they vary a lot.
Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement.
thanks guys!

On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley wrote:

> On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort  wrote:
> > Thanks
> > but i've disabled the cache already, since my concern is speed and i'm
> > willing to pay the price (memory)
>
> Then you should not disable the cache.
>
> >, and my subset are not fixed.
> > Does the facet search do any extra work that i don't need, that i might
> be
> > able to disable (either by a flag or by a code change),
> > Somehow i feel, or rather hope, that counting the terms of 200K documents
> > and finding the top 500 should take less than 30 seconds.
>
> Using facet.enum.cache.minDf should be a little faster than just
> disabling the cache - it's a different code path.
> Using the cache selectively will speed things up, so try setting that
> minDf to 1000 or so for example.
>
> How many unique terms do you have in the index?
> Is this Solr 3.1 - there were some optimizations when there were many
> terms to iterate over?
> You could also try trunk, which has even more optimizations, or the
> bulkpostings branch if you really want to experiment.
>
> -Yonik
>


Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort
OK, so I copied my index and ran solr3.1 against it.
Qtime dropped, from about 40s to 17s! This is good news, but still longer
than i hoped for.
I tried to do the same text with 4.0, but i'm getting
IndexFormatTooOldException since my index was created using 1.4.1. Is my
only chance to test this is to reindex using 3.1 or 4.0?

Another strange behavior is that the Qtime seems pretty stable, no matter
how many object match my query. 200K and 20K both take about 17s.
I would have guessed that since the time is going over all the terms of all
the subset documents, would mean that the more documents, the more time.

Thanks for any insights

ofer



On Thu, Apr 21, 2011 at 3:07 AM, Ofer Fort  wrote:

> my documents are user entries, so i'm guessing they vary a lot.
> Tomorrow i'll try 3.1 and also 4.0, and see if they have an improvement.
> thanks guys!
>
>
> On Thu, Apr 21, 2011 at 3:02 AM, Yonik Seeley 
> wrote:
>
>> On Wed, Apr 20, 2011 at 7:45 PM, Ofer Fort  wrote:
>> > Thanks
>> > but i've disabled the cache already, since my concern is speed and i'm
>> > willing to pay the price (memory)
>>
>> Then you should not disable the cache.
>>
>> >, and my subset are not fixed.
>> > Does the facet search do any extra work that i don't need, that i might
>> be
>> > able to disable (either by a flag or by a code change),
>> > Somehow i feel, or rather hope, that counting the terms of 200K
>> documents
>> > and finding the top 500 should take less than 30 seconds.
>>
>> Using facet.enum.cache.minDf should be a little faster than just
>> disabling the cache - it's a different code path.
>> Using the cache selectively will speed things up, so try setting that
>> minDf to 1000 or so for example.
>>
>> How many unique terms do you have in the index?
>> Is this Solr 3.1 - there were some optimizations when there were many
>> terms to iterate over?
>> You could also try trunk, which has even more optimizations, or the
>> bulkpostings branch if you really want to experiment.
>>
>> -Yonik
>>
>
>


Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort
Not sure i fully understand,
If "facet.method=enum steps over all terms in the index for that field",
than what does setting the q=field:subset do? if i set the q=*:*, than how
do i get the frequency only on my subset?
Ofer

On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley wrote:

> On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort  wrote:
> > Another strange behavior is that the Qtime seems pretty stable, no matter
> > how many object match my query. 200K and 20K both take about 17s.
> > I would have guessed that since the time is going over all the terms of
> all
> > the subset documents, would mean that the more documents, the more time.
>
> facet.method=enum steps over all terms in the index for that field...
> that takes time regardless of how many documents are in the base set.
>
> There are also short-circuit methods that avoid looking at the docs
> for a term if it's docfreq is low enough that it couldn't possibly
> make it into the priority queue.  Because if this, it can actually be
> faster to facet on a larger base set (try *:* as the base query).
>
> Actually, it might be interesting to see the query time if you set
> facet.mincount equal to the number of docs in the base set - that will
> test pretty much just the time to enumerate over the terms without
> doing any set intersections at all.  Be careful not to set mincount
> greater than the number of docs in the base set though - solr will
> short-circuit that too and skip enumeration altogether.
>
> The work on the bulkpostings branch should definitely speed up your
> case even more - but I have no idea when it will "land" on trunk.
>
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>


Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort
I see, thanks.
So if I would want to implement something that would fit my needs, would
going through the subset of documents and counting all the terms in each
one, would be faster? and easier to implement?

On Thu, Apr 21, 2011 at 5:36 PM, Yonik Seeley wrote:

> On Thu, Apr 21, 2011 at 9:44 AM, Ofer Fort  wrote:
> > Not sure i fully understand,
> > If "facet.method=enum steps over all terms in the index for that field",
> > than what does setting the q=field:subset do? if i set the q=*:*, than
> how
> > do i get the frequency only on my subset?
>
> It's an implementation detail.  Faceting *does* just give you counts
> that just match
> q=field:subset.  How it does it is a different matter (i.e. for
> facet.method=enum, it
> must step over all terms in the field), so it's closer to O(nterms in
> field) rather than O(ndocs in base set)
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
>
> > Ofer
> >
> > On Thu, Apr 21, 2011 at 4:40 PM, Yonik Seeley <
> yo...@lucidimagination.com>
> > wrote:
> >>
> >> On Thu, Apr 21, 2011 at 9:24 AM, Ofer Fort  wrote:
> >> > Another strange behavior is that the Qtime seems pretty stable, no
> >> > matter
> >> > how many object match my query. 200K and 20K both take about 17s.
> >> > I would have guessed that since the time is going over all the terms
> of
> >> > all
> >> > the subset documents, would mean that the more documents, the more
> time.
> >>
> >> facet.method=enum steps over all terms in the index for that field...
> >> that takes time regardless of how many documents are in the base set.
> >>
> >> There are also short-circuit methods that avoid looking at the docs
> >> for a term if it's docfreq is low enough that it couldn't possibly
> >> make it into the priority queue.  Because if this, it can actually be
> >> faster to facet on a larger base set (try *:* as the base query).
> >>
> >> Actually, it might be interesting to see the query time if you set
> >> facet.mincount equal to the number of docs in the base set - that will
> >> test pretty much just the time to enumerate over the terms without
> >> doing any set intersections at all.  Be careful not to set mincount
> >> greater than the number of docs in the base set though - solr will
> >> short-circuit that too and skip enumeration altogether.
> >>
> >> The work on the bulkpostings branch should definitely speed up your
> >> case even more - but I have no idea when it will "land" on trunk.
> >>
> >>
> >> -Yonik
> >> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> >> 25-26, San Francisco
> >
> >
>


Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort
So if i want to use the facet.method=fc, is there a way to speed it up? and
remove the bucket size limitation?

On Thu, Apr 21, 2011 at 5:58 PM, Yonik Seeley wrote:

> On Thu, Apr 21, 2011 at 10:41 AM, Ofer Fort  wrote:
> > I see, thanks.
> > So if I would want to implement something that would fit my needs, would
> > going through the subset of documents and counting all the terms in each
> > one, would be faster? and easier to implement?
>
> That's not just your needs, that's everyone's needs (it's the
> definition of field faceting).
> There's no way to do what you're asking with a term enumerator (i.e.
> facet.method=enum).
>
> Going through documents and counting all the terms in each is what
> facet.method=fc does.
> But it's also not great when the number of unique terms per document is
> high.
> If you can think of a better way, go for it!
>
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>


Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort
Well, it was worth the try;-)
But will using the facet.method=fc, will reducing the subset size
reduce the time and memory? Meaning is it an O( ndocs of the set)?
Thanks
On Thursday, April 21, 2011, Yonik Seeley  wrote:
> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
>> So if i want to use the facet.method=fc, is there a way to speed it up? and
>> remove the bucket size limitation?
>
> Not really - else we would have done it already ;-)
> We don't really have great methods for faceting on full-text fields
> (as opposed to shorter meta-data fields) today.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>


Index upgrade from 1.4.1 to 3.1 and 4.0

2011-04-21 Thread Ofer Fort
Hi all,
While doing some tests, I realized that an index that was created with
solr 1.4.1 is readable by solr 3.1, but nt readable by solr 4.0.
If I plan to migrate my index to 4.0, and I prefer not to reindex it
all, what would be my best course of action?
Will it be possible to continue to write to the index with 3.1? Will
that make it readable from 4.0 or only the newly created segments?
If I optimize it using 3.1, will that make it readable also from 4.0?
Thanks
Ofer


Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort
So I'm guessing my best approach now would be to test trunk, and hope
that as 3.1 cut the performance in half, trunk will do the same
Thanks for the info
Ofer

On Friday, April 22, 2011, Yonik Seeley  wrote:
> On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort  wrote:
>> Well, it was worth the try;-)
>> But will using the facet.method=fc, will reducing the subset size
>> reduce the time and memory? Meaning is it an O( ndocs of the set)?
>
> facet.method=fc builds a multi-valued fieldcache like structure
> (UnInvertedField) the first time, that
> is used for counting facets for all subsequent requests.  So the
> faceting time (after the first time) is O(ndocs of the set),
> but the UnInvertedField singleton uses a large amout of memory
> unrelated to any particular base docset.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
>
>> Thanks
>> On Thursday, April 21, 2011, Yonik Seeley  wrote:
>>> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
>>>> So if i want to use the facet.method=fc, is there a way to speed it up? and
>>>> remove the bucket size limitation?
>>>
>>> Not really - else we would have done it already ;-)
>>> We don't really have great methods for faceting on full-text fields
>>> (as opposed to shorter meta-data fields) today.
>>>
>>> -Yonik
>>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>>> 25-26, San Francisco
>>>
>>
>


Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort
Ok, I'll give it a try, as this is a server I am willing to risk.
How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1?

On Friday, April 22, 2011, Yonik Seeley  wrote:
> On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort  wrote:
>> So I'm guessing my best approach now would be to test trunk, and hope
>> that as 3.1 cut the performance in half, trunk will do the same
>
> Trunk prob won't be much better... but the bulkpostings branch
> possibly could be.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
>> Thanks for the info
>> Ofer
>>
>> On Friday, April 22, 2011, Yonik Seeley  wrote:
>>> On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort  wrote:
>>>> Well, it was worth the try;-)
>>>> But will using the facet.method=fc, will reducing the subset size
>>>> reduce the time and memory? Meaning is it an O( ndocs of the set)?
>>>
>>> facet.method=fc builds a multi-valued fieldcache like structure
>>> (UnInvertedField) the first time, that
>>> is used for counting facets for all subsequent requests.  So the
>>> faceting time (after the first time) is O(ndocs of the set),
>>> but the UnInvertedField singleton uses a large amout of memory
>>> unrelated to any particular base docset.
>>>
>>> -Yonik
>>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>>> 25-26, San Francisco
>>>
>>>
>>>> Thanks
>>>> On Thursday, April 21, 2011, Yonik Seeley  
>>>> wrote:
>>>>> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
>>>>>> So if i want to use the facet.method=fc, is there a way to speed it up? 
>>>>>> and
>>>>>> remove the bucket size limitation?
>>>>>
>>>>> Not really - else we would have done it already ;-)
>>>>> We don't really have great methods for faceting on full-text fields
>>>>> (as opposed to shorter meta-data fields) today.
>>>>>
>>>>> -Yonik
>>>>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>>>>> 25-26, San Francisco
>>>>>
>>>>
>>>
>>
>


Re: Highest frequency terms for a subset of documents

2011-04-21 Thread Ofer Fort
Ok, thanks

On Friday, April 22, 2011, Yonik Seeley  wrote:
> On Thu, Apr 21, 2011 at 6:50 PM, Ofer Fort  wrote:
>> Ok, I'll give it a try, as this is a server I am willing to risk.
>> How is the competability between solrj of bulkpostings, trunk, 3.1 and 1.4.1?
>
> bulkpostings, trunk, and 3.1 should all be relatively solrj
> compatible.  But the SolrJ javabin format (used by default for
> queries) changed for strings between 1.4.1 and 3.1 (SOLR-2034).
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
>
>> On Friday, April 22, 2011, Yonik Seeley  wrote:
>>> On Thu, Apr 21, 2011 at 6:34 PM, Ofer Fort  wrote:
>>>> So I'm guessing my best approach now would be to test trunk, and hope
>>>> that as 3.1 cut the performance in half, trunk will do the same
>>>
>>> Trunk prob won't be much better... but the bulkpostings branch
>>> possibly could be.
>>>
>>> -Yonik
>>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>>> 25-26, San Francisco
>>>
>>>> Thanks for the info
>>>> Ofer
>>>>
>>>> On Friday, April 22, 2011, Yonik Seeley  wrote:
>>>>> On Thu, Apr 21, 2011 at 6:25 PM, Ofer Fort  wrote:
>>>>>> Well, it was worth the try;-)
>>>>>> But will using the facet.method=fc, will reducing the subset size
>>>>>> reduce the time and memory? Meaning is it an O( ndocs of the set)?
>>>>>
>>>>> facet.method=fc builds a multi-valued fieldcache like structure
>>>>> (UnInvertedField) the first time, that
>>>>> is used for counting facets for all subsequent requests.  So the
>>>>> faceting time (after the first time) is O(ndocs of the set),
>>>>> but the UnInvertedField singleton uses a large amout of memory
>>>>> unrelated to any particular base docset.
>>>>>
>>>>> -Yonik
>>>>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>>>>> 25-26, San Francisco
>>>>>
>>>>>
>>>>>> Thanks
>>>>>> On Thursday, April 21, 2011, Yonik Seeley  
>>>>>> wrote:
>>>>>>> On Thu, Apr 21, 2011 at 11:15 AM, Ofer Fort  wrote:
>>>>>>>> So if i want to use the facet.method=fc, is there a way to speed it 
>>>>>>>> up? and
>>>>>>>> remove the bucket size limitation?
>>>>>>>
>>>>>>> Not really - else we would have done it already ;-)
>>>>>>> We don't really have great methods for faceting on full-text fields
>>>>>>> (as opposed to shorter meta-data fields) today.
>>>>>>>
>>>>>>> -Yonik
>>>>>>> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
>>>>>>> 25-26, San Francisco
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Index upgrade from 1.4.1 to 3.1 and 4.0

2011-04-22 Thread Ofer Fort
Nobody?
Am I the only one in need of upgrading an index that was created with 1.4.1?

Thanks for any info
Ofer

On Friday, April 22, 2011, Ofer Fort  wrote:
> Hi all,
> While doing some tests, I realized that an index that was created with
> solr 1.4.1 is readable by solr 3.1, but nt readable by solr 4.0.
> If I plan to migrate my index to 4.0, and I prefer not to reindex it
> all, what would be my best course of action?
> Will it be possible to continue to write to the index with 3.1? Will
> that make it readable from 4.0 or only the newly created segments?
> If I optimize it using 3.1, will that make it readable also from 4.0?
> Thanks
> Ofer
>


Re: Index upgrade from 1.4.1 to 3.1 and 4.0

2011-04-22 Thread Ofer Fort
Thanks Otis, but this is not my case. Most of my fields are not stored
, but I do have the original data in case I need to reindex.
My question is do I need to?
If my 1.4.1 can be read by 3.1, I assume 3.1 can continue to write to it?
In that case, I continue assuming that 4.0 will know how to read only
the new segments, and if I optimize it, than I will have only one new
segment, created by 3.1, thus readable by 4.0.
It makes sense to me, the only question is if my guesses are right:-)
Thanks.

On Friday, April 22, 2011, Otis Gospodnetic  wrote:
> Hi Ofer,
>
> We recently helped a customer go through just such an upgrade (or maybe even
> from 1.3.*).  We used a tool that read data from one index and indexed it to 
> the
> new index without having to reindex the data from the original sources.  All
> fields in the source index were obviously stored. :)
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
>> From: Ofer Fort 
>> To: "solr-user@lucene.apache.org" 
>> Sent: Fri, April 22, 2011 10:34:26 AM
>> Subject: Re: Index upgrade from 1.4.1 to 3.1 and 4.0
>>
>> Nobody?
>> Am I the only one in need of upgrading an index that was created with  1.4.1?
>>
>> Thanks for any info
>> Ofer
>>
>> On Friday, April 22, 2011, Ofer  Fort  wrote:
>> > Hi all,
>> >  While doing some tests, I realized that an index that was created with
>> >  solr 1.4.1 is readable by solr 3.1, but nt readable by solr 4.0.
>> > If I  plan to migrate my index to 4.0, and I prefer not to reindex it
>> > all,  what would be my best course of action?
>> > Will it be possible to continue  to write to the index with 3.1? Will
>> > that make it readable from 4.0 or  only the newly created segments?
>> > If I optimize it using 3.1, will that  make it readable also from 4.0?
>> > Thanks
>> > Ofer
>> >
>>
>


Re: Index upgrade from 1.4.1 to 3.1 and 4.0

2011-04-22 Thread Ofer Fort
Thanks, I'll do the procedure on my test env and update the community,
 if anybody already went through the process, I would lov to here about it

On Friday, April 22, 2011, Otis Gospodnetic  wrote:
> Regardless of what anyone here says, you need to try it.
> 3.1 should be able to read 1.4.1, yes.
> One the format is switched to 3.1, you can't go back and read it with 1.4.1.
> This is why you want to upgrade your Slaves first, then your Master (if you 
> have
> them -- I remember we spoke a while back and that wasn't the case back then).
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
>> From: Ofer Fort 
>> To: "solr-user@lucene.apache.org" 
>> Sent: Fri, April 22, 2011 1:00:05 PM
>> Subject: Re: Index upgrade from 1.4.1 to 3.1 and 4.0
>>
>> Thanks Otis, but this is not my case. Most of my fields are not stored
>> , but  I do have the original data in case I need to reindex.
>> My question is do I  need to?
>> If my 1.4.1 can be read by 3.1, I assume 3.1 can continue to write  to it?
>> In that case, I continue assuming that 4.0 will know how to read  only
>> the new segments, and if I optimize it, than I will have only one  new
>> segment, created by 3.1, thus readable by 4.0.
>> It makes sense to me,  the only question is if my guesses are right:-)
>> Thanks.
>>
>> On Friday,  April 22, 2011, Otis Gospodnetic 
>>wrote:
>> > Hi Ofer,
>> >
>> > We recently helped a customer go through  just such an upgrade (or maybe
> even
>> > from 1.3.*).  We used a tool that  read data from one index and indexed it 
>> > to
>>the
>> > new index without having  to reindex the data from the original sources.
>  All
>> > fields in the source  index were obviously stored. :)
>> >
>> > Otis
>> > 
>> >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> > Lucene ecosystem  search :: http://search-lucene.com/
>> >
>> >
>> >
>> > - Original  Message 
>> >> From: Ofer Fort 
>> >> To: "solr-user@lucene.apache.org"  
>> >>  Sent: Fri, April 22, 2011 10:34:26 AM
>> >> Subject: Re: Index upgrade  from 1.4.1 to 3.1 and 4.0
>> >>
>> >> Nobody?
>> >> Am I the  only one in need of upgrading an index that was created with
>> 1.4.1?
>> >>
>> >> Thanks for any info
>> >>  Ofer
>> >>
>> >> On Friday, April 22, 2011, Ofer  Fort   wrote:
>> >> > Hi all,
>> >> >  While doing some tests, I  realized that an index that was created with
>> >> >  solr 1.4.1 is  readable by solr 3.1, but nt readable by solr 4.0.
>> >> > If I  plan  to migrate my index to 4.0, and I prefer not to reindex it
>> >> > all,   what would be my best course of action?
>> >> > Will it be possible to  continue  to write to the index with 3.1? Will
>> >> > that make it  readable from 4.0 or  only the newly created segments?
>> >> > If I  optimize it using 3.1, will that  make it readable also from 4.0?
>> >>  > Thanks
>> >> > Ofer
>> >> >
>> >>
>> >
>>
>


SOLR-1155 on 3.1

2011-05-30 Thread Ofer Fort
Hey all,
In the last comment on SOLR-1155 by Jayson Minard (
https://issues.apache.org/jira/browse/SOLR-1155?focusedCommentId=13019955&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13019955
)
"I'll look at updating this for 3.1"
was it integrated into 3.1? if not is there a patch one can use?
thanks