Re: How to make UnInvertedField faster?

2011-10-21 Thread Jason Rutherglen
Sweet + Very cool!

On Fri, Oct 21, 2011 at 7:50 AM, Simon Willnauer <
simon.willna...@googlemail.com> wrote:

> In trunk we have a feature called IndexDocValues which basically
> creates the uninverted structure at index time. You can then simply
> suck that into memory or even access it on disk directly
> (RandomAccess). Even if I can't help you right now this is certainly
> going to help you here. There is no need to uninvert at all anymore in
> lucene 4.0
>
> simon
>
> On Wed, Oct 19, 2011 at 8:05 PM, Michael Ryan  wrote:
> > I was wondering if anyone has any ideas for making
> UnInvertedField.uninvert()
> > faster, or other alternatives for generating facets quickly.
> >
> > The vast majority of the CPU time for our Solr instances is spent
> generating
> > UnInvertedFields after each commit. Here's an example of one of our
> slower fields:
> >
> > [2011-10-19 17:46:01,055] INFO125974[pool-1-thread-1] - (SolrCore:440) -
> > UnInverted multi-valued field
> {field=authorCS,memSize=38063628,tindexSize=422652,
> >
> time=15610,phase1=15584,nTerms=1558514,bigTerms=0,termInstances=4510674,uses=0}
> >
> > That is from an index with approximately 8 million documents. After each
> commit,
> > it takes on average about 90 seconds to uninvert all the fields that we
> facet on.
> >
> > Any ideas at all would be greatly appreciated.
> >
> > -Michael
> >
>


Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Jason Rutherglen
> We should maybe try to fix this in 3.x too?

+1 I suggested it should be backported a while back.  Or that Lucene
4.x should be released.  I'm not sure what is holding up Lucene 4.x at
this point, bulk postings is only needed useful for PFOR.

On Fri, Oct 28, 2011 at 3:27 PM, Simon Willnauer
 wrote:
> On Fri, Oct 28, 2011 at 9:17 PM, Simon Willnauer
>  wrote:
>> Hey Roman,
>>
>> On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov
>>  wrote:
>>> Hi everyone,
>>>
>>> I'm looking for some help with Solr indexing issues on a large scale.
>>>
>>> We are indexing few terabytes/month on a sizeable Solr cluster (8
>>> masters / serving writes, 16 slaves / serving reads). After certain
>>> amount of tuning we got to the point where a single Solr instance can
>>> handle index size of 100GB without much issues, but after that we are
>>> starting to observe noticeable delays on index flush and they are
>>> getting larger. See the attached picture for details, it's done for a
>>> single JVM on a single machine.
>>>
>>> We are posting data in 8 threads using javabin format and doing commit
>>> every 5K documents, merge factor 20, and ram buffer size about 384MB.
>>> From the picture it can be seen that a single-threaded index flushing
>>> code kicks in on every commit and blocks all other indexing threads.
>>> The hardware is decent (12 physical / 24 virtual cores per machine)
>>> and it is mostly idle when the index is flushing. Very little CPU
>>> utilization and disk I/O (<5%), with the exception of a single CPU
>>> core which actually does index flush (95% CPU, 5% I/O wait).
>>>
>>> My questions are:
>>>
>>> 1) will Solr changes from real-time branch help to resolve these
>>> issues? I was reading
>>> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
>>> and it looks like we have exactly the same problem
>>
>> did you also read http://bit.ly/ujLw6v - here I try to explain the
>> major difference between Lucene 3.x and 4.0 and why 3.x has these long
>> idle times. In Lucene 3.x a full flush / commit is a single threaded
>> process, as you observed there is only one thread making progress. In
>> Lucene 4 there is still a single thread executing the commit but other
>> threads are not blocked anymore. Depending on how fast the thread can
>> flush other threads might help flushing segments for that commit
>> concurrently or simply index into new documents writers. So basically
>> 4.0 won't have this problem anymore. The realtime branch you talk
>> about is already merged into 4.0 trunk.
>>
>>>
>>> 2) what would be the best way to port these (and only these) changes
>>> to 3.4.0? I tried to dig into the branching and revisions, but got
>>> lost quickly. Tried something like "svn diff
>>> […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not
>>> sure if it's even possible to merge these into 3.4.0
>>
>> Possible yes! Worth the trouble, I would say no!
>> DocumentsWriterPerThread (DWPT) is a very big change and I don't think
>> we should backport this into our stable branch. However, this feature
>> is very stable in 4.0 though.
>>>
>>> 3) what would you recommend for production 24/7 use? 3.4.0?
>>
>> I think 3.4 is a safe bet! I personally tend to use trunk in
>> production too the only problem is that this is basically a moving
>> target and introduces extra overhead on your side to watch changes and
>> index format modification which could basically prevent you from
>> simple upgrades
>>
>>>
>>> 4) is there a workaround that can be used? also, I listed the stack trace 
>>> below
>>>
>>> Thank you!
>>> Roman
>>>
>>> P.S. This single "index flushing" thread spends 99% of all the time in
>>> "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then
>>> the merge seems to go quickly. I looked it up and it looks like the
>>> intent here is deleting old commit points (we are keeping only 1
>>> non-optimized commit point per config). Not sure why is it taking that
>>> long.
>>
>> in 3.x there is no way to apply deletes without doing a flush (afaik).
>> In 3.x a flush means single threaded again - similar to commit just
>> without syncing files to disk and writing a new segments file. In 4.0
>> you have way more control over this via
>> IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied
>> without blocking other threads. In trunk we hijack indexing threads to
>> do all that work concurrently so you get better cpu utilization and
>> due to concurrent flushing better and usually continuous IO
>> utilization.
>>
>> hope that helps.
>>
>> simon
>>>
>>> pool-2-thread-1 [RUNNABLE] CPU time: 3:31
>>> java.nio.Bits.copyToByteArray(long, Object, long, long)
>>> java.nio.DirectByteBuffer.get(byte[], int, int)
>>> org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, 
>>> int)
>>> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
>>> org.apache.lucene.index.SegmentTermEnum.next()
>>> org.apache.lucene.index.TermInfosReader.(Directo

Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Jason Rutherglen
> Otherwise we have "flexible indexing" where "flexible" means "slower
> if you do anything but the default".

The other encodings should exist as modules since they are pluggable.
4.0 can ship with the existing codec.  4.1 with additional codecs and
the bulk postings at a later time.

Otherwise it will be 6 months before 4.0 ships, that's too long.

Also it is an amusing contradiction that your argument flies in the
face of Lucid shipping 4.x today without said functionality.

On Fri, Oct 28, 2011 at 5:09 PM, Robert Muir  wrote:
> On Fri, Oct 28, 2011 at 5:03 PM, Jason Rutherglen
>  wrote:
>
>> +1 I suggested it should be backported a while back.  Or that Lucene
>> 4.x should be released.  I'm not sure what is holding up Lucene 4.x at
>> this point, bulk postings is only needed useful for PFOR.
>
> This is not true, most modern index compression schemes, not just
> PFOR-delta read more than one integer at a time.
>
> Thats why its important not only to abstract away the encoding of the
> index, but to also ensure that the enumeration apis aren't biased
> towards one-at-a-time vInt.
>
> Otherwise we have "flexible indexing" where "flexible" means "slower
> if you do anything but the default".
>
> --
> lucidimagination.com
>


Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Jason Rutherglen
> abstract away the encoding of the index

Robert, this is what you wrote.  "Abstract away the encoding of the
index" means pluggable, otherwise it's not abstract and / or it's a
flawed design.  Sounds like it's the latter.


Re: overwrite=false support with SolrJ client

2011-11-04 Thread Jason Rutherglen
It should be supported in SolrJ, I'm surprised it's been lopped out.
Bulk indexing is extremely common.

On Fri, Nov 4, 2011 at 1:16 PM, Ken Krugler  wrote:
> Hi list,
>
> I'm working on improving the performance of the Solr scheme for Cascading.
>
> This supports generating a Solr index as the output of a Hadoop job. We use 
> SolrJ to write the index locally (via EmbeddedSolrServer).
>
> There are mentions of using overwrite=false with the CSV request handler, as 
> a way of improving performance.
>
> I see that https://issues.apache.org/jira/browse/SOLR-653 removed this 
> support from SolrJ, because it was deemed too dangerous for mere mortals.
>
> My question is whether anyone knows just how much performance boost this 
> really provides.
>
> For Hadoop-based workflows, it's straightforward to ensure that the unique 
> key field is really unique, thus if the performance gain is significant, I 
> might look into figuring out some way (with a trigger lock) of re-enabling 
> this support in SolrJ.
>
> Thanks,
>
> -- Ken
>
> --
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>
>


Re: Core overhead

2011-12-16 Thread Jason Rutherglen
Wow the shameless plugging of product (footer) has hit a new low Otis.

On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
 wrote:
> Hi Yury,
>
> Not sure if this was already covered in this thread, but with N smaller cores 
> on a single N-CPU-core box you could run N queries in parallel over smaller 
> indices, which may be faster than a single query going against a single big 
> index, depending on how many concurrent query requests the box is handling 
> (i.e. how busy or idle the CPU cores are).
>
> Otis
> 
>
> Performance Monitoring SaaS for Solr - 
> http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
>
>>
>> From: Yury Kats 
>>To: solr-user@lucene.apache.org
>>Sent: Thursday, December 15, 2011 12:58 PM
>>Subject: Core overhead
>>
>>Does anybody have an idea, or better yet, measured data,
>>to see what the overhead of a core is, both in memory and speed?
>>
>>For example, what would be the difference between having 1 core
>>with 100M documents versus having 10 cores with 10M documents?
>>
>>
>>


Re: Core overhead

2011-12-16 Thread Jason Rutherglen
Ted,

"...- FREE!" is stupid idiot spam.  It's annoying and not suitable.

On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning  wrote:
> I thought it was slightly clumsy, but it was informative.  It seemed like a
> fine thing to say.  Effectively it was "I/we have developed a tool that
> will help you solve your problem".  That is responsive to the OP and it is
> clear that it is a commercial deal.
>
> On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> Wow the shameless plugging of product (footer) has hit a new low Otis.
>>
>> On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
>>  wrote:
>> > Hi Yury,
>> >
>> > Not sure if this was already covered in this thread, but with N smaller
>> cores on a single N-CPU-core box you could run N queries in parallel over
>> smaller indices, which may be faster than a single query going against a
>> single big index, depending on how many concurrent query requests the box
>> is handling (i.e. how busy or idle the CPU cores are).
>> >
>> > Otis
>> > 
>> >
>> > Performance Monitoring SaaS for Solr -
>> http://sematext.com/spm/solr-performance-monitoring/index.html
>> >
>> >
>> >
>> >>
>> >> From: Yury Kats 
>> >>To: solr-user@lucene.apache.org
>> >>Sent: Thursday, December 15, 2011 12:58 PM
>> >>Subject: Core overhead
>> >>
>> >>Does anybody have an idea, or better yet, measured data,
>> >>to see what the overhead of a core is, both in memory and speed?
>> >>
>> >>For example, what would be the difference between having 1 core
>> >>with 100M documents versus having 10 cores with 10M documents?
>> >>
>> >>
>> >>
>>


Re: Core overhead

2011-12-16 Thread Jason Rutherglen
Ted,

The list would be unreadable if everyone spammed at the bottom their
email like Otis'.  It's just bad form.

Jason

On Fri, Dec 16, 2011 at 12:00 PM, Ted Dunning  wrote:
> Sounds like we disagree.
>
> On Fri, Dec 16, 2011 at 11:56 AM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> Ted,
>>
>> "...- FREE!" is stupid idiot spam.  It's annoying and not suitable.
>>
>> On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning 
>> wrote:
>> > I thought it was slightly clumsy, but it was informative.  It seemed
>> like a
>> > fine thing to say.  Effectively it was "I/we have developed a tool that
>> > will help you solve your problem".  That is responsive to the OP and it
>> is
>> > clear that it is a commercial deal.
>> >
>> > On Fri, Dec 16, 2011 at 10:02 AM, Jason Rutherglen <
>> > jason.rutherg...@gmail.com> wrote:
>> >
>> >> Wow the shameless plugging of product (footer) has hit a new low Otis.
>> >>
>> >> On Fri, Dec 16, 2011 at 7:32 AM, Otis Gospodnetic
>> >>  wrote:
>> >> > Hi Yury,
>> >> >
>> >> > Not sure if this was already covered in this thread, but with N
>> smaller
>> >> cores on a single N-CPU-core box you could run N queries in parallel
>> over
>> >> smaller indices, which may be faster than a single query going against a
>> >> single big index, depending on how many concurrent query requests the
>> box
>> >> is handling (i.e. how busy or idle the CPU cores are).
>> >> >
>> >> > Otis
>> >> > 
>> >> >
>> >> > Performance Monitoring SaaS for Solr -
>> >> http://sematext.com/spm/solr-performance-monitoring/index.html
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> From: Yury Kats 
>> >> >>To: solr-user@lucene.apache.org
>> >> >>Sent: Thursday, December 15, 2011 12:58 PM
>> >> >>Subject: Core overhead
>> >> >>
>> >> >>Does anybody have an idea, or better yet, measured data,
>> >> >>to see what the overhead of a core is, both in memory and speed?
>> >> >>
>> >> >>For example, what would be the difference between having 1 core
>> >> >>with 100M documents versus having 10 cores with 10M documents?
>> >> >>
>> >> >>
>> >> >>
>> >>
>>


Re: soft commit

2012-01-02 Thread Jason Rutherglen
> It still normally makes sense to have the caches enabled (esp filter and 
> document caches).

In the NRT case that statement is completely incorrect

On Mon, Jan 2, 2012 at 5:37 PM, Yonik Seeley  wrote:
> On Mon, Jan 2, 2012 at 1:28 PM, Mark Miller  wrote:
>> Right - in most NRT cases (very frequent soft commits), the cache should
>> probably be disabled.
>
> Did you mean autowarm should be disabled (as it already is in the
> example config)?
> It still normally makes sense to have the caches enabled (esp filter
> and document caches).
>
> -Yonik
> http://www.lucidimagination.com


Re: soft commit

2012-01-03 Thread Jason Rutherglen
*Laugh*

I stand by what Mark said:

"Right - in most NRT cases (very frequent soft commits), the cache should
probably be disabled."

On Mon, Jan 2, 2012 at 7:45 PM, Yonik Seeley  wrote:
> On Mon, Jan 2, 2012 at 9:58 PM, Jason Rutherglen
>  wrote:
>>> It still normally makes sense to have the caches enabled (esp filter and 
>>> document caches).
>>
>> In the NRT case that statement is completely incorrect
>
> *shrug*
>
> To each their own.  I stand my my statement.
>
> -Yonik
> http://www.lucidimagination.com


Re: soft commit

2012-01-03 Thread Jason Rutherglen
> multi-select faceting

Yikes.  I'd love to see a test showing that un-inverted field cache
(which is for ALL segments as a single unit) can be used efficiently
with NRT / soft commit.

On Tue, Jan 3, 2012 at 1:50 PM, Yonik Seeley  wrote:
> On Tue, Jan 3, 2012 at 4:36 PM, Erik Hatcher  wrote:
>> As I understand it, the document and filter caches add value *intra* request 
>> such that it keeps additional work (like fetching stored fields from disk 
>> more than once) from occurring.
>
> Yep.  Highlighting, multi-select faceting, and distributed search are
> just some of the scenarios where the caches are utilized in the scope
> of a single request.
> Please folks, don't disable your caches!
>
> -Yonik
> http://www.lucidimagination.com


Re: soft commit

2012-01-03 Thread Jason Rutherglen
The main point is, Solr unlike for example Elastic Search and other
Lucene based systems does NOT cache filters or facets per-segment.

This is a fundamental design flaw.

On Tue, Jan 3, 2012 at 1:50 PM, Yonik Seeley  wrote:
> On Tue, Jan 3, 2012 at 4:36 PM, Erik Hatcher  wrote:
>> As I understand it, the document and filter caches add value *intra* request 
>> such that it keeps additional work (like fetching stored fields from disk 
>> more than once) from occurring.
>
> Yep.  Highlighting, multi-select faceting, and distributed search are
> just some of the scenarios where the caches are utilized in the scope
> of a single request.
> Please folks, don't disable your caches!
>
> -Yonik
> http://www.lucidimagination.com


Re: soft commit

2012-01-03 Thread Jason Rutherglen
Address the points I brought up or don't reply with funny name calling.

Below are two key points reiterated and re-articulated is an easy to answer way:

* Multi-select faceting is per-segment (true or false)

* Filters are cached per-segment (true or false)

On Tue, Jan 3, 2012 at 2:16 PM, Yonik Seeley  wrote:
> On Tue, Jan 3, 2012 at 5:03 PM, Jason Rutherglen
>  wrote:
>> Yikes.  I'd love to see a test showing that un-inverted field cache
>> (which is for ALL segments as a single unit) can be used efficiently
>> with NRT / soft commit.
>
> Please stop being a troll.
> Solr as multiple faceting methods - only one uses un-inverted field cache.
>
> Oh, and for the record, Solr does have a faceting method in trunk that
> caches per-segment.
> There are always tradeoffs though - string faceting per-segment will
> always be slower than string faceting over the complete index (due to
> the cost of merging per-segment counts).
>
> Anyway, disabling any of those caches won't make anything any
> faster... the data structures will still be built, they just won't be
> reused.
> Seems like you realized your original statement was erroneous and have
> just reverted to troll state, trying to find something to pick at.
>
> -Yonik
> http://www.lucidimagination.com


Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Jason Rutherglen
Steven,

If you are going to admonish people for advertising, it should be
equally dished out or not at all.

On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowe  wrote:
> Hi Peter,
>
> Commercial solicitations are taboo here, except in the context of a request 
> for help that is directly relevant to a product or service.
>
> Please don’t do this again.
>
> Steve Rowe
>
> From: Peter Velikin [mailto:pe...@velobit.com]
> Sent: Wednesday, January 18, 2012 6:33 PM
> To: solr-user@lucene.apache.org
> Subject: How to accelerate your Solr-Lucene appication by 4x
>
> Hello Solr users,
>
> Did you know that you can boost the performance of your Solr application 
> using your existing servers? All you need is commodity SSD and plug-and-play 
> software like VeloBit.
>
> At ZoomInfo, a leading business information provider, VeloBit increased the 
> performance of the Solr-Lucene-powered application by 4x.
>
> I would love to tell you more about VeloBit and find out if we can deliver 
> same business benefits at your company. Click 
> here for a 15-minute 
> briefing on the VeloBit technology.
>
> Here is more information on how VeloBit helped ZoomInfo:
>
>  *   Increased Solr-Lucene performance by 4x using existing servers and 
> commodity SSD
>  *   Installed VeloBit plug-and-play SSD caching software in 5-minutes 
> transparent to running applications and storage infrastructure
>  *   Reduced by 75% the hardware and monthly operating costs required to 
> support service level agreements
>
> Technical Details:
>
>  *   Environment: Solr‐Lucene indexed directory search service fronted by 
> J2EE web application technology
>  *   Index size: 600 GB
>  *   Number of items indexed: 50 million
>  *   Primary storage: 6 x SAS HDD
>  *   SSD Cache: VeloBit software + OCZ Vertex 3
>
> Click here to read more 
> about the ZoomInfo Solr-Lucene case 
> study.
>
> You can also sign 
> up for 
> our Early Access 
> Program 
> and try VeloBit HyperCache for free.
>
> Also, feel free to write to me directly at 
> pe...@velobit.com.
>
> Best regards,
>
> Peter Velikin
> VP Online Marketing, VeloBit, Inc.
> pe...@velobit.com
> tel. 978-263-4800
> mob. 617-306-7165
> [Description: VeloBit with tagline]
> VeloBit provides plug & play SSD caching software that dramatically 
> accelerates applications at a remarkably low cost. The software installs 
> seamlessly in less than 10 minutes and automatically tunes for fastest 
> application speed. Visit www.velobit.com for details.


Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Jason Rutherglen
Steven,

Fun-NY...

17 hits for this spam:

http://search-lucene.com/?q=%22Performance+Monitoring+SaaS+for+Solr%22

Though this was already partially discussed with Chris @ fucu.org
which according to him, should have already been moved to Lucene
General.

On Wed, Jan 18, 2012 at 11:04 PM, Steven A Rowe  wrote:
> Why Jason, I declare, whatever do you mean?
>
>
>> -Original Message-
>> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
>> Sent: Wednesday, January 18, 2012 8:29 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to accelerate your Solr-Lucene appication by 4x
>>
>> Steven,
>>
>> If you are going to admonish people for advertising, it should be
>> equally dished out or not at all.
>>
>> On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowe  wrote:
>> > Hi Peter,
>> >
>> > Commercial solicitations are taboo here, except in the context of a
>> request for help that is directly relevant to a product or service.
>> >
>> > Please don’t do this again.
>> >
>> > Steve Rowe
>> >
>> > From: Peter Velikin [mailto:pe...@velobit.com]
>> > Sent: Wednesday, January 18, 2012 6:33 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: How to accelerate your Solr-Lucene appication by 4x
>> >
>> > Hello Solr users,
>> >
>> > Did you know that you can boost the performance of your Solr application
>> using your existing servers? All you need is commodity SSD and plug-and-
>> play software like VeloBit.
>> >
>> > At ZoomInfo, a leading business information provider, VeloBit increased
>> the performance of the Solr-Lucene-powered application by 4x.
>> >
>> > I would love to tell you more about VeloBit and find out if we can
>> deliver same business benefits at your company. Click
>> here<http://www.velobit.com/15-minute-brief> for a 15-minute
>> briefing<http://www.velobit.com/15-minute-brief> on the VeloBit
>> technology.
>> >
>> > Here is more information on how VeloBit helped ZoomInfo:
>> >
>> >  *   Increased Solr-Lucene performance by 4x using existing servers and
>> commodity SSD
>> >  *   Installed VeloBit plug-and-play SSD caching software in 5-minutes
>> transparent to running applications and storage infrastructure
>> >  *   Reduced by 75% the hardware and monthly operating costs required to
>> support service level agreements
>> >
>> > Technical Details:
>> >
>> >  *   Environment: Solr‐Lucene indexed directory search service fronted
>> by J2EE web application technology
>> >  *   Index size: 600 GB
>> >  *   Number of items indexed: 50 million
>> >  *   Primary storage: 6 x SAS HDD
>> >  *   SSD Cache: VeloBit software + OCZ Vertex 3
>> >
>> > Click here<http://www.velobit.com/use-cases/enterprise-search/> to read
>> more about the ZoomInfo Solr-Lucene case study<http://www.velobit.com/use-
>> cases/enterprise-search/>.
>> >
>> > You can also sign up<http://www.velobit.com/early-access-program-
>> accelerate-application> for our Early Access
>> Program<http://www.velobit.com/early-access-program-accelerate-
>> application> and try VeloBit HyperCache for free.
>> >
>> > Also, feel free to write to me directly at
>> pe...@velobit.com<mailto:pe...@velobit.com>.
>> >
>> > Best regards,
>> >
>> > Peter Velikin
>> > VP Online Marketing, VeloBit, Inc.
>> > pe...@velobit.com<mailto:pe...@velobit.com>
>> > tel. 978-263-4800
>> > mob. 617-306-7165
>> > [Description: VeloBit with tagline]
>> > VeloBit provides plug & play SSD caching software that dramatically
>> accelerates applications at a remarkably low cost. The software installs
>> seamlessly in less than 10 minutes and automatically tunes for fastest
>> application speed. Visit www.velobit.com<http://www.velobit.com> for
>> details.


Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-19 Thread Jason Rutherglen
> 2. they always *follow* on-topic discussion

Not in the example given.

> 3. the line is blurry, e.g. nobody will object to including one's employer in 
> a tagline.

Product placement is not blurry.  The incentive is to then answer
someone else's user email, in order to post yet another spam'd product
placement ad, which is exactly what is happening.

Comparatively Peter has only posted 1 time, whereas Otis' spam is
recurring, there is large difference.

The Lucene user mailing list is being used as a free advertising forum
for a specific product (eg, it's not a company, employer, or event).
If you are fine with that, then your statements are contradictory.

On Thu, Jan 19, 2012 at 12:31 PM, Steven A Rowe  wrote:
> Jason,
>
> If I understand you correctly, you're referring to a thread 
> <http://search-lucene.com/m/iMCFOqzcmS1/%22Performance+Monitoring+SaaS+for+Solr%22/v=threaded>
>  in which you objected to a commercial tagline.
>
> At the time that thread was active, I didn't agree with you, though I didn't 
> engage in the conversation.  Here's why I didn't (and still don't) agree:
>
> 1. taglines are short;
> 2. they always *follow* on-topic discussion, and web page ads, Google search 
> results and Android apps, following in the long tradition of print periodical 
> practice, have taught me to accept -- and mostly ignore -- commercials on the 
> margins of digital content; and
> 3. the line is blurry, e.g. nobody will object to including one's employer in 
> a tagline.
>
> And to follow up on your point about fairness in application of Lucene's 
> (no-it's-not-written-down-and-no-we're-never-going-to-write-it-down-either) 
> commercial message policy, I'll echo Ted's observation that some commercial 
> messages (depending on content, tone and context) are acceptable.
>
> Steve
>
>> -Original Message-
>> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
>> Sent: Wednesday, January 18, 2012 11:33 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to accelerate your Solr-Lucene appication by 4x
>>
>> Steven,
>>
>> Fun-NY...
>>
>> 17 hits for this spam:
>>
>> http://search-lucene.com/?q=%22Performance+Monitoring+SaaS+for+Solr%22
>>
>> Though this was already partially discussed with Chris @ fucu.org
>> which according to him, should have already been moved to Lucene
>> General.
>>
>> On Wed, Jan 18, 2012 at 11:04 PM, Steven A Rowe  wrote:
>> > Why Jason, I declare, whatever do you mean?
>> >
>> >
>> >> -Original Message-
>> >> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
>> >> Sent: Wednesday, January 18, 2012 8:29 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: How to accelerate your Solr-Lucene appication by 4x
>> >>
>> >> Steven,
>> >>
>> >> If you are going to admonish people for advertising, it should be
>> >> equally dished out or not at all.
>> >>
>> >> On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowe  wrote:
>> >> > Hi Peter,
>> >> >
>> >> > Commercial solicitations are taboo here, except in the context of a
>> >> request for help that is directly relevant to a product or service.
>> >> >
>> >> > Please don’t do this again.
>> >> >
>> >> > Steve Rowe
>> >> >
>> >> > From: Peter Velikin [mailto:pe...@velobit.com]
>> >> > Sent: Wednesday, January 18, 2012 6:33 PM
>> >> > To: solr-user@lucene.apache.org
>> >> > Subject: How to accelerate your Solr-Lucene appication by 4x
>> >> >
>> >> > Hello Solr users,
>> >> >
>> >> > Did you know that you can boost the performance of your Solr
>> application
>> >> using your existing servers? All you need is commodity SSD and plug-
>> and-
>> >> play software like VeloBit.
>> >> >
>> >> > At ZoomInfo, a leading business information provider, VeloBit
>> increased
>> >> the performance of the Solr-Lucene-powered application by 4x.
>> >> >
>> >> > I would love to tell you more about VeloBit and find out if we can
>> >> deliver same business benefits at your company. Click
>> >> here<http://www.velobit.com/15-minute-brief> for a 15-minute
>> >> briefing<http://www.velobit.com/15-minute-brief> on the VeloBit
>> >> technology.
>> &g

Re: anyone use hadoop+solr?

2010-06-22 Thread Jason Rutherglen
We (Attensity Group) have been using SOLR-1301 for 6+ months now
because we have a ready Hadoop cluster and need to be able to re/index
up to 3 billion docs.  I read the various emails and wasn't sure what
you're asking.

Cheers...

On Tue, Jun 22, 2010 at 8:27 AM, Neeb  wrote:
>
> Hey James,
>
> Just wondering if you ever had a chance to try out hadoop with solr? Would
> appreciate any information/directions you could give.
>
> I am particularly interested in indexing using a mapreduce job.
>
> Cheers,
> -Ali
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/anyone-use-hadoop-solr-tp485333p914450.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: solr with hadoop

2010-07-06 Thread Jason Rutherglen
> If you do distributed indexing correctly, what about updating the documents
> and what about replicating them correctly?

Yes, you can do you and it'll work great.

On Mon, Jul 5, 2010 at 7:42 AM, MitchK  wrote:
>
> I need to revive this discussion...
>
> If you do distributed indexing correctly, what about updating the documents
> and what about replicating them correctly?
>
> Does this work? Or wasn't this an issue?
>
> Kind regards
> - Mitch
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p944413.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Total number of terms in an index?

2010-07-26 Thread Jason Rutherglen
What's the fastest way to obtain the total number of docs from the
index?  (The Luke request handler takes a long time to load so I'm
looking for something else).


Re: Total number of terms in an index?

2010-07-26 Thread Jason Rutherglen
Sorry, like the subject, I mean the total number of terms.

On Mon, Jul 26, 2010 at 4:03 PM, Jason Rutherglen
 wrote:
> What's the fastest way to obtain the total number of docs from the
> index?  (The Luke request handler takes a long time to load so I'm
> looking for something else).
>


Re: Total number of terms in an index?

2010-07-28 Thread Jason Rutherglen
Tom,

The total number of terms... Ah well, not a big deal, however yes the
flex branch does expose this so we can show this in Solr at some
point, hopefully outside of Solr's Luke impl.

On Tue, Jul 27, 2010 at 9:27 AM, Burton-West, Tom  wrote:
> Hi Jason,
>
> Are you looking for the total number of unique terms or total number of term 
> occurrences?
>
> Checkindex reports both, but does a bunch of other work so is probably not 
> the fastest.
>
> If you are looking for total number of term occurrences, you might look at 
> contrib/org/apache/lucene/misc/HighFreqTerms.java.
>
> If you are just looking for the total number of unique terms, I wonder if 
> there is some low level API that would allow you to just access the in-memory 
> representation of the tii file and then multiply the number of terms in it by 
> your indexDivisor (default 128). I haven't dug in to the code so I don't 
> actually know how the tii file gets loaded into a data structure in memory.  
> If there is api access, it seems like this might be the quickest way to get 
> the number of unique terms.  (Of course you would have to do this for each 
> segment).
>
> Tom
> -Original Message-
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> Sent: Monday, July 26, 2010 8:39 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Total number of terms in an index?
>
>
> : Sorry, like the subject, I mean the total number of terms.
>
> it's not stored anywhere, so the only way to fetch it is to actually
> iteate all of the terms and count them (that's why LukeRequestHandler is
> slow slow to compute this particular value)
>
> If i remember right, someone mentioned at one point that flex would let
> you store data about stuff like this in your index as part of the segment
> writing, but frankly i'm still not sure how that iwll help -- because you
> unless your index is fully optimized, you still have to iterate the terms
> in each segment to 'de-dup' them.
>
>
> -Hoss
>
>


Re: Auto Suggest

2010-09-02 Thread Jason Rutherglen
I'm having a different issue with the EdgeNGram technique described
here: 
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

That is one word queries q=app on the query_text field, work fine
however "q=app mou" do not.  Why would this be or is there a
configuration that could be missing?

On Wed, Sep 1, 2010 at 3:53 PM, Eric Grobler  wrote:
> Thanks for your feedback Robert,
>
> I will try that and see how Solr performs on my data - I think I will create
> a field that contains only important key/product terms from the text.
>
> Regards
> Johan
>
> On Wed, Sep 1, 2010 at 9:12 PM, Robert Petersen  wrote:
>
>> We don't have that many, just a hundred thousand, and solr response
>> times (since the index's docs are small and not complex) are logged as
>> typically 1 ms if not 0 ms.  It's funny but sometimes it is so fast no
>> milliseconds have elapsed.  Incredible if you ask me...  :)
>>
>> Once you get SOLR to consider the whole phrase as just one big term, the
>> wildcard is very fast.
>>
>> -Original Message-
>> From: Eric Grobler [mailto:impalah...@googlemail.com]
>> Sent: Wednesday, September 01, 2010 12:35 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Auto Suggest
>>
>> Hi Robert,
>>
>> Interesting approach, how many documents do you have in Solr?
>> I have about 2 million and I just wonder if it might be a bit slow.
>>
>> Regards
>> Johan
>>
>> On Wed, Sep 1, 2010 at 7:38 PM, Robert Petersen 
>> wrote:
>>
>> > I do this by replacing the spaces with a '%' in a separate search
>> field
>> > which is not parsed nor tokenized and then you can wildcard across the
>> > whole phrase like you want and the spaces don't mess you up.  Just
>> store
>> > the original phrase with spaces in a separate field for returning to
>> the
>> > front end for display.
>> >
>> > -Original Message-
>> > From: Jazz Globe [mailto:jazzgl...@hotmail.com]
>> > Sent: Wednesday, September 01, 2010 7:33 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: Auto Suggest
>> >
>> >
>> > Hallo
>> >
>> > How would one implement a multiple term auto-suggest feature in Solr
>> > that is filter sensitive?
>> > For example, a user enters :
>> > "mp3"
>> >  and solr might suggest:
>> >  ->   "mp3 player"
>> >  ->   "mp3 nano"
>> >  ->   "mp3 sony"
>> > and then the user starts the second word :
>> > "mp3 n"
>> > and that narrows it down to:
>> >  -> "mp3 nano"
>> >
>> > I had a quick look at the Terms Component.
>> > I suppose it just returns term totals for the entire index and cannot
>> be
>> > used with a filter or query?
>> >
>> > Thanks
>> > Johan
>> >
>> >
>> >
>>
>


Re: Auto Suggest

2010-09-03 Thread Jason Rutherglen
Analysis returns "app mou".

On Thu, Sep 2, 2010 at 6:12 PM, Lance Norskog  wrote:
> What does analysis.jsp show?
>
> On Thu, Sep 2, 2010 at 5:53 AM, Jason Rutherglen
>  wrote:
>> I'm having a different issue with the EdgeNGram technique described
>> here: 
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>>
>> That is one word queries q=app on the query_text field, work fine
>> however "q=app mou" do not.  Why would this be or is there a
>> configuration that could be missing?
>>
>> On Wed, Sep 1, 2010 at 3:53 PM, Eric Grobler  
>> wrote:
>>> Thanks for your feedback Robert,
>>>
>>> I will try that and see how Solr performs on my data - I think I will create
>>> a field that contains only important key/product terms from the text.
>>>
>>> Regards
>>> Johan
>>>
>>> On Wed, Sep 1, 2010 at 9:12 PM, Robert Petersen  wrote:
>>>
>>>> We don't have that many, just a hundred thousand, and solr response
>>>> times (since the index's docs are small and not complex) are logged as
>>>> typically 1 ms if not 0 ms.  It's funny but sometimes it is so fast no
>>>> milliseconds have elapsed.  Incredible if you ask me...  :)
>>>>
>>>> Once you get SOLR to consider the whole phrase as just one big term, the
>>>> wildcard is very fast.
>>>>
>>>> -Original Message-
>>>> From: Eric Grobler [mailto:impalah...@googlemail.com]
>>>> Sent: Wednesday, September 01, 2010 12:35 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Auto Suggest
>>>>
>>>> Hi Robert,
>>>>
>>>> Interesting approach, how many documents do you have in Solr?
>>>> I have about 2 million and I just wonder if it might be a bit slow.
>>>>
>>>> Regards
>>>> Johan
>>>>
>>>> On Wed, Sep 1, 2010 at 7:38 PM, Robert Petersen 
>>>> wrote:
>>>>
>>>> > I do this by replacing the spaces with a '%' in a separate search
>>>> field
>>>> > which is not parsed nor tokenized and then you can wildcard across the
>>>> > whole phrase like you want and the spaces don't mess you up.  Just
>>>> store
>>>> > the original phrase with spaces in a separate field for returning to
>>>> the
>>>> > front end for display.
>>>> >
>>>> > -Original Message-
>>>> > From: Jazz Globe [mailto:jazzgl...@hotmail.com]
>>>> > Sent: Wednesday, September 01, 2010 7:33 AM
>>>> > To: solr-user@lucene.apache.org
>>>> > Subject: Auto Suggest
>>>> >
>>>> >
>>>> > Hallo
>>>> >
>>>> > How would one implement a multiple term auto-suggest feature in Solr
>>>> > that is filter sensitive?
>>>> > For example, a user enters :
>>>> > "mp3"
>>>> >  and solr might suggest:
>>>> >  ->   "mp3 player"
>>>> >  ->   "mp3 nano"
>>>> >  ->   "mp3 sony"
>>>> > and then the user starts the second word :
>>>> > "mp3 n"
>>>> > and that narrows it down to:
>>>> >  -> "mp3 nano"
>>>> >
>>>> > I had a quick look at the Terms Component.
>>>> > I suppose it just returns term totals for the entire index and cannot
>>>> be
>>>> > used with a filter or query?
>>>> >
>>>> > Thanks
>>>> > Johan
>>>> >
>>>> >
>>>> >
>>>>
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: Auto Suggest

2010-09-03 Thread Jason Rutherglen
To clarify, the query analyzer returns that.  Variations such as
"apple mou" also do not return anything.  Maybe Jay can comment and
then we can amend the article?

On Fri, Sep 3, 2010 at 6:12 AM, Jason Rutherglen
 wrote:
> Analysis returns "app mou".
>
> On Thu, Sep 2, 2010 at 6:12 PM, Lance Norskog  wrote:
>> What does analysis.jsp show?
>>
>> On Thu, Sep 2, 2010 at 5:53 AM, Jason Rutherglen
>>  wrote:
>>> I'm having a different issue with the EdgeNGram technique described
>>> here: 
>>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>>>
>>> That is one word queries q=app on the query_text field, work fine
>>> however "q=app mou" do not.  Why would this be or is there a
>>> configuration that could be missing?
>>>
>>> On Wed, Sep 1, 2010 at 3:53 PM, Eric Grobler  
>>> wrote:
>>>> Thanks for your feedback Robert,
>>>>
>>>> I will try that and see how Solr performs on my data - I think I will 
>>>> create
>>>> a field that contains only important key/product terms from the text.
>>>>
>>>> Regards
>>>> Johan
>>>>
>>>> On Wed, Sep 1, 2010 at 9:12 PM, Robert Petersen  wrote:
>>>>
>>>>> We don't have that many, just a hundred thousand, and solr response
>>>>> times (since the index's docs are small and not complex) are logged as
>>>>> typically 1 ms if not 0 ms.  It's funny but sometimes it is so fast no
>>>>> milliseconds have elapsed.  Incredible if you ask me...  :)
>>>>>
>>>>> Once you get SOLR to consider the whole phrase as just one big term, the
>>>>> wildcard is very fast.
>>>>>
>>>>> -Original Message-
>>>>> From: Eric Grobler [mailto:impalah...@googlemail.com]
>>>>> Sent: Wednesday, September 01, 2010 12:35 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: Auto Suggest
>>>>>
>>>>> Hi Robert,
>>>>>
>>>>> Interesting approach, how many documents do you have in Solr?
>>>>> I have about 2 million and I just wonder if it might be a bit slow.
>>>>>
>>>>> Regards
>>>>> Johan
>>>>>
>>>>> On Wed, Sep 1, 2010 at 7:38 PM, Robert Petersen 
>>>>> wrote:
>>>>>
>>>>> > I do this by replacing the spaces with a '%' in a separate search
>>>>> field
>>>>> > which is not parsed nor tokenized and then you can wildcard across the
>>>>> > whole phrase like you want and the spaces don't mess you up.  Just
>>>>> store
>>>>> > the original phrase with spaces in a separate field for returning to
>>>>> the
>>>>> > front end for display.
>>>>> >
>>>>> > -Original Message-
>>>>> > From: Jazz Globe [mailto:jazzgl...@hotmail.com]
>>>>> > Sent: Wednesday, September 01, 2010 7:33 AM
>>>>> > To: solr-user@lucene.apache.org
>>>>> > Subject: Auto Suggest
>>>>> >
>>>>> >
>>>>> > Hallo
>>>>> >
>>>>> > How would one implement a multiple term auto-suggest feature in Solr
>>>>> > that is filter sensitive?
>>>>> > For example, a user enters :
>>>>> > "mp3"
>>>>> >  and solr might suggest:
>>>>> >  ->   "mp3 player"
>>>>> >  ->   "mp3 nano"
>>>>> >  ->   "mp3 sony"
>>>>> > and then the user starts the second word :
>>>>> > "mp3 n"
>>>>> > and that narrows it down to:
>>>>> >  -> "mp3 nano"
>>>>> >
>>>>> > I had a quick look at the Terms Component.
>>>>> > I suppose it just returns term totals for the entire index and cannot
>>>>> be
>>>>> > used with a filter or query?
>>>>> >
>>>>> > Thanks
>>>>> > Johan
>>>>> >
>>>>> >
>>>>> >
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>


Re: Auto Suggest

2010-09-04 Thread Jason Rutherglen
Dan,

Thanks... I wasn't clear in the original email what the issue is.
It's the fact that multiple terms are in the query, then no results
are returned.

Thanks

On Fri, Sep 3, 2010 at 8:33 AM, dan sutton  wrote:
> I set this up a few years ago with something like the following:
>
> 
>                
>                        
>                        
>                         pattern="([^a-z0-9])" replacement="" replace="all" />
>                         maxGramSize="20" minGramSize="1" />
>                
>                
>                        
>                        
>                         pattern="([^a-z0-9])" replacement="" replace="all" />
>                
>    
>
>  replacement="" replace="all" /> is the bit missing i think here
>
> This way the search is agnostic to case and any non-alphanum chars, this was
> to facilitate a location autocomplete for searching
>
> So is was a basic search, returning the top N results along with additional
> info to show in the autocomplete to our mod_perl servers, Results were
> cached in the mod_perl servers.
>
> Regards,
> Dan
>
> On Thu, Sep 2, 2010 at 1:53 PM, Jason Rutherglen > wrote:
>
>> I'm having a different issue with the EdgeNGram technique described
>> here:
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>>
>> That is one word queries q=app on the query_text field, work fine
>> however "q=app mou" do not.  Why would this be or is there a
>> configuration that could be missing?
>>
>> On Wed, Sep 1, 2010 at 3:53 PM, Eric Grobler 
>> wrote:
>> > Thanks for your feedback Robert,
>> >
>> > I will try that and see how Solr performs on my data - I think I will
>> create
>> > a field that contains only important key/product terms from the text.
>> >
>> > Regards
>> > Johan
>> >
>> > On Wed, Sep 1, 2010 at 9:12 PM, Robert Petersen 
>> wrote:
>> >
>> >> We don't have that many, just a hundred thousand, and solr response
>> >> times (since the index's docs are small and not complex) are logged as
>> >> typically 1 ms if not 0 ms.  It's funny but sometimes it is so fast no
>> >> milliseconds have elapsed.  Incredible if you ask me...  :)
>> >>
>> >> Once you get SOLR to consider the whole phrase as just one big term, the
>> >> wildcard is very fast.
>> >>
>> >> -Original Message-
>> >> From: Eric Grobler [mailto:impalah...@googlemail.com]
>> >> Sent: Wednesday, September 01, 2010 12:35 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: Auto Suggest
>> >>
>> >> Hi Robert,
>> >>
>> >> Interesting approach, how many documents do you have in Solr?
>> >> I have about 2 million and I just wonder if it might be a bit slow.
>> >>
>> >> Regards
>> >> Johan
>> >>
>> >> On Wed, Sep 1, 2010 at 7:38 PM, Robert Petersen 
>> >> wrote:
>> >>
>> >> > I do this by replacing the spaces with a '%' in a separate search
>> >> field
>> >> > which is not parsed nor tokenized and then you can wildcard across the
>> >> > whole phrase like you want and the spaces don't mess you up.  Just
>> >> store
>> >> > the original phrase with spaces in a separate field for returning to
>> >> the
>> >> > front end for display.
>> >> >
>> >> > -Original Message-
>> >> > From: Jazz Globe [mailto:jazzgl...@hotmail.com]
>> >> > Sent: Wednesday, September 01, 2010 7:33 AM
>> >> > To: solr-user@lucene.apache.org
>> >> > Subject: Auto Suggest
>> >> >
>> >> >
>> >> > Hallo
>> >> >
>> >> > How would one implement a multiple term auto-suggest feature in Solr
>> >> > that is filter sensitive?
>> >> > For example, a user enters :
>> >> > "mp3"
>> >> >  and solr might suggest:
>> >> >  ->   "mp3 player"
>> >> >  ->   "mp3 nano"
>> >> >  ->   "mp3 sony"
>> >> > and then the user starts the second word :
>> >> > "mp3 n"
>> >> > and that narrows it down to:
>> >> >  -> "mp3 nano"
>> >> >
>> >> > I had a quick look at the Terms Component.
>> >> > I suppose it just returns term totals for the entire index and cannot
>> >> be
>> >> > used with a filter or query?
>> >> >
>> >> > Thanks
>> >> > Johan
>> >> >
>> >> >
>> >> >
>> >>
>> >
>>
>


Re: Auto Suggest

2010-09-04 Thread Jason Rutherglen
Luke,

Thanks.  What happens if there are 3 terms?  It seems like the entire
query can go into facet.prefix?

On Fri, Sep 3, 2010 at 8:05 AM, Luke Tebbs  wrote:
> What about if you do something like this? -
>
> facet=true&facet.mincount=1&q=apple&facet.limit=10&facet.prefix=mou&facet.field=term_suggest&qt=basic&wt=javabin&rows=0&version=1
>
>
> Jason Rutherglen wrote:
>>
>> To clarify, the query analyzer returns that.  Variations such as
>> "apple mou" also do not return anything.  Maybe Jay can comment and
>> then we can amend the article?
>>
>> On Fri, Sep 3, 2010 at 6:12 AM, Jason Rutherglen
>>  wrote:
>>
>>>
>>> Analysis returns "app mou".
>>>
>>> On Thu, Sep 2, 2010 at 6:12 PM, Lance Norskog  wrote:
>>>
>>>>
>>>> What does analysis.jsp show?
>>>>
>>>> On Thu, Sep 2, 2010 at 5:53 AM, Jason Rutherglen
>>>>  wrote:
>>>>
>>>>>
>>>>> I'm having a different issue with the EdgeNGram technique described
>>>>> here:
>>>>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>>>>>
>>>>> That is one word queries q=app on the query_text field, work fine
>>>>> however "q=app mou" do not.  Why would this be or is there a
>>>>> configuration that could be missing?
>>>>>
>>>>> On Wed, Sep 1, 2010 at 3:53 PM, Eric Grobler
>>>>>  wrote:
>>>>>
>>>>>>
>>>>>> Thanks for your feedback Robert,
>>>>>>
>>>>>> I will try that and see how Solr performs on my data - I think I will
>>>>>> create
>>>>>> a field that contains only important key/product terms from the text.
>>>>>>
>>>>>> Regards
>>>>>> Johan
>>>>>>
>>>>>> On Wed, Sep 1, 2010 at 9:12 PM, Robert Petersen 
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> We don't have that many, just a hundred thousand, and solr response
>>>>>>> times (since the index's docs are small and not complex) are logged
>>>>>>> as
>>>>>>> typically 1 ms if not 0 ms.  It's funny but sometimes it is so fast
>>>>>>> no
>>>>>>> milliseconds have elapsed.  Incredible if you ask me...  :)
>>>>>>>
>>>>>>> Once you get SOLR to consider the whole phrase as just one big term,
>>>>>>> the
>>>>>>> wildcard is very fast.
>>>>>>>
>>>>>>> -Original Message-
>>>>>>> From: Eric Grobler [mailto:impalah...@googlemail.com]
>>>>>>> Sent: Wednesday, September 01, 2010 12:35 PM
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Subject: Re: Auto Suggest
>>>>>>>
>>>>>>> Hi Robert,
>>>>>>>
>>>>>>> Interesting approach, how many documents do you have in Solr?
>>>>>>> I have about 2 million and I just wonder if it might be a bit slow.
>>>>>>>
>>>>>>> Regards
>>>>>>> Johan
>>>>>>>
>>>>>>> On Wed, Sep 1, 2010 at 7:38 PM, Robert Petersen 
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> I do this by replacing the spaces with a '%' in a separate search
>>>>>>>>
>>>>>>>
>>>>>>> field
>>>>>>>
>>>>>>>>
>>>>>>>> which is not parsed nor tokenized and then you can wildcard across
>>>>>>>> the
>>>>>>>> whole phrase like you want and the spaces don't mess you up.  Just
>>>>>>>>
>>>>>>>
>>>>>>> store
>>>>>>>
>>>>>>>>
>>>>>>>> the original phrase with spaces in a separate field for returning to
>>>>>>>>
>>>>>>>
>>>>>>> the
>>>>>>>
>>>>>>>>
>>>>>>>> front end for display.
>>>>>>>>
>>>>>>>> -Original Message-
>>>>>>>> From: Jazz Globe [mailto:jazzgl...@hotmail.com]
>>>>>>>> Sent: Wednesday, September 01, 2010 7:33 AM
>>>>>>>> To: solr-user@lucene.apache.org
>>>>>>>> Subject: Auto Suggest
>>>>>>>>
>>>>>>>>
>>>>>>>> Hallo
>>>>>>>>
>>>>>>>> How would one implement a multiple term auto-suggest feature in Solr
>>>>>>>> that is filter sensitive?
>>>>>>>> For example, a user enters :
>>>>>>>> "mp3"
>>>>>>>>  and solr might suggest:
>>>>>>>>  ->   "mp3 player"
>>>>>>>>  ->   "mp3 nano"
>>>>>>>>  ->   "mp3 sony"
>>>>>>>> and then the user starts the second word :
>>>>>>>> "mp3 n"
>>>>>>>> and that narrows it down to:
>>>>>>>>  -> "mp3 nano"
>>>>>>>>
>>>>>>>> I had a quick look at the Terms Component.
>>>>>>>> I suppose it just returns term totals for the entire index and
>>>>>>>> cannot
>>>>>>>>
>>>>>>>
>>>>>>> be
>>>>>>>
>>>>>>>>
>>>>>>>> used with a filter or query?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Johan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goks...@gmail.com
>>>>
>>>>
>
>


Re: Solr + Katta ... benefits?

2010-09-04 Thread Jason Rutherglen
Katta can be used for managing shards that are built and live in HDFS.

On Fri, Sep 3, 2010 at 10:29 AM, thiseye  wrote:
>
> I'm investigating using Lucene for a project to index a massive HBase
> database. I was looking at using Katta to distribute the index because
> people have said that becomes a limitation with simply using Lucene as the
> index grows. Then I came across Solr which seems like it would also help
> this project. But I saw a brief reference that Solr does distributed
> indexing using shards ... so my question is why are there so many references
> to people using Solr + Katta if Solr natively does sharding? Does Katta do
> it better? Are there limitations with Solr's methodology? This is unclear to
> me based on my research. What am I missing?
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Katta-benefits-tp1413640p1413640.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Jason Rutherglen
Peter,

Are you using per-segment faceting, eg, SOLR-1617?  That could help
your situation.

On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge  wrote:
> Hi,
>
> Below are some notes regarding Solr cache tuning that should prove
> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>
> Environment:
> Solr 1.4.1 or branch_3x trunk.
> Note the 4.x trunk has lots of neat new features, so the notes here
> are likely less relevant to the 4.x environment.
>
> Overview:
> Our Solr environment makes extensive use of faceting, we perform
> commits every 30secs, and the indexes tend be on the large-ish side
> (>20million docs).
> Note: For our data, when we commit, we are always adding new data,
> never changing existing data.
> This type of environment can be tricky to tune, as Solr is more geared
> toward fast reads than frequent writes.
>
> Symptoms:
> If anyone has used faceting in searches where you are also performing
> frequent commits, you've likely encountered the dreaded OutOfMemory or
> GC Overhead Exeeded errors.
> In high commit rate environments, this is almost always due to
> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
> finish autowarming their caches before the next commit()
> comes along and invalidates them.
> Once this starts happening on a regular basis, it is likely your
> Solr's JVM will run out of memory eventually, as the number of
> searchers (and their cache arrays) will keep growing until the JVM
> dies of thirst.
> To check if your Solr environment is suffering from this, turn on INFO
> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
> onDeckSearchers=x'.
>
> In tests, we've only ever seen this problem when using faceting, and
> facet.method=fc.
>
> Some solutions to this are:
>    Reduce the commit rate to allow searchers to fully warm before the
> next commit
>    Reduce or eliminate the autowarming in caches
>    Both of the above
>
> The trouble is, if you're doing NRT commits, you likely have a good
> reason for it, and reducing/elimintating autowarming will very
> significantly impact search performance in high commit rate
> environments.
>
> Solution:
> Here are some setup steps we've used that allow lots of faceting (we
> typically search with at least 20-35 different facet fields, and date
> faceting/sorting) on large indexes, and still keep decent search
> performance:
>
> 1. Firstly, you should consider using the enum method for facet
> searches (facet.method=enum) unless you've got A LOT of memory on your
> machine. In our tests, this method uses a lot less memory and
> autowarms more quickly than fc. (Note, I've not tried the new
> segement-based 'fcs' option, as I can't find support for it in
> branch_3x - looks nice for 4.x though)
> Admittedly, for our data, enum is not quite as fast for searching as
> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
> tradeoff.
> If you do have access to LOTS of memory, AND you can guarantee that
> the index won't grow beyond the memory capacity (i.e. you have some
> sort of deletion policy in place), fc can be a lot faster than enum
> when searching with lots of facets across many terms.
>
> 2. Secondly, we've found that LRUCache is faster at autowarming than
> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
> environment - your mileage may vary.
>
> So, our filterCache section in solrconfig.xml looks like this:
>          class="solr.LRUCache"
>      size="3600"
>      initialSize="1400"
>      autowarmCount="3600"/>
>
> For a 28GB index, running in a quad-core x64 VMWare instance, 30
> warmed facet fields, Solr is running at ~4GB. Stats filterCache size
> shows usually in the region of ~2400.
>
> 3. It's also a good idea to have some sort of
> firstSearcher/newSearcher event listener queries to allow new data to
> populate the caches.
> Of course, what you put in these is dependent on the facets you need/use.
> We've found a good combination is a firstSearcher with as many facets
> in the search as your environment can handle, then a subset of the
> most common facets for the newSearcher.
>
> 4. We also set:
>   true
> just in case.
>
> 5. Another key area for search performance with high commits is to use
> 2 Solr instances - one for the high commit rate indexing, and one for
> searching.
> The read-only searching instance can be a remote replica, or a local
> read-only instance that reads the same core as the indexing instance
> (for the latter, you'll need something that periodically refreshes -
> i.e. runs commit()).
> This way, you can tune the indexing instance for writing performance
> and the searching instance as above for max read performance.
>
> Using the setup above, we get fantastic searching speed for small
> facet sets (well under 1sec), and really good searching for large
> facet sets (a couple of secs depending on index size, number of
> facets, unique terms etc. etc.),
> even when searching against largeish indexes (>20

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-12 Thread Jason Rutherglen
Yeah there's no patch... I think Yonik can write it. :-)  Yah... The
Lucene version shouldn't matter.  The distributed faceting
theoretically can easily be applied to multiple segments, however the
way it's written for me is a challenge to untangle and apply
successfully to a working patch.  Also I don't have this as an itch to
scratch at the moment.

On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge  wrote:
> Hi Jason,
>
> I've tried some limited testing with the 4.x trunk using fcs, and I
> must say, I really like the idea of per-segment faceting.
> I was hoping to see it in 3.x, but I don't see this option in the
> branch_3x trunk. Is your SOLR-1606 patch referred to in SOLR-1617 the
> one to use with 3.1?
> There seems to be a number of Solr issues tied to this - one of them
> being Lucene-1785. Can the per-segment faceting patch work with Lucene
> 2.9/branch_3x?
>
> Thanks,
> Peter
>
>
>
> On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
>  wrote:
>> Peter,
>>
>> Are you using per-segment faceting, eg, SOLR-1617?  That could help
>> your situation.
>>
>> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge  
>> wrote:
>>> Hi,
>>>
>>> Below are some notes regarding Solr cache tuning that should prove
>>> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>>>
>>> Environment:
>>> Solr 1.4.1 or branch_3x trunk.
>>> Note the 4.x trunk has lots of neat new features, so the notes here
>>> are likely less relevant to the 4.x environment.
>>>
>>> Overview:
>>> Our Solr environment makes extensive use of faceting, we perform
>>> commits every 30secs, and the indexes tend be on the large-ish side
>>> (>20million docs).
>>> Note: For our data, when we commit, we are always adding new data,
>>> never changing existing data.
>>> This type of environment can be tricky to tune, as Solr is more geared
>>> toward fast reads than frequent writes.
>>>
>>> Symptoms:
>>> If anyone has used faceting in searches where you are also performing
>>> frequent commits, you've likely encountered the dreaded OutOfMemory or
>>> GC Overhead Exeeded errors.
>>> In high commit rate environments, this is almost always due to
>>> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
>>> finish autowarming their caches before the next commit()
>>> comes along and invalidates them.
>>> Once this starts happening on a regular basis, it is likely your
>>> Solr's JVM will run out of memory eventually, as the number of
>>> searchers (and their cache arrays) will keep growing until the JVM
>>> dies of thirst.
>>> To check if your Solr environment is suffering from this, turn on INFO
>>> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
>>> onDeckSearchers=x'.
>>>
>>> In tests, we've only ever seen this problem when using faceting, and
>>> facet.method=fc.
>>>
>>> Some solutions to this are:
>>>    Reduce the commit rate to allow searchers to fully warm before the
>>> next commit
>>>    Reduce or eliminate the autowarming in caches
>>>    Both of the above
>>>
>>> The trouble is, if you're doing NRT commits, you likely have a good
>>> reason for it, and reducing/elimintating autowarming will very
>>> significantly impact search performance in high commit rate
>>> environments.
>>>
>>> Solution:
>>> Here are some setup steps we've used that allow lots of faceting (we
>>> typically search with at least 20-35 different facet fields, and date
>>> faceting/sorting) on large indexes, and still keep decent search
>>> performance:
>>>
>>> 1. Firstly, you should consider using the enum method for facet
>>> searches (facet.method=enum) unless you've got A LOT of memory on your
>>> machine. In our tests, this method uses a lot less memory and
>>> autowarms more quickly than fc. (Note, I've not tried the new
>>> segement-based 'fcs' option, as I can't find support for it in
>>> branch_3x - looks nice for 4.x though)
>>> Admittedly, for our data, enum is not quite as fast for searching as
>>> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
>>> tradeoff.
>>> If you do have access to LOTS of memory, AND you can guarantee that
>>> the index won't grow beyond the memory capacity (i.e. you have some
>

Shingle filter factory and the min shingles

2010-09-14 Thread Jason Rutherglen











I'm using for a field, indexing, then looking at the terms component.
I'm seeing shingles that consist of only 2 terms, whereas I'm
expecting all the terms to be at least 4 terms... What's up?  Thanks.


Re: Shingle filter factory and the min shingles

2010-09-14 Thread Jason Rutherglen
To answer my own question, and this sucks :)  the minShingleSize isn't
set in at least 1.4.2.  I'm guessing a later version though?

On Tue, Sep 14, 2010 at 5:49 PM, Jason Rutherglen
 wrote:
>  positionIncrementGap="100">
> 
> 
> 
>  words="stopwords.txt"/>
>  maxShingleSize="4" outputUnigrams="false"/>
> 
> 
> 
>
>
> I'm using for a field, indexing, then looking at the terms component.
> I'm seeing shingles that consist of only 2 terms, whereas I'm
> expecting all the terms to be at least 4 terms... What's up?  Thanks.
>


Re: Shingle filter factory and the min shingles

2010-09-14 Thread Jason Rutherglen
And here's the issue... https://issues.apache.org/jira/browse/SOLR-1740

On Tue, Sep 14, 2010 at 6:08 PM, Jason Rutherglen
 wrote:
> To answer my own question, and this sucks :)  the minShingleSize isn't
> set in at least 1.4.2.  I'm guessing a later version though?
>
> On Tue, Sep 14, 2010 at 5:49 PM, Jason Rutherglen
>  wrote:
>> > positionIncrementGap="100">
>> 
>> 
>> 
>> > words="stopwords.txt"/>
>> > maxShingleSize="4" outputUnigrams="false"/>
>> 
>> 
>> 
>>
>>
>> I'm using for a field, indexing, then looking at the terms component.
>> I'm seeing shingles that consist of only 2 terms, whereas I'm
>> expecting all the terms to be at least 4 terms... What's up?  Thanks.
>>
>


Re: Can I tell Solr to merge segments more slowly on an I/O starved system?

2010-09-19 Thread Jason Rutherglen
Ron,

IO throttling was discussed a while back however I don't think it was
implemented.  For systems that search on indexes where indexing is
happening on the same server, reducing IO contention would be useful.
Here is a somewhat similar issue for merging segments:
https://issues.apache.org/jira/browse/LUCENE-2164 however search +
indexing IO throttling would need to occur at the Directory level
whereas this patch has implemented it's solution around thread
priorities.

Jason

On Sun, Sep 19, 2010 at 12:04 AM, Ron Mayer  wrote:
> My system which has documents being added pretty much
> continually seems pretty well behaved except, it seems,
> when large segments get merged.     During that time
> the system starts really dragging, and queries that took
> only a couple seconds are taking dozens.
>
> Some other I/O bound servers seem to have features
> that let you throttle how much I/O they take for administrative
> background tasks -- for example PostgreSQL's "vacuum_cost_delay"
> and related parameters[1], which are described as
>
>  "The intent of this feature is to allow administrators to
>   reduce the I/O impact of these commands on concurrent
>   database activity. There are many situations in which it is
>   not very important that maintenance commands like VACUUM
>   and ANALYZE finish quickly; however, it is usually very
>   important that these commands do not significantly
>   interfere with the ability of the system to perform other
>   database operations. Cost-based vacuum delay provides
>   a way for administrators to achieve this."
>
> Are there any similar features for Solr, where it can sacrifice the
> speed of doing a commit in favor of leaving more I/O bandwidth
> for users performing searches?
>
> If not, where in the code might I look to add such a feature?
>
>     Ron
>
> [1] http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html
>
>
>
>


Re: Can I tell Solr to merge segments more slowly on an I/O starved system?

2010-09-19 Thread Jason Rutherglen
Here's the remainder of the discussion, albeit, brief:
http://www.lucidimagination.com/search/document/d6fa7b3241ed11b8/throttling_merges#9df776e79da71044

On Sun, Sep 19, 2010 at 12:04 AM, Ron Mayer  wrote:
> My system which has documents being added pretty much
> continually seems pretty well behaved except, it seems,
> when large segments get merged.     During that time
> the system starts really dragging, and queries that took
> only a couple seconds are taking dozens.
>
> Some other I/O bound servers seem to have features
> that let you throttle how much I/O they take for administrative
> background tasks -- for example PostgreSQL's "vacuum_cost_delay"
> and related parameters[1], which are described as
>
>  "The intent of this feature is to allow administrators to
>   reduce the I/O impact of these commands on concurrent
>   database activity. There are many situations in which it is
>   not very important that maintenance commands like VACUUM
>   and ANALYZE finish quickly; however, it is usually very
>   important that these commands do not significantly
>   interfere with the ability of the system to perform other
>   database operations. Cost-based vacuum delay provides
>   a way for administrators to achieve this."
>
> Are there any similar features for Solr, where it can sacrifice the
> speed of doing a commit in favor of leaving more I/O bandwidth
> for users performing searches?
>
> If not, where in the code might I look to add such a feature?
>
>     Ron
>
> [1] http://www.postgresql.org/docs/8.4/static/runtime-config-resource.html
>
>
>
>


Re: Autocomplete: match words anywhere in the token

2010-09-22 Thread Jason Rutherglen
This may be what you're looking for.
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

On Wed, Sep 22, 2010 at 4:41 AM, Arunkumar Ayyavu
 wrote:
> It's been over a week since I started learning Solr. Now, I'm using the
> electronics store example to explore the autocomplete feature in Solr.
>
> When I send the query terms.fl=name&terms.prefix=canon to terms request
> handler, I get the following response
> 
>  
>   2
>  
> 
>
> But I expect the following results in the response.
> canon pixma mp500 all-in-one photo printer
> canon powershot sd500
>
> So, I changed the schema for textgen fieldType to use
> KeywordTokenizerFactory and also removed WordDelimiterFilterFactory. That
> gives me the expected result.
>
> Now, I also want the Solr to return "canon pixma mp500 all-in-one photo
> printer"  when I send the query terms.fl=name&terms.prefix=pixma. Could you
> gurus help me get the expected result?
>
> BTW, I couldn't quite understand the behavior of terms.lower and terms.upper
> (I tried these with the electronics store example). Could you also help me
> understand these 2 query fields?
> Thanks.
>
> --
> Arun
>


Re: Autosuggest with inner phrases

2010-10-02 Thread Jason Rutherglen
This's what yer lookin' for:

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

On Sat, Oct 2, 2010 at 3:14 AM, sivaprasad  wrote:
>
> Hi ,
> I implemented the auto suggest using terms component.But the suggestions are
> coming from the starting of the word.But i want inner phrases also.For
> example, if I type "bass" Auto-Complete should offer suggestions that
> include "bass fishing"  or "bass guitar", and even "sea bass" (note how
> "bass" is not necessarily the first word).
>
> How can i achieve this using solr's terms component.
>
> Regards,
> Siva
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Autosuggest-with-inner-phrases-tp1619326p1619326.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Deletes writing bytes len 0, corrupting the index

2010-10-13 Thread Jason Rutherglen
We have unit tests for running out of disk space?  However we have
Tomcat logs that fill up quickly and starve Solr 1.4.1 of space.  The
main segments are probably not corrupted, however routinely now, there
are deletes files of length 0.

0 2010-10-12 18:35 _cc_8.del

Which is fundamental index corruption, though less extreme.  Are we
testing for this?


Re: Deletes writing bytes len 0, corrupting the index

2010-10-13 Thread Jason Rutherglen
There's a corrupt index exception thrown when opening the searcher.
The rest of the files of the segment are OK.  Meaning the problem has
occurred in writing the bit vector well after the segment has been
written.  I'm guessing we're simply not verifying that the BV has been
written fully/properly, and that we're going on to commit the segment
infos anyways.

On Wed, Oct 13, 2010 at 10:15 AM, Michael McCandless
 wrote:
> I'm not certain whether we test this particular case, but we do have
> several disk full tests.
>
> But: are you seeing a corrupt index?  Ie, exception on open or on
> searching or on CheckIndex?
>
> Or: do you see a disk-full exception when writing the del file, during
> indexing, that does not in fact corrupt the index (this is of course
> what I hope you are seeing ;) ).
>
> Mike
>
> On Wed, Oct 13, 2010 at 11:37 AM, Jason Rutherglen
>  wrote:
>> We have unit tests for running out of disk space?  However we have
>> Tomcat logs that fill up quickly and starve Solr 1.4.1 of space.  The
>> main segments are probably not corrupted, however routinely now, there
>> are deletes files of length 0.
>>
>> 0 2010-10-12 18:35 _cc_8.del
>>
>> Which is fundamental index corruption, though less extreme.  Are we
>> testing for this?
>>
>


Re: Deletes writing bytes len 0, corrupting the index

2010-10-13 Thread Jason Rutherglen
Thanks Robert, that Jira issue aptly describes what I'm seeing, I think.

On Wed, Oct 13, 2010 at 10:22 AM, Robert Muir  wrote:
> if you are going to fill up your disk space all the time with solr
> 1.4.1, I suggest replacing the lucene jars with lucene jars from
> 2.9-branch (http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9/).
>
> then you get the fix for https://issues.apache.org/jira/browse/LUCENE-2593 
> too.
>
> On Wed, Oct 13, 2010 at 11:37 AM, Jason Rutherglen
>  wrote:
>> We have unit tests for running out of disk space?  However we have
>> Tomcat logs that fill up quickly and starve Solr 1.4.1 of space.  The
>> main segments are probably not corrupted, however routinely now, there
>> are deletes files of length 0.
>>
>> 0 2010-10-12 18:35 _cc_8.del
>>
>> Which is fundamental index corruption, though less extreme.  Are we
>> testing for this?
>>
>


Re: Deletes writing bytes len 0, corrupting the index

2010-11-04 Thread Jason Rutherglen
I'm still seeing this error after downloading the latest 2.9 branch
version, compiling, copying to Solr 1.4 and deploying.  Basically as
mentioned, the .del files are of zero length... Hmm...

On Wed, Oct 13, 2010 at 1:33 PM, Jason Rutherglen
 wrote:
> Thanks Robert, that Jira issue aptly describes what I'm seeing, I think.
>
> On Wed, Oct 13, 2010 at 10:22 AM, Robert Muir  wrote:
>> if you are going to fill up your disk space all the time with solr
>> 1.4.1, I suggest replacing the lucene jars with lucene jars from
>> 2.9-branch 
>> (http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9/).
>>
>> then you get the fix for https://issues.apache.org/jira/browse/LUCENE-2593 
>> too.
>>
>> On Wed, Oct 13, 2010 at 11:37 AM, Jason Rutherglen
>>  wrote:
>>> We have unit tests for running out of disk space?  However we have
>>> Tomcat logs that fill up quickly and starve Solr 1.4.1 of space.  The
>>> main segments are probably not corrupted, however routinely now, there
>>> are deletes files of length 0.
>>>
>>> 0 2010-10-12 18:35 _cc_8.del
>>>
>>> Which is fundamental index corruption, though less extreme.  Are we
>>> testing for this?
>>>
>>
>


Re: Deletes writing bytes len 0, corrupting the index

2010-11-05 Thread Jason Rutherglen
>  can you enable IndexWriter's infoStream

I'd like to however the problem is only happening in production, and
the indexing volume is in the millions per hour.  The log would be
clogged up, as it is I have logging in Tomcat turned off because it is
filling up the SSD drive (yes I know, we should have an HD drive as
well, I didn't configure the server, and we're getting new ones,
thanks for wondering).

Can you point me at the unit test that simulates this issue?  Today I
saw a different problem in that the doc store got corrupted, given
we're streaming it to disk, how are we capturing disk full for that
case?  Meaning how can we be sure where the doc store stopped writing
at?  I haven't had time to explore what's up with this however I will
shortly, ie, examine the unit tests and code.  Perhaps though this is
simply hardware related?

On Fri, Nov 5, 2010 at 1:58 AM, Michael McCandless
 wrote:
> Hmmm... Jason can you enable IndexWriter's infoStream and get the
> corruption to happen again and post that (along with "ls -l" output)?
>
> Mike
>
> On Thu, Nov 4, 2010 at 5:11 PM, Jason Rutherglen
>  wrote:
>> I'm still seeing this error after downloading the latest 2.9 branch
>> version, compiling, copying to Solr 1.4 and deploying.  Basically as
>> mentioned, the .del files are of zero length... Hmm...
>>
>> On Wed, Oct 13, 2010 at 1:33 PM, Jason Rutherglen
>>  wrote:
>>> Thanks Robert, that Jira issue aptly describes what I'm seeing, I think.
>>>
>>> On Wed, Oct 13, 2010 at 10:22 AM, Robert Muir  wrote:
>>>> if you are going to fill up your disk space all the time with solr
>>>> 1.4.1, I suggest replacing the lucene jars with lucene jars from
>>>> 2.9-branch 
>>>> (http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9/).
>>>>
>>>> then you get the fix for https://issues.apache.org/jira/browse/LUCENE-2593 
>>>> too.
>>>>
>>>> On Wed, Oct 13, 2010 at 11:37 AM, Jason Rutherglen
>>>>  wrote:
>>>>> We have unit tests for running out of disk space?  However we have
>>>>> Tomcat logs that fill up quickly and starve Solr 1.4.1 of space.  The
>>>>> main segments are probably not corrupted, however routinely now, there
>>>>> are deletes files of length 0.
>>>>>
>>>>> 0 2010-10-12 18:35 _cc_8.del
>>>>>
>>>>> Which is fundamental index corruption, though less extreme.  Are we
>>>>> testing for this?
>>>>>
>>>>
>>>
>>
>


Re: Deletes writing bytes len 0, corrupting the index

2010-11-14 Thread Jason Rutherglen
che.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at

HTTP Status 500 - null java.lang.NullPointerException at
org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:761) at
org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:619) at
org.apache.solr.schema.TextField.write(TextField.java:45) at
org.apache.solr.schema.SchemaField.write(SchemaField.java:108) at
org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:311) at
org.apache.solr.request.XMLWriter$3.writeDocs(XMLWriter.java:483) at
org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:420)
at org.apache.solr.request.XMLWriter.writeDocList(XMLWriter.java:457)
at org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:520) at
org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130) at
org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34)
at 
org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325)
at

Nov 6, 2010 8:31:49 AM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: read past EOF
at 
org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:135)
at 
org.apache.lucene.index.SegmentReader$Norm.bytes(SegmentReader.java:455)
at 
org.apache.lucene.index.SegmentReader.getNorms(SegmentReader.java:1068)
at org.apache.lucene.index.SegmentReader.norms(SegmentReader.java:1074)
at 
org.apache.solr.search.SolrIndexReader.norms(SolrIndexReader.java:282)
at 
org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:72)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:246)
at org.apache.lucene.search.Searcher.search(Searcher.java:171)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:988)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:884)
at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:341)
at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:182)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)

On Fri, Nov 5, 2010 at 2:59 PM, Michael McCandless
 wrote:
> See TestIndexWriterOnDiskFull (on trunk).  Look for the test w/
> LUCENE-2743 in the comment... but the other tests there also test
> other cases that may hit disk full.
>
> Can you post the exceptions you hit?  (Are these logged?).
>
> Yes this could be a hardware issue...
>
> Millions of docs indexed per hour sounds like fun!
>
> Mike
>
> On Fri, Nov 5, 2010 at 5:33 PM, Jason Rutherglen
>  wrote:
>>>  can you enable IndexWriter's infoStream
>>
>> I'd like to however the problem is only happening in production, and
>> the indexing volume is in the millions per hour.  The log would be
>> clogged up, as it is I have logging in Tomcat turned off because it is
>> filling up the SSD drive (yes I know, we should have an HD drive as
>> well, I didn't configure the server, and we're getting new ones,
>> thanks for wondering).
>>
>> Can you point me at the unit test that simulates this issue?  Today I
>> saw a different problem in that the doc store got corrupted, given
>> we're streaming it to disk, how are we capturing disk full for that
>> case?  Meaning how can we be sure where the doc store stopped writing
>> at?  I haven't had time to explore what's up with this however I will
>> shortly, ie, examine the unit tests and code.  Perhaps though this is
>> simply hardware related?
>>
>> On Fri, Nov 5, 2010 at 1:58 AM, Michael McCandless
>>  wrote:
>>> Hmmm... Jason can you enable IndexWriter's infoStream and get the
>>> corruption to happen again and post that (along with "ls -l" output)?
>>>
>>> Mike
>>>
>>> On Thu, Nov 4, 2010 at 5:11 PM, Jason Rutherglen
>>>  wrote:
>>>> I'm still seeing this error after downloading the latest 2.9 branch
>>>> version, compiling, copying to Solr 1.4 and deploying.  Basically as
>>>> mentioned, the .del files are of zero length... Hmm...
>>>>
>>>> On Wed, Oct 13, 2010 at 1:33 PM, Jason Rutherglen
>>>>  wrote:
>>>>> Thanks Robert, that Jira issue aptly describes what I'm seeing, I think.
>>>>>
>>>>> On Wed, Oct 13, 2010 at 10:22 AM, Robert Muir  wrote:
>>>>>> if you are going to fill up your disk space all the time with solr
>>>>>> 1.4.1, I suggest replacing the lucene jars with lucene jars from
>>>>>> 2.9-branch 
>>>>>> (http://svn.apache.org/repos/asf/lucene/java/branches/lucene_2_9/).
>>>>>>
>>>>>> then you get the fix for 
>>>>>> https://issues.apache.org/jira/browse/LUCENE-2593 too.
>>>>>>
>>>>>> On Wed, Oct 13, 2010 at 11:37 AM, Jason Rutherglen
>>>>>>  wrote:
>>>>>>> We have unit tests for running out of disk space?  However we have
>>>>>>> Tomcat logs that fill up quickly and starve Solr 1.4.1 of space.  The
>>>>>>> main segments are probably not corrupted, however routinely now, there
>>>>>>> are deletes files of length 0.
>>>>>>>
>>>>>>> 0 2010-10-12 18:35 _cc_8.del
>>>>>>>
>>>>>>> Which is fundamental index corruption, though less extreme.  Are we
>>>>>>> testing for this?
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Rollback can't be done after committing?

2010-11-14 Thread Jason Rutherglen
The timed deletion policy is a bit too abstract, as is keeping a
numbered limit of commit points.  How would one know what they're
rolling back to when num limit is defined?

I think committing to a name and being able to roll back to it in Solr
is a good feature to add.

On Fri, Nov 12, 2010 at 2:47 AM, Michael McCandless
 wrote:
> In fact Lucene can rollback to a previous commit.
>
> You just need to use a deletion policy that preserves past commits
> (the default policy only keeps the most recent commit).
>
> Once you have multiple commits in the index you can do fun things like
> open an IndexReader on an old commit, rollback (open an IndexWriter on
> an old commit, deleting the "future" commits).  You can even open an
> IndexWriter on an old commit yet still preserve the newer commits, to
> "revert" changes to the index yet preserve the history.
>
> You can use IndexReader.listCommits to get all commits currently in the index.
>
> But I'm not sure if these capabilities are exposed yet through Solr.
>
> Mike
>
> On Thu, Nov 11, 2010 at 10:25 PM, Pradeep Singh  wrote:
>> In some cases you can rollback to a named checkpoint. I am not too sure but
>> I think I read in the lucene documentation that it supported named
>> checkpointing.
>>
>> On Thu, Nov 11, 2010 at 7:12 PM, gengshaoguang 
>> wrote:
>>
>>> Hi, Kouta:
>>> Any data store does not support rollback AFTER commit, rollback works only
>>> BEFORE.
>>>
>>> On Friday, November 12, 2010 12:34:18 am Kouta Osabe wrote:
>>> > Hi, all
>>> >
>>> > I have a question about Solr and SolrJ's rollback.
>>> >
>>> > I try to rollback like below
>>> >
>>> > try{
>>> > server.addBean(dto);
>>> > server.commit;
>>> > }catch(Exception e){
>>> >  if (server != null) { server.rollback();}
>>> > }
>>> >
>>> > I wonder if any Exception thrown, "rollback" process is run. so all
>>> > data would not be updated.
>>> >
>>> > but once commited, rollback would not be well done.
>>> >
>>> > rollback correctly will be done only when "commit" process will not?
>>> >
>>> > Solr and SolrJ's rollback system is not the same as any RDB's rollback?
>>>
>>>
>>
>


Re: Frequent garbage collections after a day of operation

2012-02-16 Thread Jason Rutherglen
> One thing that could fit the pattern you describe would be Solr caches
> filling up and getting you too close to your JVM or memory limit

This [uncommitted] issue would solve that problem by allowing the GC
to collect caches that become too large, though in practice, the cache
setting would need to be fairly large for an OOM to occur from them:
https://issues.apache.org/jira/browse/SOLR-1513

On Thu, Feb 16, 2012 at 7:14 PM, Bryan Loofbourrow
 wrote:
> A couple of thoughts:
>
> We wound up doing a bunch of tuning on the Java garbage collection.
> However, the pattern we were seeing was periodic very extreme slowdowns,
> because we were then using the default garbage collector, which blocks
> when it has to do a major collection. This doesn't sound like your
> problem, but it's something to be aware of.
>
> One thing that could fit the pattern you describe would be Solr caches
> filling up and getting you too close to your JVM or memory limit. For
> example, if you have large documents, and have defined a large document
> cache, that might do it.
>
> I found it useful to point jconsole (free with the JDK) at my JVM, and
> watch the pattern of memory usage. If the troughs at the bottom of the GC
> cycles keep rising, you know you've got something that is continuing to
> grab more memory and not let go of it. Now that our JVM is running
> smoothly, we just see a sawtooth pattern, with the troughs approximately
> level. When the system is under load, the frequency of the wave rises. Try
> it and see what sort of pattern you're getting.
>
> -- Bryan
>
>> -Original Message-
>> From: Matthias Käppler [mailto:matth...@qype.com]
>> Sent: Thursday, February 16, 2012 7:23 AM
>> To: solr-user@lucene.apache.org
>> Subject: Frequent garbage collections after a day of operation
>>
>> Hey everyone,
>>
>> we're running into some operational problems with our SOLR production
>> setup here and were wondering if anyone else is affected or has even
>> solved these problems before. We're running a vanilla SOLR 3.4.0 in
>> several Tomcat 6 instances, so nothing out of the ordinary, but after
>> a day or so of operation we see increased response times from SOLR, up
>> to 3 times increases on average. During this time we see increased CPU
>> load due to heavy garbage collection in the JVM, which bogs down the
>> the whole system, so throughput decreases, naturally. When restarting
>> the slaves, everything goes back to normal, but that's more like a
>> brute force solution.
>>
>> The thing is, we don't know what's causing this and we don't have that
>> much experience with Java stacks since we're for most parts a Rails
>> company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
>> seeing this, or can you think of a reason for this? Most of our
>> queries to SOLR involve the DismaxHandler and the spatial search query
>> components. We don't use any custom request handlers so far.
>>
>> Thanks in advance,
>> -Matthias
>>
>> --
>> Matthias Käppler
>> Lead Developer API & Mobile
>>
>> Qype GmbH
>> Großer Burstah 50-52
>> 20457 Hamburg
>> Telephone: +49 (0)40 - 219 019 2 - 160
>> Skype: m_kaeppler
>> Email: matth...@qype.com
>>
>> Managing Director: Ian Brotherston
>> Amtsgericht Hamburg
>> HRB 95913
>>
>> This e-mail and its attachments may contain confidential and/or
>> privileged information. If you are not the intended recipient (or have
>> received this e-mail in error) please notify the sender immediately
>> and destroy this e-mail and its attachments. Any unauthorized copying,
>> disclosure or distribution of this e-mail and  its attachments is
>> strictly forbidden. This notice also applies to future messages.


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-15 Thread Jason Rutherglen
This was done in SOLR-1301 going on several years ago now.

On Sat, Apr 14, 2012 at 4:11 PM, Lance Norskog  wrote:
> It sounds like you really want the final map/reduce phase to put Solr
> index files into HDFS. Solr has a feature to do this called 'Embedded
> Solr'. This packages Solr as a library instead of an HTTP servlet. The
> Solr committers mostly hate it and want it to go away, but it is
> useful for exactly this problem.
>
> There is some integration work here, both to bolt ES to the Hadoop
> output libraries and also some trickery to write out the HDFS files.
> HDFS only appends and most of the codecs (Lucene segment formats) like
> to seek a lot. Then at the end it needs a way to tell SolrCloud about
> the files.
>
> If someone wants a great Summer Of Code project, Hadoop->Lucene
> indexes->SolrCloud would be a lot of fun and make you widely loved by
> people with money. I'm not kidding. Do a good job of this and write
> clean code, and you'll get offers for very cool jobs.
>
> On Sat, Apr 14, 2012 at 2:27 PM, Otis Gospodnetic
>  wrote:
>> Hello,
>>
>> Unfortunately I don't know when exactly SolrCloud release will be ready, but 
>> we've used trunk versions in the past and didn't have major issues.
>>
>> Otis
>> 
>> Performance Monitoring SaaS for Solr - 
>> http://sematext.com/spm/solr-performance-monitoring/index.html
>>
>>
>>
>>>
>>> From: Ali S Kureishy 
>>>To: Otis Gospodnetic 
>>>Cc: "solr-user@lucene.apache.org" 
>>>Sent: Friday, April 13, 2012 7:16 PM
>>>Subject: Re: Options for automagically Scaling Solr (without needing 
>>>distributed index/replication) in a Hadoop environment
>>>
>>>
>>>Thanks Otis.
>>>
>>>
>>>I really appreciate the details offered here. This was very helpful 
>>>information.
>>>
>>>
>>>I'm going to go through Solandra and Elastic Search and see if those make 
>>>sense. I was also given a suggestion to use SolrCloud on FuseDFS (that's two 
>>>recommendations for SolrCloud so far), so I will give that a shot when it is 
>>>available. However, do you know when SolrCloud IS expected to be available?
>>>
>>>
>>>Thanks again!
>>>
>>>
>>>Warm regards,
>>>Safdar
>>>
>>>
>>>
>>>
>>>
>>>On Fri, Apr 13, 2012 at 5:23 AM, Otis Gospodnetic 
>>> wrote:
>>>
>>>Hello Ali,


> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure

> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
> seconds.


That's fine.  Whether it's doable with any tech will depend on how much 
hardware you give it, among other things.


> Needless to mention, the search index needs to scale to 5Billion pages. It
> is also possible that I might need to store multiple indexes -- one for
> crawled content, and one for ancillary data that is also very large. Each
> of these indices would likely require a logically distributed and
> replicated index.


Yup, OK.


> However, I would like for such a system to be homogenous with the Hadoop
> infrastructure that is already installed on the cluster (for the crawl). 
> In
> other words, I would much prefer if the replication and distribution of 
> the
> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
> using another scalability framework (such as SolrCloud). In addition, it
> would be ideal if this environment was flexible enough to be dynamically
> scaled based on the size requirements of the index and the search traffic
> at the time (i.e. if it is deployed on an Amazon cluster, it should be 
> easy
> enough to automatically provision additional processing power into the
> cluster without requiring server re-starts).


There is no such thing just yet.
There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
automatically index HBase content, but that was either not completed or not 
committed into HBase.


> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these 
> is
> mature enough and would be the right architectural choice to go along with
> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling 
> aspects
> above.


Here is a summary on all of them:
* Search on HBase - I assume you are referring to the same thing I 
mentioned above.  Not ready.
* Solandra - uses Cassandra+Solr, plus DataStax now has a different 
(commercial) offering that combines search and Cassandra.  Looks good.
* Lily - data stored in HBase cluster gets indexed to a separate Solr 
instance(s)  on the side.  Not really integrated the way you want it to be.
* ElasticSearch -

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-16 Thread Jason Rutherglen
One of big weaknesses of Solr Cloud (and ES?) is the lack of the
ability to redistribute shards across servers.  Meaning, as a single
shard grows too large, splitting the shard, while live updates.

How do you plan on elastically adding more servers without this feature?

Cassandra and HBase handle elasticity in their own ways.  Cassandra
has successfully implemented the Dynamo model and HBase uses the
traditional BigTable 'split'.  Both systems are complex though are at
a singular level of maturity.

Also Cassandra [successfully] implements multiple data center support,
is that available in SC or ES?

On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
 wrote:
> Hello Ali,
>
>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>
>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>> seconds.
>
>
> That's fine.  Whether it's doable with any tech will depend on how much 
> hardware you give it, among other things.
>
>> Needless to mention, the search index needs to scale to 5Billion pages. It
>> is also possible that I might need to store multiple indexes -- one for
>> crawled content, and one for ancillary data that is also very large. Each
>> of these indices would likely require a logically distributed and
>> replicated index.
>
>
> Yup, OK.
>
>> However, I would like for such a system to be homogenous with the Hadoop
>> infrastructure that is already installed on the cluster (for the crawl). In
>> other words, I would much prefer if the replication and distribution of the
>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>> using another scalability framework (such as SolrCloud). In addition, it
>> would be ideal if this environment was flexible enough to be dynamically
>> scaled based on the size requirements of the index and the search traffic
>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>> enough to automatically provision additional processing power into the
>> cluster without requiring server re-starts).
>
>
> There is no such thing just yet.
> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
> automatically index HBase content, but that was either not completed or not 
> committed into HBase.
>
>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>> mature enough and would be the right architectural choice to go along with
>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>> above.
>
>
> Here is a summary on all of them:
> * Search on HBase - I assume you are referring to the same thing I mentioned 
> above.  Not ready.
> * Solandra - uses Cassandra+Solr, plus DataStax now has a different 
> (commercial) offering that combines search and Cassandra.  Looks good.
> * Lily - data stored in HBase cluster gets indexed to a separate Solr 
> instance(s)  on the side.  Not really integrated the way you want it to be.
> * ElasticSearch - solid at this point, the most dynamic solution today, can 
> scale well (we are working on a mny-B documents index and hundreds of 
> nodes with ElasticSearch right now), etc.  But again, not integrated with 
> Hadoop the way you want it.
> * IndexTank - has some technical weaknesses, not integrated with Hadoop, not 
> sure about its future considering LinkedIn uses Zoie and Sensei already.
> * And there is SolrCloud, which is coming soon and will be solid, but is 
> again not integrated.
>
> If I were you and I had to pick today - I'd pick ElasticSearch if I were 
> completely open.  If I had Solr bias I'd give SolrCloud a try first.
>
>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>> estimate my needing with this setup, for regular web-data (HTML text) at
>> this scale?
>
> I don't know off the topic of my head, but I'm guessing several hundred for 
> serving search requests.
>
> HTH,
>
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
>
> Scalable Performance Monitoring - http://sematext.com/spm/index.html
>
>
>> Any architectural guidance would be greatly appreciated. The more details
>> provided, the wider my grin :).
>>
>> Many many thanks in advance.
>>
>> Thanks,
>> Safdar
>>


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-17 Thread Jason Rutherglen
> redistributing shards from oversubscribed nodes to other nodes

Redistributing shards on a live system is not possible however because
the updates in-flight will likely be lost.  Also it is not simple
technology to build from the ground-up.

As is today, one would need to schedule downtime, for multi-terabyte
live realtime systems, that not acceptable and will cause the system
to not meet SLAs.

Solr Cloud seems limited to a simple hashing algorithm for sending
updates to the appropriate shard.  This is precisely what Dynamo (and
Cassandra) solves, eg, elastically and dynamically rearranging the
hash 'ring' both logically and physically.

In addition, there is the potential for data loss which Cassandra has
the technology for.

On Tue, Apr 17, 2012 at 1:33 PM, Otis Gospodnetic
 wrote:
> I think Jason is right - there is no index splitting in ES and SolrCloud, so 
> one has to think ahead, "overshard", and then count on redistributing shards 
> from oversubscribed nodes to other nodes.  No resharding on demand and no 
> index/shard splitting yet.
>
> Otis
> 
> Performance Monitoring SaaS for Solr - 
> http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
>
>>
>> From: Jason Rutherglen 
>>To: solr-user@lucene.apache.org
>>Sent: Monday, April 16, 2012 8:42 PM
>>Subject: Re: Options for automagically Scaling Solr (without needing 
>>distributed index/replication) in a Hadoop environment
>>
>>One of big weaknesses of Solr Cloud (and ES?) is the lack of the
>>ability to redistribute shards across servers.  Meaning, as a single
>>shard grows too large, splitting the shard, while live updates.
>>
>>How do you plan on elastically adding more servers without this feature?
>>
>>Cassandra and HBase handle elasticity in their own ways.  Cassandra
>>has successfully implemented the Dynamo model and HBase uses the
>>traditional BigTable 'split'.  Both systems are complex though are at
>>a singular level of maturity.
>>
>>Also Cassandra [successfully] implements multiple data center support,
>>is that available in SC or ES?
>>
>>On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
>> wrote:
>>> Hello Ali,
>>>
>>>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>>>
>>>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>>>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>>>> seconds.
>>>
>>>
>>> That's fine.  Whether it's doable with any tech will depend on how much 
>>> hardware you give it, among other things.
>>>
>>>> Needless to mention, the search index needs to scale to 5Billion pages. It
>>>> is also possible that I might need to store multiple indexes -- one for
>>>> crawled content, and one for ancillary data that is also very large. Each
>>>> of these indices would likely require a logically distributed and
>>>> replicated index.
>>>
>>>
>>> Yup, OK.
>>>
>>>> However, I would like for such a system to be homogenous with the Hadoop
>>>> infrastructure that is already installed on the cluster (for the crawl). In
>>>> other words, I would much prefer if the replication and distribution of the
>>>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>>>> using another scalability framework (such as SolrCloud). In addition, it
>>>> would be ideal if this environment was flexible enough to be dynamically
>>>> scaled based on the size requirements of the index and the search traffic
>>>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>>>> enough to automatically provision additional processing power into the
>>>> cluster without requiring server re-starts).
>>>
>>>
>>> There is no such thing just yet.
>>> There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt to 
>>> automatically index HBase content, but that was either not completed or not 
>>> committed into HBase.
>>>
>>>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>>>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>>>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>>>> mature enough and would be the right architectural choice to go along with
>>>> a Nutch crawler setup, and to also satisfy the dynamic/a

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-18 Thread Jason Rutherglen
I'm curious how on the fly updates are handled as a new shard is added
to an alias.  Eg, how does the system know to which shard to send an
update?

On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček  wrote:
> Hi,
>
> speaking about ES I think it would be fair to mention that one has to
> specify number of shards upfront when the index is created - that is
> correct, however, it is possible to give index one or more aliases which
> basically means that you can add new indices on the fly and give them same
> alias which is then used to search against. Given that you can add/remove
> indices, nodes and aliases on the fly I think there is a way how to handle
> growing data set with ease. If anyone is interested such scenario has been
> discussed in detail in ES mail list.
>
> Regards,
> Lukas
>
> On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> One of big weaknesses of Solr Cloud (and ES?) is the lack of the
>> ability to redistribute shards across servers.  Meaning, as a single
>> shard grows too large, splitting the shard, while live updates.
>>
>> How do you plan on elastically adding more servers without this feature?
>>
>> Cassandra and HBase handle elasticity in their own ways.  Cassandra
>> has successfully implemented the Dynamo model and HBase uses the
>> traditional BigTable 'split'.  Both systems are complex though are at
>> a singular level of maturity.
>>
>> Also Cassandra [successfully] implements multiple data center support,
>> is that available in SC or ES?
>>
>> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
>>  wrote:
>> > Hello Ali,
>> >
>> >> I'm trying to setup a large scale *Crawl + Index + Search
>> *infrastructure
>> >
>> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
>> pages*,
>> >> crawled + indexed every *4 weeks, *with a search latency of less than
>> 0.5
>> >> seconds.
>> >
>> >
>> > That's fine.  Whether it's doable with any tech will depend on how much
>> hardware you give it, among other things.
>> >
>> >> Needless to mention, the search index needs to scale to 5Billion pages.
>> It
>> >> is also possible that I might need to store multiple indexes -- one for
>> >> crawled content, and one for ancillary data that is also very large.
>> Each
>> >> of these indices would likely require a logically distributed and
>> >> replicated index.
>> >
>> >
>> > Yup, OK.
>> >
>> >> However, I would like for such a system to be homogenous with the Hadoop
>> >> infrastructure that is already installed on the cluster (for the
>> crawl). In
>> >> other words, I would much prefer if the replication and distribution of
>> the
>> >> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead
>> of
>> >> using another scalability framework (such as SolrCloud). In addition, it
>> >> would be ideal if this environment was flexible enough to be dynamically
>> >> scaled based on the size requirements of the index and the search
>> traffic
>> >> at the time (i.e. if it is deployed on an Amazon cluster, it should be
>> easy
>> >> enough to automatically provision additional processing power into the
>> >> cluster without requiring server re-starts).
>> >
>> >
>> > There is no such thing just yet.
>> > There is no Search+Hadoop/HDFS in a box just yet.  There was an attempt
>> to automatically index HBase content, but that was either not completed or
>> not committed into HBase.
>> >
>> >> However, I'm not sure which Solr-based tool in the Hadoop ecosystem
>> would
>> >> be ideal for this scenario. I've heard mention of Solr-on-HBase,
>> Solandra,
>> >> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of
>> these is
>> >> mature enough and would be the right architectural choice to go along
>> with
>> >> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
>> aspects
>> >> above.
>> >
>> >
>> > Here is a summary on all of them:
>> > * Search on HBase - I assume you are referring to the same thing I
>> mentioned above.  Not ready.
>> > * Solandra - uses Cassandra+Solr, plus DataStax now has a different
>> (commercial) offering that combines search and Cassandra.  Looks good.
>> > * Lily - data

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-18 Thread Jason Rutherglen
The main point being made is established NoSQL solutions (eg,
Cassandra, HBase, et al) have solved the update problem (among many
other scalability issues, for several years).

If an update is being performed and it is not known where the record
exists, the update capability of the system is inefficient.  In
addition, in a production system, the mere possibility of losing data,
or inaccurate updates is usually a red flag.

On Wed, Apr 18, 2012 at 6:40 AM, Lukáš Vlček  wrote:
> AFAIK it can not. You can only add new shards by creating a new index and
> you will then need to index new data into that new index. Index aliases are
> useful mainly for searching part. So it means that you need to plan for
> this when you implement your indexing logic. On the other hand the query
> logic does not need to change as you only add new indices and give them all
> the same alias.
>
> I am not an expert on this but I think that index splitting and re-sharding
> can be expensive for [near] real-time search system and the point is that
> you can probably use different techniques to support your large scale
> needs. Index aliasing and routing in elasticsearch can help a lot in
> supporting various large scale data scenarios, check the following thread
> in ES ML for some examples:
> https://groups.google.com/forum/#!msg/elasticsearch/49q-_AgQCp8/MRol0t9asEcJ
>
> Just to sum it up, the fact that elasticsearch does have fixed number of
> shards per index and does not support resharding and index splitting does
> not mean you can not scale your data easily.
>
> (I was not following this whole thread in every detail. So may be you may
> have specific needs that can be solved only by splitting or resharding, in
> such case I would recommend you to ask on ES ML with further questions, I
> do not want to run into system X vs system Y flame here...)
>
> Regards,
> Lukas
>
> On Wed, Apr 18, 2012 at 2:22 PM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> I'm curious how on the fly updates are handled as a new shard is added
>> to an alias.  Eg, how does the system know to which shard to send an
>> update?
>>
>> On Tue, Apr 17, 2012 at 4:00 PM, Lukáš Vlček 
>> wrote:
>> > Hi,
>> >
>> > speaking about ES I think it would be fair to mention that one has to
>> > specify number of shards upfront when the index is created - that is
>> > correct, however, it is possible to give index one or more aliases which
>> > basically means that you can add new indices on the fly and give them
>> same
>> > alias which is then used to search against. Given that you can add/remove
>> > indices, nodes and aliases on the fly I think there is a way how to
>> handle
>> > growing data set with ease. If anyone is interested such scenario has
>> been
>> > discussed in detail in ES mail list.
>> >
>> > Regards,
>> > Lukas
>> >
>> > On Tue, Apr 17, 2012 at 2:42 AM, Jason Rutherglen <
>> > jason.rutherg...@gmail.com> wrote:
>> >
>> >> One of big weaknesses of Solr Cloud (and ES?) is the lack of the
>> >> ability to redistribute shards across servers.  Meaning, as a single
>> >> shard grows too large, splitting the shard, while live updates.
>> >>
>> >> How do you plan on elastically adding more servers without this feature?
>> >>
>> >> Cassandra and HBase handle elasticity in their own ways.  Cassandra
>> >> has successfully implemented the Dynamo model and HBase uses the
>> >> traditional BigTable 'split'.  Both systems are complex though are at
>> >> a singular level of maturity.
>> >>
>> >> Also Cassandra [successfully] implements multiple data center support,
>> >> is that available in SC or ES?
>> >>
>> >> On Thu, Apr 12, 2012 at 7:23 PM, Otis Gospodnetic
>> >>  wrote:
>> >> > Hello Ali,
>> >> >
>> >> >> I'm trying to setup a large scale *Crawl + Index + Search
>> >> *infrastructure
>> >> >
>> >> >> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web
>> >> pages*,
>> >> >> crawled + indexed every *4 weeks, *with a search latency of less than
>> >> 0.5
>> >> >> seconds.
>> >> >
>> >> >
>> >> > That's fine.  Whether it's doable with any tech will depend on how
>> much
>> >> hardware you give it, among other things.
>> >> >
>> >> >> Needless to mention, the search ind

Re: Benchmark Solr vs Elastic Search vs Sensei

2012-04-27 Thread Jason Rutherglen
I think Datatax Enterprise is faster than Solr Cloud with transaction
logging turned on.  Cassandra has it's own fast(er) transaction
logging mechanism.  Of course it's best to use two HDs when testing,
eg, one for the data, the other for the transaction log.

On Fri, Apr 27, 2012 at 12:58 PM, Jeff Schmidt  wrote:
> This is a pretty awesome combination, actually.  I'm getting started using it 
> myself, and I'd be very interested in what kind of benchmark results you get 
> vs. Solr and your other candidates. DataStax Enterprise 2.0 was released in 
> March and is based on Solr 4.0 and Cassandra 1.0.7 or 1.0.8, I'm looking for 
> the Cassandra 1.1 based release.
>
> Note: I am not affiliated with DataStax in anyway, other than being a 
> satisfied customer for the past few months.   I am just trying to selfishly 
> fuel your interest so you'll consider benchmarking it.
>
> My project is already using Cassandra, and we had to manage Solr separately. 
> Having the Solr indexes, and core configuration (solrconfig.xml, schema.xml, 
> synonyms.txt etc) in Cassandra, being distributed and replicated among the 
> various nodes, and eventually for us, multiple data centers is fantastic.
>
> Jeff
>
> On Apr 27, 2012, at 1:46 PM, Walter Underwood wrote:
>
>> On Apr 27, 2012, at 12:39 PM, Radim Kolar wrote:
>>
>>> Dne 27.4.2012 19:59, Jeremy Taylor napsal(a):
 DataStax offers a Solr integration that isn't master/slave and is
 NearRealTimes.
>>> its rebranded solandra?
>>
>> No, it is a rewrite.
>>
>> http://www.datastax.com/dev/blog/cassandra-with-solr-integration-details
>>
>> wunder
>> --
>> Walter Underwood
>> wun...@wunderwood.org
>>
>>
>>
>
>
>
> --
> Jeff Schmidt
> 535 Consulting
> j...@535consulting.com
> http://www.535consulting.com
> (650) 423-1068
>
>
>
>
>
>
>
>
>


Re: Solr Merge during off peak times

2012-05-02 Thread Jason Rutherglen
> BTW, in 4.0, there's DocumentWriterPerThread that
> merges in the background

It flushes without pausing, but does not perform merges.  Maybe you're
thinking of ConcurrentMergeScheduler?

On Wed, May 2, 2012 at 7:26 AM, Erick Erickson  wrote:
> Optimizing is much less important query-speed wise
> than historically, essentially it's not recommended much
> any more.
>
> A significant effect of optimize _used_ to be purging
> obsolete data (i.e. that from deleted docs) from the
> index, but that is now done on merge.
>
> There's no harm in optimizing on off-peak hours, and
> combined with an appropriate merge policy that may make
> indexing a little better (I'm thinking of not doing
> as many massive merges here).
>
> BTW, in 4.0, there's DocumentWriterPerThread that
> merges in the background and pretty much removes
> even this as a motivation for optimizing.
>
> All that said, optimizing isn't _bad_, it's just often
> unnecessary.
>
> Best
> Erick
>
> On Wed, May 2, 2012 at 9:29 AM, Prakashganesh, Prabhu
>  wrote:
>> Actually we are not thinking of a M/S setup
>> We are planning to have x number of shards on N number of servers, each of 
>> the shard handling both indexing and searching
>> The expected query volume is not that high, so don't think we would need to 
>> replicate to slaves. We think each shard will be able to handle its share of 
>> the indexing and searching. If we need to scale query capacity in future, 
>> yeah probably need to do it by replicating each shard to its slaves
>>
>> I agree autoCommit settings would be good to set up appropriately
>>
>> Another question I had is pros/cons of optimising the index. We would be 
>> purging old content every week and am thinking whether to run an index 
>> optimise in the weekend after purging old data. Because we are going to be 
>> continuously indexing data which would be mix of adds, updates, deletes, not 
>> sure if the benefit of optimising would last long enough to be worth doing 
>> it. Maybe setting a low mergeFactor would be good enough. Optimising makes 
>> sense if the index is more static, perhaps? Thoughts?
>>
>> Thanks
>> Prabhu
>>
>>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: 02 May 2012 13:15
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr Merge during off peak times
>>
>> But again, with a master/slave setup merging should
>> be relatively benign. And at 200M docs, having a M/S
>> setup is probably indicated.
>>
>> Here's a good writeup of mergepolicy
>> http://juanggrande.wordpress.com/2011/02/07/merge-policy-internals/
>>
>> If you're indexing and searching on a single machine, merging
>> is much less important than how often you commit. If a M/S
>> situation, then you're polling interval on the slave is important.
>>
>> I'd look at commit frequency long before I worried about merging,
>> that's usually where people shoot themselves in the foot - by
>> committing too often.
>>
>> Overall, your mergeFactor is probably less important than other
>> parts of how you perform indexing/searching, but it does have
>> some effect for sure...
>>
>> Best
>> Erick
>>
>> On Wed, May 2, 2012 at 7:54 AM, Prakashganesh, Prabhu
>>  wrote:
>>> We have a fairly large scale system - about 200 million docs and fairly 
>>> high indexing activity - about 300k docs per day with peak ingestion rates 
>>> of about 20 docs per sec. I want to work out what a good mergeFactor 
>>> setting would be by testing with different mergeFactor settings. I think 
>>> the default of 10 might be high, I want to try with 5 and compare. Unless I 
>>> know when a merge starts and finishes, it would be quite difficult to work 
>>> out the impact of changing mergeFactor. I want to be able to measure how 
>>> long merges take, run queries during the merge activity and see what the 
>>> response times are etc..
>>>
>>> Thanks
>>> Prabhu
>>>
>>> -Original Message-
>>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>>> Sent: 02 May 2012 12:40
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Solr Merge during off peak times
>>>
>>> Why do you care? Merging is generally a background process, or are
>>> you doing heavy indexing? In a master/slave setup,
>>> it's usually not really relevant except that (with 3.x), massive merges
>>> may temporarily stop indexing. Is that the problem?
>>>
>>> Look at the merge policys, there are configurations that make
>>> this less painful.
>>>
>>> In trunk, DocumentWriterPerThread makes merges happen in the
>>> background, which helps the long-pause-while-indexing problem.
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, May 2, 2012 at 7:22 AM, Prakashganesh, Prabhu
>>>  wrote:
 Ok, thanks Otis
 Another question on merging
 What is the best way to monitor merging?
 Is there something in the log file that I can look for?
 It seems like I have to monitor the system resources - read/write IOPS 
 etc.. and work out when a merge happened
 It would be g

Re: Search timeout for Solrcloud

2012-06-05 Thread Jason Rutherglen
There isn't a solution for killing long running queries that works.

On Tue, Jun 5, 2012 at 1:34 AM, arin_g  wrote:
> Hi,
> We use solrcloud in production, and we are facing some issues with queries
> that take very long specially deep paging queries, these queries keep our
> servers very busy. i am looking for a way to stop (kill) queries taking
> longer than a specific amount of time (say 5 seconds), i checked timeAllowed
> but it doesn't work (again query  runs completely). Also i noticed that
> there are connTimeout and socketTimeout for distributed searches, but i am
> not sure if they kill the thread (i want to save resources by killing the
> query, not just returning a timeout). Also, if i could get partial results
> that would be ideal. Any suggestions?
>
> Thanks,
>  arin
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Search-timeout-for-Solrcloud-tp3987716.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)

2011-01-09 Thread Jason Rutherglen
> The older MergePolicies followed a strategy which is quite disruptive in an 
> NRT environment.

Can you elaborate as to why (maybe we need to place this in a wiki)?
If large merges are running in their own thread, they should not
disrupt queries, eg, there won't be CPU contention.  The IO contention
can be disruptive, depending on the size and type of hardware, however
in the ideal case of the index 'fitting' into RAM/IO cache, then a
large merge should not affect queries (or indexing).

I think what's useful that is being developed for not disrupting NRT
with merges is DirectIOLinuxDirectory:
https://issues.apache.org/jira/browse/LUCENE-2500  It's also useful
for the non-NRT use case because anytime IO cache pages are evicted,
queries will slow down (unless the index is too large to fit in RAM
anyways).

On Sat, Jan 8, 2011 at 7:55 PM, Lance Norskog  wrote:
> There are always slowdowns when merging new segments during indexing.
> A MergePolicy decides when to merge segments.  The older MergePolicies
> followed a strategy which is quite disruptive in an NRT environment.
>
> There is a new feature in 3.x & the trunk called
> 'BalancedSegmentMergePolicy'. This new MergePolicy is designed for the
> near-real-time use case. It was contributed by LinkedIn. You may find
> it works well enough for your case.
>
> Lance
>
> On Thu, Jan 6, 2011 at 10:21 AM, Stephen Boesch  wrote:
>> Thanks Yonik,
>>  Using a stable release of Solr what would you suggest to do - given
>> MultiSearch's demise and the other work is still ongoing?
>>
>> 2011/1/6 Yonik Seeley 
>>
>>> On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch  wrote:
>>> > Solr/lucene newbie here ..
>>> >
>>> > We would like searches against a solr/lucene index to immediately be able
>>> to
>>> > view data that was added.  I stress "small" amount of new data given that
>>> > any significant amount would require excessive  latency.
>>>
>>> There has been significant ongoing work in lucene-core for NRT (near real
>>> time).
>>> We need to overhaul Solr's DirectUpdateHandler2 to take advantage of
>>> all this work.
>>> Mark Miller took a first crack at it (sharing a single IndexWriter,
>>> letting lucene handle the concurrency issues, etc)
>>> but if there's a JIRA issue, I'm having trouble finding it.
>>>
>>> > Looking around, i'm wondering if the direction would be a MultiSearcher
>>> > living on top of our standard directory-based IndexReader as well as a
>>> > custom Searchable that handles the newest documents - and then combines
>>> the
>>> > two results?
>>>
>>> If you look at trunk, MultiSearcher has already gone away.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: Including Small Amounts of New Data in Searches (MultiSearcher ?)

2011-01-10 Thread Jason Rutherglen
> most of the Solr sites I know of
> have much larger indexes than ram and expect everything to work
> smoothly

Hmm... In that case, throttling the merges would probably help most,
though, yes, that's not available today.  In lieu of that, I'd run
large merges during off-peak hours, or better yet, use Solr's
replication, eg, merge on the master where queries aren't hitting
anything.  Perhaps that'd throw off the NRT interval though.

On Sun, Jan 9, 2011 at 8:55 PM, Lance Norskog  wrote:
> Ok. I was talking about what tools are available now- much better
> things are in the NRT work. I don't know how merges work now, in re
> multitasking and thread contention. Most of the Solr sites I know of
> have much larger indexes than ram and expect everything to work
> smoothly.
>
> Lance
>
> On Sun, Jan 9, 2011 at 9:18 AM, Jason Rutherglen
>  wrote:
>>> The older MergePolicies followed a strategy which is quite disruptive in an 
>>> NRT environment.
>>
>> Can you elaborate as to why (maybe we need to place this in a wiki)?
>> If large merges are running in their own thread, they should not
>> disrupt queries, eg, there won't be CPU contention.  The IO contention
>> can be disruptive, depending on the size and type of hardware, however
>> in the ideal case of the index 'fitting' into RAM/IO cache, then a
>> large merge should not affect queries (or indexing).
>>
>> I think what's useful that is being developed for not disrupting NRT
>> with merges is DirectIOLinuxDirectory:
>> https://issues.apache.org/jira/browse/LUCENE-2500  It's also useful
>> for the non-NRT use case because anytime IO cache pages are evicted,
>> queries will slow down (unless the index is too large to fit in RAM
>> anyways).
>>
>> On Sat, Jan 8, 2011 at 7:55 PM, Lance Norskog  wrote:
>>> There are always slowdowns when merging new segments during indexing.
>>> A MergePolicy decides when to merge segments.  The older MergePolicies
>>> followed a strategy which is quite disruptive in an NRT environment.
>>>
>>> There is a new feature in 3.x & the trunk called
>>> 'BalancedSegmentMergePolicy'. This new MergePolicy is designed for the
>>> near-real-time use case. It was contributed by LinkedIn. You may find
>>> it works well enough for your case.
>>>
>>> Lance
>>>
>>> On Thu, Jan 6, 2011 at 10:21 AM, Stephen Boesch  wrote:
>>>> Thanks Yonik,
>>>>  Using a stable release of Solr what would you suggest to do - given
>>>> MultiSearch's demise and the other work is still ongoing?
>>>>
>>>> 2011/1/6 Yonik Seeley 
>>>>
>>>>> On Thu, Jan 6, 2011 at 12:37 PM, Stephen Boesch  wrote:
>>>>> > Solr/lucene newbie here ..
>>>>> >
>>>>> > We would like searches against a solr/lucene index to immediately be 
>>>>> > able
>>>>> to
>>>>> > view data that was added.  I stress "small" amount of new data given 
>>>>> > that
>>>>> > any significant amount would require excessive  latency.
>>>>>
>>>>> There has been significant ongoing work in lucene-core for NRT (near real
>>>>> time).
>>>>> We need to overhaul Solr's DirectUpdateHandler2 to take advantage of
>>>>> all this work.
>>>>> Mark Miller took a first crack at it (sharing a single IndexWriter,
>>>>> letting lucene handle the concurrency issues, etc)
>>>>> but if there's a JIRA issue, I'm having trouble finding it.
>>>>>
>>>>> > Looking around, i'm wondering if the direction would be a MultiSearcher
>>>>> > living on top of our standard directory-based IndexReader as well as a
>>>>> > custom Searchable that handles the newest documents - and then combines
>>>>> the
>>>>> > two results?
>>>>>
>>>>> If you look at trunk, MultiSearcher has already gone away.
>>>>>
>>>>> -Yonik
>>>>> http://www.lucidimagination.com
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: What can cause segment corruption?

2011-01-11 Thread Jason Rutherglen
Stéphane,

I've only seen production index corruption when during merge the
process ran out of disk space, or there is an underlying hardware
related issue.

On Tue, Jan 11, 2011 at 5:06 AM, Stéphane Delprat
 wrote:
> Hi,
>
>
> I'm using Solr 1.4.1 (Lucene 2.9.3)
>
> And some segments get corrupted:
>
>  4 of 11: name=_p40 docCount=470035
>    compound=false
>    hasProx=true
>    numFiles=9
>    size (MB)=1,946.747
>    diagnostics = {optimize=true, mergeFactor=6, os.version=2.6.26-2-amd64,
> os=Linux, mergeDocStores=true, lucene.version=2.9.3 951790 - 2010-06-06
> 01:30:55, source=merge, os.arch=amd64, java.version=1.6.0_20,
> java.vendor=Sun Microsystems Inc.}
>    has deletions [delFileName=_p40_bj.del]
>    test: open reader.OK [9299 deleted docs]
>    test: fields..OK [51 fields]
>    test: field norms.OK [51 fields]
>    test: terms, freq, prox...ERROR [term source:margolisphil docFreq=1 !=
> num docs seen 0 + num docs deleted 0]
> java.lang.RuntimeException: term source:margolisphil docFreq=1 != num docs
> seen 0 + num docs deleted 0
>        at
> org.apache.lucene.index.CheckIndex.testTermIndex(CheckIndex.java:675)
>        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:530)
>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>    test: stored fields...OK [15454281 total field count; avg 33.543
> fields per doc]
>    test: term vectorsOK [0 total vector count; avg 0 term/freq
> vector fields per doc]
> FAILED
>    WARNING: fixIndex() would remove reference to this segment; full
> exception:
> java.lang.RuntimeException: Term Index test failed
>        at org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:543)
>        at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:903)
>
>
> What might cause this corruption?
>
>
> I detailed my configuration here:
>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201101.mbox/%3c4d2ae506.7070...@blogspirit.com%3e
>
> Thanks,
>


Re: NRT

2011-01-17 Thread Jason Rutherglen
> How is NRT doing, being used in production?

It works and there are not any lingering bugs as it's been available
for quite a while.

> Which Solr is it in?

Per-segment field cache is used transparently by Solr,
IndexWriter.getReader is what's not used yet.  I'm not sure where
per-segment faceting is at.

> And is there built in Spatial in that version?

Spatial is independent of NRT?

On Mon, Jan 17, 2011 at 4:56 PM, Dennis Gearon  wrote:
> How is NRT doing, being used in production?
>
> Which Solr is it in?
>
> And is there built in Spatial in that version?
>
> How is Solr 4.x doing?
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a 
> better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>


Re: salvaging uncommitted data

2011-01-18 Thread Jason Rutherglen
> btw where will i find the writes that have not been committed? are they all
> in memory or are they in some temp files somewhere?

The writes'll be gone if they haven't been committed yet and the
process fails.

> org.apache.lucene.store.LockObtainFailedException: Lock obtain timed

If it's removed then you on restart of the process, this should go
away.  However you may see a corrupted index exception.

On Tue, Jan 18, 2011 at 11:31 AM, Udi Nir  wrote:
> the ebs volume is operational and i cannot see any error in dmesg etc.
> the only errors in catalina.out are the lock related ones (even though i
> removed the lock file) and when i do a commit everything looks fine in the
> log.
> i am using the following for the commit:
> curl http://localhost:8983/solr/update -s -H "Content-type:text/xml;
> charset=utf-8" -d ""
>
>
> btw where will i find the writes that have not been committed? are they all
> in memory or are they in some temp files somewhere?
>
> udi
>
>
> On Tue, Jan 18, 2011 at 11:24 AM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
>> Udi,
>>
>> It's hard for me to tell from here, but it looks like your writes are
>> really not
>> going in at all, in which case there may be nothing (much) to salvage.
>>
>> The EBS volume is mounted?  And fast (try listing a bigger dir or doing
>> something that involves some non-trivial disk IO)?
>> No errors anywhere in the log on commit?
>> How exactly are you invoking the commit?  There is a wait option there...
>>
>> Otis
>> 
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> - Original Message 
>> > From: Udi Nir 
>> > To: solr-user@lucene.apache.org
>> > Sent: Tue, January 18, 2011 2:04:56 PM
>> > Subject: Re: salvaging uncommitted data
>> >
>> > i have not stopped writing so i am getting this error all the time.
>> > the  commit actually seems to go through with no errors but it does not
>> seem
>> > to  write anything to the index files (i can see this because they are
>> old
>> > and i  cannot see new stuff in search results).
>> >
>> > my index folder is on an amazon  ebs volume which is a block device and
>> looks
>> > like a local  disk.
>> >
>> > thanks!
>> >
>> > udi
>> >
>> >
>> > On Tue, Jan 18, 2011 at 10:49 AM,  Otis Gospodnetic <
>> > otis_gospodne...@yahoo.com>  wrote:
>> >
>> > > Udi,
>> > >
>> > > Hm, don't know off the top of my head,  but sounds like "an interesting
>> > > problem".
>> > > Are you getting this  error while still writing to the index or did you
>> stop
>> > > all
>> > >  writing?
>> > > Do you get this error when you issue a commit or?
>> > > Is  the index on the local disk or?
>> > >
>> > > Otis
>> > > 
>> > >  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> > > Lucene ecosystem  search :: http://search-lucene.com/
>> > >
>> > >
>> > >
>> > > - Original  Message 
>> > > > From: Udi Nir 
>> > > > To: solr-user@lucene.apache.org
>> > >  > Sent: Tue, January 18, 2011 12:29:47 PM
>> > > > Subject: salvaging  uncommitted data
>> > > >
>> > > > Hi,
>> > > > I have a solr server  that is failing to acquire a lock with the
>> > >  exception
>> > > >  below. I think that the server has a lot of uncommitted data (I am
>> not
>> > > sure
>> > > > how to verify this) and if so I would like to  salvage it.
>> > > > Any  suggestions how to proceed?
>> > >  >
>> > > > (btw i tried removing the lock file but it  did not  help)
>> > > >
>> > > > Thanks,
>> > > > Udi
>> > > >
>> > >  >
>> > > > Jan 18, 2011 5:17:06 PM   org.apache.solr.common.SolrException log
>> > > > SEVERE:   org.apache.lucene.store.LockObtainFailedException: Lock
>> obtain
>> > >  timed
>> > > > out
>> > > > :  NativeFSLock@
>> > > >  /vol-unifi-solr/data/index/lucene-043c34f1f06a280de60b3d4e8e05601
>> > > >  6-write.lock
>> > > >          at   org.apache.lucene.store.Lock.obtain(Lock.java:85)
>> > > >           at
>> > >  org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1545)
>> > >  >          at
>> > >   org.apache.lucene.index.IndexWriter.(IndexWriter.java:1402)
>> > >  >           at
>> > > >
>>  org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:19
>> > >  > 0)
>> > > >
>> > >
>> >
>>
>


Re: salvaging uncommitted data

2011-01-18 Thread Jason Rutherglen
> if i restart it, will i lose any data that is in memory? if so, is there a
> way around it?

Usually I've restarted the process, and on restart Solr using the
true in solrconfig.xml will
automatically remove the lock file (actually I think it may be removed
automatically when the process dies).

You'll lose the data.

> is there a way to know if there is any data waiting to be written? (if not,
> i will just restart...)

There is via the API, offhand via the Solr dashboard, I don't know.

On Tue, Jan 18, 2011 at 12:35 PM, Udi Nir  wrote:
> i have not restarted the process yet.
> if i restart it, will i lose any data that is in memory? if so, is there a
> way around it?
> is there a way to know if there is any data waiting to be written? (if not,
> i will just restart...)
>
> thanks.
>
> On Tue, Jan 18, 2011 at 12:23 PM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> > btw where will i find the writes that have not been committed? are they
>> all
>> > in memory or are they in some temp files somewhere?
>>
>> The writes'll be gone if they haven't been committed yet and the
>> process fails.
>>
>> > org.apache.lucene.store.LockObtainFailedException: Lock obtain timed
>>
>> If it's removed then you on restart of the process, this should go
>> away.  However you may see a corrupted index exception.
>>
>> On Tue, Jan 18, 2011 at 11:31 AM, Udi Nir  wrote:
>> > the ebs volume is operational and i cannot see any error in dmesg etc.
>> > the only errors in catalina.out are the lock related ones (even though i
>> > removed the lock file) and when i do a commit everything looks fine in
>> the
>> > log.
>> > i am using the following for the commit:
>> > curl http://localhost:8983/solr/update -s -H "Content-type:text/xml;
>> > charset=utf-8" -d ""
>> >
>> >
>> > btw where will i find the writes that have not been committed? are they
>> all
>> > in memory or are they in some temp files somewhere?
>> >
>> > udi
>> >
>> >
>> > On Tue, Jan 18, 2011 at 11:24 AM, Otis Gospodnetic <
>> > otis_gospodne...@yahoo.com> wrote:
>> >
>> >> Udi,
>> >>
>> >> It's hard for me to tell from here, but it looks like your writes are
>> >> really not
>> >> going in at all, in which case there may be nothing (much) to salvage.
>> >>
>> >> The EBS volume is mounted?  And fast (try listing a bigger dir or doing
>> >> something that involves some non-trivial disk IO)?
>> >> No errors anywhere in the log on commit?
>> >> How exactly are you invoking the commit?  There is a wait option
>> there...
>> >>
>> >> Otis
>> >> 
>> >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> >> Lucene ecosystem search :: http://search-lucene.com/
>> >>
>> >>
>> >>
>> >> - Original Message 
>> >> > From: Udi Nir 
>> >> > To: solr-user@lucene.apache.org
>> >> > Sent: Tue, January 18, 2011 2:04:56 PM
>> >> > Subject: Re: salvaging uncommitted data
>> >> >
>> >> > i have not stopped writing so i am getting this error all the time.
>> >> > the  commit actually seems to go through with no errors but it does
>> not
>> >> seem
>> >> > to  write anything to the index files (i can see this because they are
>> >> old
>> >> > and i  cannot see new stuff in search results).
>> >> >
>> >> > my index folder is on an amazon  ebs volume which is a block device
>> and
>> >> looks
>> >> > like a local  disk.
>> >> >
>> >> > thanks!
>> >> >
>> >> > udi
>> >> >
>> >> >
>> >> > On Tue, Jan 18, 2011 at 10:49 AM,  Otis Gospodnetic <
>> >> > otis_gospodne...@yahoo.com>  wrote:
>> >> >
>> >> > > Udi,
>> >> > >
>> >> > > Hm, don't know off the top of my head,  but sounds like "an
>> interesting
>> >> > > problem".
>> >> > > Are you getting this  error while still writing to the index or did
>> you
>> >> stop
>> >> > > all
>> >> > >  writing?
>> >> > > Do you get this error when you issue a commit or?
>> &g

Re: Search for social networking sites

2011-01-21 Thread Jason Rutherglen
Out of curiousity, how would Lucandra help in the NRT use case?

On Thu, Jan 20, 2011 at 11:42 PM, Espen Amble Kolstad  wrote:
> I haven't tried myself, but you could look at solandra :
> https://github.com/tjake/Lucandra
>
> - Espen
>
> On Thu, Jan 20, 2011 at 6:30 PM, stockii  wrote:
>>
>> http://wiki.apache.org/solr/NearRealtimeSearchTuning
>>
>> http://lucene.472066.n3.nabble.com/Tuning-Solr-caches-with-high-commit-rates-NRT-td1461275.html
>>
>>
>> http://lucene.472066.n3.nabble.com/NRT-td2276967.html#a2278477
>>
>>
>> -
>> --- System
>> 
>>
>> One Server, 12 GB RAM, 2 Solr Instances, 7 Cores,
>> 1 Core with 31 Million Documents other Cores < 100.000
>>
>> - Solr1 for Search-Requests - commit every Minute  - 4GB Xmx
>> - Solr2 for Update-Request  - delta every 2 Minutes - 4GB Xmx
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Search-for-social-networking-sites-tp2295261p2295283.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>


Passing parameters to DataImportHandler

2011-02-15 Thread Jason Rutherglen
It'd be nice to be able to pass HTTP parameters into DataImportHandler
that'd be passed into the SQL as parameters, is this possible?


Re: Solr Hanging all of sudden with update/csv

2011-03-08 Thread Jason Rutherglen
> The index size itself is about 270Gb, (we are hopping to support upto
> 500-1TB), and have supplied the system with ~3TB diskspace.

That's simply massive for a single node.  When the system tries to
merge the segments the queries are probably not working?  And the
merges will take quite a while.  How long is OK for a single query to
return in?

On Tue, Mar 8, 2011 at 2:17 PM, danomano  wrote:
> Hi folks, I've been using solr for about 3 months.
>
> Our Solr install is a single node, and we have been injecting logging data
> into the solr server every couple of minutes, which each updating taking few
> minutes.
>
> Everything working fine until this morning, at which point it appeared that
> all updates were hung.
>
> Retarting the solr server did not help, as all updaters immediately 'hung'
> again.
>
> Poking around in the threads, and strace, I do in fact see stuff happening.
>
> The index size itself is about 270Gb, (we are hopping to support upto
> 500-1TB), and have supplied the system with ~3TB diskspace.
>
> Any Tips on what could be happening?
> notes: we have never run an optimize yet.
>          we have never deleted from system yet.
>
>
> The merge Thread appears to be the one..'never returnning'
> "Lucene Merge Thread #0" - Thread t@41
>   java.lang.Thread.State: RUNNABLE
>        at sun.nio.ch.FileDispatcher.pread0(Native Method)
>        at sun.nio.ch.FileDispatcher.pread(FileDispatcher.java:31)
>        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:234)
>        at sun.nio.ch.IOUtil.read(IOUtil.java:210)
>        at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:622)
>        at
> org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:161)
>        at
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:139)
>        at
> org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:94)
>        at org.apache.lucene.store.DataOutput.copyBytes(DataOutput.java:176)
>        at
> org.apache.lucene.index.FieldsWriter.addRawDocuments(FieldsWriter.java:209)
>        at
> org.apache.lucene.index.SegmentMerger.copyFieldsNoDeletions(SegmentMerger.java:424)
>        at
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:332)
>        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:153)
>        at 
> org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4053)
>        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3645)
>        at
> org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:339)
>        at
> org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:407)
>
>
> Some ptrace output:
> 23178 pread(172,
> "\270\316\276\2\245\371\274\2\271\316\276\2\272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2"...,
> 4096, 98004192) = 4096 <0.09>
> 23178 pread(172,
> "\245\371\274\2\271\316\276\2\272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2"...,
> 4096, 98004196) = 4096 <0.09>
> 23178 pread(172,
> "\271\316\276\2\272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2"...,
> 4096, 98004200) = 4096 <0.08>
> 23178 pread(172,
> "\272\316\276\2\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2"...,
> 4096, 98004204) = 4096 <0.08>
> 23178 pread(172,
> "\273\316\276\2\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2"...,
> 4096, 98004208) = 4096 <0.08>
> 23178 pread(172,
> "\274\316\276\2\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2"...,
> 4096, 98004212) = 4096 <0.09>
> 23178 pread(172,
> "\275\316\276\2\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2"...,
> 4096, 98004216) = 4096 <0.08>
> 23178 pread(172,
> "\276\316\276\2\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2\303\316\276\2"...,
> 4096, 98004220) = 4096 <0.09>
> 23178 pread(172,
> "\277\316\276\2\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2\303\316\276\2\304\316\276\2"...,
> 4096, 98004224) = 4096 <0.13>
> 22688 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed
> out) <0.051276>
> 23178 pread(172,
> "\300\316\276\2\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2\303\316\276\2\304\316\276\2\305\316\276\2"...,
> 4096, 98004228) = 4096 <0.10>
> 22688 futex(0x464a9f28, FUTEX_WAKE_PRIVATE, 1
> 23178 pread(172,
> "\301\316\276\2\302\316\276\2\367\343\274\2\246\371\274\2\303\316\276\2\304\316\276\2\305\316\276\2\306\316\276\2"...,
> 4096, 98004232) = 4096 <0.10>
> 22688 <... futex resumed> )             = 0 <0.51>
> 23178 pread(172,
> "\302\316\276\2\367\343\274\2\246\371\274\2\303\316\276\2\304\316\276\2\305\3

Re: NRT in Solr

2011-03-09 Thread Jason Rutherglen
Jae,

NRT hasn't been implemented NRT as of yet in Solr, I think partially
because major features such as replication, caching, and uninverted
faceting suddenly are no longer viable, eg, it's another round of
testing etc.  It's doable, however I think the best approach is a
separate request call path, to avoid altering to current [working]
API.

On Tue, Mar 8, 2011 at 1:27 PM, Jae Joo  wrote:
> Hi,
> Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could not
> find the configuration for NRT.
>
> Regards
>
> Jae
>


Re: NRT and warmupTime of filterCache

2011-03-09 Thread Jason Rutherglen
I think it's best to turn the warmupCount to zero because usually
there isn't time in between the creation of a new searcher to run the
warmup queries, eg, it'll negatively impact the desired goal of low
latency new index readers?

On Wed, Mar 9, 2011 at 3:41 AM, stockii  wrote:
> I tried to create an NRT like in the wiki but i got some problems with
> autowarming and ondeckSearchers.
>
> ervery minute i start a delta of one core and the other core start every
> minute a commit of the index to search for it.
>
>
> wiki says ... => 1 Searcher and fitlerCache warmupCount=3600. with this
> config i got exception that no searcher is available ... so i cannot use
> this config ...
> my config is, 4 Searchers and warmupCount=3000... with this settings i got
> "Performance Warning", but it works. BUT when the complete 30 seconds (or
> more) needed to warming the searcher, i cannot ping my server in this time
> and i got errors ...
> make it sense to decrese my warmupCount to 0 ???
>
> how serchers do i need for 7 Cores ?
>
> -
> --- System 
> 
>
> One Server, 12 GB RAM, 2 Solr Instances, 7 Cores,
> 1 Core with 31 Million Documents other Cores < 100.000
>
> - Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
> - Solr2 for Update-Request  - delta every Minute - 4GB Xmx
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/NRT-and-warmupTime-of-filterCache-tp2654886p2654886.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: True master-master fail-over without data gaps

2011-03-09 Thread Jason Rutherglen
If you're using the delta import handler the problem would seem to go
away because you can have two separate masters running at all times,
and if one fails, you can then point the slaves to the secondary
master, that is guaranteed to be in sync because it's been importing
from the same database?

On Tue, Mar 8, 2011 at 8:45 PM, Otis Gospodnetic
 wrote:
> Hello,
>
> What are some common or good ways to handle indexing (master) fail-over?
> Imagine you have a continuous stream of incoming documents that you have to
> index without losing any of them (or with losing as few of them as possible).
> How do you set up you masters?
> In other words, you can't just have 2 masters where the secondary is the
> Repeater (or Slave) of the primary master and replicates the index 
> periodically:
> you need to have 2 masters that are in sync at all times!
> How do you achieve that?
>
> * Do you just put N masters behind a LB VIP, configure them both to point to 
> the
> index on some shared storage (e.g. SAN), and count on the LB to fail-over to 
> the
> secondary master when the primary becomes unreachable?
> If so, how do you deal with index locks?  You use the Native lock and count on
> it disappearing when the primary master goes down?  That means you count on 
> the
> whole JVM process dying, which may not be the case...
>
> * Or do you use tools like DRBD, Corosync, Pacemaker, etc. to keep 2 masters
> with 2 separate indices in sync, while making sure you write to only 1 of them
> via LB VIP or otherwise?
>
> * Or ...
>
>
> This thread is on a similar topic, but is inconclusive:
>  http://search-lucene.com/m/aOsyN15f1qd1
>
> Here is another similar thread, but this one doesn't cover how 2 masters are
> kept in sync at all times:
>  http://search-lucene.com/m/aOsyN15f1qd1
>
> Thanks,
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>


Re: True master-master fail-over without data gaps

2011-03-09 Thread Jason Rutherglen
> Oh, there is no DB involved.  Think of a document stream continuously coming 
> in,
> a component listening to that stream, grabbing docs, and pushing it to
> master(s).

I don't think Solr is designed for this use case, eg, I wouldn't
expect deterministic results with the current architecture as it's
something that's inherently a a key component of [No]SQL databases.

On Wed, Mar 9, 2011 at 8:49 AM, Otis Gospodnetic
 wrote:
> Hi,
>
> - Original Message 
>
>> If you're using the delta import handler the problem would seem to go
>> away  because you can have two separate masters running at all times,
>> and if one  fails, you can then point the slaves to the secondary
>> master, that is  guaranteed to be in sync because it's been importing
>> from the same  database?
>
> Oh, there is no DB involved.  Think of a document stream continuously coming 
> in,
> a component listening to that stream, grabbing docs, and pushing it to
> master(s).
>
> Otis
>
>
>
>> On Tue, Mar 8, 2011 at 8:45 PM, Otis Gospodnetic
>>   wrote:
>> > Hello,
>> >
>> > What are some common or good ways to  handle indexing (master) fail-over?
>> > Imagine you have a continuous stream  of incoming documents that you have 
>> > to
>> > index without losing any of them  (or with losing as few of them as
>>possible).
>> > How do you set up you  masters?
>> > In other words, you can't just have 2 masters where the  secondary is the
>> > Repeater (or Slave) of the primary master and  replicates the index
>>periodically:
>> > you need to have 2 masters that are  in sync at all times!
>> > How do you achieve that?
>> >
>> > * Do you  just put N masters behind a LB VIP, configure them both to point 
>> > to
>>the
>> >  index on some shared storage (e.g. SAN), and count on the LB to fail-over 
>> > to
>>the
>> > secondary master when the primary becomes unreachable?
>> > If  so, how do you deal with index locks?  You use the Native lock and 
>> > count
>>on
>> > it disappearing when the primary master goes down?  That means you  count 
>> > on
>>the
>> > whole JVM process dying, which may not be the  case...
>> >
>> > * Or do you use tools like DRBD, Corosync, Pacemaker,  etc. to keep 2
> masters
>> > with 2 separate indices in sync, while making  sure you write to only 1 of
>>them
>> > via LB VIP or  otherwise?
>> >
>> > * Or ...
>> >
>> >
>> > This thread is on a  similar topic, but is inconclusive:
>> >  http://search-lucene.com/m/aOsyN15f1qd1
>> >
>> > Here is another  similar thread, but this one doesn't cover how 2 masters
> are
>> > kept in  sync at all times:
>> >  http://search-lucene.com/m/aOsyN15f1qd1
>> >
>> >  Thanks,
>> > Otis
>> > 
>> > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
>> > Lucene ecosystem search :: http://search-lucene.com/
>> >
>> >
>>
>


Re: True master-master fail-over without data gaps

2011-03-09 Thread Jason Rutherglen
This is why there's block cipher cryptography.

On Wed, Mar 9, 2011 at 9:11 AM, Otis Gospodnetic
 wrote:
> On disk, yes, but only indexed, and thus far enough from the original content 
> to
> make storing terms in Lucene's inverted index.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>


Re: Solr Hanging all of sudden with update/csv

2011-03-09 Thread Jason Rutherglen
You will need to cap the maximum segment size using
LogByteSizeMergePolicy.setMaxMergeMB.  As then you will only have
segments that are of an optimal size, and Lucene will not try to
create gigantic segments.  I think though on the query side you will
run out of heap space due to the terms index size.  What version are
you using?

On Wed, Mar 9, 2011 at 10:17 AM, danomano  wrote:
> After About 4-5 hours the merge completed (ran out of heap)..as you
> suggested..it was having memory issues..
>
> Read queries during the merge were working just fine (they were taking
> longer then normal ~30-60seconds).
>
> I think I need to do more reading on understanding the merge/optimization
> processes.
>
> I am beginning to think what I need to do is have lots of segments? (i.e.
> frequent merges..of smaller sized segments, wouldn't that speed up the
> merging process when it actually runs?).
>
> A couple things I'm trying to wrap my ahead around:
>
> Increasing the segments will improve indexing speed on the whole.
> The question I have is: when it needs to actually perform a merge: will
> having more segments be better  (i.e. make the merge process faster)? or
> longer? ..having a 4 hour merge aka (indexing request) is not really
> acceptable (unless I can control when that merge happens).
>
> We are using our Solr server differently then most: Frequent Inserts (in
> batches), with few Reads.
>
> I would say having a 'long' query time is acceptable (say ~60 seconds).
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Hanging-all-of-sudden-with-update-csv-tp2652903p2656457.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: True master-master fail-over without data gaps (choosing CA in CAP)

2011-03-09 Thread Jason Rutherglen
Doesn't Solandra partition by term instead of document?

On Wed, Mar 9, 2011 at 2:13 PM, Smiley, David W.  wrote:
> I was just about to jump in this conversation to mention Solandra and go fig, 
> Solandra's committer comes in. :-)   It was nice to meet you at Strata, Jake.
>
> I haven't dug into the code yet but Solandra strikes me as a killer way to 
> scale Solr. I'm looking forward to playing with it; particularly looking at 
> disk requirements and performance measurements.
>
> ~ David Smiley
>
> On Mar 9, 2011, at 3:14 PM, Jake Luciani wrote:
>
>> Hi Otis,
>>
>> Have you considered using Solandra with Quorum writes
>> to achieve master/master with CA semantics?
>>
>> -Jake
>>
>>
>> On Wed, Mar 9, 2011 at 2:48 PM, Otis Gospodnetic >> wrote:
>>
>>> Hi,
>>>
>>>  Original Message 
>>>
 From: Robert Petersen 

 Can't you skip the SAN and keep the indexes locally?  Then you  would
 have two redundant copies of the index and no lock issues.
>>>
>>> I could, but then I'd have the issue of keeping them in sync, which seems
>>> more
>>> fragile.  I think SAN makes things simpler overall.
>>>
 Also, Can't master02 just be a slave to master01 (in the master farm  and
 separate from the slave farm) until such time as master01 fails?   Then
>>>
>>> No, because it wouldn't be in sync.  It would always be N minutes behind,
>>> and
>>> when the primary master fails, the secondary would not have all the docs -
>>> data
>>> loss.
>>>
 master02 would start receiving the new documents with an  indexes
 complete up to the last replication at least and the other slaves  would
 be directed by LB to poll master02 also...
>>>
>>> Yeah, "complete up to the last replication" is the problem.  It's a data
>>> gap
>>> that now needs to be filled somehow.
>>>
>>> Otis
>>> 
>>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>>> Lucene ecosystem search :: http://search-lucene.com/
>>>
>>>
 -Original  Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Wednesday, March 09, 2011 9:47 AM
 To: solr-user@lucene.apache.org
 Subject:  Re: True master-master fail-over without data gaps (choosing CA
 in  CAP)

 Hi,


 - Original Message 
> From: Walter  Underwood 

> On  Mar 9, 2011, at 9:02 AM, Otis Gospodnetic wrote:
>
>> You mean  it's  not possible to have 2 masters that are in nearly
 real-time
> sync?
>> How  about with DRBD?  I know people use  DRBD to keep 2 Hadoop NNs
 (their
> edit
>
>> logs) in  sync to avoid the current NN SPOF, for example, so I'm
 thinking
> this
>
>> could be doable with Solr masters, too, no?
>
> If you add fault-tolerant, you run into the CAP  Theorem.  Consistency,

> availability, partition: choose two. You cannot have  it  all.

 Right, so I'll take Consistency and Availability, and I'll  put my 2
 masters in
 the same rack (which has redundant switches, power  supply, etc.) and
 thus
 minimize/avoid partitioning.
 Assuming the above  actually works, I think my Q remains:

 How do you set up 2 Solr masters so  they are in near real-time sync?
 DRBD?

 But here is maybe a simpler  scenario that more people may be
 considering:

 Imagine 2 masters on 2  different servers in 1 rack, pointing to the same
 index
 on the shared  storage (SAN) that also happens to live in the same rack.
 2 Solr masters are  behind 1 LB VIP that indexer talks to.
 The VIP is configured so that all  requests always get routed to the
 primary
 master (because only 1 master  can be modifying an index at a time),
 except when
 this primary is down,  in which case the requests are sent to the
 secondary
 master.

 So in  this case my Q is around automation of this, around Lucene index
 locks,
 around the need for manual intervention, and such.
 Concretely, if you  have these 2 master instances, the primary master has
 the
 Lucene index  lock in the index dir.  When the secondary master needs to
 take
 over  (i.e., when it starts receiving documents via LB), it needs to be
 able to
 write to that same index.  But what if that lock is still around?   One
 could use
 the Native lock to make the lock disappear if the primary  master's JVM
 exited
 unexpectedly, and in that case everything *should*  work and be
 completely
 transparent, right?  That is, the secondary  will start getting new docs,
 it will
 use its IndexWriter to write to that  same shared index, which won't be
 locked
 for writes because the lock is  gone, and everyone will be happy.  Did I
 miss
 something important  here?

 Assuming the above is correct, what if the lock is *not* gone  because
 the
 primary master's JVM is actually not dead, althoug

Re: NRT in Solr

2011-03-10 Thread Jason Rutherglen
Bill,

I think all of the improvements can be made, however they are fairly
large structural changes that would require perhaps several patches.
The other issue is we'll likely land RT this year (or next) and then
the cached values need to be appended to as the documents are added,
that and they'll be across several DWPTs (see LUCENE-2324).  So one
could easily do work for per-segment caching, and then need to go back
and do per-segment, append caches.  I'm not sure caching is needed at
all, especially with the recent speed improvements, except for facets
which resemble field caches, and probably should be subsumed there.

Jason

On Wed, Mar 9, 2011 at 8:27 PM, Bill Bell  wrote:
> So it looks like can handle adding new documents, and expiring old
> documents. Updating a document is not part of the game.
> This would work well for message boards or tweet type solutions.
>
> Solr can do this as well directly. Why wouldn't you just improve the
> document and facet caching so that when you append there is not a huge hit
> to Solr? Also we could add a expiration to documents as well.
>
> The big issue for me is that when I update Solr I need to replicate that
> change quickly to all slaves. If we changed replication to stream to the
> slaves in Near Real Time and not have to create a whole new index version,
> warming, etc, that would be awesome. That combined with better caching
> smarts and we have a near perfect solution.
>
> Thanks.
>
> On 3/9/11 3:29 PM, "Smiley, David W."  wrote:
>
>>Zoie adds NRT to Solr:
>>http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin
>>
>>I haven't tried it yet but looks cool.
>>
>>~ David Smiley
>>Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
>>
>>On Mar 9, 2011, at 9:01 AM, Jason Rutherglen wrote:
>>
>>> Jae,
>>>
>>> NRT hasn't been implemented NRT as of yet in Solr, I think partially
>>> because major features such as replication, caching, and uninverted
>>> faceting suddenly are no longer viable, eg, it's another round of
>>> testing etc.  It's doable, however I think the best approach is a
>>> separate request call path, to avoid altering to current [working]
>>> API.
>>>
>>> On Tue, Mar 8, 2011 at 1:27 PM, Jae Joo  wrote:
>>>> Hi,
>>>> Is NRT in Solr 4.0 from trunk? I have checkouted from Trunk, but could
>>>>not
>>>> find the configuration for NRT.
>>>>
>>>> Regards
>>>>
>>>> Jae
>>>>
>>
>>
>>
>>
>>
>
>
>


Re: NRT and warmupTime of filterCache

2011-03-10 Thread Jason Rutherglen
> - yes, i think so, thats the reason because i dont understand the
> wiki-article ...

Maybe the article is out of date?  I think it's grossly inefficient to
warm the searchers at all in the NRT case.  Queries are being
performed across *all* segments, even though there should only be 1
that's new that may require warming.  However given the new segment's
so small, there should be no reason to warm it at all?

On Thu, Mar 10, 2011 at 12:14 AM, stockii  wrote:
>>> it'll negatively impact the desired goal of low latency new index readers?
> - yes, i think so, thats the reason because i dont understand the
> wiki-article ...
>
> i set the warmupCount to 500 and i got no error messages, that solr isnt
> available ...
> but solr-stats.jsp show me a warmuptime of "warmupTime : 12174 " why ?
>
> is the warmuptime in solrconfig.xml the maximum time in ms, for autowarming
> ? or what does it really means ?
>
> -
> --- System 
> 
>
> One Server, 12 GB RAM, 2 Solr Instances, 7 Cores,
> 1 Core with 31 Million Documents other Cores < 100.000
>
> - Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
> - Solr2 for Update-Request  - delta every Minute - 4GB Xmx
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/NRT-and-warmupTime-of-filterCache-tp2654886p2659560.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


If statements in DataImportHandler?

2011-03-10 Thread Jason Rutherglen
Is it possible to conditionally load sub-entities in
DataImportHandler, based on the gathered value of parent entities?


Re: If statements in DataImportHandler?

2011-03-10 Thread Jason Rutherglen
Right that's not within the XML however, and it's unclear how to
access the upper level entities that have already been instantiated,
eg, beyond the given 'transform' row.

On Thu, Mar 10, 2011 at 8:02 PM, Gora Mohanty  wrote:
> On Fri, Mar 11, 2011 at 4:48 AM, Jason Rutherglen
>  wrote:
>> Is it possible to conditionally load sub-entities in
>> DataImportHandler, based on the gathered value of parent entities?
>
> Probably the easies way to do that is with a transformer.
> Please see the DIH Wiki page for details:
> http://wiki.apache.org/solr/DataImportHandler#Transformer
>
> Regards,
> Gora
>


Re: solr on the cloud

2011-03-25 Thread Jason Rutherglen
Dmitry,

If you're planning on using HBase you can take a look at
https://issues.apache.org/jira/browse/HBASE-3529  I think we may even
have a reasonable solution for reading the index [randomly] out of
HDFS.  Benchmarking'll be implemented next.  It's not production
ready, suggestions are welcome.

Jason

On Fri, Mar 25, 2011 at 2:03 PM, Dmitry Kan  wrote:
> Hi Otis,
>
> Thanks for elaborating on this and the link (funny!).
>
> I have quite a big dataset growing all the time. The problems that I start
> facing are pretty much predictable:
> 1. Scalability: this inludes indexing time (now some days!, better hours or
> even minutes, if that's possible) along with handling the rapid growth
> 2. Robustness: the entire system (distributed or single server or anything
> else) should be fault-tolerant, e.g. if one shard goes down, other catches
> up (master-slave scheme)
> 3. Some apps that we run on SOLR are pretty computationally demanding, like
> faceting over one+bi+trigrams of hundreds of millions of documents (index
> size of half a TB) ---> single server with a shard of data does not seem to
> be enough for realtime search.
>
> This is just for a bit of a background. I agree with you on that hadoop and
> cloud probably best suit massive batch processes rather than realtime
> search. I'm sure, if anyone out there made SOLR shine throught the cloud for
> realtime search over large datasets.
>
> By "SOLR on the cloud (e.g. HDFS + MR +  cloud of
> commodity machines)" I mean what you've done for your customers using EC2.
> Any chance, the guidlines/articles for/on setting indices on HDFS are
> available in some open / paid area?
>
> To sum this up, I didn't mean to create a buzz on the cloud solutions in
> this thread, just was wondering what is practically available / going on in
> SOLR development in this regard.
>
> Thanks,
>
> Dmitry
>
>
> On Fri, Mar 25, 2011 at 10:28 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
>> Hi Dan,
>>
>> This feels a bit like a buzzword soup with mushrooms. :)
>>
>> MR jobs, at least the ones in Hadoopland, are very batch oriented, so that
>> wouldn't be very suitable for most search applications.  There are some
>> technologies like Riak that combine MR and search.  Let me use this funny
>> little
>> link: http://lmgtfy.com/?q=riak%20mapreduce%20search
>>
>>
>> Sure, you can put indices on HDFS (but don't expect searches to be fast).
>>  Sure
>> you can create indices using MapReduce, we've done that successfully for
>> customers bringing long indexing jobs from many hours to minutes by using,
>> yes,
>> a cluster of machines (actually EC2 instances).
>> But when you say "more into SOLR on the cloud (e.g. HDFS + MR +  cloud of
>> commodity machines)", I can't actually picture what precisely you mean...
>>
>>
>> Otis
>> ---
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> - Original Message 
>> > From: Dmitry Kan 
>> > To: solr-user@lucene.apache.org
>> > Cc: Upayavira 
>> > Sent: Fri, March 25, 2011 8:26:33 AM
>> > Subject: Re: solr on the cloud
>> >
>> > Hi, Upayavira
>> >
>> > Probably I'm confusing the terms here. When I say  "distributed faceting"
>> I'm
>> > more into SOLR on the cloud (e.g. HDFS + MR +  cloud of commodity
>> machines)
>> > rather than into traditional multicore/sharded  SOLR on a single or
>> multiple
>> > servers with non-distributed file systems (is  that what you mean when
>> you
>> > refer to "distribution of facet requests across  hosts"?)
>> >
>> > On Fri, Mar 25, 2011 at 1:57 PM, Upayavira   wrote:
>> >
>> > >
>> > >
>> > > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan"  
>> > >  wrote:
>> > > > Hi Yonik,
>> > > >
>> > > > Oh, this is great. Is  distributed faceting available in the trunk?
>> What
>> > > > is
>> > > >  the basic server setup needed for trying this out, is it cloud with
>> HDFS
>> > >  > and
>> > > > SOLR with zookepers?
>> > > > Any chance to see the  related documentation? :)
>> > >
>> > > Distributed faceting has been  available for a long time, and is
>> > > available in the 1.4.1  release.
>> > >
>> > > The distribution of facet requests across hosts happens  in the
>> > > background. There's no real difference (in query syntax) between  a
>> > > standard facet query and a distributed one.
>> > >
>> > > i.e. you  don't need SolrCloud nor Zookeeper for it. (they may provide
>> > > other  benefits, but you don't need them for distributed faceting).
>> > >
>> > >  Upayavira
>> > >
>> > > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik  Seeley
>> > > > wrote:
>> > >  >
>> > > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan 
>> > >  wrote:
>> > > > > > Basically, of high interest is checking out the  Map-Reduce for
>> > > > > distributed
>> > > > > > faceting, is  it even possible with the trunk?
>> > > > >
>> > > > > Solr  already has distributed faceting, and it's much more
>> performant
>> > > >  > than a map-reduce implementation 

Re: Updates during Optimize

2011-04-12 Thread Jason Rutherglen
You can index and optimize at the same time.  The current limitation
or pause is when the ram buffer is flushing to disk, however that's
changing with the DocumentsWriterPerThread implementation, eg,
LUCENE-2324.

On Tue, Apr 12, 2011 at 8:34 AM, Shawn Heisey  wrote:
> On 4/12/2011 6:21 AM, stockii wrote:
>>
>> Hello.
>>
>> When is start an optimize (which takes more than 4 hours) no updates from
>> DIH are possible.
>> i thougt solr is copy the hole index and then start an optimize from the
>> copy and not lock the index and optimize this ... =(
>>
>> any way to do both in the same time ?
>
> You can't index and optimize at the same time, and I'm pretty sure that
> there isn't any way to make it possible that wouldn't involve a major
> rewrite of Lucene, and possibly Solr.  The devs would have to say
> differently if my understanding is wrong.
>
> The optimize takes place at the Lucene level.  I can't give you much
> in-depth information, but I can give you some high level stuff.  What it's
> doing is equivalent to a merge, down to one segment.  This is not the same
> as a straight file copy.  It must read the entire Lucene data structure and
> build a new one from scratch.  The process removes deleted documents and
> will also upgrade the version number of the index if it was written with an
> older version of Lucene.  It's very likely that the reading side of the
> process is nearly as comprehensive as the CheckIndex program, but it also
> has to write out a new index segment.
>
> The net result -- the process gives your CPU and especially your I/O
> subsystem a workout, simultaneously.  If you were to make your I/O subsystem
> faster, you would probably see a major improvement in your optimize times.
>
> On my installation, it takes about 11 minutes to optimize one my 16GB
> shards, each with 9 million docs.  These live in virtual machines that are
> stored on a six-drive RAID10 array using 7200RPM SATA disks.  One of my
> pie-in-the-sky upgrade dreams is to replace that with a four-drive RAID10
> array using SSD, the other two drives would be regular SATA -- a mirrored OS
> partition.
>
> Thanks,
> Shawn
>
>


Re: Search across related/correlated multivalue fields in Solr

2011-04-27 Thread Jason Rutherglen
Renaud,

Can you provide a brief synopsis of how your system works?

Jason

On Wed, Apr 27, 2011 at 11:17 AM, Renaud Delbru  wrote:
> Hi,
>
> you might want to look at the SIREn plugin [1,2], which allows you to index
> and query 1:N relationships such as yours, in a tabular data format [3].
>
> [1] http://siren.sindice.com/
> [2] https://github.com/rdelbru/SIREn
> [3]
> https://dev.deri.ie/confluence/display/SIREn/Indexing+and+Searching+Tabular+Data
>
> Kind Regards,
> --
> Renaud Delbru
>
> On 27/04/11 18:30, ronotica wrote:
>>
>> The nature of my project is such that search is needed and specifically
>> search across related entities. We want to perform several queries
>> involving
>> a correlation between two or more properties of a given entity in a
>> collection.
>>
>> To put things in context, here is a snippet of the domain:
>>
>> Student { firstname, lastname }
>> Education { degreeCode, degreeYear, institution }
>>
>> The database tables look like so:
>>
>> STUDENT
>> --
>> STUDENT_ID     FNAME      LNAME
>> 100                 John          Doe
>> 200                 Rasheed     Jones
>> 300                 Mary          Hampton
>>
>> EDUCATION
>> -
>> EDUCATION_ID      DEGREE_CODE       DEGREE_YR       INSTITUTION
>> STUDENT_ID
>> 1                         MD                      2008
>> OHIO_ST                100
>> 2                         PHD                     2010
>> YALE
>> 100
>> 3                         MS                      2007
>> OHIO_ST               200
>> 4                         MD                      2010
>>  YALE
>> 300
>>
>> A student can have many educations. Currently, our documents look like
>> this
>> in solr:
>>
>> DOC_ID       STUDENT_ID    FNAME       LNAME      DEGREE_CODE    DEGREE_YR
>> INSTITUTION
>> 100             100                John          Doe          MD PHD
>> 2008 2010     OHIO_ST YALE
>> 101             200                Rasheed     Jones        MS
>> 2007             OHIO_ST
>> 102             300                Mary          Hampton   MD
>> 2010             YALE
>>
>> Searching for all students who graduated from OHIO_ST in 2010 currently
>> gives a hit (John Doe) when it shouldn't.
>>
>> What is the best way to have overcome this issue in Solr? This is only
>> happening when I am searching across correlated fields, mainly because the
>> data has been denormalized and Lucene has no notion of relationships
>> between
>> the various fields.
>>
>> One way that as come to mind is to have separate documents for "education"
>> and perform multiple searches to get at an answer. Besides this, is there
>> any other way? Does Solr provide any elegant solution for this?
>>
>> Any help will be greatly appreciated.
>>
>> Thanks.
>>
>> PS: We have about 15 of these kind of relationships all relating to the
>> student and will like to perform search on each of them.
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Search-across-related-correlated-multivalue-fields-in-Solr-tp2871176p2871176.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Can the Suggester be updated incrementally?

2011-04-28 Thread Jason Rutherglen
It's answered on the wiki site:

"TSTLookup - ternary tree based representation, capable of immediate
data structure updates"

Although the EdgeNGram technique is probably more widely adopted, eg,
it's closer to what Google has implemented.

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

On Thu, Apr 28, 2011 at 9:37 PM, Andy  wrote:
> I'm interested in using Suggester (http://wiki.apache.org/solr/Suggester) for 
> auto-complete on the field "Document Title".
>
> Does Suggester (either FST, TST or Jaspell) support incremental updates? Say 
> I want to add a new document title to the Suggester, or to change the weight 
> of an existing document title, would I need to rebuild the entire tree for 
> every update?
>
> Also, can the Suggester be sharded? If the size of the tree gets bigger than 
> the RAM size, is it possible to shard the Suggester across multiple machines?
>
> Thanks
> Andy
>


Re: Can the Suggester be updated incrementally?

2011-04-29 Thread Jason Rutherglen
Good question, you could be correct about that.  It's possible that
part hasn't been built yet?  If not then you could create a patch?

On Thu, Apr 28, 2011 at 10:13 PM, Andy  wrote:
>
> --- On Fri, 4/29/11, Jason Rutherglen  wrote:
>
>> It's answered on the wiki site:
>>
>> "TSTLookup - ternary tree based representation, capable of
>> immediate
>> data structure updates"
>>
>
> But how to update it?
>
> The wiki talks about getting data sources from a file or from the main index. 
> In either case it sounds like the entire data structure will be rebuilt, no?
>


Re: Solr vs ElasticSearch

2011-05-31 Thread Jason Rutherglen
Mark,

Nice email address.  I personally have no idea, maybe ask Shay Banon
to post an answer?  I think it's possible to make Solr more elastic,
eg, it's currently difficult to make it move cores between servers
without a lot of manual labor.

Jason

On Tue, May 31, 2011 at 7:33 PM, Mark  wrote:
> I've been hearing more and more about ElasticSearch. Can anyone give me a
> rough overview on how these two technologies differ. What are the
> strengths/weaknesses of each. Why would one choose one of the other?
>
> Thanks
>


Re: Solr vs ElasticSearch

2011-05-31 Thread Jason Rutherglen
Thanks Shashi, this is oddly coincidental with another issue being put
into Solr (SOLR-2193) to help solve some of the NRT issues, the timing
is impeccable.

At a base however Solr uses Lucene, as does ES.  I think the main
advantage of ES is the auto-sharding etc.  I think it uses a gossip
protocol to capitalize on this however... Hmm...

On Tue, May 31, 2011 at 10:01 PM, Shashi Kant  wrote:
> Here is a very interesting comparison
>
> http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/
>
>
>> -Original Message-
>> From: Mark
>> Sent: May-31-11 10:33 PM
>> To: solr-user@lucene.apache.org
>> Subject: Solr vs ElasticSearch
>>
>> I've been hearing more and more about ElasticSearch. Can anyone give me a
>> rough overview on how these two technologies differ. What are the
>> strengths/weaknesses of each. Why would one choose one of the other?
>>
>> Thanks
>>
>>
>


Re: Solr vs ElasticSearch

2011-06-01 Thread Jason Rutherglen
> I'm likely to try playing with moving cores between hosts soon. In
> theory it shouldn't be hard. We'll see what the practice is like!

Right, in theory it's quite simple, in practice I've setup a master,
then a slave, then had to add replication to both, then call create
core, then replicate, then unload core on the master.  It's
nightmarish to setup.  The problem is, it freezes each core into a
respective role, so if I wanted to then 'move' the slave, I can't
because it's still setup as a slave.

On Wed, Jun 1, 2011 at 4:14 AM, Upayavira  wrote:
>
>
> On Tue, 31 May 2011 19:38 -0700, "Jason Rutherglen"
>  wrote:
>> Mark,
>>
>> Nice email address.  I personally have no idea, maybe ask Shay Banon
>> to post an answer?  I think it's possible to make Solr more elastic,
>> eg, it's currently difficult to make it move cores between servers
>> without a lot of manual labor.
>
> I'm likely to try playing with moving cores between hosts soon. In
> theory it shouldn't be hard. We'll see what the practice is like!
>
> Upayavira
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>


Re: Solr vs ElasticSearch

2011-06-01 Thread Jason Rutherglen
> And some way to delete the core when it has been transferred.

Right, I manually added that to CoreAdminHandler.  I opened an issue
to try to solve this problem: SOLR-2569

On Wed, Jun 1, 2011 at 8:26 AM, Upayavira  wrote:
>
>
> On Wed, 01 Jun 2011 07:52 -0700, "Jason Rutherglen"
>  wrote:
>> > I'm likely to try playing with moving cores between hosts soon. In
>> > theory it shouldn't be hard. We'll see what the practice is like!
>>
>> Right, in theory it's quite simple, in practice I've setup a master,
>> then a slave, then had to add replication to both, then call create
>> core, then replicate, then unload core on the master.  It's
>> nightmarish to setup.  The problem is, it freezes each core into a
>> respective role, so if I wanted to then 'move' the slave, I can't
>> because it's still setup as a slave.
>
> Yep, I'm expecting it to require some changes to both the
> CoreAdminHandler and the ReplicationHandler.
>
> Probably the ReplicationHandler would need a 'one-off' replication
> command. And some way to delete the core when it has been transferred.
>
> Upayavira
>
>> On Wed, Jun 1, 2011 at 4:14 AM, Upayavira  wrote:
>> >
>> >
>> > On Tue, 31 May 2011 19:38 -0700, "Jason Rutherglen"
>> >  wrote:
>> >> Mark,
>> >>
>> >> Nice email address.  I personally have no idea, maybe ask Shay Banon
>> >> to post an answer?  I think it's possible to make Solr more elastic,
>> >> eg, it's currently difficult to make it move cores between servers
>> >> without a lot of manual labor.
>> >
>> > I'm likely to try playing with moving cores between hosts soon. In
>> > theory it shouldn't be hard. We'll see what the practice is like!
>> >
>> > Upayavira
>> > ---
>> > Enterprise Search Consultant at Sourcesense UK,
>> > Making Sense of Open Source
>> >
>> >
>>
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>


Re: Solr vs ElasticSearch

2011-06-01 Thread Jason Rutherglen
Jonathan,

This is all true, however it ends up being hacky (this is from
experience) and the core on the source needs to be deleted.  Feel free
to post to the issue.

Jason

On Wed, Jun 1, 2011 at 8:44 AM, Jonathan Rochkind  wrote:
> On 6/1/2011 10:52 AM, Jason Rutherglen wrote:
>>
>> nightmarish to setup. The problem is, it freezes each core into a
>> respective role, so if I wanted to then 'move' the slave, I can't
>> because it's still setup as a slave.
>
> Don't know if this helps or not, but you CAN set up a core as both a master
> and a slave. Normally this is to make it a "repeater", still always taking
> from the same upstream and sending downstream. But there might be a way to
> hack it for your needs without actually changing Java code, a core _can_ be
> both a master and slave simultaneously, and there might be a way to change
> it's masterURL (where it pulls from when acting as a slave) without
> restarting the core too.  You can supply a 'custom' (not configured)
> masterURL in a manual 'pull' command (over HTTP), but of course usually
> slaves poll rather than be directed by manual 'pull' commands.
>
>


Re: Nrt and caching

2012-07-07 Thread Jason Rutherglen
Hi Amit,

If the caches were per-segment, then NRT would be optimal in Solr.

Currently the caches are stored per-multiple-segments, meaning after each
'soft' commit, the cache(s) will be purged.

On Fri, Jul 6, 2012 at 9:45 PM, Amit Nithian  wrote:

> Sorry I'm a bit new to the nrt stuff in solr but I'm trying to understand
> the implications of frequent commits and cache rebuilding and auto warming.
> What are the best practices surrounding nrt searching and caches and query
> performance.
>
> Thanks!
> Amit
>


Re: Nrt and caching

2012-07-07 Thread Jason Rutherglen
The field caches are per-segment, which are used for sorting and basic
[slower] facets.  The result set, document, filter, and multi-value facet
caches are [in Solr] per-multi-segment.

Of these, the document, filter, and multi-value facet caches could be
converted to be [performant] per-segment, as with some other Apache
licensed Lucene based search engines.

On Sat, Jul 7, 2012 at 10:42 AM, Yonik Seeley wrote:

> On Sat, Jul 7, 2012 at 9:59 AM, Jason Rutherglen
>  wrote:
> > Currently the caches are stored per-multiple-segments, meaning after each
> > 'soft' commit, the cache(s) will be purged.
>
> Depends which caches.  Some caches are per-segment, and some caches
> are top level.
> It's also a trade-off... for some things, per-segment data structures
> would indeed turn around quicker on a reopen, but every query would be
> slower for it.
>
> -Yonik
> http://lucidimagination.com
>


Re: Nrt and caching

2012-07-07 Thread Jason Rutherglen
Andy,

You'd need to hack on the Solr code, specifically the SimpleFacets class.
Solr uses UnInvertedField to build an in memory doc -> terms mapping, which
would need to be cached per-segment.  Then you'd need to aggregate the
resultant per-segment counts.

There is another open source library that has taken the same basic faceting
approach (it is per-segment), and could be colloquially faster, however it
is built for Lucene 3.x at the moment.

On Sat, Jul 7, 2012 at 12:21 PM, Andy  wrote:

> So If I want to use multi-value facet with NRT I'd need to convert the
> cache to per-segment? How do I do that?
>
> Thanks.
>
>
> ________
>  From: Jason Rutherglen 
> To: solr-user@lucene.apache.org
> Sent: Saturday, July 7, 2012 11:32 AM
> Subject: Re: Nrt and caching
>
> The field caches are per-segment, which are used for sorting and basic
> [slower] facets.  The result set, document, filter, and multi-value facet
> caches are [in Solr] per-multi-segment.
>
> Of these, the document, filter, and multi-value facet caches could be
> converted to be [performant] per-segment, as with some other Apache
> licensed Lucene based search engines.
>
> On Sat, Jul 7, 2012 at 10:42 AM, Yonik Seeley  >wrote:
>
> > On Sat, Jul 7, 2012 at 9:59 AM, Jason Rutherglen
> >  wrote:
> > > Currently the caches are stored per-multiple-segments, meaning after
> each
> > > 'soft' commit, the cache(s) will be purged.
> >
> > Depends which caches.  Some caches are per-segment, and some caches
> > are top level.
> > It's also a trade-off... for some things, per-segment data structures
> > would indeed turn around quicker on a reopen, but every query would be
> > slower for it.
> >
> > -Yonik
> > http://lucidimagination.com
> >
>


Re: Grouping and Averages

2012-07-07 Thread Jason Rutherglen
Average should be doable in Solr, maybe not today, not sure.  Median is the
challenge :)  Try Hive.

On Sat, Jul 7, 2012 at 3:34 PM, Walter Underwood wrote:

> It sounds like you need a database for analytics, not a search engine.
>
> Solr cannot do aggregates like that. It can select and group, but to
> calculate averages you'll need to fetch all the results over the network
> and calculate them yourself.
>
> wunder
>
> On Jul 7, 2012, at 9:05 AM, Jeremy Branham wrote:
>
> > I’m sorry – I sent this email before I was confirmed in the group, so I
> don’t know if anyone sent a reply =\
> >
> > __
> >
> > Hello -
> > I’m not sure If this is an appropriate use for Solr, but I want to stay
> away from a typical DB store for high availability reasons.
> >
> > I am storing documents that may have a common value for a field we’ll
> call “category”.
> > In another field there will be an integer field we’ll call “rating”.
> >
> > I would like to group the documents on the “category” field and display
> the average “rating” per group.
> >
> > The stats component lets me get the avg rating, but when I collapse the
> results into groups it gives me the average for the entire collection,
> rather than for the specific group.
> >
> > Am I going about this wrong?
> > Is it possible to get the desired outcome with a  single query?
> >
> > I’d appreciate any insight!
> > Thank you,
> >
> >
> >
> > Jeremy Branham
> > Software Engineer
> > http://LinkedIn.com/in/JeremyBranham
> > http://jeremybranham.wordpress.com/
> > http://Zeroth.biz
>
>
>
>


Re: Grouping and Averages

2012-07-07 Thread Jason Rutherglen
I don't think aggregations in the Solr group by are completed yet.  There's
a Lucene or Solr issue implementing group by count that could be adapted to
implement average for example.

On Sat, Jul 7, 2012 at 4:37 PM, Jeremy Branham wrote:

> Thanks for the replies.
> I may be able to simplify my requirements.
>
> In my application, the number of documents per group indicate popularity.
> If I could sort the groups descending by the document count, then using
> the stats component + filter I could query each group to get avg value for
> a field.
>
> Though I dont see how to sort the groups by document count.
> I thought maybe a pseudo field with a functional query would return a
> document element but my tests failed.
>
> Its a bit of a challenge to switch my thought process from SQL to Solr.
>
>
> Jeremy Branham
> Software Engineer
> http://LinkedIn.com/in/**JeremyBranham<http://LinkedIn.com/in/JeremyBranham>
> http://jeremybranham.**wordpress.com/<http://jeremybranham.wordpress.com/>
> http://Zeroth.biz
>
> -Original Message- From: Jason Rutherglen
> Sent: Saturday, July 07, 2012 2:45 PM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Grouping and Averages
>
> Average should be doable in Solr, maybe not today, not sure.  Median is the
> challenge :)  Try Hive.
>
> On Sat, Jul 7, 2012 at 3:34 PM, Walter Underwood  >wrote:
>
>  It sounds like you need a database for analytics, not a search engine.
>>
>> Solr cannot do aggregates like that. It can select and group, but to
>> calculate averages you'll need to fetch all the results over the network
>> and calculate them yourself.
>>
>> wunder
>>
>> On Jul 7, 2012, at 9:05 AM, Jeremy Branham wrote:
>>
>> > I’m sorry – I sent this email before I was confirmed in the group, so I
>> don’t know if anyone sent a reply =\
>> >
>> > __**
>> >
>> > Hello -
>> > I’m not sure If this is an appropriate use for Solr, but I want to stay
>> away from a typical DB store for high availability reasons.
>> >
>> > I am storing documents that may have a common value for a field we’ll
>> call “category”.
>> > In another field there will be an integer field we’ll call “rating”.
>> >
>> > I would like to group the documents on the “category” field and display
>> the average “rating” per group.
>> >
>> > The stats component lets me get the avg rating, but when I collapse the
>> results into groups it gives me the average for the entire collection,
>> rather than for the specific group.
>> >
>> > Am I going about this wrong?
>> > Is it possible to get the desired outcome with a  single query?
>> >
>> > I’d appreciate any insight!
>> > Thank you,
>> >
>> >
>> >
>> > Jeremy Branham
>> > Software Engineer
>> > http://LinkedIn.com/in/**JeremyBranham<http://LinkedIn.com/in/JeremyBranham>
>> > http://jeremybranham.**wordpress.com/<http://jeremybranham.wordpress.com/>
>> > http://Zeroth.biz
>>
>>
>>
>>
>>
>


Re: Nrt and caching

2012-07-07 Thread Jason Rutherglen
Multi-value faceting is fast for queries, however because it's cached
per-multi-segment, each soft commit will flush the cache, and it will be
reloaded on the first query.  As the index grows it becomes expensive to
build, as well as being RAM consuming.

I am not aware of any Jira issues open with activity regarding adding this
feature to Solr.

On Sat, Jul 7, 2012 at 8:32 PM, Andy  wrote:

> Jason,
>
> If I just use stock Solr 4.0 without modifying the source code, does that
> mean multi-value faceting will be very slow when I'm constantly
> inserting/updating documents?
>
> Which open source library are you referring to? Will Solr adopt this
> per-segment approach any time soon?
>
> Thanks
>
>
> ____
>  From: Jason Rutherglen 
> To: solr-user@lucene.apache.org
> Sent: Saturday, July 7, 2012 2:05 PM
> Subject: Re: Nrt and caching
>
> Andy,
>
> You'd need to hack on the Solr code, specifically the SimpleFacets class.
> Solr uses UnInvertedField to build an in memory doc -> terms mapping, which
> would need to be cached per-segment.  Then you'd need to aggregate the
> resultant per-segment counts.
>
> There is another open source library that has taken the same basic faceting
> approach (it is per-segment), and could be colloquially faster, however it
> is built for Lucene 3.x at the moment.
>
> On Sat, Jul 7, 2012 at 12:21 PM, Andy  wrote:
>
> > So If I want to use multi-value facet with NRT I'd need to convert the
> > cache to per-segment? How do I do that?
> >
> > Thanks.
> >
> >
> > 
> >  From: Jason Rutherglen 
> > To: solr-user@lucene.apache.org
> > Sent: Saturday, July 7, 2012 11:32 AM
> > Subject: Re: Nrt and caching
> >
> > The field caches are per-segment, which are used for sorting and basic
> > [slower] facets.  The result set, document, filter, and multi-value facet
> > caches are [in Solr] per-multi-segment.
> >
> > Of these, the document, filter, and multi-value facet caches could be
> > converted to be [performant] per-segment, as with some other Apache
> > licensed Lucene based search engines.
> >
> > On Sat, Jul 7, 2012 at 10:42 AM, Yonik Seeley <
> yo...@lucidimagination.com
> > >wrote:
> >
> > > On Sat, Jul 7, 2012 at 9:59 AM, Jason Rutherglen
> > >  wrote:
> > > > Currently the caches are stored per-multiple-segments, meaning after
> > each
> > > > 'soft' commit, the cache(s) will be purged.
> > >
> > > Depends which caches.  Some caches are per-segment, and some caches
> > > are top level.
> > > It's also a trade-off... for some things, per-segment data structures
> > > would indeed turn around quicker on a reopen, but every query would be
> > > slower for it.
> > >
> > > -Yonik
> > > http://lucidimagination.com
> > >
> >
>


Re: Count disctint groups in grouping distributed

2012-09-12 Thread Jason Rutherglen
Distinct in a distributed environment would require de-duplication
en-masse, use Hive or MapReduce instead.

On Wed, Sep 12, 2012 at 11:53 AM, yriveiro  wrote:
> Hi,
>
> Exists the possibility of do a distinct group count in a grouping done using
> a sharding schema?
>
> This issue https://issues.apache.org/jira/browse/SOLR-3436 make a fixe in
> the way to sum all groups returned in a distributed grouping operation, but
> not always we want the sum, in some cases is interesting have the distinct
> groups between shards.
>
>
>
> -
> Best regards
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Count-disctint-groups-in-grouping-distributed-tp4007257.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: search suggest

2009-07-29 Thread Jason Rutherglen
Autosuggest is something that would be very useful to build into
Solr as many search projects require it.

I'd recommend indexing relevant terms/phrases into a Ternary
Search Tree which is compact and performant. Using a wildcard
query will likely not be as fast as a Ternary Tree, and I'm not
sure how phrases would be handled?

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi

It would be good to separate out the TernaryTree from
analysis/compound and into Lucene core, or into it's own contrib.

Also see http://issues.apache.org/jira/browse/LUCENE-625 which
improves relevancy using click through rates.

I'll open an issue in Solr to get this one going.

On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersen wrote:
> To do a proper search suggest feature you have to index all the queries
> your system gets and search it with wildcards for matches on what the
> user has typed so far for each user keystroke in the search box...
> Usually with some timer logic to wait for a small hesitation in their
> typing.
>
>
>
> -Original Message-
> From: Jack Bates [mailto:ms...@freezone.co.uk]
> Sent: Tuesday, July 28, 2009 10:54 AM
> To: solr-user@lucene.apache.org
> Subject: search suggest
>
> how can i use solr to make search suggestions? i'm thinking google-style
> suggestions, which suggests more refined queries - vs. freebase-style
> suggestions, which suggests top hits.
>
> i've been looking at the query params,
> http://wiki.apache.org/solr/StandardRequestHandler
>
> - and searching for "solr suggest" - but haven't figured out how to get
> search suggestions from solr
>


Re: search suggest

2009-07-29 Thread Jason Rutherglen
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/compound/hyphenation/TernaryTree.html

On Wed, Jul 29, 2009 at 12:08 PM, Jason
Rutherglen wrote:
> Autosuggest is something that would be very useful to build into
> Solr as many search projects require it.
>
> I'd recommend indexing relevant terms/phrases into a Ternary
> Search Tree which is compact and performant. Using a wildcard
> query will likely not be as fast as a Ternary Tree, and I'm not
> sure how phrases would be handled?
>
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi
>
> It would be good to separate out the TernaryTree from
> analysis/compound and into Lucene core, or into it's own contrib.
>
> Also see http://issues.apache.org/jira/browse/LUCENE-625 which
> improves relevancy using click through rates.
>
> I'll open an issue in Solr to get this one going.
>
> On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersen wrote:
>> To do a proper search suggest feature you have to index all the queries
>> your system gets and search it with wildcards for matches on what the
>> user has typed so far for each user keystroke in the search box...
>> Usually with some timer logic to wait for a small hesitation in their
>> typing.
>>
>>
>>
>> -Original Message-
>> From: Jack Bates [mailto:ms...@freezone.co.uk]
>> Sent: Tuesday, July 28, 2009 10:54 AM
>> To: solr-user@lucene.apache.org
>> Subject: search suggest
>>
>> how can i use solr to make search suggestions? i'm thinking google-style
>> suggestions, which suggests more refined queries - vs. freebase-style
>> suggestions, which suggests top hits.
>>
>> i've been looking at the query params,
>> http://wiki.apache.org/solr/StandardRequestHandler
>>
>> - and searching for "solr suggest" - but haven't figured out how to get
>> search suggestions from solr
>>
>


Re: search suggest

2009-07-29 Thread Jason Rutherglen
Here's a good article on Ternary Trees: http://www.ddj.com/windows/184410528

I looked at the one in Lucene, I don't understand why the find method
only returns a char/int?

On Wed, Jul 29, 2009 at 2:33 PM, Robert Petersen wrote:
> Simple minded autosuggest can just not tokenize the phrases at all and
> so the wildcards just complete whatever the user has typed so far
> including spaces.  Upon encountering a space though, autosuggest should
> wait to make more suggestions until the user has typed at least a couple
> of letters of the next word.  That is the way I did it last time using a
> different search engine.  It'd sure be kewl if this became a core
> feature of solr!
>
> I like the idea of the tree approach, sounds much faster.  The root is
> the least letters to start suggestions and the leaves are the full
> phrases?
>
> -Original Message-
> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
> Sent: Wednesday, July 29, 2009 12:09 PM
> To: solr-user@lucene.apache.org
> Subject: Re: search suggest
>
> Autosuggest is something that would be very useful to build into
> Solr as many search projects require it.
>
> I'd recommend indexing relevant terms/phrases into a Ternary
> Search Tree which is compact and performant. Using a wildcard
> query will likely not be as fast as a Ternary Tree, and I'm not
> sure how phrases would be handled?
>
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi
>
> It would be good to separate out the TernaryTree from
> analysis/compound and into Lucene core, or into it's own contrib.
>
> Also see http://issues.apache.org/jira/browse/LUCENE-625 which
> improves relevancy using click through rates.
>
> I'll open an issue in Solr to get this one going.
>
> On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersen
> wrote:
>> To do a proper search suggest feature you have to index all the
> queries
>> your system gets and search it with wildcards for matches on what the
>> user has typed so far for each user keystroke in the search box...
>> Usually with some timer logic to wait for a small hesitation in their
>> typing.
>>
>>
>>
>> -Original Message-
>> From: Jack Bates [mailto:ms...@freezone.co.uk]
>> Sent: Tuesday, July 28, 2009 10:54 AM
>> To: solr-user@lucene.apache.org
>> Subject: search suggest
>>
>> how can i use solr to make search suggestions? i'm thinking
> google-style
>> suggestions, which suggests more refined queries - vs. freebase-style
>> suggestions, which suggests top hits.
>>
>> i've been looking at the query params,
>> http://wiki.apache.org/solr/StandardRequestHandler
>>
>> - and searching for "solr suggest" - but haven't figured out how to
> get
>> search suggestions from solr
>>
>


Re: search suggest

2009-07-29 Thread Jason Rutherglen
I created an issue and have added some notes
https://issues.apache.org/jira/browse/SOLR-1316

On Wed, Jul 29, 2009 at 3:15 PM, Jason
Rutherglen wrote:
> Here's a good article on Ternary Trees: http://www.ddj.com/windows/184410528
>
> I looked at the one in Lucene, I don't understand why the find method
> only returns a char/int?
>
> On Wed, Jul 29, 2009 at 2:33 PM, Robert Petersen wrote:
>> Simple minded autosuggest can just not tokenize the phrases at all and
>> so the wildcards just complete whatever the user has typed so far
>> including spaces.  Upon encountering a space though, autosuggest should
>> wait to make more suggestions until the user has typed at least a couple
>> of letters of the next word.  That is the way I did it last time using a
>> different search engine.  It'd sure be kewl if this became a core
>> feature of solr!
>>
>> I like the idea of the tree approach, sounds much faster.  The root is
>> the least letters to start suggestions and the leaves are the full
>> phrases?
>>
>> -Original Message-
>> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
>> Sent: Wednesday, July 29, 2009 12:09 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: search suggest
>>
>> Autosuggest is something that would be very useful to build into
>> Solr as many search projects require it.
>>
>> I'd recommend indexing relevant terms/phrases into a Ternary
>> Search Tree which is compact and performant. Using a wildcard
>> query will likely not be as fast as a Ternary Tree, and I'm not
>> sure how phrases would be handled?
>>
>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi
>>
>> It would be good to separate out the TernaryTree from
>> analysis/compound and into Lucene core, or into it's own contrib.
>>
>> Also see http://issues.apache.org/jira/browse/LUCENE-625 which
>> improves relevancy using click through rates.
>>
>> I'll open an issue in Solr to get this one going.
>>
>> On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersen
>> wrote:
>>> To do a proper search suggest feature you have to index all the
>> queries
>>> your system gets and search it with wildcards for matches on what the
>>> user has typed so far for each user keystroke in the search box...
>>> Usually with some timer logic to wait for a small hesitation in their
>>> typing.
>>>
>>>
>>>
>>> -Original Message-
>>> From: Jack Bates [mailto:ms...@freezone.co.uk]
>>> Sent: Tuesday, July 28, 2009 10:54 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: search suggest
>>>
>>> how can i use solr to make search suggestions? i'm thinking
>> google-style
>>> suggestions, which suggests more refined queries - vs. freebase-style
>>> suggestions, which suggests top hits.
>>>
>>> i've been looking at the query params,
>>> http://wiki.apache.org/solr/StandardRequestHandler
>>>
>>> - and searching for "solr suggest" - but haven't figured out how to
>> get
>>> search suggestions from solr
>>>
>>
>


Re: NativeFSLockFactory, ConcurrentMergeScheduler: why locks?

2009-08-11 Thread Jason Rutherglen
Fuad,

The lock indicates to external processes the index is in use, meaning
it's not cause ConcurrentMergeScheduler to block.

ConcurrentMergeScheduler does merge in it's own thread, however
if the merges are large then they can spike IO, CPU, and cause
the machine to be somewhat unresponsive.

What is the size of your index (in docs and GB)? How many
deletes are you performing? There are a few possible solutions
to these problems if you're able to separate athe updating from
the searching onto different servers.

-J

On Tue, Aug 11, 2009 at 10:08 AM, Fuad Efendi wrote:
> 1.       I always have files lucene--write.lock and
> lucene--n-write.lock which I believe shouldn't be used with
> NativeFSLockFactory
>
> 2.       I use mergeFactor=100 and ramBufferSizeMB=256, few GB indes size. I
> tried mergeFactor=10 and mergeFactor=1000.
>
>
>
>
>
> It seems ConcurrentMergeScheduler locks everything instead of using separate
> thread on background...
>
>
>
>
>
> So that my configured system spents half an hour to UPDATE (probably
> existing in the index) million of documents, then it stops and waits few
> hours for index merge which is extremely slow (a lot of deletes?)
>
>
>
> With mergeFactor=1000 I had extremely performant index updates (50,000,000 a
> first day), and then I was waiting more than 2 days when merge complete (and
> was forced to kill process).
>
>
>
> Why it locks everything?
>
>
>
> Thanks,
>
> Fuad
>
>
>
>


Re: NativeFSLockFactory, ConcurrentMergeScheduler: why locks?

2009-08-11 Thread Jason Rutherglen
> 1 minute of document updates (about 100,000 documents) and
then SOLR stops

100,000 docs in a minute is a lot. Lucene is probably
automatically flushing to disk and merging which is tying up the
IO subsystem. You may want to set the ConcurrentMergeScheduler
to 1 thread (which in Solr cannot be done and requires a custom
class, currently). This will minimize the number of threads
trying to merge at once, and may allow the merges to occur more
quickly (as the sequential read/writes will have longer to
perform, otherwise they could be interrupted by other merges,
causing excessive HD head movement).

I'd look at using SSDs, however I am aware that business folks
typically are not fond of them!

> instead of implementing specific document handler

Implementing a custom handler is probably unnecessary?

> I am suspecting "delete" is main bottleneck for Lucene

How many deletes are you performing in a minute? Or is it
100,000? (Meaning the update above is an update call to Solr,
not an add). 100,000 deletes is a lot as well.

Based on what you've said Fuad, I'd add documents, queue up
deletes to a separate file (i.e. not in Solr/Lucene), then later
on, send the deletes to Solr just prior to committing. This will allow
Lucene to focus on indexing only, create new segments etc, then
apply deletes later only when the segments are somewhat stable
(i.e. not being merged at a rapid pace).

Feel free to post some more numbers.

On Tue, Aug 11, 2009 at 2:07 PM, Fuad Efendi wrote:
> Hi Jason,
>
> I am using Master/Slave (two servers);
> I monitored few hours today - 1 minute of document updates (about 100,000
> documents) and then SOLR stops for at least 5 minutes to do background jobs
> like RAM flush, segment merge...
>
> Documents are small; about 10Gb of total index size for 50,000,000
> documents.
>
> I am suspecting "delete" is main bottleneck for Lucene since it marks
> documents for deletion and then it needs to optimize inverted indexes (in
> fact, to optimize)...
>
>
> I run "update" queries to update documents, I have timestamp field and in
> many cases I need to update timestamp only of existing document (specific
> process periodically deletes expired documents, once a week) - but I am
> still using out-of-the-box /update instead of implementing specific document
> handler.
>
> I can run it in a batch - for instance, collecting of millions of documents
> somewhere and removing duplicates before sending to SOLR - but I prefer to
> update document several times during a day - it's faster (although I
> encountered a problem...)
>
>
> Thanks,
> Fuad
>
>
>
> -Original Message-
> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
> Sent: August-11-09 4:45 PM
> To: solr-user@lucene.apache.org
> Subject: Re: NativeFSLockFactory, ConcurrentMergeScheduler: why locks?
>
> Fuad,
>
> The lock indicates to external processes the index is in use, meaning
> it's not cause ConcurrentMergeScheduler to block.
>
> ConcurrentMergeScheduler does merge in it's own thread, however
> if the merges are large then they can spike IO, CPU, and cause
> the machine to be somewhat unresponsive.
>
> What is the size of your index (in docs and GB)? How many
> deletes are you performing? There are a few possible solutions
> to these problems if you're able to separate athe updating from
> the searching onto different servers.
>
> -J
>
> On Tue, Aug 11, 2009 at 10:08 AM, Fuad Efendi wrote:
>> 1.       I always have files lucene--write.lock and
>> lucene--n-write.lock which I believe shouldn't be used with
>> NativeFSLockFactory
>>
>> 2.       I use mergeFactor=100 and ramBufferSizeMB=256, few GB indes size.
> I
>> tried mergeFactor=10 and mergeFactor=1000.
>>
>>
>>
>>
>>
>> It seems ConcurrentMergeScheduler locks everything instead of using
> separate
>> thread on background...
>>
>>
>>
>>
>>
>> So that my configured system spents half an hour to UPDATE (probably
>> existing in the index) million of documents, then it stops and waits few
>> hours for index merge which is extremely slow (a lot of deletes?)
>>
>>
>>
>> With mergeFactor=1000 I had extremely performant index updates (50,000,000
> a
>> first day), and then I was waiting more than 2 days when merge complete
> (and
>> was forced to kill process).
>>
>>
>>
>> Why it locks everything?
>>
>>
>>
>> Thanks,
>>
>> Fuad
>>
>>
>>
>>
>
>
>


Query with no cache without editing solrconfig?

2009-08-12 Thread Jason Rutherglen
Is there a way to do this via a URL?


Re: Solr support for Lucene Near realtime search

2009-08-12 Thread Jason Rutherglen
Hi Alan,

Solr 1.4 does not contain near realtime search capabilities and
it could be variously detrimental to call commit too often as
indexing and searches could precipitously degrade. That being
said, most of the NRT functionality is not too difficult to add,
except for per segment caching SOLR-1308.

What use case are you trying to solve?

-J

On Wed, Aug 12, 2009 at 2:13 PM, fansnap_alan wrote:
>
> Hi,
>
> Does the latest version of Solr 1.4 dev (including DIH) take advantage of
> Lucene's Near Realtime Search features?  I've read several past postings
> about providing near-real time search using a small and large index and was
> wondering if that will still be necessary when Solr 1.4 releases.
>
> Thanks
>
> Alan
> a...@fansnap.com
> --
> View this message in context: 
> http://www.nabble.com/Solr-support-for-Lucene-Near-realtime-search-tp24943422p24943422.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: facet performance tips

2009-08-12 Thread Jason Rutherglen
For your fields with many terms you may want to try Bobo
http://code.google.com/p/bobo-browse/ which could work well with your
case.

On Wed, Aug 12, 2009 at 12:02 PM, Fuad Efendi wrote:
> I am currently faceting on tokenized multi-valued field at
> http://www.tokenizer.org (25 mlns simple docs)
>
> It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and
> non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667)
>
> Average "faceting" on query results: 0.2 - 0.3 seconds; without those
> patches - 20-50 seconds.
>
> I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475 & SOLR-667) and
> to compare results...
>
>
>
>
> P.S.
> Avoid faceting on a field with heavy distribution of terms (such as few
> millions of terms in my case); It won't work in SOLR 1.3.
>
> TIP: use non-tokenized single-valued field for faceting, such as
> non-tokenized "country" field.
>
>
>
> P.P.S.
> Would be nice to load/stress
> http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against
> putting CPU in a spin loop ConcurrentHashMap.
>
>
>
> -Original Message-
> From: Erik Hatcher [mailto:ehatc...@apache.org]
> Sent: August-12-09 2:12 PM
> To: solr-user@lucene.apache.org
> Subject: Re: facet performance tips
>
> Yes, increasing the filterCache size will help with Solr 1.3
> performance.
>
> Do note that trunk (soon Solr 1.4) has dramatically improved faceting
> performance.
>
>        Erik
>
> On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:
>
>> Hi everyone,
>>
>>  I'm using some faceting on a solr index containing ~ 160K documents.
>> I perform facets on multivalued string fields. The number of possible
>> different values is quite large.
>>
>> Enabling facets degrades the performance by a factor 3.
>>
>> Because I'm using solr 1.3, I guess the facetting makes use of the
>> filter cache to work. My filterCache is set
>> to a size of 2048. I also noticed in my solr stats a very small ratio
>> of cache hit (~ 0.01%).
>>
>> Can it be the reason why the faceting is slow? Does it make sense to
>> increase the filterCache size so it matches more or less the number
>> of different possible values for the faceted fields? Would that not
>> make the memory usage explode?
>>
>> Thanks for your help !
>>
>> --
>> Jerome Eteve.
>>
>> Chat with me live at http://www.eteve.net
>>
>> jer...@eteve.net
>
>
>
>


Distributed query returns time consumed by each Solr shard?

2009-08-12 Thread Jason Rutherglen
Is there a way to do this currently? If a shard takes an
inordinate amount of time compared to the other shards, it's useful
to see the various qtimes per shard, with the aggregated results.


Re: facet performance tips

2009-08-13 Thread Jason Rutherglen
Yeah we need a performance comparison, I haven't had time to put
one together. If/when I do I'll compare Bobo performance against
Solr bitset intersection based facets, compare memory
consumption.

For near realtime Solr needs to cache and merge bitsets at the
SegmentReader level, and Bobo needs to be upgraded to work with
Lucene 2.9's searching at the segment level (currently it uses a
MultiSearcher).

Distributed search on either should be fairly straightforward?

On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote:
> It seems BOBO-Browse is alternate faceting engine; would be interesting to
> compare performance with SOLR... Distributed?
>
>
> -Original Message-
> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
> Sent: August-12-09 6:12 PM
> To: solr-user@lucene.apache.org
> Subject: Re: facet performance tips
>
> For your fields with many terms you may want to try Bobo
> http://code.google.com/p/bobo-browse/ which could work well with your
> case.
>
>
>
>
>


  1   2   3   >