from:"andy"

Re: [ANN] Heliosearch 0.06 released, native code faceting

2014-06-19 Thread Andy

Congrats! Any idea when will native faceting & off-heap fieldcache be available 
for multivalued fields? Most of my fields are multivalued so that's the big one 
for me.

Andy


On Thursday, June 19, 2014 3:46 PM, Yonik Seeley  wrote:
 


FYI, for those who want to try out the new native code faceting, this
is the first release containing it (for single valued string fields
only as of yet).

http://heliosearch.org/download/

Heliosearch v0.06

Features:
o  Heliosearch v0.06 is based on (and contains all features of)
Lucene/Solr 4.9.0
o  Native code faceting for single valued string fields.
    - Written in C++, statically compiled with gcc for Windows, Mac OS-X, Linux
    - static compilation avoids JVM hotspot warmup period,
mis-compilation bugs, and variations between runs
    - Improves performance over 2x
o  Top level Off-heap fieldcache for single valued string fields in nCache.
    - Improves sorting and faceting speed
    - Reduces garbage collection overhead
    - Eliminates FieldCache “insanity” that exists in Apache Solr from
faceting and sorting on the same field
o  Full request Parameter substitution / macro expansion, including
default value support.
o  frange query now only returns documents with a value.
     For example, in Apache Solr, {!frange l=-1 u=1 v=myfield} will
also return documents without a value since the numeric default value
of 0 lies within the range requested.
o  New JSON features via Noggit upgrade, allowing optional comments
(C/C++ and shell style), unquoted keys, and relaxed escaping that
allows one to backslash escape any character.


-Yonik
http://heliosearch.org - native code faceting, facet functions,
sub-facets, off-heap data

Re: Facets with 5000 facet fields

2013-03-21 Thread Andy

What do I need to do to use this new per segment faceting method?

 From: Mark Miller 
To: solr-user@lucene.apache.org 
Sent: Wednesday, March 20, 2013 1:09 PM
Subject: Re: Facets with 5000 facet fields

On Mar 20, 2013, at 11:29 AM, Chris Hostetter  wrote:

> Not true ... per segment FIeldCache support is available in solr 
> faceting, you just have to specify facet.method=fcs (FieldCache per 
> Segment)

Also, if you use docvalues in 4.2, Robert tells me it is uses a new per seg 
faceting method that may have some better nrt characteristics than fcs. I have 
not played with it yet but hope to soon.

- Mark

Re: Facets with 5000 facet fields

2013-03-21 Thread Andy

But if I just add facet.method=fcs, wouldn't I just get fcs? Mark said this new 
method based on docvalues is better than fcs, so wouldn't I need to do 
something other than specifying fcs to enable this new method?




 From: Upayavira 
To: solr-user@lucene.apache.org 
Sent: Thursday, March 21, 2013 9:04 AM
Subject: Re: Facets with 5000 facet fields
 
as was said below, add facet.method=fcs to your query URL.

Upayavira

On Thu, Mar 21, 2013, at 09:41 AM, Andy wrote:
> What do I need to do to use this new per segment faceting method?
> 
> 
> 
>  From: Mark Miller 
> To: solr-user@lucene.apache.org 
> Sent: Wednesday, March 20, 2013 1:09 PM
> Subject: Re: Facets with 5000 facet fields
>  
> 
> On Mar 20, 2013, at 11:29 AM, Chris Hostetter 
> wrote:
> 
> > Not true ... per segment FIeldCache support is available in solr 
> > faceting, you just have to specify facet.method=fcs (FieldCache per 
> > Segment)
> 
> Also, if you use docvalues in 4.2, Robert tells me it is uses a new per
> seg faceting method that may have some better nrt characteristics than
> fcs. I have not played with it yet but hope to soon.
> 
> - Mark

Re: [blogpost] Memory is overrated, use SSDs

2013-06-06 Thread Andy

This is very interesting. Thanks for sharing the benchmark.

One question I have is did you precondition the SSD ( 
http://www.sandforce.com/userfiles/file/downloads/FMS2009_F2A_Smith.pdf )? SSD 
performance tends to take a very deep dive once all blocks are written at least 
once and the garbage collector kicks in. 



 From: Toke Eskildsen 
To: "solr-user@lucene.apache.org"  
Sent: Thursday, June 6, 2013 7:11 PM
Subject: [blogpost] Memory is overrated, use SSDs
 

Inspired by multiple Solr mailing list entries during the last month or two, I 
did some search performance testing on our 11M documents / 49GB index using 
logged queries on Solr 4 with MMapDirectory. It turns out that our setup with 
Solid State Drives and 8GB of RAM (which leaves 5GB for disk cache) performs 
nearly as well as having the whole index in disk cache; the SSD solution 
delivering ~425 q/s for non-faceted searches and the memory solution delivering 
~475 q/s (roughly estimated from the graphs, sorry). Going full memory cache 
certainly is faster if we ignore warmup, but those last queries/second are 
quite expensive.

http://sbdevel.wordpress.com/2013/06/06/memory-is-overrated/

Regards,
Toke Eskildsen, State and University Library, Denmark

Nested documents

2011-09-10 Thread Andy

Hi,

Does Solr support nested documents? If not is there any plan to add such a 
feature?

Thanks.

Re: Continuous update on progress of "New SolrCloud Design" work

2011-12-05 Thread Andy

Hi,

> add features corresponding to stuff that we used to use in ElasticSearch

Does that mean you have used ElasticSearch but decided to try SolrCloud instead?

I'm also looking at a distributed solution. ElasticSearch just seems much 
further along than SolrCloud. So I'd be interested to hear about any particular 
reasons you decided to pick SolrCloud instead of ElasticSearch.

Andy




 From: Per Steffensen 
To: solr-user@lucene.apache.org 
Sent: Monday, December 5, 2011 6:23 AM
Subject: Continuous update on progress of "New SolrCloud Design" work
 
Hi

My guess is that the work for acheiving 
http://wiki.apache.org/solr/NewSolrCloudDesign has begun on branch "solrcloud". 
It is hard to follow what is going on and how to use what has been acheived - 
you cannot follow the examples on http://wiki.apache.org/solr/SolrCloud anymore 
(e.g. there is no shard="shard1" in solr/example/solr/solr.xml anymore). Will 
it be possible to maintain a how-to-use section on 
http://wiki.apache.org/solr/NewSolrCloudDesign with examples, e.g. like to ones 
on http://wiki.apache.org/solr/SolrCloud, on how to use it, that "at any time" 
reflects how to use whats on the HEAD of "solrcloud" branch?

In my project we are about to start using something else that ElasticSearch, 
and SolrCloud is an option, but there is a lot to be done in Solr(Cloud) before 
it is even comparable with ElasticSearch wrt features. If we choose to go for 
SolrCloud we would like to participate in the development of the new SolrCloud, 
and add features corresponding to stuff that we used to use in ElasticSearch, 
but it is very hard to contribute to SolrCloud if it is "black box" (that only 
a few persons know about) work going on on branch "solrcloud" getting us from 
http://wiki.apache.org/solr/SolrCloud to 
http://wiki.apache.org/solr/NewSolrCloudDesign.

Regards, Per Steffensen

Non-prefix, hierarchical autocomplete? Would SOLR-1316 work? Solritas?

2010-06-19 Thread Andy

Hi,

I've seen some posts on using SOLR-1316 or Solritas for autocomplete. Wondered 
what is the best solution for my use case:

1) I would like to have an "hierarchical" autocomplete. For example, I have a 
"Country" dropdown list and a "City" textbox. A user would select a country 
from the dropdown list, and then type out the City in the textbox. Based on 
which country he selected, I want to limit the autocomplete suggestions to 
cities that are relevant for the selected country.

This hierarchy could be multi-level. For example, there may be a "Neighborhood" 
textbox. The autocomplete suggestions for "Neighborhood" would be limited to 
neighborhoods that are relevant for the city entered by the user in the "City" 
textbox.

2) I want to have autocomplete suggestions that includes non-prefix matches. 
For example, if the user type "auto", the autocomplete suggestions should 
include terms such as "automata" and "build automation".

3) I'm doing autocomplete for tags. I would like to allow multi-word tags and 
use comma (",") as a separator for tags. So when the use hits the space bar, he 
is still typing out the same tag, but when he hits the comma key, he's starting 
a new tag.

Would SOLR-1316 or Solritas work for the above requirements? If they do how do 
I set it up? I can't really find much documentation on SOLR-1316 or Solritas in 
this area.

Thanks.

Re: Non-prefix, hierarchical autocomplete? Would SOLR-1316 work? Solritas?

2010-06-19 Thread Andy

Forgot to add, I would like to order the autocomplete suggestions for 
tags/cities based on how many times they are present in the documents.

--- On Sat, 6/19/10, Andy  wrote:

> From: Andy 
> Subject: Non-prefix, hierarchical autocomplete? Would SOLR-1316 work? 
> Solritas?
> To: solr-user@lucene.apache.org
> Date: Saturday, June 19, 2010, 3:28 AM
> Hi,
> 
> I've seen some posts on using SOLR-1316 or Solritas for
> autocomplete. Wondered what is the best solution for my use
> case:
> 
> 1) I would like to have an "hierarchical" autocomplete. For
> example, I have a "Country" dropdown list and a "City"
> textbox. A user would select a country from the dropdown
> list, and then type out the City in the textbox. Based on
> which country he selected, I want to limit the autocomplete
> suggestions to cities that are relevant for the selected
> country.
> 
> This hierarchy could be multi-level. For example, there may
> be a "Neighborhood" textbox. The autocomplete suggestions
> for "Neighborhood" would be limited to neighborhoods that
> are relevant for the city entered by the user in the "City"
> textbox.
> 
> 2) I want to have autocomplete suggestions that includes
> non-prefix matches. For example, if the user type "auto",
> the autocomplete suggestions should include terms such as
> "automata" and "build automation".
> 
> 3) I'm doing autocomplete for tags. I would like to allow
> multi-word tags and use comma (",") as a separator for tags.
> So when the use hits the space bar, he is still typing out
> the same tag, but when he hits the comma key, he's starting
> a new tag.
> 
> Would SOLR-1316 or Solritas work for the above
> requirements? If they do how do I set it up? I can't really
> find much documentation on SOLR-1316 or Solritas in this
> area.
> 
> Thanks.
> 
> 
>       
>

Re: Chinese chars are not indexed ?

2010-06-28 Thread Andy

What if Chinese is mixed with English?

I have text that is entered by users and it could be a mix of Chinese, English, 
etc.

What's the best way to handle that?

Thanks.

--- On Mon, 6/28/10, Ahmet Arslan  wrote:

> From: Ahmet Arslan 
> Subject: Re: Chinese chars are not indexed ?
> To: solr-user@lucene.apache.org
> Date: Monday, June 28, 2010, 3:44 AM
> > oh yes, *...* works. thanks.
> > 
> > I saw tokenizer is defined in schema.xml. There are a
> few
> > places that define the tokenizer. Wondering if it is
> enough
> > to define one for:
> 
> It is better to define a brand new field type specific to
> Chinese. 
> 
> http://wiki.apache.org/solr/LanguageAnalysis?highlight=%28CJKtokenizer%29#Chinese.2C_Japanese.2C_KoreanSomething
> like:
> 
> at index time:
> 
> 
> 
> at query time:
> 
> 
> 
> 
> 
> 
>       
>

solr single threaded?

2010-08-08 Thread Andy

I read that Lucene search is single threaded. Does that mean Solr search is 
also single threaded?

What does it mean - that there are no concurrent searches & all searches are 
serialized? Can Solr take advantages of multiple CPUs?

Thanks.

Re: solr single threaded?

2010-08-09 Thread Andy

Otis,

Thanks. In that case what does it mean that "Lucene search is single threaded"? 
How is that different from the Solr behavior?

Andy

--- On Mon, 8/9/10, Otis Gospodnetic  wrote:

> From: Otis Gospodnetic 
> Subject: Re: solr single threaded?
> To: solr-user@lucene.apache.org
> Date: Monday, August 9, 2010, 1:28 AM
> Andy,
> 
> Short answer: No, Solr will use multiple CPU cores and/or
> multiple CPUs if they 
> are present.
> 
> A single non-distributed search request runs in a single
> thread, but Solr (and 
> the servlet container you put it in) can handle a number of
> such threads in 
> parallel.
> 
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> - Original Message 
> > From: Andy 
> > To: solr-user@lucene.apache.org
> > Sent: Mon, August 9, 2010 1:19:58 AM
> > Subject: solr single threaded?
> > 
> > I read that Lucene search is single threaded. Does
> that mean Solr search is 
> >also  single threaded?
> > 
> > What does it mean - that there are no concurrent 
> searches & all searches are 
> >serialized? Can Solr take advantages of multiple 
> CPUs?
> > 
> > Thanks.
> > 
> > 
> >       
> > 
>

Possible to have more than 1 uniqueKey fields in a document?

2010-08-21 Thread Andy

Is it possible to define more than 1 uniqueKey fields per document in 
schema.xml?

Re: Possible to have more than 1 uniqueKey fields in a document?

2010-08-21 Thread Andy

I'm still a bit confused. Can I define 2 "uniqueKey" fields in schema.xml?

I want to use 2 outside apps. One define a uniqueKey that is a mix of alphabets 
and numbers. Another app requires a uniqueKey of the type long.

Obviously the 2 requirements aren't compatible. I'm trying to see if it's 
possible to define 2 uniqueKeys so each app could have its own one.

--- On Sat, 8/21/10, Lance Norskog  wrote:

> From: Lance Norskog 
> Subject: Re: Possible to have more than 1 uniqueKey fields in a document?
> To: solr-user@lucene.apache.org
> Date: Saturday, August 21, 2010, 5:23 PM
> There can be as many as you want. Buy
> you can only specify one as "the
> uniqueKey". That is used for Distributed Search and
> deduplication.
> 
> Indexing might work better if you concatenate the different
> unique
> values into one field.
> 
> On Sat, Aug 21, 2010 at 3:27 AM, Andy 
> wrote:
> > Is it possible to define more than 1 uniqueKey fields
> per document in schema.xml?
> >
> >
> >
> >
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
>

Removing expired documents from Solr index

2010-08-24 Thread Andy

My documents have an "expiration_datetime" field that holds the expiration 
datetime of the document.

I use a filter query to exclude expired documents from my query results.

Is it a good idea to periodically go through the index and remove expired 
documents from it? If so what is the best way to do that? Any example code 
would be appreciated.

Can I use an ExternalFileField as an input to a boost query?

2010-08-27 Thread Andy

I have a field "popularity" that is changing frequently. So I'd like to put it 
in an ExternalFileField.

If I do that, can I still use "popularity" in a boosted query such as:

{!boost b=log(popularity)}foo

Thanks.

ExternalFileField best practices

2010-08-28 Thread Andy

I'm interested in using ExternalFileField to store a field "popularity" that is 
being updated frequently.

However ExternalFileField seems to be a pretty obscure feature. Have a few 
questions:

1) Can anyone share your experience using it? 

2) What is the most efficient way to update the external file?
For example, the file could look like:

1=12  // the document with uniqueKey 1 has a popularity of 12//
2=4
3=45
5=78

Now the popularity of document 1 is updated to 13:
 
- What is the best way to update the file to reflect the change? Isn't this an 
O(n) operation?
- How to deal with concurrent updates to the file by multiple threads?

Would this method of using an external file scale?

Thanks.

Re: ExternalFileField best practices

2010-08-28 Thread Andy

Lance,

Thanks for the response.

Can I use an ExternalFileField as an input to a boost query?

For example, if I put the field "popularity" in an ExternalFileField, can I 
still use "popularity" in a boosted query such as:

{!boost b=log(popularity)}foo

The doc says ExternalFileField can only be used in FunctionQuery. Does that 
include a boost query like {!boost b=log(popularity)}?


--- On Sat, 8/28/10, Lance Norskog  wrote:

> From: Lance Norskog 
> Subject: Re: ExternalFileField best practices
> To: solr-user@lucene.apache.org
> Date: Saturday, August 28, 2010, 5:16 PM
> The file is completely reloaded when
> you commit or optimize. There is
> no incremental update available. And, yes, this could be a
> scaling
> problem.
> 
> How you update it is completely external to Solr.
> 
> On Sat, Aug 28, 2010 at 2:50 AM, Andy 
> wrote:
> > I'm interested in using ExternalFileField to store a
> field "popularity" that is being updated frequently.
> >
> > However ExternalFileField seems to be a pretty obscure
> feature. Have a few questions:
> >
> > 1) Can anyone share your experience using it?
> >
> > 2) What is the most efficient way to update the
> external file?
> > For example, the file could look like:
> >
> > 1=12      // the document with uniqueKey 1 has a
> popularity of 12//
> > 2=4
> > 3=45
> > 5=78
> >
> > Now the popularity of document 1 is updated to 13:
> >
> > - What is the best way to update the file to reflect
> the change? Isn't this an O(n) operation?
> > - How to deal with concurrent updates to the file by
> multiple threads?
> >
> > Would this method of using an external file scale?
> >
> > Thanks.
> >
> >
> >
> >
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
>

Re: ExternalFileField best practices

2010-08-28 Thread Andy

But isn't it the case that bf adds the boost value while {!boost} multiply the 
boost value? In my case I think a multiplication is more appropriate.

So there's no way to use ExternalFileField in {!boost}?

--- On Sat, 8/28/10, Lance Norskog  wrote:

> From: Lance Norskog 
> Subject: Re: ExternalFileField best practices
> To: solr-user@lucene.apache.org
> Date: Saturday, August 28, 2010, 11:55 PM
> You want the boost function bf=
> parameter.
> 
> On Sat, Aug 28, 2010 at 5:32 PM, Andy 
> wrote:
> > Lance,
> >
> > Thanks for the response.
> >
> > Can I use an ExternalFileField as an input to a boost
> query?
> >
> > For example, if I put the field "popularity" in an
> ExternalFileField, can I still use "popularity" in a boosted
> query such as:
> >
> > {!boost b=log(popularity)}foo
> >
> > The doc says ExternalFileField can only be used in
> FunctionQuery. Does that include a boost query like {!boost
> b=log(popularity)}?
> >
> >
> > --- On Sat, 8/28/10, Lance Norskog 
> wrote:
> >
> >> From: Lance Norskog 
> >> Subject: Re: ExternalFileField best practices
> >> To: solr-user@lucene.apache.org
> >> Date: Saturday, August 28, 2010, 5:16 PM
> >> The file is completely reloaded when
> >> you commit or optimize. There is
> >> no incremental update available. And, yes, this
> could be a
> >> scaling
> >> problem.
> >>
> >> How you update it is completely external to Solr.
> >>
> >> On Sat, Aug 28, 2010 at 2:50 AM, Andy 
> >> wrote:
> >> > I'm interested in using ExternalFileField to
> store a
> >> field "popularity" that is being updated
> frequently.
> >> >
> >> > However ExternalFileField seems to be a
> pretty obscure
> >> feature. Have a few questions:
> >> >
> >> > 1) Can anyone share your experience using
> it?
> >> >
> >> > 2) What is the most efficient way to update
> the
> >> external file?
> >> > For example, the file could look like:
> >> >
> >> > 1=12      // the document with uniqueKey 1
> has a
> >> popularity of 12//
> >> > 2=4
> >> > 3=45
> >> > 5=78
> >> >
> >> > Now the popularity of document 1 is updated
> to 13:
> >> >
> >> > - What is the best way to update the file to
> reflect
> >> the change? Isn't this an O(n) operation?
> >> > - How to deal with concurrent updates to the
> file by
> >> multiple threads?
> >> >
> >> > Would this method of using an external file
> scale?
> >> >
> >> > Thanks.
> >> >
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goks...@gmail.com
> >>
> >
> >
> >
> >
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
>

Re: Tuning Solr caches with high commit rates (NRT)

2010-09-17 Thread Andy

Does Solr use Lucene NRT?

--- On Fri, 9/17/10, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: Tuning Solr caches with high commit rates (NRT)
> To: solr-user@lucene.apache.org
> Date: Friday, September 17, 2010, 1:05 PM
> Near Real Time...
> 
> Erick
> 
> On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon wrote:
> 
> > BTW, what is NRT?
> >
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > EARTH has a Right To Life,
> >  otherwise we all die.
> >
> > Read 'Hot, Flat, and Crowded'
> > Laugh at http://www.yert.com/film.php
> >
> >
> > --- On Fri, 9/17/10, Peter Sturge 
> wrote:
> >
> > > From: Peter Sturge 
> > > Subject: Re: Tuning Solr caches with high commit
> rates (NRT)
> > > To: solr-user@lucene.apache.org
> > > Date: Friday, September 17, 2010, 2:18 AM
> > > Hi,
> > >
> > > It's great to see such a fantastic response to
> this thread
> > > - NRT is
> > > alive and well!
> > >
> > > I'm hoping to collate this information and add it
> to the
> > > wiki when I
> > > get a few free cycles (thanks Erik for the heads
> up).
> > >
> > > In the meantime, I thought I'd add a few tidbits
> of
> > > additional
> > > information that might prove useful:
> > >
> > > 1. The first one to note is that the
> techniques/setup
> > > described in
> > > this thread don't fix the underlying potential
> for
> > > OutOfMemory errors
> > > - there can always be an index large enough to
> ask of its
> > > JVM more
> > > memory than is available for cache.
> > > These techniques, however, mitigate the risk, and
> provide
> > > an efficient
> > > balance between memory use and search
> performance.
> > > There are some interesting discussions going on
> for both
> > > Lucene and
> > > Solr regarding the '2 pounds of baloney into a 1
> pound bag'
> > > issue of
> > > unbounded caches, with a number of interesting
> strategies.
> > > One strategy that I like, but haven't found in
> discussion
> > > lists is
> > > auto-limiting cache size/warming based on
> available
> > > resources (similar
> > > to the way file system caches use free memory).
> This would
> > > allow
> > > caches to adjust to their memory environment as
> indexes
> > > grow.
> > >
> > > 2. A note regarding lockType in solrconfig.xml
> for dual
> > > Solr
> > > instances: It's best not to use 'none' as a value
> for
> > > lockType - this
> > > sets the lockType to null, and as the source
> comments note,
> > > this is a
> > > recipe for disaster, so, use 'simple' instead.
> > >
> > > 3. Chris mentioned setting maxWarmingSearchers to
> 1 as a
> > > way of
> > > minimizing the number of onDeckSearchers. This is
> a prudent
> > > move --
> > > thanks Chris for bringing this up!
> > >
> > > All the best,
> > > Peter
> > >
> > >
> > >
> > >
> > > On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich
> 
> > > wrote:
> > > > Peter Sturge,
> > > >
> > > > this was a nice hint, thanks again! If you
> are here in
> > > Germany anytime I
> > > > can invite you to a beer or an apfelschorle
> ! :-)
> > > > I only needed to change the lockType to none
> in the
> > > solrconfig.xml,
> > > > disable the replication and set the data dir
> to the
> > > master data dir!
> > > >
> > > > Regards,
> > > > Peter Karich.
> > > >
> > > >> Hi Peter,
> > > >>
> > > >> this scenario would be really great for
> us - I
> > > didn't know that this is
> > > >> possible and works, so: thanks!
> > > >> At the moment we are doing similar with
> > > replicating to the readonly
> > > >> instance but
> > > >> the replication is somewhat lengthy and
> > > resource-intensive at this
> > > >> datavolume ;-)
> > > >>
> > > >> Regards,
> > > >> Peter.
> > > >>
> > > >>
> > > >>> 1. You can run multiple Solr
> instances in
> > > separate JVMs, with both
> > > >>> having their solr.xml configured to
> use the
> > > same index folder.
> > > >>> You need to be careful that one and
> only one
> > > of these instances will
> > > >>> ever update the index at a time. The
> best way
> > > to ensure this is to use
> > > >>> one for writing only,
> > > >>> and the other is read-only and never
> writes to
> > > the index. This
> > > >>> read-only instance is the one to use
> for
> > > tuning for high search
> > > >>> performance. Even though the RO
> instance
> > > doesn't write to the index,
> > > >>> it still needs periodic (albeit
> empty) commits
> > > to kick off
> > > >>> autowarming/cache refresh.
> > > >>>
> > > >>> Depending on your needs, you might
> not need to
> > > have 2 separate
> > > >>> instances. We need it because the
> 'write'
> > > instance is also doing a lot
> > > >>> of metadata pre-write operations in
> the same
> > > jvm as Solr, and so has
> > > >>> its own memory requirements.
> > > >>>
> > > >>> 2. We use sharding all the time, and
> it works
> > > just fine with this
> > > >>> scenario, as the RO instance is
> simply another
> > > shard in the pack.
> > > >>>
> > > >>>
> > > >>> On Sun, Sep 12, 2010 at 8:46 PM,
> Peter Karich
> > > 
> > > wrote:
> > > >>>
> > > >>

is indexing single-threaded?

2010-09-22 Thread Andy

Does Solr index data in a single thread or can data be indexed concurrently in 
multiple threads?

Thanks
Andy

Different analyzers for dfferent documents in different languages?

2010-09-22 Thread Andy

I have documents that are in different languages. There's a field in the 
documents specifying what language it's in.

Is it possible to index the documents such that based on what language a 
document is in, a different analyzer will be used on that document?

What is the "normal" way to handle documents in different languages?

Thanks
Andy

Re: is indexing single-threaded?

2010-09-22 Thread Andy


--- On Wed, 9/22/10, Andy  wrote:

> Does Solr index data in a single
> thread or can data be indexed concurrently in multiple
> threads?
> 

Can anyone help?

bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Andy

Hi,

I was going thru this LucidImagnaton presentation on analysis:

http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks-on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right

1) on p.31-33, it talks about forming bi-grams for the 32 most common terms 
during indexing. Is there an analyzer that does that?

2) on p. 34, it mentions that the default Solr configuraton would turn "L'art" 
into the phrase query "L art" but it is much more efficient to turn it into a 
single token 'L art'. Which analyzer would do that?

Thanks.
Andy

possible to have uniqueKey to be type long?

2010-09-24 Thread Andy

I have a uniqueKey "id". I want to have id of the type long. So I changed my 
schema.xml to have:



When I tried to index data, I got the error:

Severe errors in solr configuration.
Check your log files for more detailed information on what may be wrong.
If you want solr to continue after configuration errors, change: 
 false
in null
-
org.apache.solr.common.SolrException: QueryElevationComponent requires the 
schema to have a uniqueKeyField implemented using StrField


I read in some emails that although QueryElevationComponent complains about 
uniqueKey not being of the type string, it's OK to have uniqueKey to be int 
(and presumably long).

So I follow the error message and set
false
in my solrconfig.xml

But I'm still getting the same error when I tried to index the data. Solr still 
aborted even though I set abortOnConfigurationError to false as shown above.

Why is Solr still aborting? Is there any way to have uniqueKey of type long?

Thanks.

RE: bi-grams for common terms - any analyzers do that?

2010-09-24 Thread Andy

--- On Thu, 9/23/10, Burton-West, Tom  wrote:

> It also splits on whitespace which causes all CJK queries
> to be treated as phrase queries regardless of the CJK
> tokenizer you use. 

But I thought specialized analyzers like CJKAnalyzer are designed for those 
languages, which don't use whitespace to separate words. 

Isn't it up to the tokenizer, not the QueryParser, to decide how to split the 
query into tokens?

I'm really confused.

If Solr's QueryParser will only split on whitespace no matter what then what is 
the point of using CJKAnalyzer?

It sounds like Solr would be pretty useless for languages like CJK. Is there 
any work around for this? Any CJK sites using Solr?

questions about autocommit & committing documents

2010-09-25 Thread Andy

In the example solrconfig.xml that comes with Solr, the autocommit section:

 
  1
  1000 


has been commented out.

- With  commented out, does it mean that every new document indexed 
to Solr is being auto-committed individually? Or that they are not being 
auto-committed at all?

- If I enable  and set  at 1, does it mean that my new 
documents won't be avalable for searching until 10,000 new documents have been 
added?

- When I add a new document to Solr, do I need to call commit explicitly? If 
so, how do I do that? 
I look at the Solr tutorial ( http://lucene.apache.org/solr/tutorial.html), the 
command used to index documents (java -jar post.jar solr.xml monitor.xml) 
doesn't include any explicit call to commit the documents. So I'm not sure if 
it's necessary.

Thanks

Re: questions about autocommit & committing documents

2010-09-26 Thread Andy

Thanks Mitch.

How do I do an explicit commit?

Andy

--- On Sun, 9/26/10, MitchK  wrote:

> From: MitchK 
> Subject: Re: questions about autocommit & committing documents
> To: solr-user@lucene.apache.org
> Date: Sunday, September 26, 2010, 4:13 AM
> 
> Hi Andy,
> 
> 
> Andy-152 wrote:
> > 
> >  
> >   1
> >   1000 
> > 
> > 
> > has been commented out.
> > 
> > - With  commented out, does it mean
> that every new document
> > indexed to Solr is being auto-committed individually?
> Or that they are not
> > being auto-committed at all?
> > 
> I am not sure, whether there is a default value, but if
> not, commenting out
> would mean that you have to send a commit explicitly. 
> 
> 
> 
> > - If I enable  and set
>  at 1, does it mean that
> > my new documents won't be avalable for searching until
> 10,000 new
> > documents have been added?
> > 
> Yes, that's correct. However, you can do a commit
> explicitly, if you want to
> do so. 
> 
> 
> 
> > - When I add a new document to Solr, do I need to call
> commit explicitly?
> > If so, how do I do that? 
> > I look at the Solr tutorial (
> > http://lucene.apache.org/solr/tutorial.html), the
> command used to index
> > documents (java -jar post.jar solr.xml monitor.xml)
> doesn't include any
> > explicit call to commit the documents. So I'm not sure
> if it's necessary.
> > 
> > Thanks
> > 
> Committing is necessary, since every added document is not
> visible at
> query-time, if there was no commit to it. 
> 
> Kind regards,
> Mitch
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/questions-about-autocommit-committing-documents-tp1582487p1582676.html
> Sent from the Solr - User mailing list archive at
> Nabble.com.
>

Multi-lingual auto-complete?

2010-09-27 Thread Andy

I want to provide auto-complete to users when they're inputting tags. The 
auto-complete tag suggestions would be based on tags that are already in the 
system.

Multiple tags are separated by commas. A single tag could contain multiple 
words such as "Apple computer".

One issue is that a tag could be in multiple languages, including both 
languages (e.g. English, French) that use whitespace as word separator and 
languages that don't (e.g. CJK)

An example of such a multi-lingual tag is "Apple 电脑".

If a user types "apple", I'd like the autocomplete suggestions to include both 
"Apple computer" (ie. matches are case insensitive) and "green apple" (ie. 
matches aren't restricted to prefixes). And a user typing "电脑" should match 
"Apple 电脑".

Is it possible to do that? I read the article:
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

In that article KeywordTokenizerFactor is used. If I changed it to CJKTokenizer 
would that work? 

With an input of "Apple 电脑", what would CJKTokenizer produce?

-is it "Apple", "电", "脑" ?
or
- is it "A", "p", "p", "l", "e", "电", "脑" ?

Any help would be greatly appreciated.

Andy

What's the difference between TokenizerFactory, Tokenizer, & Analyzer?

2010-09-28 Thread Andy

Could someone help me to understand the differences between TokenizerFactory, 
Tokenizer, & Analyzer?

Specifically, I'm interested in implementing auto-complete for tags that could 
contain both English & Chinese. I read this article 
(http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/).
 In the article KeywordTokenizerFactory is used as tokenizer. I thought I'd try 
replacing that with CJKTokenizer. 2 questions:

1) KeywordTokenizerFactory seems to be a "tokenizer factory" while CJKTokenizer 
seems to be just a tokenizer. Are they the same type of things at all? 
Could I just replace 

with

??

2) I'm also interested in trying out SmartChineseAnalyzer 
(http://lucene.apache.org/java/2_9_0/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/SmartChineseAnalyzer.html)
However SmartChineseAnalyzer doesn't offer a separate tokenizer. It's just an 
analyzer and that's it. How do I use it in Solr?

Thanks.
Andy

How to set up multiple indexes?

2010-09-29 Thread Andy

I installed Solr according to the tutorial. My schema.xml & solrconfig.xml is 
in 
~/apache-solr-1.4.1/example/solr/conf

Everything so far is just like that in the tutorial. But I want to set up a 2nd 
index (separate from the "main" index) just for the purpose of auto-complete.

I understand that I need to set up multicore for this. But I'm not sure how to 
do that. I read the doc (http://wiki.apache.org/solr/CoreAdmin) but am still 
pretty confused.

- where do I put the 2nd index?
- do I need separate schema.xml & solrconfig.xml for the 2nd index? Where do I 
put them?
- how do I tell solr which index do I want a document to go to?
- how do I tell solr which index do I want to query against?
- any step-by-step instruction on setting up multicore?

Thanks.
Andy

Any way to "append" new text to an existing indexed field?

2010-10-01 Thread Andy

I'm building a Q&A application. There's a "Question" database table and an 
"Answer" table.

For each question, I'm putting the question itself plus all the answers into a 
single field "text" to be indexed and searched.

Say I have a question that has 10 existing answers that are already indexed. If 
a new answer is submitted for that question, is there any way I could just 
"append" the new answer to the "text" field? Or is the only way to implement 
this is to retrieve the original question and the 10 existing answers from the 
database, combine them with the newly submitted 11th answer, and re-index 
everything from scratch?

The latter option just seems inefficient. Is there a better design that could 
be used for this use case?

Andy

Re: Any way to "append" new text to an existing indexed field?

2010-10-01 Thread Andy

Well I want to just display the "title" of the question in my search results 
and users can then just click on it to see the detals of the question and all 
the answers.

For example, say a question has the title "What is the meaning of life?" and 
then one of the answers to that question is "solr." If someone searches for 
"solr", I want to display the question title "What is the meaning of life?" in 
the search results. If the user clicks on the question title and drills down, 
he can then see that one of the answers is "solr".

I'm not sure it makes sense to index the question and each answer separately 
because I don't want to get duplicate questions in the search results. In the 
above example, let's say there's another answer "solr is the meaning". If each 
answer is indexed separately, I'd get two "What is the meaning of life?" in my 
search results when someone searches for "solr".

--- On Fri, 10/1/10, Allistair Crossley  wrote:

> From: Allistair Crossley 
> Subject: Re: Any way to "append" new text to an existing indexed field?
> To: solr-user@lucene.apache.org
> Date: Friday, October 1, 2010, 7:46 AM
> i would say question and answer are 2
> different entities. if you are using the data import
> handler, i would personally create them as separate entities
> with their own queries to the database using the deltaQuery
> method to pick up only new rows. i guess it depends if you
> need question + answers to actually come back out to be used
> for display (i.e. you stored their data), or whether it's
> good enough to match on question/answer separately and then
> just link to a question ID in your UI to drill-down from the
> database.
> 
> disclaimer: i am a solr novice - just started, so i'd see
> what others think too ;)
> 
> On Oct 1, 2010, at 7:38 AM, Andy wrote:
> 
> > I'm building a Q&A application. There's a
> "Question" database table and an "Answer" table.
> > 
> > For each question, I'm putting the question itself
> plus all the answers into a single field "text" to be
> indexed and searched.
> > 
> > Say I have a question that has 10 existing answers
> that are already indexed. If a new answer is submitted for
> that question, is there any way I could just "append" the
> new answer to the "text" field? Or is the only way to
> implement this is to retrieve the original question and the
> 10 existing answers from the database, combine them with the
> newly submitted 11th answer, and re-index everything from
> scratch?
> > 
> > The latter option just seems inefficient. Is there a
> better design that could be used for this use case?
> > 
> > Andy
> > 
> > 
> > 
> 
>

NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

2010-10-02 Thread Andy

I working on a user-generated tagging feature. Some of the tags could be 
multi-lingual, mixng languages like English, Chinese, Japanese

I'd like to add auto-complete to help users to enter the tags. And I'd want to 
match in the middle of the tags as well.

For example, if a user types "guit" I want to suggest:
"guitar"
"electric guitar"
"电动guitar"
"guitar英雄"

And if a user types "吉他" I want to suggest:
"吉他Hero"
"electric吉他"
"古典吉他"


I'm thinking about using:


 
   
   
   
 
 
   
   
 


Would the above setup do what I want to do?

Also how would I deal with hyphens? For example I want an input or either 
"wi-f" or "wif" to match the tag "wi-fi". 

Would adding WordDelimiterFilterFactory to both "index" and "query" accomplish 
that?


Thanks.

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

2010-10-02 Thread Andy



--- On Sat, 10/2/10, Ahmet Arslan  wrote:

> From: Ahmet Arslan 

> > For example, if a user types
> "guit" I want to suggest:
> > "guitar"
> > "electric guitar"
> > "电动guitar"
> > "guitar英雄"
> > 
> > And if a user types "吉他" I want to suggest:
> > "吉他Hero"
> > "electric吉他"
> > "古典吉他"
> > 
> > 
> > I'm thinking about using:
> > 
> >  class="solr.TextField"
> > positionIncrementGap="100">
> >  
> >     > class="solr.KeywordTokenizerFactory"/>
> >     > class="solr.LowerCaseFilterFactory"/>
> >     > class="solr.NGramFilterFactory" minGramSize="1"
> > maxGramSize="15" />
> >  
> >  
> >     > class="solr.KeywordTokenizerFactory"/>
> >     > class="solr.LowerCaseFilterFactory"/>
> >  
> > 
> > 
> > Would the above setup do what I want to do?
> 
> fieldType autocomplete will bring you only startsWith tags
> since it uses KeywordTokenizerFactory. You need
> WhitespaceTokenizer for your use case. 
> 
> Or you can use two different fields and types (using
> keywordtokenizer and whitespacetokenizer). So that
> beginsWith matches comes first.
> 

I don't understand. Many tags like "electric吉他" or "古典吉他" have no whitespace at 
all, so how does WhitespaceTokenizer help?

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

2010-10-03 Thread Andy

--- On Sat, 10/2/10, Ahmet Arslan  wrote:

> > I don't understand. Many tags like "electric吉他"
> or
> > "古典吉他" have no whitespace at all, so how does
> > WhitespaceTokenizer help?
> 
> It makes sense for tags having more than one words. i.e.
> "electric guitar"
> 
> If you tokenize this using whitespacetokenizer, you obtain
> two tokens.
> If you use keywordtokenizer, you obtain only one token,
> always.
> 
> In other words, if you want query qui to return "electric
> guitar" you need whitespacetokenizer.

But I thought NGramFilterFactory would generate substrings that start in the 
"middle", hence ensuring autocomplete matching in the middle.

So in the case of "electric guitar", keywordtokenizer would create one token - 
"electric guitar"

NGramFilterFactory would then take that one toke ("electric guitar") and 
generate N-grams out of it. One of the ngrams would be "guit" because "guit" is 
a substring of "electric guitar".

Or did I misunderstand how NGramFilterFactory work?

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

2010-10-03 Thread Andy

Ah Thanks for clearing that up.

Does anyone know how to deal with these 2 issues when using NGramFilterFactory 
for autocomplete?

1) hyphens - if user types "ema" or "e-ma" I want to suggest "email"

2) accents - if user types "herme"  want to suggest "Hermès"

Thanks.

--- On Sun, 10/3/10, Ahmet Arslan  wrote:

> From: Ahmet Arslan 
> Subject: Re: NGramFilterFactory for auto-complete that matches the middle of 
> multi-lingual tags?
> To: solr-user@lucene.apache.org
> Date: Sunday, October 3, 2010, 6:26 AM
> > But I thought NGramFilterFactory
> would generate substrings
> > that start in the "middle", hence ensuring
> autocomplete
> > matching in the middle.
> > 
> > So in the case of "electric guitar", keywordtokenizer
> would
> > create one token - "electric guitar"
> > 
> > NGramFilterFactory would then take that one toke
> ("electric
> > guitar") and generate N-grams out of it. One of the
> ngrams
> > would be "guit" because "guit" is a substring of
> "electric
> > guitar".
> > 
> 
> Ups. You are correct, I am sorry. I mixed it with
> *Edge*NGramFilterFActory.
> 
> 
>       
>

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

2010-10-04 Thread Andy

> > 1) hyphens - if user types "ema" or "e-ma" I want to
> > suggest "email"
> > 
> > 2) accents - if user types "herme"  want to suggest
> > "Hermès"
> 
> Accents can be removed with using MappingCharFilterFactory
> before the tokenizer. (both index and query time)
> 
>  mapping="mapping-ISOLatin1Accent.txt"/>
> 
> I am not sure if this is most elegant solution but you can
> replace - with "" uing MappingCharFilterFactory too. It
> satisfies what you describe in 1.
> 
> But generally NGramFilterFactory produces a lot of tokens.
> I mean query er can return hermes. May be
> EdgeNGramFilterFactory can be more suitable for
> auto-complete task. At least it guarantees that some word is
> starting with that character sequence.

Thanks.

I agree with the issues with NGramFilterFactory you pointed out and I really 
want to avoid using it. But the problem is that I have Chinese tags like "电吉他" 
and multi-lingual tags like "electric吉他".

For tags like that WhitespaceTokenizerFactory wouldn't work. And if I use 
ChineseFilterFactory would it recognize that the "electric" in "electric吉他" 
isn't Chinese and shouldn't be split into individual characters?

Any ideas here are greatly appreciated.

In a related matter, I checked out 
http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-tree.html 
and saw that there are:

EdgeNGramFilterFactory & EdgeNGramTokenizerFactory
NGramFilterFactory & NGramTokenizerFactory

What are the differences between *FilterFactory and *TokenizerFactory? In my 
case which one should I be using?

Thanks.

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

2010-10-04 Thread Andy

 
> I got your point. You want to retrieve "electric吉他"
> with the query 吉他. That's why you don't want EdgeNGram.
> If this is the only reason for NGram, I think you can
> transform "electric吉他" into two tokens "electric"
> "吉他" in TokenFilter(s) and apply EdgeNGram approach.
> 

What TokenFilters would split "electric吉他" into "electric" & "吉他"?

Differences between FilterFactory and TokenizerFactory?

2010-10-04 Thread Andy

There are EdgeNGramFilterFactory & EdgeNGramTokenizerFactory.

Likewise there are StandardFilterFactory & StandardTokenizerFactory.

LowerCaseFilterFactory & LowerCaseTokenizerFactory.

Seems like they always come in pairs. 

What are the differences between FilterFactory and TokenizerFactory? When 
should I use one as opposed to the other?

Thanks

"OR" facet queries?

2010-10-09 Thread Andy

I want to enable users to select multiple facet values for a specific facet 
fields. For example, if "color" is a facet field, I'd like to let users to 
select "red" OR "blue".

Please note, I've set

because I want "q=hello+world" means "hello" and "world" are AND'ed together.

1) What is the syntax of doing that? Can I implement that by putting "OR" 
within the fq clause?
E.g.
&facet=on&facet.field=color&facet.field=size
&fq=color:(red OR blue)
&fq=size:(M OR L)

2) Is there a performance penalty associated with using "OR" on the facet 
values like that? If so how much of a penalty?

Thanks

Which is faster -- delete or update?

2010-11-01 Thread Andy

My documents have a "down_vote" field. Every time a user votes down a document, 
I increment the "down_vote" field in my database and also re-index the document 
to Solr to reflect the new down_vote value.
During searches, I want to restrict the results to only documents with, say 
fewer than 3 down_vote. 2 ways to implement that:
1) When a user down vote a document, check to see if total down votes have 
reached 3. If it has, delete document from Solr index.
2) When a user down vote a document, update the document in Solr index to 
reflect the new down_vote value even if total down votes might have been more 
than 3. During query, add a "fq" to restrict results to documents with fewer 
than 3 down votes.
Which approach is better? Is it faster to delete a document from index or to 
update the document to reflect the new down_vote value?
Thanks.Andy

Updating Solr index - DIH delta vs. task queues

2010-11-04 Thread Andy

Hi,
I have data stored in a database that is being updated constantly. I need to 
find a way to update Solr index as data in the database is being updated.
There seems to be 2 main schools of thoughts on this:
1) DIH delta - query the database for all records that have a timestamp later 
than the last_index_time. Import those records for indexing to Solr
2) Task queue - every time a record is updated in the database, throw a task to 
a queue to index that record to Solr
Just want to know what are the pros and cons of each approach and what is your 
experience. For someone starting new, what'd be your recommendation?
ThanksAndy

Re: EdgeNGram relevancy

2010-11-11 Thread Andy

Could anyone help me understand what does "Clyde Phillips" appear in the 
results for "Bill Cl"??

"Clyde Phillips" doesn't produce any EdgeNGram that would match "Bill Cl", so 
why is it even in the results?

Thanks.

--- On Thu, 11/11/10, Ahmet Arslan  wrote:

> You can add an additional field, with
> using KeywordTokenizerFactory instead of
> WhitespaceTokenizerFactory. And query both these fields with
> an OR operator. 
> 
> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
> 
> You can even apply boost so that begins with matches comes
> first.
> 
> --- On Thu, 11/11/10, Robert Gründler 
> wrote:
> 
> > From: Robert Gründler 
> > Subject: EdgeNGram relevancy
> > To: solr-user@lucene.apache.org
> > Date: Thursday, November 11, 2010, 5:51 PM
> > Hi,
> > 
> > consider the following fieldtype (used for
> > autocompletion):
> > 
> >    class="solr.TextField"
> > positionIncrementGap="100">
> >    
> >       > class="solr.WhitespaceTokenizerFactory"/>
> >       > class="solr.LowerCaseFilterFactory"/>
> >       > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"
> > />     
> >       > class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z])"
> > replacement="" replace="all" />
> >       > class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="25" />
> >    
> >    
> >       > class="solr.WhitespaceTokenizerFactory"/>
> >       > class="solr.LowerCaseFilterFactory"/>
> >       > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"
> />
> >       > class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z])"
> > replacement="" replace="all" />
> >    
> >   
> > 
> > 
> > This works fine as long as the query string is a
> single
> > word. For multiple words, the ranking is weird
> though.
> > 
> > Example:
> > 
> > Query String: "Bill Cl"
> > 
> > Result (in that order):
> > 
> > - Clyde Phillips
> > - Clay Rogers
> > - Roger Cloud
> > - Bill Clinton
> > 
> > "Bill Clinton" should have the highest rank in that
> > case.  
> > 
> > Has anyone an idea how to to configure this fieldtype
> to
> > make matches in both tokens rank higher than those who
> match
> > in either token?
> > 
> > 
> > thanks!
> > 
> > 
> > -robert
> > 
> > 
> > 
> > 
> 
> 
> 
>

Re: EdgeNGram relevancy

2010-11-11 Thread Andy

Ah I see. Thanks for the explanation.

Could you set the defaultOperator to "AND"? That way both "Bill" and "Cl" must 
be a match and that would exclude "Clyde Phillips".


--- On Thu, 11/11/10, Robert Gründler  wrote:

> From: Robert Gründler 
> Subject: Re: EdgeNGram relevancy
> To: solr-user@lucene.apache.org
> Date: Thursday, November 11, 2010, 3:51 PM
> according to the fieldtype i posted
> previously, i think it's because of:
> 
> 1. WhiteSpaceTokenizer splits the String "Clyde Phillips"
> into 2 tokens: "Clyde" and "Phillips"
> 2. EdgeNGramFilter gets the 2 tokens, and creates an
> EdgeNGram for each token: "C" "Cl" "Cly"
> ...   AND  "P" "Ph" "Phi" ...
> 
> The Query String "Bill Cl" gets split up in 2 Tokens "Bill"
> and "Cl" by the WhitespaceTokenizer.
> 
> This creates a match for the 2nd token "Ci" of the query,
> and one of the "sub"tokens the EdgeNGramFilter created:
> "Cl".
> 
> 
> -robert
> 
> 
> 
> 
> On Nov 11, 2010, at 21:34 , Andy wrote:
> 
> > Could anyone help me understand what does "Clyde
> Phillips" appear in the results for "Bill Cl"??
> > 
> > "Clyde Phillips" doesn't produce any EdgeNGram that
> would match "Bill Cl", so why is it even in the results?
> > 
> > Thanks.
> > 
> > --- On Thu, 11/11/10, Ahmet Arslan 
> wrote:
> > 
> >> You can add an additional field, with
> >> using KeywordTokenizerFactory instead of
> >> WhitespaceTokenizerFactory. And query both these
> fields with
> >> an OR operator. 
> >> 
> >> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
> >> 
> >> You can even apply boost so that begins with
> matches comes
> >> first.
> >> 
> >> --- On Thu, 11/11/10, Robert Gründler 
> >> wrote:
> >> 
> >>> From: Robert Gründler 
> >>> Subject: EdgeNGram relevancy
> >>> To: solr-user@lucene.apache.org
> >>> Date: Thursday, November 11, 2010, 5:51 PM
> >>> Hi,
> >>> 
> >>> consider the following fieldtype (used for
> >>> autocompletion):
> >>> 
> >>>    name="edgytext"
> >> class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>    
> >>>       >>> class="solr.WhitespaceTokenizerFactory"/>
> >>>       >>> class="solr.LowerCaseFilterFactory"/>
> >>>       >>> class="solr.StopFilterFactory"
> ignoreCase="true"
> >>> words="stopwords.txt"
> enablePositionIncrements="true"
> >>> />     
> >>>           >>> class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >>> replacement="" replace="all" />
> >>>       >>> class="solr.EdgeNGramFilterFactory"
> minGramSize="1"
> >>> maxGramSize="25" />
> >>>    
> >>>    
> >>>       >>> class="solr.WhitespaceTokenizerFactory"/>
> >>>       >>> class="solr.LowerCaseFilterFactory"/>
> >>>       >>> class="solr.StopFilterFactory"
> ignoreCase="true"
> >>> words="stopwords.txt"
> enablePositionIncrements="true"
> >> />
> >>>           >>> class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >>> replacement="" replace="all" />
> >>>    
> >>>   
> >>> 
> >>> 
> >>> This works fine as long as the query string is
> a
> >> single
> >>> word. For multiple words, the ranking is
> weird
> >> though.
> >>> 
> >>> Example:
> >>> 
> >>> Query String: "Bill Cl"
> >>> 
> >>> Result (in that order):
> >>> 
> >>> - Clyde Phillips
> >>> - Clay Rogers
> >>> - Roger Cloud
> >>> - Bill Clinton
> >>> 
> >>> "Bill Clinton" should have the highest rank in
> that
> >>> case.  
> >>> 
> >>> Has anyone an idea how to to configure this
> fieldtype
> >> to
> >>> make matches in both tokens rank higher than
> those who
> >> match
> >>> in either token?
> >>> 
> >>> 
> >>> thanks!
> >>> 
> >>> 
> >>> -robert
> >>> 
> >>> 
> >>> 
> >>> 
> >> 
> >> 
> >> 
> >> 
> > 
> > 
> > 
> 
>

DIH for multilingual index & multiValued field?

2010-11-13 Thread Andy

I have a MySQL table:

CREATE TABLE documents (
id INT NOT NULL AUTO_INCREMENT,
language_code CHAR(2),
tags CHAR(30),
text TEXT,
PRIMARY KEY (id)
);

I have 2 questions about Solr DIH:

1) The "langauge_code" field indicates what language the "text" field is in. 
And depending on the language, I want to index "text" to different Solr fields.

# pseudo code

if langauge_code == "en":
index "text" to Solr field "text_en"
elif langauge_code == "fr":
index "text" to Solr field "text_fr"
elif langauge_code == "zh":
index "text" to Solr field "text_zh"
...

Can DIH handle a usecase like this? How do I configure it to do so?

2) The "tags" field needs to be indexed into a Solr multiValued field. Multiple 
values are stored in a string, separated by a comma. For example, if `tags` 
contains the string "blue, green, yellow" then I want to index the 3 values 
"blue", "green", "yellow" into a Solr multiValued field.

How do I do that with DIH?

Thanks.

custom solr sort

2013-01-05 Thread andy

 void copy(int slot, int doc) throws IOException {
values[slot] = getRelation(doc);

}

@Override
public void setBottom(int slot) {
bottom = values[slot];
}

@Override
public FieldComparator setNextReader(
AtomicReaderContext ctx) throws IOException {
uidDoc = FieldCache.DEFAULT.getInts(ctx.reader(), "userID",
true);
return this;
}

@Override
public Float value(int slot) {
return new Float(values[slot]);
}

private float getRelation(int doc) throws IOException {
if (dg3.get(uidDoc[doc])) {
return 3.0f;
} else if (dg2.get(uidDoc[doc])) {
return 4.0f;
} else if (dg1.get(uidDoc[doc])) {
return 5.0f;
} else {
return 1.0f;
}
}

@Override
public int compareDocToValue(int arg0, Object arg1)
throws IOException {
// TODO Auto-generated method stub
return 0;
    }
}

}
}


and solrconfig.xml configuration is 




   
mySortComponent
  



Thanks
Andy




--
View this message in context: 
http://lucene.472066.n3.nabble.com/custom-solr-sort-tp4031014.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: custom solr sort

2013-01-07 Thread andy

Hi Upayavira,

The custom sort field is not stored in the index, I want to archieve a
requirement that didfferent search users will get different search results
when  they search same keyword by my search engine, the search users have
relationship with the each result document in the solr. But the relationship
is provided by the other teams' rest service.
So the search sequence is as follows :
1. I add the search user's id in the solr query  ( i.e. :  
query.setParam("uid", vo.getUserId());)
   and specify my own request  hanlder "*mysearch*"  query.setParam("qt",
"mysearch");

2.  MySortComponent set the custom sort as the first sort.
3.  MyComparatorSource got the uid ,and send request to a rest service,
get the relationship according the uid
4.sort the result

Do you have any suggestions?



Upayavira wrote
> Can you explain why you want to implement a different sort first? There
> may be other ways of achieving the same thing.
> 
> Upayavira
> 
> On Sun, Jan 6, 2013, at 01:32 AM, andy wrote:
>> Hi,
>> 
>> Maybe this is an old thread or maybe it's different with previous one.
>> 
>> I want to custom solr sort and  pass solr param from client to solr
>> server,
>> so I  implemented SearchComponent which named MySortComponent in my code,
>> and also implemented FieldComparatorSource and FieldComparator. when I
>> use
>> "mysearch" requesthandler(see following codes), I found that custom sort
>> just effect on the current page when I got multiple page results, but the
>> sort is expected when I sets the rows which contains  all the results.
>> Does
>> anybody know how to solve it or the reason?
>> 
>> code snippet:
>> 
>> public class MySortComponent extends SearchComponent implements
>> SolrCoreAware {
>>   
>> public void inform(SolrCore arg0) {
>> }
>> 
>> @Override
>> public void prepare(ResponseBuilder rb) throws IOException {
>> SolrParams params = rb.req.getParams();
>>  String uid = params.get("uid")
>>  private RestTemplate restTemplate = new RestTemplate();
>>  
>> MyComparatorSource comparator = new MyComparatorSource(uid);
>> SortSpec sortSpec = rb.getSortSpec();
>> if (sortSpec.getSort() == null) {
>> sortSpec.setSort(new Sort(new SortField[] {
>> new SortField("relation",
>> comparator),SortField.FIELD_SCORE }));
>>   
>> } else {
>>   
>> SortField[] current = sortSpec.getSort().getSort();
>> ArrayList
> 
>  sorts = new ArrayList
> 
> (
>> current.length + 1);
>> sorts.add(new SortField("relation", comparator));
>> for (SortField sf : current) {
>> sorts.add(sf);
>> }
>> sortSpec.setSort(new Sort(sorts.toArray(new
>> SortField[sorts.size()])));
>>   
>> }
>> 
>> }
>> 
>> @Override
>> public void process(ResponseBuilder rb) throws IOException {
>> 
>> }
>> 
>> //
>> -
>> // SolrInfoMBean
>> //
>> -
>> 
>> @Override
>> public String getDescription() {
>> return "Custom Sorting";
>> }
>> 
>> @Override
>> public String getSource() {
>> return "";
>> }
>> 
>> @Override
>> public URL[] getDocs() {
>> try {
>> return new URL[] { new URL(
>> "http://wiki.apache.org/solr/QueryComponent";) };
>> } catch (MalformedURLException e) {
>> throw new RuntimeException(e);
>> }
>> }
>> 
>> public class MyComparatorSource extends FieldComparatorSource {
>> private BitSet dg1;
>> private BitSet dg2;
>> private BitSet dg3;
>> 
>> public MyComparatorSource(String uid) throws IOException {
>> 
>> SearchResponse responseBody = restTemplate.postForObject(
>> "http://search.test.com/userid/search/"; + uid, null,
>> SearchResponse.class);
>> 
>> String d1 = responseBody.getOneDe();
>> String d2 = responseBody.getTwo

Re: custom solr sort

2013-01-07 Thread andy

Thanks you guys, I got the reason now, there'is something wrong with
compareBottom method in my source,it's not consistent with compare method




--
View this message in context: 
http://lucene.472066.n3.nabble.com/custom-solr-sort-tp4031014p4031444.html
Sent from the Solr - User mailing list archive at Nabble.com.

How to use SolrCloud in multi-threaded indexing

2013-01-31 Thread andy

Hi, 

I am going to upgrade to solr 4.1 from version 3.6, and I want to set up to
shards.
I use ConcurrentUpdateSolrServer to index the documents in solr3.6.
I saw the api CloudSolrServer in 4.1,BUT
1:CloudSolrServer use the LBHttpSolrServer to issue requests,but "*
LBHttpSolrServer  should NOT be used for indexing *" documented in the api 
http://lucene.apache.org/solr/4_1_0/solr-solrj/index.html
  
2:it seems CloudSolrServer does not support multi thread indexing 

So, how to do multi-threaded indexing in solr 4.1?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-use-SolrCloud-in-multi-threaded-indexing-tp4037641.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to use SolrCloud in multi-threaded indexing

2013-02-04 Thread andy


Thanks man



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-use-SolrCloud-in-multi-threaded-indexing-tp4037641p4038481.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to use SolrCloud in multi-threaded indexing

2013-02-04 Thread andy

Thanks man



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-use-SolrCloud-in-multi-threaded-indexing-tp4037641p4038482.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facets with 5000 facet fields

2013-03-19 Thread Andy

Hoss,

What about the case where there's only a small number of fields (a dozen or 
two) but each field has hundreds of thousands or millions of values? Would Solr 
be able to handle that?




 From: Chris Hostetter 
To: solr-user@lucene.apache.org 
Sent: Tuesday, March 19, 2013 6:09 PM
Subject: Re: Facets with 5000 facet fields
 

: In order to support faceting, Solr maintains a cache of the faceted
: field. You need one cache for each field you are faceting on, meaning
: your memory requirements will be substantial, unless, I guess, your

1) you can consider trading ram for time by using "facet.method=enum" (and 
disabling your filterCache) ... it will prevent the need for hte 
FieldCaches but will probably be slower as it will compute the docset per 
value per field instead of generating the FieldCaches once and re-useing 
them.

2) the entire question seems suspicious...

: > We have configured solr for 5000 facet fields as part of request
: > handler.We
: > have 10811177 docs in the index.

...i have lots of experience dealing with indexes that had thousands of 
fields that were faceted on, but i've never seen any realistic usecase for 
faceting on more then a few hundred fields per search.  Can you please 
elaborate on your goals and usecases so we can offer better advice...

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341


-Hoss

Re: Facets with 5000 facet fields

2013-03-20 Thread Andy

That's impressive performance.

Are you doing NRT updates? I seem to recall that facet cache is not per segment 
so every time the index is updated the facet cache will need to be re-computed. 
And that's going to kill performance. Have you run into that problem?

 From: Toke Eskildsen 
To: "solr-user@lucene.apache.org" ; Andy 

Sent: Wednesday, March 20, 2013 4:06 AM
Subject: Re: Facets with 5000 facet fields

On Wed, 2013-03-20 at 07:19 +0100, Andy wrote:
> What about the case where there's only a small number of fields (a
> dozen or two) but each field has hundreds of thousands or millions of
> values? Would Solr be able to handle that?

We do that on a daily basis at State and University Library, Denmark:
One of our facet fields has 10766502 unique terms, another has 6636746.
This is for 11M documents and it has query response times clustering at
~150ms, ~750ms and ~1500ms (I'll have to look into why it clusters like
that).

This is with standard Solr faceting on a quad core Xeon L5420 server
with SSD. It has 16GB of RAM and runs two search instances, each with
~11M documents, one with a 52GB index, one with 71GB.

- Toke Eskildsen

how to custom the groupValue for the solr group function

2012-04-27 Thread andy

I want to specify the group field to "title"
which has some index examples like this

吸尘器（Panasonic） MC-CA391G
吸尘器（Panasonic） MC-CA491R
吸尘器（Panasonic） MC-CA402G

and so on,
I search like this
q=title:吸尘器&group=true&group.field=title

I analyze the searching result that I got a group value is "ca"
the corresponding numFound is 6

吸尘器（Panasonic） MC-CA391G
吸尘器（Panasonic） MC-CA491R
吸尘器（Panasonic） MC-CA402G
and other three item are in the same group,
I donot want to these items be grouped together because they have different
product number ,


Now , I have a question , HOW TO CUSTOM THE GROUP VALUE , MAKE THE GROUP
RESULT MORE exactly?
pls do give me a favor!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-custom-the-groupValue-for-the-solr-group-function-tp3943752p3943752.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Benchmark Solr vs Elastic Search vs Sensei

2012-04-27 Thread Andy

So the Cassandra integration brings distributed index and replication to Solr? 
Is that different from what Solr Cloud does?

 From: Jeff Schmidt 
To: solr-user@lucene.apache.org 
Sent: Friday, April 27, 2012 3:58 PM
Subject: Re: Benchmark Solr vs Elastic Search vs Sensei

This is a pretty awesome combination, actually.  I'm getting started using it 
myself, and I'd be very interested in what kind of benchmark results you get 
vs. Solr and your other candidates. DataStax Enterprise 2.0 was released in 
March and is based on Solr 4.0 and Cassandra 1.0.7 or 1.0.8, I'm looking for 
the Cassandra 1.1 based release.

Note: I am not affiliated with DataStax in anyway, other than being a satisfied 
customer for the past few months.   I am just trying to selfishly fuel your 
interest so you'll consider benchmarking it.

My project is already using Cassandra, and we had to manage Solr separately. 
Having the Solr indexes, and core configuration (solrconfig.xml, schema.xml, 
synonyms.txt etc) in Cassandra, being distributed and replicated among the 
various nodes, and eventually for us, multiple data centers is fantastic.

Jeff

On Apr 27, 2012, at 1:46 PM, Walter Underwood wrote:

> On Apr 27, 2012, at 12:39 PM, Radim Kolar wrote:
> 
>> Dne 27.4.2012 19:59, Jeremy Taylor napsal(a):
>>> DataStax offers a Solr integration that isn't master/slave and is
>>> NearRealTimes.
>> its rebranded solandra?
> 
> No, it is a rewrite.
> 
> http://www.datastax.com/dev/blog/cassandra-with-solr-integration-details
> 
> wunder
> --
> Walter Underwood
> wun...@wunderwood.org
> 
> 
> 

--
Jeff Schmidt
535 Consulting
j...@535consulting.com
http://www.535consulting.com
(650) 423-1068

Re: Benchmark Solr vs Elastic Search vs Sensei

2012-04-27 Thread Andy

What is the performance of Elasticsearch and SenseiDB in your benchmark?



 From: Volodymyr Zhabiuk 
To: solr-user@lucene.apache.org 
Sent: Thursday, April 26, 2012 9:50 PM
Subject: Benchmark Solr vs Elastic Search vs Sensei
 
Hi Solr users

I've implemented the project to compare the performance between
Solr, Elastic Search and SenseiDB
https://github.com/vzhabiuk/search-perf
the Solr version 3.5.0 was used. I've used the default configuration,
just enabled json updates and used the following schema
https://github.com/vzhabiuk/search-perf/blob/master/configs/solr/schema.xml.
2.5 mln documents were put into the index, after
that I've launched the indexing process to add anotherr 500k docs. I
was issuing commits after each 500 doc batch . At the
same time I've launched the concurrent client, that sent the
following type of queries
((tags:moon-roof%20or%20tags:electric%20or%20tags:highend%20or%20tags:hybrid)%20AND%20(!tags:family%20AND%20!tags:chick%20magnet%20AND%20!tags:soccer%20mom))%20
OR%20((color:red%20or%20color:green%20or%20color:white%20or%20color:yellow)%20AND%20(!color:gold%20AND%20!color:silver%20AND%20!color:black))%20
OR%20mileage:[15001%20TO%2017500]%20OR%20mileage:[17501%20TO%20*]%20
OR%20city:u.s.a.*
&facet=true&facet.field=tags&facet.field=color
The query contains the high level "OR" query, consisting of 2 terms, 2
ranges and 1 prefix. It is designed to hit ~60-70% of all the docs
Here is the performance result:
#Threads     min       median         mean            75%         qps
   1         208.95ms  332.66ms    350.48ms     422.92ms     2.8
   2         188.68ms  338.09ms    339.22ms     402.15ms     5.9
   3         151.06ms  326.64ms    336.20ms     418.61ms     8.8
   4         125.13ms  332.90ms    332.18ms     396.14ms     12.0
If there is no  indexing process on background
The result is as follows for 2,6 mln docs:
#Threads     min     median          mean             75%         qps
   1         106.70ms  199.66ms    199.40ms     234.89ms     5.1
   2         128.61ms  199.12ms    201.81ms     229.89ms     9.9
   3         110.99ms  197.43ms    203.13ms     232.25ms     14.7
   4         90.24ms    201.46ms      200.46ms     227.75ms     19.9
   5         106.14ms  208.75ms    207.69ms     242.88ms     24.0
   6         103.75ms  208.91ms    211.23ms     238.60ms     28.3
   7         113.54ms  207.07ms    209.69ms     239.99ms     33.3
   8         117.32ms  216.38ms    224.74ms     258.74ms     35.5
I've got three questions so far:
1. In case of background indexing the latency is almost 2 times
higher, is there any way to overcome this?
2. How can we tune the Solr to get better results ?
3. What's in your opinion is the preferred type of queries that I can
use for the benchmark?

With many thanks,
Volodymyr


BTW here is the spec of my machine
RedHat 6.1 64bit
Intel XEON e5620 @2.40 GHz, 8 cores
63 GB RAM

Re: how to custom the groupValue for the solr group function

2012-04-27 Thread andy

Hi Martijn,

Thank you for your reply.
Yes, I have analyzed the title field,  so   I got the unexpected result
,maybe
I have not understand the group function very well,
thank you very much Martijn, I will try that according your opinion.

Thanks,
Andy

--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-custom-the-groupValue-for-the-solr-group-function-tp3943752p3945933.html
Sent from the Solr - User mailing list archive at Nabble.com.

facet range query question

2012-05-09 Thread andy

I am in a E-Comerce project right now, and I have a requirement like this :

I have a lot of commodities in my SOLR indexes, commodity has the price
field, now I want to do facet range query,
I refer to the solr wiki, the facet range query need specify
*facet.range.gap* or specify *facet.range.spec=*,10,50,100,250,**
because different commodities in different categories has a huge balance,
laptop maybe 500$ more but a pen maybe 10$, so it's hard to specify the
facet range gap
 My prefer result is just specify the *counts* that  the  whole search
result was divided  how many parts according to the price 
for instance :
I search keyword *phone* and specify the *counts* 5
return the facet range automatically like this:
10
100
100
100
10

and I search keyword *laptop* and specify the *counts* 5
return the facet range automatically like this:
10
200
200
100
10


does any one know something like this, or other functions can implement my
requirement ?
please give me a favor,
Thank You

Andy




 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/facet-range-query-question-tp3976026.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: facet range query question

2012-05-14 Thread andy

THANKS for your relay

--
View this message in context: 
http://lucene.472066.n3.nabble.com/facet-range-query-question-tp3976026p3983783.html
Sent from the Solr - User mailing list archive at Nabble.com.

complex boolean filtering in fq queries

2010-12-07 Thread Andy

I have a facet query that requires some complex boolean filtering. Something 
like:

fq=location:national OR (fq=location:CA AND fq=city:"San Francisco")

1) How do I turn the above filters into a REST query string?
2) Do I need the double quotes around "San Francisco"?
3) Will complex boolean filters like this substantially slow down query 
performance?

Thanks

Re: complex boolean filtering in fq queries

2010-12-07 Thread Andy

Forgot to add, my defaultOperator is "AND".

--- On Wed, 12/8/10, Andy  wrote:

> From: Andy 
> Subject: complex boolean filtering in fq queries
> To: solr-user@lucene.apache.org
> Date: Wednesday, December 8, 2010, 1:21 AM
> I have a facet query that requires
> some complex boolean filtering. Something like:
> 
> fq=location:national OR (fq=location:CA AND fq=city:"San
> Francisco")
> 
> 1) How do I turn the above filters into a REST query
> string?
> 2) Do I need the double quotes around "San Francisco"?
> 3) Will complex boolean filters like this substantially
> slow down query performance?
> 
> Thanks
> 
> 
>       
>

Re: complex boolean filtering in fq queries

2010-12-07 Thread Andy

--- On Wed, 12/8/10, Tom Hill  wrote:
> 
> fq=location:national OR (location:CA AND city:"San
> Francisco")

> Do you mean URL encoding it? You can just type your query
> into the
> search box in the admin UI, and copy from the resulting
> URL.

Thanks Tom.

I wasn't referring to URL encoding. I was just unsure about the syntax of the 
fq filter. I didn't know how to express "AND", "OR" in queries or whether I 
could use parentheses in fq.

Let me make sure I got this right. So I could just do something like:

q=foo
&facet=on
&facet.field=location
&facet.field=city
&fq=location:national OR (location:CA AND city:"San Francisco")

And it would do what I want? Thanks.

How to handle multivalued hierarchical facets?

2010-12-08 Thread Andy

I have facets that are hierarchical. For example, Location can be represented 
as this hierarchy:

Country > State > City

If each document can only have a single value for each of these facets, then I 
can just use separate fields for each facet.

But if multiple values are allowed, then that approach would not work. For 
example if a document has 2 Location values:

US>CA>San Francisco
US>MA>Boston

If I just put the values "CA" & "MA" in the field "State", and "San Francisco" 
& "Boston" in "City", facetting would not work. Someone could select "CA" and 
the value "Boston" would be displayed for the field "City".

How do I handle this use case?

Thanks

Open source Solr UI with multiple select faceting?

2010-12-09 Thread Andy

Hi,

Any open source Solr UI's that support selecting multiple facet values ("OR" 
faceting)? For example allowing a user to select "red" or "blue" for the facet 
field "Color". 

I'd prefer libraries in javascript or Python. I know about ajax-solr but it 
doesn't seem to support multiple selects.

Thanks.

Re: [ANN] General Availability of LucidWorks Enterprise

2010-12-15 Thread Andy

Congrats!

A couple questions:

1) Which version of Solr is this based on?
2) How is LWE different from standard Solr? How should one choose between the 
two?

Thanks.

--- On Wed, 12/15/10, Grant Ingersoll  wrote:

> From: Grant Ingersoll 
> Subject: [ANN] General Availability of LucidWorks Enterprise
> To: solr-user@lucene.apache.org, java-u...@lucene.apache.org
> Date: Wednesday, December 15, 2010, 4:39 PM
> Lucid Imagination is pleased to
> announce the general availability of our Apache Solr/Lucene
> powered LucidWorks Enterprise (LWE).  LWE is designed
> to make it easier for people to get up to speed on search by
> providing easier management, integration with libraries
> commonly used in building search applications (such as
> crawling) as well as value add components developed by Lucid
> Imagination all packaged on top of Apache Solr while still
> giving access to Solr.
> 
> You can get more info in the press release: 
> http://www.lucidimagination.com/About/Company-News/Lucid-Imagination-Announces-General-Availability-and-Free-Download-LucidWorks-Ent
> 
> Other Details:
> Download LucidWorks Enterprise software:
> www.lucidimagination.com/lwe/download
> View free documentation: http://lucidworks.lucidimagination.com
> View a demonstration of LucidWorks Enterprise: 
> http://www.lucidimagination.com/lwe/demos
> 
> Access LucidWorks Enterprise whitepapers and tutorials:
> www.lucidimagination.com/lwe/whitepapers
> Read further commentary on the Lucid Imagination blog
> 
> Cheers,
> Grant
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com
> 
>

DIH for sharded database?

2010-12-18 Thread Andy

I have a table that is broken up into many virtual shards. So basically I have 
N identical tables:

Document1
Document2
.
.
Document36

Currently these tables all live in the same database, but in the future they 
may be moved to different servers to scale out if the needs arise.

Is there any way to configure a DIH for these tables so that it will 
automatically loop through the 36 identical tables and pull data out for 
indexing?

Something like (pseudo code):

for (i = 1; i <= 36; i++) {
   ## retrieve data from the table Document{$i} & index the data
}

What's the best way to handle a situation like this?

Thanks

Re: DIH for sharded database?

2010-12-18 Thread Andy

--- On Sat, 12/18/10, Lance Norskog  wrote:

> You can have a file with 1,2,3 on
> separate lines. There is a
> line-by-line file reader that can pull these as separate
> drivers.
> Inside that entity the JDBC url has to be altered with the
> incoming
> numbers. I don't know if this will work.

I'm not sure I understand.

How will altering the JDBC url change the name of the table it is importing 
data from?

Wouldn't I need to change the  actual SQL query itself?

"select * from Document1"
"select * from Document2"
...
"select * from Document36"

DIH for taxonomy faceting in Lucid webcast

2010-12-19 Thread Andy

Hi,

I watched the Lucid webcast:
http://www.lucidimagination.com/solutions/webcasts/faceting

It talks about encoding hierarchical categories to facilitate faceting. So a 
category "path" of "NonFic>Science" would be encoded as the multivalues 
"0/NonFic" & "1/NonFic/Science".

1) My categories are stored in database as coded numbers instead of fully 
spelled out names. For example I would have a category of "2/7" and a lookup 
dictionary to convert "2/7" into "NonFic/Science". How do I do such lookup in 
DIH?

2) Once I have the fully spelled out category path such as "NonFic/Science", 
how do I turn that into "0/NonFic" & "1/NonFic/Science" using the DIH?

3) Some of my categories are multi-words containing whitespaces, such as 
"Computer Science" and "Functional Programming", so I'd have facet values such 
as "2/NonFic/Computer Science/Functional Programming".  How do I handle 
whitespaces in this case? Would filtering by fq still work?

Thanks

Re: DIH for sharded database?

2010-12-19 Thread Andy

This is helpful. Thank you.

--- On Sun, 12/19/10, Dennis Gearon  wrote:

> From: Dennis Gearon 
> Subject: Re: DIH for sharded database?
> To: solr-user@lucene.apache.org
> Date: Sunday, December 19, 2010, 11:56 AM
> Some talk on giant databases in
> postgres:
>   
> http://wiki.postgresql.org/images/3/38/PGDay2009-EN-Datawarehousing_with_PostgreSQL.pdf
> 
> wikipedia
>   http://en.wikipedia.org/wiki/Partition_%28database%29
>   (says to use a UNION)
> postgres description on how to do it:
>   http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html
> 
>  Dennis Gearon
> 
> 
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes.
> It is usually a better 
> idea to learn from others’ mistakes, so you do not have
> to make them yourself. 
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> 
> 
> EARTH has a Right To Life,
> otherwise we all die.
> 
> 
> 
> - Original Message 
> From: Andy 
> To: solr-user@lucene.apache.org
> Sent: Sat, December 18, 2010 6:20:54 PM
> Subject: DIH for sharded database?
> 
> I have a table that is broken up into many virtual shards.
> So basically I have N 
> identical tables:
> 
> Document1
> Document2
> .
> .
> Document36
> 
> Currently these tables all live in the same database, but
> in the future they may 
> be moved to different servers to scale out if the needs
> arise.
> 
> Is there any way to configure a DIH for these tables so
> that it will 
> automatically loop through the 36 identical tables and pull
> data out for 
> indexing?
> 
> Something like (pseudo code):
> 
> for (i = 1; i <= 36; i++) {
>    ## retrieve data from the table
> Document{$i} & index the data
> }
> 
> What's the best way to handle a situation like this?
> 
> Thanks
>

Re: DIH for sharded database?

2010-12-19 Thread Andy

--- On Mon, 12/20/10, Lance Norskog  wrote:

> You said: Currently these tables all
> live in the same database, but in
> the future they may be moved to different servers to scale
> out if the
> needs arise.
> 
> That's why I concentrated on the JDBC url problem.
> 
> But you can use a file as a list of tables. Read each line,
> and a
> sub-entity can substitute the line value into the SQL
> statement.
> 

Can you give me an example of how to do this or pointing me to documentation 
that illustrates this? I think I sorta understand what you're saying 
conceptually but I need to be sure about the specifics.

Thanks.

White space in facet values

2010-12-21 Thread Andy

How do I handle facet values that contain whitespace? Say I have a field 
"Product" that I want to facet on. A value for "Product" could be "Electric 
Guitar". How should I handle the white space in "Electric Guitar" during 
indexing? What about when I apply the constraint fq=Product:Electric Guitar?

Duplicate values in multiValued field

2010-12-21 Thread Andy

If I put duplicate values into a multiValued field, would that cause any 
issues? 

For example I have a multiValued field "Color". Some of my documents have 
duplicate values for that field, such as: Green, Red, Blue, Green, Green. 

Would the above (having 3 duplicate Green) be the same as having the duplicated 
values of: Green, Red, Blue?

Or do I need to clean my data and remove duplicate values before indexing?

Thanks.

Any way to "tie" corresponding values together in different multiValued fields?

2010-12-22 Thread Andy

I have products, each has a specific Product ID.

For certain products such as "Shirts", there are also extra fields such as 
"Size" and "Color".

Right now I define both "Size" and "Color" as multiValued fields. And when I 
have a Shirt of Size M and Color white, I just put "M" in "Size" and "white" in 
"Color". Now if I have another shirt with the same Product ID but Size L and 
Color blue, I add "L" to "Size" and "blue" to "Color".

This causes a problem during faceting. If a user filters on "M" for "Size" and 
"blue" for "Color", he'd get a match. But in reality there isn't a shirt with 
Size M and Color blue.

Is there any way to encode the data to "tie" Size M to Color white, and to tie 
Size L to Color blue so that the filtering would come out right? How should I 
handle this use case?

Thanks.

Re: DIH for taxonomy faceting in Lucid webcast

2010-12-22 Thread Andy


--- On Wed, 12/22/10, Chris Hostetter  wrote:

> : 2) Once I have the fully spelled out category path such
> as 
> : "NonFic/Science", how do I turn that into "0/NonFic"
> & 
> : "1/NonFic/Science" using the DIH?
> 
> I don't have any specific suggestions for you -- i've never
> tried it in 
> DIH myself.  the ScriptTransformer might be able to
> help you out, but i'm 
> not sure.

Thanks Chris.

What did you use to generate those encodings if not DIH?

Sorting within grouped results?

2011-01-05 Thread Andy

I want to group my results by a field named "group_id".

According to http://wiki.apache.org/solr/FieldCollapsing , for each unique 
value of group_id a docList with the top scoring document is returned.

But in my case I want to sort the results within each "group_id" by an int 
field "popularity" instead. So within each "group_id" I just want the document 
with the highest "popularity". 

Is it possible to do that?

Thanks.

Will Result Grouping return documents that don't contain the specified "group.field"?

2011-01-06 Thread Andy

I want to group my results by a field named "group_id".

However, some of my documents don't contain the field "group_id". But I still 
want these documents to be returned as part of the results as long as they 
match the main query "q". 

Do I need to do anything to tell Solr that I want those documents?

Thanks.

RE: Will Result Grouping return documents that don't contain the specified "group.field"?

2011-01-06 Thread Andy

So by default Solr will not return documents that don't contain the specified 
group.field?


--- On Thu, 1/6/11, Bob Sandiford  wrote:

> From: Bob Sandiford 
> Subject: RE: Will Result Grouping return documents that don't contain the 
> specified "group.field"?
> To: "solr-user@lucene.apache.org" 
> Date: Thursday, January 6, 2011, 5:19 PM
> What if you put in a default value
> for the group_id field in the solr schema - would that work
> for you?  e.g. something like 'unknown'  Then
> you'll get all those with no original group_id value still
> grouped together, and you can figure out at display time
> what you want to do with them.
> 
> Bob Sandiford | Lead Software Engineer | SirsiDynix
> P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
> www.sirsidynix.com 
> 
> 
> > -Original Message-
> > From: Andy [mailto:angelf...@yahoo.com]
> > Sent: Thursday, January 06, 2011 3:06 PM
> > To: solr-user@lucene.apache.org
> > Subject: Will Result Grouping return documents that
> don't contain the
> > specified "group.field"?
> > 
> > I want to group my results by a field named
> "group_id".
> > 
> > However, some of my documents don't contain the field
> "group_id". But I
> > still want these documents to be returned as part of
> the results as
> > long as they match the main query "q".
> > 
> > Do I need to do anything to tell Solr that I want
> those documents?
> > 
> > Thanks.
> > 
> > 
> > 
> 
> 
>

Re: Will Result Grouping return documents that don't contain the specified "group.field"?

2011-01-06 Thread Andy

Is there anyway to configure Solr such that:

1) Documents that contain the specified "group.field" will be returned and 
grouped by "group.field",
AND
2) Documents that don't contain the specified "group.field" will just be 
returned as "normal" search results without any grouping

Any way to achieve that?

Thanks.

--- On Thu, 1/6/11, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: Will Result Grouping return documents that don't contain the 
> specified "group.field"?
> To: solr-user@lucene.apache.org
> Date: Thursday, January 6, 2011, 9:19 PM
> Correct. Given the fact that Solr
> only requires fields in documents where
> required="true", how could it? The behavior of "just put
> everything in a
> bucket that doesn't have field X" would produce some
> "interesting"
> results
> 
> Best
> Erick
> 
> On Thu, Jan 6, 2011 at 5:55 PM, Andy 
> wrote:
> 
> > So by default Solr will not return documents that
> don't contain the
> > specified group.field?
> >
> >
> > --- On Thu, 1/6/11, Bob Sandiford 
> wrote:
> >
> > > From: Bob Sandiford 
> > > Subject: RE: Will Result Grouping return
> documents that don't contain the
> > specified "group.field"?
> > > To: "solr-user@lucene.apache.org"
> 
> > > Date: Thursday, January 6, 2011, 5:19 PM
> > > What if you put in a default value
> > > for the group_id field in the solr schema - would
> that work
> > > for you?  e.g. something like
> 'unknown'  Then
> > > you'll get all those with no original group_id
> value still
> > > grouped together, and you can figure out at
> display time
> > > what you want to do with them.
> > >
> > > Bob Sandiford | Lead Software Engineer |
> SirsiDynix
> > > P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
> > > www.sirsidynix.com
> > >
> > >
> > > > -Original Message-
> > > > From: Andy [mailto:angelf...@yahoo.com]
> > > > Sent: Thursday, January 06, 2011 3:06 PM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Will Result Grouping return
> documents that
> > > don't contain the
> > > > specified "group.field"?
> > > >
> > > > I want to group my results by a field named
> > > "group_id".
> > > >
> > > > However, some of my documents don't contain
> the field
> > > "group_id". But I
> > > > still want these documents to be returned as
> part of
> > > the results as
> > > > long as they match the main query "q".
> > > >
> > > > Do I need to do anything to tell Solr that I
> want
> > > those documents?
> > > >
> > > > Thanks.
> > > >
> > > >
> > > >
> > >
> > >
> > >
> >
> >
> >
> >
>

Does field collapsing (with facet) reduce performance?

2011-01-17 Thread Andy

Just wanted to know how efficient field collapsing is. And if there is a 
performance penalty, how big is it likely to be?

I'm interested in using field collapsing with faceting.

Thanks.

Re: Does field collapsing (with facet) reduce performance?

2011-01-17 Thread Andy

I understand that the specific figures differ for everybody.

I just wanted to see if anyone who has used this feature could share their 
experience. A ballpark figure -- e.g. 50% slowdown or 10 times slower -- would 
be helpful.


--- On Mon, 1/17/11, Markus Jelsma  wrote:

> From: Markus Jelsma 
> Subject: Re: Does field collapsing (with facet) reduce performance?
> To: solr-user@lucene.apache.org
> Cc: "Andy" 
> Date: Monday, January 17, 2011, 7:27 PM
> There is always CPU and RAM involved
> for every nice component you use. Just 
> how much the penalty is depends completely on your
> hardware, index and type of 
> query. Under heavy load it numbers will change.
> 
> Since we don't know your situation and it's hard to predict
> without 
> benchmarks, you should really do the tests yourself.
> 
> > Just wanted to know how efficient field collapsing is.
> And if there is a
> > performance penalty, how big is it likely to be?
> > 
> > I'm interested in using field collapsing with
> faceting.
> > 
> > Thanks.
>

Does Distributed Search support {!boost }?

2011-02-08 Thread Andy

Is it possible to do a query like {!boost b=log(popularity)}foo over sharded 
indexes?

I looked at the wiki on distributed search 
(http://wiki.apache.org/solr/DistributedSearch) and it has a list of 
"components" that are supported in distributed search. Just wondering what 
component does {!boost } belong to?

Thanks.


 

No need to miss a message. Get email on-the-go 
with Yahoo! Mail for Mobile. Get started.
http://mobile.yahoo.com/mail

Re: Difference between Solr and Lucidworks distribution

2011-02-12 Thread Andy

Now I'm confused.

In http://www.lucidimagination.com/lwe/subscriptions-and-pricing, the price of 
LucidWorks Enterprise Software is stated as "FREE". I thought the price for 
"Production" was for the support service, not for the software.

But you seem to be saying that 'LucidWorks Enterprise' is a separate software 
that isn't free. Did I misunderstand?

--- On Sat, 2/12/11, Lance Norskog  wrote:

> From: Lance Norskog 
> Subject: Re: Difference between Solr and Lucidworks distribution
> To: solr-user@lucene.apache.org, markus.jel...@openindex.io
> Date: Saturday, February 12, 2011, 8:10 PM
> There are two distributions.
> 
> The company is Lucid Imagination. 'Lucidworks for Solr' is
> the
> certified distribution of Solr 1.4.1, with several
> enhancements.
> 
> Markus refers to 'LucidWorks Enterprise', which is LWE.
> This is a
> separate app with tools and a REST API for managing a Solr
> instance.
> 
> Lance Norskog
> 
> On Fri, Feb 11, 2011 at 8:36 AM, Markus Jelsma
> 
> wrote:
> > It is not free for production environments.
> > http://www.lucidimagination.com/lwe/subscriptions-and-pricing
> >
> > On Friday 11 February 2011 17:31:22 Greg Georges
> wrote:
> >> Hello all,
> >>
> >> I just started watching the webinars from
> Lucidworks, and they mention
> >> their distribution which has an installer, etc..
> Is there any other
> >> differences? Is it a good idea to use this free
> distribution?
> >>
> >> Greg
> >
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350
> >
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
>

Any plan to make Field Collapsing available for distributed search?

2011-02-21 Thread Andy

Hello,

I'm looking into Field Collapsing. According to the documentation one 
limitation is that "distributed search support for result grouping has not yet 
been implemented."

Just wondered if there's any plan to add distributed search support to field 
collapsing. Or is there any technical obstacle that make such a feature 
unlikely?

Thanks

Andy

How to handle searches across traditional and simplifies Chinese?

2011-03-07 Thread Andy

I have documents that contain both simplified and traditional Chinese 
characters. Is there any way to search across them? For example, if someone 
searches for 类 (simplified Chinese), I'd like to be able to recognize that the 
equivalent character is 類 in traditional Chinese and search for 类 or 類 in the 
documents. 

Is that something that Solr, or any related software, can do? Is there a 
standard approach in dealing with this problem?

Thanks.

Re: How to handle searches across traditional and simplifies Chinese?

2011-03-07 Thread Andy

Thanks. Please tell me more about the tables/software that does the conversion. 
Really appreciate your help.


--- On Mon, 3/7/11, François Schiettecatte  wrote:

> From: François Schiettecatte 
> Subject: Re: How to handle searches across traditional and simplifies Chinese?
> To: solr-user@lucene.apache.org
> Date: Monday, March 7, 2011, 5:24 PM
> I did a little research into this for
> a client a while. The character mapping is not one to one
> which complicates things (TC and SC have evolved
> independently) and if you want to do a perfect job you will
> need a dictionary. However there are tables out there (I can
> dig one up for you) that allow conversion from one to the
> other. So you would pick either TC or SC as your canonical
> Chinese, and just convert all the documents and searches to
> it.
> 
> I will stress that this is very much a brute force
> approach, the mapping is not perfect and the two character
> sets have evolved (much like UK and US English, I was
> brought up in the UK and live in the US).
> 
> Hope this helps.
> 
> Cheers
> 
> François
> 
> On Mar 7, 2011, at 5:02 PM, Andy wrote:
> 
> > I have documents that contain both simplified and
> traditional Chinese characters. Is there any way to search
> across them? For example, if someone searches for 类
> (simplified Chinese), I'd like to be able to recognize that
> the equivalent character is 類 in traditional Chinese and
> search for 类 or 類 in the documents. 
> > 
> > Is that something that Solr, or any related software,
> can do? Is there a standard approach in dealing with this
> problem?
> > 
> > Thanks.
> > 
> > 
> > 
> 
>

Re: Different options for autocomplete/autosuggestion

2011-03-14 Thread Andy

Can you provide more details? Or a link?

--- On Mon, 3/14/11, Bill Bell  wrote:

> See how Lucid Enterprise does it... A
> bit differently.
> 
> On 3/14/11 12:14 AM, "Kai Schlamp" 
> wrote:
> 
> >Hi.
> >
> >There seems to be several options for implementing an
> >autocomplete/autosuggestions feature with Solr. I am
> trying to
> >summarize those possibilities together with their
> advantages and
> >disadvantages. It would be really nice to read some of
> your opinions.
> >
> >* Using N-Gram filter + text field query
> >+ available in stable 1.4.x
> >+ results can be boosted
> >+ sorted by best matches
> >- may return duplicate results
> >
> >* Facets
> >+ available in stable 1.4.x
> >+ no duplicate entries
> >- sorted by count
> >- may need an extra N-Gram field for infix queries
> >
> >* Terms
> >+ available in stable 1.4.x
> >+ infix query by using regex in 3.x
> >- only prefix query in 1.4.x
> >- regexp may be slow (just a guess)
> >
> >* Suggestions
> >? Did not try that yet. Does it allow infix queries?
> >
> >* Field Collapsing
> >+ no duplications
> >- only available in 4.x branch
> >? Does it work together with highlighting? That would
> be a big plus.
> >
> >What are your experiences regarding
> autocomplete/autosuggestion with
> >Solr? Any additions, suggestions or corrections? What
> do you prefer?
> >
> >Kai
> 
> 
>

Tokenizing Chinese & multi-language search

2011-03-15 Thread Andy

Hi,

I remember reading in this list a while ago that Solr will only tokenize on 
whitespace even when using CJKAnalyzer. That would make Solr unusable on 
Chinese or any other languages that don't use whitespace as separator.

1) I remember reading about a workaround. Unfortunately I can't find the post 
that mentioned it. Could someone give me pointers on how to address this issue?

2) Let's say I have fixed this issue and have properly analyzed and indexed the 
Chinese documents. My documents are in multiple languages. I plan to use 
separate fields for documents in different languages: text_en, text_zh, 
text_ja, text_fr, etc. Each field will be associated with the appropriate 
analyzer. 
My problem now is how to deal with the query string. I don't know what language 
the query is in, so I won't be able to select the appropriate analyzer for the 
query string. If I just use the standard analyzer on the query string, any 
query that's in Chinese won't be tokenized correctly. So would the whole system 
still work in this case?

This must be a pretty common use case, handling multi-language search. What is 
the recommended way of dealing with this problem?

Thanks.
Andy

Re: Tokenizing Chinese & multi-language search

2011-03-15 Thread Andy

Hi Otis,

It doesn't look like the last 2 options would work for me. So I guess my best 
bet is to ask the user to specify the language when they type in the query.

Once I get that information from the user, how do I dynamically pick an 
analyzer for the query string?

Thanks

Andy

--- On Tue, 3/15/11, Otis Gospodnetic  wrote:

> From: Otis Gospodnetic 
> Subject: Re: Tokenizing Chinese & multi-language search
> To: solr-user@lucene.apache.org
> Date: Tuesday, March 15, 2011, 11:51 PM
> Hi Andy,
> 
> Is the "I don't know what language the query is in"
> something you could change 
> by...
> - asking the user
> - deriving from HTTP request headers
> - identifying the query language (if queries are long
> enough and "texty")
> - ...
> 
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> - Original Message 
> > From: Andy 
> > To: solr-user@lucene.apache.org
> > Sent: Tue, March 15, 2011 9:07:36 PM
> > Subject: Tokenizing Chinese & multi-language
> search
> > 
> > Hi,
> > 
> > I remember reading in this list a while ago that Solr
> will only  tokenize on 
> >whitespace even when using CJKAnalyzer. That would make
> Solr  unusable on 
> >Chinese or any other languages that don't use
> whitespace as  separator.
> > 
> > 1) I remember reading about a workaround.
> Unfortunately I  can't find the post 
> >that mentioned it. Could someone give me pointers on
> how to  address this issue?
> > 
> > 2) Let's say I have fixed this issue and have 
> properly analyzed and indexed 
> >the Chinese documents. My documents are in 
> multiple languages. I plan to use 
> >separate fields for documents in different 
> languages: text_en, text_zh, 
> >text_ja, text_fr, etc. Each field will be 
> associated with the appropriate 
> >analyzer. 
> >
> > My problem now is how to deal with  the query
> string. I don't know what 
> >language the query is in, so I won't be able  to
> select the appropriate analyzer 
> >for the query string. If I just use the  standard
> analyzer on the query string, 
> >any query that's in Chinese won't be  tokenized
> correctly. So would the whole 
> >system still work in this  case?
> > 
> > This must be a pretty common use case, handling
> multi-language  search. What is 
> >the recommended way of dealing with this 
> problem?
> > 
> > Thanks.
> > Andy
> > 
> > 
> >       
> > 
>

What request handlers to use for query strings in Chinese or Japanese?

2011-03-16 Thread Andy

Hi,

For my Solr server, some of the query strings will be in Asian languages such 
as Chinese or Japanese. 

For such query strings, would the Standard or Dismax request handler work? My 
understanding is that both the Standard and the Dismax handler tokenize the 
query string by whitespace. And that wouldn't work for Chinese or Japanese, 
right? 

In that case, what request handler should I use? And if I need to set up custom 
request handlers for those languages, how do I do it?

Thanks.

Andy

Re: copyField at search time / multi-language support

2011-03-28 Thread Andy

Tom,

Could you share the method you use to perform language detection? Any open 
source tools that do that?

Thanks.

--- On Mon, 3/28/11, Tom Mortimer  wrote:

> From: Tom Mortimer 
> Subject: copyField at search time / multi-language support
> To: solr-user@lucene.apache.org
> Date: Monday, March 28, 2011, 4:45 AM
> Hi,
> 
> Here's my problem: I'm indexing a corpus with text in a
> variety of
> languages. I'm planning to detect these at index time and
> send the
> text to one of a suitably-configured field (e.g.
> "mytext_de" for
> German, "mytext_cjk" for Chinese/Japanese/Korean etc.)
> 
> At search time I want to search all of these fields.
> However, there
> will be at least 12 of them, which could lead to a very
> long query
> string. (Also I need to use the standard query parser
> rather than
> dismax, for full query syntax.)
> 
> Therefore I was wondering if there was a way to copy fields
> at search
> time, so I can have my mytext query in a single field and
> have it
> copied to mytext_de, mytext_cjk etc. Something like:
> 
>     dest="mytext_de" />
>     dest="mytext_cjk" />
>   ...
> 
> If this is not currently possible, could someone give me
> some pointers
> for hacking Solr to support it? Should I subclass
> solr.SearchHandler?
> I know nothing about Solr internals at the moment...
> 
> thanks,
> Tom
>

Re: copyField at search time / multi-language support

2011-03-28 Thread Andy

Thanks Markus.

Do you know if this patch is good enough for production use? Thanks.

Andy

--- On Tue, 3/29/11, Markus Jelsma  wrote:

> From: Markus Jelsma 
> Subject: Re: copyField at search time / multi-language support
> To: solr-user@lucene.apache.org
> Cc: "Andy" 
> Date: Tuesday, March 29, 2011, 1:29 AM
> https://issues.apache.org/jira/browse/SOLR-1979
> 
> > Tom,
> > 
> > Could you share the method you use to perform language
> detection? Any open
> > source tools that do that?
> > 
> > Thanks.
> > 
> > --- On Mon, 3/28/11, Tom Mortimer 
> wrote:
> > > From: Tom Mortimer 
> > > Subject: copyField at search time /
> multi-language support
> > > To: solr-user@lucene.apache.org
> > > Date: Monday, March 28, 2011, 4:45 AM
> > > Hi,
> > > 
> > > Here's my problem: I'm indexing a corpus with
> text in a
> > > variety of
> > > languages. I'm planning to detect these at index
> time and
> > > send the
> > > text to one of a suitably-configured field (e.g.
> > > "mytext_de" for
> > > German, "mytext_cjk" for Chinese/Japanese/Korean
> etc.)
> > > 
> > > At search time I want to search all of these
> fields.
> > > However, there
> > > will be at least 12 of them, which could lead to
> a very
> > > long query
> > > string. (Also I need to use the standard query
> parser
> > > rather than
> > > dismax, for full query syntax.)
> > > 
> > > Therefore I was wondering if there was a way to
> copy fields
> > > at search
> > > time, so I can have my mytext query in a single
> field and
> > > have it
> > > copied to mytext_de, mytext_cjk etc. Something
> like:
> > > 
> > >     > > dest="mytext_de" />
> > >     > > dest="mytext_cjk" />
> > >   ...
> > > 
> > > If this is not currently possible, could someone
> give me
> > > some pointers
> > > for hacking Solr to support it? Should I
> subclass
> > > solr.SearchHandler?
> > > I know nothing about Solr internals at the
> moment...
> > > 
> > > thanks,
> > > Tom
>

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-08 Thread Andy

I can't view the document either -- it showed up empty.

Has anyone succeeded in viewing it?

Andy

--- On Fri, 4/8/11, Albert Vila  wrote:

> From: Albert Vila 
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert 
> Question)?
> To: solr-user@lucene.apache.org
> Date: Friday, April 8, 2011, 3:43 AM
> Ephraim, I still can't view the
> document.
> 
> Don't know if I'm doing something wrong, but I downloaded
> it and It
> appears to be empty.
> 
> Albert
> 
> On 7 April 2011 09:32, Ephraim Ofir 
> wrote:
> > You can't view it online, but you should be able to
> download it from:
> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> > 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
> >
> > Enjoy,
> > Ephraim Ofir
> >
> >
> > -Original Message-
> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> > Sent: Thursday, April 07, 2011 8:30 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Very very large scale Solr Deployment =
> how to do (Expert
> > Question)?
> >
> > Hello Ephraim, hello Lance, hello Walter,
> >
> > thanks for your replies:
> >
> > Ephraim, thanks very much for the further detailed
> explanation. I will
> > try
> > to setup a demo system in the next few days and use
> your advice.
> > LoadBalancers are an important aspect of your design.
> Can you recommend
> > one
> > LB specificallly? (I would be using haproxy.1wt.eu) .
> I think the Idea
> > with
> > uploading your document is very good. However
> Google-Docs seemed not be
> > be
> > working (at least for me with the docx format?), but
> maybe you can
> > simply
> > output the document as PDF and then I think Google
> Docs is working, so
> > all
> > the others can also have a look at your concept. The
> best approach would
> > be
> > if you could upload your advice directly somewhere to
> the solr wiki as
> > it is
> > really helpful.I found some other documents meanwhile,
> but yours is much
> > clearer and more complete, with the LBs and the
> Aggregators (
> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> >
> > Lance, thanks I will have a look at what linkedin is
> doing.
> >
> > Walter, thanks for the advice: Well you are right,
> mentioning google. My
> > question was also to understand how such large systems
> like
> > google/facebook
> > are actually working. So my numbers are just
> theoretical and made up. My
> > system will be smaller,  but I would be very happy to
> understand how
> > such
> > large systems are build and I think the approach
> Ephraim showd should be
> > working quite well at large scale. If you know a good
> documents (besides
> > the
> > bigtable research paper that I already know) that
> technically describes
> > how
> > google is working in detail that would be of great
> interest. You seem to
> > be
> > working for a company that handles large datasets.
> Does google use this
> > approach, sharing the index into N writers, and the
> procuded index is
> > then
> > replicated to N "read only searchers"?
> >
> > thank you all.
> > best regards
> > jens
> >
> >
> >
> > 2011/4/7 Walter Underwood 
> >
> >> The bigger answer is that you cannot get to this
> size by just
> > configuring
> >> Solr. You may have to invent a lot of stuff. Like
> all of Google.
> >>
> >> Where did you get these numbers? The proposed
> query rate is twice as
> > big as
> >> Google (Feb 2010 estimate, 34K qps).
> >>
> >> I work at MarkLogic, and we scale to 100's of
> terabytes, with fast
> > update
> >> and query rates. If you want a real system that
> handles that, you
> > might want
> >> to look at our product.
> >>
> >> wunder
> >>
> >> On Apr 6, 2011, at 8:06 PM, Lance Norskog wrote:
> >>
> >> > I would not use replication. LinkedIn
> consumer search is a flat
> > system
> >> > where one process indexes new entries and
> does queries
> > simultaneously.
> >> > It's a custom Lucene app called Zoie. Their
> stuff is on Github..
> >> >
> >> > I would get documents to indexers via a
> multicast IP-based queueing
> >> > system. This scales very well and there's a
> lot of hardware support.
> >> >
> &

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-08 Thread Andy

Could anyone please post a version of the document in pdf or openoffice format? 
I'm on Linux so there's no way for me to use MS Word.

Thanks.


--- On Fri, 4/8/11, Albert Vila  wrote:

> From: Albert Vila 
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert 
> Question)?
> To: solr-user@lucene.apache.org
> Date: Friday, April 8, 2011, 9:25 AM
> Yes, It won't work if you are using
> OpenOffice. However it works fine
> with Microsoft Word.
> 
> Hope it helps.
> 
> Albert
> 
> On 8 April 2011 14:55, Andy 
> wrote:
> > I can't view the document either -- it showed up
> empty.
> >
> > Has anyone succeeded in viewing it?
> >
> > Andy
> >
> > --- On Fri, 4/8/11, Albert Vila 
> wrote:
> >
> >> From: Albert Vila 
> >> Subject: Re: Very very large scale Solr Deployment
> = how to do (Expert Question)?
> >> To: solr-user@lucene.apache.org
> >> Date: Friday, April 8, 2011, 3:43 AM
> >> Ephraim, I still can't view the
> >> document.
> >>
> >> Don't know if I'm doing something wrong, but I
> downloaded
> >> it and It
> >> appears to be empty.
> >>
> >> Albert
> >>
> >> On 7 April 2011 09:32, Ephraim Ofir 
> >> wrote:
> >> > You can't view it online, but you should be
> able to
> >> download it from:
> >> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> >> >
> 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
> >> >
> >> > Enjoy,
> >> > Ephraim Ofir
> >> >
> >> >
> >> > -Original Message-
> >> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> >> > Sent: Thursday, April 07, 2011 8:30 AM
> >> > To: solr-user@lucene.apache.org
> >> > Subject: Re: Very very large scale Solr
> Deployment =
> >> how to do (Expert
> >> > Question)?
> >> >
> >> > Hello Ephraim, hello Lance, hello Walter,
> >> >
> >> > thanks for your replies:
> >> >
> >> > Ephraim, thanks very much for the further
> detailed
> >> explanation. I will
> >> > try
> >> > to setup a demo system in the next few days
> and use
> >> your advice.
> >> > LoadBalancers are an important aspect of your
> design.
> >> Can you recommend
> >> > one
> >> > LB specificallly? (I would be using
> haproxy.1wt.eu) .
> >> I think the Idea
> >> > with
> >> > uploading your document is very good.
> However
> >> Google-Docs seemed not be
> >> > be
> >> > working (at least for me with the docx
> format?), but
> >> maybe you can
> >> > simply
> >> > output the document as PDF and then I think
> Google
> >> Docs is working, so
> >> > all
> >> > the others can also have a look at your
> concept. The
> >> best approach would
> >> > be
> >> > if you could upload your advice directly
> somewhere to
> >> the solr wiki as
> >> > it is
> >> > really helpful.I found some other documents
> meanwhile,
> >> but yours is much
> >> > clearer and more complete, with the LBs and
> the
> >> Aggregators (
> >> > http://lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf)
> >> >
> >> > Lance, thanks I will have a look at what
> linkedin is
> >> doing.
> >> >
> >> > Walter, thanks for the advice: Well you are
> right,
> >> mentioning google. My
> >> > question was also to understand how such
> large systems
> >> like
> >> > google/facebook
> >> > are actually working. So my numbers are just
> >> theoretical and made up. My
> >> > system will be smaller,  but I would be very
> happy to
> >> understand how
> >> > such
> >> > large systems are build and I think the
> approach
> >> Ephraim showd should be
> >> > working quite well at large scale. If you
> know a good
> >> documents (besides
> >> > the
> >> > bigtable research paper that I already know)
> that
> >> technically describes
> >> > how
> >> > google is working in detail that would be of
> great
> >> interest. You seem to
> >> > be
> >> > working for a company that handles large
&

Re: Very very large scale Solr Deployment = how to do (Expert Question)?

2011-04-08 Thread Andy

Perfect. Thank you very much.

Andy

--- On Fri, 4/8/11, Pascal Coupet  wrote:

> From: Pascal Coupet 
> Subject: Re: Very very large scale Solr Deployment = how to do (Expert 
> Question)?
> To: solr-user@lucene.apache.org
> Date: Friday, April 8, 2011, 10:20 AM
> I dit put a pdf version here:
> https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B02DHBZQYYT_MmRkZTY0YjQtODJmZS00Mzg0LWJiNTEtOWJjNzViNmNjZjdh&hl=en&authkey=CL2Fq_QG
> 
> Zoom it to get a better view.
> 
> Pascal
> 
> 2011/4/8 Andy 
> 
> > Could anyone please post a version of the document in
> pdf or openoffice
> > format? I'm on Linux so there's no way for me to use
> MS Word.
> >
> > Thanks.
> >
> >
> > --- On Fri, 4/8/11, Albert Vila 
> wrote:
> >
> > > From: Albert Vila 
> > > Subject: Re: Very very large scale Solr
> Deployment = how to do (Expert
> > Question)?
> > > To: solr-user@lucene.apache.org
> > > Date: Friday, April 8, 2011, 9:25 AM
> > > Yes, It won't work if you are using
> > > OpenOffice. However it works fine
> > > with Microsoft Word.
> > >
> > > Hope it helps.
> > >
> > > Albert
> > >
> > > On 8 April 2011 14:55, Andy 
> > > wrote:
> > > > I can't view the document either -- it
> showed up
> > > empty.
> > > >
> > > > Has anyone succeeded in viewing it?
> > > >
> > > > Andy
> > > >
> > > > --- On Fri, 4/8/11, Albert Vila 
> > > wrote:
> > > >
> > > >> From: Albert Vila 
> > > >> Subject: Re: Very very large scale Solr
> Deployment
> > > = how to do (Expert Question)?
> > > >> To: solr-user@lucene.apache.org
> > > >> Date: Friday, April 8, 2011, 3:43 AM
> > > >> Ephraim, I still can't view the
> > > >> document.
> > > >>
> > > >> Don't know if I'm doing something wrong,
> but I
> > > downloaded
> > > >> it and It
> > > >> appears to be empty.
> > > >>
> > > >> Albert
> > > >>
> > > >> On 7 April 2011 09:32, Ephraim Ofir
> 
> > > >> wrote:
> > > >> > You can't view it online, but you
> should be
> > > able to
> > > >> download it from:
> > > >> >
> > https://docs.google.com/leaf?id=0BwOEbnJ7oeOrNmU5ZThjODUtYzM5MS00YjRlLWI
> > > >> >
> > >
> 2OTktZTEzNDk1YmVmOWU4&hl=en&authkey=COGel4gP
> > > >> >
> > > >> > Enjoy,
> > > >> > Ephraim Ofir
> > > >> >
> > > >> >
> > > >> > -Original Message-
> > > >> > From: Jens Mueller [mailto:supidupi...@googlemail.com]
> > > >> > Sent: Thursday, April 07, 2011 8:30
> AM
> > > >> > To: solr-user@lucene.apache.org
> > > >> > Subject: Re: Very very large scale
> Solr
> > > Deployment =
> > > >> how to do (Expert
> > > >> > Question)?
> > > >> >
> > > >> > Hello Ephraim, hello Lance, hello
> Walter,
> > > >> >
> > > >> > thanks for your replies:
> > > >> >
> > > >> > Ephraim, thanks very much for the
> further
> > > detailed
> > > >> explanation. I will
> > > >> > try
> > > >> > to setup a demo system in the next
> few days
> > > and use
> > > >> your advice.
> > > >> > LoadBalancers are an important
> aspect of your
> > > design.
> > > >> Can you recommend
> > > >> > one
> > > >> > LB specificallly? (I would be
> using
> > > haproxy.1wt.eu) .
> > > >> I think the Idea
> > > >> > with
> > > >> > uploading your document is very
> good.
> > > However
> > > >> Google-Docs seemed not be
> > > >> > be
> > > >> > working (at least for me with the
> docx
> > > format?), but
> > > >> maybe you can
> > > >> > simply
> > > >> > output the document as PDF and then
> I think
> > > Google
> > > >> Docs is working, so
> > > >> > all
> > > >> > the others can also have a look at
>

Re: Lucid Works

2011-04-08 Thread Andy

--- On Fri, 4/8/11, Andrzej Bialecki  wrote:

> :) If you don't need the new functionality in 4.x, you don't
> need the performance improvements, 

What performance improvements does 4.x have over 3.1?

> reindexing cycles are long (indexes tend to stay around)
> then 3.1 is a safer bet. If you need a dozen or so new
> exciting features (e.g. results grouping) or top
> performance, or if you need LucidWorks with Click and other
> goodies, then use 4.x and be prepared for an occasional full
> reindex.

So using 4.x would require occasional full reindex but using 3.1 would not? 
Could you explain? I thought 4.x comes with NRT indexing. So why is full 
reindex necessary?

Thanks.

Andy

Can the Suggester be updated incrementally?

2011-04-28 Thread Andy

I'm interested in using Suggester (http://wiki.apache.org/solr/Suggester) for 
auto-complete on the field "Document Title".

Does Suggester (either FST, TST or Jaspell) support incremental updates? Say I 
want to add a new document title to the Suggester, or to change the weight of 
an existing document title, would I need to rebuild the entire tree for every 
update?

Also, can the Suggester be sharded? If the size of the tree gets bigger than 
the RAM size, is it possible to shard the Suggester across multiple machines?

Thanks
Andy

Re: Can the Suggester be updated incrementally?

2011-04-28 Thread Andy

--- On Fri, 4/29/11, Jason Rutherglen  wrote:

> It's answered on the wiki site:
> 
> "TSTLookup - ternary tree based representation, capable of
> immediate
> data structure updates"
> 

But how to update it? 

The wiki talks about getting data sources from a file or from the main index. 
In either case it sounds like the entire data structure will be rebuilt, no?

Has NRT been abandoned?

2011-05-01 Thread Andy

Hi,

I read on this mailing list previously that NRT was implemented in 4.0, it just 
 wasn't ready for production yet. Then I looked at the wiki 
(http://wiki.apache.org/solr/NearRealtimeSearch). It listed 2 jira issues 
related to NRT: SOLR-1308 and SOLR-1278. Both issues have their resolutions set 
to "Won't Fix" recently.

Does that mean NRT is no longer going to happen? What's the state of NRT in 
Solr? 

Thanks

Andy

Re: Has NRT been abandoned?

2011-05-01 Thread Andy

Nagendra,

This looks interesting. Does Solr-RA support:

1) facet
2) Boost query such as {!boost b=log(popularity)}foo

Thanks
Andy

--- On Sun, 5/1/11, Nagendra Nagarajayya  wrote:

> From: Nagendra Nagarajayya 
> Subject: Re: Has NRT been abandoned?
> To: solr-user@lucene.apache.org
> Date: Sunday, May 1, 2011, 12:01 PM
> Hi Andy:
> 
> I have a solution for NRT with Solr 1.4.1. The solution
> uses the RankingAlgorithm as the search library. The NRT
> functionality allows you to add documents without the
> IndexSearchers being closed or caches being cleared. A
> commit is not needed with the document update. Searches can
> run concurrently with document updates. No changes are
> needed except for enabling the NRT through solrconfig.xml.
> The performance is about  262 TPS (document adds) on a
> dual core intel system with 2GB heap with searches in
> parallel. The performance at the moment is limited by how
> fast IndexWriter.getReader() performs.
> 
> I have a white paper that describes NRT in details, allows
> you to download the tweets, schema and solrconfig.xml files.
> You can access the white paper from here:
> 
> http://solr-ra.tgels.com/papers/solr-ra_real_time_search.pdf
> 
> You can download Solr with RankingAlgorithm (Solr-RA) from
> here:
> 
> http://solr-ra.tgels.com
> 
> I still have not yet integrated the NRT with Solr 3.1 (the
> new release) and plan to do so very soon.
> 
> Please let me know if you need any more info.
> 
> Regards,
> 
> - Nagendra Nagarajayya
> http://solr-ra.tgels.com
> 
> On 5/1/2011 8:28 AM, Andy wrote:
> > Hi,
> > 
> > I read on this mailing list previously that NRT was
> implemented in 4.0, it just  wasn't ready for
> production yet. Then I looked at the wiki 
> (http://wiki.apache.org/solr/NearRealtimeSearch). It
> listed 2 jira issues related to NRT: SOLR-1308 and
> SOLR-1278. Both issues have their resolutions set to "Won't
> Fix" recently.
> > 
> > Does that mean NRT is no longer going to happen?
> What's the state of NRT in Solr?
> > 
> > Thanks
> > 
> > Andy
> > 
> > 
> > 
> 
>

Re: Has NRT been abandoned?

2011-05-01 Thread Andy


--- On Sun, 5/1/11, Robert Muir  wrote:

 
> Hi, I don't think it means that. keep an eye on
> https://issues.apache.org/jira/browse/SOLR-2193, you
> can set yourself
> as a Watcher to receive updates.

Ah I see.

Thank you.

1 2 3 >

1 - 100 of 292 matches

Mail list logo