Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both English
and Japanese words in a single message. Obviously, we need to be using CJK
in addition to WhitespaceTokenizerFactory.

I've found some references to using copyFields or NGrams but I can't quite
grasp what the whole solution would look like.

-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.

If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?

On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma
wrote:

> You can use only one tokenizer per analyzer. You'd better use separate
> fields +
> fieldTypes for different languages.
>
> > I am looking for a clear example of using more than one tokenizer for a
> > source single field. My application has a single "body" field which until
> > recently was all latin characters, but we're now encountering both
> English
> > and Japanese words in a single message. Obviously, we need to be using
> CJK
> > in addition to WhitespaceTokenizerFactory.
> >
> > I've found some references to using copyFields or NGrams but I can't
> quite
> > grasp what the whole solution would look like.
>



-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
StandardTokenizer doesn't handle some of the tokens we need, like
@twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
Korean. Am I wrong about that?

On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir  wrote:

> On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder  wrote:
> > The problem is that the field is not guaranteed to contain just a single
> > language. I'm looking for some way to pass it first through CJK, then
> > Whitespace.
> >
> > If I'm totally off-target here, is there a recommended way of dealing
> with
> > mixed-language fields?
> >
>
> maybe you should consider a tokenizer like StandardTokenizer, that
> works reasonably well for most languages.
>



-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Jacob Elder
+1

That's exactly what we need, too.

On Mon, Nov 29, 2010 at 5:28 PM, Shawn Heisey  wrote:

> On 11/29/2010 3:15 PM, Jacob Elder wrote:
>
>> I am looking for a clear example of using more than one tokenizer for a
>> source single field. My application has a single "body" field which until
>> recently was all latin characters, but we're now encountering both English
>> and Japanese words in a single message. Obviously, we need to be using CJK
>> in addition to WhitespaceTokenizerFactory.
>>
>
> What I'd like to see is a CJK filter that runs after tokenization
> (whitespace in my case) and doesn't do anything but handle the CJK
> characters.  If there are no CJK characters in the token, it should do
> nothing at all.  The CJK tokenizer does a whole host of other things that I
> want to handle myself.
>
> Shawn
>
>


-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Jacob Elder
Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
past, we were using a patched version of StandardTokenizer which treated
@twitteruser and #hashtag better, but this became a release engineering
nightmare so we switched to Whitespace.

Perhaps I could rephrase the question as follows:

Is there a literal configuration example of what this wiki article suggests:

http://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields

Further, could I then use copyFields to get those back into a single field?

On Mon, Nov 29, 2010 at 5:39 PM, Robert Muir  wrote:

> On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder  wrote:
> > StandardTokenizer doesn't handle some of the tokens we need, like
> > @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese
> or
> > Korean. Am I wrong about that?
>
> it uses the unigram method for CJK ideographs... the CJKtokenizer just
> uses the bigram method, its just an alternative method.
>
> the whitespace doesnt work at all though, so give up on that!
>



-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Dinamically change master

2010-11-30 Thread Jacob Elder
Your best bet might be to look into Lucandra:

https://github.com/tjake/Lucandra

On Tue, Nov 30, 2010 at 10:41 AM, Tommaso Teofili  wrote:

> Hi all,
>
> in a replication environment if the host where the master is running goes
> down for some reason, is there a way to communicate to the slaves to point
> to a different (backup) master without manually changing configuration (and
> restarting the slaves or their cores)?
>
> Basically I'd like to be able to change the replication master dinamically
> inside the slaves.
>
> Do you have any idea of how this could be achieved?
>
> Thanks in advance for any help.
> Regards,
> Tommaso
>



-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jacob Elder
On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir  wrote:

> (Jonathan, I apologize for emailing you twice, i meant to hit reply-all)
>
> On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind 
> wrote:
> >
> > Wait, standardtokenizer already handles CJK and will put each CJK char
> into
> > it's own token?  Really? I had no idea!  Is that documented anywhere, or
> you
> > just have to look at the source to see it?
> >
>
> Yes, you are right, the documentation should have been more explicit:
> in previous releases it doesn't say anything about how it tokenizes
> CJK in the documentation. But it does do them this way, and tagged
> them as "CJ" token type.
>
> I think the documentation issue is "fixed" in branch_3x and trunk:
>
>  * As of Lucene version 3.1, this class implements the Word Break rules
> from the
>  * Unicode Text Segmentation algorithm, as specified in
>  * http://unicode.org/reports/tr29/";>Unicode Standard Annex
> #29.
> (from
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
> )
>
> So you can read the UAX#29 report and then you know how it tokenizes text
> You can also just use this demo app to see how the new one works:
> http://unicode.org/cldr/utility/breaks.jsp (choose "Word")
>

What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the
current stable StandardTokenizer handle CJK?

-- 
Jacob Elder
@jelder
(646) 535-3379


Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jacob Elder
On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir  wrote:

> On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder  wrote:
> > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
> > past, we were using a patched version of StandardTokenizer which treated
> > @twitteruser and #hashtag better, but this became a release engineering
> > nightmare so we switched to Whitespace.
>
> in this case, have you considered using a CharFilter (e.g.
> MappingCharFilter) before the tokenizer?
>
> This way you could map your special things such as @ and # to some
> other string that the tokenizer doesnt split on,
> e.g. # => "HASH_".
>
> then your #foobar goes to HASH_foobar.
> If you want searches of "#foobar" to only match "#foobar" and not also
> "foobar" itself, and vice versa, you are done.
> Maybe you want searches of #foobar to only match #foobar, but searches
> of "foobar" to match both "#foobar" and "foobar".
> In this case, you would probably use a worddelimiterfilter w/
> preserveOriginal at index-time only , followed by a StopFilter
> containing HASH, so you index HASH_foobar and foobar.
>
> anyway i think you have a lot of flexibility to reuse
> standardtokenizer but customize things like this without maintaining
> your own tokenizer, this is the purpose of CharFilters.
>

That worked brilliantly. Thank you very much, Robert.

-- 
Jacob Elder
@jelder
(646) 535-3379


commitWithin question

2009-10-21 Thread Jacob Elder
Our application involves lots of live index updates with mixed priority. A
few updates are very important and need to be in the index promptly, while
we also have a great deal of updates which can be dealt with lazily.

The documentation for the commitWithin leaves some room for interpretation.
Does setting commitWithin=1000 mean that only this update will be committed
within 1s, or that all pending documents will be committed within 1s?


-- 
Jacob Elder


Definitive version of acts_as_solr

2009-12-11 Thread Jacob Elder
What versions of acts_as_solr are you all using?

There appears to be about a dozen forks on GitHub, including my own.
http://acts-as-solr.rubyforge.org/ has a notice that the official site is
now http://acts_as_solr.railsfreaks.com/, but *don't click that
link*because it's just a mess of pop-up ads now. It would be great to
get some
consolidation and agreement from the community.

-- 
Jacob Elder


Re: shards parameter

2009-12-17 Thread Jacob Elder
If the goal is to save time when using the admin interface, you can just add
this to conf/admin-extra.html:

http://www.google.com/jsapi"</a>;>

google.load("prototype", "1.6");


Event.observe(
window,
'load',
function() {
elements = document.getElementsByName('queryForm')
elements[0].insert("<input name=\"shards\"
value=\"shard01,shard02\">")
});


You will get an editable field with sensible defaults under the query box.

On Thu, Dec 17, 2009 at 4:09 PM, Yonik Seeley wrote:

> You're setting up an infinite loop by adding a shards parameter on the
> default search handler.
> Create a new search handler and put your default under that.
>
> -Yonik
> http://www.lucidimagination.com
>
>
> On Thu, Dec 17, 2009 at 7:47 AM, pcurila  wrote:
> >
> > I tried it out. But there is another issue I can not cope with.
> > I have two shards:
> > localhost:8983/solr
> > localhost:8984/solr
> >
> > If I write this into the defaults section
> > localhost:8983/solr,localhost:8984/solr
> > and than I issue a query on localhost:8983, solr do not respond.
> >
> > If I write this
> > localhost:8984/solr
> > it works but there is just half of the index.
> >
> >
> >
> >
> > Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
> >>
> >> yes.
> >> put it under the "defaults" section in your standard requesthandler.
> >>
> >> On Thu, Dec 17, 2009 at 5:22 PM, pcurila  wrote:
> >>>
> >>> Hello, is there any way to configure shards parameter in
> solrconfig.xml?
> >>> So I
> >>> do not need provide it in the url. Thanks Peter
> >>> --
> >>> View this message in context:
> >>> http://old.nabble.com/shards-parameter-tp26826908p26826908.html
> >>> Sent from the Solr - User mailing list archive at Nabble.com.
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >> -
> >> Noble Paul | Systems Architect| AOL | http://aol.com
> >>
> >>
> >
> > --
> > View this message in context:
> http://old.nabble.com/shards-parameter-tp26826908p26827527.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
>



-- 
Jacob Elder


Getting details from

2009-06-11 Thread Jacob Elder
Hello,

Is there any way to get the number of deleted records from a delete request?

I'm sending:

type_i:(2 OR 3) AND creation_time_rl:[0 TO
124426080]

And getting:



02


This is Solr 1.3.

-- 
Jacob Elder


Re: If you could have one feature in Solr...

2010-03-25 Thread Jacob Elder
   1. Real time or near-real time updates.
   2. First-class spatial search.

On Wed, Feb 24, 2010 at 9:42 AM, Grant Ingersoll wrote:

> What would it be?
>



-- 
Jacob Elder