Good example of multiple tokenizers for a single field
I am looking for a clear example of using more than one tokenizer for a source single field. My application has a single "body" field which until recently was all latin characters, but we're now encountering both English and Japanese words in a single message. Obviously, we need to be using CJK in addition to WhitespaceTokenizerFactory. I've found some references to using copyFields or NGrams but I can't quite grasp what the whole solution would look like. -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
The problem is that the field is not guaranteed to contain just a single language. I'm looking for some way to pass it first through CJK, then Whitespace. If I'm totally off-target here, is there a recommended way of dealing with mixed-language fields? On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma wrote: > You can use only one tokenizer per analyzer. You'd better use separate > fields + > fieldTypes for different languages. > > > I am looking for a clear example of using more than one tokenizer for a > > source single field. My application has a single "body" field which until > > recently was all latin characters, but we're now encountering both > English > > and Japanese words in a single message. Obviously, we need to be using > CJK > > in addition to WhitespaceTokenizerFactory. > > > > I've found some references to using copyFields or NGrams but I can't > quite > > grasp what the whole solution would look like. > -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
StandardTokenizer doesn't handle some of the tokens we need, like @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or Korean. Am I wrong about that? On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir wrote: > On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder wrote: > > The problem is that the field is not guaranteed to contain just a single > > language. I'm looking for some way to pass it first through CJK, then > > Whitespace. > > > > If I'm totally off-target here, is there a recommended way of dealing > with > > mixed-language fields? > > > > maybe you should consider a tokenizer like StandardTokenizer, that > works reasonably well for most languages. > -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
+1 That's exactly what we need, too. On Mon, Nov 29, 2010 at 5:28 PM, Shawn Heisey wrote: > On 11/29/2010 3:15 PM, Jacob Elder wrote: > >> I am looking for a clear example of using more than one tokenizer for a >> source single field. My application has a single "body" field which until >> recently was all latin characters, but we're now encountering both English >> and Japanese words in a single message. Obviously, we need to be using CJK >> in addition to WhitespaceTokenizerFactory. >> > > What I'd like to see is a CJK filter that runs after tokenization > (whitespace in my case) and doesn't do anything but handle the CJK > characters. If there are no CJK characters in the token, it should do > nothing at all. The CJK tokenizer does a whole host of other things that I > want to handle myself. > > Shawn > > -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
Right. CJK doesn't tend to have a lot of whitespace to begin with. In the past, we were using a patched version of StandardTokenizer which treated @twitteruser and #hashtag better, but this became a release engineering nightmare so we switched to Whitespace. Perhaps I could rephrase the question as follows: Is there a literal configuration example of what this wiki article suggests: http://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields Further, could I then use copyFields to get those back into a single field? On Mon, Nov 29, 2010 at 5:39 PM, Robert Muir wrote: > On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder wrote: > > StandardTokenizer doesn't handle some of the tokens we need, like > > @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese > or > > Korean. Am I wrong about that? > > it uses the unigram method for CJK ideographs... the CJKtokenizer just > uses the bigram method, its just an alternative method. > > the whitespace doesnt work at all though, so give up on that! > -- Jacob Elder @jelder (646) 535-3379
Re: Dinamically change master
Your best bet might be to look into Lucandra: https://github.com/tjake/Lucandra On Tue, Nov 30, 2010 at 10:41 AM, Tommaso Teofili wrote: > Hi all, > > in a replication environment if the host where the master is running goes > down for some reason, is there a way to communicate to the slaves to point > to a different (backup) master without manually changing configuration (and > restarting the slaves or their cores)? > > Basically I'd like to be able to change the replication master dinamically > inside the slaves. > > Do you have any idea of how this could be achieved? > > Thanks in advance for any help. > Regards, > Tommaso > -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir wrote: > (Jonathan, I apologize for emailing you twice, i meant to hit reply-all) > > On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind > wrote: > > > > Wait, standardtokenizer already handles CJK and will put each CJK char > into > > it's own token? Really? I had no idea! Is that documented anywhere, or > you > > just have to look at the source to see it? > > > > Yes, you are right, the documentation should have been more explicit: > in previous releases it doesn't say anything about how it tokenizes > CJK in the documentation. But it does do them this way, and tagged > them as "CJ" token type. > > I think the documentation issue is "fixed" in branch_3x and trunk: > > * As of Lucene version 3.1, this class implements the Word Break rules > from the > * Unicode Text Segmentation algorithm, as specified in > * http://unicode.org/reports/tr29/";>Unicode Standard Annex > #29. > (from > http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java > ) > > So you can read the UAX#29 report and then you know how it tokenizes text > You can also just use this demo app to see how the new one works: > http://unicode.org/cldr/utility/breaks.jsp (choose "Word") > What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the current stable StandardTokenizer handle CJK? -- Jacob Elder @jelder (646) 535-3379
Re: Good example of multiple tokenizers for a single field
On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir wrote: > On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote: > > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the > > past, we were using a patched version of StandardTokenizer which treated > > @twitteruser and #hashtag better, but this became a release engineering > > nightmare so we switched to Whitespace. > > in this case, have you considered using a CharFilter (e.g. > MappingCharFilter) before the tokenizer? > > This way you could map your special things such as @ and # to some > other string that the tokenizer doesnt split on, > e.g. # => "HASH_". > > then your #foobar goes to HASH_foobar. > If you want searches of "#foobar" to only match "#foobar" and not also > "foobar" itself, and vice versa, you are done. > Maybe you want searches of #foobar to only match #foobar, but searches > of "foobar" to match both "#foobar" and "foobar". > In this case, you would probably use a worddelimiterfilter w/ > preserveOriginal at index-time only , followed by a StopFilter > containing HASH, so you index HASH_foobar and foobar. > > anyway i think you have a lot of flexibility to reuse > standardtokenizer but customize things like this without maintaining > your own tokenizer, this is the purpose of CharFilters. > That worked brilliantly. Thank you very much, Robert. -- Jacob Elder @jelder (646) 535-3379
commitWithin question
Our application involves lots of live index updates with mixed priority. A few updates are very important and need to be in the index promptly, while we also have a great deal of updates which can be dealt with lazily. The documentation for the commitWithin leaves some room for interpretation. Does setting commitWithin=1000 mean that only this update will be committed within 1s, or that all pending documents will be committed within 1s? -- Jacob Elder
Definitive version of acts_as_solr
What versions of acts_as_solr are you all using? There appears to be about a dozen forks on GitHub, including my own. http://acts-as-solr.rubyforge.org/ has a notice that the official site is now http://acts_as_solr.railsfreaks.com/, but *don't click that link*because it's just a mess of pop-up ads now. It would be great to get some consolidation and agreement from the community. -- Jacob Elder
Re: shards parameter
If the goal is to save time when using the admin interface, you can just add this to conf/admin-extra.html: http://www.google.com/jsapi"</a>;> google.load("prototype", "1.6"); Event.observe( window, 'load', function() { elements = document.getElementsByName('queryForm') elements[0].insert("<input name=\"shards\" value=\"shard01,shard02\">") }); You will get an editable field with sensible defaults under the query box. On Thu, Dec 17, 2009 at 4:09 PM, Yonik Seeley wrote: > You're setting up an infinite loop by adding a shards parameter on the > default search handler. > Create a new search handler and put your default under that. > > -Yonik > http://www.lucidimagination.com > > > On Thu, Dec 17, 2009 at 7:47 AM, pcurila wrote: > > > > I tried it out. But there is another issue I can not cope with. > > I have two shards: > > localhost:8983/solr > > localhost:8984/solr > > > > If I write this into the defaults section > > localhost:8983/solr,localhost:8984/solr > > and than I issue a query on localhost:8983, solr do not respond. > > > > If I write this > > localhost:8984/solr > > it works but there is just half of the index. > > > > > > > > > > Noble Paul നോബിള് नोब्ळ्-2 wrote: > >> > >> yes. > >> put it under the "defaults" section in your standard requesthandler. > >> > >> On Thu, Dec 17, 2009 at 5:22 PM, pcurila wrote: > >>> > >>> Hello, is there any way to configure shards parameter in > solrconfig.xml? > >>> So I > >>> do not need provide it in the url. Thanks Peter > >>> -- > >>> View this message in context: > >>> http://old.nabble.com/shards-parameter-tp26826908p26826908.html > >>> Sent from the Solr - User mailing list archive at Nabble.com. > >>> > >>> > >> > >> > >> > >> -- > >> - > >> Noble Paul | Systems Architect| AOL | http://aol.com > >> > >> > > > > -- > > View this message in context: > http://old.nabble.com/shards-parameter-tp26826908p26827527.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > > > > -- Jacob Elder
Getting details from
Hello, Is there any way to get the number of deleted records from a delete request? I'm sending: type_i:(2 OR 3) AND creation_time_rl:[0 TO 124426080] And getting: 02 This is Solr 1.3. -- Jacob Elder
Re: If you could have one feature in Solr...
1. Real time or near-real time updates. 2. First-class spatial search. On Wed, Feb 24, 2010 at 9:42 AM, Grant Ingersoll wrote: > What would it be? > -- Jacob Elder