Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)?

2015-09-23 Thread Rich Cariens
For what it's worth, we've had good luck using the ICUTokenizer and associated filters. A native Chinese speaker here at the office gave us an enthusiastic thumbs up on our Chinese search results. Your mileage may vary of course. On Wed, Sep 23, 2015 at 11:04 AM, Erick Erickson wrote: > In a wor

Re: Implementing custom analyzer for multi-language stemming

2014-08-06 Thread Rich Cariens
the language detection tool do it's best and not sweat it. On Wed, Aug 6, 2014 at 12:11 AM, TK wrote: > > On 8/5/14, 8:36 AM, Rich Cariens wrote: > >> Of course this is extremely primitive and basic, but I think it would be >> possible to write a CharFilter or Tok

Re: Implementing custom analyzer for multi-language stemming

2014-08-05 Thread Rich Cariens
I've started a GitHub project to try out some cross-lingual analysis ideas ( https://github.com/whateverdood/cross-lingual-search). I haven't played over there for about 3 months, but plan on restarting work there shortly. In a nutshell, the interesting component ("SimplePolyGlotStemmingTokenFilter

Re: MMapDirectory failed to map a 23G compound index segment

2011-09-30 Thread Rich Cariens
ept for OS system wide / per-process limits imposed) > you should be able to mmap up to the full 64 bit address space. > > Your virtual memory is unlimited (from "ulimit" output), so that's good. > > Mike McCandless > > http://blog.mikemccandless.com > > On

Re: MMapDirectory failed to map a 23G compound index segment

2011-09-12 Thread Rich Cariens
11 at 8:58 PM, Lance Norskog wrote: > > > Do you need to use the compound format? > > > > On Thu, Sep 8, 2011 at 3:57 PM, Rich Cariens >wrote: > > > >> I should add some more context: > >> > >> 1. the problem index included several cfs s

Re: MMapDirectory failed to map a 23G compound index segment

2011-09-08 Thread Rich Cariens
open files ulimit. Do the MultiMMapIndexInput ByteBuffer arrays each consume a file handle/descriptor? On Thu, Sep 8, 2011 at 5:19 PM, Rich Cariens wrote: > FWiW I optimized the index down to a single segment and now I have no > trouble opening an MMapDirectory on that index, even though

Re: MMapDirectory failed to map a 23G compound index segment

2011-09-08 Thread Rich Cariens
FWiW I optimized the index down to a single segment and now I have no trouble opening an MMapDirectory on that index, even though the 23G cfx segment file remains. On Thu, Sep 8, 2011 at 4:27 PM, Rich Cariens wrote: > Thanks for the response. "free -g" reports: > >

Re: MMapDirectory failed to map a 23G compound index segment

2011-09-08 Thread Rich Cariens
iettecatte > My memory of this is a little rusty but isn't mmap also limited by mem + > swap on the box? What does 'free -g' report? > > François > > On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote: > > > Ahoy ahoy! > > > > I've run

MMapDirectory failed to map a 23G compound index segment

2011-09-07 Thread Rich Cariens
Ahoy ahoy! I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound index segment file. The stack trace looks pretty much like every other trace I've found when searching for OOM & "map failed"[1]. My configuration follows: Solr 1.4.1/Lucene 2.9.3 (plus SOLR-1969

Re: SSD experience

2011-08-22 Thread Rich Cariens
ut a 40% boost in performance on our tests with no changes > > except the disk. > > > > On Mon, Aug 22, 2011 at 10:54 AM, Rich Cariens >wrote: > > > >> Ahoy ahoy! > >> > >> Does anyone have any experiences or stories they can share with

SSD experience

2011-08-22 Thread Rich Cariens
Ahoy ahoy! Does anyone have any experiences or stories they can share with the list about how SSDs impacted search performance for better or worse? I found a Lucene SSD performance benchmark doc

Re: how to enable MMapDirectory in solr 1.4?

2011-08-08 Thread Rich Cariens
We patched our 1.4.1 build with SOLR-1969(making MMapDirectory configurable) and realized a 64% search performance boost on our Linux hosts. On Mon, Aug 8, 2011 at 10:05 AM, Dyer, James wrote: > If you want to try MMapDirectory with Solr 1.4, then

Re: document storage

2011-05-13 Thread Rich Cariens
We've decided to store the original document in both Solr and external repositories. This is to support the following: 1. highlighting - We need to mark-up the entire document with hit-terms. However if this was the only reason to store the text I'd seriously consider calling out to the e

Re: Guidance for event-driven indexing

2011-02-15 Thread Rich Cariens
; and if you want to apply an UpdateChain, that would look like this: > > > > myPipeline > > > > See http://wiki.apache.org/solr/SolrRequestHandler for details > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com &g

Re: Guidance for event-driven indexing

2011-02-15 Thread Rich Cariens
o choose where :) > > A JMSUpdateHandler sounds heavy weight, but does not need to be, and might > be the logically best place for such a feature imo. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > On 14. feb. 2011, at 17.42, Rich Carien

Re: Guidance for event-driven indexing

2011-02-14 Thread Rich Cariens
itect > Cominvent AS - www.cominvent.com > > On 14. feb. 2011, at 16.53, Rich Cariens wrote: > > > Hello, > > > > I've built a system that receives JMS events containing links to docs > that I > > must download and index. Right now the JMS receiving, downloa

Guidance for event-driven indexing

2011-02-14 Thread Rich Cariens
Hello, I've built a system that receives JMS events containing links to docs that I must download and index. Right now the JMS receiving, downloading, and transformation into SolrInputDoc's happens in a separate JVM that then uses Solrj javabin HTTP POSTs to distribute these docs across many index

Re: Full text hit term highlighting

2010-12-05 Thread Rich Cariens
This works, as long as you don't need query > highlighting." Have you found a way around that, or have you decided not to > use highlighting after all? Or am I missing something? > ____ > From: Rich Cariens [richcari...@gmail.com]

Re: Full text hit term highlighting

2010-12-05 Thread Rich Cariens
e next of > any of the terms. > > On Sat, Dec 4, 2010 at 4:10 PM, Rich Cariens > wrote: > > Anyone ever use Solr to present a view of a document with hit-terms > > highlighted within? Kind of like Google's cached <http://bit.ly/hgudWq > >copies? > > > > > > -- > Lance Norskog > goks...@gmail.com >

Full text hit term highlighting

2010-12-04 Thread Rich Cariens
Anyone ever use Solr to present a view of a document with hit-terms highlighted within? Kind of like Google's cached copies?

Re: Optimize Index

2010-11-04 Thread Rich Cariens
For what it's worth, the Solr class instructor at the Lucene Revolution conference recommended *against* optimizing, and instead suggested to just let the merge factor do it's job. On Thu, Nov 4, 2010 at 2:55 PM, Shawn Heisey wrote: > On 11/4/2010 7:22 AM, stockiii wrote: > >> how can i start an

Re: StreamingUpdateSolrServer hangs

2010-04-16 Thread Rich Cariens
I experienced the hang described with the Solr 1.4.0 build. Yonik - I also thought the streaming updater was blocking on commits but updates never resumed. To be honest I was in a bit of a rush to meet a deadline so after spending a day or so tinkering I bailed out and just wrote a component by h

Re: Index "transaction log" or equivalent?

2010-04-08 Thread Rich Cariens
Thanks Mark. That's sort of what I was thinking of doing. On Thu, Apr 8, 2010 at 10:33 AM, Mark Miller wrote: > On 04/08/2010 09:23 AM, Rich Cariens wrote: > >> Are there any best practices or built-in support for keeping track of >> what's >> been inde

Index "transaction log" or equivalent?

2010-04-08 Thread Rich Cariens
Are there any best practices or built-in support for keeping track of what's been indexed in a Solr application so as to support a full rebuild? I'm not indexing from a single source, but from many, sometimes arbitrary, sources including: 1. A document repository that fires events (containing

Re: an OR filter query

2010-04-04 Thread Rich Cariens
Why not just make the your "mature:false" filter query a default value instead of always appended? I.e.: -snip- mature:false -snip- That way if someone wants mature items in their results the search client explicitly sets "fq=mature:*" or whatever. Would that work? On Sun, Apr 4, 2010 at

Re: Experience with indexing billions of documents?

2010-04-02 Thread Rich Cariens
A colleague of mine is using native Lucene + some home-grown patches/optimizations to index over 13B small documents in a 32-shard environment, which is around 406M docs per shard. If there's a 2B doc id limitation in Lucene then I assume he's patched it himself. On Fri, Apr 2, 2010 at 1:17 PM,