Tokenizing and searching named character entity references
Greetings: I am working with many different data sources - some source employ "entity references" ; others do not. My goal is to make the searching across sources as consistent as possible. Example text - Source1: weakening Hδ absorption Source1: zero-field gap ω Source2: weakening H delta absorption Source2: zero-field gap omega Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory for Source1 - the entity is replaced with the "named character entity" - This works great. But I want the searching tokens to be identical for each source. I need to capture δ as a token. Is this possible with the SOLR supplied tokenizers? I experimented with different combinations and orders and was not successful. Is this possible using synonyms? I also experimented with this route but again was not successful. Do I need to create a custom tokenizer? Thanks Frances -- View this message in context: http://www.nabble.com/Tokenizing-and-searching-named-character-entity-references-tp18632403p18632403.html Sent from the Solr - User mailing list archive at Nabble.com.
Letter-number transitions - can this be turned off
Is there a flag to disable the letter-number transition in the solr.WordDelimiterFilterFactory? We are indexing category codes, thesaurus codes for which this letter number transition makes no sense. It is bloating the indexing (which is already large). Thanks F Knudson -- View this message in context: http://www.nabble.com/Letter-number-transitions---can-this-be-turned-off-tf4544769.html#a12969359 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Letter-number transitions - can this be turned off
Thanks for your helpful suggestions. I have considered other analyzers but WDF has great strengths. I will experiment with maintaining transitions and then consider modifying the code. F. Knudson Mike Klaas wrote: > > On 30-Sep-07, at 12:47 PM, F Knudson wrote: > >> >> Is there a flag to disable the letter-number transition in the >> solr.WordDelimiterFilterFactory? We are indexing category codes, >> thesaurus >> codes for which this letter number transition makes no sense. It is >> bloating the indexing (which is already large). > > Have you considered using a different analyzer? > > If you want to continue using WDF, you could make a quick change > around since 320: > > if (splitOnCaseChange == 0 && > (lastType & ALPHA) != 0 && (type & ALPHA) != 0) { >// ALPHA->ALPHA: always ignore if case isn't considered. > > } else if ((lastType & UPPER)!=0 && (type & LOWER)!=0) { >// UPPER->LOWER: Don't split > } else { > > ... > > by adding a clause that catches ALPHA -> NUMERIC (and vice versa) and > ignores it. > > Another approach that I am using locally is to maintain the > transitions, but force tokens to be a minimum size (so r2d2 doesn't > tokenize to four tokens but arrrdeee does). > > There is a patch here: http://issues.apache.org/jira/browse/SOLR-293 > > If you vote for it, I promise to get it in for 1.3 > > -Mike > > -- View this message in context: http://www.nabble.com/Letter-number-transitions---can-this-be-turned-off-tf4544769.html#a13003019 Sent from the Solr - User mailing list archive at Nabble.com.
Optimization taking days/weeks
Optimization time on solr index has turned into days/weeks. We are using solr 1.2. We use one box to build/optimize indexes. This index is copied to another box for searching purposes. We welcome suggestions/comments, etc. We are a bit stumped on this. Details are below. Box details Proc: 8 Dual Core 2.6GHz Mem: 32 GB OS: Red Hat Linux Enterprise 4 Kernel: 2.6.9-55.0.12.ELlargesmp These are details from the index currently in use. Search response time is very acceptable (searchers are very happy) Optimization time - 10433 (12/11/07) index size - 229486464 # of records - 84960570 index directory flasher# ls -l total 229486464 -rw-r--r-- 1 flknud staff22197926593 Dec 12 08:07 _2bl6.fdt -rw-r--r-- 1 flknud staff679684560 Dec 12 08:20 _2bl6.fdx -rw-r--r-- 1 flknud staff208 Dec 12 08:23 _2bl6.fnm -rw-r--r-- 1 flknud staff40176405625 Dec 12 09:28 _2bl6.frq -rw-r--r-- 1 flknud staff594723994 Dec 12 09:41 _2bl6.nrm -rw-r--r-- 1 flknud staff47616340310 Dec 12 12:07 _2bl6.prx -rw-r--r-- 1 flknud staff76708079 Dec 12 12:25 _2bl6.tii -rw-r--r-- 1 flknud staff6154384415 Dec 12 12:42 _2bl6.tis -rw-r--r-- 1 flknud staff 20 Dec 12 12:48 segments.gen -rw-r--r-- 1 flknud staff 44 Dec 12 12:48 segments_2c64 -- current directory listing indexed new records - Jan 22 and Jan 27 # of records - 85032470 optimization time - 558188 There were no out of memory errors. There was 800961792KB left in the directory. The files were not collapsed as expected. There are still files dated Jan 22 and Jan 27. A new optimization was started Feb. 11 and continues. This is a snapshot of the index directory. We have at least another million records to add. Plus weekly updates of approximately 103K records. We are using the direct indexing method. java settings used - java -Xmx1024M -Xms1024M The files continue to grow so work is progressing. snapshot 2/21/08 -bash-3.00$ ls -ltr total 205396680 -rw-r--r-- 1 flknud users 208 Jan 10 07:15 _2bm7.fnm -rw-r--r-- 1 flknud users 22202159522 Jan 10 08:09 _2bm7.fdt -rw-r--r-- 1 flknud users 679819760 Jan 10 08:09 _2bm7.fdx -rw-r--r-- 1 flknud users 40184944027 Jan 16 18:16 _2bm7.frq -rw-r--r-- 1 flknud users 47626230575 Jan 16 18:16 _2bm7.prx -rw-r--r-- 1 flknud users 6155230704 Jan 16 18:16 _2bm7.tis -rw-r--r-- 1 flknud users76704158 Jan 16 18:16 _2bm7.tii -rw-r--r-- 1 flknud users 594842294 Jan 16 18:18 _2bm7.nrm -rw-r--r-- 1 flknud users 208 Jan 22 08:57 _2bpa.fnm -rw-r--r-- 1 flknud users10806426 Jan 22 08:57 _2bpa.fdt -rw-r--r-- 1 flknud users 371200 Jan 22 08:57 _2bpa.fdx -rw-r--r-- 1 flknud users21114330 Jan 22 08:57 _2bpa.frq -rw-r--r-- 1 flknud users25683573 Jan 22 08:57 _2bpa.prx -rw-r--r-- 1 flknud users 9225592 Jan 22 08:57 _2bpa.tis -rw-r--r-- 1 flknud users 118660 Jan 22 08:57 _2bpa.tii -rw-r--r-- 1 flknud users 324804 Jan 22 08:57 _2bpa.nrm -rw-r--r-- 1 flknud users 198 Jan 22 09:00 _2bpl.fnm -rw-r--r-- 1 flknud users 1335931 Jan 22 09:00 _2bpl.fdt -rw-r--r-- 1 flknud users 36800 Jan 22 09:00 _2bpl.fdx -rw-r--r-- 1 flknud users 2646708 Jan 22 09:00 _2bpl.frq -rw-r--r-- 1 flknud users 3781824 Jan 22 09:00 _2bpl.prx -rw-r--r-- 1 flknud users 1429176 Jan 22 09:00 _2bpl.tis -rw-r--r-- 1 flknud users 18582 Jan 22 09:00 _2bpl.tii -rw-r--r-- 1 flknud users 32204 Jan 22 09:00 _2bpl.nrm -rw-r--r-- 1 flknud users 198 Jan 22 09:01 _2bpm.fnm -rw-r--r-- 1 flknud users 121716 Jan 22 09:01 _2bpm.fdt -rw-r--r-- 1 flknud users3200 Jan 22 09:01 _2bpm.fdx -rw-r--r-- 1 flknud users 205961 Jan 22 09:01 _2bpm.frq -rw-r--r-- 1 flknud users 302114 Jan 22 09:01 _2bpm.prx -rw-r--r-- 1 flknud users 233641 Jan 22 09:01 _2bpm.tis -rw-r--r-- 1 flknud users3036 Jan 22 09:01 _2bpm.tii -rw-r--r-- 1 flknud users2804 Jan 22 09:01 _2bpm.nrm -rw-r--r-- 1 flknud users 198 Jan 27 14:00 _2bpn.fnm -rw-r--r-- 1 flknud users 227962 Jan 27 14:00 _2bpn.fdt -rw-r--r-- 1 flknud users7200 Jan 27 14:00 _2bpn.fdx -rw-r--r-- 1 flknud users 437798 Jan 27 14:00 _2bpn.frq -rw-r--r-- 1 flknud users 593858 Jan 27 14:00 _2bpn.prx -rw-r--r-- 1 flknud users 516031 Jan 27 14:00 _2bpn.tis -rw-r--r-- 1 flknud users6814 Jan 27 14:00 _2bpn.tii -rw-r--r-- 1 flknud users6304 Jan 27 14:00 _2bpn.nrm -rw-r--r-- 1 flknud users 198 Jan 27 14:01 _2bpo.fnm -rw-r--r-- 1 flknud users 231456 Jan 27 14:01 _2bpo.fdt -rw-r--r-- 1 flknud users7200 Jan 27 14:01 _2bpo.fdx -rw-r--r-- 1 flknud users 448401 Jan 27 14:01 _2bpo.frq -rw-r--r-- 1 flknud users 616557 Jan 27 14:01 _2bpo.prx -rw-r--r-- 1 flknud users 587697 Jan 27 14:01 _2bpo.tis -rw-r--r-- 1 flknud users7801 Jan 27 14:01 _2bpo.tii -rw-r--r-- 1 flknud users6304 Jan 27 14:01 _2bpo.nrm
RE: Optimization taking days/weeks
We will review the java settings. The current settings are a bit low - but the indexed typically does not reach even 50% of the allocated 1024MB Max Heap. Yes the index is large - only 3 fields are stored - and I have set the positionIncrementGap to 50 (down from 100) in an attempt to reduce index size. Would you suggest to build one index used only for searching and a second index used only for display? Does that fit within your definition of "partition"? Thanks Frances Alex Benjamen wrote: > > This sounds too familiar... > >>java settings used - java -Xmx1024M -Xms1024M > Sounds like your settings are pretty low... if you're using 64bit JVM, you > should be able to set > these much higher, maybe give it like 8gb. > > Another thing, you may want to look at reducing the index size... is there > any way you could > partition the index? Also only index fields which you need and do not > store the values in the index. > I've originally had an index which was 50Gb in size, and after removing > fields I do not need, I'm down > to 8Gb and not storing any values in the index. > > > > -- View this message in context: http://www.nabble.com/Optimization-taking-days-weeks-tp15738090p15762156.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Optimization taking days/weeks
We are a bit concerned regarding the index size. At least no response (so far) as indicated that the size is unmanagable. We killed the process - will move to Java6 - and will use vmstat to monitor the new optimization process. At what index size would you begin to worry? Or is it a combination of index size, optimization time, and response time? We are data rich here! Thanks Frances Otis Gospodnetic wrote: > > That's a tiny little index there ;) Circa 100GB? > > What do you see if you run vmstat 2 while the optimization is happening? > Non-idle CPU? A pile of IO? Is there a reason for such a small heap on a > machine with 32GB of RAM? > > Otis > > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > ----- Original Message >> From: F Knudson <[EMAIL PROTECTED]> >> To: solr-user@lucene.apache.org >> Sent: Thursday, February 28, 2008 9:54:50 AM >> Subject: Optimization taking days/weeks >> >> >> Optimization time on solr index has turned into days/weeks. >> We are using solr 1.2. >> We use one box to build/optimize indexes. This index is copied to another >> box for searching purposes. >> We welcome suggestions/comments, etc. We are a bit stumped on this. >> Details are below. >> >> Box details >> Proc: 8 Dual Core 2.6GHz >> Mem: 32 GB >> OS: Red Hat Linux Enterprise 4 >> Kernel: 2.6.9-55.0.12.ELlargesmp >> >> These are details from the index currently in use. Search response time >> is >> very acceptable (searchers are very happy) >> Optimization time - 10433 (12/11/07) >> index size - 229486464 >> # of records - 84960570 >> index directory >> flasher# ls -l >> total 229486464 >> -rw-r--r-- 1 flknud staff22197926593 Dec 12 08:07 _2bl6.fdt >> -rw-r--r-- 1 flknud staff679684560 Dec 12 08:20 _2bl6.fdx >> -rw-r--r-- 1 flknud staff208 Dec 12 08:23 _2bl6.fnm >> -rw-r--r-- 1 flknud staff40176405625 Dec 12 09:28 _2bl6.frq >> -rw-r--r-- 1 flknud staff594723994 Dec 12 09:41 _2bl6.nrm >> -rw-r--r-- 1 flknud staff47616340310 Dec 12 12:07 _2bl6.prx >> -rw-r--r-- 1 flknud staff76708079 Dec 12 12:25 _2bl6.tii >> -rw-r--r-- 1 flknud staff6154384415 Dec 12 12:42 _2bl6.tis >> -rw-r--r-- 1 flknud staff 20 Dec 12 12:48 segments.gen >> -rw-r--r-- 1 flknud staff 44 Dec 12 12:48 segments_2c64 >> -- >> >> current directory listing >> indexed new records - Jan 22 and Jan 27 >> # of records - 85032470 >> optimization time - 558188 >> >> There were no out of memory errors. There was 800961792KB left in the >> directory. The files were not >> collapsed as expected. There are still files dated Jan 22 and Jan 27. >> >> A new optimization was started Feb. 11 and continues. >> This is a snapshot of the index directory. >> >> We have at least another million records to add. Plus weekly updates of >> approximately 103K records. >> We are using the direct indexing method. >> java settings used - java -Xmx1024M -Xms1024M >> >> The files continue to grow so work is progressing. >> snapshot 2/21/08 >> -bash-3.00$ ls -ltr >> total 205396680 >> -rw-r--r-- 1 flknud users 208 Jan 10 07:15 _2bm7.fnm >> -rw-r--r-- 1 flknud users 22202159522 Jan 10 08:09 _2bm7.fdt >> -rw-r--r-- 1 flknud users 679819760 Jan 10 08:09 _2bm7.fdx >> -rw-r--r-- 1 flknud users 40184944027 Jan 16 18:16 _2bm7.frq >> -rw-r--r-- 1 flknud users 47626230575 Jan 16 18:16 _2bm7.prx >> -rw-r--r-- 1 flknud users 6155230704 Jan 16 18:16 _2bm7.tis >> -rw-r--r-- 1 flknud users76704158 Jan 16 18:16 _2bm7.tii >> -rw-r--r-- 1 flknud users 594842294 Jan 16 18:18 _2bm7.nrm >> -rw-r--r-- 1 flknud users 208 Jan 22 08:57 _2bpa.fnm >> -rw-r--r-- 1 flknud users10806426 Jan 22 08:57 _2bpa.fdt >> -rw-r--r-- 1 flknud users 371200 Jan 22 08:57 _2bpa.fdx >> -rw-r--r-- 1 flknud users21114330 Jan 22 08:57 _2bpa.frq >> -rw-r--r-- 1 flknud users25683573 Jan 22 08:57 _2bpa.prx >> -rw-r--r-- 1 flknud users 9225592 Jan 22 08:57 _2bpa.tis >> -rw-r--r-- 1 flknud users 118660 Jan 22 08:57 _2bpa.tii >> -rw-r--r-- 1 flknud users 324804 Jan 22 08:57 _2bpa.nrm >> -rw-r--r-- 1 flknud users 198 Jan 22 09:00 _2bpl.fnm >> -rw-r--r-- 1 flknud users 1335931 Jan 22 09:00 _2bpl.fdt >> -rw-r--r-- 1 flknud users 36800 Jan 22 09:00 _2bpl.fdx >> -rw-r--r-- 1 flknud users 2646708 Jan 22 09:00 _2bpl.frq >> -rw-r--r-- 1 flknud us
Re: Optimization taking days/weeks
Yes indeed - it was spending all of its time in garbage collection. We will be moving to Java6. Thanks for your suggestion. Frances Yonik Seeley wrote: > > Have you checked if this is due to running out of heap memory? > When that happens, the garbage collector can start taking a lot of CPU. > If you are using a Java6 JVM, it should have management enabled by > default and you should be able to connect to it via jconsole and > check. > > -Yonik > > On Thu, Feb 28, 2008 at 9:54 AM, F Knudson <[EMAIL PROTECTED]> wrote: >> >> Optimization time on solr index has turned into days/weeks. >> We are using solr 1.2. >> We use one box to build/optimize indexes. This index is copied to >> another >> box for searching purposes. >> We welcome suggestions/comments, etc. We are a bit stumped on this. >> Details are below. >> >> Box details >> Proc: 8 Dual Core 2.6GHz >> Mem: 32 GB >> OS: Red Hat Linux Enterprise 4 >> Kernel: 2.6.9-55.0.12.ELlargesmp >> >> These are details from the index currently in use. Search response time >> is >> very acceptable (searchers are very happy) >> Optimization time - 10433 (12/11/07) >> index size - 229486464 >> # of records - 84960570 >> index directory >> flasher# ls -l >> total 229486464 >> -rw-r--r-- 1 flknud staff22197926593 Dec 12 08:07 _2bl6.fdt >> -rw-r--r-- 1 flknud staff679684560 Dec 12 08:20 _2bl6.fdx >> -rw-r--r-- 1 flknud staff208 Dec 12 08:23 _2bl6.fnm >> -rw-r--r-- 1 flknud staff40176405625 Dec 12 09:28 _2bl6.frq >> -rw-r--r-- 1 flknud staff594723994 Dec 12 09:41 _2bl6.nrm >> -rw-r--r-- 1 flknud staff47616340310 Dec 12 12:07 _2bl6.prx >> -rw-r--r-- 1 flknud staff76708079 Dec 12 12:25 _2bl6.tii >> -rw-r--r-- 1 flknud staff6154384415 Dec 12 12:42 _2bl6.tis >> -rw-r--r-- 1 flknud staff 20 Dec 12 12:48 segments.gen >> -rw-r--r-- 1 flknud staff 44 Dec 12 12:48 segments_2c64 >> -- >> >> current directory listing >> indexed new records - Jan 22 and Jan 27 >> # of records - 85032470 >> optimization time - 558188 >> >> There were no out of memory errors. There was 800961792KB left in the >> directory. The files were not >> collapsed as expected. There are still files dated Jan 22 and Jan 27. >> >> A new optimization was started Feb. 11 and continues. >> This is a snapshot of the index directory. >> >> We have at least another million records to add. Plus weekly updates of >> approximately 103K records. >> We are using the direct indexing method. >> java settings used - java -Xmx1024M -Xms1024M >> >> The files continue to grow so work is progressing. >> snapshot 2/21/08 >> -bash-3.00$ ls -ltr >> total 205396680 >> -rw-r--r-- 1 flknud users 208 Jan 10 07:15 _2bm7.fnm >> -rw-r--r-- 1 flknud users 22202159522 Jan 10 08:09 _2bm7.fdt >> -rw-r--r-- 1 flknud users 679819760 Jan 10 08:09 _2bm7.fdx >> -rw-r--r-- 1 flknud users 40184944027 Jan 16 18:16 _2bm7.frq >> -rw-r--r-- 1 flknud users 47626230575 Jan 16 18:16 _2bm7.prx >> -rw-r--r-- 1 flknud users 6155230704 Jan 16 18:16 _2bm7.tis >> -rw-r--r-- 1 flknud users76704158 Jan 16 18:16 _2bm7.tii >> -rw-r--r-- 1 flknud users 594842294 Jan 16 18:18 _2bm7.nrm >> -rw-r--r-- 1 flknud users 208 Jan 22 08:57 _2bpa.fnm >> -rw-r--r-- 1 flknud users10806426 Jan 22 08:57 _2bpa.fdt >> -rw-r--r-- 1 flknud users 371200 Jan 22 08:57 _2bpa.fdx >> -rw-r--r-- 1 flknud users21114330 Jan 22 08:57 _2bpa.frq >> -rw-r--r-- 1 flknud users25683573 Jan 22 08:57 _2bpa.prx >> -rw-r--r-- 1 flknud users 9225592 Jan 22 08:57 _2bpa.tis >> -rw-r--r-- 1 flknud users 118660 Jan 22 08:57 _2bpa.tii >> -rw-r--r-- 1 flknud users 324804 Jan 22 08:57 _2bpa.nrm >> -rw-r--r-- 1 flknud users 198 Jan 22 09:00 _2bpl.fnm >> -rw-r--r-- 1 flknud users 1335931 Jan 22 09:00 _2bpl.fdt >> -rw-r--r-- 1 flknud users 36800 Jan 22 09:00 _2bpl.fdx >> -rw-r--r-- 1 flknud users 2646708 Jan 22 09:00 _2bpl.frq >> -rw-r--r-- 1 flknud users 3781824 Jan 22 09:00 _2bpl.prx >> -rw-r--r-- 1 flknud users 1429176 Jan 22 09:00 _2bpl.tis >> -rw-r--r-- 1 flknud users 18582 Jan 22 09:00 _2bpl.tii >> -rw-r--r-- 1 flknud users 32204 Jan 22 09:00 _2bpl.nrm >> -rw-r--r-- 1 flknud users 198 Jan 22 09:01 _2bpm.fnm >> -rw-r--r-- 1 flknud users 121716 Jan 22