Tokenizing and searching named character entity references

2008-07-24 Thread F Knudson

Greetings:

I am working with many different data sources - some source employ "entity
references" ; others do not.  My goal is to make the searching across
sources as consistent as possible.

Example text - 

Source1:   weakening Hδ absorption
Source1:   zero-field gap ω

Source2:  weakening H delta absorption
Source2:  zero-field gap omega

Using the tokenizer solr.HTMLStripWhitespaceTokenizerFactory for Source1 -
the entity is replaced with the "named character entity" - 

This works great.  

But I want the searching tokens to be identical for each source.  I need to
capture δ  as a token.



  

   
   
   


  

 
Is this possible with the SOLR supplied tokenizers?  I experimented with
different combinations and orders and was not successful.

Is this possible using synonyms?  I also experimented with this route but
again was not successful.

Do I need to create a custom tokenizer?

Thanks
Frances
-- 
View this message in context: 
http://www.nabble.com/Tokenizing-and-searching-named-character-entity-references-tp18632403p18632403.html
Sent from the Solr - User mailing list archive at Nabble.com.



Letter-number transitions - can this be turned off

2007-09-30 Thread F Knudson

Is there a flag to disable the letter-number transition in the
solr.WordDelimiterFilterFactory?  We are indexing category codes, thesaurus
codes for which this letter number transition makes no sense.  It is
bloating the indexing (which is already large).

Thanks
F Knudson
-- 
View this message in context: 
http://www.nabble.com/Letter-number-transitions---can-this-be-turned-off-tf4544769.html#a12969359
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Letter-number transitions - can this be turned off

2007-10-02 Thread F Knudson

Thanks for your helpful suggestions.

I have considered other analyzers but WDF has great strengths.  I will
experiment with maintaining transitions and then consider modifying the
code.

F. Knudson


Mike Klaas wrote:
> 
> On 30-Sep-07, at 12:47 PM, F Knudson wrote:
> 
>>
>> Is there a flag to disable the letter-number transition in the
>> solr.WordDelimiterFilterFactory?  We are indexing category codes,  
>> thesaurus
>> codes for which this letter number transition makes no sense.  It is
>> bloating the indexing (which is already large).
> 
> Have you considered using a different analyzer?
> 
> If you want to continue using WDF, you could make a quick change  
> around since 320:
> 
>  if (splitOnCaseChange == 0 &&
>  (lastType & ALPHA) != 0 && (type & ALPHA) != 0) {
>// ALPHA->ALPHA: always ignore if case isn't considered.
> 
>  } else if ((lastType & UPPER)!=0 && (type & LOWER)!=0) {
>// UPPER->LOWER: Don't split
>  } else {
> 
>   ...
> 
> by adding a clause that catches ALPHA -> NUMERIC (and vice versa) and  
> ignores it.
> 
> Another approach that I am using locally is to maintain the  
> transitions, but force tokens to be a minimum size (so r2d2 doesn't  
> tokenize to four tokens but arrrdeee does).
> 
> There is a patch here: http://issues.apache.org/jira/browse/SOLR-293
> 
> If you vote for it, I promise to get it in for 1.3 
> 
> -Mike
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Letter-number-transitions---can-this-be-turned-off-tf4544769.html#a13003019
Sent from the Solr - User mailing list archive at Nabble.com.



Optimization taking days/weeks

2008-02-28 Thread F Knudson

Optimization time on solr index has turned into days/weeks.
We are using solr 1.2.
We use one box to build/optimize indexes. This index is copied to another
box for searching purposes.
We welcome suggestions/comments, etc.  We are a bit stumped on this.
Details are below.

Box details
Proc: 8 Dual Core 2.6GHz
Mem: 32 GB
OS: Red Hat Linux Enterprise 4
Kernel: 2.6.9-55.0.12.ELlargesmp

These are details from the index currently in use.  Search response time is
very acceptable (searchers are very happy)
Optimization time - 10433  (12/11/07)
index size - 229486464
# of records - 84960570 
index directory
flasher# ls -l
total 229486464
-rw-r--r--   1 flknud   staff22197926593 Dec 12 08:07 _2bl6.fdt
-rw-r--r--   1 flknud   staff679684560 Dec 12 08:20 _2bl6.fdx
-rw-r--r--   1 flknud   staff208 Dec 12 08:23 _2bl6.fnm
-rw-r--r--   1 flknud   staff40176405625 Dec 12 09:28 _2bl6.frq
-rw-r--r--   1 flknud   staff594723994 Dec 12 09:41 _2bl6.nrm
-rw-r--r--   1 flknud   staff47616340310 Dec 12 12:07 _2bl6.prx
-rw-r--r--   1 flknud   staff76708079 Dec 12 12:25 _2bl6.tii
-rw-r--r--   1 flknud   staff6154384415 Dec 12 12:42 _2bl6.tis
-rw-r--r--   1 flknud   staff 20 Dec 12 12:48 segments.gen
-rw-r--r--   1 flknud   staff 44 Dec 12 12:48 segments_2c64
--

current directory listing
indexed new records - Jan 22 and Jan 27
# of records - 85032470
optimization time - 558188

There were no out of memory errors.  There was 800961792KB  left in the
directory.  The files were not 
collapsed as expected.  There are still files dated Jan 22 and Jan 27.  

A new optimization was started Feb. 11 and continues.
This is a snapshot of the index directory.

We have at least another million records to add.  Plus weekly updates of
approximately 103K records.
We are using the direct indexing method.
java settings used - java -Xmx1024M -Xms1024M 

The files continue to grow so work is progressing.
snapshot 2/21/08
-bash-3.00$ ls -ltr
total 205396680
-rw-r--r--  1 flknud users 208 Jan 10 07:15 _2bm7.fnm
-rw-r--r--  1 flknud users 22202159522 Jan 10 08:09 _2bm7.fdt
-rw-r--r--  1 flknud users   679819760 Jan 10 08:09 _2bm7.fdx
-rw-r--r--  1 flknud users 40184944027 Jan 16 18:16 _2bm7.frq
-rw-r--r--  1 flknud users 47626230575 Jan 16 18:16 _2bm7.prx
-rw-r--r--  1 flknud users  6155230704 Jan 16 18:16 _2bm7.tis
-rw-r--r--  1 flknud users76704158 Jan 16 18:16 _2bm7.tii
-rw-r--r--  1 flknud users   594842294 Jan 16 18:18 _2bm7.nrm
-rw-r--r--  1 flknud users 208 Jan 22 08:57 _2bpa.fnm
-rw-r--r--  1 flknud users10806426 Jan 22 08:57 _2bpa.fdt
-rw-r--r--  1 flknud users  371200 Jan 22 08:57 _2bpa.fdx
-rw-r--r--  1 flknud users21114330 Jan 22 08:57 _2bpa.frq
-rw-r--r--  1 flknud users25683573 Jan 22 08:57 _2bpa.prx
-rw-r--r--  1 flknud users 9225592 Jan 22 08:57 _2bpa.tis
-rw-r--r--  1 flknud users  118660 Jan 22 08:57 _2bpa.tii
-rw-r--r--  1 flknud users  324804 Jan 22 08:57 _2bpa.nrm
-rw-r--r--  1 flknud users 198 Jan 22 09:00 _2bpl.fnm
-rw-r--r--  1 flknud users 1335931 Jan 22 09:00 _2bpl.fdt
-rw-r--r--  1 flknud users   36800 Jan 22 09:00 _2bpl.fdx
-rw-r--r--  1 flknud users 2646708 Jan 22 09:00 _2bpl.frq
-rw-r--r--  1 flknud users 3781824 Jan 22 09:00 _2bpl.prx
-rw-r--r--  1 flknud users 1429176 Jan 22 09:00 _2bpl.tis
-rw-r--r--  1 flknud users   18582 Jan 22 09:00 _2bpl.tii
-rw-r--r--  1 flknud users   32204 Jan 22 09:00 _2bpl.nrm
-rw-r--r--  1 flknud users 198 Jan 22 09:01 _2bpm.fnm
-rw-r--r--  1 flknud users  121716 Jan 22 09:01 _2bpm.fdt
-rw-r--r--  1 flknud users3200 Jan 22 09:01 _2bpm.fdx
-rw-r--r--  1 flknud users  205961 Jan 22 09:01 _2bpm.frq
-rw-r--r--  1 flknud users  302114 Jan 22 09:01 _2bpm.prx
-rw-r--r--  1 flknud users  233641 Jan 22 09:01 _2bpm.tis
-rw-r--r--  1 flknud users3036 Jan 22 09:01 _2bpm.tii
-rw-r--r--  1 flknud users2804 Jan 22 09:01 _2bpm.nrm
-rw-r--r--  1 flknud users 198 Jan 27 14:00 _2bpn.fnm
-rw-r--r--  1 flknud users  227962 Jan 27 14:00 _2bpn.fdt
-rw-r--r--  1 flknud users7200 Jan 27 14:00 _2bpn.fdx
-rw-r--r--  1 flknud users  437798 Jan 27 14:00 _2bpn.frq
-rw-r--r--  1 flknud users  593858 Jan 27 14:00 _2bpn.prx
-rw-r--r--  1 flknud users  516031 Jan 27 14:00 _2bpn.tis
-rw-r--r--  1 flknud users6814 Jan 27 14:00 _2bpn.tii
-rw-r--r--  1 flknud users6304 Jan 27 14:00 _2bpn.nrm
-rw-r--r--  1 flknud users 198 Jan 27 14:01 _2bpo.fnm
-rw-r--r--  1 flknud users  231456 Jan 27 14:01 _2bpo.fdt
-rw-r--r--  1 flknud users7200 Jan 27 14:01 _2bpo.fdx
-rw-r--r--  1 flknud users  448401 Jan 27 14:01 _2bpo.frq
-rw-r--r--  1 flknud users  616557 Jan 27 14:01 _2bpo.prx
-rw-r--r--  1 flknud users  587697 Jan 27 14:01 _2bpo.tis
-rw-r--r--  1 flknud users7801 Jan 27 14:01 _2bpo.tii
-rw-r--r--  1 flknud users6304 Jan 27 14:01 _2bpo.nrm

RE: Optimization taking days/weeks

2008-02-29 Thread F Knudson

We will review the java settings.  The current settings are a bit low - but
the indexed typically does not reach even 50% of the allocated 1024MB Max
Heap.

Yes the index is large - only 3 fields are stored - and I have set the
positionIncrementGap to 50 (down from 100) in an attempt to reduce index
size.  Would you suggest to build one index used only for searching and a
second index used only for display?  Does that fit within your definition of
"partition"?

Thanks
Frances


Alex Benjamen wrote:
> 
> This sounds too familiar... 
> 
>>java settings used - java -Xmx1024M -Xms1024M  
> Sounds like your settings are pretty low... if you're using 64bit JVM, you
> should be able to set 
> these much higher, maybe give it like 8gb. 
> 
> Another thing, you may want to look at reducing the index size... is there
> any way you could 
> partition the index? Also only index fields which you need and do not
> store the values in the index.
> I've originally had an index which was 50Gb in size, and after removing
> fields I do not need, I'm down
> to 8Gb and not storing any values in the index.
>  
>  
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Optimization-taking-days-weeks-tp15738090p15762156.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Optimization taking days/weeks

2008-02-29 Thread F Knudson

We are a bit concerned regarding the index size.  At least no response (so
far) as indicated that the size is unmanagable.  We killed the process -
will move to Java6 - and will use vmstat to monitor the new optimization
process. 
At what index size would you begin to worry?  Or is it a combination of
index size, optimization time, and response time?
We are data rich here!
Thanks
Frances


Otis Gospodnetic wrote:
> 
> That's a tiny little index there ;)  Circa 100GB?
>  
> What do you see if you run vmstat 2 while the optimization is happening?
> Non-idle CPU?  A pile of IO?  Is there a reason for such a small heap on a
> machine with 32GB of RAM?
> 
> Otis
> 
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> ----- Original Message 
>> From: F Knudson <[EMAIL PROTECTED]>
>> To: solr-user@lucene.apache.org
>> Sent: Thursday, February 28, 2008 9:54:50 AM
>> Subject: Optimization taking days/weeks
>> 
>> 
>> Optimization time on solr index has turned into days/weeks.
>> We are using solr 1.2.
>> We use one box to build/optimize indexes. This index is copied to another
>> box for searching purposes.
>> We welcome suggestions/comments, etc.  We are a bit stumped on this.
>> Details are below.
>> 
>> Box details
>> Proc: 8 Dual Core 2.6GHz
>> Mem: 32 GB
>> OS: Red Hat Linux Enterprise 4
>> Kernel: 2.6.9-55.0.12.ELlargesmp
>> 
>> These are details from the index currently in use.  Search response time
>> is
>> very acceptable (searchers are very happy)
>> Optimization time - 10433  (12/11/07)
>> index size - 229486464
>> # of records - 84960570 
>> index directory
>> flasher# ls -l
>> total 229486464
>> -rw-r--r--   1 flknud   staff22197926593 Dec 12 08:07 _2bl6.fdt
>> -rw-r--r--   1 flknud   staff679684560 Dec 12 08:20 _2bl6.fdx
>> -rw-r--r--   1 flknud   staff208 Dec 12 08:23 _2bl6.fnm
>> -rw-r--r--   1 flknud   staff40176405625 Dec 12 09:28 _2bl6.frq
>> -rw-r--r--   1 flknud   staff594723994 Dec 12 09:41 _2bl6.nrm
>> -rw-r--r--   1 flknud   staff47616340310 Dec 12 12:07 _2bl6.prx
>> -rw-r--r--   1 flknud   staff76708079 Dec 12 12:25 _2bl6.tii
>> -rw-r--r--   1 flknud   staff6154384415 Dec 12 12:42 _2bl6.tis
>> -rw-r--r--   1 flknud   staff 20 Dec 12 12:48 segments.gen
>> -rw-r--r--   1 flknud   staff 44 Dec 12 12:48 segments_2c64
>> --
>> 
>> current directory listing
>> indexed new records - Jan 22 and Jan 27
>> # of records - 85032470
>> optimization time - 558188
>> 
>> There were no out of memory errors.  There was 800961792KB  left in the
>> directory.  The files were not 
>> collapsed as expected.  There are still files dated Jan 22 and Jan 27.  
>> 
>> A new optimization was started Feb. 11 and continues.
>> This is a snapshot of the index directory.
>> 
>> We have at least another million records to add.  Plus weekly updates of
>> approximately 103K records.
>> We are using the direct indexing method.
>> java settings used - java -Xmx1024M -Xms1024M 
>> 
>> The files continue to grow so work is progressing.
>> snapshot 2/21/08
>> -bash-3.00$ ls -ltr
>> total 205396680
>> -rw-r--r--  1 flknud users 208 Jan 10 07:15 _2bm7.fnm
>> -rw-r--r--  1 flknud users 22202159522 Jan 10 08:09 _2bm7.fdt
>> -rw-r--r--  1 flknud users   679819760 Jan 10 08:09 _2bm7.fdx
>> -rw-r--r--  1 flknud users 40184944027 Jan 16 18:16 _2bm7.frq
>> -rw-r--r--  1 flknud users 47626230575 Jan 16 18:16 _2bm7.prx
>> -rw-r--r--  1 flknud users  6155230704 Jan 16 18:16 _2bm7.tis
>> -rw-r--r--  1 flknud users76704158 Jan 16 18:16 _2bm7.tii
>> -rw-r--r--  1 flknud users   594842294 Jan 16 18:18 _2bm7.nrm
>> -rw-r--r--  1 flknud users 208 Jan 22 08:57 _2bpa.fnm
>> -rw-r--r--  1 flknud users10806426 Jan 22 08:57 _2bpa.fdt
>> -rw-r--r--  1 flknud users  371200 Jan 22 08:57 _2bpa.fdx
>> -rw-r--r--  1 flknud users21114330 Jan 22 08:57 _2bpa.frq
>> -rw-r--r--  1 flknud users25683573 Jan 22 08:57 _2bpa.prx
>> -rw-r--r--  1 flknud users 9225592 Jan 22 08:57 _2bpa.tis
>> -rw-r--r--  1 flknud users  118660 Jan 22 08:57 _2bpa.tii
>> -rw-r--r--  1 flknud users  324804 Jan 22 08:57 _2bpa.nrm
>> -rw-r--r--  1 flknud users 198 Jan 22 09:00 _2bpl.fnm
>> -rw-r--r--  1 flknud users 1335931 Jan 22 09:00 _2bpl.fdt
>> -rw-r--r--  1 flknud users   36800 Jan 22 09:00 _2bpl.fdx
>> -rw-r--r--  1 flknud users 2646708 Jan 22 09:00 _2bpl.frq
>> -rw-r--r--  1 flknud us

Re: Optimization taking days/weeks

2008-02-29 Thread F Knudson

Yes indeed - it was spending all of its time in garbage collection.  We will
be moving to Java6.
Thanks for your suggestion.

Frances


Yonik Seeley wrote:
> 
> Have you checked if this is due to running out of heap memory?
> When that happens, the garbage collector can start taking a lot of CPU.
> If you are using a Java6 JVM, it should have management enabled by
> default and you should be able to connect to it via jconsole and
> check.
> 
> -Yonik
> 
> On Thu, Feb 28, 2008 at 9:54 AM, F Knudson <[EMAIL PROTECTED]> wrote:
>>
>>  Optimization time on solr index has turned into days/weeks.
>>  We are using solr 1.2.
>>  We use one box to build/optimize indexes. This index is copied to
>> another
>>  box for searching purposes.
>>  We welcome suggestions/comments, etc.  We are a bit stumped on this.
>>  Details are below.
>>
>>  Box details
>>  Proc: 8 Dual Core 2.6GHz
>>  Mem: 32 GB
>>  OS: Red Hat Linux Enterprise 4
>>  Kernel: 2.6.9-55.0.12.ELlargesmp
>>
>>  These are details from the index currently in use.  Search response time
>> is
>>  very acceptable (searchers are very happy)
>>  Optimization time - 10433  (12/11/07)
>>  index size - 229486464
>>  # of records - 84960570
>>  index directory
>>  flasher# ls -l
>>  total 229486464
>>  -rw-r--r--   1 flknud   staff22197926593 Dec 12 08:07 _2bl6.fdt
>>  -rw-r--r--   1 flknud   staff679684560 Dec 12 08:20 _2bl6.fdx
>>  -rw-r--r--   1 flknud   staff208 Dec 12 08:23 _2bl6.fnm
>>  -rw-r--r--   1 flknud   staff40176405625 Dec 12 09:28 _2bl6.frq
>>  -rw-r--r--   1 flknud   staff594723994 Dec 12 09:41 _2bl6.nrm
>>  -rw-r--r--   1 flknud   staff47616340310 Dec 12 12:07 _2bl6.prx
>>  -rw-r--r--   1 flknud   staff76708079 Dec 12 12:25 _2bl6.tii
>>  -rw-r--r--   1 flknud   staff6154384415 Dec 12 12:42 _2bl6.tis
>>  -rw-r--r--   1 flknud   staff 20 Dec 12 12:48 segments.gen
>>  -rw-r--r--   1 flknud   staff 44 Dec 12 12:48 segments_2c64
>>  --
>>
>>  current directory listing
>>  indexed new records - Jan 22 and Jan 27
>>  # of records - 85032470
>>  optimization time - 558188
>>
>>  There were no out of memory errors.  There was 800961792KB  left in the
>>  directory.  The files were not
>>  collapsed as expected.  There are still files dated Jan 22 and Jan 27.
>>
>>  A new optimization was started Feb. 11 and continues.
>>  This is a snapshot of the index directory.
>>
>>  We have at least another million records to add.  Plus weekly updates of
>>  approximately 103K records.
>>  We are using the direct indexing method.
>>  java settings used - java -Xmx1024M -Xms1024M
>>
>>  The files continue to grow so work is progressing.
>>  snapshot 2/21/08
>>  -bash-3.00$ ls -ltr
>>  total 205396680
>>  -rw-r--r--  1 flknud users 208 Jan 10 07:15 _2bm7.fnm
>>  -rw-r--r--  1 flknud users 22202159522 Jan 10 08:09 _2bm7.fdt
>>  -rw-r--r--  1 flknud users   679819760 Jan 10 08:09 _2bm7.fdx
>>  -rw-r--r--  1 flknud users 40184944027 Jan 16 18:16 _2bm7.frq
>>  -rw-r--r--  1 flknud users 47626230575 Jan 16 18:16 _2bm7.prx
>>  -rw-r--r--  1 flknud users  6155230704 Jan 16 18:16 _2bm7.tis
>>  -rw-r--r--  1 flknud users76704158 Jan 16 18:16 _2bm7.tii
>>  -rw-r--r--  1 flknud users   594842294 Jan 16 18:18 _2bm7.nrm
>>  -rw-r--r--  1 flknud users 208 Jan 22 08:57 _2bpa.fnm
>>  -rw-r--r--  1 flknud users10806426 Jan 22 08:57 _2bpa.fdt
>>  -rw-r--r--  1 flknud users  371200 Jan 22 08:57 _2bpa.fdx
>>  -rw-r--r--  1 flknud users21114330 Jan 22 08:57 _2bpa.frq
>>  -rw-r--r--  1 flknud users25683573 Jan 22 08:57 _2bpa.prx
>>  -rw-r--r--  1 flknud users 9225592 Jan 22 08:57 _2bpa.tis
>>  -rw-r--r--  1 flknud users  118660 Jan 22 08:57 _2bpa.tii
>>  -rw-r--r--  1 flknud users  324804 Jan 22 08:57 _2bpa.nrm
>>  -rw-r--r--  1 flknud users 198 Jan 22 09:00 _2bpl.fnm
>>  -rw-r--r--  1 flknud users 1335931 Jan 22 09:00 _2bpl.fdt
>>  -rw-r--r--  1 flknud users   36800 Jan 22 09:00 _2bpl.fdx
>>  -rw-r--r--  1 flknud users 2646708 Jan 22 09:00 _2bpl.frq
>>  -rw-r--r--  1 flknud users 3781824 Jan 22 09:00 _2bpl.prx
>>  -rw-r--r--  1 flknud users 1429176 Jan 22 09:00 _2bpl.tis
>>  -rw-r--r--  1 flknud users   18582 Jan 22 09:00 _2bpl.tii
>>  -rw-r--r--  1 flknud users   32204 Jan 22 09:00 _2bpl.nrm
>>  -rw-r--r--  1 flknud users 198 Jan 22 09:01 _2bpm.fnm
>>  -rw-r--r--  1 flknud users  121716 Jan 22