Re: How to index long words with StandardTokenizerFactory?

Ahmet Arslan Sat, 23 Oct 2010 14:30:28 -0700

Ops I am sorry, I thought that solr/lib refers to solrhome/lib.

I just tested this and it seems that you have successfully increased the max 
token length. You can verify this by analysis.jsp page.


Although analysis.jsp's output, it seems that some other mechanism is 
preventing this huge token to be indexed. Response of 
http://localhost:8983/solr/terms?terms.fl=body
 does not have that huge token.

If you are interested in only prefix queries, as a workaround, you can use  
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" /> 
at index time.  So the query (without star) 
solr/select?q=body:big will return that document. 

By the way for this particular task you don't need to edit lucene/solr disto. 
You can use this class for this with standard pre-compiled solr.war.
By putting jar into SolrHome/lib directory.

package foo.solr.analysis;

import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.solr.analysis.BaseTokenizerFactory;
import java.io.Reader;


public class CustomStandardTokenizerFactory extends BaseTokenizerFactory {
  public StandardTokenizer create(Reader input) {
    final StandardTokenizer tokenizer = new StandardTokenizer(input);
    tokenizer.setMaxTokenLength(Integer.MAX_VALUE);
    return tokenizer;
  }
}

<fieldType name="text_block" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="foo.solr.analysis.CustomStandardTokenizerFactory" />  
     
      </analyzer>
    </fieldType>

--- On Sat, 10/23/10, Sergey Bartunov <sbos....@gmail.com> wrote:

> From: Sergey Bartunov <sbos....@gmail.com>
> Subject: Re: How to index long words with StandardTokenizerFactory?
> To: solr-user@lucene.apache.org
> Date: Saturday, October 23, 2010, 6:01 PM
> This is exactly what I did. Look:
> 
> >> >> 3) I replace lucene-core-2.9.3.jar in
> solr/lib/ by
> >> my
> >> >> lucene-core-2.9.3-dev.jar that I'd just
> compiled
> >> >> 4) than I do "ant compile" and "ant dist"
> in solr
> >> folder
> >> >> 5) after that I recompile
> >> solr/example/webapps/solr.war
> 
> On 23 October 2010 18:53, Ahmet Arslan <iori...@yahoo.com>
> wrote:
> > I think you should replace your new
> lucene-core-2.9.3-dev.jar in \apache-solr-1.4.1\lib and then
> create a new solr.war under \apache-solr-1.4.1\dist. And
> copy this new solr.war to solr/example/webapps/solr.war
> >
> > --- On Sat, 10/23/10, Sergey Bartunov <sbos....@gmail.com>
> wrote:
> >
> >> From: Sergey Bartunov <sbos....@gmail.com>
> >> Subject: Re: How to index long words with
> StandardTokenizerFactory?
> >> To: solr-user@lucene.apache.org
> >> Date: Saturday, October 23, 2010, 5:45 PM
> >> Yes. I did. Won't help.
> >>
> >> On 23 October 2010 17:45, Ahmet Arslan <iori...@yahoo.com>
> >> wrote:
> >> > Did you delete the folder
> >> Jetty_0_0_0_0_8983_solr.war_** under
> >> apache-solr-1.4.1\example\work?
> >> >
> >> > --- On Sat, 10/23/10, Sergey Bartunov <sbos....@gmail.com>
> >> wrote:
> >> >
> >> >> From: Sergey Bartunov <sbos....@gmail.com>
> >> >> Subject: Re: How to index long words
> with
> >> StandardTokenizerFactory?
> >> >> To: solr-user@lucene.apache.org
> >> >> Date: Saturday, October 23, 2010, 3:56
> PM
> >> >> Here are all the files: http://rghost.net/3016862
> >> >>
> >> >> 1) StandardAnalyzer.java,
> StandardTokenizer.java -
> >> patched
> >> >> files from
> >> >> lucene-2.9.3
> >> >> 2) I patch these files and build lucene
> by typing
> >> "ant"
> >> >> 3) I replace lucene-core-2.9.3.jar in
> solr/lib/ by
> >> my
> >> >> lucene-core-2.9.3-dev.jar that I'd just
> compiled
> >> >> 4) than I do "ant compile" and "ant dist"
> in solr
> >> folder
> >> >> 5) after that I recompile
> >> solr/example/webapps/solr.war
> >> >> with my new
> >> >> solr and lucene-core jars
> >> >> 6) I put my schema.xml in
> solr/example/solr/conf/
> >> >> 7) then I do "java -jar start.jar" in
> >> solr/example
> >> >> 8) index big_post.xml
> >> >> 9) trying to find this document by "curl
> >> >> http://localhost:8983/solr/select?q=body:big*";
> >> >> (big_post.xml contains
> >> >> a long word bigaaaaa...aaaa)
> >> >> 10) solr returns nothing
> >> >>
> >> >> On 23 October 2010 02:43, Steven A Rowe
> <sar...@syr.edu>
> >> >> wrote:
> >> >> > Hi Sergey,
> >> >> >
> >> >> > What does your ~34kb field value
> look like?
> >>  Does
> >> >> StandardTokenizer think it's just one
> token?
> >> >> >
> >> >> > What doesn't work?  What happens?
> >> >> >
> >> >> > Steve
> >> >> >
> >> >> >> -----Original Message-----
> >> >> >> From: Sergey Bartunov [mailto:sbos....@gmail.com]
> >> >> >> Sent: Friday, October 22, 2010
> 3:18 PM
> >> >> >> To: solr-user@lucene.apache.org
> >> >> >> Subject: Re: How to index long
> words
> >> with
> >> >> StandardTokenizerFactory?
> >> >> >>
> >> >> >> I'm using Solr 1.4.1. Now I'm
> successed
> >> with
> >> >> replacing lucene-core jar
> >> >> >> but maxTokenValue seems to be
> used in
> >> very strange
> >> >> way. Currenty for
> >> >> >> me it's set to 1024*1024, but I
> couldn't
> >> index a
> >> >> field with just size
> >> >> >> of ~34kb. I understand that it's
> a little
> >> weird to
> >> >> index such a big
> >> >> >> data, but I just want to know it
> doesn't
> >> work
> >> >> >>
> >> >> >> On 22 October 2010 20:36, Steven
> A Rowe
> >> <sar...@syr.edu>
> >> >> wrote:
> >> >> >> > Hi Sergey,
> >> >> >> >
> >> >> >> > I've opened an issue to add
> a
> >> maxTokenLength
> >> >> param to the
> >> >> >> StandardTokenizerFactory
> configuration:
> >> >> >> >
> >> >> >> >        https://issues.apache.org/jira/browse/SOLR-2188
> >> >> >> >
> >> >> >> > I'll work on it this
> weekend.
> >> >> >> >
> >> >> >> > Are you using Solr 1.4.1?
>  I ask
> >> because of
> >> >> your mention of Lucene
> >> >> >> 2.9.3.  I'm not sure there will
> ever be
> >> a Solr
> >> >> 1.4.2 release.  I plan on
> >> >> >> targeting Solr 3.1 and 4.0 for
> the
> >> SOLR-2188 fix.
> >> >> >> >
> >> >> >> > I'm not sure why you didn't
> get the
> >> results
> >> >> you wanted with your Lucene
> >> >> >> hack - is it possible you have
> other
> >> Lucene jars
> >> >> in your Solr classpath?
> >> >> >> >
> >> >> >> > Steve
> >> >> >> >
> >> >> >> >> -----Original
> Message-----
> >> >> >> >> From: Sergey Bartunov
> [mailto:sbos....@gmail.com]
> >> >> >> >> Sent: Friday, October
> 22, 2010
> >> 12:08 PM
> >> >> >> >> To: solr-user@lucene.apache.org
> >> >> >> >> Subject: How to index
> long words
> >> with
> >> >> StandardTokenizerFactory?
> >> >> >> >>
> >> >> >> >> I'm trying to force
> solr to
> >> index words
> >> >> which length is more than 255
> >> >> >> >> symbols (this constant
> is
> >> >> DEFAULT_MAX_TOKEN_LENGTH in lucene
> >> >> >> >> StandardAnalyzer.java)
> using
> >> >> StandardTokenizerFactory as 'filter' tag
> >> >> >> >> in schema configuration
> XML.
> >> Specifying
> >> >> the maxTokenLength attribute
> >> >> >> >> won't work.
> >> >> >> >>
> >> >> >> >> I'd tried to make the
> dirty
> >> hack: I
> >> >> downloaded lucene-core-2.9.3 src
> >> >> >> >> and changed the
> >> DEFAULT_MAX_TOKEN_LENGTH
> >> >> to 1000000, built it to jar
> >> >> >> >> and replaced original
> >> lucene-core jar in
> >> >> solr /lib. But seems like
> >> >> >> >> that it had bring no
> effect.
> >> >>
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
> >
> >
>

Re: How to index long words with StandardTokenizerFactory?

Reply via email to