Re: How to index long words with StandardTokenizerFactory?

Sergey Bartunov Sun, 24 Oct 2010 07:47:57 -0700

I did it just as you recommended. Solr indexes files around 15kb, but
no more. The same effect was with patched constants


On 24 October 2010 01:29, Ahmet Arslan <iori...@yahoo.com> wrote:
> Ops I am sorry, I thought that solr/lib refers to solrhome/lib.
>
> I just tested this and it seems that you have successfully increased the max 
> token length. You can verify this by analysis.jsp page.
>
> Although analysis.jsp's output, it seems that some other mechanism is 
> preventing this huge token to be indexed. Response of 
> http://localhost:8983/solr/terms?terms.fl=body
>  does not have that huge token.
>
> If you are interested in only prefix queries, as a workaround, you can use  
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" 
> /> at index time.  So the query (without star)
> solr/select?q=body:big will return that document.
>
> By the way for this particular task you don't need to edit lucene/solr disto. 
> You can use this class for this with standard pre-compiled solr.war.
> By putting jar into SolrHome/lib directory.
>
> package foo.solr.analysis;
>
> import org.apache.lucene.analysis.standard.StandardTokenizer;
> import org.apache.solr.analysis.BaseTokenizerFactory;
> import java.io.Reader;
>
>
> public class CustomStandardTokenizerFactory extends BaseTokenizerFactory {
>   public StandardTokenizer create(Reader input) {
>     final StandardTokenizer tokenizer = new StandardTokenizer(input);
>     tokenizer.setMaxTokenLength(Integer.MAX_VALUE);
>     return tokenizer;
>   }
> }
>
> <fieldType name="text_block" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="foo.solr.analysis.CustomStandardTokenizerFactory" />
>       </analyzer>
>     </fieldType>
>
> --- On Sat, 10/23/10, Sergey Bartunov <sbos....@gmail.com> wrote:
>
>> From: Sergey Bartunov <sbos....@gmail.com>
>> Subject: Re: How to index long words with StandardTokenizerFactory?
>> To: solr-user@lucene.apache.org
>> Date: Saturday, October 23, 2010, 6:01 PM
>> This is exactly what I did. Look:
>>
>> >> >> 3) I replace lucene-core-2.9.3.jar in
>> solr/lib/ by
>> >> my
>> >> >> lucene-core-2.9.3-dev.jar that I'd just
>> compiled
>> >> >> 4) than I do "ant compile" and "ant dist"
>> in solr
>> >> folder
>> >> >> 5) after that I recompile
>> >> solr/example/webapps/solr.war
>>
>> On 23 October 2010 18:53, Ahmet Arslan <iori...@yahoo.com>
>> wrote:
>> > I think you should replace your new
>> lucene-core-2.9.3-dev.jar in \apache-solr-1.4.1\lib and then
>> create a new solr.war under \apache-solr-1.4.1\dist. And
>> copy this new solr.war to solr/example/webapps/solr.war
>> >
>> > --- On Sat, 10/23/10, Sergey Bartunov <sbos....@gmail.com>
>> wrote:
>> >
>> >> From: Sergey Bartunov <sbos....@gmail.com>
>> >> Subject: Re: How to index long words with
>> StandardTokenizerFactory?
>> >> To: solr-user@lucene.apache.org
>> >> Date: Saturday, October 23, 2010, 5:45 PM
>> >> Yes. I did. Won't help.
>> >>
>> >> On 23 October 2010 17:45, Ahmet Arslan <iori...@yahoo.com>
>> >> wrote:
>> >> > Did you delete the folder
>> >> Jetty_0_0_0_0_8983_solr.war_** under
>> >> apache-solr-1.4.1\example\work?
>> >> >
>> >> > --- On Sat, 10/23/10, Sergey Bartunov <sbos....@gmail.com>
>> >> wrote:
>> >> >
>> >> >> From: Sergey Bartunov <sbos....@gmail.com>
>> >> >> Subject: Re: How to index long words
>> with
>> >> StandardTokenizerFactory?
>> >> >> To: solr-user@lucene.apache.org
>> >> >> Date: Saturday, October 23, 2010, 3:56
>> PM
>> >> >> Here are all the files: http://rghost.net/3016862
>> >> >>
>> >> >> 1) StandardAnalyzer.java,
>> StandardTokenizer.java -
>> >> patched
>> >> >> files from
>> >> >> lucene-2.9.3
>> >> >> 2) I patch these files and build lucene
>> by typing
>> >> "ant"
>> >> >> 3) I replace lucene-core-2.9.3.jar in
>> solr/lib/ by
>> >> my
>> >> >> lucene-core-2.9.3-dev.jar that I'd just
>> compiled
>> >> >> 4) than I do "ant compile" and "ant dist"
>> in solr
>> >> folder
>> >> >> 5) after that I recompile
>> >> solr/example/webapps/solr.war
>> >> >> with my new
>> >> >> solr and lucene-core jars
>> >> >> 6) I put my schema.xml in
>> solr/example/solr/conf/
>> >> >> 7) then I do "java -jar start.jar" in
>> >> solr/example
>> >> >> 8) index big_post.xml
>> >> >> 9) trying to find this document by "curl
>> >> >> http://localhost:8983/solr/select?q=body:big*";
>> >> >> (big_post.xml contains
>> >> >> a long word bigaaaaa...aaaa)
>> >> >> 10) solr returns nothing
>> >> >>
>> >> >> On 23 October 2010 02:43, Steven A Rowe
>> <sar...@syr.edu>
>> >> >> wrote:
>> >> >> > Hi Sergey,
>> >> >> >
>> >> >> > What does your ~34kb field value
>> look like?
>> >>  Does
>> >> >> StandardTokenizer think it's just one
>> token?
>> >> >> >
>> >> >> > What doesn't work?  What happens?
>> >> >> >
>> >> >> > Steve
>> >> >> >
>> >> >> >> -----Original Message-----
>> >> >> >> From: Sergey Bartunov [mailto:sbos....@gmail.com]
>> >> >> >> Sent: Friday, October 22, 2010
>> 3:18 PM
>> >> >> >> To: solr-user@lucene.apache.org
>> >> >> >> Subject: Re: How to index long
>> words
>> >> with
>> >> >> StandardTokenizerFactory?
>> >> >> >>
>> >> >> >> I'm using Solr 1.4.1. Now I'm
>> successed
>> >> with
>> >> >> replacing lucene-core jar
>> >> >> >> but maxTokenValue seems to be
>> used in
>> >> very strange
>> >> >> way. Currenty for
>> >> >> >> me it's set to 1024*1024, but I
>> couldn't
>> >> index a
>> >> >> field with just size
>> >> >> >> of ~34kb. I understand that it's
>> a little
>> >> weird to
>> >> >> index such a big
>> >> >> >> data, but I just want to know it
>> doesn't
>> >> work
>> >> >> >>
>> >> >> >> On 22 October 2010 20:36, Steven
>> A Rowe
>> >> <sar...@syr.edu>
>> >> >> wrote:
>> >> >> >> > Hi Sergey,
>> >> >> >> >
>> >> >> >> > I've opened an issue to add
>> a
>> >> maxTokenLength
>> >> >> param to the
>> >> >> >> StandardTokenizerFactory
>> configuration:
>> >> >> >> >
>> >> >> >> >        https://issues.apache.org/jira/browse/SOLR-2188
>> >> >> >> >
>> >> >> >> > I'll work on it this
>> weekend.
>> >> >> >> >
>> >> >> >> > Are you using Solr 1.4.1?
>>  I ask
>> >> because of
>> >> >> your mention of Lucene
>> >> >> >> 2.9.3.  I'm not sure there will
>> ever be
>> >> a Solr
>> >> >> 1.4.2 release.  I plan on
>> >> >> >> targeting Solr 3.1 and 4.0 for
>> the
>> >> SOLR-2188 fix.
>> >> >> >> >
>> >> >> >> > I'm not sure why you didn't
>> get the
>> >> results
>> >> >> you wanted with your Lucene
>> >> >> >> hack - is it possible you have
>> other
>> >> Lucene jars
>> >> >> in your Solr classpath?
>> >> >> >> >
>> >> >> >> > Steve
>> >> >> >> >
>> >> >> >> >> -----Original
>> Message-----
>> >> >> >> >> From: Sergey Bartunov
>> [mailto:sbos....@gmail.com]
>> >> >> >> >> Sent: Friday, October
>> 22, 2010
>> >> 12:08 PM
>> >> >> >> >> To: solr-user@lucene.apache.org
>> >> >> >> >> Subject: How to index
>> long words
>> >> with
>> >> >> StandardTokenizerFactory?
>> >> >> >> >>
>> >> >> >> >> I'm trying to force
>> solr to
>> >> index words
>> >> >> which length is more than 255
>> >> >> >> >> symbols (this constant
>> is
>> >> >> DEFAULT_MAX_TOKEN_LENGTH in lucene
>> >> >> >> >> StandardAnalyzer.java)
>> using
>> >> >> StandardTokenizerFactory as 'filter' tag
>> >> >> >> >> in schema configuration
>> XML.
>> >> Specifying
>> >> >> the maxTokenLength attribute
>> >> >> >> >> won't work.
>> >> >> >> >>
>> >> >> >> >> I'd tried to make the
>> dirty
>> >> hack: I
>> >> >> downloaded lucene-core-2.9.3 src
>> >> >> >> >> and changed the
>> >> DEFAULT_MAX_TOKEN_LENGTH
>> >> >> to 1000000, built it to jar
>> >> >> >> >> and replaced original
>> >> lucene-core jar in
>> >> >> solr /lib. But seems like
>> >> >> >> >> that it had bring no
>> effect.
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>> >
>> >
>> >
>> >
>>
>
>
>
>
>

Re: How to index long words with StandardTokenizerFactory?

Reply via email to