Re: How to index long words with StandardTokenizerFactory?

Sergey Bartunov Sat, 23 Oct 2010 05:57:04 -0700

Here are all the files: http://rghost.net/3016862


1) StandardAnalyzer.java, StandardTokenizer.java - patched files from
lucene-2.9.3
2) I patch these files and build lucene by typing "ant"
3) I replace lucene-core-2.9.3.jar in solr/lib/ by my
lucene-core-2.9.3-dev.jar that I'd just compiled
4) than I do "ant compile" and "ant dist" in solr folder
5) after that I recompile solr/example/webapps/solr.war with my new
solr and lucene-core jars
6) I put my schema.xml in solr/example/solr/conf/
7) then I do "java -jar start.jar" in solr/example
8) index big_post.xml
9) trying to find this document by "curl
http://localhost:8983/solr/select?q=body:big*"; (big_post.xml contains
a long word bigaaaaa...aaaa)
10) solr returns nothing

On 23 October 2010 02:43, Steven A Rowe <sar...@syr.edu> wrote:
> Hi Sergey,
>
> What does your ~34kb field value look like?  Does StandardTokenizer think 
> it's just one token?
>
> What doesn't work?  What happens?
>
> Steve
>
>> -----Original Message-----
>> From: Sergey Bartunov [mailto:sbos....@gmail.com]
>> Sent: Friday, October 22, 2010 3:18 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index long words with StandardTokenizerFactory?
>>
>> I'm using Solr 1.4.1. Now I'm successed with replacing lucene-core jar
>> but maxTokenValue seems to be used in very strange way. Currenty for
>> me it's set to 1024*1024, but I couldn't index a field with just size
>> of ~34kb. I understand that it's a little weird to index such a big
>> data, but I just want to know it doesn't work
>>
>> On 22 October 2010 20:36, Steven A Rowe <sar...@syr.edu> wrote:
>> > Hi Sergey,
>> >
>> > I've opened an issue to add a maxTokenLength param to the
>> StandardTokenizerFactory configuration:
>> >
>> >        https://issues.apache.org/jira/browse/SOLR-2188
>> >
>> > I'll work on it this weekend.
>> >
>> > Are you using Solr 1.4.1?  I ask because of your mention of Lucene
>> 2.9.3.  I'm not sure there will ever be a Solr 1.4.2 release.  I plan on
>> targeting Solr 3.1 and 4.0 for the SOLR-2188 fix.
>> >
>> > I'm not sure why you didn't get the results you wanted with your Lucene
>> hack - is it possible you have other Lucene jars in your Solr classpath?
>> >
>> > Steve
>> >
>> >> -----Original Message-----
>> >> From: Sergey Bartunov [mailto:sbos....@gmail.com]
>> >> Sent: Friday, October 22, 2010 12:08 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: How to index long words with StandardTokenizerFactory?
>> >>
>> >> I'm trying to force solr to index words which length is more than 255
>> >> symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene
>> >> StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag
>> >> in schema configuration XML. Specifying the maxTokenLength attribute
>> >> won't work.
>> >>
>> >> I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src
>> >> and changed the DEFAULT_MAX_TOKEN_LENGTH to 1000000, built it to jar
>> >> and replaced original lucene-core jar in solr /lib. But seems like
>> >> that it had bring no effect.

Re: How to index long words with StandardTokenizerFactory?

Reply via email to