I did it just as you recommended. Solr indexes files around 15kb, but no more. The same effect was with patched constants
On 24 October 2010 01:29, Ahmet Arslan <iori...@yahoo.com> wrote: > Ops I am sorry, I thought that solr/lib refers to solrhome/lib. > > I just tested this and it seems that you have successfully increased the max > token length. You can verify this by analysis.jsp page. > > Although analysis.jsp's output, it seems that some other mechanism is > preventing this huge token to be indexed. Response of > http://localhost:8983/solr/terms?terms.fl=body > does not have that huge token. > > If you are interested in only prefix queries, as a workaround, you can use > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" > /> at index time. So the query (without star) > solr/select?q=body:big will return that document. > > By the way for this particular task you don't need to edit lucene/solr disto. > You can use this class for this with standard pre-compiled solr.war. > By putting jar into SolrHome/lib directory. > > package foo.solr.analysis; > > import org.apache.lucene.analysis.standard.StandardTokenizer; > import org.apache.solr.analysis.BaseTokenizerFactory; > import java.io.Reader; > > > public class CustomStandardTokenizerFactory extends BaseTokenizerFactory { > public StandardTokenizer create(Reader input) { > final StandardTokenizer tokenizer = new StandardTokenizer(input); > tokenizer.setMaxTokenLength(Integer.MAX_VALUE); > return tokenizer; > } > } > > <fieldType name="text_block" class="solr.TextField" > positionIncrementGap="100"> > <analyzer> > <tokenizer class="foo.solr.analysis.CustomStandardTokenizerFactory" /> > </analyzer> > </fieldType> > > --- On Sat, 10/23/10, Sergey Bartunov <sbos....@gmail.com> wrote: > >> From: Sergey Bartunov <sbos....@gmail.com> >> Subject: Re: How to index long words with StandardTokenizerFactory? >> To: solr-user@lucene.apache.org >> Date: Saturday, October 23, 2010, 6:01 PM >> This is exactly what I did. Look: >> >> >> >> 3) I replace lucene-core-2.9.3.jar in >> solr/lib/ by >> >> my >> >> >> lucene-core-2.9.3-dev.jar that I'd just >> compiled >> >> >> 4) than I do "ant compile" and "ant dist" >> in solr >> >> folder >> >> >> 5) after that I recompile >> >> solr/example/webapps/solr.war >> >> On 23 October 2010 18:53, Ahmet Arslan <iori...@yahoo.com> >> wrote: >> > I think you should replace your new >> lucene-core-2.9.3-dev.jar in \apache-solr-1.4.1\lib and then >> create a new solr.war under \apache-solr-1.4.1\dist. And >> copy this new solr.war to solr/example/webapps/solr.war >> > >> > --- On Sat, 10/23/10, Sergey Bartunov <sbos....@gmail.com> >> wrote: >> > >> >> From: Sergey Bartunov <sbos....@gmail.com> >> >> Subject: Re: How to index long words with >> StandardTokenizerFactory? >> >> To: solr-user@lucene.apache.org >> >> Date: Saturday, October 23, 2010, 5:45 PM >> >> Yes. I did. Won't help. >> >> >> >> On 23 October 2010 17:45, Ahmet Arslan <iori...@yahoo.com> >> >> wrote: >> >> > Did you delete the folder >> >> Jetty_0_0_0_0_8983_solr.war_** under >> >> apache-solr-1.4.1\example\work? >> >> > >> >> > --- On Sat, 10/23/10, Sergey Bartunov <sbos....@gmail.com> >> >> wrote: >> >> > >> >> >> From: Sergey Bartunov <sbos....@gmail.com> >> >> >> Subject: Re: How to index long words >> with >> >> StandardTokenizerFactory? >> >> >> To: solr-user@lucene.apache.org >> >> >> Date: Saturday, October 23, 2010, 3:56 >> PM >> >> >> Here are all the files: http://rghost.net/3016862 >> >> >> >> >> >> 1) StandardAnalyzer.java, >> StandardTokenizer.java - >> >> patched >> >> >> files from >> >> >> lucene-2.9.3 >> >> >> 2) I patch these files and build lucene >> by typing >> >> "ant" >> >> >> 3) I replace lucene-core-2.9.3.jar in >> solr/lib/ by >> >> my >> >> >> lucene-core-2.9.3-dev.jar that I'd just >> compiled >> >> >> 4) than I do "ant compile" and "ant dist" >> in solr >> >> folder >> >> >> 5) after that I recompile >> >> solr/example/webapps/solr.war >> >> >> with my new >> >> >> solr and lucene-core jars >> >> >> 6) I put my schema.xml in >> solr/example/solr/conf/ >> >> >> 7) then I do "java -jar start.jar" in >> >> solr/example >> >> >> 8) index big_post.xml >> >> >> 9) trying to find this document by "curl >> >> >> http://localhost:8983/solr/select?q=body:big*" >> >> >> (big_post.xml contains >> >> >> a long word bigaaaaa...aaaa) >> >> >> 10) solr returns nothing >> >> >> >> >> >> On 23 October 2010 02:43, Steven A Rowe >> <sar...@syr.edu> >> >> >> wrote: >> >> >> > Hi Sergey, >> >> >> > >> >> >> > What does your ~34kb field value >> look like? >> >> Does >> >> >> StandardTokenizer think it's just one >> token? >> >> >> > >> >> >> > What doesn't work? What happens? >> >> >> > >> >> >> > Steve >> >> >> > >> >> >> >> -----Original Message----- >> >> >> >> From: Sergey Bartunov [mailto:sbos....@gmail.com] >> >> >> >> Sent: Friday, October 22, 2010 >> 3:18 PM >> >> >> >> To: solr-user@lucene.apache.org >> >> >> >> Subject: Re: How to index long >> words >> >> with >> >> >> StandardTokenizerFactory? >> >> >> >> >> >> >> >> I'm using Solr 1.4.1. Now I'm >> successed >> >> with >> >> >> replacing lucene-core jar >> >> >> >> but maxTokenValue seems to be >> used in >> >> very strange >> >> >> way. Currenty for >> >> >> >> me it's set to 1024*1024, but I >> couldn't >> >> index a >> >> >> field with just size >> >> >> >> of ~34kb. I understand that it's >> a little >> >> weird to >> >> >> index such a big >> >> >> >> data, but I just want to know it >> doesn't >> >> work >> >> >> >> >> >> >> >> On 22 October 2010 20:36, Steven >> A Rowe >> >> <sar...@syr.edu> >> >> >> wrote: >> >> >> >> > Hi Sergey, >> >> >> >> > >> >> >> >> > I've opened an issue to add >> a >> >> maxTokenLength >> >> >> param to the >> >> >> >> StandardTokenizerFactory >> configuration: >> >> >> >> > >> >> >> >> > https://issues.apache.org/jira/browse/SOLR-2188 >> >> >> >> > >> >> >> >> > I'll work on it this >> weekend. >> >> >> >> > >> >> >> >> > Are you using Solr 1.4.1? >> I ask >> >> because of >> >> >> your mention of Lucene >> >> >> >> 2.9.3. I'm not sure there will >> ever be >> >> a Solr >> >> >> 1.4.2 release. I plan on >> >> >> >> targeting Solr 3.1 and 4.0 for >> the >> >> SOLR-2188 fix. >> >> >> >> > >> >> >> >> > I'm not sure why you didn't >> get the >> >> results >> >> >> you wanted with your Lucene >> >> >> >> hack - is it possible you have >> other >> >> Lucene jars >> >> >> in your Solr classpath? >> >> >> >> > >> >> >> >> > Steve >> >> >> >> > >> >> >> >> >> -----Original >> Message----- >> >> >> >> >> From: Sergey Bartunov >> [mailto:sbos....@gmail.com] >> >> >> >> >> Sent: Friday, October >> 22, 2010 >> >> 12:08 PM >> >> >> >> >> To: solr-user@lucene.apache.org >> >> >> >> >> Subject: How to index >> long words >> >> with >> >> >> StandardTokenizerFactory? >> >> >> >> >> >> >> >> >> >> I'm trying to force >> solr to >> >> index words >> >> >> which length is more than 255 >> >> >> >> >> symbols (this constant >> is >> >> >> DEFAULT_MAX_TOKEN_LENGTH in lucene >> >> >> >> >> StandardAnalyzer.java) >> using >> >> >> StandardTokenizerFactory as 'filter' tag >> >> >> >> >> in schema configuration >> XML. >> >> Specifying >> >> >> the maxTokenLength attribute >> >> >> >> >> won't work. >> >> >> >> >> >> >> >> >> >> I'd tried to make the >> dirty >> >> hack: I >> >> >> downloaded lucene-core-2.9.3 src >> >> >> >> >> and changed the >> >> DEFAULT_MAX_TOKEN_LENGTH >> >> >> to 1000000, built it to jar >> >> >> >> >> and replaced original >> >> lucene-core jar in >> >> >> solr /lib. But seems like >> >> >> >> >> that it had bring no >> effect. >> >> >> >> >> > >> >> > >> >> > >> >> > >> >> >> > >> > >> > >> > >> > > > > >