Re: Tokenize integers?
Ok, thanks. However I am still abit confused. Since I know that these are only integers, can't I somehow make solr to use solr.IntField or solr.SortableIntField, but still tokenize like this? I tried the configuration below but changed TextField to IntField and indexed the document again, but then the search didn't work... This is what I use now (after your suggestion): This works great when searching. But when I get the document back, I see that the stored value is still the comma separated values. ie: ... 3,5 ... I would have liked it like this instead: ... 3 5 ... Is this possible with solr by some configuration? Am I really the only one that would like this behaivor? /Jimi Quoting Otis Gospodnetic <[EMAIL PROTECTED]>: I think you are after http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Saturday, May 3, 2008 11:57:37 PM Subject: Tokenize integers? Hi, What is the recommended way to configure a fieldtype for a field that looks like this in the source system? categoryIds=1,325,488 The order of these id's are not important. I want to be able to fetch all the id's, separately, ie I want them to be stored as multivalue, I guess... And I also want to be able to search on the individual id's, or combinations (for example search for all articles with category id 1 and 488). I know I can index this as multiple categoryId fields (and have them as int or sint type), but that means I need to write preprocessing on the "client" side. I would prefer a server side fix, so that the client can send the xml like this: ... 1,325,488 ... And then the server (ie solr) will transform this into a multivalue int/sint field, using tokenizing or whatever it is called (or is tokenizing not performed on the stored value?). What are your suggestions? Maybe this is already documented in the wiki or someplace else? I have searched for this, but not found anything that helps. Regards /Jimi
IOException: Mark invalid while analyzing HTML
Hi, I'm seeing a problem mentioned in Solr-42, Highlighting problems with HTMLStripWhitespaceTokenizerFactory: https://issues.apache.org/jira/browse/SOLR-42 I'm indexing HTML documents, and am getting reams of "Mark invalid" IOExceptions: SEVERE: java.io.IOException: Mark invalid at java.io.BufferedReader.reset(Unknown Source) at org .apache .solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171) at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 728) at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 742) at java.io.Reader.read(Unknown Source) at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:56) at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:118) at org .apache .solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:249) at org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33) at org .apache .solr .analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:92) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:45) at org .apache .solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94) at org .apache .solr .analysis .RemoveDuplicatesTokenFilter.process(RemoveDuplicatesTokenFilter.java: 33) at org .apache .solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:82) at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:79) at org.apache.lucene.index.DocumentsWriter$ThreadState $FieldData.invertField(DocumentsWriter.java:1518) at org.apache.lucene.index.DocumentsWriter$ThreadState $FieldData.processField(DocumentsWriter.java:1407) at org.apache.lucene.index.DocumentsWriter $ThreadState.processDocument(DocumentsWriter.java:1116) at org .apache .lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2440) at org .apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java: 2422) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java: 1445) This is using a ~1 week old version of Solr 1.3 from SVN. One workaround mentioned in that Jira issue was to move HTML stripping outside of Solr; can anyone suggest a better approach than that? Thanks James
definition of field types?
I must be overlooking ... where can I find definitions of the built-in types such as textTight, text_ws, etc?
Re: definition of field types?
A good place to look is the Wiki. Look for "Analyzer" substring on the main Solr wiki page. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: JLIST <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Sunday, May 4, 2008 8:17:50 PM > Subject: definition of field types? > > I must be overlooking ... where can I find definitions of > the built-in types such as textTight, text_ws, etc? > >