Re: Tokenize integers?

2008-05-04 Thread solr
Ok, thanks. However I am still abit confused. Since I know that these  
are only integers, can't I somehow make solr to use solr.IntField or  
solr.SortableIntField, but still tokenize like this? I tried the  
configuration below but changed TextField to IntField and indexed the  
document again, but then the search didn't work...


This is what I use now (after your suggestion):


  
  
  
  
  
  
  
  


This works great when searching. But when I get the document back, I  
see that the stored value is still the comma separated values. ie:


...
3,5
...

I would have liked it like this instead:

...
3
5
...

Is this possible with solr by some configuration? Am I really the only  
one that would like this behaivor?


/Jimi

Quoting Otis Gospodnetic <[EMAIL PROTECTED]>:

I think you are after   
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-1c9b83870ca7890cd73b193cefed83c283339089


Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 

From: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Saturday, May 3, 2008 11:57:37 PM
Subject: Tokenize integers?

Hi,

What is the recommended way to configure a fieldtype for a field that
looks like this in the source system?

categoryIds=1,325,488

The order of these id's are not important. I want to be able to fetch
all the id's, separately, ie I want them to be stored as multivalue, I
guess... And I also want to be able to search on the individual id's,
or combinations (for example search for all articles with category id
1 and 488).

I know I can index this as multiple categoryId fields (and have them
as int or sint type), but that means I need to write preprocessing on
the "client" side. I would prefer a server side fix, so that the
client can send the xml like this:

...
1,325,488
...

And then the server (ie solr) will transform this into a multivalue
int/sint field, using tokenizing or whatever it is called (or is
tokenizing not performed on the stored value?).

What are your suggestions? Maybe this is already documented in the
wiki or someplace else? I have searched for this, but not found
anything that helps.

Regards
/Jimi










IOException: Mark invalid while analyzing HTML

2008-05-04 Thread James Brady

Hi,
I'm seeing a problem mentioned in Solr-42, Highlighting problems with  
HTMLStripWhitespaceTokenizerFactory:

https://issues.apache.org/jira/browse/SOLR-42

I'm indexing HTML documents, and am getting reams of "Mark invalid"  
IOExceptions:

SEVERE: java.io.IOException: Mark invalid
at java.io.BufferedReader.reset(Unknown Source)
	at  
org 
.apache 
.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171)
	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 
728)
	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: 
742)

at java.io.Reader.read(Unknown Source)
at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:56)
at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:118)
	at  
org 
.apache 
.solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:249)
	at  
org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33)
	at  
org 
.apache 
.solr 
.analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:92)

at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:45)
	at  
org 
.apache 
.solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94)
	at  
org 
.apache 
.solr 
.analysis 
.RemoveDuplicatesTokenFilter.process(RemoveDuplicatesTokenFilter.java: 
33)
	at  
org 
.apache 
.solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:82)

at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:79)
	at org.apache.lucene.index.DocumentsWriter$ThreadState 
$FieldData.invertField(DocumentsWriter.java:1518)
	at org.apache.lucene.index.DocumentsWriter$ThreadState 
$FieldData.processField(DocumentsWriter.java:1407)
	at org.apache.lucene.index.DocumentsWriter 
$ThreadState.processDocument(DocumentsWriter.java:1116)
	at  
org 
.apache 
.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2440)
	at  
org 
.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java: 
2422)
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java: 
1445)



This is using a ~1 week old version of Solr 1.3 from SVN.

One workaround mentioned in that Jira issue was to move HTML stripping  
outside of Solr; can anyone suggest a better approach than that?


Thanks
James



definition of field types?

2008-05-04 Thread JLIST
I must be overlooking ... where can I find definitions of
the built-in types such as textTight, text_ws, etc?



Re: definition of field types?

2008-05-04 Thread Otis Gospodnetic
A good place to look is the Wiki.  Look for "Analyzer" substring on the main 
Solr wiki page.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
> From: JLIST <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Sunday, May 4, 2008 8:17:50 PM
> Subject: definition of field types?
> 
> I must be overlooking ... where can I find definitions of
> the built-in types such as textTight, text_ws, etc?
> 
>