capturing field length into a stored document field
For various statistics I collect from an index it's important for me to know the length (measured in tokens) of a document field. I can get that information to some degree from the "norms" for the field but a) the resolution isn't that great, and b) more importantly, if boosts are used it's almost impossible to get lengths from this. Here's two ideas I was thinking about that maybe some can comment on. 1) Use copyto to copy the field in question, fieldA to an addition field, fieldALength, which has an extra filter that just counts the tokens and only outputs a token representing the length of the field. This has the disadvantage of retokenizing basically the whole document (because the field in question is basically the body). Plus I would think littering the term space with these tokens might be bad for performance, I'm not sure. 2) Add a filter to the field in question which again counts the tokens. This filter allows the regular tokens to be indexed as usual but somehow manages to get the token-count into a stored field of the document. This has the advantage of not having to retokenize the field and instead of littering the token space, the count becomes docdata for each doc. Can this be done? Maybe using threadLocal to temporarily store the count? Thanks. -- View this message in context: http://www.nabble.com/capturing-field-length-into-a-stored-document-field-tp25297690p25297690.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: capturing field length into a stored document field
Here's a hybrid solution. Add a filter to the field in question that counts all the tokens and at the end outputs a token of the form __numtokens.__. This eliminates the need to retokenize the field again. Also, bucket the numbers, either by some factor of ten, or base 2, so that there aren't so many different token types produced. This has a space advantage over storing in a field, especially since the information isn't needed at query time anyway. mike.schultz wrote: > > For various statistics I collect from an index it's important for me to > know the length (measured in tokens) of a document field. I can get that > information to some degree from the "norms" for the field but a) the > resolution isn't that great, and b) more importantly, if boosts are used > it's almost impossible to get lengths from this. > > Here's two ideas I was thinking about that maybe some can comment on. > > 1) Use copyto to copy the field in question, fieldA to an addition field, > fieldALength, which has an extra filter that just counts the tokens and > only outputs a token representing the length of the field. This has the > disadvantage of retokenizing basically the whole document (because the > field in question is basically the body). Plus I would think littering > the term space with these tokens might be bad for performance, I'm not > sure. > > 2) Add a filter to the field in question which again counts the tokens. > This filter allows the regular tokens to be indexed as usual but somehow > manages to get the token-count into a stored field of the document. This > has the advantage of not having to retokenize the field and instead of > littering the token space, the count becomes docdata for each doc. Can > this be done? Maybe using threadLocal to temporarily store the count? > > Thanks. > > -- View this message in context: http://www.nabble.com/capturing-field-length-into-a-stored-document-field-tp25297690p25339584.html Sent from the Solr - User mailing list archive at Nabble.com.
specifying search depth?
I want to use the standard QueryComponent to run a query then sort a *limited number of the results* by some function query. So if my query returns 10,000 results, I'd like to calculate the function over only the top, say 100 of them, and sort that for the ultimate results. Is this possible? Thanks Mike -- View this message in context: http://old.nabble.com/specifying-search-depth--tp26400097p26400097.html Sent from the Solr - User mailing list archive at Nabble.com.
Making Analyzer Phrase aware?
I was looking at the SOLR-908 port of nutch CommonGramsFilter as an approach for having phrase searches be sensitive to stop words within a query. So a search on "car on street" wouldn't match the text "car in street". >From what I can tell the query version of the filter will *always* create stop-word-grams, not just in a phrase context. I want non-phrase searches to ignore stop words as usual. Can someone tell me how to make an analyzer (or token filter) "phrase aware" so I only create grams when I know I'm inside of a phrase? Thanks. Mike -- View this message in context: http://www.nabble.com/Making-Analyzer-Phrase-aware--tp24306862p24306862.html Sent from the Solr - User mailing list archive at Nabble.com.