relative token count in a query result
Hello, earlier, I was trying to retrieve the total token count per index http://lucene.472066.n3.nabble.com/how-to-retrieve-total-token-count-per-collection-index-td4000161.html . now, I would like to have a token (word) count within the document-set (resulting of a query), both for the matching word and as sum of all tokens of matching documents. The ultimate goal is to be able to compute relative frequencies of terms, on token-base instead of per article base. so if I search for word "Haus" within a subcollection (defined by a separate query) and the word appears in a matching doc A 2 times and doc B 5 times, i need as hit-count: 7 not 2. + if the subcollection contains documents A with 300 tokens (i.e. running words, not different terms) B with 100 tokens C with 50 tokens I also need this second sum, i.e. 450. I plan to get the second number by first preprocessing the document counting the tokens storing the number in a separate field, then applying the statsComponent, which will deliver me the sum for given query/subcollection. for the first number, i could use the termfreq() function, but that gives me only the term frequency per document. So, before I iterate over the whole result, to sum it, I wonder, if the statsComponent would be able to perform the counting also over a dynamic field (the result of the function). I tried this: /solr/select/?fq=docsrc:falter&q={!func}tf(inhalt,'haus')&stats=true&stats.field=score&rows=10&indent=true&fl=score&debugQuery=true but got the error: Field type text_de{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100}} is not currently supported Or is there any other way? If I understand it correctly, any of tf(), idf(), sttf(), wouldn't be of any help here neither. Thanks in advance best, matej
how to retrieve total token count per collection/index
Hello, I wonder how to figure out the total token count in a collection (per index), i.e. the size of a corpus/collection measured in tokens. The statistics in /admin tell the number of distinct terms, and the frequency list per index reveals the number of documents with given term. So even if I would sum all the frequencies, I wouldn't get the result I need. Thank you. best, Matej
Re: how to retrieve total token count per collection/index
Am 09.08.2012 18:02, schrieb Robert Muir: On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk wrote: Hello, I wonder how to figure out the total token count in a collection (per index), i.e. the size of a corpus/collection measured in tokens. You want to use this statistic, which tells you number of tokens for an indexed field: http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/index/Terms.html#getSumTotalTermFreq%28%29 thank you. Is there any 3.6 equivalent for this, before I install and run 4.0? I can't seem to find a corresponding class (org.apache.lucene.index.Terms) in 3.6.
Re: how to retrieve total token count per collection/index
Am 09.08.2012 18:02, schrieb Robert Muir: On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk wrote: Hello, I wonder how to figure out the total token count in a collection (per index), i.e. the size of a corpus/collection measured in tokens. You want to use this statistic, which tells you number of tokens for an indexed field: http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/index/Terms.html#getSumTotalTermFreq%28%29 just to say: thank you, this seems to work well! matej
configure saxon in 4.x
hi, i am unable to configure saxon as the xslt-transformer of choice in solr 4.x (ALPHA and BETA) On startup, I keep getting the error: null:javax.xml.transform.TransformerFactoryConfigurationError: Pr ovider net.sf.saxon.TransformerFactoryImpl not found even though the log before says, it is adding the .jars to the class-loader. I start solr (in the example dir) with: java -jar start.jar -Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl I have set the saxon-jars in the CLASSPATH. I also tried to put various versions of the saxon-jars into ./example/lib or ./example/solr/lib and to edit ./example/solr/collection1/solrconfig.xml and: What am I doing wrong? Any hint appreciated best, Matej
Re: configure saxon in 4.x
there are certainly multiple ways, now I found at least one: in the 4.x package the configuration for the jetty-start.jar expects the extension libraries in lib/ext/* (in 3.6 everything under lib/** was considered) hope this helps save hours of desperation to other lost souls (should there be any) unable to get their configuration right best, matej Am 28.09.2012 17:38, schrieb tech.vronk: hi, i am unable to configure saxon as the xslt-transformer of choice in solr 4.x (ALPHA and BETA) On startup, I keep getting the error: null:javax.xml.transform.TransformerFactoryConfigurationError: Pr ovider net.sf.saxon.TransformerFactoryImpl not found even though the log before says, it is adding the .jars to the class-loader. I start solr (in the example dir) with: java -jar start.jar -Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl I have set the saxon-jars in the CLASSPATH. I also tried to put various versions of the saxon-jars into ./example/lib or ./example/solr/lib and to edit ./example/solr/collection1/solrconfig.xml and: What am I doing wrong? Any hint appreciated best, Matej
mapping values in fields
Hi, I try to map values from one field into other values in another field. For example: original_field: orig_value1 mapped_field: mapped_value1 with the help of an explicitely defined (N:1) mapping: orig_value1 => mapped_value1 orig_value2 => mapped_value1 orig_value3 => mapped_value2 I have tried to use SynonymFilterFactory for the mapped_field: synonyms="region-map.txt" ignoreCase="true" expand="true"/> combined with: Now, a search for mapped_field:mapped_value1 yields results, however in the result the mapped_value1 does not appear at all, but instead the orig_value1 appears also in the mapped_field. How can I achieve, that the mapped_value appears in the result as well? thank you, matej
Re: mapping values in fields
the query is: mapped_field:mapped_value1 and seems to correctly return the documents. the mapped_field has attribute stored=true and also appears in the result (even without requesting it explicitely with fl), just with the orig_value1 instead of mapped_value1 matej Am 02.10.2012 15:46, schrieb Erick Erickson: What's the query you send? I'm guessing a bit here since you haven't included it, but try insuring two things: 1> your mapped_field is has 'stored="true" ' 2> you specify (either in your request handler on on the URL) fl=mapped_value Best Erick On Tue, Oct 2, 2012 at 9:04 AM, tech.vronk wrote: Hi, I try to map values from one field into other values in another field. For example: original_field: orig_value1 mapped_field: mapped_value1 with the help of an explicitely defined (N:1) mapping: orig_value1 => mapped_value1 orig_value2 => mapped_value1 orig_value3 => mapped_value2 I have tried to use SynonymFilterFactory for the mapped_field: combined with: Now, a search for mapped_field:mapped_value1 yields results, however in the result the mapped_value1 does not appear at all, but instead the orig_value1 appears also in the mapped_field. How can I achieve, that the mapped_value appears in the result as well? thank you, matej
Re: mapping values in fields
Thank you Erick, that would explain the behaviour. But it leaves me with the question: When are all the analyzers/filters applied when not during indexing? I mean, if I define a lowercase filter on a field, if it is not applied on the input stream before storing so that I have a lowercased text stored in the index, at what point then? the only other point I can think of, is on the fly, during the search, but that sound terribly inefficient (doing the transformations every time?) would be greatful if anybody can bring some light into this for me. best, matej Am 02.10.2012 16:58, schrieb Erick Erickson: Ah, I get it (finally). OK, there's no good way to do what you want that I know of. The problem is that the stored="true" takes effect long before any transformations are applied, and is always the raw input. You effectively want to chain the fields together, i.e. apply the analysis chain _then_ have the copyfield take effect which is not supported. I don't know how to accomplish this off the top of my head OOB, I'd guess your client would have to manage the substitutions and then just index separate fields... Best Erick On Tue, Oct 2, 2012 at 9:54 AM, tech.vronk wrote: the query is: mapped_field:mapped_value1 and seems to correctly return the documents. the mapped_field has attribute stored=true and also appears in the result (even without requesting it explicitely with fl), just with the orig_value1 instead of mapped_value1 matej Am 02.10.2012 15:46, schrieb Erick Erickson: What's the query you send? I'm guessing a bit here since you haven't included it, but try insuring two things: 1> your mapped_field is has 'stored="true" ' 2> you specify (either in your request handler on on the URL) fl=mapped_value Best Erick On Tue, Oct 2, 2012 at 9:04 AM, tech.vronk wrote: Hi, I try to map values from one field into other values in another field. For example: original_field: orig_value1 mapped_field: mapped_value1 with the help of an explicitely defined (N:1) mapping: orig_value1 => mapped_value1 orig_value2 => mapped_value1 orig_value3 => mapped_value2 I have tried to use SynonymFilterFactory for the mapped_field: combined with: Now, a search for mapped_field:mapped_value1 yields results, however in the result the mapped_value1 does not appear at all, but instead the orig_value1 appears also in the mapped_field. How can I achieve, that the mapped_value appears in the result as well? thank you, matej