relative token count in a query result

2012-11-20 Thread tech.vronk

Hello,

earlier, I was trying to retrieve the total token count per index
http://lucene.472066.n3.nabble.com/how-to-retrieve-total-token-count-per-collection-index-td4000161.html
.

now, I would like to have a token (word) count within the document-set 
(resulting of a query),

both for the matching word and as sum of all tokens of matching documents.

The ultimate goal is to be able to compute relative frequencies of 
terms, on token-base instead of per article base.


so if I search for word "Haus" within a subcollection (defined by a 
separate query) and the word appears in a matching doc A 2 times and doc 
B 5 times, i need as hit-count: 7 not 2.


+ if the subcollection contains documents
A with 300 tokens (i.e. running words, not different terms)
B with 100 tokens
C with 50 tokens

I also need this second sum, i.e. 450.

I plan to get the second number by first
preprocessing the document counting the tokens
storing the number in a separate field,
then applying the statsComponent,
which will deliver me the sum for given query/subcollection.

for the first number, i could use the termfreq() function,
but that gives me only the term frequency per document.

So, before I iterate over the whole result, to sum it,
I wonder, if the statsComponent would be able to perform the counting 
also over a dynamic field (the result of the function).

I tried this:
/solr/select/?fq=docsrc:falter&q={!func}tf(inhalt,'haus')&stats=true&stats.field=score&rows=10&indent=true&fl=score&debugQuery=true

but got the error:
Field type 
text_de{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100}} 
is not currently supported


Or is there any other way?

If I understand it correctly, any of tf(), idf(), sttf(), wouldn't be of 
any help here neither.


Thanks in advance

best,
matej




how to retrieve total token count per collection/index

2012-08-09 Thread tech.vronk

Hello,

I wonder how to figure out the total token count in a collection (per 
index), i.e. the size of a corpus/collection measured in tokens.


The statistics in /admin tell the number of distinct terms,
and the frequency list per index reveals the number of documents with 
given term. So even if I would sum all the frequencies, I wouldn't get 
the result I need.


Thank you.

best,
Matej


Re: how to retrieve total token count per collection/index

2012-08-09 Thread tech.vronk

Am 09.08.2012 18:02, schrieb Robert Muir:

On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk  wrote:

Hello,

I wonder how to figure out the total token count in a collection (per
index), i.e. the size of a corpus/collection measured in tokens.


You want to use this statistic, which tells you number of tokens for
an indexed field:
http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/index/Terms.html#getSumTotalTermFreq%28%29


thank you.

Is there any 3.6 equivalent for this, before I install and run 4.0?
I can't seem to find a corresponding class (org.apache.lucene.index.Terms)
in 3.6.



Re: how to retrieve total token count per collection/index

2012-08-20 Thread tech.vronk

Am 09.08.2012 18:02, schrieb Robert Muir:

On Thu, Aug 9, 2012 at 10:20 AM, tech.vronk  wrote:

Hello,

I wonder how to figure out the total token count in a collection (per
index), i.e. the size of a corpus/collection measured in tokens.


You want to use this statistic, which tells you number of tokens for
an indexed field:
http://lucene.apache.org/core/4_0_0-ALPHA/core/org/apache/lucene/index/Terms.html#getSumTotalTermFreq%28%29



just to say:
thank you, this seems to work well!

matej


configure saxon in 4.x

2012-09-28 Thread tech.vronk

hi,

i am unable to configure saxon as the xslt-transformer of choice
in solr 4.x (ALPHA and BETA)

On startup, I keep getting the error:
null:javax.xml.transform.TransformerFactoryConfigurationError: Pr
ovider net.sf.saxon.TransformerFactoryImpl not found
even though the log before says, it is adding the .jars to the class-loader.

I start solr (in the example dir) with:
java -jar start.jar 
-Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl


I have set the saxon-jars in the CLASSPATH.

I also tried to put various versions of the saxon-jars into
./example/lib or
./example/solr/lib

and to edit ./example/solr/collection1/solrconfig.xml

  
and:
  


What am I doing wrong?


Any hint appreciated

best,
Matej


Re: configure saxon in 4.x

2012-09-29 Thread tech.vronk


there are certainly multiple ways,
now I found at least one:

in the 4.x package the configuration for the jetty-start.jar
expects the extension libraries in
lib/ext/*

(in 3.6 everything under lib/** was considered)

hope this helps save hours of desperation
to other lost souls (should there be any)
unable to get their configuration right

best,
matej


Am 28.09.2012 17:38, schrieb tech.vronk:

hi,

i am unable to configure saxon as the xslt-transformer of choice
in solr 4.x (ALPHA and BETA)

On startup, I keep getting the error:
null:javax.xml.transform.TransformerFactoryConfigurationError: Pr
ovider net.sf.saxon.TransformerFactoryImpl not found
even though the log before says, it is adding the .jars to the 
class-loader.


I start solr (in the example dir) with:
java -jar start.jar 
-Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl


I have set the saxon-jars in the CLASSPATH.

I also tried to put various versions of the saxon-jars into
./example/lib or
./example/solr/lib

and to edit ./example/solr/collection1/solrconfig.xml

  
and:
  


What am I doing wrong?


Any hint appreciated

best,
Matej






mapping values in fields

2012-10-02 Thread tech.vronk

Hi,

I try to map values from one field into other values in another field.
For example:
original_field: orig_value1
mapped_field: mapped_value1

with the help of an explicitely defined (N:1) mapping:
orig_value1 => mapped_value1
orig_value2 => mapped_value1
orig_value3 => mapped_value2

I have tried to use SynonymFilterFactory
for the mapped_field:


  
  
  synonyms="region-map.txt" ignoreCase="true" expand="true"/>

  

combined with:



Now, a search for
 mapped_field:mapped_value1
yields results,
however in the result the mapped_value1 does not appear at all,
but instead the orig_value1 appears also in the mapped_field.

How can I achieve, that the mapped_value appears in the result as well?


thank you,

matej


Re: mapping values in fields

2012-10-02 Thread tech.vronk


the query is:
  mapped_field:mapped_value1

and seems to correctly return the documents.

the mapped_field has attribute stored=true
and also appears in the result (even without requesting it explicitely 
with fl),

just with the orig_value1 instead of mapped_value1

matej

Am 02.10.2012 15:46, schrieb Erick Erickson:

What's the query you send? I'm guessing a bit here since you
haven't included it, but try insuring two things:

1>  your mapped_field is has 'stored="true" '
2> you specify (either in your request handler on on the URL) fl=mapped_value

Best
Erick

On Tue, Oct 2, 2012 at 9:04 AM, tech.vronk  wrote:

Hi,

I try to map values from one field into other values in another field.
For example:
original_field: orig_value1
mapped_field: mapped_value1

with the help of an explicitely defined (N:1) mapping:
orig_value1 => mapped_value1
orig_value2 => mapped_value1
orig_value3 => mapped_value2

I have tried to use SynonymFilterFactory
for the mapped_field:


   
   
   
   

combined with:



Now, a search for
  mapped_field:mapped_value1
yields results,
however in the result the mapped_value1 does not appear at all,
but instead the orig_value1 appears also in the mapped_field.

How can I achieve, that the mapped_value appears in the result as well?


thank you,

matej






Re: mapping values in fields

2012-10-03 Thread tech.vronk


Thank you Erick, that would explain the behaviour.

But it leaves me with the question:
When are all the analyzers/filters applied
when not during indexing?

I mean, if I define a lowercase filter on a field,
if it is not applied on the input stream before storing
so that I have a lowercased text stored in the index,
at what point then?
the only other point I can think of, is on the fly, during the search,
but that sound terribly inefficient
(doing the transformations every time?)

would be greatful if anybody can bring some light into this for me.

best,
matej



Am 02.10.2012 16:58, schrieb Erick Erickson:

Ah, I get it (finally). OK, there's no good way to do what
you want that I know of. The problem is that the
stored="true" takes effect long before any transformations
are applied, and is always the raw input. You effectively
want to chain the fields together, i.e. apply the analysis
chain _then_ have the copyfield take effect which is not
supported.

I don't know how to accomplish this off the top of my head
OOB, I'd guess your client would have to manage the
substitutions and then just index separate fields...

Best
Erick

On Tue, Oct 2, 2012 at 9:54 AM, tech.vronk  wrote:

the query is:
   mapped_field:mapped_value1

and seems to correctly return the documents.

the mapped_field has attribute stored=true
and also appears in the result (even without requesting it explicitely with
fl),
just with the orig_value1 instead of mapped_value1

matej

Am 02.10.2012 15:46, schrieb Erick Erickson:


What's the query you send? I'm guessing a bit here since you
haven't included it, but try insuring two things:

1>  your mapped_field is has 'stored="true" '
2> you specify (either in your request handler on on the URL)
fl=mapped_value

Best
Erick

On Tue, Oct 2, 2012 at 9:04 AM, tech.vronk  wrote:

Hi,

I try to map values from one field into other values in another field.
For example:
original_field: orig_value1
mapped_field: mapped_value1

with the help of an explicitely defined (N:1) mapping:
orig_value1 => mapped_value1
orig_value2 => mapped_value1
orig_value3 => mapped_value2

I have tried to use SynonymFilterFactory
for the mapped_field:







combined with:



Now, a search for
   mapped_field:mapped_value1
yields results,
however in the result the mapped_value1 does not appear at all,
but instead the orig_value1 appears also in the mapped_field.

How can I achieve, that the mapped_value appears in the result as well?


thank you,

matej