Hi,
Just a quick note to mention that I finally figured (most of) this out.
The short version is that if there's an explicit "index" analyzer (as in
type="index") but not a corresponding "query" analyzer then Solr appears
to use the first for all cases.
I guess this makes sense but it's a bit confusing so if I get a few
minutes I will update the wiki to make the distinction explicit.
The longer version is over here, for anyone interested:
http://github.com/straup/solr-machinetags
The long version is me asking a couple more questions:
# All the questions assume the following schema.xml:
# http://github.com/straup/solr-machinetags/blob/master/conf/schema.xml
Because all the values for a given namespace/predicate field get indexed
in the same multiValue bucket, the faceting doesn't behave the way you'd
necessarily expect. For example, if you index the following...
solr.add([{ 'id' : int(time.time()), 'body' : 'float thing', 'tag' :
'w,t,f', 'machinetag' : 'dc:number=12345' }])
solr.add([{ 'id' : int(time.time()), 'body' : 'decimal thing', 'tag' :
'a,b,c', 'machinetag' : 'dc:number=123.23' }])
solr.add([{ 'id' : int(time.time()), 'body' : 'negative thing', 'tag' :
'a,b,c', 'machinetag' : ['dc:number=-45.23', 'asc:test=rara'] }])
...and then facet on the predicates for ?q=ns:dc (basically to ask: show
me all the predicates for the "dc:" namespace) you end up with...
"facet_fields":{
"ns":[
"asc",1,
"dc",1]},
...which seems right from a Solr perspective but isn't really a correct
representation of the machine tags.
Can anyone offer any ideas on a better/different way to model this data?
Also, has anyone figured out how to match on double quotes inside a
regular expression defined in an XML attribute?
As in:
<tokenizer class="solr.PatternTokenizerFactory"
pattern="^(?:(?:[a-zA-Z]|\d)(?:\w+)?)\:(?:(?:[a-zA-Z]|\d)(?:\w+)?)=(.+)"
group="1" />
Where that pattern should really end:
=\"?(.+)\"?$
Thanks,
-------- Original Message --------
Subject: machine tags, copy fields and pattern tokenizers
Date: Mon, 25 Jan 2010 16:20:58 -0800
From: straup <str...@gmail.com>
Reply-To: str...@gmail.com
To: solr-user@lucene.apache.org
Hi,
I am trying to work out how to store, query and facet machine tags [1]
in Solr using a combination of copy fields and pattern tokenizer factories.
I am still relatively new to Solr so despite feeling like I've gone over
the docs, and friends, it's entirely possible I've missed something
glaringly obvious.
The short version is: Faceting works. Yay! You can facet on the
individual parts of a machine tag (namespace, predicate, value) and it
does what you'd expect. For example:
?q=*:*&facet=true&facet.field=mt_namespace&rows=0
numFound:115
foo:65
dc:48
lastfm:2
The longer version is: Even though faceting seems to work I can't query
(as in ?q=) on the individual fields.
For example, if a single "machinetag" (foo:bar=example) field is copied
to "mt_namespace", "mt_predicate" and "mt_value" fields I still can't
query for "?q=mt_namespace:foo".
It appears as though the entire machine tag is being copied to
mt_namespace even though my reading of the docs is that is a attribute
is present in a solr.PatternTokenizerFactory analyzer then only the
matching capture group will be stored.
Is that incorrect?
I've included the field/fieldType definitions I'm using below. [2] Any
help/suggestions would be appreciated.
Cheers,
[1] http://www.flickr.com/groups/api/discuss/72157594497877875/
[2]
<field name="machine_tags" type="machinetag" indexed="true"
stored="true" required="false" multiValued="true"/>
<field name="mt_namespace" type="mt_namespace" indexed="true"
stored="true" required="false" multiValued="true" />
<field name="mt_predicate" type="mt_predicate" indexed="true"
stored="true" required="false" multiValued="true" />
<field name="mt_value" type="mt_value" indexed="true" stored="true"
required="false" multiValued="true" />
<copyField source="machine_tags" dest="mt_namespace" />
<copyField source="machine_tags" dest="mt_predicate" />
<copyField source="machine_tags" dest="mt_value" />
<fieldType name="machinetag" class="solr.TextField" />
<fieldType name="mt_namespace" class="solr.TextField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory"
pattern="([a-zA-Z[0-9]](?:\w+)?):.+" group="1" />
</analyzer>
</fieldType>
<fieldType name="mt_predicate" class="solr.TextField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory"
pattern="[a-zA-Z[0-9]](?:\w+)?:([a-zA-Z[0-9]](?:\w+)?)=.+" group="1" />
</analyzer>
</fieldType>
<fieldType name="mt_value" class="solr.TextField">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory"
pattern="[a-zA-Z[0-9]](?:\w+)?:[a-zA-Z[0-9]](?:\w+)?=(.+)" group="1" />
</analyzer>
</fieldType>