Fwd: machine tags, copy fields and pattern tokenizers

straup Mon, 01 Feb 2010 08:26:52 -0800

Hi,

Just a quick note to mention that I finally figured (most of) this out.

The short version is that if there's an explicit "index" analyzer (as intype="index") but not a corresponding "query" analyzer then Solr appearsto use the first for all cases.

I guess this makes sense but it's a bit confusing so if I get a fewminutes I will update the wiki to make the distinction explicit.


The longer version is over here, for anyone interested:

        http://github.com/straup/solr-machinetags

The long version is me asking a couple more questions:

# All the questions assume the following schema.xml:
# http://github.com/straup/solr-machinetags/blob/master/conf/schema.xml

Because all the values for a given namespace/predicate field get indexedin the same multiValue bucket, the faceting doesn't behave the way you'dnecessarily expect. For example, if you index the following...

solr.add([{ 'id' : int(time.time()), 'body' : 'float thing', 'tag' :'w,t,f', 'machinetag' : 'dc:number=12345' }])

solr.add([{ 'id' : int(time.time()), 'body' : 'decimal thing', 'tag' :'a,b,c', 'machinetag' : 'dc:number=123.23' }])

solr.add([{ 'id' : int(time.time()), 'body' : 'negative thing', 'tag' :'a,b,c', 'machinetag' : ['dc:number=-45.23', 'asc:test=rara'] }])

...and then facet on the predicates for ?q=ns:dc (basically to ask: showme all the predicates for the "dc:" namespace) you end up with...


  "facet_fields":{
    "ns":[
      "asc",1,
      "dc",1]},

...which seems right from a Solr perspective but isn't really a correctrepresentation of the machine tags.


Can anyone offer any ideas on a better/different way to model this data?

Also, has anyone figured out how to match on double quotes inside aregular expression defined in an XML attribute?


As in:

<tokenizer class="solr.PatternTokenizerFactory"pattern="^(?:(?:[a-zA-Z]|\d)(?:\w+)?)\:(?:(?:[a-zA-Z]|\d)(?:\w+)?)=(.+)"group="1" />

Where that pattern should really end:

        =\"?(.+)\"?$

Thanks,

-------- Original Message --------
Subject: machine tags, copy fields and pattern tokenizers
Date: Mon, 25 Jan 2010 16:20:58 -0800
From: straup <[email protected]>
Reply-To: [email protected]
To: [email protected]

Hi,

I am trying to work out how to store, query and facet machine tags [1]
in Solr using a combination of copy fields and pattern tokenizer factories.

I am still relatively new to Solr so despite feeling like I've gone over
the docs, and friends, it's entirely possible I've missed something
glaringly obvious.

The short version is: Faceting works. Yay! You can facet on the
individual parts of a machine tag (namespace, predicate, value) and it
does what you'd expect. For example:

?q=*:*&facet=true&facet.field=mt_namespace&rows=0

numFound:115
foo:65
dc:48
lastfm:2

The longer version is: Even though faceting seems to work I can't query
(as in ?q=) on the individual fields.

For example, if a single "machinetag" (foo:bar=example) field is copied
to "mt_namespace", "mt_predicate" and "mt_value" fields I still can't
query for "?q=mt_namespace:foo".

It appears as though the entire machine tag is being copied to
mt_namespace even though my reading of the docs is that is a attribute
is present in a solr.PatternTokenizerFactory analyzer then only the
matching capture group will be stored.

Is that incorrect?

I've included the field/fieldType definitions I'm using below. [2] Any
help/suggestions would be appreciated.

Cheers,

[1] http://www.flickr.com/groups/api/discuss/72157594497877875/

[2]

<field name="machine_tags" type="machinetag" indexed="true"
stored="true" required="false" multiValued="true"/>

<field name="mt_namespace" type="mt_namespace" indexed="true"
stored="true" required="false" multiValued="true" />

<field name="mt_predicate" type="mt_predicate" indexed="true"
stored="true" required="false" multiValued="true" />

<field name="mt_value" type="mt_value" indexed="true" stored="true"
required="false" multiValued="true" />

<copyField source="machine_tags" dest="mt_namespace" />
<copyField source="machine_tags" dest="mt_predicate" />
<copyField source="machine_tags" dest="mt_value" />

<fieldType name="machinetag" class="solr.TextField" />

<fieldType name="mt_namespace" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory"
pattern="([a-zA-Z[0-9]](?:\w+)?):.+" group="1" />
   </analyzer>
</fieldType>

<fieldType name="mt_predicate" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory"
pattern="[a-zA-Z[0-9]](?:\w+)?:([a-zA-Z[0-9]](?:\w+)?)=.+" group="1" />
  </analyzer>
</fieldType>

<fieldType name="mt_value" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory"
pattern="[a-zA-Z[0-9]](?:\w+)?:[a-zA-Z[0-9]](?:\w+)?=(.+)" group="1" />
  </analyzer>
</fieldType>

Fwd: machine tags, copy fields and pattern tokenizers

Reply via email to