One downside of doing joins is that it makes it pretty hard to
distribute/federate in the future because a document doesn't stand
alone.

A flat structure for tagging could be to add a
taguser and tag field to the actual document each time a user tagged a document.

   - all collected objects
facet.query=tag:*
   - all objects collected by erikhatcher
facet.query=taguser:erikhatcher
   - all collected objects with tag "foo"
facet.query=tag:foo
- facet by tag
facet.field=tag
- filter query resultst by a constraint tag=foo
fq=tag:foo

You wouldn't be able to query for:
- total number of tags
- items with the largest number of tags
- a tag by a specific user... that would require something like a
phrase match across fields.

Downsides of a flat structure:
- you need to reindex the whole document, or have updateable documents
- even with updateable documents, it could be costly to update
 (if people's tagging rate is fairly low, this may not matter much)

--- Separate tag or collectible objects ---
   - all collected objects
The count of all tagged objects?  how would you do this?
   - all objects collected by erikhatcher
facet.query=C_user:erikhatcher
   - all collected objects with tag "foo"
facet.query=C_tag:foo

- facet by tag
facet.field=C_tag   (this would give counts of *tags* not documents)

- filter query resultst by a constraint tag=foo
Not currently doable, would need to build up a filter somehow...
indirectFilter=id:((C_tag:foo).C_uid)

If an indirect approach has enough advantages, we could perhaps come
up with a way to express it.

My custom facet cache differs from the built-in facets
in that it builds a cross-reference cache from the "C" types to the
"A" types (a JOIN, heh).

What does the cross-reference cache look like when it's built?  A simple int[]?
To do more efficiently, it seems like one would want separate indicies
for the A and C docs  to keep maxDoc() down.

What's the id for the C docs?  user catenated with id of the collected
doc, so all tags/comments for a particular user on a particular doc go
in the same C doc?

-Yonik


On 2/2/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
Before Solr had facets, I built my own implementation in a much
cruder and less performant way into Collex as custom request handlers.

Now the performance issue of warming up the cache needs to be
addressed.  I'm going to upgrade Solr and adjust the application to
work with the built-in faceting and see how far I get with that.  The
dilemma is that I've got a couple of custom things that don't map to
the built-in faceting and I'm looking for advice on how to proceed.

The index has a "type" field: "A" for archived objects and "C" for
collectibles.  All the original objects are indexed in batch fashion
as type "A".  Users collect objects and tags/annotates them.  When a
user collects an object, a document of type "C" is indexed with the
original objects unique identifier (a URI), the username, tags, and
annotation.  My custom facet cache differs from the built-in facets
in that it builds a cross-reference cache from the "C" types to the
"A" types (a JOIN, heh).

We can do queries that return facet counts such as:

   - all collected objects
   - all objects collected by erikhatcher
   - all collected objects with tag "foo"

One of the facet counts returned is user, so you can easily see how
many objects each user has collected.

For the basic faceting we do on object metadata, this will fit well
with what Solr has built-in, but I'm not quite sure how to build in
the cross-reference and leverage faster warming, so I'm asking here
to see what thoughts folks have on how to proceed.

Reply via email to