On Feb 2, 2007, at 4:29 PM, Yonik Seeley wrote:
One downside of doing joins is that it makes it pretty hard to
distribute/federate in the future because a document doesn't stand
alone.
The connection between objects is key in our library domain though.

A flat structure for tagging could be to add a
taguser and tag field to the actual document each time a user tagged a document.
I've been contemplating how that would look and work.  But the  
downsides you mention are sorta show-stoppers for our needs:
- filter query resultst by a constraint tag=foo
fq=tag:foo

You wouldn't be able to query for:
- total number of tags
Couldn't that be the term frequency information?

- items with the largest number of tags
tag frequency is very important, but having a tag field would give us  
frequency per tag term.  So I don't see this as a problem.
- a tag by a specific user... that would require something like a
phrase match across fields.
This is necessary too.  The Collex sidebar allows you to see all  
objects tagged as "foo" by a specific user.
Downsides of a flat structure:
- you need to reindex the whole document, or have updateable documents
- even with updateable documents, it could be costly to update
 (if people's tagging rate is fairly low, this may not matter much)
I figured this usecase would lend itself well to updateable docs,  
though I've not yet visualized how this would work entirely.
--- Separate tag or collectible objects ---
   - all collected objects
The count of all tagged objects?  how would you do this?
   - all objects collected by erikhatcher
facet.query=C_user:erikhatcher
   - all collected objects with tag "foo"
facet.query=C_tag:foo
The "all objects tagged "foo" by erikhatcher is the holy grail, eh?

- facet by tag
facet.field=C_tag   (this would give counts of *tags* not documents)
These are important numbers too.  But object count per tag is the ideal.

- filter query resultst by a constraint tag=foo
Not currently doable, would need to build up a filter somehow...
indirectFilter=id:((C_tag:foo).C_uid)

If an indirect approach has enough advantages, we could perhaps come
up with a way to express it.
I like it!

My custom facet cache differs from the built-in facets
in that it builds a cross-reference cache from the "C" types to the
"A" types (a JOIN, heh).
What does the cross-reference cache look like when it's built?  A  
simple int[]?
To do more efficiently, it seems like one would want separate indicies
for the A and C docs  to keep maxDoc() down.

    cache = new HashMap<String, Map>();
Map<String,Map<String,DocSet>> userTagMap = new HashMap<String,Map<String,DocSet>>();
    Map<String,DocSet> tagMap = new HashMap<String, DocSet>();
    Map<String,DocSet> userMap = new HashMap<String, DocSet>();
    Map<String, DocSet> collectedMap = new HashMap<String, DocSet>();
    DocSet collectedSet = new BitDocSet();
    collectedMap.put("collected", collectedSet);
    cache.put("tag", tagMap);
    cache.put("usertag", userTagMap);
    cache.put("username", userMap);
    cache.put("collected", collectedMap);

so basically (in Ruby code) I have the following to get a DocSet:

        cache['tag'][tag]

or
        cache['usertag'][username][tag]

Interestingly, I do build a separate RAMDirectory index for another purpose under Collex: agent name lookup, where agents are associated with one or more roles.
What's the id for the C docs?  user catenated with id of the collected
doc, so all tags/comments for a particular user on a particular doc go
in the same C doc?
Yes, a collectable object has a URI in this form: "#{object_id}/# 
{username}"
Thanks for the feedback thus far.  I'm optimistic we'll find a good  
solution to this.  Worst case, I continue to use my hack for mapping  
associations, but tune the cache generation a bit.
        Erik



Reply via email to