On Feb 2, 2007, at 4:29 PM, Yonik Seeley wrote:
One downside of doing joins is that it makes it pretty hard to
distribute/federate in the future because a document doesn't stand
alone.
The connection between objects is key in our library domain though.
A flat structure for tagging could be to add a
taguser and tag field to the actual document each time a user
tagged a document.
I've been contemplating how that would look and work. But the
downsides you mention are sorta show-stoppers for our needs:
- filter query resultst by a constraint tag=foo
fq=tag:foo
You wouldn't be able to query for:
- total number of tags
Couldn't that be the term frequency information?
- items with the largest number of tags
tag frequency is very important, but having a tag field would give us
frequency per tag term. So I don't see this as a problem.
- a tag by a specific user... that would require something like a
phrase match across fields.
This is necessary too. The Collex sidebar allows you to see all
objects tagged as "foo" by a specific user.
Downsides of a flat structure:
- you need to reindex the whole document, or have updateable documents
- even with updateable documents, it could be costly to update
(if people's tagging rate is fairly low, this may not matter much)
I figured this usecase would lend itself well to updateable docs,
though I've not yet visualized how this would work entirely.
--- Separate tag or collectible objects ---
- all collected objects
The count of all tagged objects? how would you do this?
- all objects collected by erikhatcher
facet.query=C_user:erikhatcher
- all collected objects with tag "foo"
facet.query=C_tag:foo
The "all objects tagged "foo" by erikhatcher is the holy grail, eh?
- facet by tag
facet.field=C_tag (this would give counts of *tags* not documents)
These are important numbers too. But object count per tag is the ideal.
- filter query resultst by a constraint tag=foo
Not currently doable, would need to build up a filter somehow...
indirectFilter=id:((C_tag:foo).C_uid)
If an indirect approach has enough advantages, we could perhaps come
up with a way to express it.
I like it!
My custom facet cache differs from the built-in facets
in that it builds a cross-reference cache from the "C" types to the
"A" types (a JOIN, heh).
What does the cross-reference cache look like when it's built? A
simple int[]?
To do more efficiently, it seems like one would want separate indicies
for the A and C docs to keep maxDoc() down.
cache = new HashMap<String, Map>();
Map<String,Map<String,DocSet>> userTagMap = new
HashMap<String,Map<String,DocSet>>();
Map<String,DocSet> tagMap = new HashMap<String, DocSet>();
Map<String,DocSet> userMap = new HashMap<String, DocSet>();
Map<String, DocSet> collectedMap = new HashMap<String, DocSet>();
DocSet collectedSet = new BitDocSet();
collectedMap.put("collected", collectedSet);
cache.put("tag", tagMap);
cache.put("usertag", userTagMap);
cache.put("username", userMap);
cache.put("collected", collectedMap);
so basically (in Ruby code) I have the following to get a DocSet:
cache['tag'][tag]
or
cache['usertag'][username][tag]
Interestingly, I do build a separate RAMDirectory index for another
purpose under Collex: agent name lookup, where agents are associated
with one or more roles.
What's the id for the C docs? user catenated with id of the collected
doc, so all tags/comments for a particular user on a particular doc go
in the same C doc?
Yes, a collectable object has a URI in this form: "#{object_id}/#
{username}"
Thanks for the feedback thus far. I'm optimistic we'll find a good
solution to this. Worst case, I continue to use my hack for mapping
associations, but tune the cache generation a bit.
Erik