Hello Solr users,

How would you design a filtered join scenario?

Say I have a bunch of movies (excuse any inaccuracies, this is an
imagined scenario):

curl -XPOST -H 'Content-Type: application/json'
'localhost:8983/solr/test/update?commitWithin=1000' --data-binary '
[{
"id": "1",
"title": "Rambo 1",
"release_date": "1978-01-01"
},
{
"id": "2",
"title": "Rambo 5",
"release_date": "1998-01-01"
},
{
"id": "3",
    "title": "300 Spartaaaaaans",
"release_date": "2005-01-01"
}]'

And a bunch of users of certain families who watched those movies:

curl -XPOST -H 'Content-Type: application/json'
'localhost:8983/solr/test/update?commitWithin=1000' --data-binary '
[{
"id": "user_1",
"name": "Jane",
"family": "Smith",
"born": "1990-01-01",
"watched_movies": ["1", "3"]
},
{
"id": "user_2",
"title": "Joe",
"family": "Smith",
"born": "1970-01-01",
"watched_movies": ["2"]
},
{
"id": "user_3",
"title": "Radu",
"family": "Gheorghe,
"born": "1985-01-01",
"watched_movies": ["1", "2", "3"]
}]'

They don't have to be in the same collection. The important question
is how to get:
- movies watched by user of family Smith
- after they were born
- including the matching users
- I'd like to be able to facet on movie metadata, but I don't need to
facet on user metadata, just to be able to retrieve those fields

The above query should bring back Rambo 5 and 300, with Joe and Jane
respectively. I wouldn't get Rambo 1, because although Jane watched
it, the movie was released before she was born.

Here are some options that I have in mind:
1) using the join query parser (or the newer XCJF) to do the join
itself. Then have some sort of plugin pull the "born" value or each
corresponding user (via some subquery) and filter movies afterwards.
Normalized, but likely painfully slow

2) similar approach with 1), in a streaming expression. Again,
normalized, but slow (we're talking billions of movies, millions of
users). And limited support for facets.

3) have some sort of denormalization. For example, pre-compute
matching users for every movie, then just use join/XCJF to do the
actual join. This makes indexing/updates expensive and potentially
complicated

4) normalization with nested documents. This is best for searches, but
pretty much a no-go for indexing/updates. In this imaginary use-case,
there are binge-watchers who might watch a billion movies in a week,
making us reindex everything

Do you see better ways?

Thanks in advance and best regards,
Radu

Reply via email to