Best architecture - mixing more than 1 instance of the same index

Marcelo Elias Del Valle Tue, 26 Feb 2013 13:35:37 -0800

Hello all,

I am facing a problem on how to structure Sol cluster and indexes.


Current problem:

We developed an DMP (Data Management Platform) for online advertisement
purposes... and we are currently using Solr in order to index all the data
we collect and provide "ultra-fast user segmentation" tool for our
advertisers.

For all the data we collect by our own, we are using just one Solr index.
For a example: We collect and index all the webpages the user has visited,
searches he made, user profile, etc...

The case is that we are now trying to allow our advertisers to upload their
entire CRM customer base, in order for them to match their CRM data with
our own data (the data we collect) during the audience
segmentation/creation.

BUT, the problem is that each advertiser has more than 10-20 millions of
users in their CRMs, with dozens or hundreds of fields (columns), that are
frequently updated, and we don't know if we should put all this data in the
same Solr index that is already used for our own data, so putting new
"fields" inside the user (document) on Solr, or if we should build one Solr
index for each advertiser with their own user's data (documents).

The main reason for this "big doubt", is that we have to support JOIN-like
searches between ours and the advertiser's CRM data, when making the
segmentation (querying the data), because advertisers want to make some
segmentation using theirs and also ours data together.

We thought that would be much better (and easier) to have one separated
index for each advertiser, but the big problem is, as far we know, it's not
possible to make some JOIN between 2 (or more) different Solr indexes. I
mean, it's possible, but too expensive for large amounts of data (our case).

One solution that also don't work for us is to have different Solr indexes
for each advertiser but also with our own data included, so also putting
(merging) the data we collect with the advertiser data and create a new
index.
The reason for not working is that our data is much more frequently added
and updated than the advertiser's, so in this case, we would need to update
all the advertisers Solr indexes every time we collect new data, even if
this "new data" is not related to the advertiser.


Here is a more detailed view of our problem:

Let's suppose the following scenario:

1 user = 1 cookie ID

This is some of data we collect through javascript tags on websites, and
associate with a user:
- Website Visited
- Page Content (Text)
- Searches (Taken from the http referer)

Other data that we collect from users that are installing our facebook
apps, and that we also associate with a user:
- Gender
- Age
- Facebook Likes
- Friends List

Now, let's suppose that company X is one of our customers and that they
want to import their CRM database containing following "fields":
- company X Customer ID
- Last Purchase Date
- Average Purchase Value
- Store(s) City(ies) where the Customer Purchased

So, after importing the company X's database, we will proceed with the
identification of the user using a javascript tag on company X's website,
where they would be passing on the JS tag the Walmart Customer ID, in order
for us to associate that the Walmart Customer ID "12345" is actually our
user/cookie ID "ABCDE".

Now comes the difficult part... Later, company X will be able to perform
the following type of "queries / searches":

- Select (Count) the number of company X Customer ID's where Last Purchase
Date = "Yesterday" AND Searched on the web for = "Cakes";

- Select (Count) the number of company X Customer ID's where Average
Purchase Value = "<300" AND Gender = "Male" AND Facebook Likes =
"Coca-Cola" but NOT have Facebook Likes = "Pepsi";

- Select (Count) the number of company X Customer ID's where Last Purchase
Date = "Last Week" AND Average Purchase Value = "<300" AND Gender = "Male"
AND Facebook Likes = "Coca-Cola" but NOT have Facebook Likes = "Pepsi";

As you can see, company X can make searches within their CRM "fields", but
also can make searches using (merging) our own  data.

By a initial premisse, the company X database importing/update process will
not be real-time, but will maybe have around 1 or 2 updates per day.

So, do you think we could have this kind of "queries / searches"
pre-processed in Hadoop and later indexed by Solr? How would it work if we
can't predict the searches company X will do?

Regarding that other possibility, the unique problem that I see about
making "searches / queries" in different Solr Indexes and then join the
results (in memory) in order to get the unduplicated count value, is that
if company X makes a more broader search, the results would be too big to
compare, what would take a lot of time (I mean seconds, not milliseconds as
we need).

Any hint about how to solve this problem/situation?

Best regards,
Marcelo Valle.

Best architecture - mixing more than 1 instance of the same index

Reply via email to