Hello all, I am facing a problem on how to structure Sol cluster and indexes.
Current problem: We developed an DMP (Data Management Platform) for online advertisement purposes... and we are currently using Solr in order to index all the data we collect and provide "ultra-fast user segmentation" tool for our advertisers. For all the data we collect by our own, we are using just one Solr index. For a example: We collect and index all the webpages the user has visited, searches he made, user profile, etc... The case is that we are now trying to allow our advertisers to upload their entire CRM customer base, in order for them to match their CRM data with our own data (the data we collect) during the audience segmentation/creation. BUT, the problem is that each advertiser has more than 10-20 millions of users in their CRMs, with dozens or hundreds of fields (columns), that are frequently updated, and we don't know if we should put all this data in the same Solr index that is already used for our own data, so putting new "fields" inside the user (document) on Solr, or if we should build one Solr index for each advertiser with their own user's data (documents). The main reason for this "big doubt", is that we have to support JOIN-like searches between ours and the advertiser's CRM data, when making the segmentation (querying the data), because advertisers want to make some segmentation using theirs and also ours data together. We thought that would be much better (and easier) to have one separated index for each advertiser, but the big problem is, as far we know, it's not possible to make some JOIN between 2 (or more) different Solr indexes. I mean, it's possible, but too expensive for large amounts of data (our case). One solution that also don't work for us is to have different Solr indexes for each advertiser but also with our own data included, so also putting (merging) the data we collect with the advertiser data and create a new index. The reason for not working is that our data is much more frequently added and updated than the advertiser's, so in this case, we would need to update all the advertisers Solr indexes every time we collect new data, even if this "new data" is not related to the advertiser. Here is a more detailed view of our problem: Let's suppose the following scenario: 1 user = 1 cookie ID This is some of data we collect through javascript tags on websites, and associate with a user: - Website Visited - Page Content (Text) - Searches (Taken from the http referer) Other data that we collect from users that are installing our facebook apps, and that we also associate with a user: - Gender - Age - Facebook Likes - Friends List Now, let's suppose that company X is one of our customers and that they want to import their CRM database containing following "fields": - company X Customer ID - Last Purchase Date - Average Purchase Value - Store(s) City(ies) where the Customer Purchased So, after importing the company X's database, we will proceed with the identification of the user using a javascript tag on company X's website, where they would be passing on the JS tag the Walmart Customer ID, in order for us to associate that the Walmart Customer ID "12345" is actually our user/cookie ID "ABCDE". Now comes the difficult part... Later, company X will be able to perform the following type of "queries / searches": - Select (Count) the number of company X Customer ID's where Last Purchase Date = "Yesterday" AND Searched on the web for = "Cakes"; - Select (Count) the number of company X Customer ID's where Average Purchase Value = "<300" AND Gender = "Male" AND Facebook Likes = "Coca-Cola" but NOT have Facebook Likes = "Pepsi"; - Select (Count) the number of company X Customer ID's where Last Purchase Date = "Last Week" AND Average Purchase Value = "<300" AND Gender = "Male" AND Facebook Likes = "Coca-Cola" but NOT have Facebook Likes = "Pepsi"; As you can see, company X can make searches within their CRM "fields", but also can make searches using (merging) our own data. By a initial premisse, the company X database importing/update process will not be real-time, but will maybe have around 1 or 2 updates per day. So, do you think we could have this kind of "queries / searches" pre-processed in Hadoop and later indexed by Solr? How would it work if we can't predict the searches company X will do? Regarding that other possibility, the unique problem that I see about making "searches / queries" in different Solr Indexes and then join the results (in memory) in order to get the unduplicated count value, is that if company X makes a more broader search, the results would be too big to compare, what would take a lot of time (I mean seconds, not milliseconds as we need). Any hint about how to solve this problem/situation? Best regards, Marcelo Valle.