Document Security Model Question
I had earlier posted a similar discussion in LinkedIn and David Smiley rightly advised me that solr-user is a better place for technical discussions -- Our product which is hosted supports searching on educational resources. Our customers can choose to make specific resources unavailable for their users and also it depends on licensing. Our current solution uses full text search support in the database and handles availability as part of sql . My task is to move the search from the database full text search into Solr. I searched through posts and found some that were kind of related and I am thinking along the following lines a) Use the authorization model. I can add fields like allow and/or deny in the index which contain the list of customers. At query time, I can add the constraint based on the customer Id. I am concerned about the performance if there are lot of values for these fields and also it requires constant reindexing if a value in this field changes b) Use Query-time Join. Have the resource to availability for customer in separate inner documents. We are planning to deploy in SolrCloud. I have read some challenges about Query-time join and SolrCloud. So this may not work for us. c) Other ideas? Excerpts from David Smiley's response You're right that there may be some re-indexing as security rules change. If many Lucene/Solr documents share identical access control with other documents, then it may make more sense to externally determine which unique set of access-control sets the user has access to, then finally search by id -- which will hopefully not be a huge number. I've seen this done both externally and with a Solr core to join on. -- View this message in context: http://lucene.472066.n3.nabble.com/Document-Security-Model-Question-tp4101078.html Sent from the Solr - User mailing list archive at Nabble.com.
Indexing different customer customized field values
In our application, we index educational resources and allow searching for them. We allow our customers to change some of the non-textual metadata associated with a resource (like booklevel, interestlevel etc) to serve their users better. So for each resource, in theory it could have different set of metadata values for each customer, but in reality may be 10 - 25% of our customers customize a small portion of the resources. Our current solution uses SQL Server to manage the customizations (the database is sharded for other reasons as well) and also uses SQL Server's Full Text index for search. We are replacing this with Solr. There are few approaches we had thought about, but none of them seem ideal a) Duplicate the entries in Solr. Each resource would be replicated for each customer and there would be an index entry/customer. The number of index entries is an big concern even though the text field values are the same. (We have about 300K resources and about 50K customers and both will grow) b) Use a dedicated solr core for each customer. This wouldn't be using resources efficiently and we would be duplicating textual components which doesn't change from customer to customer. c) Use a Global index that has the resources with default values and then use a separate index for each customer that contains resources that are customized This requires managing lot of small cores/indexes. Also this would require merging results from multiple cores, so don't think this will work d) Use solr to do the text search and do Post Processing to filter based on metadata externally -- as you can imagine, this have all the challenges associated with post processing (pagination support, etc) e) Use Advanced/Post filtering Solr support --- Even if we can figure out a reasonable way to cache the lookup for metadata values for each customer, not sure if this would be efficient Any other recommendations on solutions. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-different-customer-customized-field-values-tp4102000.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Indexing different customer customized field values
Thanks Otis We also thought about having multiple fields, but thought that having too many fields will be an issue. I see threads about too many fields is an issue for sort (we don't expect to sort on these), but look through the archives. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-different-customer-customized-field-values-tp4102000p4102204.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Document Security Model Question
Thanks Rajinimaski for the reposnse. Agree that if the changes are frequent, then first option wouldn't work efficiently. Also the other challenge is that in our case for each resource, it is easy/efficient to get a list of changes since last checkpoint (because of our model of deployment of customer databases) rather than getting a snapshot of allowed/disallowed across all customers for each resource. In your PostFilter implementation, do you cache the acls in memory, then they get updated periodically externally to solr and the post filter just uses the cache or something along these lines? -- View this message in context: http://lucene.472066.n3.nabble.com/Document-Security-Model-Question-tp4101078p4102664.html Sent from the Solr - User mailing list archive at Nabble.com.
Use BM25Similarity for title field and default for others
We want the effect of the field length to have a lesser influence on score for the title field (we don't want to completely disable it) -- so we get the following behavior Docs with more hits in the title rank higher Docs with shorter titles rank higher if the hits are equal. The DefaultSimilarity wasn't giving us this always (shorter titles were preferred over longer titles with more hits. Note -- we use edismax and search across title and other fields (like body) Inorder to solve this we use BM25Similarity with a small value for b for the title field. We ended up using the SchemaSimilarityFactory for the global similarity inorder to use the BM25Simiarlity for the title field. This gave us the results we are looking for with respect to the title field. We also have keyword, tag and other metadata fields and we want them to be mostly treated as filters and not influence the score at all. Because of the use of the SchemaSimilarityFactory, even though we get the DefaultSimilarity for non title fields, it is not the same as DefaultSimilarityFactory and so we have situations where the metadata fields dominate the score (because PerFieldSimilarityWrapper uses queryNorm of 1.0) We are thinking that we have the following options to fix this issue a) Use BM25Similarity for all fields and adjust the k1, b values as appropriate b) Send the metadata field clauses as part of fq instead of q (but we might have lot of dynamically generated clauses and not sure if fq is the best suited for these as we don't want them to be cached as they could vary from request to request) c) Associate a boost of zero for the metadata fields in the query d) Extend the SchemaSimilarityFactory and write custom code (at this point, I am not sure what the custom class should do) Are these correct? Do we have any other options. Any advice on what is a better option. I appreciate any inputs on this. -- View this message in context: http://lucene.472066.n3.nabble.com/Use-BM25Similarity-for-title-field-and-default-for-others-tp4054159.html Sent from the Solr - User mailing list archive at Nabble.com.