Document Security Model Question

2013-11-14 Thread kchellappa
I had earlier posted a similar discussion in LinkedIn and David Smiley
rightly advised me that solr-user is a better place for technical
discussions

--

Our product which is hosted supports searching on educational resources. Our
customers can choose to make specific resources unavailable for their users
and also it depends on licensing. Our current solution uses full text search
support in the database and handles availability as part of sql .

My task is to move the search from the database full text search into Solr.
I searched through posts and found some that were kind of related and I am
thinking along the following lines

  a)  Use the authorization model.   I can add fields like allow and/or deny
in the index which contain the list of customers.  At query time, I can add
the constraint based on the customer Id.  I am concerned about the
performance if there are lot of values for these fields and also it requires
constant reindexing if a value in this field changes
 b) Use Query-time Join.  
 Have the resource to availability for customer in separate inner
documents.
 We are planning to deploy in SolrCloud.  I have read some challenges
about Query-time join and SolrCloud. So this may not work for us.

c) Other ideas?
 
Excerpts from David Smiley's response

You're right that there may be some re-indexing as security rules change. If
many Lucene/Solr documents share identical access control with other
documents, then it may make more sense to externally determine which unique
set of access-control sets the user has access to, then finally search by id
-- which will hopefully not be a huge number. I've seen this done both
externally and with a Solr core to join on.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Document-Security-Model-Question-tp4101078.html
Sent from the Solr - User mailing list archive at Nabble.com.


Indexing different customer customized field values

2013-11-19 Thread kchellappa
In our application, we index educational resources and allow searching for
them.
We allow our customers to change some of the non-textual metadata associated
with a resource (like booklevel, interestlevel etc) to serve their users
better.
So for each resource, in theory it could have different set of metadata
values for each customer, but in reality may be 10 - 25% of our customers
customize a small portion of the resources.

Our current solution uses SQL Server to manage the customizations (the
database is sharded for other reasons as well) and also uses SQL Server's
Full Text index for search.
We are replacing this with Solr.

There are few approaches we had thought about, but none of them seem ideal

a) Duplicate the entries in Solr.  Each resource would be replicated for
each customer and there would be an index entry/customer.  
The number of index entries is an big concern even though the text field
values are the same.  
(We have about 300K resources and about 50K customers and both will grow)

b) Use a dedicated solr core for each customer.  This wouldn't be using
resources efficiently and we would be duplicating textual components 
which doesn't change from customer to customer.

c) Use a Global index that has the resources with default values and then
use a separate index for each customer that contains resources that are
customized
This requires managing lot of small cores/indexes.  Also this would require
merging results from multiple cores, so don't think this will work

d) Use solr to do the text search and do Post Processing to filter based on
metadata externally -- as you can imagine, this have all the 
challenges associated with post processing (pagination support, etc)

e) Use Advanced/Post filtering Solr support --- Even if we can figure out a
reasonable way to cache the lookup for metadata values for each customer, 
not sure if this would be efficient

Any other recommendations on solutions.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-different-customer-customized-field-values-tp4102000.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing different customer customized field values

2013-11-20 Thread kchellappa
Thanks Otis

We also thought about having multiple fields, but thought that having too
many fields will be an issue.  I see threads about too many fields is an
issue for sort (we don't expect to sort on these), but look through the
archives.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-different-customer-customized-field-values-tp4102000p4102204.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Document Security Model Question

2013-11-22 Thread kchellappa
Thanks Rajinimaski for the reposnse.

Agree that if the changes are frequent, then first option wouldn't work
efficiently.  Also the other challenge is that in our case for each
resource, it is easy/efficient to get a list of changes since last
checkpoint (because of our model of deployment of customer databases) rather
than getting a snapshot of allowed/disallowed across all customers for each
resource.


In your PostFilter implementation, do you cache the acls in memory, then
they get updated periodically externally to solr and the post filter just
uses the cache or something along these lines?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Document-Security-Model-Question-tp4101078p4102664.html
Sent from the Solr - User mailing list archive at Nabble.com.


Use BM25Similarity for title field and default for others

2013-04-06 Thread kchellappa
We want the effect of the field length to have a lesser influence on score
for the title field (we don't want to completely disable it) -- so we get
the following behavior

Docs with more hits in the title rank higher
Docs with shorter titles rank higher if the hits are equal.

The DefaultSimilarity wasn't giving us this always (shorter titles were
preferred over longer titles with more hits.

Note -- we use edismax and search across title and other fields (like body)

Inorder to solve this we use BM25Similarity with a small value for b for the
title field.  We ended up using the SchemaSimilarityFactory for the global
similarity inorder to use the BM25Simiarlity for the title field.   This
gave us the results we are looking for with respect to the title field.


We also have keyword, tag and other metadata fields and we want them to be
mostly treated as filters and not influence the score at all.   Because of
the use of the SchemaSimilarityFactory, even though we get the
DefaultSimilarity for non title fields, it is not the same as
DefaultSimilarityFactory and so we have situations where the metadata fields
dominate the score (because PerFieldSimilarityWrapper uses queryNorm of 1.0) 

We are thinking that we have the following options to fix this issue

a)   Use BM25Similarity for all fields and adjust the k1, b values as
appropriate
b)   Send the metadata field clauses as part of  fq instead of q (but we
might have lot of dynamically generated clauses and not sure if fq is the
best suited for these as we don't want them to be cached as they could vary
from request to request)
   c)   Associate a boost of zero for the metadata fields in the query
   d)   Extend the SchemaSimilarityFactory and write custom code (at this
point, I am not sure what the custom class should do)


Are these correct?  Do we have any other options. Any advice on what is a
better option.
I appreciate any inputs on this.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Use-BM25Similarity-for-title-field-and-default-for-others-tp4054159.html
Sent from the Solr - User mailing list archive at Nabble.com.