Re: List all Collections together with number of records

Erick Erickson Sun, 07 Jun 2015 09:08:15 -0700

bq: we still need those information to be stored in a separate collection
for security reasons.


Not necessarily. I've seen lots of installations where "auth tokens" are
embedded in the document (say groups that can see this doc). Then
the front-end simply attaches &fq=auth_field:(groups each user belongs to)
to every query to restrict access.

That said, some organizations aren't comfortable with this and demand
separate collections, in which case you're stuck.

You've defined an architecture though, and one of the consequences
of that is if you have many collections, you'll have to fire off many
queries (perhaps in parallel, but still). There's no magic to get around
that. And it really doesn't matter, because in what you've described
what has to happen is one query has to be fired to each collection.
It doesn't matter whether Solr does that for you or you spawn a bunch
of threads on the client, the same work has to happen somewhere.

You also have to figure out how to present the results to the user,
if it's simple count you're OK. But scores will _not_ be comparable
across the various collections so the presentation will be challenging.

Best,
Erick

On Sun, Jun 7, 2015 at 6:29 AM, Zheng Lin Edwin Yeo
<[email protected]> wrote:
> The reasons we want to have different collections is that each of the
> collections have different fields, and that some collections will contain
> information that are more sensitive than others.
>
> As such, we may need to restrict access to certain collections for some
> users. Although the restriction will be done on the front end client side,
> but we still need those information to be stored in a separate collection
> for security reasons..
>
> Regards,
> Edwin
>
>
> On 7 June 2015 at 12:23, Erick Erickson <[email protected]> wrote:
>
>> bq: Yup this information will need to be collected each time the user
>> search
>> for a query, as we want to show the number of records that matches the
>> search query in each of the collections.
>>
>> You're looking at something akin to "federated search". About all you can
>> do is send out parallel queries to each collection.
>>
>> This is an "interesting" requirement, and I really question whether it's a
>> wise
>> thing to insist on. I'd really think about going back to the design.
>> For instance,
>> could you consolidate all these collections into a single one, with perhaps
>> a collection_id? Then the problem is relatively simple, use field
>> collapsing
>> (aka "grouping").
>>
>> Best,
>> Erick
>>
>> On Sat, Jun 6, 2015 at 6:40 PM, Zheng Lin Edwin Yeo
>> <[email protected]> wrote:
>> > Yup this information will need to be collected each time the user search
>> > for a query, as we want to show the number of records that matches the
>> > search query in each of the collections.
>> >
>> > Currently I only have 6 collections, but it could increase to hundreds of
>> > collections in the future. So I'm worried that it could slow down the
>> > system a lot if we have to pass hundreds of queries for each search
>> request.
>> >
>> > Regards,
>> > Edwin
>> >
>> >
>> > On 5 June 2015 at 21:00, Upayavira <[email protected]> wrote:
>> >
>> >> I'm not so sure this is as bad as it sounds. When your collection is
>> >> sharded, no single node knows about the documents in other shards/nodes,
>> >> so to find the total number, a query will need to go to every node.
>> >>
>> >> Trying to work out something to do a single request to every node,
>> >> combine their collection statistics and aggregate them into a single
>> >> result sounds very complicated, and likely overkill.
>> >>
>> >> Are you needing to collect this information often? Do you have a lot of
>> >> collections?
>> >>
>> >> Upayavira
>> >>
>> >>
>> >> On Fri, Jun 5, 2015, at 06:29 AM, Zheng Lin Edwin Yeo wrote:
>> >> > I'm trying to write a SolrJ program in Java to read and consolidate
>> all
>> >> > the
>> >> > information into a JSON file, The client will just need to call this
>> >> > SolrJ
>> >> > program and read this JSON file to get the details. But the problem
>> is we
>> >> > are still querying the Solr once for each collection, just that this
>> time
>> >> > it is done in the SolrJ program in a for-loop, while previously it's
>> done
>> >> > on the client side. Not sure will this lead to performance
>> improvement?
>> >> >
>> >> > For your suggestion on spawning a bunch of threads, does it mean the
>> same
>> >> > thing as I did?
>> >> >
>> >> > Regards,
>> >> > Edwin
>> >> >
>> >> >
>> >> > On 5 June 2015 at 12:03, Erick Erickson <[email protected]>
>> wrote:
>> >> >
>> >> > > Have you considered spawning a bunch of threads, one per collection
>> >> > > and having them all run in parallel?
>> >> > >
>> >> > > Best,
>> >> > > Erick
>> >> > >
>> >> > > On Thu, Jun 4, 2015 at 4:52 PM, Zheng Lin Edwin Yeo
>> >> > > <[email protected]> wrote:
>> >> > > > The reason we wanted to do a single call is to improve on the
>> >> > > performance,
>> >> > > > as our application requires to list the total number of records in
>> >> each
>> >> > > of
>> >> > > > the collections, and the number of records that matches the query
>> >> each of
>> >> > > > the collections.
>> >> > > >
>> >> > > > Currently we are querying each collection one by one to retrieve
>> the
>> >> > > > numFound value and display them, but this can slow down the system
>> >> > > > significantly when the number of collection grows. So we are
>> >> thinking of
>> >> > > > ways to improve the speed in this area.
>> >> > > >
>> >> > > > Any other methods which you can suggest that we can do to overcome
>> >> this
>> >> > > > speed problem?
>> >> > > >
>> >> > > > Regards,
>> >> > > > Edwin
>> >> > > > On 5 Jun 2015 00:16, "Erick Erickson" <[email protected]>
>> >> wrote:
>> >> > > >
>> >> > > >> Not in a single call that I know of. These are really orthogonal
>> >> > > >> concepts. Getting the cluster status merely involves reading the
>> >> > > >> Zookeeper clusterstate whereas getting the total number of docs
>> for
>> >> > > >> each would involve querying each collection, i.e. going to the
>> Solr
>> >> > > >> nodes themselves. I'd guess it's unlikely to be combined.
>> >> > > >>
>> >> > > >> Best,
>> >> > > >> Erick
>> >> > > >>
>> >> > > >> On Thu, Jun 4, 2015 at 7:47 AM, Zheng Lin Edwin Yeo
>> >> > > >> <[email protected]> wrote:
>> >> > > >> > Hi,
>> >> > > >> >
>> >> > > >> > Would like to check, are we able to use the Collection API or
>> any
>> >> > > other
>> >> > > >> > method to list all the collections in the cluster together with
>> >> the
>> >> > > >> number
>> >> > > >> > of records in each of the collections in one output?
>> >> > > >> >
>> >> > > >> > Currently, I only know of the List Collections
>> >> > > >> > /admin/collections?action=LIST. However, this only list the
>> names
>> >> of
>> >> > > the
>> >> > > >> > collections that are in the cluster, but not the number of
>> >> records.
>> >> > > >> >
>> >> > > >> > Is there a way to show the number of records in each of the
>> >> > > collections
>> >> > > >> as
>> >> > > >> > well?
>> >> > > >> >
>> >> > > >> > Regards,
>> >> > > >> > Edwin
>> >> > > >>
>> >> > >
>> >>
>>

Re: List all Collections together with number of records

Reply via email to