Join Parent and Child Documents

2015-08-01 Thread Vineeth Dasaraju
Hi,

I had indexed a nested json object into solr as a parent document with
child documents. Whenever I query for a term in the child document, I am
returned only the child documents. Is it possible to get the parent
document along with the child documents as a part of the results? I have
been trying to run stats on the parent document but have not been able to
do so because solr returns the child document which always has a "null"
value for the field whose stats I am looking for. Is it possible to perform
stats on the parent documents I get them from a query for the child
documents?

Regards,
Vineeth


Re: Join Parent and Child Documents

2015-08-01 Thread Mikhail Khludnev
On Sat, Aug 1, 2015 at 10:51 AM, Vineeth Dasaraju 
wrote:

> Hi,
>
> I had indexed a nested json object into solr as a parent document with
> child documents. Whenever I query for a term in the child document, I am
> returned only the child documents. Is it possible to get the parent
> document along with the child documents as a part of the results?

There is a special query parser.
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
.
 Did you use it?


> I have
> been trying to run stats on the parent document but have not been able to
> do so because solr returns the child document which always has a "null"
> value for the field whose stats I am looking for. Is it possible to perform
> stats on the parent documents I get them from a query for the child
> documents?
>
> Regards,
> Vineeth
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Do not match on high frequency terms

2015-08-01 Thread Mikhail Khludnev
It seems like you need to develop custom query or query parser. Regarding
SolrJ: you can try to call http://wiki.apache.org/solr/TermsComponent
https://cwiki.apache.org/confluence/display/solr/The+Terms+Component I'm
not sure how exactly call TermsComponent in SolrJ, I just found
https://lucene.apache.org/solr/5_2_1/solr-solrj/org/apache/solr/client/solrj/response/TermsResponse.html
to read its' response.

On Fri, Jul 31, 2015 at 11:31 PM, Swedish, Steve 
wrote:

> Hello,
>
> I'm hoping someone might be able to help me out with this as I do not have
> very much solr experience. Basically, I am wondering if it is possible to
> not match on terms that have a document frequency above a certain
> threshold. For my situation, a stop word list will be unrealistic to
> maintain, so I was wondering if there may be an alternative solution using
> term document frequency to identify common terms.
>
> What would actually be ideal is if I could somehow use the
> CommonTermsQuery. The problem I ran across when looking at this option was
> that the CommonTermsQuery seems to only work for queries on one field at a
> time (unless I'm mistaken). However, I have a query of the structure
> q=(field1:(blah) AND (field2:(blah) OR field3:(blah))) OR field1:(blah) OR
> (field2:(blah) AND field3:(blah)). If there are any ideas on how to use the
> CommonTermsQuery with this query structure, that would be great.
>
> If it's possible to extract the document frequency for terms in my query
> before the query is run, allowing me to remove the high frequency terms
> from the query first, that could also be a valid solution. I'm using solrj
> as well, so a solution that works with solrj would be appreciated.
>
> Thanks,
> Steve
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Peronalized Search Results or Matching Documents to Users

2015-08-01 Thread Erick Erickson
How soon? It's pretty much done AFAIK, but the folks trying to work on
it have had their priorities re-arranged.

So I really don't have a date.

Erick

On Fri, Jul 31, 2015 at 4:59 PM, Upayavira  wrote:
> How soon? And will you be able to use them for querying, or just
> faceting/sorting/displaying?
>
> Thx!
>
> Upayavira
>
> On Fri, Jul 31, 2015, at 09:27 PM, Erick Erickson wrote:
>> And coming soon will be docvalues field updates that don't require
>> reindexing the whole doc.
>>
>> Best,
>> Erick
>> On Jul 31, 2015 6:51 AM, "Upayavira"  wrote:
>>
>> > On Thu, Jul 30, 2015, at 07:29 PM, Shawn Heisey wrote:
>> > > On 7/30/2015 10:46 AM, Robert Farrior wrote:
>> > > > We have a requirement to be able to have a master product catalog and
>> > to
>> > > > create a sub-catalog of products per user. This means I may have 10,000
>> > > > users who each create their own list of documents. This is a simple
>> > mapping
>> > > > of user to documents. The full data about the documents would be in
>> > the main
>> > > > catalog.
>> > > >
>> > > > What approaches would allow Solr to only return the results that are
>> > in the
>> > > > user's list?  It seems like I would need a couple of steps in the
>> > process.
>> > > > In other words, the main catalog has 3 documents: A, B and C. I have 2
>> > > > users. User 1 has access to documents A and C but not B. User 2 has
>> > access
>> > > > to documents C and B but not A.
>> > > >
>> > > > When a user searches, I want to only return documents that the user has
>> > > > access to.
>> > >
>> > > A common approach for Solr would be to have a multivalued "user" field
>> > > on each document, which has individual values for each user that can
>> > > access the document.  When you index the document, you included values
>> > > in this field listing all the users that can access that document.
>> > >
>> > > Then you simply filter by user:
>> > >
>> > > fq=user:joe
>> > >
>> > > This is EXTREMELY efficient at query time, especially when the number of
>> > > users is much smaller than the number of documents.  It may complicate
>> > > indexing somewhat, but indexing is an extremely custom operation that
>> > > users have to write themselves, so it probably won't be horrible.
>> >
>> > Things to consider:
>> >
>> >  * How often are documents assigned to new users?
>> >  * How many documents does a user typically have?
>> >  * Do you have a 'trigger' in your app that tells you a user has been
>> >  assigned
>> >a new doc?
>> >
>> > You can use a pseudo join to implement this sort of thing - have a
>> > different core that contains the 'permissions', either a document that
>> > says "this document ID is accessible via these users" or "this user is
>> > allowed to see these document IDs". You are keeping your fast moving
>> > (authorization) data separate from your slow moving (the docs
>> > themselves) data.
>> >
>> > You can then say "find me all documents that are accessible via user X"
>> >
>> > Upayavira
>> >


Fast autocomplete for large dataset

2015-08-01 Thread Olivier Austina
Hi,

I am looking for a fast and easy to maintain way to do autocomplete for
large dataset in solr. I heard about Ternary Search Tree (TST)
.
But I would like to know if there is something I missed such as best
practice, Solr new feature. Any suggestion is welcome. Thank you.

Regards
Olivier


Re: Fast autocomplete for large dataset

2015-08-01 Thread Erick Erickson
Well, defining what you mean by "autocomplete" would be a start. If it's just
a user types some letters and you suggest the next N terms in the list,
TermsComponent will fix you right up.

If it's more complicated, the AutoSuggest functionality might help.

If it's correcting spelling, there's the spellchecker.

Best,
Erick

On Sat, Aug 1, 2015 at 10:00 AM, Olivier Austina
 wrote:
> Hi,
>
> I am looking for a fast and easy to maintain way to do autocomplete for
> large dataset in solr. I heard about Ternary Search Tree (TST)
> .
> But I would like to know if there is something I missed such as best
> practice, Solr new feature. Any suggestion is welcome. Thank you.
>
> Regards
> Olivier


Re: Fast autocomplete for large dataset

2015-08-01 Thread Olivier Austina
Thank you Eric for your reply.
If I understand it seems that these approaches are using index to hold
terms. As the index grows bigger, it can be a performance issues.
Is it right? Please can you check this article
 to see
what I mean?   Thank you.

Regards
Olivier


2015-08-01 17:42 GMT+02:00 Erick Erickson :

> Well, defining what you mean by "autocomplete" would be a start. If it's
> just
> a user types some letters and you suggest the next N terms in the list,
> TermsComponent will fix you right up.
>
> If it's more complicated, the AutoSuggest functionality might help.
>
> If it's correcting spelling, there's the spellchecker.
>
> Best,
> Erick
>
> On Sat, Aug 1, 2015 at 10:00 AM, Olivier Austina
>  wrote:
> > Hi,
> >
> > I am looking for a fast and easy to maintain way to do autocomplete for
> > large dataset in solr. I heard about Ternary Search Tree (TST)
> > .
> > But I would like to know if there is something I missed such as best
> > practice, Solr new feature. Any suggestion is welcome. Thank you.
> >
> > Regards
> > Olivier
>


Re: Fast autocomplete for large dataset

2015-08-01 Thread Erick Erickson
Not really. There's no need to use ngrams as the article suggests if the
terms component does what you need. Which is why I asked you about what
autocomplete means in your context. Which you have not clarified. Have you
even looked at terms component?  Especially the terms.prefix option?

Terms component has it's limitations, but performance isn't one of them.
The suggesters mentioned in the article have other limitations. It's really
useless to discuss those limitations, though, until the problem you're
trying to solve is clearly stated.
On Aug 1, 2015 1:01 PM, "Olivier Austina"  wrote:

> Thank you Eric for your reply.
> If I understand it seems that these approaches are using index to hold
> terms. As the index grows bigger, it can be a performance issues.
> Is it right? Please can you check this article
>  to see
> what I mean?   Thank you.
>
> Regards
> Olivier
>
>
> 2015-08-01 17:42 GMT+02:00 Erick Erickson :
>
> > Well, defining what you mean by "autocomplete" would be a start. If it's
> > just
> > a user types some letters and you suggest the next N terms in the list,
> > TermsComponent will fix you right up.
> >
> > If it's more complicated, the AutoSuggest functionality might help.
> >
> > If it's correcting spelling, there's the spellchecker.
> >
> > Best,
> > Erick
> >
> > On Sat, Aug 1, 2015 at 10:00 AM, Olivier Austina
> >  wrote:
> > > Hi,
> > >
> > > I am looking for a fast and easy to maintain way to do autocomplete for
> > > large dataset in solr. I heard about Ternary Search Tree (TST)
> > > .
> > > But I would like to know if there is something I missed such as best
> > > practice, Solr new feature. Any suggestion is welcome. Thank you.
> > >
> > > Regards
> > > Olivier
> >
>


Re: Fast autocomplete for large dataset

2015-08-01 Thread Olivier Austina
Thank you Eric,

I would like to implement an autocomplete for large dataset.  The
autocomplete should show the phrase or the question the user want as the
user types. The requirement is that the autocomplete should be fast (not
slowdown by the volume of data as dataset become bigger), and easy to
maintain. The autocomplete can have its own Solr server.  It is an
autocomplete like others but it should be only fast and easy to maintain.

What is the limitations of suggesters mentioned in the article? Thank you.

Regards
Olivier


2015-08-01 19:41 GMT+02:00 Erick Erickson :

> Not really. There's no need to use ngrams as the article suggests if the
> terms component does what you need. Which is why I asked you about what
> autocomplete means in your context. Which you have not clarified. Have you
> even looked at terms component?  Especially the terms.prefix option?
>
> Terms component has it's limitations, but performance isn't one of them.
> The suggesters mentioned in the article have other limitations. It's really
> useless to discuss those limitations, though, until the problem you're
> trying to solve is clearly stated.
> On Aug 1, 2015 1:01 PM, "Olivier Austina" 
> wrote:
>
> > Thank you Eric for your reply.
> > If I understand it seems that these approaches are using index to hold
> > terms. As the index grows bigger, it can be a performance issues.
> > Is it right? Please can you check this article
> >  to see
> > what I mean?   Thank you.
> >
> > Regards
> > Olivier
> >
> >
> > 2015-08-01 17:42 GMT+02:00 Erick Erickson :
> >
> > > Well, defining what you mean by "autocomplete" would be a start. If
> it's
> > > just
> > > a user types some letters and you suggest the next N terms in the list,
> > > TermsComponent will fix you right up.
> > >
> > > If it's more complicated, the AutoSuggest functionality might help.
> > >
> > > If it's correcting spelling, there's the spellchecker.
> > >
> > > Best,
> > > Erick
> > >
> > > On Sat, Aug 1, 2015 at 10:00 AM, Olivier Austina
> > >  wrote:
> > > > Hi,
> > > >
> > > > I am looking for a fast and easy to maintain way to do autocomplete
> for
> > > > large dataset in solr. I heard about Ternary Search Tree (TST)
> > > > .
> > > > But I would like to know if there is something I missed such as best
> > > > practice, Solr new feature. Any suggestion is welcome. Thank you.
> > > >
> > > > Regards
> > > > Olivier
> > >
> >
>


Re: Peronalized Search Results or Matching Documents to Users

2015-08-01 Thread Upayavira
ticket?

On Sat, Aug 1, 2015, at 02:02 PM, Erick Erickson wrote:
> How soon? It's pretty much done AFAIK, but the folks trying to work on
> it have had their priorities re-arranged.
> 
> So I really don't have a date.
> 
> Erick
> 
> On Fri, Jul 31, 2015 at 4:59 PM, Upayavira  wrote:
> > How soon? And will you be able to use them for querying, or just
> > faceting/sorting/displaying?
> >
> > Thx!
> >
> > Upayavira
> >
> > On Fri, Jul 31, 2015, at 09:27 PM, Erick Erickson wrote:
> >> And coming soon will be docvalues field updates that don't require
> >> reindexing the whole doc.
> >>
> >> Best,
> >> Erick
> >> On Jul 31, 2015 6:51 AM, "Upayavira"  wrote:
> >>
> >> > On Thu, Jul 30, 2015, at 07:29 PM, Shawn Heisey wrote:
> >> > > On 7/30/2015 10:46 AM, Robert Farrior wrote:
> >> > > > We have a requirement to be able to have a master product catalog and
> >> > to
> >> > > > create a sub-catalog of products per user. This means I may have 
> >> > > > 10,000
> >> > > > users who each create their own list of documents. This is a simple
> >> > mapping
> >> > > > of user to documents. The full data about the documents would be in
> >> > the main
> >> > > > catalog.
> >> > > >
> >> > > > What approaches would allow Solr to only return the results that are
> >> > in the
> >> > > > user's list?  It seems like I would need a couple of steps in the
> >> > process.
> >> > > > In other words, the main catalog has 3 documents: A, B and C. I have 
> >> > > > 2
> >> > > > users. User 1 has access to documents A and C but not B. User 2 has
> >> > access
> >> > > > to documents C and B but not A.
> >> > > >
> >> > > > When a user searches, I want to only return documents that the user 
> >> > > > has
> >> > > > access to.
> >> > >
> >> > > A common approach for Solr would be to have a multivalued "user" field
> >> > > on each document, which has individual values for each user that can
> >> > > access the document.  When you index the document, you included values
> >> > > in this field listing all the users that can access that document.
> >> > >
> >> > > Then you simply filter by user:
> >> > >
> >> > > fq=user:joe
> >> > >
> >> > > This is EXTREMELY efficient at query time, especially when the number 
> >> > > of
> >> > > users is much smaller than the number of documents.  It may complicate
> >> > > indexing somewhat, but indexing is an extremely custom operation that
> >> > > users have to write themselves, so it probably won't be horrible.
> >> >
> >> > Things to consider:
> >> >
> >> >  * How often are documents assigned to new users?
> >> >  * How many documents does a user typically have?
> >> >  * Do you have a 'trigger' in your app that tells you a user has been
> >> >  assigned
> >> >a new doc?
> >> >
> >> > You can use a pseudo join to implement this sort of thing - have a
> >> > different core that contains the 'permissions', either a document that
> >> > says "this document ID is accessible via these users" or "this user is
> >> > allowed to see these document IDs". You are keeping your fast moving
> >> > (authorization) data separate from your slow moving (the docs
> >> > themselves) data.
> >> >
> >> > You can then say "find me all documents that are accessible via user X"
> >> >
> >> > Upayavira
> >> >


Re: Peronalized Search Results or Matching Documents to Users

2015-08-01 Thread Mikhail Khludnev
On Sat, Aug 1, 2015 at 9:45 PM, Upayavira  wrote:

> ticket?
>
https://issues.apache.org/jira/browse/SOLR-5944


>
> On Sat, Aug 1, 2015, at 02:02 PM, Erick Erickson wrote:
> > How soon? It's pretty much done AFAIK, but the folks trying to work on
> > it have had their priorities re-arranged.
> >
> > So I really don't have a date.
> >
> > Erick
> >
> > On Fri, Jul 31, 2015 at 4:59 PM, Upayavira  wrote:
> > > How soon? And will you be able to use them for querying, or just
> > > faceting/sorting/displaying?
> > >
> > > Thx!
> > >
> > > Upayavira
> > >
> > > On Fri, Jul 31, 2015, at 09:27 PM, Erick Erickson wrote:
> > >> And coming soon will be docvalues field updates that don't require
> > >> reindexing the whole doc.
> > >>
> > >> Best,
> > >> Erick
> > >> On Jul 31, 2015 6:51 AM, "Upayavira"  wrote:
> > >>
> > >> > On Thu, Jul 30, 2015, at 07:29 PM, Shawn Heisey wrote:
> > >> > > On 7/30/2015 10:46 AM, Robert Farrior wrote:
> > >> > > > We have a requirement to be able to have a master product
> catalog and
> > >> > to
> > >> > > > create a sub-catalog of products per user. This means I may
> have 10,000
> > >> > > > users who each create their own list of documents. This is a
> simple
> > >> > mapping
> > >> > > > of user to documents. The full data about the documents would
> be in
> > >> > the main
> > >> > > > catalog.
> > >> > > >
> > >> > > > What approaches would allow Solr to only return the results
> that are
> > >> > in the
> > >> > > > user's list?  It seems like I would need a couple of steps in
> the
> > >> > process.
> > >> > > > In other words, the main catalog has 3 documents: A, B and C. I
> have 2
> > >> > > > users. User 1 has access to documents A and C but not B. User 2
> has
> > >> > access
> > >> > > > to documents C and B but not A.
> > >> > > >
> > >> > > > When a user searches, I want to only return documents that the
> user has
> > >> > > > access to.
> > >> > >
> > >> > > A common approach for Solr would be to have a multivalued "user"
> field
> > >> > > on each document, which has individual values for each user that
> can
> > >> > > access the document.  When you index the document, you included
> values
> > >> > > in this field listing all the users that can access that document.
> > >> > >
> > >> > > Then you simply filter by user:
> > >> > >
> > >> > > fq=user:joe
> > >> > >
> > >> > > This is EXTREMELY efficient at query time, especially when the
> number of
> > >> > > users is much smaller than the number of documents.  It may
> complicate
> > >> > > indexing somewhat, but indexing is an extremely custom operation
> that
> > >> > > users have to write themselves, so it probably won't be horrible.
> > >> >
> > >> > Things to consider:
> > >> >
> > >> >  * How often are documents assigned to new users?
> > >> >  * How many documents does a user typically have?
> > >> >  * Do you have a 'trigger' in your app that tells you a user has
> been
> > >> >  assigned
> > >> >a new doc?
> > >> >
> > >> > You can use a pseudo join to implement this sort of thing - have a
> > >> > different core that contains the 'permissions', either a document
> that
> > >> > says "this document ID is accessible via these users" or "this user
> is
> > >> > allowed to see these document IDs". You are keeping your fast moving
> > >> > (authorization) data separate from your slow moving (the docs
> > >> > themselves) data.
> > >> >
> > >> > You can then say "find me all documents that are accessible via
> user X"
> > >> >
> > >> > Upayavira
> > >> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Avoid re indexing

2015-08-01 Thread naga sharathrayapati
I have an exception with one of the document after indexing 6 mil documents
out of 10 mil, is there any way i can avoid re indexing the 6 mil
documents?

I also see that there are few documents that are deleted (based on the
count) while indexing, is there a way to identify what are those documents?

can i add shard to a collection without re indexing?


Re: Avoid re indexing

2015-08-01 Thread Upayavira


On Sat, Aug 1, 2015, at 10:30 PM, naga sharathrayapati wrote:
> I have an exception with one of the document after indexing 6 mil
> documents
> out of 10 mil, is there any way i can avoid re indexing the 6 mil
> documents?

How are you indexing your documents? Are you using the DIH? Personally,
I'd recommend you write your own app to push your content to Solr, then
you will be able to control exceptions more precisely and have the
behaviour you expect.

> I also see that there are few documents that are deleted (based on the
> count) while indexing, is there a way to identify what are those
> documents?

If you see deleted documents but are not actually deleting any, this
will be because you have updated documents with an existing ID. An
update is actually a delete followed by an insert.

> can i add shard to a collection without re indexing?

You cannot just add a new shard to an existing collection (at least, one
that is using the compositeId router (the default). If a shard is too
large, you will need to split an existing shard, which you can do with
the collections API.

It is much better though, to start with the right number of shards if at
all possible.

Upayavira


Re: Avoid re indexing

2015-08-01 Thread naga sharathrayapati
I am using solrj to index documents

i agree with you regarding the index update but i should not see any
deleted documents as it is a fresh index. Can we actually identify what are
those deleted documents?

if there is no option of adding shards to existing collection i do not like
the idea of re indexing the whole data (worth hours) and we have gone with
good number of shards but there is a rapid increase of size in data over
the past few days, do you think is it worth logging a ticket?

On Sat, Aug 1, 2015 at 5:04 PM, Upayavira  wrote:

>
>
> On Sat, Aug 1, 2015, at 10:30 PM, naga sharathrayapati wrote:
> > I have an exception with one of the document after indexing 6 mil
> > documents
> > out of 10 mil, is there any way i can avoid re indexing the 6 mil
> > documents?
>
> How are you indexing your documents? Are you using the DIH? Personally,
> I'd recommend you write your own app to push your content to Solr, then
> you will be able to control exceptions more precisely and have the
> behaviour you expect.
>
> > I also see that there are few documents that are deleted (based on the
> > count) while indexing, is there a way to identify what are those
> > documents?
>
> If you see deleted documents but are not actually deleting any, this
> will be because you have updated documents with an existing ID. An
> update is actually a delete followed by an insert.
>
> > can i add shard to a collection without re indexing?
>
> You cannot just add a new shard to an existing collection (at least, one
> that is using the compositeId router (the default). If a shard is too
> large, you will need to split an existing shard, which you can do with
> the collections API.
>
> It is much better though, to start with the right number of shards if at
> all possible.
>
> Upayavira
>


Re: Avoid re indexing

2015-08-01 Thread Upayavira


On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote:
> I am using solrj to index documents
> 
> i agree with you regarding the index update but i should not see any
> deleted documents as it is a fresh index. Can we actually identify what
> are
> those deleted documents?

If you post doc 1234, then you post doc 1234 a second time, you will see
a deletion in your index. If you don't want deletions to show in your
index, be sure NEVER to update a document, only add new ones with
absolutely distinct document IDs.

You cannot see (via Solr) which docs are deleted. You could, I suppose,
introspect the Lucene index, but that would most definitely be an expert
task.

> if there is no option of adding shards to existing collection i do not
> like
> the idea of re indexing the whole data (worth hours) and we have gone
> with
> good number of shards but there is a rapid increase of size in data over
> the past few days, do you think is it worth logging a ticket?

You can split a shard. See the collections API:

https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3

What would you want to log a ticket for? I'm not sure that there's
anything that would require that.

Upayavira


Re: Avoid re indexing

2015-08-01 Thread Nagasharath
If my current shard is holding 3 million documents will the new subshard after 
splitting also be able to hold 3 million documents?
If that is the case After shard splitting the sub shards should hold 6 million 
documents if a shard is split in to two. Am I right?

> On 01-Aug-2015, at 5:43 pm, Upayavira  wrote:
> 
> 
> 
>> On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote:
>> I am using solrj to index documents
>> 
>> i agree with you regarding the index update but i should not see any
>> deleted documents as it is a fresh index. Can we actually identify what
>> are
>> those deleted documents?
> 
> If you post doc 1234, then you post doc 1234 a second time, you will see
> a deletion in your index. If you don't want deletions to show in your
> index, be sure NEVER to update a document, only add new ones with
> absolutely distinct document IDs.
> 
> You cannot see (via Solr) which docs are deleted. You could, I suppose,
> introspect the Lucene index, but that would most definitely be an expert
> task.
> 
>> if there is no option of adding shards to existing collection i do not
>> like
>> the idea of re indexing the whole data (worth hours) and we have gone
>> with
>> good number of shards but there is a rapid increase of size in data over
>> the past few days, do you think is it worth logging a ticket?
> 
> You can split a shard. See the collections API:
> 
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
> 
> What would you want to log a ticket for? I'm not sure that there's
> anything that would require that.
> 
> Upayavira


Re: Avoid re indexing

2015-08-01 Thread Upayavira
Erm, that doesn't seem to make sense. Seems like you are talking about
*merging* shards.

Say you had two shards, 3m docs each:

shard1: 3m docs
shard2: 3m docs

If you split shard1, you would have:

shard1_0: 1.5m docs
shard1_1: 1.5m docs
shard2: 3m docs

You could, of course, then split shard2. You could also split shard1
into three parts instead, if you preferred:

shard1_0: 1m docs
shard1_1: 1m docs
shard1_2: 1m docs
shard2: 3m docs

Upayavira

On Sun, Aug 2, 2015, at 12:25 AM, Nagasharath wrote:
> If my current shard is holding 3 million documents will the new subshard
> after splitting also be able to hold 3 million documents?
> If that is the case After shard splitting the sub shards should hold 6
> million documents if a shard is split in to two. Am I right?
> 
> > On 01-Aug-2015, at 5:43 pm, Upayavira  wrote:
> > 
> > 
> > 
> >> On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote:
> >> I am using solrj to index documents
> >> 
> >> i agree with you regarding the index update but i should not see any
> >> deleted documents as it is a fresh index. Can we actually identify what
> >> are
> >> those deleted documents?
> > 
> > If you post doc 1234, then you post doc 1234 a second time, you will see
> > a deletion in your index. If you don't want deletions to show in your
> > index, be sure NEVER to update a document, only add new ones with
> > absolutely distinct document IDs.
> > 
> > You cannot see (via Solr) which docs are deleted. You could, I suppose,
> > introspect the Lucene index, but that would most definitely be an expert
> > task.
> > 
> >> if there is no option of adding shards to existing collection i do not
> >> like
> >> the idea of re indexing the whole data (worth hours) and we have gone
> >> with
> >> good number of shards but there is a rapid increase of size in data over
> >> the past few days, do you think is it worth logging a ticket?
> > 
> > You can split a shard. See the collections API:
> > 
> > https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
> > 
> > What would you want to log a ticket for? I'm not sure that there's
> > anything that would require that.
> > 
> > Upayavira


Re: Avoid re indexing

2015-08-01 Thread Nagasharath
Yes, shard splitting will only help in managing large clusters and to improve 
query performance. In my case as index size is fully grown (no capacity to hold 
in the existing shards) across the collection adding a new shard will help and 
for which I have to re index.


> On 01-Aug-2015, at 6:34 pm, Upayavira  wrote:
> 
> Erm, that doesn't seem to make sense. Seems like you are talking about
> *merging* shards.
> 
> Say you had two shards, 3m docs each:
> 
> shard1: 3m docs
> shard2: 3m docs
> 
> If you split shard1, you would have:
> 
> shard1_0: 1.5m docs
> shard1_1: 1.5m docs
> shard2: 3m docs
> 
> You could, of course, then split shard2. You could also split shard1
> into three parts instead, if you preferred:
> 
> shard1_0: 1m docs
> shard1_1: 1m docs
> shard1_2: 1m docs
> shard2: 3m docs
> 
> Upayavira
> 
>> On Sun, Aug 2, 2015, at 12:25 AM, Nagasharath wrote:
>> If my current shard is holding 3 million documents will the new subshard
>> after splitting also be able to hold 3 million documents?
>> If that is the case After shard splitting the sub shards should hold 6
>> million documents if a shard is split in to two. Am I right?
>> 
>>> On 01-Aug-2015, at 5:43 pm, Upayavira  wrote:
>>> 
>>> 
>>> 
 On Sat, Aug 1, 2015, at 11:29 PM, naga sharathrayapati wrote:
 I am using solrj to index documents
 
 i agree with you regarding the index update but i should not see any
 deleted documents as it is a fresh index. Can we actually identify what
 are
 those deleted documents?
>>> 
>>> If you post doc 1234, then you post doc 1234 a second time, you will see
>>> a deletion in your index. If you don't want deletions to show in your
>>> index, be sure NEVER to update a document, only add new ones with
>>> absolutely distinct document IDs.
>>> 
>>> You cannot see (via Solr) which docs are deleted. You could, I suppose,
>>> introspect the Lucene index, but that would most definitely be an expert
>>> task.
>>> 
 if there is no option of adding shards to existing collection i do not
 like
 the idea of re indexing the whole data (worth hours) and we have gone
 with
 good number of shards but there is a rapid increase of size in data over
 the past few days, do you think is it worth logging a ticket?
>>> 
>>> You can split a shard. See the collections API:
>>> 
>>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
>>> 
>>> What would you want to log a ticket for? I'm not sure that there's
>>> anything that would require that.
>>> 
>>> Upayavira


solr multicore vs sharding vs 1 big collection

2015-08-01 Thread Jay Potharaju
Hi

I currently have a single collection with 40 million documents and index
size of 25 GB. The collections gets updated every n minutes and as a result
the number of deleted documents is constantly growing. The data in the
collection is an amalgamation of more than 1000+ customer records. The
number of documents per each customer is around 100,000 records on average.

Now that being said, I 'm trying to get an handle on the growing deleted
document size. Because of the growing index size both the disk space and
memory is being used up. And would like to reduce it to a manageable size.

I have been thinking of splitting the data into multiple core, 1 for each
customer. This would allow me manage the smaller collection easily and can
create/update the collection also fast. My concern is that number of
collections might become an issue. Any suggestions on how to address this
problem. What are my other alternatives to moving to a multicore
collections.?

Solr: 4.9
Index size:25 GB
Max doc: 40 million
Doc count:29 million

Replication:4

4 servers in solrcloud.

Thanks
Jay


Re: solr multicore vs sharding vs 1 big collection

2015-08-01 Thread Erick Erickson
40 million docs isn't really very many by modern standards,
although if they're huge documents then that might be an issue.

So is this a single shard or multiple shards? If you're really facing
performance issues, simply making a new collection with more
than one shard (independent of how many replicas each has) is
probably simplest.

The number of deleted documents really shouldn't be a problem.
Typically the deleted documents are purged during segment
merging that happens automatically as you add documents. I often
see 10-15% or the corpus consist of deleted documents.

You can force these by doing a force merge (aka optimization), but that
is usually not recommended unless you have a strange situation where
you have lots and lots of docs that have been deleted as measured
by the Admin UI page, the "deleted docs" entry relative to the maxDoc
number (again on the admin UI page).

So show us what you're seeing that's concerning. Typically, especially
on an index that's continually getting updates it's adequate to just
let the background segment merging take care of things.

Best,
Erick

On Sat, Aug 1, 2015 at 8:49 PM, Jay Potharaju  wrote:
> Hi
>
> I currently have a single collection with 40 million documents and index
> size of 25 GB. The collections gets updated every n minutes and as a result
> the number of deleted documents is constantly growing. The data in the
> collection is an amalgamation of more than 1000+ customer records. The
> number of documents per each customer is around 100,000 records on average.
>
> Now that being said, I 'm trying to get an handle on the growing deleted
> document size. Because of the growing index size both the disk space and
> memory is being used up. And would like to reduce it to a manageable size.
>
> I have been thinking of splitting the data into multiple core, 1 for each
> customer. This would allow me manage the smaller collection easily and can
> create/update the collection also fast. My concern is that number of
> collections might become an issue. Any suggestions on how to address this
> problem. What are my other alternatives to moving to a multicore
> collections.?
>
> Solr: 4.9
> Index size:25 GB
> Max doc: 40 million
> Doc count:29 million
>
> Replication:4
>
> 4 servers in solrcloud.
>
> Thanks
> Jay


Re: Fast autocomplete for large dataset

2015-08-01 Thread Erick Erickson
Here's some background:

http://lucidworks.com/blog/solr-suggester/

Basically, the limitation is that to build the suggester all docs in
the index need to be read to pull out the stored field and build
either the FST or the sidecar Lucene index, which can be a _very_
costly operation (as in minutes/hours for a large dataset).

bq: The requirement is that the autocomplete should be fast (not
slowdown by the volume of data as dataset become bigger)

Well, in some alternate universe this may be possible. But the larger
the corpus the slower the processing will be, there's just no way
around that. Whether it's fast enough for your application is a better
question ;).

Best,
Erick


On Sat, Aug 1, 2015 at 2:05 PM, Olivier Austina
 wrote:
> Thank you Eric,
>
> I would like to implement an autocomplete for large dataset.  The
> autocomplete should show the phrase or the question the user want as the
> user types. The requirement is that the autocomplete should be fast (not
> slowdown by the volume of data as dataset become bigger), and easy to
> maintain. The autocomplete can have its own Solr server.  It is an
> autocomplete like others but it should be only fast and easy to maintain.
>
> What is the limitations of suggesters mentioned in the article? Thank you.
>
> Regards
> Olivier
>
>
> 2015-08-01 19:41 GMT+02:00 Erick Erickson :
>
>> Not really. There's no need to use ngrams as the article suggests if the
>> terms component does what you need. Which is why I asked you about what
>> autocomplete means in your context. Which you have not clarified. Have you
>> even looked at terms component?  Especially the terms.prefix option?
>>
>> Terms component has it's limitations, but performance isn't one of them.
>> The suggesters mentioned in the article have other limitations. It's really
>> useless to discuss those limitations, though, until the problem you're
>> trying to solve is clearly stated.
>> On Aug 1, 2015 1:01 PM, "Olivier Austina" 
>> wrote:
>>
>> > Thank you Eric for your reply.
>> > If I understand it seems that these approaches are using index to hold
>> > terms. As the index grows bigger, it can be a performance issues.
>> > Is it right? Please can you check this article
>> >  to see
>> > what I mean?   Thank you.
>> >
>> > Regards
>> > Olivier
>> >
>> >
>> > 2015-08-01 17:42 GMT+02:00 Erick Erickson :
>> >
>> > > Well, defining what you mean by "autocomplete" would be a start. If
>> it's
>> > > just
>> > > a user types some letters and you suggest the next N terms in the list,
>> > > TermsComponent will fix you right up.
>> > >
>> > > If it's more complicated, the AutoSuggest functionality might help.
>> > >
>> > > If it's correcting spelling, there's the spellchecker.
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > > On Sat, Aug 1, 2015 at 10:00 AM, Olivier Austina
>> > >  wrote:
>> > > > Hi,
>> > > >
>> > > > I am looking for a fast and easy to maintain way to do autocomplete
>> for
>> > > > large dataset in solr. I heard about Ternary Search Tree (TST)
>> > > > .
>> > > > But I would like to know if there is something I missed such as best
>> > > > practice, Solr new feature. Any suggestion is welcome. Thank you.
>> > > >
>> > > > Regards
>> > > > Olivier
>> > >
>> >
>>


Re: solr multicore vs sharding vs 1 big collection

2015-08-01 Thread Shawn Heisey
On 8/1/2015 6:49 PM, Jay Potharaju wrote:
> I currently have a single collection with 40 million documents and index
> size of 25 GB. The collections gets updated every n minutes and as a result
> the number of deleted documents is constantly growing. The data in the
> collection is an amalgamation of more than 1000+ customer records. The
> number of documents per each customer is around 100,000 records on average.
> 
> Now that being said, I 'm trying to get an handle on the growing deleted
> document size. Because of the growing index size both the disk space and
> memory is being used up. And would like to reduce it to a manageable size.
> 
> I have been thinking of splitting the data into multiple core, 1 for each
> customer. This would allow me manage the smaller collection easily and can
> create/update the collection also fast. My concern is that number of
> collections might become an issue. Any suggestions on how to address this
> problem. What are my other alternatives to moving to a multicore
> collections.?
> 
> Solr: 4.9
> Index size:25 GB
> Max doc: 40 million
> Doc count:29 million
> 
> Replication:4
> 
> 4 servers in solrcloud.

Creating 1000+ collections in SolrCloud is definitely problematic.  If
you need to choose between a lot of shards and a lot of collections, I
would definitely go with a lot of shards.  I would also want a lot of
servers for an index with that many pieces.

https://issues.apache.org/jira/browse/SOLR-7191

I don't think it would matter how many collections or shards you have
when it comes to how many deleted documents are in your index.  If you
want to clean up a large number of deletes in an index, the best option
is an optimize.  An optimize requires a large amount of disk I/O, so it
can be extremely disruptive if the query volume is high.  It should be
done when the query volume is at its lowest.  For the index you
describe, a nightly or weekly optimize seems like a good option.

Aside from having a lot of deleted documents in your index, what kind of
problems are you trying to solve?

Thanks,
Shawn



Re: Fast autocomplete for large dataset

2015-08-01 Thread Olivier Austina
Thank you Eric for your replies and the link.

Regards
Olivier


2015-08-02 3:47 GMT+02:00 Erick Erickson :

> Here's some background:
>
> http://lucidworks.com/blog/solr-suggester/
>
> Basically, the limitation is that to build the suggester all docs in
> the index need to be read to pull out the stored field and build
> either the FST or the sidecar Lucene index, which can be a _very_
> costly operation (as in minutes/hours for a large dataset).
>
> bq: The requirement is that the autocomplete should be fast (not
> slowdown by the volume of data as dataset become bigger)
>
> Well, in some alternate universe this may be possible. But the larger
> the corpus the slower the processing will be, there's just no way
> around that. Whether it's fast enough for your application is a better
> question ;).
>
> Best,
> Erick
>
>
> On Sat, Aug 1, 2015 at 2:05 PM, Olivier Austina
>  wrote:
> > Thank you Eric,
> >
> > I would like to implement an autocomplete for large dataset.  The
> > autocomplete should show the phrase or the question the user want as the
> > user types. The requirement is that the autocomplete should be fast (not
> > slowdown by the volume of data as dataset become bigger), and easy to
> > maintain. The autocomplete can have its own Solr server.  It is an
> > autocomplete like others but it should be only fast and easy to maintain.
> >
> > What is the limitations of suggesters mentioned in the article? Thank
> you.
> >
> > Regards
> > Olivier
> >
> >
> > 2015-08-01 19:41 GMT+02:00 Erick Erickson :
> >
> >> Not really. There's no need to use ngrams as the article suggests if the
> >> terms component does what you need. Which is why I asked you about what
> >> autocomplete means in your context. Which you have not clarified. Have
> you
> >> even looked at terms component?  Especially the terms.prefix option?
> >>
> >> Terms component has it's limitations, but performance isn't one of them.
> >> The suggesters mentioned in the article have other limitations. It's
> really
> >> useless to discuss those limitations, though, until the problem you're
> >> trying to solve is clearly stated.
> >> On Aug 1, 2015 1:01 PM, "Olivier Austina" 
> >> wrote:
> >>
> >> > Thank you Eric for your reply.
> >> > If I understand it seems that these approaches are using index to hold
> >> > terms. As the index grows bigger, it can be a performance issues.
> >> > Is it right? Please can you check this article
> >> >  to
> see
> >> > what I mean?   Thank you.
> >> >
> >> > Regards
> >> > Olivier
> >> >
> >> >
> >> > 2015-08-01 17:42 GMT+02:00 Erick Erickson :
> >> >
> >> > > Well, defining what you mean by "autocomplete" would be a start. If
> >> it's
> >> > > just
> >> > > a user types some letters and you suggest the next N terms in the
> list,
> >> > > TermsComponent will fix you right up.
> >> > >
> >> > > If it's more complicated, the AutoSuggest functionality might help.
> >> > >
> >> > > If it's correcting spelling, there's the spellchecker.
> >> > >
> >> > > Best,
> >> > > Erick
> >> > >
> >> > > On Sat, Aug 1, 2015 at 10:00 AM, Olivier Austina
> >> > >  wrote:
> >> > > > Hi,
> >> > > >
> >> > > > I am looking for a fast and easy to maintain way to do
> autocomplete
> >> for
> >> > > > large dataset in solr. I heard about Ternary Search Tree (TST)
> >> > > > .
> >> > > > But I would like to know if there is something I missed such as
> best
> >> > > > practice, Solr new feature. Any suggestion is welcome. Thank you.
> >> > > >
> >> > > > Regards
> >> > > > Olivier
> >> > >
> >> >
> >>
>