Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
Is there are an easy way for a client to tell Solr to close or release the
IndexSearcher and/or IndexWriter for a core?

I have a use case where we're creating a lot of cores with not that many
documents per zone (a few hundred to maybe 10's of thousands).  Writes come
in batches, and reads also tend to be bursty, if less so than the writes.

And we're having problems with ram usage on the server.  Poking around a
heap dump, the problem is that every IndexSearcher or IndexWriter being
opened is taking up large amounts of memory.

I've looked at the unload call, and while it is unclear, it seems like it
deletes the data on disk as well.  I don't want to delete the data on disk,
I just want to unload the searcher and writer, and free up the memory.

So I'm wondering if there is a call I can make when I know or suspect that
the core isn't going to be used in the near future to release these objects
and return the memory?  Or a configuration option I can set to do so after,
say, being idle for 5 seconds?  It's OK for there to be a performance hit
the first time I reopen the core.

Thanks,

Brian


Re: Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
Some further information:

The main things use memory that I see from my heap dump are:

1. Arrays of org.apache.lucene.util.fst.FST$Arc classes- which mainly seem
to hold nulls.  The ones of these I've investigated have been held by
org.apache.lucene.util.fst.FST objects, I have 38 cores open and have over
121,000 of these arrays, taking up over 126M of space.

2. Byte arrays, of which I have 384,000 of, taking up 106M of space.

When I trace the cycle of references up, I've always ended up at an
IndexSearcher or IndexWriter class, causing me to assume the problem was
that I was simply opening up too many cores, but I could be mistaken.

This was on a freshly started system without many cores having been touched
yet- so the memory usage, while larger than I expect, isn't critical yet.
It does become critical as the number of cores increases.

Thanks,

Brian



On Mon, Aug 3, 2015 at 4:57 PM, Brian Hurt  wrote:

>
> Is there are an easy way for a client to tell Solr to close or release the
> IndexSearcher and/or IndexWriter for a core?
>
> I have a use case where we're creating a lot of cores with not that many
> documents per zone (a few hundred to maybe 10's of thousands).  Writes come
> in batches, and reads also tend to be bursty, if less so than the writes.
>
> And we're having problems with ram usage on the server.  Poking around a
> heap dump, the problem is that every IndexSearcher or IndexWriter being
> opened is taking up large amounts of memory.
>
> I've looked at the unload call, and while it is unclear, it seems like it
> deletes the data on disk as well.  I don't want to delete the data on disk,
> I just want to unload the searcher and writer, and free up the memory.
>
> So I'm wondering if there is a call I can make when I know or suspect that
> the core isn't going to be used in the near future to release these objects
> and return the memory?  Or a configuration option I can set to do so after,
> say, being idle for 5 seconds?  It's OK for there to be a performance hit
> the first time I reopen the core.
>
> Thanks,
>
> Brian
>
>


Re: Closing the IndexSearcher/IndexWriter for a core

2015-08-03 Thread Brian Hurt
So unloading a core doesn't delete the data?  That is good to know.

On Mon, Aug 3, 2015 at 6:22 PM, Erick Erickson 
wrote:

> This doesn't work in SolrCloud, but it really sounds like "lots of
> cores" which is designed
> to keep the most recent N cores loaded and auto-unload older ones, see:
> http://wiki.apache.org/solr/LotsOfCores
>
> Best,
> Erick
>
> On Mon, Aug 3, 2015 at 4:57 PM, Brian Hurt  wrote:
> > Is there are an easy way for a client to tell Solr to close or release
> the
> > IndexSearcher and/or IndexWriter for a core?
> >
> > I have a use case where we're creating a lot of cores with not that many
> > documents per zone (a few hundred to maybe 10's of thousands).  Writes
> come
> > in batches, and reads also tend to be bursty, if less so than the writes.
> >
> > And we're having problems with ram usage on the server.  Poking around a
> > heap dump, the problem is that every IndexSearcher or IndexWriter being
> > opened is taking up large amounts of memory.
> >
> > I've looked at the unload call, and while it is unclear, it seems like it
> > deletes the data on disk as well.  I don't want to delete the data on
> disk,
> > I just want to unload the searcher and writer, and free up the memory.
> >
> > So I'm wondering if there is a call I can make when I know or suspect
> that
> > the core isn't going to be used in the near future to release these
> objects
> > and return the memory?  Or a configuration option I can set to do so
> after,
> > say, being idle for 5 seconds?  It's OK for there to be a performance hit
> > the first time I reopen the core.
> >
> > Thanks,
> >
> > Brian
>


Getting a large number of documents by id

2013-07-18 Thread Brian Hurt
I have a situation which is common in our current use case, where I need to
get a large number (many hundreds) of documents by id.  What I'm doing
currently is creating a large query of the form "id:12345 OR id:23456 OR
..." and sending it off.  Unfortunately, this query is taking a long time,
especially the first time it's executed.  I'm seeing times of like 4+
seconds for this query to return, to get 847 documents.

So, my question is: what should I be looking at to improve the performance
here?

Brian


Re: Getting a large number of documents by id

2013-07-18 Thread Brian Hurt
Thanks everyone for the response.

On Thu, Jul 18, 2013 at 11:22 AM, Alexandre Rafalovitch
wrote:

> You could start from doing id:(12345 23456) to reduce the query length and
> possibly speed up parsing.
>

I didn't know about this syntax- it looks useful.


> You could also move the query from 'q' parameter to 'fq' parameter, since
> you probably don't care about ranking ('fq' does not rank).
>

Yes, I don't care about rank, so this helps.


> If these are unique every time, you could probably look at not caching
> (can't remember exact syntax).
>

That's all I can think of at the moment without digging deep into why you
> need to do this at all.
>
>
Short version of a long story: I'm implementing a graph database on top of
solr.  Which is not what solr is designed for, I know.  This is a case
where I'm following a set of edges from a given node to it's 847 children,
and I need to get the children.  And yes, I've looked at neo4j- it doesn't
help.



> Regards,
>Alex.
>
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>
>
> On Thu, Jul 18, 2013 at 10:46 AM, Brian Hurt  wrote:
>
> > I have a situation which is common in our current use case, where I need
> to
> > get a large number (many hundreds) of documents by id.  What I'm doing
> > currently is creating a large query of the form "id:12345 OR id:23456 OR
> > ..." and sending it off.  Unfortunately, this query is taking a long
> time,
> > especially the first time it's executed.  I'm seeing times of like 4+
> > seconds for this query to return, to get 847 documents.
> >
> > So, my question is: what should I be looking at to improve the
> performance
> > here?
> >
> > Brian
> >
>


having create copy the directory on non-cloud solr

2013-08-02 Thread Brian Hurt
I seem to recall somewhere in the documention that the create function on
non-cloud solr doesn't copy the config files in, you have to copy them in
by hand.  Is this correct?  If so, can anyone point me to where in the docs
it says this, and if there are any plans to change this?  Thanks.

Brian


Noob question: why doesn't this query work?

2013-04-24 Thread Brian Hurt
So, I'm executing the following query:
id:"6178dB=@Fm" AND i_0:"613OFS" AND (i_3:"6111" OR i_3:"1yyy\~") AND (NOT
id:"6178ZwWj5m" OR numfields:[* TO "6114"] OR d_4:"false" OR NOT
i_4:"6142E=m")

It's machine generated, which explains the redundancies.  The problem is
that the query returns no results- but there is a document that should
match- it has an id of "6178dB=@Fm", an i_0 field of "613OFS", an i_3 field
of "6111", a numfields of "611A", a d_4 of true (but this shouldn't
matter), and an i_4 of "6142F1S".

The problem seems to be with the negations.  I did try to replace the NOT's
with -'s, so, for example, NOT id:"6178ZwWj5m" would become
-id:"6178ZwWj5m", and this didn't seem to work.

Help?  What's wrong with the query?  Thanks.

Brian


Re: Noob question: why doesn't this query work?

2013-04-24 Thread Brian Hurt
Thanks for your reponse.  You've given me some solid leads.


On Wed, Apr 24, 2013 at 11:25 AM, Shawn Heisey  wrote:

> On 4/24/2013 8:59 AM, Brian Hurt wrote:
> > So, I'm executing the following query:
> > id:"6178dB=@Fm" AND i_0:"613OFS" AND (i_3:"6111" OR i_3:"1yyy\~") AND
> (NOT
> > id:"6178ZwWj5m" OR numfields:[* TO "6114"] OR d_4:"false" OR NOT
> > i_4:"6142E=m")
> >
> > It's machine generated, which explains the redundancies.  The problem is
> > that the query returns no results- but there is a document that should
> > match- it has an id of "6178dB=@Fm", an i_0 field of "613OFS", an i_3
> field
> > of "6111", a numfields of "611A", a d_4 of true (but this shouldn't
> > matter), and an i_4 of "6142F1S".
> >
> > The problem seems to be with the negations.  I did try to replace the
> NOT's
> > with -'s, so, for example, NOT id:"6178ZwWj5m" would become
> > -id:"6178ZwWj5m", and this didn't seem to work.
> >
> > Help?  What's wrong with the query?  Thanks.
>
> It looks like you might have meant to negate all of the query clauses
> inside the last set of parentheses.  That's not what your actual query
> says. If you change your negation so that the NOT is outside the
> parentheses, so that it reads "AND NOT (... OR ...)", that should fix
> that part of it.
>
>
No, I meant the NOT to only bind to the next id.  So the query I wanted was:

id:"6178dB=@Fm" AND i_0:"613OFS" AND (i_3:"6111" OR i_3:"1yyy\~") AND ((NOT
id:"6178ZwWj5m") OR numfields:[* TO "6114"] OR d_4:"false" OR (NOT
i_4:"6142E=m"))



> If the boolean layout you have is really what you want, then you need to
> change the negation queries to (*:* -query) instead, because pure
> negative queries are not supported.  That syntax says "all documents
> except those that match the query."  For simple negation queries, Solr
> can figure out that it needs to add the *:* internally, but this query
> is more complex.
>
>
This could be the problem.  This is query is machine generated, so I don't
care how ugly it is.  Does this apply even to inner queries?  I.e., should
that last clause be (*:* -i_4:6142E=m") instead of (NOT I-4:"6142E=m")?


> A few other possible problems:
>
> A backslash is a special character used to escape other special
> characters, so you *might* need two of them - one to say 'the next
> character is literal' and one to actually be the backslash.  If you
> follow the advice in the next paragraph, I can guarantee this will be
> the case.  For that reason, you might want to keep the quotes on fields
> that might contain characters that have special meaning to the Solr
> query parser.
>
>
I wash all strings through ClientUtils.escapeQueryChars always, so this
isn't a problem.  That string should just be "1yyy~", the ~ was getting
escaped.


> Don't use quotes unless you really are after phrase queries or you can't
> escape special characters.  You might actually need phrase queries for
> some of this, but I would try simple one-field queries without the
> quotes to see whether you need them.  I have no idea what happens if you
> include quotes inside a range query (the "6114"), but it might not do
> what you expect.  I would definitely remove the quotes from that part of
> the query.
>
>
This is another solid possibility, although it might raise some
difficulties for me- I need to be able to support literal string
comparisons, so I'm not sure how well this would support the query s_7 <=
"some string with spaces" sorts of queries.  But some experimentation here
is definitely in order.


> Thanks,
> Shawn
>
>


Help getting a document by unique ID

2013-03-18 Thread Brian Hurt
So here's the problem I'm trying to solve: in my use case, all my
documents have a unique id associated with them (a string), and I very
often need to get them by id.  Currently I'm doing a search on id, and
this takes long enough it's killing my performance.  Now, it looks
like there is a GET call in the REST interface which does exactly what
I need, but I'm using the solrj interface.

So my two questions are:

1. Is GET the right function I should be using?  Or should I be using
some other function, or storing copies of the documents some where
else entirely for fast id-based retrieval?

2. How do I call GET with solrj?  I've googled for how to do this, and
haven't come up with anything.

Thanks.

Brian


Re: Help getting a document by unique ID

2013-03-19 Thread Brian Hurt
On Mon, Mar 18, 2013 at 7:08 PM, Jack Krupansky  wrote:
> Hmmm... if query by your unique key field is killing your performance, maybe
> you have some larger problem to address.

This is almost certainly true.  I'm well outside the use cases
targeted by Solr/Lucene, and it's a testament to the quality of the
product that it works at all.  Among other things, I'm implementing a
graph database on top of Solr (it being easier to build a graph
database on top of Solr than it is to implement Solr on top of a graph
database).

Which is the problem- you might think that 60ms unique key accesses
(what I'm seeing) is more than good enough- and for most use cases,
you'd be right.  But it's not unusual for a single web-page hit to
generate many dozens, if not low hundreds, of calls to get document by
id.  At which point, 60ms hits pile up fast.

The current plan is to just cache the documents as files in the local
file system (or possibly other systems), and have the get document
calls go there instead, while complicated searches still go to Solr.
Fortunately, this isn't complicated.

> How bad is it? Are you using the
> string field type? How long are your ids?

My ids start at 100 million and go up like a kite from there- thus the
string representation.

>
> The only thing the real-time GET API gives you is more immediate access to
> recently added, uncommitted data. Accessing older, committed data will be no
> faster. But if accessing that recent data is what you are after, real-time
> GET may do the trick.

OK, so this is good to know.  This answers question #1: GET isn't the
function I should be calling.  Thanks.

Brian