disappearing index

2008-12-03 Thread Justin
I built up two indexes using a multicore configuration
one containing 52,000+ documents and the other over 10 million,  the entire
indexing process showed now errors.

The server crashed over night, well after the indexing had completed, and
now no documents are reported for either index.

This despite the fact that the core's both have huge /data folders.  (one is
1.5GB the other is 8.5GB).

Any ideas?


small facets not working

2009-05-19 Thread Justin
I have a solr index which contains research data from the human genome
project.

Each document contains about 60 facets, including one general composite
field that contains all the facet data.   the general facet is anywhere from
100KB to 7MB.

One facet is called Gene.Symbol and, you guessed it, it contains only the
gene symbol. There is only one Symbol per gene (for smarty pantses out
there, the aliases are contained in another facet).

When I do a search for anything in the big general facet, I find what i'm
looking for.  But if I do a search in the Gene.Symbol facet, it does not
find anything.

I realize it's probably finding the string repeated elsewhere in the
document, but how do I get it to find it in the Gene.Symbol facet?

so a search for

http://localhost:8983/solr/core0/select?indent=on&version=2.2&q=Gene.Symbol:abc

returns nothing, but a search for

http://localhost:8983/solr/core0/select?indent=on&version=2.2&q=abc

returns
ABCC2
ABCC8
ABCD1
ABCG1
ABCA1
...
CABC1
...
ABCD3
ABCC5
ABCC9
ABCG2
ABCB11
ABCC3
ABCF1
ABCC1
ABCF2
ABCB9



Schema.xml:



  


   
   
  


   


...


 
  






  
  







  


...


...
BFDText




out of memory every time

2008-03-03 Thread Justin
I'm indexing a large number of documents.

As a server I'm using the /solr/example/start.jar

No matter how much memory I allocate it fails around 7200 documents.

I am committing every 100 docs, and optimizing every 300.

all of my xml's contain on doc, and can range in size from 2k to 700k.

when I restart the start.jar it again reports out of memory.


a sample document looks like this:


 
  1851
  TRAJ20
  12049
  ENSG0211869
  28735
  HUgn28735
  TRA_
  TRAJ20
  9953837
  ENSG0211869
  T cell receptor alpha
joining 20
  14q11.2
  14q11
  14q11.2
  AE000662.1
  M94081.1
  CH471078.2
  NC_14.7
  NT_026437.11
  NG_001332.2
  8188290
  The human T-cell receptor
TCRAC/TCRDC (C alpha/C delta) region: organization,sequence, and evolution
of 97.6 kb of DNA.
  Koop B.F.
  Rowen L.
  Hood L.
  Wang K.
  Kuo C.L.
  Seto D.
  Lenstra J.A.
  Howard S.
  Shan W.
  Deshpande P.
  31311_at
  




the schema is (in summary):

   
   

   
   



PK
text






and my conf is:
   false
100
900
2147483647
1


Quoted searches

2008-03-19 Thread Justin
When I issue a search in quotes, like "tay sachs"
lucene is returning results as if it were written: tay OR sachs


Any reason why?  Any way to stop it?


solrconfig.xml location for cloud setup not working (works on single node)

2016-05-20 Thread Justin Edmands
I have configured a single node and configured a proper datahandler. I need to 
move it to a clustered setup. Seems like I cannot, for the life of me, get the 
datahandler to work with the cloud setup. 

ls 
/opt/solr/solr-6.0.0/example/cloud/node1/solr/activityDigest_shard1_replica1/conf/
 

currency.xml db-data-config.xml lang protwords.txt _rest_managed.json 
schema.xml solrconfig.xml stopwords.txt synonyms.txt 

inside the solrconfig.xml, I have created a request handler to point to my 
config file: 

 
 
 
 
db-data-config.xml 
 
 


in the db-data-config.xml I have a working config (works in a single node setup 
that is) 

 
 
 


Getting a list of matching terms and offsets

2016-06-04 Thread Justin Lee
Is anyone aware of a way of getting a list of each matching token and their
offsets after executing a search?  The reason I want to do this is because
I have the physical coordinates of each token in the original document
stored out of band, and I want to be able to highlight in the original
document.  I would really like to have Solr return the list of matching
tokens because then things like stemming and phrase matching will work as
expected. I'm thinking of something like the highlighter component, except
instead of returning html, it would return just the matching tokens and
their offsets.

I have googled high and low and can't seem to find an exact answer to this
question, so I have spent the last few days examining the internals of the
various highlighting classes in Solr and Lucene.  I think the bulk of the
action is in WeightedSpanTermExtractor and its interaction with
getBestTextFragments in the Highlighter class.  But before I spend anymore
time on this I thought I'd ask (1) whether anyone knows of an easier way of
doing this, and (2) whether I'm at least barking up the right tree.

Thanks much,
Justin


Re: Getting a list of matching terms and offsets

2016-06-05 Thread Justin Lee
Thanks for the responses Alex and Ahmet.

The TermVector component was the first thing I looked at, but what it gives
you is offset information for every token in the document.  I'm trying to
get a list of tokens that actually match the search query, and unless I'm
missing something, the TermVector component doesn't give you that
information.

The TermSpans class does contain the right information, but again the hard
part is: how do I reliably get a list of TokenSpans for the tokens that
actually match the search query?  That's why I ended up in the highlighter
source code, because the highlighter has to do just this in order to create
snippets with accurate highlighting.

Justin

On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan 
wrote:

> Hi,
>
> May be org.apache.lucene.search.spans.TermSpans ?
>
>
>
> On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch 
> wrote:
> It sounds like TermVector component's output:
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
>
> Perhaps with additional flags enabled (e.g. tv.offsets and/or
> tv.positions).
>
> Regards,
>Alex.
> 
> Newsletter and resources for Solr beginners and intermediates:
> http://www.solr-start.com/
>
>
>
> On 5 June 2016 at 07:39, Justin Lee  wrote:
> > Is anyone aware of a way of getting a list of each matching token and
> their
> > offsets after executing a search?  The reason I want to do this is
> because
> > I have the physical coordinates of each token in the original document
> > stored out of band, and I want to be able to highlight in the original
> > document.  I would really like to have Solr return the list of matching
> > tokens because then things like stemming and phrase matching will work as
> > expected. I'm thinking of something like the highlighter component,
> except
> > instead of returning html, it would return just the matching tokens and
> > their offsets.
> >
> > I have googled high and low and can't seem to find an exact answer to
> this
> > question, so I have spent the last few days examining the internals of
> the
> > various highlighting classes in Solr and Lucene.  I think the bulk of the
> > action is in WeightedSpanTermExtractor and its interaction with
> > getBestTextFragments in the Highlighter class.  But before I spend
> anymore
> > time on this I thought I'd ask (1) whether anyone knows of an easier way
> of
> > doing this, and (2) whether I'm at least barking up the right tree.
> >
> > Thanks much,
> > Justin
>


Re: Getting a list of matching terms and offsets

2016-06-05 Thread Justin Lee
Thanks, yea, I looked at debug query too.  Unfortunately the output of
debug query doesn't quite do it.  For example, if you use a wildcard query,
it will simply explain the score associated with that wildcard query, not
the actual matching token.  In order words, if you search for "hour*" and
the actual matching text is "hours", debug query doesn't tell you that.
Instead, it just reports the score associated with "hour*".

The closest example I've ever found is this:

https://lucidworks.com/blog/2013/05/09/update-accessing-words-around-a-positional-match-in-lucene-4/

But this kind of approach won't let me use the full power of the Solr
ecosystem.  I'd basically be back to dealing with Lucene directly, which I
think is a step backwards.  I think the right approach is to write my own
SearchComponent, using the highlighter as a starting point.  But I wanted
to make sure there wasn't a simpler way.

On Sun, Jun 5, 2016 at 11:30 AM Ahmet Arslan 
wrote:

> Well debug query has the list of token that caused match.
> If i am not mistaken i read an example about span query and spans thing.
> It was listing the positions of the matches.
> Cannot find the example at the moment..
>
> Ahmet
>
>
>
> On Sunday, June 5, 2016 9:10 PM, Justin Lee 
> wrote:
> Thanks for the responses Alex and Ahmet.
>
> The TermVector component was the first thing I looked at, but what it gives
> you is offset information for every token in the document.  I'm trying to
> get a list of tokens that actually match the search query, and unless I'm
> missing something, the TermVector component doesn't give you that
> information.
>
> The TermSpans class does contain the right information, but again the hard
> part is: how do I reliably get a list of TokenSpans for the tokens that
> actually match the search query?  That's why I ended up in the highlighter
> source code, because the highlighter has to do just this in order to create
> snippets with accurate highlighting.
>
> Justin
>
>
> On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan 
> wrote:
>
> > Hi,
> >
> > May be org.apache.lucene.search.spans.TermSpans ?
> >
> >
> >
> > On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> > It sounds like TermVector component's output:
> >
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
> >
> > Perhaps with additional flags enabled (e.g. tv.offsets and/or
> > tv.positions).
> >
> > Regards,
> >Alex.
> > 
> > Newsletter and resources for Solr beginners and intermediates:
> > http://www.solr-start.com/
> >
> >
> >
> > On 5 June 2016 at 07:39, Justin Lee  wrote:
> > > Is anyone aware of a way of getting a list of each matching token and
> > their
> > > offsets after executing a search?  The reason I want to do this is
> > because
> > > I have the physical coordinates of each token in the original document
> > > stored out of band, and I want to be able to highlight in the original
> > > document.  I would really like to have Solr return the list of matching
> > > tokens because then things like stemming and phrase matching will work
> as
> > > expected. I'm thinking of something like the highlighter component,
> > except
> > > instead of returning html, it would return just the matching tokens and
> > > their offsets.
> > >
> > > I have googled high and low and can't seem to find an exact answer to
> > this
> > > question, so I have spent the last few days examining the internals of
> > the
> > > various highlighting classes in Solr and Lucene.  I think the bulk of
> the
> > > action is in WeightedSpanTermExtractor and its interaction with
> > > getBestTextFragments in the Highlighter class.  But before I spend
> > anymore
> > > time on this I thought I'd ask (1) whether anyone knows of an easier
> way
> > of
> > > doing this, and (2) whether I'm at least barking up the right tree.
> > >
> > > Thanks much,
> > > Justin
> >
>


Re: Getting a list of matching terms and offsets

2016-06-06 Thread Justin Lee
Thank you very much!  That JIRA entry led me to
https://issues.apache.org/jira/browse/SOLR-4722, which still works against
Solr 6 with a couple of modifications and should serve as the basis for
what I want to do.  You saved me a bunch of work, so thanks very much.
 (Also, it is always nice to know that people with more experience than me
took the same approach.)

On Sun, Jun 5, 2016 at 1:09 PM Ahmet Arslan 
wrote:

> Hi Lee,
>
> May be you can find useful starting point on
> https://issues.apache.org/jira/browse/SOLR-1397
>
> Please consider to contribute when you gather something working.
>
> Ahmet
>
>
>
>
> On Sunday, June 5, 2016 10:37 PM, Justin Lee 
> wrote:
> Thanks, yea, I looked at debug query too.  Unfortunately the output of
> debug query doesn't quite do it.  For example, if you use a wildcard query,
> it will simply explain the score associated with that wildcard query, not
> the actual matching token.  In order words, if you search for "hour*" and
> the actual matching text is "hours", debug query doesn't tell you that.
> Instead, it just reports the score associated with "hour*".
>
> The closest example I've ever found is this:
>
>
> https://lucidworks.com/blog/2013/05/09/update-accessing-words-around-a-positional-match-in-lucene-4/
>
> But this kind of approach won't let me use the full power of the Solr
> ecosystem.  I'd basically be back to dealing with Lucene directly, which I
> think is a step backwards.  I think the right approach is to write my own
> SearchComponent, using the highlighter as a starting point.  But I wanted
> to make sure there wasn't a simpler way.
>
>
> On Sun, Jun 5, 2016 at 11:30 AM Ahmet Arslan 
> wrote:
>
> > Well debug query has the list of token that caused match.
> > If i am not mistaken i read an example about span query and spans thing.
> > It was listing the positions of the matches.
> > Cannot find the example at the moment..
> >
> > Ahmet
> >
> >
> >
> > On Sunday, June 5, 2016 9:10 PM, Justin Lee 
> > wrote:
> > Thanks for the responses Alex and Ahmet.
> >
> > The TermVector component was the first thing I looked at, but what it
> gives
> > you is offset information for every token in the document.  I'm trying to
> > get a list of tokens that actually match the search query, and unless I'm
> > missing something, the TermVector component doesn't give you that
> > information.
> >
> > The TermSpans class does contain the right information, but again the
> hard
> > part is: how do I reliably get a list of TokenSpans for the tokens that
> > actually match the search query?  That's why I ended up in the
> highlighter
> > source code, because the highlighter has to do just this in order to
> create
> > snippets with accurate highlighting.
> >
> > Justin
> >
> >
> > On Sun, Jun 5, 2016 at 9:09 AM Ahmet Arslan 
> > wrote:
> >
> > > Hi,
> > >
> > > May be org.apache.lucene.search.spans.TermSpans ?
> > >
> > >
> > >
> > > On Sunday, June 5, 2016 7:59 AM, Alexandre Rafalovitch <
> > arafa...@gmail.com>
> > > wrote:
> > > It sounds like TermVector component's output:
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
> > >
> > > Perhaps with additional flags enabled (e.g. tv.offsets and/or
> > > tv.positions).
> > >
> > > Regards,
> > >Alex.
> > > 
> > > Newsletter and resources for Solr beginners and intermediates:
> > > http://www.solr-start.com/
> > >
> > >
> > >
> > > On 5 June 2016 at 07:39, Justin Lee  wrote:
> > > > Is anyone aware of a way of getting a list of each matching token and
> > > their
> > > > offsets after executing a search?  The reason I want to do this is
> > > because
> > > > I have the physical coordinates of each token in the original
> document
> > > > stored out of band, and I want to be able to highlight in the
> original
> > > > document.  I would really like to have Solr return the list of
> matching
> > > > tokens because then things like stemming and phrase matching will
> work
> > as
> > > > expected. I'm thinking of something like the highlighter component,
> > > except
> > > > instead of returning html, it would return just the matching tokens
> and
> > > > their offsets.
> > > >
> > > > I have googled high and low and can't seem to find an exact answer to
> > > this
> > > > question, so I have spent the last few days examining the internals
> of
> > > the
> > > > various highlighting classes in Solr and Lucene.  I think the bulk of
> > the
> > > > action is in WeightedSpanTermExtractor and its interaction with
> > > > getBestTextFragments in the Highlighter class.  But before I spend
> > > anymore
> > > > time on this I thought I'd ask (1) whether anyone knows of an easier
> > way
> > > of
> > > > doing this, and (2) whether I'm at least barking up the right tree.
> > > >
> > > > Thanks much,
> > > > Justin
> > >
> >
>


Bypassing ExtractingRequestHandler

2016-06-09 Thread Justin Lee
Has anybody had any experience bypassing ExtractingRequestHandler and
simply managing Tika manually?  I want to make a small modification to Tika
to get and save additional data from my PDFs, but I have been
procrastinating in no small part due to the unpleasant prospect of setting
up a development environment where I could compile and debug modifications
that might run through PDFBox, Tika, and ExtractingRequestHandler.  It
occurs to me that it would be much easier if the two were separate, so I
could have direct control over Tika and just submit the text to Solr after
extraction.  Am I going to regret this approach?  I'm not sure what
ExtractingRequestHandler really does for me that Tika doesn't already do.

Also, I was reading this
<http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly>
stackoverflow entry and someone offhandedly mentioned that
ExtractingRequestHandler might be separated in the future anyway. Is there
a public roadmap for the project, or does one have to keep up with the
developer's mailing list and hunt through JIRA entries to keep up with the
pulse of the project?

Thanks,
Justin


Re: Bypassing ExtractingRequestHandler

2016-06-13 Thread Justin Lee
Thanks everyone for the help and advice.  The SolrJ exmaple makes sense to
me.  The import of SOLR-8166 was kind of mind boggling to me, but maybe
I'll revisit after some time.

Tim: for context, I'm ultimately trying to create an external highlighter.
See https://issues.apache.org/jira/browse/SOLR-1397.  I want to store the
bounding box (in PDF units) for each token in the extracted text stream.
Then when I get results from Solr using the above patch, I'll convert the
UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate
in the UI.  I like this approach because I get highlighting that accurately
reflects the search, even when the search is complex (e.g. wildcards or
proximity searches).

I think it would take quite a bit of thinking to get something general
enough to add into Tika.  For example, what units?  Take a look at the
discussion of what units to report offsets in here:
https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert
Muir -- although whatever issues there are here they are the same as the
offsets reported in the Term Vector Component, it would seem to me).  As
another example, I'm just not sure what format is general enough to make
sense for everybody.  I think I'll just create a mapping from UTF-16
offsets into (x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store
that in a NoSQL store.  Then, when I get Solr results, I'll look at the
matching offsets, the JSON blob, and the original document and be on my
merry way.  I'm happy to open a JIRA entry in Tika if you think this is a
coherent request.

The other approach, I suppose, is to try to pass the information along
during indexing and store as a token payload.  But it seems like the
indexing interface is really text oriented.  I have also thought about
using DelimitedPayloadTokenFilter, which will increase the index size I
imagine (how much, though?) and require more customization of Solr
internals.  I don't know which is the better approach.

On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. 
wrote:

>
>
>
> >Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff
> should be straightforward:
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> +1
>
> > We tend to prefer running Tika externally as it's entirely possible
> > that Tika will crash or hang with certain files - and that will bring
> > down Solr if you're running Tika within it.
>
> +1
>
> >> I want to make a small modification
> >> to Tika to get and save additional data from my PDFs
> What info do you need, and if it is common enough, could you ask over on
> Tika's JIRA and we'll try to add it directly?
>
>
>
>


Re: Send kill -9 to a node and can not delete down replicas with onlyIfDown.

2016-07-19 Thread Justin Lee
Pardon me for hijacking the thread, but I'm curious about something you
said, Erick.  I always thought that the point (in part) of going through
the pain of using zookeeper and creating replicas was so that the system
could seamlessly recover from catastrophic failures.  Wouldn't an OOM
condition have a similar effect (or maybe java is better at cleanup on that
kind of error)?  The reason I ask is that I'm trying to set up a solr
system that is highly available and I'm a little bit surprised that a kill
-9 on one process on one machine could put the entire system in a bad
state.  Is it common to have to address problems like this with manual
intervention in production systems?  Ideally, I'd hope to be able to set up
a system where a single node dying a horrible death would never require
intervention.

On Tue, Jul 19, 2016 at 8:54 AM Erick Erickson 
wrote:

> First of all, killing with -9 is A Very Bad Idea. You can
> leave write lock files laying around. You can leave
> the state in an "interesting" place. You haven't given
> Solr a chance to tell Zookeeper that it's going away.
> (which would set the state to "down"). In short
> when you do this you have to deal with the consequences
> yourself, one of which is this mismatch between
> cluster state and live_nodes.
>
> Now, that rant done the bin/solr script tries to stop Solr
> gracefully but issues a kill if solr doesn't stop nicely. Personally
> I think that timeout should be longer, but that's another story.
>
> The onlyIfDown='true' option is there specifically as a
> safety valve. It was provided for those who want to guard against
> typos and the like, so just don't specify it and you should be fine.
>
> Best,
> Erick
>
> On Mon, Jul 18, 2016 at 11:51 PM, Jerome Yang  wrote:
> > Hi all,
> >
> > Here's the situation.
> > I'm using solr5.3 in cloud mode.
> >
> > I have 4 nodes.
> >
> > After use "kill -9 pid-solr-node" to kill 2 nodes.
> > These replicas in the two nodes still are "ACTIVE" in zookeeper's
> > state.json.
> >
> > The problem is, when I try to delete these down replicas with
> > parameter onlyIfDown='true'.
> > It says,
> > "Delete replica failed: Attempted to remove replica :
> > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is
> > 'active'."
> >
> > From this link:
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >
> >
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE
> >
> > It says:
> > *NOTE*: when the node the replica is hosted on crashes, the replica's
> state
> > may remain ACTIVE in ZK. To determine if the replica is truly active, you
> > must also verify that its node
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.html#getNodeName--
> >
> > is
> > under /live_nodes in ZK (or use ClusterState.liveNodesContain(String)
> > <
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/ClusterState.html#liveNodesContain-java.lang.String-
> >
> > ).
> >
> > So, is this a bug?
> >
> > Regards,
> > Jerome
>


Re: Send kill -9 to a node and can not delete down replicas with onlyIfDown.

2016-07-19 Thread Justin Lee
Thanks for taking the time for the detailed response. I completely get what
you are saying. Makes sense.
On Tue, Jul 19, 2016 at 10:56 AM Erick Erickson 
wrote:

> Justin:
>
> Well, "kill -9" just makes it harder. The original question
> was whether a replica being "active" was a bug, and it's
> not when you kill -9; the Solr node has no chance to
> tell Zookeeper it's going away. ZK does modify
> the live_nodes by itself, thus there are checks as
> necessary when a replica's state is referenced
> whether the node is also in live_nodes. And an
> overwhelming amount of the time this is OK, Solr
> recovers just fine.
>
> As far as the write locks are concerned, those are
> a Lucene level issue so if you kill Solr at just the
> wrong time it's possible that that'll be left over. The
> write locks are held for as short a period as possible
> by Lucene, but occasionally they can linger if you kill
> -9.
>
> When a replica comes up, if there is a write lock already, it
> doesn't just take over; it fails to load instead.
>
> A kill -9 won't bring the cluster down by itself except
> if there are several coincidences. Just don't make
> it a habit. For instance, consider if you kill -9 on
> two Solrs that happen to contain all of the replicas
> for a shard1 for collection1. And you _happen_ to
> kill them both at just the wrong time and they both
> leave Lucene write locks for those replicas. Now
> no replica will come up for shard1 and the collection
> is unusable.
>
> So the shorter form is that using "kill -9" is a poor practice
> that exposes you to some risk. The hard-core Solr
> guys work extremely had to compensate for this kind
> of thing, but kill -9 is a harsh, last-resort option and
> shouldn't be part of your regular process. And you should
> expect some "interesting" states when you do. And
> you should use the bin/solr script to stop Solr
> gracefully.
>
> Best,
> Erick
>
>
> On Tue, Jul 19, 2016 at 9:29 AM, Justin Lee 
> wrote:
> > Pardon me for hijacking the thread, but I'm curious about something you
> > said, Erick.  I always thought that the point (in part) of going through
> > the pain of using zookeeper and creating replicas was so that the system
> > could seamlessly recover from catastrophic failures.  Wouldn't an OOM
> > condition have a similar effect (or maybe java is better at cleanup on
> that
> > kind of error)?  The reason I ask is that I'm trying to set up a solr
> > system that is highly available and I'm a little bit surprised that a
> kill
> > -9 on one process on one machine could put the entire system in a bad
> > state.  Is it common to have to address problems like this with manual
> > intervention in production systems?  Ideally, I'd hope to be able to set
> up
> > a system where a single node dying a horrible death would never require
> > intervention.
> >
> > On Tue, Jul 19, 2016 at 8:54 AM Erick Erickson 
> > wrote:
> >
> >> First of all, killing with -9 is A Very Bad Idea. You can
> >> leave write lock files laying around. You can leave
> >> the state in an "interesting" place. You haven't given
> >> Solr a chance to tell Zookeeper that it's going away.
> >> (which would set the state to "down"). In short
> >> when you do this you have to deal with the consequences
> >> yourself, one of which is this mismatch between
> >> cluster state and live_nodes.
> >>
> >> Now, that rant done the bin/solr script tries to stop Solr
> >> gracefully but issues a kill if solr doesn't stop nicely. Personally
> >> I think that timeout should be longer, but that's another story.
> >>
> >> The onlyIfDown='true' option is there specifically as a
> >> safety valve. It was provided for those who want to guard against
> >> typos and the like, so just don't specify it and you should be fine.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Jul 18, 2016 at 11:51 PM, Jerome Yang 
> wrote:
> >> > Hi all,
> >> >
> >> > Here's the situation.
> >> > I'm using solr5.3 in cloud mode.
> >> >
> >> > I have 4 nodes.
> >> >
> >> > After use "kill -9 pid-solr-node" to kill 2 nodes.
> >> > These replicas in the two nodes still are "ACTIVE" in zookeeper's
> >> > state.json.
> >> >
> >> > The problem is, when I try to delete these down replicas with

Contributors Group

2017-09-25 Thread Justin Baynton
Hello There. Can you please add the following user to the contributors
group:

JustinBaynton

Thank you!

Justin


Documents Added Not Available After Commit (Both Soft and Hard)

2014-06-06 Thread Justin Sweeney
gt;> _2gqy(4.5):C697/215 _2gr2(4.5):C878/352 _2gr7(4.5):C28135/11775
>> _2gr9(4.5):C3276/1341 _2grb(4.5):C5/1 _2grc(4.5):C3247/1219 _2grd(4.5):C6/1
>> _2grf(4.5):C5/2 _2grg(4.5):C23659/10967 _2grh(4.5):C1 _2grj(4.5):C1
>> _2grk(4.5):C5160/1482 _2grm(4.5):C1210/351 _2grn(4.5):C3957/1372
>> _2gro(4.5):C7734/2207 _2grp(4.5):C220/36)}
>
> INFO  - 2014-06-05 21:14:26.949;
>> org.apache.solr.update.DirectUpdateHandler2; start
>> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
>
> INFO  - 2014-06-05 21:14:36.727; org.apache.solr.core.SolrDeletionPolicy;
>> SolrDeletionPolicy.onCommit: commits: num=2
>
>
>> commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/data/solr-data/index
>> lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@26041cb3;
>> maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_acl,generation=13413}
>
>
>> commit{dir=NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@/data/solr-data/index
>> lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@26041cb3;
>> maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_acm,generation=13414}
>
> INFO  - 2014-06-05 21:14:36.728; org.apache.solr.core.SolrDeletionPolicy;
>> newest commit generation = 13414
>
> INFO  - 2014-06-05 21:14:36.749; org.apache.solr.search.SolrIndexSearcher;
>> Opening Searcher@5bf20a8a main
>
> INFO  - 2014-06-05 21:14:36.750;
>> org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
>
> INFO  - 2014-06-05 21:14:36.759; org.apache.solr.core.QuerySenderListener;
>> QuerySenderListener sending requests to Searcher@5bf20a8a
>> main{StandardDirectoryReader(segments_acm:1367002775958
>> _2f28(4.5):C13583563/4088615 _2gl6(4.5):C2754573/202192
>> _2g21(4.5):C1046256/298243 _2ge2(4.5):C835858/208834
>> _2gqd(4.5):C383500/35732 _2gmu(4.5):C125197/33714 _2grl(4.5):C46906/3282
>> _2gpj(4.5):C66480/17459 _2gra(4.5):C364/40 _2gr1(4.5):C36064/3442
>> _2gqg(4.5):C42504/22410 _2gqm(4.5):C26821/13787 _2gqu(4.5):C24172/10804
>> _2gqy(4.5):C697/231 _2gr2(4.5):C878/382 _2gr7(4.5):C28135/12761
>> _2gr9(4.5):C3276/1478 _2grb(4.5):C5/1 _2grc(4.5):C3247/1323 _2grd(4.5):C6/1
>> _2grf(4.5):C5/2 _2grg(4.5):C23659/11895 _2grh(4.5):C1 _2grj(4.5):C1
>> _2grk(4.5):C5160/1982 _2grm(4.5):C1210/531 _2grn(4.5):C3957/1790
>> _2gro(4.5):C7734/3504 _2grp(4.5):C220/106 _2grq(4.5):C72751/30166
>> _2grr(4.5):C1)}
>
> INFO  - 2014-06-05 21:14:36.759; org.apache.solr.core.SolrCore;
>> [zoomCollection] webapp=null path=null
>> params={event=newSearcher&q=d_name:ibm&distrib=false} hits=38 status=0
>> QTime=0
>
> INFO  - 2014-06-05 21:14:36.760; org.apache.solr.core.QuerySenderListener;
>> QuerySenderListener done.
>
> INFO  - 2014-06-05 21:14:36.760; org.apache.solr.core.SolrCore;
>> [zoomCollection] Registered new searcher Searcher@5bf20a8a
>> main{StandardDirectoryReader(segments_acm:1367002775958
>> _2f28(4.5):C13583563/4088615 _2gl6(4.5):C2754573/202192
>> _2g21(4.5):C1046256/298243 _2ge2(4.5):C835858/208834
>> _2gqd(4.5):C383500/35732 _2gmu(4.5):C125197/33714 _2grl(4.5):C46906/3282
>> _2gpj(4.5):C66480/17459 _2gra(4.5):C364/40 _2gr1(4.5):C36064/3442
>> _2gqg(4.5):C42504/22410 _2gqm(4.5):C26821/13787 _2gqu(4.5):C24172/10804
>> _2gqy(4.5):C697/231 _2gr2(4.5):C878/382 _2gr7(4.5):C28135/12761
>> _2gr9(4.5):C3276/1478 _2grb(4.5):C5/1 _2grc(4.5):C3247/1323 _2grd(4.5):C6/1
>> _2grf(4.5):C5/2 _2grg(4.5):C23659/11895 _2grh(4.5):C1 _2grj(4.5):C1
>> _2grk(4.5):C5160/1982 _2grm(4.5):C1210/531 _2grn(4.5):C3957/1790
>> _2gro(4.5):C7734/3504 _2grp(4.5):C220/106 _2grq(4.5):C72751/30166
>> _2grr(4.5):C1)}
>
>
I've also shared via Google Drive a more complete log for a period of time
where this is occurring, as well as our solrconfig.xml in case that is
useful.

Any ideas on why the Solr commit is not finding any changes despite the
clear logging of the adds. For some reason, after hours of this it will
find changes and commit everything, including the documents that were
skipped previously.

Thanks for any assistance!

Justin Sweeney
​
 solr_commit_issue.log
<https://docs.google.com/file/d/0B7jKxYrZOSvac21nV0JuRWF0SW8/edit?usp=drive_web>
​​
 solrconfig.xml
<https://docs.google.com/file/d/0B7jKxYrZOSvaRUY2QzhUN2tQYmM/edit?usp=drive_web>
​


Re: Documents Added Not Available After Commit (Both Soft and Hard)

2014-06-11 Thread Justin Sweeney
Thanks for the input!

Erick - To clarify, we see the No Uncommitted Changes message repeatedly
for a number of commits (not a consistent number each time this happens)
and then eventually we see a commit that successfully finds changes, at
which point the documents are available.

Shalin - That bug looks like it could be related to our case, did you
notice any impact of the bug in situations where there were not just
pending deletes by term? In our case, we are adding documents, we do have
some deletes, but the bulk are adds. We can see the logging of the adds in
the solr log prior to seeing the No Uncommitted Changes message.

Either way, it may be useful for us to upgrade and see if it fixes the
issue. I'll let you know if that works out once we get a chance to do that.

Thanks,
Justin


On Mon, Jun 9, 2014 at 3:02 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> I think this may be the same bug as LUCENE-5289 which was fixed in 4.5.1.
> Can you upgrade to 4.5.1 and see if that solves the problem?
>
>
>
>
> On Fri, Jun 6, 2014 at 7:17 PM, Justin Sweeney  >
> wrote:
>
> > Hi,
> >
> > An application I am working on indexes documents to a Solr index. This
> Solr
> > index is setup as a single node, without any replication. This index is
> > running Solr 4.5.0.
> >
> > We have noticed an issue lately that is causing some problems for our
> > application. The problem is that we add/update a number of documents in
> the
> > Solr index and we have the index setup to autoCommit (hard) once every 30
> > minutes. In the Solr logs, I am able to see the add command to Solr and I
> > can also see Solr start the hard commit. When this hard commit occurs, we
> > see the following message:
> > INFO  - 2014-06-04 20:13:55.135;
> > org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes.
> > Skipping IW.commit.
> >
> > This only happens sometimes, but Solr will go hours (we have seen 6-12
> > hours of this behavior) before it does a hard commit where it find
> changes.
> > After the hard commit where the changes are found, we are then able to
> > search for and find the documents that were added hours ago, but up until
> > that point the documents are not searchable.
> >
> > We tried enabling autoSoftCommit every 5 minutes in the hope that this
> > would help, but we are seeing the same behavior.
> >
> > Here is a sampling of the logs showing this occurring (I've trimmed it
> down
> > to just show what is happening):
> >
> > INFO  - 2014-06-05 20:00:41.300;
> > >> org.apache.solr.update.processor.LogUpdateProcessor; [zoomCollection]
> > >> webapp=/solr path=/update params={wt=javabin&version=2}
> > {add=[359453225]} 0
> > >> 0
> > >
> > > INFO  - 2014-06-05 20:00:41.376;
> > >> org.apache.solr.update.processor.LogUpdateProcessor; [zoomCollection]
> > >> webapp=/solr path=/update params={wt=javabin&version=2}
> > {add=[347170717]} 0
> > >> 1
> > >
> > > INFO  - 2014-06-05 20:00:51.527;
> > >> org.apache.solr.update.DirectUpdateHandler2; start
> > >>
> >
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=true,prepareCommit=false}
> > >
> > > INFO  - 2014-06-05 20:00:51.533;
> > org.apache.solr.search.SolrIndexSearcher;
> > >> Opening Searcher@257c43d main
> > >
> > > INFO  - 2014-06-05 20:00:51.533;
> > >> org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
> > >
> > > INFO  - 2014-06-05 20:00:51.545;
> > org.apache.solr.core.QuerySenderListener;
> > >> QuerySenderListener sending requests to Searcher@257c43d
> > >> main{StandardDirectoryReader(segments_acl:1367002775953
> > >> _2f28(4.5):C13583563/4081507 _2gl6(4.5):C2754573/193533
> > >> _2g21(4.5):C1046256/296354 _2ge2(4.5):C835858/206139
> > >> _2gqd(4.5):C383500/31051 _2gmu(4.5):C125197/32491
> _2grl(4.5):C46906/1255
> > >> _2gpj(4.5):C66480/16562 _2gra(4.5):C364/22 _2gr1(4.5):C36064/2556
> > >> _2gqg(4.5):C42504/21515 _2gqm(4.5):C26821/12659
> _2gqu(4.5):C24172/10240
> > >> _2gqy(4.5):C697/215 _2gr2(4.5):C878/352 _2gr7(4.5):C28135/11775
> > >> _2gr9(4.5):C3276/1341 _2grb(4.5):C5/1 _2grc(4.5):C3247/1219
> > _2grd(4.5):C6/1
> > >> _2grf(4.5):C5/2 _2grg(4.5):C23659/10967 _2grh(4.5):C1 _2grj(4.5):C1
> > >> _2grk(4.5):C5160/1482 _2grm(4.5):C1210/351 _2grn(4.5):C3957/1372
> > >> _2gro(4.5):C7734/2207 _2grp(4.5):C220/36)}
> > >
> > > INFO  - 2014-

Large-scale Solr publish - hanging at blockUntilFinished indefinitely - stuck on SocketInputStream.socketRead0

2013-05-22 Thread Justin Babuscio
*Problem:*

We periodically rebuild our Solr index from scratch.  We have built a
custom publisher that horizontally scales to increase write throughput.  On
a given rebuild, we will have ~60 JVMs running with 5 threads that are
actively publishing to all Solr masters.

For each thread, we instantiate one StreamingUpdateSolrServer(
QueueSize:100, QueueThreadSize: 2 ) for each master = 20 servers/thread.

At the end of a publish cycle (we publish in smaller chunks = 5MM records),
we execute server.blockUntilFinished() on each of the 20 servers on each
thread ( 100 total ).  Before we applied a recent change, this would always
execute to completion.  There were a few hang-ups on publishes but we
consistently re-published our entire corpus in 6-7 hours.

The *problem* is that the blockUntilFinished hangs indefinitely.  From the
java thread dumps, it appears that the loop in StreamingUpdateSolrServer
thinks a runner thread is still active so it blocks (as expected).  The
other note about the java thread dump is that the active runner thread is
exactly this:


*Hung Runner Thread:*
"pool-1-thread-8" prio=3 tid=0x0001084c nid=0xfe runnable
[0x5c7fe000]
java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
 - locked <0xfffe81dbcbe0> (a java.io.BufferedInputStream)
at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)
 at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)
at
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)
 at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413)
at
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)
 at
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)
at
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)
 at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
 at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
 at
org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner.run(StreamingUpdateSolrServer.java:154)


Although the runner thread is reading the socket, there is absolutely no
activity on the Solr clients.  Other than the blockUntilFinished thread,
the client is basically sleeping.

*
*
*
*
***Recent Change:*

We increased the "maxFieldLength" from 1(default) to 2147483647
(Integer.MAX_VALUE).

Given this change is server side, I don't know how this would impact adding
a new document.  I see how it would increase commit times and index size,
but don't see the relationship to hanging client adds.


*Ingest Workflow:*

1) Pull artifacts from relational database (PDF/TXT/Java bean)
2) Extract all searchable text fields -- this is where we use Tika,
independent of Solr
3) Using Solr4J client, we publish an object that is serialized to XML and
written to the master
4) execute "blockUntilFinished" for all 20 servers on each thread.

5) Autocommit set on servers at 30 minutes or 50k documents.  During
republish, 50k threshold is met first.

*
*
*Environment:*

Solr v3.5.0
20 masters
2 slaves/master = 40 slaves


*Corpus:*

We have ~100MM records, ranging in size from 50MB PDFs to 1KB TXT files.
 Our schema has an unusually large number of fields, 200.  Our index size
averages about 30GB/shards, totally 600GB.


*Releated Bugs:*

My symptoms are most related to this bug but we are not executing any
deletes so I have low confidence that it is 100% related
https://issues.apache.org/jira/browse/SOLR-1990


Although we have similar stack traces, we are only ADDING docs.


Thanks ahead for any input/help!

-- 
Justin Babuscio


Re: Large-scale Solr publish - hanging at blockUntilFinished indefinitely - stuck on SocketInputStream.socketRead0

2013-05-22 Thread Justin Babuscio
Shawn,

Thank you!

Just some quick responses:

On your overflow theory, why would this impact the client?  Is is possible
that a write attempt to Solr would block indefinitely while the Solr server
is running wild or in a bad state due to the overflow?


We attempt to set the BinaryRequestWriter but per this bug:
https://issues.apache.org/jira/browse/SOLR-1565, v3.5 uses the default XML
writer.


On upgrading to 3.6.2 or 4.x, we have an organizational challenge that
requires approval of the software/upgrade.  I am promoting/supporting this
idea but cannot execute in the short-term.

For the mass publish, we originally used the CommonsHttpSolrServer (what we
use in live production updates) but we found the trade-off with performance
was quite large.  I really like your idea about KISS on threading.  Since
I'm already introducing complexity with all the multi-threading, why stress
the older 3.x software.  We may need to trade-off time for this.



My first tactics will be to adjust the maxFieldLength and toggle the
configuration to use CommonsHttpSolrServer.  I will follow-up with any
discoveries.

Thanks again,
Justin





On Wed, May 22, 2013 at 11:46 AM, Shawn Heisey  wrote:

> On 5/22/2013 9:08 AM, Justin Babuscio wrote:
>
>> We periodically rebuild our Solr index from scratch.  We have built a
>> custom publisher that horizontally scales to increase write throughput.
>>  On
>> a given rebuild, we will have ~60 JVMs running with 5 threads that are
>> actively publishing to all Solr masters.
>>
>> For each thread, we instantiate one StreamingUpdateSolrServer(
>> QueueSize:100, QueueThreadSize: 2 ) for each master = 20 servers/thread.
>>
>
> Looking over all your details, you might want to try first reducing the
> maxFieldLength to slightly below Integer.MAX_VALUE.  Try setting it to 2
> billion, or even something more modest, in the millions.  It's
> theoretically possible that the other value might be leading to an overflow
> somewhere.  I've been looking for evidence of this, nothing's turned up yet.
>
> There MIGHT be bugs in the Apache Commons libraries that SolrJ uses. The
> next thing I would try is upgrading those component jars in your
> application's classpath - httpclient, commons-io, commons-codec, etc.
>
> Upgrading to a newer SolrJ version is also a good idea.  Your notes imply
> that you are using the default XML request writer in SolrJ.  If that's
> true, you should be able to use a 4.3 SolrJ even with an older Solr
> version, which would give you a server object that's based on
> HttpComponents 4.x, where your current objects are based on HttpClient 3.x.
>  You would need to make adjustments in your source code.  If you're not
> using the default XML request writer, you can get a similar change by using
> SolrJ 3.6.2.
>
> IMHO you should switch to HttpSolrServer (CommonsHttpSolrServer in SolrJ
> 3.5 and earlier).  StreamingUpdateSolrServer (and its replacement in 3.6
> and later, named ConcurrentUpdateSolrServer) has one glaring problem - it
> never informs the calling application about any errors that it encounters
> during indexing.  It lies to you, and tells you that everything has
> succeeded even when it doesn't.
>
> The one advantage that SUSS/CUSS has over its Http sibling is that it is
> multi-threaded, so it can send updates concurrently.  You seem to know
> enough about how it works, so I'll just say that you don't need additional
> complexity that is not under your control and refuses to throw exceptions
> when an error occurs.  You already have a large-scale concurrent and
> multi-threaded indexing setup, so SolrJ's additional thread handling
> doesn't really buy you much.
>
> Thanks,
> Shawn
>
>


-- 
Justin Babuscio
571-210-0035
http://linchpinsoftware.com


Re: Geo spatial search with multi-valued locations (SOLR-2155 / lucene-spatial-playground)

2011-09-03 Thread Justin Caratzas
Mike,

I've applied the patch as of a June-dated trunk.  There were some
trivial conflicts, but mostly-easy to apply.  It has been in production
for a couple months with no major hiccups so far :).

Justin

"Smiley, David W."  writes:

> Hi Mike.
>
> I have hopes that LSP will be ready in time for Solr 4. It's usable now with 
> the understanding that it's still fairly early and so there are bound to be 
> bugs. I've been focusing a lot on testing lately.  You could try applying 
> SOLR-2155 but I think there was some Lucene/Solr code re-organization 
> regarding the ValueSource API. It shouldn't be hard to update.  I don't think 
> JTeam's plugin handles multi-value but I could be wrong (Chris Male will be 
> sure to jump in and correct me if so).  QBase/Metacarta has a Solr plugin 
> I've used indirectly through a packaged deal with their products 
> http://www.metacarta.com/products-overview.htm  I have no idea if you can get 
> it stand-alone. As of a few months ago, it was based on a version of Solr 
> trunk from March 2010 and they have yet to update it.
>
> ~ David Smiley
>
> On Aug 29, 2011, at 2:27 PM, Mike Austin wrote:
>
>> Besides the full integration into solr for this, would you recommend any
>> third party solr plugins such as "
>> http://www.jteam.nl/products/spatialsolrplugin.html";, or others?
>> 
>> I can understand that spacial features can get complex and there could be
>> many use cases, but this seems like a "basic" feature that you would use
>> with a standard set of spacial features like what is in solr4 now.
>> 
>> Thanks,
>> Mike
>> 
>> On Mon, Aug 29, 2011 at 12:38 PM, Darren Govoni  wrote:
>> 
>>> It doesn't.
>>> 
>>> 
>>> On 08/29/2011 01:37 PM, Mike Austin wrote:
>>> 
>>>> I've been trying to follow the progress of this and I'm not sure what the
>>>> current status is.  Can someone update me on what is currently in Solr4
>>>> and
>>>> does it support multi-valued location in a single document?  I saw that
>>>> SOLR-2155 was not included and is now lucene-spatial-playground.
>>>> 
>>>> Thanks,
>>>> Mike
>>>> 
>>>> 
>>> 


Re: Easy way to tell if there are pending documents

2011-11-16 Thread Justin Caratzas

You can enable the stats handler
(https://issues.apache.org/jira/browse/SOLR-1750), and get inspect the
json pragmatically.

-- Justin

"Latter, Antoine"  writes:

> Thank you, that does help - but I am more looking for a way to get at this 
> programmatically.
>
> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
> Sent: Tuesday, November 15, 2011 11:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Easy way to tell if there are pending documents
>
> Antoine,
>
> On Solr Admin Stats page search for "docsPending".  I think this is what you 
> are looking for.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem 
> search :: http://search-lucene.com/
>
>
>>
>>From: "Latter, Antoine" 
>>To: "'solr-user@lucene.apache.org'" 
>>Sent: Monday, November 14, 2011 11:39 AM
>>Subject: Easy way to tell if there are pending documents
>>
>>Hi Solr,
>>
>>Does anyone know of an easy way to tell if there are pending documents 
>>waiting for commit?
>>
>>Our application performs operations that are never safe to perform
>> while commits are pending. We make this work by making sure that all
>> indexing operations end in a commit, and stop the unsafe operations
>> from running while a commit is running.
>>
>>This works great most of the time, except when we have enough disk
>> space to add documents to the pending area, but not enough disk
>> space to do a commit - then the indexing operations only error out
>> after they've done all of their adds.
>>
>>It would be nice if the unsafe operation could somehow detect that there are 
>>pending documents and abort.
>>
>>In the interim I'll have the unsafe operation perform a commit when it 
>>starts, but I've been weeding out useless commits from my app recently and I 
>>don't like them creeping back in.
>>
>>Thanks,
>>Antoine
>>
>>
>>



Re: inconsistent JVM crash with version 4.0-SNAPSHOT

2011-11-25 Thread Justin Caratzas
Lasse Aagren  writes:

> Hi,
>
> We are running Solr-Lucene 4.0-SNAPSHOT (1199777M - hudson - 2011-11-09 
> 14:58:50) on severel servers running:
>
> 64bit Debian Squeeze (6.0.3)
> OpenJDK6 (b18-1.8.9-0.1~squeeze1)
> Tomcat 6.028 (6.0.28-9+squeeze1)
>
> Some of the servers have 48G RAM and in that case java have 16G (-Xmx16g) and 
> some of the servers have 96G RAM and in that case java have 48G (-Xmx48G).
>
> We are seeing some inconsistent crashes of tomcat's JVM under different 
> Solr/Lucene operations/circumstances. Sadly we can't replicate it. 
>
> It doesn't happen often, but often enough that we can't rely on it in 
> production.
>
> When it happens, something like the following appears in the logs:
>
> ==
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f6c318d0902, pid=16516, tid=139772378892032
> #
> # JRE version: 6.0_18-b18
> # Java VM: OpenJDK 64-Bit Server VM (14.0-b16 mixed mode linux-amd64 )
> # Derivative: IcedTea6 1.8.9
> # Distribution: Debian GNU/Linux 6.0.2 (squeeze), package 
> 6b18-1.8.9-0.1~squeeze1
> # Problematic frame:
> # j  
> org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(Lorg/apache/lucene/index/IndexReader$AtomicReaderContext;Lorg/apache/lucene/util/Bits;)Lorg/apache/lucene/search/DocIdSet;+193
> #
> # An error report file with more information is saved as:
> # /tmp/hs_err_pid16516.log
> #
> # If you would like to submit a bug report, please include
> # instructions how to reproduce the bug and visit:
> #   http://icedtea.classpath.org/bugzilla
> #
> ==
>
> Every time it happens the problematic frame is:
>
> Problematic frame:
> # j  
> org.apache.lucene.search.MultiTermQueryWrapperFilter.getDocIdSet(Lorg/apache/lucene/index/IndexReader$AtomicReaderContext;Lorg/apache/lucene/util/Bits;
> )Lorg/apache/lucene/search/DocIdSet;+193
>
> And /tmp/hs_err_pid16516.log is attached to this mail.
>
> Has anyone seen this before? 
>
> Please don't hesitate to ask for further specification about our setup.
>
> Best regards,

I seem to remember a recent java released fixed seemingly random
SIGSEGV's causing Solr/Lucene to crash non-deterministicly.

http://lucene.apache.org/solr/#26+October+2011+-+Java+7u1+fixes+index+corruption+and+crash+bugs+in+Apache+Lucene+Core+and+Apache+Solr

Hopefully this will provide you with some answers. If not, please let
the list know.

justin



Re: cache monitoring tools?

2011-12-11 Thread Justin Caratzas
At my work, we use Munin and Nagio for monitoring and alerts.  Munin is
great because writing a plugin for it so simple, and with Solr's
statistics handler, we can track almost any solr stat we want.  It also
comes with included plugins for load, file system stats, processes,
etc.

http://munin-monitoring.org/

Justin

Paul Libbrecht  writes:

> Allow me to chim in and ask a generic question about monitoring tools
> for people close to developers: are any of the tools mentioned in this
> thread actually able to show graphs of loads, e.g. cache counts or CPU
> load, in parallel to a console log or to an http request log??
>
> I am working on such a tool currently but I have a bad feeling of reinventing 
> the wheel.
>
> thanks in advance
>
> Paul
>
>
>
> Le 8 déc. 2011 à 08:53, Dmitry Kan a écrit :
>
>> Otis, Tomás: thanks for the great links!
>> 
>> 2011/12/7 Tomás Fernández Löbbe 
>> 
>>> Hi Dimitry, I pointed to the wiki page to enable JMX, then you can use any
>>> tool that visualizes JMX stuff like Zabbix. See
>>> 
>>> http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/
>>> 
>>> On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan  wrote:
>>> 
>>>> The culprit seems to be the merger (frontend) SOLR. Talking to one shard
>>>> directly takes substantially less time (1-2 sec).
>>>> 
>>>> On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan  wrote:
>>>> 
>>>>> Tomás: thanks. The page you gave didn't mention cache specifically, is
>>>>> there more documentation on this specifically? I have used solrmeter
>>>> tool,
>>>>> it draws the cache diagrams, is there a similar tool, but which would
>>> use
>>>>> jmx directly and present the cache usage in runtime?
>>>>> 
>>>>> pravesh:
>>>>> I have increased the size of filterCache, but the search hasn't become
>>>> any
>>>>> faster, taking almost 9 sec on avg :(
>>>>> 
>>>>> name: search
>>>>> class: org.apache.solr.handler.component.SearchHandler
>>>>> version: $Revision: 1052938 $
>>>>> description: Search using components:
>>>>> 
>>>> 
>>> org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent,
>>>>> 
>>>>> stats: handlerStart : 1323255147351
>>>>> requests : 100
>>>>> errors : 3
>>>>> timeouts : 0
>>>>> totalTime : 885438
>>>>> avgTimePerRequest : 8854.38
>>>>> avgRequestsPerSecond : 0.008789442
>>>>> 
>>>>> the stats (copying fieldValueCache as well here, to show term
>>>> statistics):
>>>>> 
>>>>> name: fieldValueCache
>>>>> class: org.apache.solr.search.FastLRUCache
>>>>> version: 1.0
>>>>> description: Concurrent LRU Cache(maxSize=1, initialSize=10,
>>>>> minSize=9000, acceptableSize=9500, cleanupThread=false)
>>>>> stats: lookups : 79
>>>>> hits : 77
>>>>> hitratio : 0.97
>>>>> inserts : 1
>>>>> evictions : 0
>>>>> size : 1
>>>>> warmupTime : 0
>>>>> cumulative_lookups : 79
>>>>> cumulative_hits : 77
>>>>> cumulative_hitratio : 0.97
>>>>> cumulative_inserts : 1
>>>>> cumulative_evictions : 0
>>>>> item_shingleContent_trigram :
>>>>> 
>>>> 
>>> {field=shingleContent_trigram,memSize=326924381,tindexSize=4765394,time=215426,phase1=213868,nTerms=14827061,bigTerms=35,termInstances=114359167,uses=78}
>>>>> name: filterCache
>>>>> class: org.apache.solr.search.FastLRUCache
>>>>> version: 1.0
>>>>> description: Concurrent LRU Cache(maxSize=153600, initialSize=4096,
>>>>> minSize=138240, acceptableSize=145920, cleanupThread=false)
>>>>> stats: lookups : 1082854
>>>>> hits : 940370
>>>>> hitratio : 0.86
>>>>> inserts : 142486
>>>>> evictions : 0
>>>>> size : 142486
>>>>> warmupTime : 0
>>>>> cumulative_lookups : 1082854
>>>>> cumulative_hits 

Re: cache monitoring tools?

2011-12-12 Thread Justin Caratzas
Dmitry,

The only added stress that munin puts on each box is the 1 request per
stat per 5 minutes to our admin stats handler.  Given that we get 25
requests per second, this doesn't make much of a difference.  We don't
have a sharded index (yet) as our index is only 2-3 GB, but we do have slave 
servers with replicated
indexes that handle the queries, while our master handles
updates/commits.

Justin

Dmitry Kan  writes:

> Justin, in terms of the overhead, have you noticed if Munin puts much of it
> when used in production? In terms of the solr farm: how big is a shard's
> index (given you have sharded architecture).
>
> Dmitry
>
> On Sun, Dec 11, 2011 at 6:39 PM, Justin Caratzas
> wrote:
>
>> At my work, we use Munin and Nagio for monitoring and alerts.  Munin is
>> great because writing a plugin for it so simple, and with Solr's
>> statistics handler, we can track almost any solr stat we want.  It also
>> comes with included plugins for load, file system stats, processes,
>> etc.
>>
>> http://munin-monitoring.org/
>>
>> Justin
>>
>> Paul Libbrecht  writes:
>>
>> > Allow me to chim in and ask a generic question about monitoring tools
>> > for people close to developers: are any of the tools mentioned in this
>> > thread actually able to show graphs of loads, e.g. cache counts or CPU
>> > load, in parallel to a console log or to an http request log??
>> >
>> > I am working on such a tool currently but I have a bad feeling of
>> reinventing the wheel.
>> >
>> > thanks in advance
>> >
>> > Paul
>> >
>> >
>> >
>> > Le 8 déc. 2011 à 08:53, Dmitry Kan a écrit :
>> >
>> >> Otis, Tomás: thanks for the great links!
>> >>
>> >> 2011/12/7 Tomás Fernández Löbbe 
>> >>
>> >>> Hi Dimitry, I pointed to the wiki page to enable JMX, then you can use
>> any
>> >>> tool that visualizes JMX stuff like Zabbix. See
>> >>>
>> >>>
>> http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/
>> >>>
>> >>> On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan 
>> wrote:
>> >>>
>> >>>> The culprit seems to be the merger (frontend) SOLR. Talking to one
>> shard
>> >>>> directly takes substantially less time (1-2 sec).
>> >>>>
>> >>>> On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan 
>> wrote:
>> >>>>
>> >>>>> Tomás: thanks. The page you gave didn't mention cache specifically,
>> is
>> >>>>> there more documentation on this specifically? I have used solrmeter
>> >>>> tool,
>> >>>>> it draws the cache diagrams, is there a similar tool, but which would
>> >>> use
>> >>>>> jmx directly and present the cache usage in runtime?
>> >>>>>
>> >>>>> pravesh:
>> >>>>> I have increased the size of filterCache, but the search hasn't
>> become
>> >>>> any
>> >>>>> faster, taking almost 9 sec on avg :(
>> >>>>>
>> >>>>> name: search
>> >>>>> class: org.apache.solr.handler.component.SearchHandler
>> >>>>> version: $Revision: 1052938 $
>> >>>>> description: Search using components:
>> >>>>>
>> >>>>
>> >>>
>> org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.StatsComponent,org.apache.solr.handler.component.DebugComponent,
>> >>>>>
>> >>>>> stats: handlerStart : 1323255147351
>> >>>>> requests : 100
>> >>>>> errors : 3
>> >>>>> timeouts : 0
>> >>>>> totalTime : 885438
>> >>>>> avgTimePerRequest : 8854.38
>> >>>>> avgRequestsPerSecond : 0.008789442
>> >>>>>
>> >>>>> the stats (copying fieldValueCache as well here, to show term
>> >>>> statistics):
>> >>>>>
>> >>>>> name: fieldValueCache
>> >>>>> class: org.apache.solr.search.FastLRUCache
>> >>>>> version: 1.0
>> >>>>> descr

Re: cache monitoring tools?

2011-12-15 Thread Justin Caratzas
Dmitry,

Thats beyond the scope of this thread, but Munin essentially runs
"plugins" which are essentially scripts that output graph configuration
and values when polled by the Munin server.  So it uses a plain text
protocol, so that the scripts can be written in any language.  Munin
then feeds this info into RRDtool, which displays the graph.  There are
some examples[1] of solr plugins that people have used to scrape the
stats.jsp page.

Justin

1. http://exchange.munin-monitoring.org/plugins/search?keyword=solr

Dmitry Kan  writes:

> Thanks, Justin. With zabbix I can gather jmx exposed stats from SOLR, how
> about munin, what protocol / way it uses to accumulate stats? It wasn't
> obvious from their online documentation...
>
> On Mon, Dec 12, 2011 at 4:56 PM, Justin Caratzas
> wrote:
>
>> Dmitry,
>>
>> The only added stress that munin puts on each box is the 1 request per
>> stat per 5 minutes to our admin stats handler.  Given that we get 25
>> requests per second, this doesn't make much of a difference.  We don'tg
>> have a sharded index (yet) as our index is only 2-3 GB, but we do have
>> slave servers with replicated
>> indexes that handle the queries, while our master handles
>> updates/commits.
>>
>> Justin
>>
>> Dmitry Kan  writes:
>>
>> > Justin, in terms of the overhead, have you noticed if Munin puts much of
>> it
>> > when used in production? In terms of the solr farm: how big is a shard's
>> > index (given you have sharded architecture).
>> >
>> > Dmitry
>> >
>> > On Sun, Dec 11, 2011 at 6:39 PM, Justin Caratzas
>> > wrote:
>> >
>> >> At my work, we use Munin and Nagio for monitoring and alerts.  Munin is
>> >> great because writing a plugin for it so simple, and with Solr's
>> >> statistics handler, we can track almost any solr stat we want.  It also
>> >> comes with included plugins for load, file system stats, processes,
>> >> etc.
>> >>
>> >> http://munin-monitoring.org/
>> >>
>> >> Justin
>> >>
>> >> Paul Libbrecht  writes:
>> >>
>> >> > Allow me to chim in and ask a generic question about monitoring tools
>> >> > for people close to developers: are any of the tools mentioned in this
>> >> > thread actually able to show graphs of loads, e.g. cache counts or CPU
>> >> > load, in parallel to a console log or to an http request log??
>> >> >
>> >> > I am working on such a tool currently but I have a bad feeling of
>> >> reinventing the wheel.
>> >> >
>> >> > thanks in advance
>> >> >
>> >> > Paul
>> >> >
>> >> >
>> >> >
>> >> > Le 8 déc. 2011 à 08:53, Dmitry Kan a écrit :
>> >> >
>> >> >> Otis, Tomás: thanks for the great links!
>> >> >>
>> >> >> 2011/12/7 Tomás Fernández Löbbe 
>> >> >>
>> >> >>> Hi Dimitry, I pointed to the wiki page to enable JMX, then you can
>> use
>> >> any
>> >> >>> tool that visualizes JMX stuff like Zabbix. See
>> >> >>>
>> >> >>>
>> >>
>> http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/
>> >> >>>
>> >> >>> On Wed, Dec 7, 2011 at 11:49 AM, Dmitry Kan 
>> >> wrote:
>> >> >>>
>> >> >>>> The culprit seems to be the merger (frontend) SOLR. Talking to one
>> >> shard
>> >> >>>> directly takes substantially less time (1-2 sec).
>> >> >>>>
>> >> >>>> On Wed, Dec 7, 2011 at 4:10 PM, Dmitry Kan 
>> >> wrote:
>> >> >>>>
>> >> >>>>> Tomás: thanks. The page you gave didn't mention cache
>> specifically,
>> >> is
>> >> >>>>> there more documentation on this specifically? I have used
>> solrmeter
>> >> >>>> tool,
>> >> >>>>> it draws the cache diagrams, is there a similar tool, but which
>> would
>> >> >>> use
>> >> >>>>> jmx directly and present the cache usage in runtime?
>> >> >>>>>
>> >> >>>>> pravesh:
>> >> >>>>> I have increased the size of f

setting up clustering

2010-07-14 Thread Justin Lolofie
I'm trying to enable clustering in solr 1.4. I'm following these instructions:

http://wiki.apache.org/solr/ClusteringComponent

However, `ant get-libraries` fails for me. Before it tries to download
the 4 jar files, it tries to compile lucene? Is this necessary?

Has anyone gotten clustering working properly?

My next attempt was to just copy contrib/clustering/lib/*.jar and
contrib/clustering/lib/downloads/*.jar to WEB-INF/lib and enable
clustering in solrconfig.xml, but this doesnt work either and I cant
tell from the error log whether it just couldnt find the jar files or
if there is some other problem:

SEVERE: org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.handler.clustering.ClusteringComponent'


boosting particular field values

2010-07-21 Thread Justin Lolofie
I'm using dismax request handler, solr 1.4.

I would like to boost the weight of certain fields according to their
values... this appears to work:

bq=category:electronics^5.5

However, I think this boosting only affects sorting the results that
have already matched? So if I only get 10 rows back, I might not get
any records back that are category electronics. If I get 100 rows, I
can see that bq is working. However, I only want to get 10 rows.

How does one affect the kinds of results that are matched to begin
with? bq is the wrong thing to use, right?

Thanks for any help,
Justin


Re: boosting particular field values

2010-07-21 Thread Justin Lolofie
I might have misunderstood, but I think I cant do string literals in
function queries, right?

myfield:"something"^3.0

I tried it anyway using solr 1.4, doesnt seem to work.

On Wed, Jul 21, 2010 at 1:48 PM, Markus Jelsma  wrote:
> function queries match all documents
>
>
> http://wiki.apache.org/solr/FunctionQuery#Using_FunctionQuery
>
>
> -Original message-
> From: Justin Lolofie 
> Sent: Wed 21-07-2010 20:24
> To: solr-user@lucene.apache.org;
> Subject: boosting particular field values
>
> I'm using dismax request handler, solr 1.4.
>
> I would like to boost the weight of certain fields according to their
> values... this appears to work:
>
> bq=category:electronics^5.5
>
> However, I think this boosting only affects sorting the results that
> have already matched? So if I only get 10 rows back, I might not get
> any records back that are category electronics. If I get 100 rows, I
> can see that bq is working. However, I only want to get 10 rows.
>
> How does one affect the kinds of results that are matched to begin
> with? bq is the wrong thing to use, right?
>
> Thanks for any help,
> Justin
>


Re: Dismax query response field number

2010-07-22 Thread Justin Lolofie
scrapy what version of solr are you using?

I'd like to do "fq=city:Paris" but it doesnt seem to work for me (solr
1.4) and the docs seem to suggest its a feature that is coming but not
there yet? Or maybe I misunderstood?


On Thu, Jul 22, 2010 at 6:00 AM,   wrote:
>
>  Thanks,
>
> That was the problem!
>
>
>
>
> select?q=moto&qt=dismax& fq =city:Paris
>
>
>
>
>
>
>
>
>
>
>
> -Original Message-
> From: Chantal Ackermann 
> To: solr-user@lucene.apache.org 
> Sent: Thu, Jul 22, 2010 12:47 pm
> Subject: Re: Dismax query response field number
>
>
> is this a typo in your query or in your e-mail?
>
> you have the "q" parameter twice.
> use "fq" for query inputs that mention a field explicitly when using
> dismax.
>
> So it should be:
> select?q=moto&qt=dismax& fq =city:Paris
>
> (the whitespace is only for visualization)
>
>
> chantal
>
>
> On Thu, 2010-07-22 at 11:03 +0200, scr...@asia.com wrote:
>> Yes i've data... maybe my query is wrong?
>>
>> select?q=moto&qt=dismax&q=city:Paris
>>
>> Field city is not showing?
>>
>>
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Grijesh.singh 
>> To: solr-user@lucene.apache.org
>> Sent: Thu, Jul 22, 2010 10:07 am
>> Subject: Re: Dismax query response field number
>>
>>
>>
>> Do u have data in that field also,Solr returns field which have data only.
>
>
>
>
>
>


analysis tool vs. reality

2010-08-03 Thread Justin Lolofie
Hello,

I have found the analysis tool in the admin page to be very useful in
understanding my schema. I've made changes to my schema so that a
particular case I'm looking at matches properly. I restarted solr,
deleted the document from the index, and added it again. But still,
when I do a query, the document does not get returned in the results.

Does anyone have any tips for debugging this sort of issue? What is
different between what I see in analysis tool and new documents added
to the index?

Thanks,
Justin


analysis tool vs. reality

2010-08-03 Thread Justin Lolofie
Hi Erik, thank you for replying. So, turning on debugQuery shows
information about how the query is processed- is there a way to see
how things are stored internally in the index?

My query is "ABC12". There is a document who's "title" field is
"ABC12". However, I can only get it to match if I search for "ABC" or
"12". This was also true in the analysis tool up until recently.
However, I changed schema.xml and turned on catenate-all in
WordDelimterFilterFactory for title fieldtype. Now, in the analysis
tool "ABC12" matches "ABC12". However, when doing an actual query, it
does not match.

Thank you for any help,
Justin


-- Forwarded message --
From: Erik Hatcher 
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 16:50:06 -0400
Subject: Re: analysis tool vs. reality
The analysis tool is merely that, but during querying there is also a
query parser involved.  Adding debugQuery=true to your request will
give you the parsed query in the response offering insight into what
might be going on.   Could be lots of things, like not querying the
fields you think you are to a misunderstanding about some text not
being analyzed (like wildcard clauses).

   Erik

On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:

Hello,

I have found the analysis tool in the admin page to be very useful in
understanding my schema. I've made changes to my schema so that a
particular case I'm looking at matches properly. I restarted solr,
deleted the document from the index, and added it again. But still,
when I do a query, the document does not get returned in the results.

Does anyone have any tips for debugging this sort of issue? What is
different between what I see in analysis tool and new documents added
to the index?

Thanks,
Justin


analysis tool vs. reality

2010-08-04 Thread Justin Lolofie
Erik: Yes, I did re-index if that means adding the document again.
Here are the exact steps I took:

1. analysis.jsp "ABC12" does NOT match title "ABC12" (however, ABC or 12 does)
2. changed schema.xml WordDelimeterFilterFactory catenate-all
3. restarted tomcat
4. deleted the document with title "ABC12"
5. added the document with title "ABC12"
6. query "ABC12" does NOT result in the document with title "ABC12"
7. analysis.jsp "ABC12" DOES match that document now

Is there any way to see, given an ID, how something is indexed internally?

Lance: I understand the index/query sections of analysis.jsp. However,
it operates on text that you enter into the form, not on actual index
data. Since all my documents have a unique ID, I'd like to supply an
ID and a query, and get back the same index/query sections- using
whats actually in the index.


-- Forwarded message --
From: Erik Hatcher 
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 22:43:17 -0400
Subject: Re: analysis tool vs. reality
Did you reindex after changing the schema?


On Aug 3, 2010, at 7:35 PM, Justin Lolofie wrote:

Hi Erik, thank you for replying. So, turning on debugQuery shows
information about how the query is processed- is there a way to see
how things are stored internally in the index?

My query is "ABC12". There is a document who's "title" field is
"ABC12". However, I can only get it to match if I search for "ABC" or
"12". This was also true in the analysis tool up until recently.
However, I changed schema.xml and turned on catenate-all in
WordDelimterFilterFactory for title fieldtype. Now, in the analysis
    tool "ABC12" matches "ABC12". However, when doing an actual query, it
does not match.

Thank you for any help,
Justin


-- Forwarded message --
From: Erik Hatcher 
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 16:50:06 -0400
Subject: Re: analysis tool vs. reality
The analysis tool is merely that, but during querying there is also a
query parser involved.  Adding debugQuery=true to your request will
give you the parsed query in the response offering insight into what
might be going on.   Could be lots of things, like not querying the
fields you think you are to a misunderstanding about some text not
being analyzed (like wildcard clauses).

 Erik

On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:

  Hello,

  I have found the analysis tool in the admin page to be very useful in
  understanding my schema. I've made changes to my schema so that a
  particular case I'm looking at matches properly. I restarted solr,
  deleted the document from the index, and added it again. But still,
  when I do a query, the document does not get returned in the results.

  Does anyone have any tips for debugging this sort of issue? What is
  different between what I see in analysis tool and new documents added
  to the index?

  Thanks,
  Justin


analysis tool vs. reality

2010-08-04 Thread Justin Lolofie
Wow, I got to work this morning and my query results now include the
'ABC12' document. I'm not sure what that means. Either I made a
mistake in the process I described in the last email (I dont think
this is the case) or there is some kind of caching of query results
going on that doesnt get flushed on a restart of tomcat.




Erik: Yes, I did re-index if that means adding the document again.
Here are the exact steps I took:

1. analysis.jsp "ABC12" does NOT match title "ABC12" (however, ABC or 12 does)
2. changed schema.xml WordDelimeterFilterFactory catenate-all
3. restarted tomcat
4. deleted the document with title "ABC12"
5. added the document with title "ABC12"
6. query "ABC12" does NOT result in the document with title "ABC12"
7. analysis.jsp "ABC12" DOES match that document now

Is there any way to see, given an ID, how something is indexed internally?

Lance: I understand the index/query sections of analysis.jsp. However,
it operates on text that you enter into the form, not on actual index
data. Since all my documents have a unique ID, I'd like to supply an
ID and a query, and get back the same index/query sections- using
whats actually in the index.


-- Forwarded message --
From: Erik Hatcher 
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 22:43:17 -0400
Subject: Re: analysis tool vs. reality
Did you reindex after changing the schema?


On Aug 3, 2010, at 7:35 PM, Justin Lolofie wrote:

Hi Erik, thank you for replying. So, turning on debugQuery shows
information about how the query is processed- is there a way to see
how things are stored internally in the index?

My query is "ABC12". There is a document who's "title" field is
"ABC12". However, I can only get it to match if I search for "ABC" or
"12". This was also true in the analysis tool up until recently.
However, I changed schema.xml and turned on catenate-all in
WordDelimterFilterFactory for title fieldtype. Now, in the analysis
tool "ABC12" matches "ABC12". However, when doing an actual query, it
does not match.

Thank you for any help,
Justin


-- Forwarded message --
From: Erik Hatcher 
To: solr-user@lucene.apache.org
Date: Tue, 3 Aug 2010 16:50:06 -0400
Subject: Re: analysis tool vs. reality
The analysis tool is merely that, but during querying there is also a
query parser involved.  Adding debugQuery=true to your request will
give you the parsed query in the response offering insight into what
might be going on.   Could be lots of things, like not querying the
fields you think you are to a misunderstanding about some text not
being analyzed (like wildcard clauses).

 Erik

On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:

  Hello,

  I have found the analysis tool in the admin page to be very useful in
  understanding my schema. I've made changes to my schema so that a
  particular case I'm looking at matches properly. I restarted solr,
  deleted the document from the index, and added it again. But still,
  when I do a query, the document does not get returned in the results.

  Does anyone have any tips for debugging this sort of issue? What is
  different between what I see in analysis tool and new documents added
  to the index?

  Thanks,
  Justin


Re: DIH silently ignoring a record

2013-03-19 Thread Justin L.
Shalin,

Thanks for your questions- the mystery is solved this morning. My "unique"
key was only unique within an entity and not between them. There was only
one instance of overlap- the no-longer mysterious record and its
doppelganger.

All the other symptoms were side effects from how I was troubleshooting.
For example, if I did a full import, the doppelganger record (which I didnt
know about) would be imported- but my test query was only looking for the
one that didnt make it in. However, if I imported only that entity, it
would, as expected, update the index record and things would appear fine to
me.

So, no bug. Just plain old bad/narrow troubleshooting combined with
coincidence (only record not getting imported is first row, etc).

-justin


On Mon, Mar 18, 2013 at 7:34 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> That does sound perplexing.
>
> Justin, can you tell us which field in the query is your record id? What is
> the record id's type in database and in solr schema? What is your unique
> key and its type in solr schema?
>
>
> On Tue, Mar 19, 2013 at 5:19 AM, Justin L.  wrote:
>
> > Every time I do an import, DataImportHandler is not importing 1 row from
> my
> > database.
> >
> > I have 3 entities each defined with a single query. I have confirmed, by
> > looking at totals from solr as well as comparing a "*:*" query to direct
> db
> > queries-- exactly 1 row is missing every time. And its the same row- the
> > first row of one of my entities when sorted by primary key. The other two
> > entities are fully imported without trouble.
> >
> > There are no errors in the log- even when DIH logging is turned up to
> FINE.
> > When I alter the query to retrieve only the mysterious record, it shows
> up
> > as "Fetched: 1 Skipped: 0 Processed: 1". But when I do a query for *:* it
> > returns 0 documents.
> >
> > Ready for a twist? The DIH query for this entity does not have an ORDER
> BY
> > clause- when I add one to sort by primary key DESC it imports all of the
> > rows for that entity, including the mysterious record.
> >
> > Ready to have your mind blown? I am using the alternative method for
> doing
> > delta imports (see query below). When I make clean=false, and update the
> > timestamp on the mysterious record- yup- it gets imported properly.
> >
> >
> >
> > Because I have the ORDER BY DESC hack, I can get by and live to fight
> > another day. But I thought someone might like to know this because I
> think
> > I am hitting a bug in DIH- specifically, something after the querying but
> > before the posting to solr. If someone familiar with DIH innards wants to
> > suggest where I should look or how to step through it, I'd be willing to
> > take a look.
> >
> > xoxo,
> > Justin
> >
> >
> > * Fun facts:
> > Solr 4.0
> > Oracle 11g
> > The mysterious record's id is "01"
> > I use field elements to rename the columns rather than in-the-sql aliases
> > because of a problem I had with them earlier. But I will try changing
> that.
> >
> >
> > * Alternative delta import method:
> >
> > http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport
> >
> >
> > * DIH query that should import mysterious record:
> >
> > select organization_name, organization_id, address
> > from organization o
> > join rolodex r on r.rolodex_id = o.contact_address_id
> > and r.sponsor_address_flag = 'N'
> > and r.actv_ind = 'Y'
> > where '${dataimporter.request.clean}' = 'true'
> > or to_char(o.update_timestamp,'-MM-DD HH24:MI:SS') >
> > '${dataimporter.organization.last_index_time
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Solr v3.5.0 - numFound changes when paging through results on 8-shard cluster

2012-06-19 Thread Justin Babuscio
Solr v3.5.0
8 Master Shards
2 Slaves Per Master

Confirming that there are no active records being written, the "numFound"
value is decreasing as we page through the results.

For example,
Page1 - numFound = 3683
Page2 - numFound = 3683
Page3 - numFound = 3683
Page4 - numFound = 2866
Page5 - numFound = 2419
Page5 - numFound = 1898
Page6 - numFound = 1898
...
PageN - numFound = 1898



It looks like it eventually settles on the real count.  Is this a
limitation when using a distributed cluster or is the numFound always
intended to give an approximately similar to how Google responds with total
hits?


I also increased start to higher than 1898 (made it 3000) and it returned 0
results with numCount = 1898.


Thanks ahead,

-- 
Justin Babuscio
571-210-0035
http://linchpinsoftware.com


Re: Solr v3.5.0 - numFound changes when paging through results on 8-shard cluster

2012-06-19 Thread Justin Babuscio
1) We have 1 core and we use the default search handler.

2) For the shards, we use the URL
parameters, shards=s1/solr,s2/solr,s3/solr,...,s8/solr
where s# point to a baremetal load balancer that routes the requests to
one of the two slave shards.

   There is definitely the chance that on each page, the load balancer is
mixing the shards used for searching.  Is there a possibility that the
Master1 may have two shards with two different counts?  This could explain
it.

3) For the URL to execute the aggregating search, we have a virtual IP that
round robins to all 16 slaves in the cluster.

On a given search, any one of the 16 slaves may field the aggregation
request then the shard single-node queries are fired off to the load
balancer which then may mix up the nodes.


Other than the load balancing, we have no other configuration or search
differences besides the start parameter in the URL to move to the next page.


On Tue, Jun 19, 2012 at 4:20 PM, Yury Kats  wrote:

> On 6/19/2012 4:06 PM, Justin Babuscio wrote:
> > Solr v3.5.0
> > 8 Master Shards
> > 2 Slaves Per Master
> >
> > Confirming that there are no active records being written, the "numFound"
> > value is decreasing as we page through the results.
> >
> > For example,
> > Page1 - numFound = 3683
> > Page2 - numFound = 3683
> > Page3 - numFound = 3683
> > Page4 - numFound = 2866
> > Page5 - numFound = 2419
> > Page5 - numFound = 1898
> > Page6 - numFound = 1898
> > ...
> > PageN - numFound = 1898
> >
> >
> >
> > It looks like it eventually settles on the real count.  Is this a
> > limitation when using a distributed cluster or is the numFound always
> > intended to give an approximately similar to how Google responds with
> total
> > hits?
>
> numFound should return the real count for any given query.
> How are you sepcifying which shards/cores to use for each query?
> Does this change between queries?
>
>


-- 
Justin Babuscio
571-210-0035
http://linchpinsoftware.com


Re: Solr v3.5.0 - numFound changes when paging through results on 8-shard cluster

2012-06-19 Thread Justin Babuscio
As I understand your problem, it sounds like you were using your master as
part of your search cluster so the two distributed queries were returning
conflicting numbers.

In my scenario, our eight Masters are used for /updates & /deletes only.
 There are no queries issued to these nodes.  When the distributed query is
executed, it could be possible for the two slaves to be out of sync (i.e.
one replicated faster than the other).

The proof that may eliminate this is that there is no activity on my
servers right now.  The search counts have stabilized.

It's consistently returning less results as I execute the query with a
"start" URL param > 400.



On Tue, Jun 19, 2012 at 5:05 PM, Shawn Heisey  wrote:

> On 6/19/2012 2:32 PM, Justin Babuscio wrote:
>
>> 2) For the shards, we use the URL
>> parameters, shards=s1/solr,s2/solr,s3/**solr,...,s8/solr
>> where s# point to a baremetal load balancer that routes the requests
>> to
>> one of the two slave shards.
>>
>
> This most likely has nothing to do with your question about changing
> numFound, just a side issue that I wanted to comment on.  I was at one time
> using a similar method where I had each shard as an entry in the load
> balancer.  This led to an unusual occasional problem.
>
> As you may know, a distributed query results in two queries being sent to
> each shard -- the first one finds the documents on each shard, then once
> Solr has gathered those results, it makes another request that retrieves
> the document.
>
> Imagine that you have just updated your master server, and you make a
> query that will include one or more of the new documents in the results.
>  If you make that query just after the master server gets updated, but
> before the slave has had a chance to copy and commit the changes, you can
> run into this:  The first (search) query goes to the master server and will
> see the new document.  The second (retrieval) query will then go to the
> slave, requesting a document that does not yet exist there.  This *will*
> happen eventually.  I would run into it at least once a day on a monitoring
> system that checked the age of the newest document.
>
> Here's one way to deal with that: I have a dedicated core on each server
> that has the shards parameter included in the request handler.  This core
> does not have an index of its own, it exists only to act as a search
> broker, pointing at all the cores with the data.  The name of this core is
> ncmain, and its standard request handler contains the following:
>
> idxa2.example.**com:8981/solr/inclive,idxa1.**
> example.com:8981/solr/s0live,**idxa1.example.com:8981/solr/**
> s1live,idxa1.example.com:8981/**solr/s2live,idxa2.example.com:**
> 8981/solr/s3live,idxa2.**example.com:8981/solr/s4live,**
> idxa2.example.com:8981/solr/**s5live<http://idxa2.example.com:8981/solr/inclive,idxa1.example.com:8981/solr/s0live,idxa1.example.com:8981/solr/s1live,idxa1.example.com:8981/solr/s2live,idxa2.example.com:8981/solr/s3live,idxa2.example.com:8981/solr/s4live,idxa2.example.com:8981/solr/s5live>
> 
>
> On the servers for chain A (idxa1, idxa2), the shards parameter references
> only chain A server cores.  On the servers for chain B (idxb1, idxb2), the
> shards parameter references only chain B server cores.
>
> The load balancer only talks to these broker cores, not the cores with the
> actual indexes.  Neither the client nor the load balancer needs to use (or
> even know about) the shards parameter.  That is handled entirely within the
> Solr configuration.
>
> Thanks,
> Shawn
>
>


-- 
Justin Babuscio
571-210-0035
http://linchpinsoftware.com


Re: Solr v3.5.0 - numFound changes when paging through results on 8-shard cluster

2012-06-19 Thread Justin Babuscio
I believe that is the issue.

We recently lost a physical server and a misinformed (due to weekend fire)
sys admin moved one of the master shards.  This caused the automated
deployment scripts to change the order of publishing.  When a rebuild
followed the following day, we essentially wrote the same record to
multiple servers causing this false positive count.

Thank you for all the feedback in resolving this.  We are going to delete
our entire index, rebuild from scratch (achievable for our user base), and
it should clear up any discrepancies.

Justin

On Tue, Jun 19, 2012 at 5:40 PM, Chris Hostetter
wrote:

> : Confirming that there are no active records being written, the "numFound"
> : value is decreasing as we page through the results.
>
> 1) check that the "clones" of each shard are in fact identical (just look
> at the index files on each machine and make sure they are the same.
>
> 2) distributed searching relies heavily on using a uniqeuKey, and can
> behave oddly if documents with identical keys exist in multiple shards.
>
>
> http://wiki.apache.org/solr/DistributedSearch?#Distributed_Searching_Limitations
>
> If i remember correctly, what you are describing sounds like one of the
> things that can hapen if you violate the uniqueKey rule across differnet
> shards when indexing.
>
> I *think* what you are seeing is that in the distributed request for
> page#1 the coordinator sums up the numFound from all shards, and merges
> results 1-$rows acording to the sort, likewise for pages 2 & 3 when you
> get to page #4, it suddenly sees that doc#9876543 is included in hte
> responses from 3 diff shards, and it subtracts 2 from the numFound, and so
> on as you page farther through the results.  the more documents with
> duplicate uniqueKeys it find in the results as it pages through, the lower
> the cumulative numFound gets.
>
> : For example,
> : Page1 - numFound = 3683
> : Page2 - numFound = 3683
> : Page3 - numFound = 3683
> : Page4 - numFound = 2866
> : Page5 - numFound = 2419
> : Page5 - numFound = 1898
> : Page6 - numFound = 1898
> : ...
> : PageN - numFound = 1898
>
> -Hoss
>



-- 
Justin Babuscio
571-210-0035
http://linchpinsoftware.com


Highlighting error InvalidTokenOffsetsException: Token oedipus exceeds length of provided text sized 11

2012-08-03 Thread Justin Engelman
ava:216)

  at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)

  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)

  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)

  at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)

  at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)

  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

  at org.mortbay.jetty.Server.handle(Server.java:326)

  at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)

  at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)

  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)

  at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)

  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)

  at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)

  at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException:
Token oedipus exceeds length of provided text sized 11

  at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)

  at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490)

  ... 24 more



I am not even sure where the “oedipus” token  is coming from.  It doesn’t
show up in the analysis.  Help please?

Thank you,

Justin


Solr replication hangs on multiple slave nodes

2012-10-04 Thread Justin Babuscio
After a large index rebuild (16-masters with ~15GB each), some slaves fail
to completely replicate.

We are running Solr v3.5 with 16 masters and 2 slaves each for a total of
48 servers.

4 of the 32 slaves sit in a stalled replication state with similar messages:

Files Downloaded:  254/260
Downloaded: 12.09 GB / 12.09 GB [ 100% ]
Downloading File: _t6.fdt, Downloaded: 3.1 MB / 3.1 MB [ 100 % ]
Time Elapsed: 3215s, EStimated Time REmaining: 0s, Speed: 24.5 MB/s


As you'll notice, all download sizes appear to be complete but the files
downloaded are not.  This also prevents the servers from polling for a new
update from the masters.  When searching, we are occasionally seeing 500
responses from the slaves that fail to replicate.  The errors are

ArrayIndexOutOfBounds - this occurs when writing the HTTP Response (our
container is WebSphere)
NullPointerExceptions - org.apache.lucnee.queryParser.QueryParser.parse
(QueryParser.java:203 )

We have tried to stop the slave, delete the /data directory, and restart.
 This started downloading the index but stalled as expected.

Thanks,
Justin


encountered the "Cannot allocate memory" when calling snapshooter program after optimize command

2009-01-07 Thread Justin Yao

Hi,

I configured solr to listen on postOptimize event and call the 
snapshooter program after an optimize command. It works well when the 
Java heap size is set to less than 4G. But if I increased the java heap 
size to 5G, the snapshooter program can't be successfully called after 
the optimize command and error message is here:


SEVERE: java.io.IOException: Cannot run program 
"/home/solr_1.3/solr/bin/snapshooter" (in directory 
"/home/solr_1.3/solr/bin"): java.io.IOException: error=12, Cannot 
allocate memory

at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
at java.lang.Runtime.exec(Runtime.java:593)

Here is my server platform:

OS: CentOS 5.2 x86_64
Memory: 8G
Solr: 1.3


Any suggestion is appreciated.

Thanks,
Justin


SpellCheckerRequestHandler and onlyMorePopular

2007-10-24 Thread Justin Knoll
I'm running the example Solr install with a custom schema.xml and  
solrconfig.xml. I'm seeing some unexpected results for searches using  
the SpellCheckerRequestHandler using the onlyMorePopular option.  
Namely, searching for certain terms with onlyMorePopular set to true  
returns a suggestion which, when searched for in turn itself, returns  
a suggestion back to the original term.


For example, a query such as:
http://localhost:8983/solr/select/? 
q=Eft&qt=spellchecker&onlyMorePopular=true


returns:


−

0
2

−

oft



And a query for
http://localhost:8983/solr/select/? 
q=oft&qt=spellchecker&onlyMorePopular=true


−

0
2

−

Eft



It seems that "onlyMorePopular" should be an asymmetric relation. I  
thought perhaps it might actually be implemented as a >= instead of a  
strict >, making it antisymmetric and perhaps explaining this result  
as a popularity tie. However, taking a clean copy of the example  
install, adding entries into the spellchecker.xml file, then  
inserting and rebuilding the index results in onlyMorePopular cross- 
recommendations as above even when I've created a clear popularity  
inequality between similar terms (e.g. adding two docs with  
word="blackkerry" makes it more popular than the existing  
"blackberry" doc, but each is suggested for the other).


I checked the defaults list for the spellchecker requestHandler in my  
solrconfig.xml, and it didn't specify a value for onlyMorePopular. I  
added a default value of true and restarted Solr, but that has no  
effect. I've also tried using Luke to inspect the spell index, but  
I'm not sure exactly what to look for. I'd be more than happy to  
provide any details which might assist others in lending their  
expertise. Any insights would be very much appreciated.


Thanks,
Justin Knoll

Distribution without SSH?

2007-11-29 Thread Justin Knoll

Hello,
I recently set up Solr with distribution on a couple of servers. I  
just learned that our network policies do not permit us to use SSH  
with passphraseless keys, and the snappuller script uses SSH to  
examine the master Solr instance's state before it pulls the newest  
index via rsync.


We plan to attempt to rewrite the snappuller (and possibly other  
distribution scripts, as required) to eliminate this dependency on  
SSH. I thought I ask the list in case anyone has experience with this  
same situation or any insights into the reasoning behind requiring  
SSH access to the master instance.


Thanks,
Justin Knoll


RE: Commit preformance problem

2008-03-02 Thread justin alexander

a script for posting large sets (23GB here)
 
http://www.nabble.com/file/p15786630/post3.sh post3.sh 
-- 
View this message in context: 
http://www.nabble.com/Commit-preformance-problem-tp15434972p15786630.html
Sent from the Solr - User mailing list archive at Nabble.com.



Solr 8.2 Cloud Replication Locked

2020-04-23 Thread Justin Sweeney
Hi all,

We are running Solr 8.2 Cloud in a cluster where we have a single TLOG
replica per shard and multiple PULL replicas for each shard. We have
noticed an issue recently where some of the PULL replicas stop replicating
from the masters. The will have a replication which outputs:

o.a.s.h.IndexFetcher Number of files in latest index in master:

Then nothing else for IndexFetcher after that. I went onto a few instances
and took a thread dump and we see the following where it seems to be locked
getting the index write lock. I don’t see anything else in the thread dump
indicating deadlock. Any ideas here?

"indexFetcher-19-thread-1" #468 prio=5 os_prio=0 cpu=285847.01ms
> elapsed=62993.13s tid=0x7fa8fc004800 nid=0x254 waiting on condition
> [0x7ef584ede000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park(java.base@11.0.6/Native Method)
> - parking to wait for <0x0003aa5e4ad8> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.6
> /LockSupport.java:234)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireNanos(java.base@11.0.6
> /AbstractQueuedSynchronizer.java:980)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireNanos(java.base@11.0.6
> /AbstractQueuedSynchronizer.java:1288)
> at
> java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.tryLock(java.base@11.0.6
> /ReentrantReadWriteLock.java:1131)
> at
> org.apache.solr.update.DefaultSolrCoreState.lock(DefaultSolrCoreState.java:179)
> at
> org.apache.solr.update.DefaultSolrCoreState.closeIndexWriter(DefaultSolrCoreState.java:240)
> at
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:569)
> at
> org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:351)
> at
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:424)
> at
> org.apache.solr.handler.ReplicationHandler.lambda$setupPolling$13(ReplicationHandler.java:1193)
> at
> org.apache.solr.handler.ReplicationHandler$$Lambda$668/0x000800d0f440.run(Unknown
> Source)
> at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.6
> /Executors.java:515)
> at java.util.concurrent.FutureTask.runAndReset(java.base@11.0.6
> /FutureTask.java:305)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.6
> /ScheduledThreadPoolExecutor.java:305)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.6
> /ThreadPoolExecutor.java:1128)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.6
> /ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(java.base@11.0.6/Thread.java:834)


Performance of /export requests

2019-05-10 Thread Justin Sweeney
Hi,

We are currently working on a project where we are making heavy use of
/export in Solr in order to stream data back. We have an index with about
16 fields that are all docvalues fields and any number of them may be
requested to be streamed in results. Our index has ~450 million documents
spread across 10 shards.

We are creating a CloudSolrStream and when we call CloudSolrStream.open()
we see that call being slower than we had hoped. For some queries, that
call can take 800 ms. What we found interesting was that doing the same
request repeatedly resulted in the same time of 800 ms, which seems to
indicate that /export does not take advantage of caching or there is
something else at play.

I’m starting to dig through the code to better understand, but I wanted to
reach out to see what sort of expectations we should have here and if there
is anything we can do to increase performance of these requests.

We are currently using Solr 5, but we’ve also tried with Solr 7 and seen
similar results. If I can provide any additional information, please let me
know.

Thank you!
Justin


Re: Performance of /export requests

2019-05-12 Thread Justin Sweeney
Thanks for the quick response. We are generally seeing exports from Solr 5
and 7 to be roughly the same, but I’ll check out Solr 8.

Joel - We are generally sorting a on tlong field and criteria can vary from
searching everything (*:*) to searching on a combination of a few tint and
string types.

All of our 16 fields are docvalues. Is there any performance degradation as
the number of docvalues fields increases or should that not have an impact?
Also, is the 30k sliding window configurable? In many cases we are
streaming back a few thousand, maybe up to 10k and then cutting off the
stream. If we could configure the size of that window, could that speed
things up some?

Thanks again for the info.

On Sat, May 11, 2019 at 2:38 PM Joel Bernstein  wrote:

> Can you share the sort criteria and search query? The main strategy for
> improving performance of the export handler is adding more shards. This is
> different than with typical distributed search, where deep paging issues
> get worse as you add more shards. With the export handler if you double the
> shards you double the pushing power. There are no deep paging drawbacks to
> adding more shards.
>
> On Sat, May 11, 2019 at 2:17 PM Toke Eskildsen  wrote:
>
> > Justin Sweeney  wrote:
> >
> > [Index: 10 shards, 450M docs]
> >
> > > We are creating a CloudSolrStream and when we call
> CloudSolrStream.open()
> > > we see that call being slower than we had hoped. For some queries, that
> > > call can take 800 ms. [...]
> >
> > As far as I can see in the code, CloudSolrStream.open() opens streams
> > against the relevant shards and checks if there is a result. The last
> step
> > is important as that means the first batch of tuples must be calculated
> in
> > the shards. Streaming works internally by having a sliding window of 30K
> > tuples through the result set in each shard, so open() results in (up to)
> > 30K tuples being calculated. On the other hand, getting the first 30K
> > tuples should be very fast after open().
> >
> > > We are currently using Solr 5, but we’ve also tried with Solr 7 and
> seen
> > > similar results.
> >
> > Solr 7 has a performance regression for export (or rather a regression
> for
> > DocValues that is very visible when using export. See
> > https://issues.apache.org/jira/browse/SOLR-13013), so I would expect it
> > to be slower than Solr 5. You could try with Solr 8 where this regression
> > should be mitigated somewhat.
> >
> > - Toke Eskildsen
> >
>


Solr Slack Workspace

2021-01-15 Thread Justin Sweeney
Hi all,

I did some googling and didn't find anything, but is there a Slack
workspace for Solr? I think this could be useful to expand interaction
within the community of Solr users and connect people solving similar
problems.

I'd be happy to get this setup if it does not exist already.

Justin


Re: Solr Slack Workspace

2021-01-27 Thread Justin Sweeney
Thanks, I joined the Relevance Slack:
https://opensourceconnections.com/slack, I definitely think a dedicated
Solr workspace would also be good allowing for channels to get involved
with development as well as user based questions.

It does seem like slack has made it increasingly difficult to create open
workspaces and not force someone to approve or only allow specific email
domains. Has anyone tried to do that recently? I tried for an hour or so
last weekend and it seemed to not be very straightforward anymore.

On Tue, Jan 26, 2021 at 12:57 PM Houston Putman 
wrote:

> There is https://solr-dev.slack.com
>
> It's not really used, but it's there and we can open it up for people to
> join and start using.
>
> On Tue, Jan 26, 2021 at 5:38 AM Ishan Chattopadhyaya <
> ichattopadhy...@gmail.com> wrote:
>
> > Thanks ufuk. I'll take a look.
> >
> > On Tue, 26 Jan, 2021, 4:05 pm ufuk yılmaz, 
> > wrote:
> >
> > > It’s asking for a searchscale.com email address?
> > >
> > > Sent from Mail for Windows 10
> > >
> > > From: Ishan Chattopadhyaya
> > > Sent: 26 January 2021 13:33
> > > To: solr-user
> > > Subject: Re: Solr Slack Workspace
> > >
> > > There is a Slack backed by official IRC support. Please see
> > > https://lucene.472066.n3.nabble.com/Solr-Users-Slack-td4466856.html
> for
> > > details on how to join it.
> > >
> > > On Tue, 19 Jan, 2021, 2:54 pm Charlie Hull, <
> > > ch...@opensourceconnections.com>
> > > wrote:
> > >
> > > > Relevance Slack is open to anyone working on search & relevance -
> #solr
> > > is
> > > > only one of the channels, there's lots more! Hope to see you there.
> > > >
> > > > Cheers
> > > >
> > > > Charlie
> > > > https://opensourceconnections.com/slack
> > > >
> > > >
> > > > On 16/01/2021 02:18, matthew sporleder wrote:
> > > > > IRC has kind of died off,
> > > > > https://lucene.apache.org/solr/community.html has a slack
> mentioned,
> > > > > I'm on https://opensourceconnections.com/slack after taking their
> > solr
> > > > > training class and assume it's mostly open to solr community.
> > > > >
> > > > > On Fri, Jan 15, 2021 at 8:10 PM Justin Sweeney
> > > > >  wrote:
> > > > >> Hi all,
> > > > >>
> > > > >> I did some googling and didn't find anything, but is there a Slack
> > > > >> workspace for Solr? I think this could be useful to expand
> > interaction
> > > > >> within the community of Solr users and connect people solving
> > similar
> > > > >> problems.
> > > > >>
> > > > >> I'd be happy to get this setup if it does not exist already.
> > > > >>
> > > > >> Justin
> > > >
> > > >
> > > > --
> > > > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > > > 
> > > > Founding member of The Search Network <https://thesearchnetwork.com/
> >
> > > > and co-author of Searching the Enterprise
> > > > <https://opensourceconnections.com/about-us/books-resources/>
> > > > tel/fax: +44 (0)8700 118334
> > > > mobile: +44 (0)7767 825828
> > > >
> > >
> > >
> >
>


Re: Solr Slack Workspace

2021-02-05 Thread Justin Sweeney
Worked for me and a few others, thanks for doing that!

On Tue, Feb 2, 2021 at 5:04 AM Ishan Chattopadhyaya <
ichattopadhy...@gmail.com> wrote:

> Hi all,
> I've created an invite link for the Slack workspace:
> https://s.apache.org/solr-slack.
> Please test it out. I'll send a broader notification once this is tested
> out to be working well.
> Thanks and regards,
> Ishan
>
> On Thu, Jan 28, 2021 at 12:26 AM Justin Sweeney <
> justin.sweene...@gmail.com>
> wrote:
>
> > Thanks, I joined the Relevance Slack:
> > https://opensourceconnections.com/slack, I definitely think a dedicated
> > Solr workspace would also be good allowing for channels to get involved
> > with development as well as user based questions.
> >
> > It does seem like slack has made it increasingly difficult to create open
> > workspaces and not force someone to approve or only allow specific email
> > domains. Has anyone tried to do that recently? I tried for an hour or so
> > last weekend and it seemed to not be very straightforward anymore.
> >
> > On Tue, Jan 26, 2021 at 12:57 PM Houston Putman  >
> > wrote:
> >
> > > There is https://solr-dev.slack.com
> > >
> > > It's not really used, but it's there and we can open it up for people
> to
> > > join and start using.
> > >
> > > On Tue, Jan 26, 2021 at 5:38 AM Ishan Chattopadhyaya <
> > > ichattopadhy...@gmail.com> wrote:
> > >
> > > > Thanks ufuk. I'll take a look.
> > > >
> > > > On Tue, 26 Jan, 2021, 4:05 pm ufuk yılmaz,
>  > >
> > > > wrote:
> > > >
> > > > > It’s asking for a searchscale.com email address?
> > > > >
> > > > > Sent from Mail for Windows 10
> > > > >
> > > > > From: Ishan Chattopadhyaya
> > > > > Sent: 26 January 2021 13:33
> > > > > To: solr-user
> > > > > Subject: Re: Solr Slack Workspace
> > > > >
> > > > > There is a Slack backed by official IRC support. Please see
> > > > >
> https://lucene.472066.n3.nabble.com/Solr-Users-Slack-td4466856.html
> > > for
> > > > > details on how to join it.
> > > > >
> > > > > On Tue, 19 Jan, 2021, 2:54 pm Charlie Hull, <
> > > > > ch...@opensourceconnections.com>
> > > > > wrote:
> > > > >
> > > > > > Relevance Slack is open to anyone working on search & relevance -
> > > #solr
> > > > > is
> > > > > > only one of the channels, there's lots more! Hope to see you
> there.
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > > Charlie
> > > > > > https://opensourceconnections.com/slack
> > > > > >
> > > > > >
> > > > > > On 16/01/2021 02:18, matthew sporleder wrote:
> > > > > > > IRC has kind of died off,
> > > > > > > https://lucene.apache.org/solr/community.html has a slack
> > > mentioned,
> > > > > > > I'm on https://opensourceconnections.com/slack after taking
> > their
> > > > solr
> > > > > > > training class and assume it's mostly open to solr community.
> > > > > > >
> > > > > > > On Fri, Jan 15, 2021 at 8:10 PM Justin Sweeney
> > > > > > >  wrote:
> > > > > > >> Hi all,
> > > > > > >>
> > > > > > >> I did some googling and didn't find anything, but is there a
> > Slack
> > > > > > >> workspace for Solr? I think this could be useful to expand
> > > > interaction
> > > > > > >> within the community of Solr users and connect people solving
> > > > similar
> > > > > > >> problems.
> > > > > > >>
> > > > > > >> I'd be happy to get this setup if it does not exist already.
> > > > > > >>
> > > > > > >> Justin
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Charlie Hull - Managing Consultant at OpenSource Connections
> > Limited
> > > > > > 
> > > > > > Founding member of The Search Network <
> > https://thesearchnetwork.com/
> > > >
> > > > > > and co-author of Searching the Enterprise
> > > > > > <https://opensourceconnections.com/about-us/books-resources/>
> > > > > > tel/fax: +44 (0)8700 118334
> > > > > > mobile: +44 (0)7767 825828
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>