Re: pdfs

2014-05-25 Thread Siegfried Goeschl
Hi Brian,

can you send me the email? I would like to play around :-)

Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce 
the issue … 

Thanks in advance

Siegfried Goeschl


On 25 May 2014, at 04:18, Brian McDowell  wrote:

> Our feeding (indexing) tool halts because Solr becomes unresponsive after
> getting some really bad pdfs. There are levels of pdf "badness." Some just
> will not parse and that's fine, but others are more problematic in that our
> Operations team has to restart Solr because it just hangs and accepts no
> more documents. I actually have identified a pdf that will bring down Solr
> every time. Does anyone think that doing pre-validation using the pdfbox
> jar will work? Or, will trying to validate just hang as well? Any help is
> appreciated.
> 
> 
> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky 
> wrote:
> 
>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Siegfried Goeschl
>> Sent: Thursday, May 22, 2014 4:35 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: pdfs
>> 
>> 
>> Hi folks,
>> 
>> for a small customer project I'm running SOLR with embedded Tikka.
>> 
>> * memory consumption is an issue but can be handled
>> * there is an issue with PDFBox hitting an infinite loop which causes
>> excessive CPU usage - requires SOLR restart but happens only once
>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>> erratic since I was never able to track the problem back to a particular
>> PDF document
>> 
>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>> consumption goes through the roof
>> 
>> If you doing really serious stuff I would recommend
>> * moving the document extraction stuff out of SOLR
>> * provide monitoring and recovery and stuck document extractions
>> ** killing worker threads
>> ** using external processed and kill them when spinning out of control
>> 
>> Cheers,
>> 
>> Siegfried Goeschl
>> 
>> On 22.05.14 06:46, Jack Krupansky wrote:
>> 
>>> Yeah, PDF extraction has always been at least somewhat problematic. It
>>> has improved over the years, but still not likely to be perfect.
>>> 
>>> That said, I'm not aware of any specific PDF extraction issue that would
>>> bring down Solr - as opposed to causing a 500 status with an exception
>>> in PDF extraction, with the exception of memory usage. Some PDF
>>> documents, especially those which are graphic-intense can require a lot
>>> of memory. The rest of Solr could be adversely affected if all available
>>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>> 
>>> So, what is your specific symptom?
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Brian McDowell
>>> Sent: Thursday, May 22, 2014 12:24 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: pdfs
>>> 
>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
>>> Solr completely so that it actually needs to be manually restarted. We are
>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>> problem because the release notes associated with the new tika version and
>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
>>> this issue is causing us to reevaluate using Solr. Any help on this matter
>>> would be greatly appreciated. Thank you!
>>> 
>> 
>> 



Re: pdfs

2014-05-25 Thread Siegfried Goeschl
Sorry typo :- can you send me the PDF by email directly :-)

Siegfried Goeschl

On 25 May 2014, at 10:06, Siegfried Goeschl  wrote:

> Hi Brian,
> 
> can you send me the email? I would like to play around :-)
> 
> Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce 
> the issue … 
> 
> Thanks in advance
> 
> Siegfried Goeschl
> 
> 
> On 25 May 2014, at 04:18, Brian McDowell  wrote:
> 
>> Our feeding (indexing) tool halts because Solr becomes unresponsive after
>> getting some really bad pdfs. There are levels of pdf "badness." Some just
>> will not parse and that's fine, but others are more problematic in that our
>> Operations team has to restart Solr because it just hangs and accepts no
>> more documents. I actually have identified a pdf that will bring down Solr
>> every time. Does anyone think that doing pre-validation using the pdfbox
>> jar will work? Or, will trying to validate just hang as well? Any help is
>> appreciated.
>> 
>> 
>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky 
>> wrote:
>> 
>>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Siegfried Goeschl
>>> Sent: Thursday, May 22, 2014 4:35 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: pdfs
>>> 
>>> 
>>> Hi folks,
>>> 
>>> for a small customer project I'm running SOLR with embedded Tikka.
>>> 
>>> * memory consumption is an issue but can be handled
>>> * there is an issue with PDFBox hitting an infinite loop which causes
>>> excessive CPU usage - requires SOLR restart but happens only once
>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>>> erratic since I was never able to track the problem back to a particular
>>> PDF document
>>> 
>>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>>> consumption goes through the roof
>>> 
>>> If you doing really serious stuff I would recommend
>>> * moving the document extraction stuff out of SOLR
>>> * provide monitoring and recovery and stuck document extractions
>>> ** killing worker threads
>>> ** using external processed and kill them when spinning out of control
>>> 
>>> Cheers,
>>> 
>>> Siegfried Goeschl
>>> 
>>> On 22.05.14 06:46, Jack Krupansky wrote:
>>> 
 Yeah, PDF extraction has always been at least somewhat problematic. It
 has improved over the years, but still not likely to be perfect.
 
 That said, I'm not aware of any specific PDF extraction issue that would
 bring down Solr - as opposed to causing a 500 status with an exception
 in PDF extraction, with the exception of memory usage. Some PDF
 documents, especially those which are graphic-intense can require a lot
 of memory. The rest of Solr could be adversely affected if all available
 JVM heap is consumed. The solution is to give the JVM more heap space.
 
 So, what is your specific symptom?
 
 -- Jack Krupansky
 
 -Original Message- From: Brian McDowell
 Sent: Thursday, May 22, 2014 12:24 AM
 To: solr-user@lucene.apache.org
 Subject: pdfs
 
 Has anyone had issues with indexing pdf files? Some pdfs are bringing down
 Solr completely so that it actually needs to be manually restarted. We are
 using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
 problem because the release notes associated with the new tika version and
 also the new pdfbox indicate fixes for pdf issues. It didn't work and now
 this issue is causing us to reevaluate using Solr. Any help on this matter
 would be greatly appreciated. Thank you!
 
>>> 
>>> 
> 



RE: multiple queries in single request

2014-05-25 Thread Pavel Belenkovich
Thanx Jack!

Could someone please explain what "batching" means in this case?
(Assuming I have just 1-2 documents per requested id)

regards,
Pavel.


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Thursday, May 22, 2014 15:51
To: solr-user@lucene.apache.org
Subject: Re: multiple queries in single request

No, I was rejecting BOTH methods 1 and 2. I was suggesting a different method. 
I'll leave it to somebody else to describe the method so that it is easier to 
understand.

-- Jack Krupansky

-Original Message-
From: Pavel Belenkovich
Sent: Thursday, May 22, 2014 4:00 AM
To: solr-user@lucene.apache.org
Subject: RE: multiple queries in single request

Hi Jack!

Thanx for the response!

So you say that using method 2 below (single request with ORs and sorting 
results in client) is better than method 1 (separate requests)?

regards,
Pavel.


-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Thursday, May 22, 2014 01:26
To: solr-user@lucene.apache.org
Subject: Re: multiple queries in single request

Nothing special for this use case.

This seems to be a use case that I would call "bulk data retrieval - based on 
ID".

I would suggest "batching" your requests - limit each request query to, say,
50 or 100 IDs.

-- Jack Krupansky

-Original Message-
From: Pavel Belenkovich
Sent: Wednesday, May 21, 2014 1:07 PM
To: solr-user@lucene.apache.org
Subject: multiple queries in single request

Hi,

I have list of 1000 values for some field which is sort of id (essentially 
unique between documents) (let's say firstname_lastmane).
I need to get the document for each id (to know which document is for which id, 
not just list of responses).

Is there some support for multiple queries in single Solr request?
I saw old posts requesting that but don't know if it's been implemented yet.

There are 2 methods I can think of to achieve the result:
1 - trivial - make separate request per value. I think it's very inefficient.
2- Perform single request with OR on all values.
Then loop over the responses and match them to requested values.
This would also require making the field stored.

Can you propose better option?

thanx,
Pavel 



Re: Distributed Search in Solr with different queries per shard

2014-05-25 Thread Ramkumar R. Aiyengar
I agree with Eric that this is premature unless you can show that it makes
a difference.

Firstly why are you splitting the data into multiple time tiers (one
recent, and one all) and then waiting to merge results from all of them?
Time tiering is useful when you can do the search separately on both and
then pick the one which comes back with full results first (usually will be
the recent one but it might not have as many results as you want).

The way you are trying to aggregate the data is sharding, where one of the
cores doesn't have the data the other one has. So you could just 'optimize'
by not having the data present in the historical collection. We have
support for custom sharding keys now in Solr, haven't used it personally
but that might be worth a shot..
On 21 May 2014 14:57, "Avner Levy"  wrote:

> I have 2 cores.
> One with active data and one with historical data (for documents which
> were removed from the active one).
> I want to run Distributed Search on both and get the unified result (as
> supported by Solr Distributed Search, I'm not using Solr Cloud).
> My problem is that the query for each core is different.
> Is there a way to specify different query per core and still let Solr to
> unify the query results?
> For example:
> Active data core query: select all green docs
> History core query: select all green docs with year=2012
> Is there a way to extend the distributed search handler to support such a
> scenario?
> Thanks in advance,
>   Avner
> · One option is to send a unified query to both but then each core
> will work harder for no reason.
>
>


RE: Query translation of User Fields

2014-05-25 Thread Liram Vardi
Hi Jack,

Thank you for your answer.
I submitted the following Jira issue: 
https://issues.apache.org/jira/browse/SOLR-6113 

Thanks,
Liram

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Thursday, May 22, 2014 5:56 PM
To: solr-user@lucene.apache.org
Subject: Re: Query translation of User Fields

Hmmm... that doesn't sound like what I would have expected - I would have 
thought that Solr would throw an exception on the "user" field, rather than 
simply treat it as a text keyword. File a Jira. Either it's a bug or the doc is 
not complete.

-- Jack Krupansky

-Original Message-
From: Liram Vardi
Sent: Thursday, May 22, 2014 10:50 AM
To: solr-user@lucene.apache.org
Subject: Query translation of User Fields

Hi all,

I have a question regarding the functionality of Edismax parser and its "User 
Field" feature.
We are running Solr 4.4 on our server.

For the query:
"q= id:b* user:"Anna Collins"&defType=edismax&uf=* -user&rows=0"
The parsed query (taken from query debug info) is:
+((id:b* (text:user) (text:"anna collins"))~1)

I expect that because "user" was filter out in "uf" (User fields), the parsed 
query should not contain the "user" search part.
In another words, the parsed query should look simply like this:  +id:b* What 
is the right behavior?

Thanks,
Liram 


Email secured by Check Point


Re: Query translation of User Fields

2014-05-25 Thread Yonik Seeley
On Thu, May 22, 2014 at 10:56 AM, Jack Krupansky
 wrote:
> Hmmm... that doesn't sound like what I would have expected - I would have
> thought that Solr would throw an exception on the "user" field, rather than
> simply treat it as a text keyword.

No, I believe that's working as designed.  edismax should never throw
exceptions due to the structure of the user query.
Just because something looks like a field query (has a : in it)
doesn't mean it was intended to be.

Examples:
Terminator 2: Judgment Day
Mission: Impossible

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: 答复: Internals about "Too many values for UnInvertedField faceting on field xxx"

2014-05-25 Thread Yonik Seeley
On Sat, May 24, 2014 at 9:50 PM, 张月祥  wrote:
> Thanks for your reply. I'll try it.
>
> We're  still interested in the real limitation about  "Too many values for
> UnInvertedField faceting on field xxx" .
>
> Could anybody tell us some internals about "Too many values for
> UnInvertedField faceting on field xxx" ?

There are only 256 byte arrays to hold all of the ord data, and the
pointers into those arrays are only 24 bits long.  That gets you back
to 32 bits, or 4GB of ord data max.  It's practically less since you
only have to overflow one array before the exception is thrown.

This faceting method is best for high numbers of unique values, but a
relatively low number of unique values per document.
I've been considering making an off-heap version for Heliosearch, and
maybe bump the limits a little at the same time...

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap filters&fieldcache


Re: Query translation of User Fields

2014-05-25 Thread Jack Krupansky

I stand corrected! I used to know that.

But I do think the doc for edismax should be more clear on this point- what 
happens if an invalid field name is referenced - or more specifically, what 
happens if the user references a legitimate field name that merely happens 
to be disallowed using uf.


-- Jack Krupansky

-Original Message- 
From: Yonik Seeley

Sent: Sunday, May 25, 2014 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Query translation of User Fields

On Thu, May 22, 2014 at 10:56 AM, Jack Krupansky
 wrote:

Hmmm... that doesn't sound like what I would have expected - I would have
thought that Solr would throw an exception on the "user" field, rather 
than

simply treat it as a text keyword.


No, I believe that's working as designed.  edismax should never throw
exceptions due to the structure of the user query.
Just because something looks like a field query (has a : in it)
doesn't mean it was intended to be.

Examples:
Terminator 2: Judgment Day
Mission: Impossible

-Yonik
http://heliosearch.org - facet functions, subfacets, off-heap 
filters&fieldcache 



Re: SolrCloud Nodes autoSoftCommit and (temporary) missing documents

2014-05-25 Thread Steve McKay
Solr can add the filter for you:



timestamp:[* TO NOW-30SECOND]



Increasing soft commit frequency isn't a bad idea, though. I'd probably do 
both. :)

On May 23, 2014, at 6:51 PM, Michael Tracey  wrote:

> Hey all,
> 
> I've got a number of nodes (Solr 4.4 Cloud) that I'm balancing with HaProxy 
> for queries.  I'm indexing pretty much constantly, and have autoCommit and 
> autoSoftCommit on for Near Realtime Searching.  All works nicely, except that 
> occasionally the auto-commit cycles are far enough off that one node will 
> return a document that another node doesn't.  I don't want to have to add 
> something like this: timestamp:[* TO NOW-30MINUTE] to every query to make 
> sure that all the nodes have the record.  Ideas? autoSoftCommit more often?
> 
>  
>   10 
>   720 
>   false 
> 
> 
>  
>   3 
>   5000
>  
> 
> Thanks,
> 
> M.



Re: SolrCloud Nodes autoSoftCommit and (temporary) missing documents

2014-05-25 Thread Siegfried Goeschl
Hi folks,

I think that the timestamp should be rounded down to a minute (or whatever) to 
avoid trashing the filter query cache

Cheers,

Siegfried Goeschl

On 25 May 2014, at 18:19, Steve McKay  wrote:

> Solr can add the filter for you:
> 
> 
>
>timestamp:[* TO NOW-30SECOND]
>
> 
> 
> Increasing soft commit frequency isn't a bad idea, though. I'd probably do 
> both. :)
> 
> On May 23, 2014, at 6:51 PM, Michael Tracey  wrote:
> 
>> Hey all,
>> 
>> I've got a number of nodes (Solr 4.4 Cloud) that I'm balancing with HaProxy 
>> for queries.  I'm indexing pretty much constantly, and have autoCommit and 
>> autoSoftCommit on for Near Realtime Searching.  All works nicely, except 
>> that occasionally the auto-commit cycles are far enough off that one node 
>> will return a document that another node doesn't.  I don't want to have to 
>> add something like this: timestamp:[* TO NOW-30MINUTE] to every query to 
>> make sure that all the nodes have the record.  Ideas? autoSoftCommit more 
>> often?
>> 
>>  
>>  10 
>>  720 
>>  false 
>> 
>> 
>>  
>>  3 
>>  5000
>>  
>> 
>> Thanks,
>> 
>> M.
> 



Re: “ClientAbortException: java.io.IOException” in solr query error

2014-05-25 Thread Tirthankar
But this exception could be thrown by SOLRJ which is a client to the SOLR
server. Isn't that possible. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/ClientAbortException-java-io-IOException-in-solr-query-error-tp4082321p4138093.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr - Cores not initialised

2014-05-25 Thread Manikandan Saravanan
Hi,

I’m running Solr 4.6.0 on an Ubuntu box. I recently made the following changes:

1. I edited Schema.xml to index my data by a column called timestamp.
2. I then ran the reload procedure as mentioned here 
https://wiki.apache.org/solr/CoreAdmin#RELOAD

After that, when I restarted Solr, I get a big red alert saying the following:
core0: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: No 
such core: core0
core1: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: No 
such core: core1
If I try to visit http://:8983/solr/core0/admin or http://:8983/solr/core1/admin, I get this

s XML file does not appear to have any style information associated with it. 
The document tree is shown below.



SolrCore 'core0' is not available due to init failure: No such core: core0


org.apache.solr.common.SolrException: SolrCore 'core0' is not available due to 
init failure: No such core: core0 at 
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:818) at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:289)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
 at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) 
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) 
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) 
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) 
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) 
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) 
at org.eclipse.jetty.server.Server.handle(Server.java:368) at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
 at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
 at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:942)
 at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1004)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640) at 
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
 at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) 
at java.lang.Thread.run(Thread.java:744) Caused by: 
org.apache.solr.common.SolrException: No such core: core0 at 
org.apache.solr.core.CoreContainer.reload(CoreContainer.java:675) at 
org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction(CoreAdminHandler.java:717)
 at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:178)
 at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at 
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:662)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
 ... 26 more

500



-- 
Manikandan Saravanan
Architect - Technology
TheSocialPeople