Re: Getting page number of result with tika

2013-04-13 Thread Erick Erickson
You can't assume that Fix Version/s 4.3 means anybody is actively working on it,
and the age of the patches suggests nobody is. The Fix Version/s gets updated
when releases are made, otherwise you'd have open JIRAs for, say, Solr 1.4.1.

Near as I can tell, that JIRA is dead, don't look for it unless
someone picks it up
again.

Best
Erick

On Thu, Apr 11, 2013 at 11:55 AM, Gian Maria Ricci
 wrote:
> As far as I know SOLR-380 
> deal with the problem of kowing page number with tika indexing. The issue
> contains a patch but it is really old, and I'm curious how is the status of
> this issue (since I see Fix Version/s 4.3, so it seems that it will be
> implemented in the next version).
>
>
>
> Anyone has a good workaround/patch/solution to search into tika indexed
> documents and having the list of pages where match was found?
>
>
>
> Thanks in advance.
>
>
>
> Gian Maria.
>
>
>


Re: Approximately needed RAM for 5000 query/second at a Solr machine?

2013-04-13 Thread Erick Erickson
bq: disk space is three times

True, I keep forgetting about compound since I never use it...

On Wed, Apr 10, 2013 at 11:05 AM, Walter Underwood
 wrote:
> Correct, except the worst case maximum for disk space is three times. --wunder
>
> On Apr 10, 2013, at 6:04 AM, Erick Erickson wrote:
>
>> You're mixing up disk and RAM requirements when you talk
>> about having twice the disk size. Solr does _NOT_ require
>> twice the index size of RAM to optimize, it requires twice
>> the size on _DISK_.
>>
>> In terms of RAM requirements, you need to create an index,
>> run realistic queries at the installation and measure.
>>
>> Best
>> Erick
>>
>> On Tue, Apr 9, 2013 at 10:32 PM, bigjust  wrote:
>>>
>>>
>>>
> On 4/9/2013 7:03 PM, Furkan KAMACI wrote:
>> These are really good metrics for me:
>> You say that RAM size should be at least index size, and it is
>> better to have a RAM size twice the index size (because of worst
>> case scenario).
>> On the other hand let's assume that I have a RAM size that is
>> bigger than twice of indexes at machine. Can Solr use that extra
>> RAM or is it a approximately maximum limit (to have twice size of
>> indexes at machine)?
> What we have been discussing is the OS cache, which is memory that
> is not used by programs.  The OS uses that memory to make everything
> run faster.  The OS will instantly give that memory up if a program
> requests it.
> Solr is a java program, and java uses memory a little differently,
> so Solr most likely will NOT use more memory when it is available.
> In a "normal" directly executable program, memory can be allocated
> at any time, and given back to the system at any time.
> With Java, you tell it the maximum amount of memory the program is
> ever allowed to use.  Because of how memory is used inside Java,
> most long-running Java programs (like Solr) will allocate up to the
> configured maximum even if they don't really need that much memory.
> Most Java virtual machines will never give the memory back to the
> system even if it is not required.
> Thanks, Shawn
>
>
>>> Furkan KAMACI  writes:
>>>
 I am sorry but you said:

 *you need enough free RAM for the OS to cache the maximum amount of
 disk space all your indexes will ever use*

 I have made an assumption my indexes at my machine. Let's assume that
 it is 5 GB. So it is better to have at least 5 GB RAM? OK, Solr will
 use RAM up to how much I define it as a Java processes. When we think
 about the indexes at storage and caching them at RAM by OS, is that
 what you talk about: having more than 5 GB - or - 10 GB RAM for my
 machine?

 2013/4/10 Shawn Heisey 

>>>
>>> 10 GB.  Because when Solr shuffles the data around, it could use up to
>>> twice the size of the index in order to optimize the index on disk.
>>>
>>> -- Justin
>
> --
> Walter Underwood
> wun...@wunderwood.org
>
>
>


Re: Use of SolrJettyTestBase

2013-04-13 Thread Erick Erickson
I don't see anything obvious, can you set a breakpoint in any other
test and hit it? It's always worked for me if I set a breakpoint and
execute in debug mode...

Not much help,
Erick

On Thu, Apr 11, 2013 at 5:01 PM, Upayavira  wrote:
> On Tue, Apr 2, 2013, at 12:21 AM, Chris Hostetter wrote:
>> : I've subclassed SolrJettyTestBase, and added a test method (annotated
>> : with @test). However, my test method is never called. I see the
>>
>> You got an immediate failure from the tests setup, because you don'th ave
>> assertions enabled in your JVM (the Lucene & Solr test frameworks both
>> require assertions enabled to run tests because so many important things
>> can't be sanity checked w/o them)...
>>
>> : Test class requires enabled assertions, enable globally (-ea) or for
>> : Solr/Lucene subpackages only: com.odoko.ArgPostActionTest
>>
>> FYI: in addition to that txt being written to System.err, it would have
>> immediately been thrown as an Exception as well.   (see
>> TestRuleAssertionsRequired.java)
>
> So, I've finally found time to get past the enable assertions thingie.
> I've got that sorted. But my test still doesn't stop at breakpoints.
>
> I've got this:
>
> public class ArgPostActionTest extends SolrJettyTestBase {
>
>   @BeforeClass
>   public static void beforeTest() throws Exception {
> createJetty(ExternalPaths.EXAMPLE_HOME, null, null);
>   }
>
>   @Test
>   public void testArgPostAction() throws SolrServerException {
>   blah.blah.blah
>   assertEquals(response.getResults().getNumFound(), 1);
>   }
> }
>
> Neither of these methods get called when I execute the test. Any ideas
> what's up?
>
> Upayavira


Re: Not able to replicate the solr 3.5 indexes to solr 4.2 indexes

2013-04-13 Thread Erick Erickson
Please make a JIRA and attach as a patch if there aren't any JIRAs
for this yet.

Best
Erick

On Fri, Apr 12, 2013 at 1:58 AM, Montu v Boda
 wrote:
> hi
>
> thanks for your reply.
>
> is anyone is going to fix this issue in new solr version? because there are
> so many guys facing the same problem while upgrading the solr index 3.5.0 to
> solr 4.2
>
> Thanks & Regards
> Montu v Boda
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Not-able-to-replicate-the-solr-3-5-indexes-to-solr-4-2-indexes-tp4055313p4055477.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr 3.4: memory leak?

2013-04-13 Thread Dmitry Kan
Hi André,

Thanks a lot for your response and the relevant information.

Indeed, we have noticed the similar behavior when hot reloading a web-app
with solr after changing some of the classes. The only bad consequence of
this that luckily does not happen too often, is that the web app becomes
stale. So we prefer actually (re)deploying via tomcat restart.

Thanks,

Dmitry

On Thu, Apr 11, 2013 at 6:01 PM, Andre Bois-Crettez
wrote:

> On 04/11/2013 08:49 AM, Dmitry Kan wrote:
>
>> SEVERE: The web application [/solr] appears to have started a thread named
>> [**MultiThreadedHttpConnectionMan**ager cleanup] but has failed to stop
>> it.
>> This is very likely to create a memory leak.
>> Apr 11, 2013 6:38:14 AM org.apache.catalina.loader.**WebappClassLoader
>> clearThreadLocalMap
>>
>
> To my understanding this kind of leak only is a problem if the Java code
> is *reloaded* while the tomcat JVM is not stopped.
> For example when reloadable="true" in the Context of the web application
> and you change files in WEB-INF or .war : what would happen is that each
> existing threadlocals would continue to live (potentially holding
> references to other stuff and preventing GC) while new threadlocals are
> created.
>
> http://wiki.apache.org/tomcat/**MemoryLeakProtection
>
> If you stop tomcat entirely each time, you should be safe.
>
>
> --
> André Bois-Crettez
>
> Search technology, Kelkoo
> http://www.kelkoo.com/
>
>
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>


Re: Not able to replicate the solr 3.5 indexes to solr 4.2 indexes

2013-04-13 Thread Umesh Prasad
Hi Erick,
I have already created a Jira and also attached a Path. But no unit
tests. My local build is failing (building from solr 4.2.1 source jar).
Please see
https://issues.apache.org/jira/browse/SOLR-4703
.
--
Umesh


On Sat, Apr 13, 2013 at 7:24 PM, Erick Erickson wrote:

> Please make a JIRA and attach as a patch if there aren't any JIRAs
> for this yet.
>
> Best
> Erick
>
> On Fri, Apr 12, 2013 at 1:58 AM, Montu v Boda
>  wrote:
> > hi
> >
> > thanks for your reply.
> >
> > is anyone is going to fix this issue in new solr version? because there
> are
> > so many guys facing the same problem while upgrading the solr index
> 3.5.0 to
> > solr 4.2
> >
> > Thanks & Regards
> > Montu v Boda
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Not-able-to-replicate-the-solr-3-5-indexes-to-solr-4-2-indexes-tp4055313p4055477.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
---
Thanks & Regards
Umesh Prasad


CloudSolrServer vs ConcurrentUpdateSolrServer for indexing

2013-04-13 Thread J Mohamed Zahoor
Hi

This question has come up many times in the list with lots of variations (which 
confuses me a lot).

Iam using Solr 4.1. one collection , 6 shards, 6 machines.
I am using CloudSolrServer  inside each mapper to index my documents…. While it 
is working fine , iam trying to improve the indexing performance.


Question is:  

1) is CloudSolrServer multiThreaded?

2) Will using ConcurrentUpdateSolr server increase indexing performance?

./Zahoor
 

Re: CloudSolrServer vs ConcurrentUpdateSolrServer for indexing

2013-04-13 Thread Mark Miller

On Apr 13, 2013, at 11:07 AM, J Mohamed Zahoor  wrote:

> Hi
> 
> This question has come up many times in the list with lots of variations 
> (which confuses me a lot).
> 
> Iam using Solr 4.1. one collection , 6 shards, 6 machines.
> I am using CloudSolrServer  inside each mapper to index my documents…. While 
> it is working fine , iam trying to improve the indexing performance.
> 
> 
> Question is:  
> 
> 1) is CloudSolrServer multiThreaded?

No. The proper fast way to use it is to start many threads that all add docs to 
the same CloudSolrServer instance. In other words, currently, you must do the 
multi threading yourself. CloudSolrServer is "thread safe".

> 
> 2) Will using ConcurrentUpdateSolr server increase indexing performance?

Yes, but at the cost of having to specify a server to talk to - if it goes 
down, so does your indexing. It's also not very great at reporting errors. 
Finally, using multiple threads and CloudSolrServer, you can approach the 
performance of ConcurrentUpdateSolr server.

- Mark

> 
> ./Zahoor



Re: Easier way to do this?

2013-04-13 Thread William Bell
OK, is d in degrees or miles?


On Fri, Apr 12, 2013 at 10:20 PM, David Smiley (@MITRE.org) <
dsmi...@mitre.org> wrote:

> Bill,
>
> I responded to the issue you created about this:
> https://issues.apache.org/jira/browse/SOLR-4704
>
> In summary, use {!geofilt}.
>
> ~ David
>
>
> Billnbell wrote
> > I would love for the SOLR spatial 4 to support pt so that I can run # of
> > results around a central point easily like in 3.6. How can I pass
> > parameters to a Circle() ? I would love to send PT to this query since
> the
> > pt is the same across multiple areas
> >
> > For example:
> >
> >
> http://localhost:8983/solr/core/select?rows=0&q=*:*&facet=true&facet.query={
> !
> >
> key=.5}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.0072369))%22&facet.query={!
> >
> key=1}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.01447))%22&facet.query={!
> >
> key=5}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.0723))%22&facet.query={!
> >
> key=10}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.1447))%22&{!
> >
> key=25}facet.query=store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.361846))%22&facet.query={!
> >
> key=50}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=.72369))%22&facet.query={!
> >
> key=100}store_geohash:%22Intersects(Circle(26.012156,-80.311943%20d=1.447))%22
>
>
>
>
>
> -
>  Author:
> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Easier-way-to-do-this-tp4055474p4055732.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Easier way to do this?

2013-04-13 Thread David Smiley (@MITRE.org)
Good question.  With geofilt it's kilometers.



-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Easier-way-to-do-this-tp4055474p4055784.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Basic auth on SolrCloud /admin/* calls

2013-04-13 Thread Tim Vaillancourt

This JIRA covers a lot of what you're asking:

https://issues.apache.org/jira/browse/SOLR-4470

I am also trying to get this sort of solution in place, but it seems to 
be dying off a bit. Hopefully we can get some interest on this again, 
this question comes up every few weeks, it seems.


I can confirm the latest patch from this JIRA works as expected, 
although my primary concern is the credentials appear in the JVM 
command, and I'd like to move that to a file.


Cheers,

Tim

On 11/04/13 10:41 AM, Michael Della Bitta wrote:

It's fairly easy to lock down Solr behind basic auth using just the
servlet container it's running in, but the problem becomes letting
services that *should* be able to access Solr in. I've rolled with
basic auth in some setups, but certain deployments such as Solr Cloud
or sharded setups don't play well with auth because there's no good
way to configure them to use it.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Thu, Apr 11, 2013 at 1:19 PM, Raymond Wiker  wrote:

On Apr 11, 2013, at 17:12 , adfel70  wrote:

Hi
I need to implement security in solr as follows:
1. prevent unauthorized users from accessing to solr admin pages.
2. prevent unauthorized users from performing solr operations - both /admin
and /update.


Is the conclusion of this thread is that this is not possible at the moment?


The "obvious" solution (to me, at least) would be to (1) restrict access to solr to 
localhost, and (2) use a reverse proxy (e.g, apache) on the same node to provide 
authenticated&  restricted access to solr. I think I've seen recipes for (1), somewhere, 
and I've used (2) fairly extensively for similar purposes.


Re: Approximately needed RAM for 5000 query/second at a Solr machine?

2013-04-13 Thread Furkan KAMACI
Hi Jack;

Due to I am new to Solr, can you explain this two things that you said:

1) when most people say "index size" they are referring to all fields,
collectively, not individual fields (what do you mean with "Segments are on
a per-field basis"  and all fields, individual fields.)
2) more cores might make the worst case scenario worse since it will
maximize the amount of data processed at a given moment


2013/4/13 Erick Erickson 

> bq: disk space is three times
>
> True, I keep forgetting about compound since I never use it...
>
> On Wed, Apr 10, 2013 at 11:05 AM, Walter Underwood
>  wrote:
> > Correct, except the worst case maximum for disk space is three times.
> --wunder
> >
> > On Apr 10, 2013, at 6:04 AM, Erick Erickson wrote:
> >
> >> You're mixing up disk and RAM requirements when you talk
> >> about having twice the disk size. Solr does _NOT_ require
> >> twice the index size of RAM to optimize, it requires twice
> >> the size on _DISK_.
> >>
> >> In terms of RAM requirements, you need to create an index,
> >> run realistic queries at the installation and measure.
> >>
> >> Best
> >> Erick
> >>
> >> On Tue, Apr 9, 2013 at 10:32 PM, bigjust  wrote:
> >>>
> >>>
> >>>
> > On 4/9/2013 7:03 PM, Furkan KAMACI wrote:
> >> These are really good metrics for me:
> >> You say that RAM size should be at least index size, and it is
> >> better to have a RAM size twice the index size (because of worst
> >> case scenario).
> >> On the other hand let's assume that I have a RAM size that is
> >> bigger than twice of indexes at machine. Can Solr use that extra
> >> RAM or is it a approximately maximum limit (to have twice size of
> >> indexes at machine)?
> > What we have been discussing is the OS cache, which is memory that
> > is not used by programs.  The OS uses that memory to make everything
> > run faster.  The OS will instantly give that memory up if a program
> > requests it.
> > Solr is a java program, and java uses memory a little differently,
> > so Solr most likely will NOT use more memory when it is available.
> > In a "normal" directly executable program, memory can be allocated
> > at any time, and given back to the system at any time.
> > With Java, you tell it the maximum amount of memory the program is
> > ever allowed to use.  Because of how memory is used inside Java,
> > most long-running Java programs (like Solr) will allocate up to the
> > configured maximum even if they don't really need that much memory.
> > Most Java virtual machines will never give the memory back to the
> > system even if it is not required.
> > Thanks, Shawn
> >
> >
> >>> Furkan KAMACI  writes:
> >>>
>  I am sorry but you said:
> 
>  *you need enough free RAM for the OS to cache the maximum amount of
>  disk space all your indexes will ever use*
> 
>  I have made an assumption my indexes at my machine. Let's assume that
>  it is 5 GB. So it is better to have at least 5 GB RAM? OK, Solr will
>  use RAM up to how much I define it as a Java processes. When we think
>  about the indexes at storage and caching them at RAM by OS, is that
>  what you talk about: having more than 5 GB - or - 10 GB RAM for my
>  machine?
> 
>  2013/4/10 Shawn Heisey 
> 
> >>>
> >>> 10 GB.  Because when Solr shuffles the data around, it could use up to
> >>> twice the size of the index in order to optimize the index on disk.
> >>>
> >>> -- Justin
> >
> > --
> > Walter Underwood
> > wun...@wunderwood.org
> >
> >
> >
>


Re: Which tokenizer or analizer should use and field type

2013-04-13 Thread anurag.jain
I tried both way.
(project AND assistant) OR manager 

"project assistant"~5 OR manager 


it is working properly.
but i got problem.

if i give query projec assistant, then it is not able to find out. 

and what is meaning of ~5 ?

If i write *projec assistant* then it is able to find out but it give
project or assistant. 

My objective is to search like - Mysql like operator, %search word% .

How to write query which is exactly like , Mysql like operator. 

Thanks 

Need help As soon as possible






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Which-tokenizer-or-analizer-should-use-and-field-type-tp4055591p4055833.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Which tokenizer or analizer should use and field type

2013-04-13 Thread anurag.jain
Hi, If you can help me in. It will solve my problem.

keyword:(*assistant AND coach*) giving me 1 result.

keyword:(*iit AND kanpur*)  giving me 2 result.

But query:- 

keyword:(*assistant AND coach* OR (*iit AND kanpur*)) giving me only 1
result.

Also i tried. keyword:(*assistant AND coach* OR (*:* *iit AND kanpur*))
giving me only 1 result. Don't know why. 

How query should look like ?? please help me to find out solution. 

Thanks in advance.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Which-tokenizer-or-analizer-should-use-and-field-type-tp4055591p4055837.html
Sent from the Solr - User mailing list archive at Nabble.com.


Is any way to return the number of indexed tokens in a field?

2013-04-13 Thread Alexandre Rafalovitch
Hello,

We seem to have all sorts of functions around tokenized field content, but
I am looking for simple count/length that can be returned as a
pseudo-field. Does anyone know of one out of the box?

The specific situation is that I am indexing a field for specific regular
expressions that become tokens (in a copyField). Not every field has the
same number of those.

I now want to find the documents that have maximum number of tokens in that
field (for testing and review). But I can't figure out how.  Any help would
be appreciated.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


Solr Thrift APIs

2013-04-13 Thread Kiran J
Hi,

Is it possible to access Solr through thrift APIs ?

Thanks