bug in search with sloppy queries

2015-06-14 Thread Dmitry Kan
Hi guys,

We observe some strange bug in solr 4.10.2, where by a sloppy query hits
words it should not:

the "e commerce"the "e commerce"SpanNearQuery(spanNear([Contents:the,
spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
false))spanNear([Contents:the,
spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, false)


This query produces words as hits, like:

>From E-Tail

In the inner spanNear query we expect that e and commerce will occur within
0 slop in that order.

Can somebody shed light into what is going on?

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Phrase query get converted to SpanNear with slop 1 instead of 0

2015-06-14 Thread ariya bala
Hi,

I encounter this peculiar case with solr 4.10.2 where the parsed query
doesnt seem to be logical.

PHRASE23("reduce workforce") ==>
SpanNearQuery(spanNear([spanNear([Contents:reduceä,
Contents:workforceä], 1, true)], 23, true))

The question is why does the Phrase("quoted string") gets converted to
SpanNear of 1 rather than 0.

-- 
*Ariya *


Re: What's wrong

2015-06-14 Thread Test Test
Re,
Thanks for your reply.
I mock my parser like this :
@Overridepublic Query parse() {      SpanQuery[] clauses = new SpanQuery[2];    
   clauses[0] = new SpanTermQuery(new Term("details", "london"));       
clauses[1] = new SpanTermQuery(new Term("details", "city"));      return new 
SpanNearQuery(clauses, 1, true); }
Thus i have a query like this spanNear([details:london, details:city], 1, true)
If i do for example spanNear([details:london], 1, true) or 
spanNear([details:city], 1, true) i get my document.I have already add the 
parameter q.op = "OR", it doesn't work. 



 Le Samedi 13 juin 2015 17h21, Jack Krupansky  a 
écrit :
   

 What does does your exact query parameter look like? The parentheses in
your message make it unclear.

You have a comma in your query as if you expect this has some functional
purpose. Technically, it should get analyzed away, but why did you include
it?

Do any queries find that document, or do all other queries find it and only
this one fails to find it?

Are you sure that you committed the document?

Does a query by id find the document?

Does your  for details have indexed="TRUE"?


-- Jack Krupansky

On Sat, Jun 13, 2015 at 5:54 AM, Test Test  wrote:

> Hi,
> I have solr document, composed like this, with 2 fields : id = 1details =
> "London is the capital and most-populous city of United Kingdom."
> When i request solr with this parameter (details:london, details:city), i
> don't get the document.The "details" field is a type "text_general"
>  positionIncrementGap="100">        
>              class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>                      
>                      class="solr.StandardTokenizerFactory"/>             class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>           ignoreCase="true" expand="true"/>             class="solr.LowerCaseFilterFactory"/>            
> What's wrong?


  

Limitation on Collections Number

2015-06-14 Thread Arnon Yogev
We're running some tests on Solr and would like to have a deeper 
understanding of its limitations.

Specifically, We have tens of millions of documents (say 50M) and are 
comparing several "#collections X #docs_per_collection" configurations.
For example, we could have a single collection with 50M docs or 5000 
collections with 10K docs each.
When trying to create the 5000 collections, we start getting frequent 
errors after 1000-1500 collections have been created. Feels like some 
limit has been reached.
These tests are done on a single node + an additional node for replica.

Can someone elaborate on what could limit Solr to a high number of 
collections (if at all)?
i.e. if we wanted to have 5K or 10K (or 100K) collections, is there 
anything in Solr that can prevent it? Where would it break?

Thanks,
Arnon

Integrating Solr 5.2.0 with nutch 1.10

2015-06-14 Thread kunal chakma
Hi,
 I am very new to the nutch and solr plateform. I have been trying a
lot to integrate Solr 5.2.0 with nutch 1.10 but not able to do so. I have
followed all the steps mentioned at nutch 1.x tutorial page but when I
execute the following command ,

bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
crawl/linkdb/ crawl/segments/20150613164847/ -filter -normalize

I get the following errors
Indexer: starting at 2015-06-14 19:05:28
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication


Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)

Please help me in resolving the issue.

 *With regards,*

*KUNAL CHAKMA*
Computer Science & Engineering Department
National Institute of Technology Agartala
Jirania-799055,
Agartala,Tripura
India

 Signature powered by

WiseStamp



Re: Limitation on Collections Number

2015-06-14 Thread Jack Krupansky
As a general rule, there are only two ways that Solr scales to large
numbers: large number of documents and moderate number of nodes (shards and
replicas). All other parameters should be kept relatively small, like
dozens or low hundreds. Even shards and replicas should probably kept down
to that same guidance of dozens or low hundreds.

Tens of millions of documents should be no problem. I recommend 100 million
as the rough limit of documents per node. Of course it all depends on your
particular data model and data and hardware and network, so that number
could be smaller or larger.

The main guidance has always been to simply do a proof of concept
implementation to test for your particular data model and data values.

-- Jack Krupansky

On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev  wrote:

> We're running some tests on Solr and would like to have a deeper
> understanding of its limitations.
>
> Specifically, We have tens of millions of documents (say 50M) and are
> comparing several "#collections X #docs_per_collection" configurations.
> For example, we could have a single collection with 50M docs or 5000
> collections with 10K docs each.
> When trying to create the 5000 collections, we start getting frequent
> errors after 1000-1500 collections have been created. Feels like some
> limit has been reached.
> These tests are done on a single node + an additional node for replica.
>
> Can someone elaborate on what could limit Solr to a high number of
> collections (if at all)?
> i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
> anything in Solr that can prevent it? Where would it break?
>
> Thanks,
> Arnon


Re: What's wrong

2015-06-14 Thread Jack Krupansky
Why don't you take a step back and tell us what you are really trying to do.

Try using a normal Solr query parser first, to verify that the data is
analyzed as expected.

Did you try using the surround query parser? It supports span queries.

Your span query appears to require that the two terms appear in order and
with no more than one other term between them, but your data has more than
one term between them, so of course that will not match.

You can simulate a span query using sloppy phrases in the normal Solr query
parsers: "london city"~10. The only catch is that does not require that the
terms be in that order.

-- Jack Krupansky

On Sun, Jun 14, 2015 at 6:45 AM, Test Test  wrote:

> Re,
> Thanks for your reply.
> I mock my parser like this :
> @Overridepublic Query parse() {  SpanQuery[] clauses = new
> SpanQuery[2];   clauses[0] = new SpanTermQuery(new Term("details",
> "london"));   clauses[1] = new SpanTermQuery(new Term("details",
> "city"));  return new SpanNearQuery(clauses, 1, true); }
> Thus i have a query like this spanNear([details:london, details:city], 1,
> true)
> If i do for example spanNear([details:london], 1, true)
> or spanNear([details:city], 1, true) i get my document.I have already add
> the parameter q.op = "OR", it doesn't work.
>
>
>
>  Le Samedi 13 juin 2015 17h21, Jack Krupansky <
> jack.krupan...@gmail.com> a écrit :
>
>
>  What does does your exact query parameter look like? The parentheses in
> your message make it unclear.
>
> You have a comma in your query as if you expect this has some functional
> purpose. Technically, it should get analyzed away, but why did you include
> it?
>
> Do any queries find that document, or do all other queries find it and only
> this one fails to find it?
>
> Are you sure that you committed the document?
>
> Does a query by id find the document?
>
> Does your  for details have indexed="TRUE"?
>
>
> -- Jack Krupansky
>
> On Sat, Jun 13, 2015 at 5:54 AM, Test Test  wrote:
>
> > Hi,
> > I have solr document, composed like this, with 2 fields : id = 1details =
> > "London is the capital and most-populous city of United Kingdom."
> > When i request solr with this parameter (details:london, details:city), i
> > don't get the document.The "details" field is a type "text_general"
> >  > positionIncrementGap="100">
> >  > class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
> >  
> >  > class="solr.StandardTokenizerFactory"/> > class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
> >   synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/> > class="solr.LowerCaseFilterFactory"/>
> > What's wrong?
>
>
>
>


Re: Limitation on Collections Number

2015-06-14 Thread Shai Erera
Thanks Jack for your response. But I think Arnon's question was different.

If you need to index 10,000 different collection of documents in Solr (say
a collection denotes someone's Dropbox files), then you have two options:
index all collections in one Solr collection, and add a field like
collectionID to each document and query, or index each user's private
collection in a different Solr collection.

The pros of the latter is that you don't need to add a collectionID filter
to each query. Also from a security/privacy standpoint (and search quality)
- a user can only ever search what he has access to -- e.g. it cannot get a
spelling correction for words he never saw in his documents, nor document
suggestions (even though the 'context' in some of Lucene suggesters allow
one to do that too). From a quality standpoint you don't mix different term
statistics etc.

So from a single node's point of view, you can either index 100M documents
in one index (Collection, shard, replica -- whatever -- a single Solr
core), or in 10,000 such cores. From node capacity perspectives the two are
the same -- same amount of documents will be indexed overall, same query
workload etc.

So the question is purely about Solr and its collections management -- is
there anything in that process that can prevent one from managing thousands
of collections on a single node, or within a single SolrCloud instance? If
so, what is it -- are these the ZK watchers? Is there a thread per
collection at work? Others?

Shai

On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky 
wrote:

> As a general rule, there are only two ways that Solr scales to large
> numbers: large number of documents and moderate number of nodes (shards and
> replicas). All other parameters should be kept relatively small, like
> dozens or low hundreds. Even shards and replicas should probably kept down
> to that same guidance of dozens or low hundreds.
>
> Tens of millions of documents should be no problem. I recommend 100 million
> as the rough limit of documents per node. Of course it all depends on your
> particular data model and data and hardware and network, so that number
> could be smaller or larger.
>
> The main guidance has always been to simply do a proof of concept
> implementation to test for your particular data model and data values.
>
> -- Jack Krupansky
>
> On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev  wrote:
>
> > We're running some tests on Solr and would like to have a deeper
> > understanding of its limitations.
> >
> > Specifically, We have tens of millions of documents (say 50M) and are
> > comparing several "#collections X #docs_per_collection" configurations.
> > For example, we could have a single collection with 50M docs or 5000
> > collections with 10K docs each.
> > When trying to create the 5000 collections, we start getting frequent
> > errors after 1000-1500 collections have been created. Feels like some
> > limit has been reached.
> > These tests are done on a single node + an additional node for replica.
> >
> > Can someone elaborate on what could limit Solr to a high number of
> > collections (if at all)?
> > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
> > anything in Solr that can prevent it? Where would it break?
> >
> > Thanks,
> > Arnon
>


Re: bug in search with sloppy queries

2015-06-14 Thread Erick Erickson
My guess is that you have WordDelimiterFilterFactory in your
analysis chain with parameters that break up E-Tail to both "e" and "tail" _and_
put them in the same position. This assumes that the result fragment
you pasted is incomplete and "commerce" is in it

>From E-Tail commerce

or some such. Try the admin/analysis screen with the "verbose" box checked
and you'll see the position of each token after analysis to see if my guess
is accurate.

Best,
Erick

On Sun, Jun 14, 2015 at 4:34 AM, Dmitry Kan  wrote:
> Hi guys,
>
> We observe some strange bug in solr 4.10.2, where by a sloppy query hits
> words it should not:
>
> the "e commerce" name="querystring">the "e commerce" name="parsedquery">SpanNearQuery(spanNear([Contents:the,
> spanNear([Contents:eä, Contents:commerceä], 0, true)], 300,
> false))spanNear([Contents:the,
> spanNear([Contents:eä, Contents:commerceä], 0, true)], 300, false)
>
>
> This query produces words as hits, like:
>
> From E-Tail
>
> In the inner spanNear query we expect that e and commerce will occur within
> 0 slop in that order.
>
> Can somebody shed light into what is going on?
>
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info


Re: Integrating Solr 5.2.0 with nutch 1.10

2015-06-14 Thread Erick Erickson
No clue, you'd probably have better luck on the Nutch user's list
unless there are _Solr_ errors. Does your Solr log show any errors?


Best,
Erick

On Sun, Jun 14, 2015 at 6:49 AM, kunal chakma  wrote:
> Hi,
>  I am very new to the nutch and solr plateform. I have been trying a
> lot to integrate Solr 5.2.0 with nutch 1.10 but not able to do so. I have
> followed all the steps mentioned at nutch 1.x tutorial page but when I
> execute the following command ,
>
> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
> crawl/linkdb/ crawl/segments/20150613164847/ -filter -normalize
>
> I get the following errors
> Indexer: starting at 2015-06-14 19:05:28
> Indexer: deleting gone documents: false
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Active IndexWriters :
> SOLRIndexWriter
> solr.server.url : URL of the SOLR instance (mandatory)
> solr.commit.size : buffer size when sending to SOLR (default 1000)
> solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
> solr.auth : use authentication (default false)
> solr.auth.username : username for authentication
> solr.auth.password : password for authentication
>
>
> Indexer: java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:113)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:177)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:187)
>
> Please help me in resolving the issue.
>
>  *With regards,*
>
> *KUNAL CHAKMA*
> Computer Science & Engineering Department
> National Institute of Technology Agartala
> Jirania-799055,
> Agartala,Tripura
> India
>
>  Signature powered by
> 
> WiseStamp
> 


Re: Limitation on Collections Number

2015-06-14 Thread Erick Erickson
To my knowledge there's nothing built in to Solr to limit the number
of collections. There's nothing explicitly in place to handle
many hundreds of collections either so you're really in uncharted,
certainly untested waters. Anecdotally we've heard of the problem
you're describing.

You say you start seeing errors. What are they? OOMs? deadlocks?

If you are _not_ in SolrCloud, then there's the "Lots of cores" solution,
see: http://wiki.apache.org/solr/LotsOfCores. Pay attention to the
warning at the top: NOT FOR SOLRCLOUD!

Also note that the "lots of cores" option really is built for the pattern
where a particular core is searched sporadically. Indexing dropbox
files is a good example. A user may sign on and search her documents
just a few times a day, for a few minutes at a time. Because cores
are loaded/unloaded on demand, supporting
many hundreds of simultaneous users would cause a lot of core
loading/unloading and impact performance.

Best,
Erick

On Sun, Jun 14, 2015 at 8:00 AM, Shai Erera  wrote:
> Thanks Jack for your response. But I think Arnon's question was different.
>
> If you need to index 10,000 different collection of documents in Solr (say
> a collection denotes someone's Dropbox files), then you have two options:
> index all collections in one Solr collection, and add a field like
> collectionID to each document and query, or index each user's private
> collection in a different Solr collection.
>
> The pros of the latter is that you don't need to add a collectionID filter
> to each query. Also from a security/privacy standpoint (and search quality)
> - a user can only ever search what he has access to -- e.g. it cannot get a
> spelling correction for words he never saw in his documents, nor document
> suggestions (even though the 'context' in some of Lucene suggesters allow
> one to do that too). From a quality standpoint you don't mix different term
> statistics etc.
>
> So from a single node's point of view, you can either index 100M documents
> in one index (Collection, shard, replica -- whatever -- a single Solr
> core), or in 10,000 such cores. From node capacity perspectives the two are
> the same -- same amount of documents will be indexed overall, same query
> workload etc.
>
> So the question is purely about Solr and its collections management -- is
> there anything in that process that can prevent one from managing thousands
> of collections on a single node, or within a single SolrCloud instance? If
> so, what is it -- are these the ZK watchers? Is there a thread per
> collection at work? Others?
>
> Shai
>
> On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky 
> wrote:
>
>> As a general rule, there are only two ways that Solr scales to large
>> numbers: large number of documents and moderate number of nodes (shards and
>> replicas). All other parameters should be kept relatively small, like
>> dozens or low hundreds. Even shards and replicas should probably kept down
>> to that same guidance of dozens or low hundreds.
>>
>> Tens of millions of documents should be no problem. I recommend 100 million
>> as the rough limit of documents per node. Of course it all depends on your
>> particular data model and data and hardware and network, so that number
>> could be smaller or larger.
>>
>> The main guidance has always been to simply do a proof of concept
>> implementation to test for your particular data model and data values.
>>
>> -- Jack Krupansky
>>
>> On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev  wrote:
>>
>> > We're running some tests on Solr and would like to have a deeper
>> > understanding of its limitations.
>> >
>> > Specifically, We have tens of millions of documents (say 50M) and are
>> > comparing several "#collections X #docs_per_collection" configurations.
>> > For example, we could have a single collection with 50M docs or 5000
>> > collections with 10K docs each.
>> > When trying to create the 5000 collections, we start getting frequent
>> > errors after 1000-1500 collections have been created. Feels like some
>> > limit has been reached.
>> > These tests are done on a single node + an additional node for replica.
>> >
>> > Can someone elaborate on what could limit Solr to a high number of
>> > collections (if at all)?
>> > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
>> > anything in Solr that can prevent it? Where would it break?
>> >
>> > Thanks,
>> > Arnon
>>


Re: Limitation on Collections Number

2015-06-14 Thread Jack Krupansky
My answer remains the same - a large number of collections (cores) in a
single Solr instance is not one of the ways in which Solr is designed to
scale. To repeat, there are only two ways to scale Solr, number of
documents and number of nodes.



-- Jack Krupansky

On Sun, Jun 14, 2015 at 11:00 AM, Shai Erera  wrote:

> Thanks Jack for your response. But I think Arnon's question was different.
>
> If you need to index 10,000 different collection of documents in Solr (say
> a collection denotes someone's Dropbox files), then you have two options:
> index all collections in one Solr collection, and add a field like
> collectionID to each document and query, or index each user's private
> collection in a different Solr collection.
>
> The pros of the latter is that you don't need to add a collectionID filter
> to each query. Also from a security/privacy standpoint (and search quality)
> - a user can only ever search what he has access to -- e.g. it cannot get a
> spelling correction for words he never saw in his documents, nor document
> suggestions (even though the 'context' in some of Lucene suggesters allow
> one to do that too). From a quality standpoint you don't mix different term
> statistics etc.
>
> So from a single node's point of view, you can either index 100M documents
> in one index (Collection, shard, replica -- whatever -- a single Solr
> core), or in 10,000 such cores. From node capacity perspectives the two are
> the same -- same amount of documents will be indexed overall, same query
> workload etc.
>
> So the question is purely about Solr and its collections management -- is
> there anything in that process that can prevent one from managing thousands
> of collections on a single node, or within a single SolrCloud instance? If
> so, what is it -- are these the ZK watchers? Is there a thread per
> collection at work? Others?
>
> Shai
>
> On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky 
> wrote:
>
> > As a general rule, there are only two ways that Solr scales to large
> > numbers: large number of documents and moderate number of nodes (shards
> and
> > replicas). All other parameters should be kept relatively small, like
> > dozens or low hundreds. Even shards and replicas should probably kept
> down
> > to that same guidance of dozens or low hundreds.
> >
> > Tens of millions of documents should be no problem. I recommend 100
> million
> > as the rough limit of documents per node. Of course it all depends on
> your
> > particular data model and data and hardware and network, so that number
> > could be smaller or larger.
> >
> > The main guidance has always been to simply do a proof of concept
> > implementation to test for your particular data model and data values.
> >
> > -- Jack Krupansky
> >
> > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev  wrote:
> >
> > > We're running some tests on Solr and would like to have a deeper
> > > understanding of its limitations.
> > >
> > > Specifically, We have tens of millions of documents (say 50M) and are
> > > comparing several "#collections X #docs_per_collection" configurations.
> > > For example, we could have a single collection with 50M docs or 5000
> > > collections with 10K docs each.
> > > When trying to create the 5000 collections, we start getting frequent
> > > errors after 1000-1500 collections have been created. Feels like some
> > > limit has been reached.
> > > These tests are done on a single node + an additional node for replica.
> > >
> > > Can someone elaborate on what could limit Solr to a high number of
> > > collections (if at all)?
> > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
> > > anything in Solr that can prevent it? Where would it break?
> > >
> > > Thanks,
> > > Arnon
> >
>


Re: Limitation on Collections Number

2015-06-14 Thread Shalin Shekhar Mangar
Yes, there are some known problems while scaling to large number of
collections, say 1000 or above. See
https://issues.apache.org/jira/browse/SOLR-7191

On Sun, Jun 14, 2015 at 8:30 PM, Shai Erera  wrote:

> Thanks Jack for your response. But I think Arnon's question was different.
>
> If you need to index 10,000 different collection of documents in Solr (say
> a collection denotes someone's Dropbox files), then you have two options:
> index all collections in one Solr collection, and add a field like
> collectionID to each document and query, or index each user's private
> collection in a different Solr collection.
>
> The pros of the latter is that you don't need to add a collectionID filter
> to each query. Also from a security/privacy standpoint (and search quality)
> - a user can only ever search what he has access to -- e.g. it cannot get a
> spelling correction for words he never saw in his documents, nor document
> suggestions (even though the 'context' in some of Lucene suggesters allow
> one to do that too). From a quality standpoint you don't mix different term
> statistics etc.
>
> So from a single node's point of view, you can either index 100M documents
> in one index (Collection, shard, replica -- whatever -- a single Solr
> core), or in 10,000 such cores. From node capacity perspectives the two are
> the same -- same amount of documents will be indexed overall, same query
> workload etc.
>
> So the question is purely about Solr and its collections management -- is
> there anything in that process that can prevent one from managing thousands
> of collections on a single node, or within a single SolrCloud instance? If
> so, what is it -- are these the ZK watchers? Is there a thread per
> collection at work? Others?
>
> Shai
>
> On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky 
> wrote:
>
> > As a general rule, there are only two ways that Solr scales to large
> > numbers: large number of documents and moderate number of nodes (shards
> and
> > replicas). All other parameters should be kept relatively small, like
> > dozens or low hundreds. Even shards and replicas should probably kept
> down
> > to that same guidance of dozens or low hundreds.
> >
> > Tens of millions of documents should be no problem. I recommend 100
> million
> > as the rough limit of documents per node. Of course it all depends on
> your
> > particular data model and data and hardware and network, so that number
> > could be smaller or larger.
> >
> > The main guidance has always been to simply do a proof of concept
> > implementation to test for your particular data model and data values.
> >
> > -- Jack Krupansky
> >
> > On Sun, Jun 14, 2015 at 7:31 AM, Arnon Yogev  wrote:
> >
> > > We're running some tests on Solr and would like to have a deeper
> > > understanding of its limitations.
> > >
> > > Specifically, We have tens of millions of documents (say 50M) and are
> > > comparing several "#collections X #docs_per_collection" configurations.
> > > For example, we could have a single collection with 50M docs or 5000
> > > collections with 10K docs each.
> > > When trying to create the 5000 collections, we start getting frequent
> > > errors after 1000-1500 collections have been created. Feels like some
> > > limit has been reached.
> > > These tests are done on a single node + an additional node for replica.
> > >
> > > Can someone elaborate on what could limit Solr to a high number of
> > > collections (if at all)?
> > > i.e. if we wanted to have 5K or 10K (or 100K) collections, is there
> > > anything in Solr that can prevent it? Where would it break?
> > >
> > > Thanks,
> > > Arnon
> >
>



-- 
Regards,
Shalin Shekhar Mangar.


Re: Limitation on Collections Number

2015-06-14 Thread Shai Erera
>
> My answer remains the same - a large number of collections (cores) in a
> single Solr instance is not one of the ways in which Solr is designed to
> scale. To repeat, there are only two ways to scale Solr, number of
> documents and number of nodes.
>

Jack, I understand that, but I still feel you're missing the point. We
didn't ask about scaling Solr at all - it's a question about indexing
strategy when you need to index multiple disparate collections of documents
-- one collection w/ a collectionID field, or a Solr collection per set of
documents.

If you are _not_ in SolrCloud, then there's the "Lots of cores" solution,
> see: http://wiki.apache.org/solr/LotsOfCores. Pay attention to the
> warning at the top: NOT FOR SOLRCLOUD!
>

Thanks Erick. We did read this a while ago. We are in SolrCloud mode cause
we want to keep a replica per collection and SolrCloud makes it easy for
us. However, we aren't in a real/common SolrCloud mode, where we just need
to index 1B documents and sharding + replication comes to our aid.

If we were not in a SolrCloud mode, I imagine we'd need to manage the
replicas ourselves and also index a document to both replicas manually?
That is, there is no way in _non_ SolrCloud mode to tell two cores that
they are replicas of one another - correct?

A user may sign on and search her documents
> just a few times a day, for a few minutes at a time.
>

This is almost true -- you may visit your Dropbox once an hour (or it may
be open in the background on your computer), but the server still receives
documents (e.g. shares) frequently by other users, and need to index it for
your collection. Not saying this isn't a good fit, just mentioning that
it's not only the user who can update his/her collection, and therefore
one's collection may be constantly active. Eventually this needs to be
benchmarked.

Our benchmarks show that on 1000 such collections, we achieve significant
better response times from the multi-collection setup (one Solr collection
per user) vs the single-collection setup (one Solr collection for *all*
users, with a collectionID field added to all documents). Our next step is
to try perhaps a hybrid mode where we store groups of users in the same
Solr collection, but not all of them in the same Solr collection. So maybe
if Solr works well w/ 1000 collections, we will index 10 users in one such
collection ... we'll give it a try.

I think SOLR-7191 may solve the general use case though I haven't yet read
through it thoroughly.

Shai

On Sun, Jun 14, 2015 at 6:50 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> Yes, there are some known problems while scaling to large number of
> collections, say 1000 or above. See
> https://issues.apache.org/jira/browse/SOLR-7191
>
> On Sun, Jun 14, 2015 at 8:30 PM, Shai Erera  wrote:
>
> > Thanks Jack for your response. But I think Arnon's question was
> different.
> >
> > If you need to index 10,000 different collection of documents in Solr
> (say
> > a collection denotes someone's Dropbox files), then you have two options:
> > index all collections in one Solr collection, and add a field like
> > collectionID to each document and query, or index each user's private
> > collection in a different Solr collection.
> >
> > The pros of the latter is that you don't need to add a collectionID
> filter
> > to each query. Also from a security/privacy standpoint (and search
> quality)
> > - a user can only ever search what he has access to -- e.g. it cannot
> get a
> > spelling correction for words he never saw in his documents, nor document
> > suggestions (even though the 'context' in some of Lucene suggesters allow
> > one to do that too). From a quality standpoint you don't mix different
> term
> > statistics etc.
> >
> > So from a single node's point of view, you can either index 100M
> documents
> > in one index (Collection, shard, replica -- whatever -- a single Solr
> > core), or in 10,000 such cores. From node capacity perspectives the two
> are
> > the same -- same amount of documents will be indexed overall, same query
> > workload etc.
> >
> > So the question is purely about Solr and its collections management -- is
> > there anything in that process that can prevent one from managing
> thousands
> > of collections on a single node, or within a single SolrCloud instance?
> If
> > so, what is it -- are these the ZK watchers? Is there a thread per
> > collection at work? Others?
> >
> > Shai
> >
> > On Sun, Jun 14, 2015 at 5:21 PM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> > > As a general rule, there are only two ways that Solr scales to large
> > > numbers: large number of documents and moderate number of nodes (shards
> > and
> > > replicas). All other parameters should be kept relatively small, like
> > > dozens or low hundreds. Even shards and replicas should probably kept
> > down
> > > to that same guidance of dozens or low hundreds.
> > >
> > > Tens of millions of documents should be no problem.

Re: file index format

2015-06-14 Thread Frank Ralf
Hi,

I face the same problem when trying to index DITA XML files. These are XML
files but have the file extension .dita which Solr ignores.

According to java -jar post.jar -h only the following file extensions are
supported: 

 /-Dfiletypes=[,,...]
 
(default=xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log)/

As a workaround I can change the file extension to .xml but would prefer not
to be forced to do so. As Solr also checks the MIME type the list of allowed
file extensions shouldn't be that rigid.

http://stackoverflow.com/questions/30763161/solr-post-files-with-no-extention/30769088
suggests a simple bash script with a for loop that submits each file
individually and works regardless of the file extension as a workaround.

Kind regards,
Frank





--
View this message in context: 
http://lucene.472066.n3.nabble.com/file-index-format-tp4199892p4211693.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solrj Tika/Cell not using defaultField

2015-06-14 Thread Charlie Hubbard
I'm having trouble getting Solr to pay attention to the defaultField value
when I send a document to Solr Cell or Tika.  Here is my post I'm sending
using Solrj

POST
/solr/collection1/update/extract?extractOnly=true&defaultField=text&wt=javabin&version=2
HTTP/1.1

When I get the response back the NamedList contains the content it
extracted but it's under the name null and null_metadata respectively.
I've seen it return the defaultField I give it before, but for some reason
now it's not returning it.  I've even tried to configure the
ExtractRequestHandler like so:



text




true
text
links
ignored_




But even that doesn't get picked up.  Here is the SOLR code I use to set
the parameters:

public SolrRequest toSolrExtractRequest() throws IOException {
ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
req.addFile(getLocation(), null);

req.setParam(EXTRACT_ONLY, "true");
req.setParam(DEFAULT_FIELD, "text");

return req;
}

So why is this not working?

Charlie


Re: file index format

2015-06-14 Thread Frank Ralf
Looks like this has been solved recently in the current dev branch:

"SimplePostTool (and thus bin/post) cannot index files with unknown
extensions"
https://issues.apache.org/jira/browse/SOLR-7546



--
View this message in context: 
http://lucene.472066.n3.nabble.com/file-index-format-tp4199892p4211699.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Limitation on Collections Number

2015-06-14 Thread Erick Erickson
re: hybrid approach.

Hmmm, _assuming_ that no single user has a really huge number of
documents you might be able to use a single collection (or much
smaller group of collections), by using custom routing. That allows
you to send all the docs for a particular user to a particular shard.
There are some obvious issues here with the long-tail users, most of
your users have +/- X docs on average, and three of them have 100,000X
docs. There are probably some not-so-obvious gotcha's too

True, for user X you'd send sub-requests to all shards, but all but
one of them wouldn't find anything so would _probably_ be close to a
no-op. Conceptually, each shard then becomes N of your current
collections. Maybe there's a sweet spot performance-wise here where
you're hosting some number of users per shard (or aggregate N docs per
shard or...).

Of course there's more maintenance here, particularly you have to
manage the size of shards yourself since the possibility of them
getting lopsided is higher etc.

FWIW,
Erick

On Sun, Jun 14, 2015 at 9:48 AM, Shai Erera  wrote:
>>
>> My answer remains the same - a large number of collections (cores) in a
>> single Solr instance is not one of the ways in which Solr is designed to
>> scale. To repeat, there are only two ways to scale Solr, number of
>> documents and number of nodes.
>>
>
> Jack, I understand that, but I still feel you're missing the point. We
> didn't ask about scaling Solr at all - it's a question about indexing
> strategy when you need to index multiple disparate collections of documents
> -- one collection w/ a collectionID field, or a Solr collection per set of
> documents.
>
> If you are _not_ in SolrCloud, then there's the "Lots of cores" solution,
>> see: http://wiki.apache.org/solr/LotsOfCores. Pay attention to the
>> warning at the top: NOT FOR SOLRCLOUD!
>>
>
> Thanks Erick. We did read this a while ago. We are in SolrCloud mode cause
> we want to keep a replica per collection and SolrCloud makes it easy for
> us. However, we aren't in a real/common SolrCloud mode, where we just need
> to index 1B documents and sharding + replication comes to our aid.
>
> If we were not in a SolrCloud mode, I imagine we'd need to manage the
> replicas ourselves and also index a document to both replicas manually?
> That is, there is no way in _non_ SolrCloud mode to tell two cores that
> they are replicas of one another - correct?
>
> A user may sign on and search her documents
>> just a few times a day, for a few minutes at a time.
>>
>
> This is almost true -- you may visit your Dropbox once an hour (or it may
> be open in the background on your computer), but the server still receives
> documents (e.g. shares) frequently by other users, and need to index it for
> your collection. Not saying this isn't a good fit, just mentioning that
> it's not only the user who can update his/her collection, and therefore
> one's collection may be constantly active. Eventually this needs to be
> benchmarked.
>
> Our benchmarks show that on 1000 such collections, we achieve significant
> better response times from the multi-collection setup (one Solr collection
> per user) vs the single-collection setup (one Solr collection for *all*
> users, with a collectionID field added to all documents). Our next step is
> to try perhaps a hybrid mode where we store groups of users in the same
> Solr collection, but not all of them in the same Solr collection. So maybe
> if Solr works well w/ 1000 collections, we will index 10 users in one such
> collection ... we'll give it a try.
>
> I think SOLR-7191 may solve the general use case though I haven't yet read
> through it thoroughly.
>
> Shai
>
> On Sun, Jun 14, 2015 at 6:50 PM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
>> Yes, there are some known problems while scaling to large number of
>> collections, say 1000 or above. See
>> https://issues.apache.org/jira/browse/SOLR-7191
>>
>> On Sun, Jun 14, 2015 at 8:30 PM, Shai Erera  wrote:
>>
>> > Thanks Jack for your response. But I think Arnon's question was
>> different.
>> >
>> > If you need to index 10,000 different collection of documents in Solr
>> (say
>> > a collection denotes someone's Dropbox files), then you have two options:
>> > index all collections in one Solr collection, and add a field like
>> > collectionID to each document and query, or index each user's private
>> > collection in a different Solr collection.
>> >
>> > The pros of the latter is that you don't need to add a collectionID
>> filter
>> > to each query. Also from a security/privacy standpoint (and search
>> quality)
>> > - a user can only ever search what he has access to -- e.g. it cannot
>> get a
>> > spelling correction for words he never saw in his documents, nor document
>> > suggestions (even though the 'context' in some of Lucene suggesters allow
>> > one to do that too). From a quality standpoint you don't mix different
>> term
>> > statistics etc.
>> >
>> > So from a single node's point of v

Re: file index format

2015-06-14 Thread Frank Ralf
This issue has also already been discussed in the Tika issue queue:

"Add method get file extension from MimeTypes"
https://issues.apache.org/jira/browse/TIKA-538

And
http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
does support DITA XML file types.

I will investigate further and report back.

Frank



--
View this message in context: 
http://lucene.472066.n3.nabble.com/file-index-format-tp4199892p4211738.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Division with Stats Component when Grouping in Solr

2015-06-14 Thread kingofhypocrites
I think I have this about working with the analytics component. It seems to
fill in all the gaps that the stats component and the json facet don't
support.

It solved the following problems for me:
- I am able to perform math on stats to form other stats.. Then i can sort
on those as needed.
- When I perform math on stats it uses the summed totals per group rather
than doing it per row
- I am able to to do offsets and number of rows to handle paging

I am confused why this module isn't built into Sor. This functionality is so
vital for any adhoc querying on time series data. Pretty much any scenario
like the SQL query I provided would need all of these things.

Only thing I couldn't figure out is how to get the list of total buckets...
or in other words the distinct count of keywords. If anyone is able to help
with this, I could really use it in order to provide a total record count to
the user (e.g. Showing records 1-10 of 2939). 

Here is what I have in case this helps someone:
olap=true&o.r1.ff=keyword_s&o.r1.s.visits=sum(visits_i)&o.r1.s.bounces=sum(bounces_i)&o.r1.s.bounce_rate=div(sum(bounces_i),sum(visits_i))&o.r1.ff.keyword_s.sortstatistic=bounce_rate&o.r1.ff.keyword_s.sortdirection=desc&o.r1.ff.keyword_s.offset=0&o.r1.ff.keyword_s.limit=10

Also if anyone has access to the original documentation from bloomberg
mentioned in the stats component PDF, I'd love to have it :)
https://issues.apache.org/jira/secure/attachment/12606793/Search%20Analytics%20Component.pdf

All the links for detailed documentation are now broken.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Division-with-Stats-Component-when-Grouping-in-Solr-tp4211402p4211751.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Division with Stats Component when Grouping in Solr

2015-06-14 Thread Erick Erickson
Why it isn't in core Solr... Because it doesn't (and probably can't)
support distributed mode.
The Streaming aggregation stuff, and the (in trunk Real Soon Now)
Parallel SQL support
are where the effort is going to support this kind of stuff.

https://issues.apache.org/jira/browse/SOLR-7560

https://issues.apache.org/jira/browse/SOLR-7082

Best,
Erick

On Sun, Jun 14, 2015 at 2:25 PM, kingofhypocrites
 wrote:
> I think I have this about working with the analytics component. It seems to
> fill in all the gaps that the stats component and the json facet don't
> support.
>
> It solved the following problems for me:
> - I am able to perform math on stats to form other stats.. Then i can sort
> on those as needed.
> - When I perform math on stats it uses the summed totals per group rather
> than doing it per row
> - I am able to to do offsets and number of rows to handle paging
>
> I am confused why this module isn't built into Sor. This functionality is so
> vital for any adhoc querying on time series data. Pretty much any scenario
> like the SQL query I provided would need all of these things.
>
> Only thing I couldn't figure out is how to get the list of total buckets...
> or in other words the distinct count of keywords. If anyone is able to help
> with this, I could really use it in order to provide a total record count to
> the user (e.g. Showing records 1-10 of 2939).
>
> Here is what I have in case this helps someone:
> olap=true&o.r1.ff=keyword_s&o.r1.s.visits=sum(visits_i)&o.r1.s.bounces=sum(bounces_i)&o.r1.s.bounce_rate=div(sum(bounces_i),sum(visits_i))&o.r1.ff.keyword_s.sortstatistic=bounce_rate&o.r1.ff.keyword_s.sortdirection=desc&o.r1.ff.keyword_s.offset=0&o.r1.ff.keyword_s.limit=10
>
> Also if anyone has access to the original documentation from bloomberg
> mentioned in the stats component PDF, I'd love to have it :)
> https://issues.apache.org/jira/secure/attachment/12606793/Search%20Analytics%20Component.pdf
>
> All the links for detailed documentation are now broken.
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Division-with-Stats-Component-when-Grouping-in-Solr-tp4211402p4211751.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Please help test the new Angular JS Admin UI

2015-06-14 Thread Erick Erickson
And anyone who, you know, really likes working with UI code please
help making it better!

As of Solr 5.2, there is a new version of the Admin UI available, and
several improvements are already in 5.2.1 (release imminent). The old
admin UI is still the default, the new one is available at

/admin/index.html

Currently, you will see very little difference at first glance; the
goal for this release was to have as much of the current functionality
as possible ported to establish the framework. Upayavira has done
almost all of the work getting this in place, thanks for taking that
initiative Upayavira!

Anyway, the plan is several fold:
> Get as much testing on this as possible over the 5.2 time frame.
> Make the new Angular JS-based code the default in 5.3
> Make improvements/bug fixes to the admin UI on the new code line, 
> particularly SolrCloud functionality.
> Deprecate the current code and remove it eventually.

The new code should be quite a bit easier to work on for programmer
types, and there are Big Plans Afoot for making the admin UI more
SolrCloud-friendly. Now that the framework is in place, it should be
easier for anyone who wants to volunteer to contribute, please do!

So please give it a whirl. I'm sure there will be things that crop up,
and any help addressing them will be appreciated. There's already an
umbrella JIRA for this work, see:
https://issues.apache.org/jira/browse/SOLR-7666. Please link any new
issues to this JIRA so we can keep track of it all as well as
coordinate efforts. If all goes well, this JIRA can be used to see
what's already been reported too.

Note that things may be moving pretty quickly, so trunk and 5x will
always be the most current. That said looking at 5.2.1 will be much
appreciated.

Erick


Re: Issues with using Paoding to index Chinese characters

2015-06-14 Thread Zheng Lin Edwin Yeo
But I think Solr 3.6 is too far back to fall back to as I'm already using
Solr 5.1.

Regards,
Edwin

On 14 June 2015 at 14:49, Upayavira  wrote:

> When in 2012? I'd give it a go with Solr 3.6 if you don't want to modify
> the library.
>
> Upayavira
>
> On Sun, Jun 14, 2015, at 04:14 AM, Zheng Lin Edwin Yeo wrote:
> > I'm still trying to find out which version it is compatible for, but the
> > document which I've followed is written in 2012.
> >
> > http://java.dzone.com/articles/indexing-chinese-solr
> >
> > Regards,
> > Edwin
> >
> >
> > On 12 June 2015 at 20:15, Upayavira  wrote:
> >
> > > Not knowing anything about paoding, it seems that this library isn't
> > > compatible with the current version of Solr/Lucene. Have a look at the
> > > version that it was compiled for. Having looked at the date of the
> > > latest download (2008) Lucene has changed a LOT since then, so some
> > > conversion work will definitely be needed to make it work.
> > >
> > > Upayavira
> > >
> > > On Fri, Jun 12, 2015, at 08:28 AM, Zheng Lin Edwin Yeo wrote:
> > > > I'm trying to use Paoding to index Chinese characters in Solr.
> > > >
> > > > I'm using Solr 5.1, have downloaded the dictionary to shard1\dic and
> > > > shard2\dic, and have configured the following in schema,xml
> > > >
> > > > 
> > > > 
> > > > 
> > > >
> > > > I've also included -DPAODING_DIC_HOME=/dic during my startup of Solr
> > > >
> > > > However, when I tried to start Solr, I get the following error:
> > > >
> > > > java.lang.VerifyError: class
> > > > net.paoding.analysis.analyzer.PaodingAnalyzerBean overrides final
> > > > method
> > > >
> > >
> tokenStream.(Ljava/lang/String;Ljava/io/Reader;)Lorg/apache/lucene/analysis/TokenStream;
> > > >   at java.lang.ClassLoader.defineClass1(Native Method)
> > > >   at java.lang.ClassLoader.defineClass(Unknown Source)
> > > >   at java.security.SecureClassLoader.defineClass(Unknown Source)
> > > >   at java.net.URLClassLoader.defineClass(Unknown Source)
> > > >   at java.net.URLClassLoader.access$100(Unknown Source)
> > > >   at java.net.URLClassLoader$1.run(Unknown Source)
> > > >   at java.net.URLClassLoader$1.run(Unknown Source)
> > > >   at java.security.AccessController.doPrivileged(Native Method)
> > > >   at java.net.URLClassLoader.findClass(Unknown Source)
> > > >   at
> > >
> org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
> > > >   at
> > >
> org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:383)
> > > >   at java.lang.ClassLoader.defineClass1(Native Method)
> > > >   at java.lang.ClassLoader.defineClass(Unknown Source)
> > > >   at java.security.SecureClassLoader.defineClass(Unknown Source)
> > > >   at java.net.URLClassLoader.defineClass(Unknown Source)
> > > >   at java.net.URLClassLoader.access$100(Unknown Source)
> > > >   at java.net.URLClassLoader$1.run(Unknown Source)
> > > >   at java.net.URLClassLoader$1.run(Unknown Source)
> > > >   at java.security.AccessController.doPrivileged(Native Method)
> > > >   at java.net.URLClassLoader.findClass(Unknown Source)
> > > >   at
> > >
> org.eclipse.jetty.webapp.WebAppClassLoader.loadClass(WebAppClassLoader.java:421)
> > > >   at java.lang.ClassLoader.loadClass(Unknown Source)
> > > >   at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
> > > >   at java.lang.ClassLoader.loadClass(Unknown Source)
> > > >   at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
> > > >   at java.lang.ClassLoader.loadClass(Unknown Source)
> > > >   at java.lang.Class.forName0(Native Method)
> > > >   at java.lang.Class.forName(Unknown Source)
> > > >   at
> > >
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:476)
> > > >   at
> > >
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:423)
> > > >   at
> > >
> org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:262)
> > > >   at
> > >
> org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:94)
> > > >   at
> > >
> org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:42)
> > > >   at
> > >
> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)
> > > >   at
> > > org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:489)
> > > >   at
> org.apache.solr.schema.IndexSchema.(IndexSchema.java:175)
> > > >   at
> > >
> org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)
> > > >   at
> > >
> org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)
> > > >   at
> > >
> org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:102)
> > > >   at
> > >
> org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:74)
> > > >   at
> > > org.apache.s

invalid index version and generation

2015-06-14 Thread Summer Shire
Hi all,

Every time I optimize my index with maxSegment=2 after some time the 
replication fails to get filelist for a given generation. Looks like the index 
version and generation count gets messed up.
(If the maxSegment=1 this never happens. I am able to successfully reproduce 
this by optimizing with maxSegment=2. I am using solr 4.7.2)
eg: SnapPuller- No files to download for index generation: 67

Here is what I see when I curl the commands on my terminal.
I also tried to modify the snapPullers fetchFileList(long gen) method to get 
indexVersion and pass that as a param but that did not help either.
Also my slave gen/version is always smaller than master. 

What could be going on? any idea ?

Thanks,
Summer

$ curl "http://localhost:6600/solr/main0/replication?command=indexversion";


00143434117464167


$ curl 
"http://localhost:6600/solr/main0/replication?command=filelist&generation=67&indexversion=1434341174641";


00invalid index generation


RE: Solr Exact match boost Reduce the results

2015-06-14 Thread JACK
Hi chillra,
I have changed the index and query filed configuration to

 
  

But still my problem not solved , it won't resolve my problem.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exact-match-boost-Reduce-the-results-tp4211352p4211788.html
Sent from the Solr - User mailing list archive at Nabble.com.