date:20150109

Re: Determining the Number of Solr Shards

2015-01-09 Thread Toke Eskildsen

On Thu, 2015-01-08 at 22:55 +0100, Nishanth S wrote:
> Thanks guys for your inputs I would be looking at around 100 Tb of total
>  index size  with 5100 million documents [...]

That is a large corpus when coupled with your high indexing & QPS
requirements. Are the queries complex too? Will you be doing non-trivial
faceting?

Your requirements are so high that any guesswork at this point is likely
to be wrong by an order of magnitude. What is very certain is that you
will need serious hardware. Your starting point should not be to try and
estimate the number of shards. Start by building a test setup.

- Toke Eskildsen

Tokenizer or Filter ?

2015-01-09 Thread tomas.kalas

Hello, i have a question what i have to use tokenizer or filter ?
I need separate 2 chanels. I wrote this here earlier, but realize it with
solr basic tools it is not probably possible. And i',m trying to write own
tool for this task.
I have this input HelloHelloHow are you ?Fine
and you're? 
d1 - direction1
d2 - direction2
and i want to output only d1 and between this result search some words, for
example output should be:
Output: [Hello,How are you?] 

I wrote my idea in java, but i dont know where  to incorporate it. If to
Filter or Tokenizer and some advices how to start? I probably must extends
some lucene library and include it easily modificated there isn't it ?

Here is my code:

package test1;
import java.util.Arrays;

public class Test1 {


public static void main(String[] args) {
String dialogue = "HelloHelloHow are you
?Fine and you're? ";

String[] input = dialogue.split("(?<=)\\d*(?=)");

int countD1 = 0;

for (String input1 : input) {
if (input1.startsWith("")) {
countD1++;
}
}
String [] d1 = new String[countD1];
int array = 0;

for (String input1 : input) {
if (input1.startsWith("")) {
d1[array] = input1;
array++;
}
}
String d1Out = Arrays.toString(d1);
System.out.println(d1Out); 
//Return s1Out
 }
}

Thanks for you advices. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-09 Thread Ahmet Arslan

Can't you use solr.PatternTokenizerFactory for this task?



On Friday, January 9, 2015 1:48 PM, tomas.kalas  wrote:
Hello, i have a question what i have to use tokenizer or filter ?
I need separate 2 chanels. I wrote this here earlier, but realize it with
solr basic tools it is not probably possible. And i',m trying to write own
tool for this task.
I have this input HelloHelloHow are you ?Fine
and you're? 
d1 - direction1
d2 - direction2
and i want to output only d1 and between this result search some words, for
example output should be:
Output: [Hello,How are you?] 

I wrote my idea in java, but i dont know where  to incorporate it. If to
Filter or Tokenizer and some advices how to start? I probably must extends
some lucene library and include it easily modificated there isn't it ?

Here is my code:

package test1;
import java.util.Arrays;

public class Test1 {


public static void main(String[] args) {
String dialogue = "HelloHelloHow are you
?Fine and you're? ";

String[] input = dialogue.split("(?<=)\\d*(?=)");

int countD1 = 0;

for (String input1 : input) {
if (input1.startsWith("")) {
countD1++;
}
}
String [] d1 = new String[countD1];
int array = 0;

for (String input1 : input) {
if (input1.startsWith("")) {
d1[array] = input1;
array++;
}
}
String d1Out = Arrays.toString(d1);
System.out.println(d1Out); 
//Return s1Out
 }
}

Thanks for you advices. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to return child documents with parent

2015-01-09 Thread yliu

Thanks.  That solved my problem.

Y



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-return-child-documents-with-parent-tp4178081p4178378.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-09 Thread tomas.kalas

I'm used the same regex and it doesn't work unfortunately. Or should I
somehow change the regex? Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346p4178389.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenizer or Filter ?

2015-01-09 Thread Jack Krupansky

Consider an update processor - it can take any input, break it up any way
you want, and then output multiple field values.

You can even us the stateless script update processor to write the logic in
JavaScript.

-- Jack Krupansky

On Fri, Jan 9, 2015 at 6:47 AM, tomas.kalas  wrote:

> Hello, i have a question what i have to use tokenizer or filter ?
> I need separate 2 chanels. I wrote this here earlier, but realize it with
> solr basic tools it is not probably possible. And i',m trying to write own
> tool for this task.
> I have this input HelloHelloHow are you
> ?Fine
> and you're? 
> d1 - direction1
> d2 - direction2
> and i want to output only d1 and between this result search some words, for
> example output should be:
> Output: [Hello,How are you?]
>
> I wrote my idea in java, but i dont know where  to incorporate it. If to
> Filter or Tokenizer and some advices how to start? I probably must extends
> some lucene library and include it easily modificated there isn't it ?
>
> Here is my code:
>
> package test1;
> import java.util.Arrays;
>
> public class Test1 {
>
>
> public static void main(String[] args) {
> String dialogue = "HelloHelloHow are you
> ?Fine and you're? ";
>
> String[] input = dialogue.split("(?<=)\\d*(?=)");
>
> int countD1 = 0;
>
> for (String input1 : input) {
> if (input1.startsWith("")) {
> countD1++;
> }
> }
> String [] d1 = new String[countD1];
> int array = 0;
>
> for (String input1 : input) {
> if (input1.startsWith("")) {
> d1[array] = input1;
> array++;
> }
> }
> String d1Out = Arrays.toString(d1);
> System.out.println(d1Out);
> //Return s1Out
>  }
> }
>
> Thanks for you advices.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tokenizer-or-Filter-tp4178346.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Request two databases at the same time ?

2015-01-09 Thread Bruno Mannina


Dear All,

I use Apache-SOLR3.6, on Ubuntu (newbie user).

I have a big database named BigDB1 with 90M documents,
each document contains several fields (docid, title, author, date, etc...)

I received today from another source, abstract of some documents (there
are also the same docid field in this source).
I don't want to modify my BigDB1 to update documents with abstract
because BigDB1 is always updated twice by week.

Do you think it's possible to create a new database named AbsDB1 and
request the both database at the same time ?
 if I do for example:
title:airplane AND abstract:plastic

I would like to obtain documents from BigDB1 and AbsDB1.

Many thanks for your help, information and others things that can help me.

Regards,
Bruno

---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce 
que la protection avast! Antivirus est active.
http://www.avast.com

Best way to implement Spotlight of certain results

2015-01-09 Thread Dan Davis

I have a requirement to spotlight certain results if the query text exactly
matches the title or see reference (indexed by me as alttitle_t).
What that means is that these matching results are shown above the
top-10/20 list with different CSS and fields.   Its like feeling lucky on
google :)

I have considered three ways of implementing this:

   1. Assume that edismax qf/pf will boost these results to be first when
   there is an exact match on these important fields.   The downside then is
   that my relevancy is constrained and I must maintain my configuration with
   title and alttitle_t as top search fields (see XML snippet below).I may
   have to overweight them to achieve the "always first" criteria.   Another
   less major downside is that I must always return the spotlight summary
   field (for display) and the image to display on each search.   These could
   be got from a database by the id, however, it is convenient to get them
   from Solr.
   2. Issue two searches for every user search, and use a second set of
   parameters (change the search type and fields to search only by exact
   matching a specific string field spottitle_s).   The search for the
   spotlight can then have its own configuration.   The downside here is that
   I am using Django and pysolr for the front-end, and pysolr is both
   synchronous and tied to the requestHandler named "select".   Convention.
   Of course, running in parallel is not a fix-all - running a search takes
   some time, even if run in parallel.
   3. Automate the population of elevate.xml so that all these 959 queries
   are here.   This is probably best, but forces me to restart/reload when
   there are changes to this components.   The elevation can be done through a
   query.

What I'd love to do is to configure the "select" requestHandler to run both
searches and return me both sets of results.   Is there anyway to do that -
apply the same q= parameter to two configured way to run a search?
Something like sub queries?

I suspect that approach 1 will get me through my demo and a brief
evaluation period, but that either approach 2 or 3 will be the winner.

Here's a snippet from my current qf/pf configuration:
  
title^100
alttitle_t^100
...
text
  
  
title^1000
alttitle_t^1000
...
text^10
 

Thanks,

Dan Davis

filter on solr pivot data

2015-01-09 Thread Darniz

Hello

i need to know how can i filter on solr pivot data.

For exampel we have a dealer which might have many cars in his lot and car
has photos, i need to find out a dealer which has cars which has no photos

so i have 

dealer1 -> has 20 cars -> all of them has photos
dealer2 -> has 20 cars ->  some of them have photos
dealer3 -> has 20 cars -> none of them have photos

in the results i want to see only dealers which has no photos ie dealer3, i
managed to do pivot and get a breakdown by vin and photo exists now i want
to apply filter and get  only those dealer who has all vin which have photo
exists as 0



lst name="facet_pivot">


vin
1N4AA5AP0EC908535
1


mappings_|photo_exist|
1
1




vin
1N4AA5AP1EC470625
1


mappings_|photo_exist|
1
1




is it possible



--
View this message in context: 
http://lucene.472066.n3.nabble.com/filter-on-solr-pivot-data-tp4178451.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: GC tuning question - can improving GC pauses cause indexing to slow down?

2015-01-09 Thread Walter Underwood

For throughput with G1, get rid of the pause time goal (-XX:MaxGCPauseMillis), 
so the GC can pause as long as it wants.

Beyond that, use a non-concurrent collector and make sure that everything is OK 
with pauses that last a few seconds.

This is a pretty detailed paper about balancing throughput and pause:

https://engineering.linkedin.com/garbage-collection/garbage-collection-optimization-high-throughput-and-low-latency-java-applications

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Jan 8, 2015, at 11:38 PM, Shawn Heisey  wrote:

> On 1/8/2015 11:05 PM, Boogie Shafer wrote:
>> In the abstract, it sounds like you are seeing the difference between tuning 
>> for latency vs tuning for throughput 
>> 
>> My hunch would be you are seeing more (albeit individually quicker) GC 
>> events with your new settings during the rebuild
>> 
>> I imagine that in most cases a solr rebuild is relatively rare compared to 
>> the amount of times where a lower latency request is desired. If the rebuild 
>> times are problematic for you, use tunings specific to that workload during 
>> the times you need it and then switch back to your low latency settings 
>> after. If you are doing that you can probably run with a bigger heap 
>> temporarily during the rebuild as you aren't likely to be fielding queries 
>> and don't benefit from having a larger OS cache available
> 
> Full rebuilds are indeed relatively rare.  Avoiding long pauses and
> keeping query latency low are usually a lot more important than how
> quickly the index rebuilds.  Quick rebuilds are nice, but not strictly
> necessary.
> 
> We do incremental updates that start at the top of every minute, unless
> an update is already running.  Exactly how long those updates take is of
> little importance, unless that time is easier to measure in minutes
> rather than seconds.
> 
> If I ever find myself in a situation where completing a rebuild as fast
> as possible becomes extremely important, does anyone have suggestions
> for GC tuning options that will optimize for throughput?
> 
> Thanks,
> Shawn
>

Re: Request two databases at the same time ?

2015-01-09 Thread Erick Erickson

bq: I don't want to modify my BigDB1 to update documents with abstract
because BigDB1 is always updated twice by week.

Why not? Solr/Lucene handle updating docs, if a doc in the index has
the same , the old doc is deleted and the new one takes its
place. So why not just put the new abstracts into BigDB1? If you
re-index the docs later (your twice/week comment), then they'll be
overwritten. This will be much simpler than trying to maintain two.

But if you cannot update BigDB1 just fire off two queries and combine
them. Or specify the shards parameter on the URL pointing to both
collections. Do note, though, that the relevance calculations may not
be absolutely comparable, so mixing the results may show some
surprises...

Best,
Erick

On Fri, Jan 9, 2015 at 9:12 AM, Bruno Mannina  wrote:
> Dear All,
>
> I use Apache-SOLR3.6, on Ubuntu (newbie user).
>
> I have a big database named BigDB1 with 90M documents,
> each document contains several fields (docid, title, author, date, etc...)
>
> I received today from another source, abstract of some documents (there are
> also the same docid field in this source).
> I don't want to modify my BigDB1 to update documents with abstract because
> BigDB1 is always updated twice by week.
>
> Do you think it's possible to create a new database named AbsDB1 and request
> the both database at the same time ?
>  if I do for example:
> title:airplane AND abstract:plastic
>
> I would like to obtain documents from BigDB1 and AbsDB1.
>
> Many thanks for your help, information and others things that can help me.
>
> Regards,
> Bruno
>
> ---
> Ce courrier électronique ne contient aucun virus ou logiciel malveillant
> parce que la protection avast! Antivirus est active.
> http://www.avast.com
>

Re: Best way to implement Spotlight of certain results

2015-01-09 Thread Erick Erickson

Hmm, I wonder if the RerankingQueryParser might help here?
See: https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking

Best,
Erick

On Fri, Jan 9, 2015 at 10:35 AM, Dan Davis  wrote:
> I have a requirement to spotlight certain results if the query text exactly
> matches the title or see reference (indexed by me as alttitle_t).
> What that means is that these matching results are shown above the
> top-10/20 list with different CSS and fields.   Its like feeling lucky on
> google :)
>
> I have considered three ways of implementing this:
>
>1. Assume that edismax qf/pf will boost these results to be first when
>there is an exact match on these important fields.   The downside then is
>that my relevancy is constrained and I must maintain my configuration with
>title and alttitle_t as top search fields (see XML snippet below).I may
>have to overweight them to achieve the "always first" criteria.   Another
>less major downside is that I must always return the spotlight summary
>field (for display) and the image to display on each search.   These could
>be got from a database by the id, however, it is convenient to get them
>from Solr.
>2. Issue two searches for every user search, and use a second set of
>parameters (change the search type and fields to search only by exact
>matching a specific string field spottitle_s).   The search for the
>spotlight can then have its own configuration.   The downside here is that
>I am using Django and pysolr for the front-end, and pysolr is both
>synchronous and tied to the requestHandler named "select".   Convention.
>Of course, running in parallel is not a fix-all - running a search takes
>some time, even if run in parallel.
>3. Automate the population of elevate.xml so that all these 959 queries
>are here.   This is probably best, but forces me to restart/reload when
>there are changes to this components.   The elevation can be done through a
>query.
>
> What I'd love to do is to configure the "select" requestHandler to run both
> searches and return me both sets of results.   Is there anyway to do that -
> apply the same q= parameter to two configured way to run a search?
> Something like sub queries?
>
> I suspect that approach 1 will get me through my demo and a brief
> evaluation period, but that either approach 2 or 3 will be the winner.
>
> Here's a snippet from my current qf/pf configuration:
>   
> title^100
> alttitle_t^100
> ...
> text
>   
>   
> title^1000
> alttitle_t^1000
> ...
> text^10
>  
>
> Thanks,
>
> Dan Davis

Re: filter on solr pivot data

2015-01-09 Thread Erick Erickson

Why not just add an fq clause like &fq=-mappings_iphoto_exist:[* TO *]?

note the "-" sign.

On Fri, Jan 9, 2015 at 11:14 AM, Darniz  wrote:
> Hello
>
> i need to know how can i filter on solr pivot data.
>
> For exampel we have a dealer which might have many cars in his lot and car
> has photos, i need to find out a dealer which has cars which has no photos
>
> so i have
>
> dealer1 -> has 20 cars -> all of them has photos
> dealer2 -> has 20 cars ->  some of them have photos
> dealer3 -> has 20 cars -> none of them have photos
>
> in the results i want to see only dealers which has no photos ie dealer3, i
> managed to do pivot and get a breakdown by vin and photo exists now i want
> to apply filter and get  only those dealer who has all vin which have photo
> exists as 0
>
>
>
> lst name="facet_pivot">
> 
> 
> vin
> 1N4AA5AP0EC908535
> 1
> 
> 
> mappings_|photo_exist|
> 1
> 1
> 
> 
> 
> 
> vin
> 1N4AA5AP1EC470625
> 1
> 
> 
> mappings_|photo_exist|
> 1
> 1
> 
> 
> 
>
> is it possible
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/filter-on-solr-pivot-data-tp4178451.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Best way to implement Spotlight of certain results

2015-01-09 Thread Michał B . .

Maybe I understand you badly but I thing that you could use grouping to
achieve such effect. If you could prepare two group queries one with exact
match and other, let's say, default than you will be able to extract
matches from grouping results. i.e (using default solr example collection)

http://localhost:8983/solr/collection1/select?q=*:*&group=true&group.query=manu%3A%22Ap+Computer+Inc.%22&group.query=name:Apple%2060%20GB%20iPod%20with%20Video%20Playback%20Black&group.limit=10

this query will return two groups one with exact match second with the rest
standard results.

Regars,
Michal


2015-01-09 20:44 GMT+01:00 Erick Erickson :

> Hmm, I wonder if the RerankingQueryParser might help here?
> See: https://cwiki.apache.org/confluence/display/solr/Query+Re-Ranking
>
> Best,
> Erick
>
> On Fri, Jan 9, 2015 at 10:35 AM, Dan Davis  wrote:
> > I have a requirement to spotlight certain results if the query text
> exactly
> > matches the title or see reference (indexed by me as alttitle_t).
> > What that means is that these matching results are shown above the
> > top-10/20 list with different CSS and fields.   Its like feeling lucky on
> > google :)
> >
> > I have considered three ways of implementing this:
> >
> >1. Assume that edismax qf/pf will boost these results to be first when
> >there is an exact match on these important fields.   The downside
> then is
> >that my relevancy is constrained and I must maintain my configuration
> with
> >title and alttitle_t as top search fields (see XML snippet below).
> I may
> >have to overweight them to achieve the "always first" criteria.
>  Another
> >less major downside is that I must always return the spotlight summary
> >field (for display) and the image to display on each search.   These
> could
> >be got from a database by the id, however, it is convenient to get
> them
> >from Solr.
> >2. Issue two searches for every user search, and use a second set of
> >parameters (change the search type and fields to search only by exact
> >matching a specific string field spottitle_s).   The search for the
> >spotlight can then have its own configuration.   The downside here is
> that
> >I am using Django and pysolr for the front-end, and pysolr is both
> >synchronous and tied to the requestHandler named "select".
>  Convention.
> >Of course, running in parallel is not a fix-all - running a search
> takes
> >some time, even if run in parallel.
> >3. Automate the population of elevate.xml so that all these 959
> queries
> >are here.   This is probably best, but forces me to restart/reload
> when
> >there are changes to this components.   The elevation can be done
> through a
> >query.
> >
> > What I'd love to do is to configure the "select" requestHandler to run
> both
> > searches and return me both sets of results.   Is there anyway to do
> that -
> > apply the same q= parameter to two configured way to run a search?
> > Something like sub queries?
> >
> > I suspect that approach 1 will get me through my demo and a brief
> > evaluation period, but that either approach 2 or 3 will be the winner.
> >
> > Here's a snippet from my current qf/pf configuration:
> >   
> > title^100
> > alttitle_t^100
> > ...
> > text
> >   
> >   
> > title^1000
> > alttitle_t^1000
> > ...
> > text^10
> >  
> >
> > Thanks,
> >
> > Dan Davis
>



-- 
Michał Bieńkowski

RE: can't make sense of spellchecker results when using techproducts example

2015-01-09 Thread Dyer, James

Chris,

- DirectSpellChecker has a setting for "minPrefix" which the techproducts 
example sets to 1 (also the default).  So it will never try to correct the 
first character.  I think this is both a performance optimization and is based 
on the assumption that we rarely misspell the first character.  This is why it 
will not  correct "hell" to "dell".  I think it will allow you to set this to 
0, if you want your sample query to work.

- The "maxCollationTries" feature re-writes "q" / "spellcheck.q", and then 
using all the other parameters, queries internally to see if there any hits.  
This doesn't play very well when "q.op=OR" / "mm=1".  So when you see a 
collation like "here ultrasharp" / "heat ..." etc, you see it is indeed getting 
some hits.  So it considers it a valid query re-write, despite the absurdity.  
We could improve this example config by adding 
"spellcheck.collateParam.q.op=AND" to the defaults.  (When using dismax, you 
would add "spellcheck.collateParam.mm=100%")  Also, while the "collateParam" 
functionality is in the old Solr wiki, it doesn't seem to be in the reference 
manual, so we probably should add it as this would be pretty important for a 
lot of users.

- Unless using the legacy IndexBasedSpellChecker / FileBasedSpellchecker, you 
need not use "spellcheck.build".  Its a no-op for both Direct and WordBreak, as 
these do not use sidecar indexes.

So without changing the config, these queries illustrate the spellchecker 
pretty well, including the word-break functionality.

http://localhost:8983/solr/techproducts/spell?spellcheck.q=dzll+ultra%20sharp&df=text&spellcheck=true&spellcheck.collateParam.q.op=AND
http://localhost:8983/solr/techproducts/spell?spellcheck.q=dellultrasharp&df=text&spellcheck=true&spellcheck.collateParam.q.op=AND

Spellcheck has a lot of gotchas, and I would wish we could dream up a way to 
make it easy for people.  I remember it being a struggle for me when I was a 
new user, and I know we get lots of questions on the user-list about it.

My apologies to you for not answering this sooner.

James Dyer
Ingram Content Group


-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Wednesday, December 17, 2014 6:49 PM
To: solr-user@lucene.apache.org
Subject: can't make sense of spellchecker results when using techproducts 
example


Ok, so i've been working on updating hte ref guide to account for hte new 
way to run the "examples" in 5.0.

The spell checking page...

https://cwiki.apache.org/confluence/display/solr/Spell+Checking

...has some examples that loosely corroloate to the "techproducts" 
example, but even if you ignore the specifics of those examples, i need 
help understanding the basic behavior of hte spellchecker as configured in 
the techproducts

Assuming you run this...

bin/solr -e techproducts

with that example running & those docs indexed, this URL gives me 
results i can't explain...

http://localhost:8983/solr/techproducts/spell?spellcheck.q=hell+ultrashar&df=text&spellcheck=true&spellcheck.build=true

(see below)

1) "dell" is not listed as a possible suggestion for for "hell" (even if 
the dictionary thinks "hold" is a better suggestion, why isn't "dell" even 
included in the list of possibilities?

2) in the "collation" section, i can't make any sense of what these 
results mean -- how is "hello ultrasharp" a suggested collationQuery when 
*none* of the example docs contain both "hello" and "ultrasharp" ?

http://localhost:8983/solr/techproducts/select?df=text&q=%2Bhello+%2Bultrasharp


So WTF is up with these spell check results?






   0
   15

build



   
 
   6
   0
   4
   0
   
 
   hello
   1
 
 
   here
   2
 
 
   heat
   1
 
 
   hold
   1
 
 
   html
   1
 
 
   héllo
   1
 
   
 
 
   1
   5
   14
   0
   
 
   ultrasharp
   1
 
   
 
   
   false
   
 
   hello ultrasharp
   2
   
 hello
 ultrasharp
   
 
 
   here ultrasharp
   3
   
 here
 ultrasharp
   
 
 
   heat ultrasharp
   2
   
 heat
 ultrasharp
   
 
 
   hold ultrasharp
   2
   
 hold
 ultrasharp
   
 
 
   html ultrasharp
   2
   
 html
 ultrasharp
   
 
   








-Hoss
http://www.lucidworks.com/

Re: How does text-rev work?

2015-01-09 Thread Alexandre Rafalovitch

Anybody? Otherwise, I guess it is a JIRA to delete the unused field?

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 28 December 2014 at 13:16, Alexandre Rafalovitch  wrote:
> I am looking at the collection1/techproducts schema and I can't figure
> out how the reversed wildcard example is supposed to work.
>
> We define text_general_rev type and text_rev field, but we don't seem
> to be populating it at any point. And running the example does not
> seem to show any tokens in the field even when the non-inverted text
> field does have some.
>
> Apparently, there is some magic in the QueryParser to do something
> about this at query time, but I see no explanation of what is supposed
> to happen at the index/schema time.
>
> Anybody has the skinny on this one?
>
> Regards,
>Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/

Re: Request two databases at the same time ?

2015-01-09 Thread Bruno Mannina


Dear Erick,

thank you for your answer.

My answers are below.

Le 09/01/2015 20:43, Erick Erickson a écrit :

bq: I don't want to modify my BigDB1 to update documents with abstract
because BigDB1 is always updated twice by week.

Why not? Solr/Lucene handle updating docs, if a doc in the index has
the same , the old doc is deleted and the new one takes its
place. So why not just put the new abstracts into BigDB1? If you
re-index the docs later (your twice/week comment), then they'll be
overwritten. This will be much simpler than trying to maintain two.
I understand this process, I use it for other collections and twice time 
by week for BigDB1.
But, i.e. Doc1 is updated with Abstract on Monday. Tuesday I must update 
it with new data, then Abstract will be lost.
I can't check/get abstract before to re-insert it in the new doc because 
I receive several thousand docs every week (new and amend),

i think it will take a long time to do that.


But if you cannot update BigDB1 just fire off two queries and combine
them. Or specify the shards parameter on the URL pointing to both
collections. Do note, though, that the relevance calculations may not
be absolutely comparable, so mixing the results may show some
surprises...

Shards..I wilkl take a look to this, I don't know this param.
Concerning relevance, I don't really use it, so it won't be a problem I 
think.



Sincerely,


Best,
Erick

On Fri, Jan 9, 2015 at 9:12 AM, Bruno Mannina  wrote:

Dear All,

I use Apache-SOLR3.6, on Ubuntu (newbie user).

I have a big database named BigDB1 with 90M documents,
each document contains several fields (docid, title, author, date, etc...)

I received today from another source, abstract of some documents (there are
also the same docid field in this source).
I don't want to modify my BigDB1 to update documents with abstract because
BigDB1 is always updated twice by week.

Do you think it's possible to create a new database named AbsDB1 and request
the both database at the same time ?
  if I do for example:
title:airplane AND abstract:plastic

I would like to obtain documents from BigDB1 and AbsDB1.

Many thanks for your help, information and others things that can help me.

Regards,
Bruno

---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant
parce que la protection avast! Antivirus est active.
http://www.avast.com






---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce 
que la protection avast! Antivirus est active.
http://www.avast.com

Re: How does text-rev work?

2015-01-09 Thread Jack Krupansky

Or a Jira to document it.

The basic idea is that if a normal leading wildcard is too slow, the user
can index a copy of their text fields using the text_rev type, which
indexes terms with their characters reversed and with a special marker.
Then the query parser detects a leading wildcard and that the field type
uses the reversed wildcard filter, and then it generates a wildcard query
that using the reversed query token and wildcard pattern so that the
leading wildcard becomes a trailing wildcard or prefix query

-- Jack Krupansky

On Fri, Jan 9, 2015 at 3:15 PM, Alexandre Rafalovitch 
wrote:

> Anybody? Otherwise, I guess it is a JIRA to delete the unused field?
>
> Regards,
>Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 28 December 2014 at 13:16, Alexandre Rafalovitch 
> wrote:
> > I am looking at the collection1/techproducts schema and I can't figure
> > out how the reversed wildcard example is supposed to work.
> >
> > We define text_general_rev type and text_rev field, but we don't seem
> > to be populating it at any point. And running the example does not
> > seem to show any tokens in the field even when the non-inverted text
> > field does have some.
> >
> > Apparently, there is some magic in the QueryParser to do something
> > about this at query time, but I see no explanation of what is supposed
> > to happen at the index/schema time.
> >
> > Anybody has the skinny on this one?
> >
> > Regards,
> >Alex.
> > 
> > Sign up for my Solr resources newsletter at http://www.solr-start.com/
>

Unexplained leader initiated recovery after updates

2015-01-09 Thread Lindsay Martin

Hi all,

I am experiencing a problem where Solr nodes go into recovery following an 
update cycle.

Examination of the logs indicates that the recovery is initiated by the shard 
master while processing regular update events, because the replica is 
unreachable.

For example, the following is recorded in the leader’s log file:

...
2014-12-15 05:14:03.285 [qtp2092193830-400307] INFO  
org.apache.solr.cloud.ZkController  Put replica core=listings 
coreNodeName=solr12:8983_solr_listings on solr12:8983_solr into 
leader-initiated recovery.
2014-12-15 05:14:03.285 [qtp2092193830-400307] WARN  
org.apache.solr.cloud.ZkController  Leader is publishing core=listings 
coreNodeName =solr12:8983_solr_listings state=down on behalf of un-reachable 
replica http://solr12:8983/solr/listings/; forcePublishState? false
2014-12-15 05:14:03.287 [zkCallback-2-thread-20] INFO  
org.apache.solr.cloud.DistributedQueue LatchChildWatcher fired on path: 
/overseer/queue state: SyncConnected type NodeChildrenChanged
...

However when I check, I cannot detect any connectivity problems between the 
leader and the replica. About 40% of the time, the nodes recover without any 
intervention in 4 or 5 minutes. The remaining 60% of the time however, the 
recovering node reports a java.lang.OutOfMemoryError and Solr needs to be 
restarted.

For background, here are some details about our configuration:
* Solr 4.10.2 (problem also observed with Solr 4.6.1)
* 12 shards with 2 nodes per shard
* a single updater running in a separate subnet is posting updates using the 
SolrJ CloudSolrServer client. Updates are triggered hourly.
* system is under continuous query load
* autoCommit is set to 821 seconds
* autoSoftCommit is set to 303 seconds

I cannot correlate these recovery events to an increase in update or query 
load. The query traffic is does not appear to be affected by any transient 
connectivity issues.

The only clear pattern is that these recovery events happen after an updater 
run and the cluster is busy processing the updates.

Can suggest where to look to figure out why these recovery events are occurring?

Thanks,

Lindsay Martin

Re: Unexplained leader initiated recovery after updates

2015-01-09 Thread Shawn Heisey

On 1/9/2015 4:54 PM, Lindsay Martin wrote:
> I am experiencing a problem where Solr nodes go into recovery following an 
> update cycle.

> For background, here are some details about our configuration:
> * Solr 4.10.2 (problem also observed with Solr 4.6.1)
> * 12 shards with 2 nodes per shard
> * a single updater running in a separate subnet is posting updates using the 
> SolrJ CloudSolrServer client. Updates are triggered hourly.
> * system is under continuous query load
> * autoCommit is set to 821 seconds
> * autoSoftCommit is set to 303 seconds

I would suspect some kind of performance problem that likely results in
the zkClientTimeout expiring.  I have a standard set of questions for
performance problems.

Questions about zookeeper:

How many ZK nodes?  Is zookeeper on separate hardware?  If it's on the
same hardware as Solr, is its database on the same disk spindles as the
Solr index, or separate spindles?  Is zookeeper standalone or embedded
in Solr?  If it's standalone, do you happen to know the java max heap
for the zookeeper processes?

Questions about Solr and the hardware:

How many total Solr servers?  How much RAM is installed on each one?
What is the max size of the Java heap?  Are you running more than one
Solr (JVM/container) instance per machine?

If you add up all the "index" directories on a server, how much disk
space does it take?  Is the amount of disk space used similar on all of
the servers?

Thanks,
Shawn

Re: Garbage Collection tuning - G1 is now a good option

2015-01-09 Thread Shawn Heisey

On 1/1/2015 12:10 PM, Shawn Heisey wrote:
> I've been working with Oracle employees to find better GC tuning
> options.  The results are good enough to share with the community:
> 
> https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning
> 
> With the latest Java 7 or Java 8 version, and a couple of tuning
> options, G1GC has grown up enough to be a viable choice.  Two of the
> settings on that list were critical for making the performance
> acceptable with my testing: ParallelRefProcEnabled and G1HeapRegionSize.
> 
> I've included some notes on the wiki about how you can size the G1 heap
> regions appropriately for your own index.

A note was just recently added to the Lucene wiki, saying that the G1
collector should never be used with Lucene, because there are bugs that
might cause index corruption.  Solr uses Lucene, so this would apply.

https://wiki.apache.org/lucene-java/JavaBugs#Oracle_Java_.2F_Sun_Java_.2F_OpenJDK_Bugs

I have never had any problems with it, but perhaps I spoke too soon when
recommending G1.  Does anyone else have anything to add?

Thanks,
Shawn

Re: Garbage Collection tuning - G1 is now a good option

2015-01-09 Thread William Bell

It looks like 32 bit is affected.

> On 2013-08-14 08:27, Dawid Weiss wrote:
>>
>> Hi everyone,
>>
>> I am a committer to the Lucene/Solr project. We've recently hit what
>> we believe is a JIT/GC bug -- it manifests itself only when G1GC is
>> used, on a 32-bit VM:

On Fri, Jan 9, 2015 at 7:10 PM, Shawn Heisey  wrote:

> On 1/1/2015 12:10 PM, Shawn Heisey wrote:
> > I've been working with Oracle employees to find better GC tuning
> > options.  The results are good enough to share with the community:
> >
> > https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning
> >
> > With the latest Java 7 or Java 8 version, and a couple of tuning
> > options, G1GC has grown up enough to be a viable choice.  Two of the
> > settings on that list were critical for making the performance
> > acceptable with my testing: ParallelRefProcEnabled and G1HeapRegionSize.
> >
> > I've included some notes on the wiki about how you can size the G1 heap
> > regions appropriately for your own index.
>
> A note was just recently added to the Lucene wiki, saying that the G1
> collector should never be used with Lucene, because there are bugs that
> might cause index corruption.  Solr uses Lucene, so this would apply.
>
>
> https://wiki.apache.org/lucene-java/JavaBugs#Oracle_Java_.2F_Sun_Java_.2F_OpenJDK_Bugs
>
> I have never had any problems with it, but perhaps I spoke too soon when
> recommending G1.  Does anyone else have anything to add?
>
> Thanks,
> Shawn
>
>


-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076

Re: How does text-rev work?

2015-01-09 Thread Alexandre Rafalovitch

So, Query Parser does some sort of magic and looks for the field with
the same name and _rev suffix?

But what populates that field? In the example schema, it seems to be
standalone and empty. Is there a copyField missing?

Regards,
   Alex.

Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 9 January 2015 at 17:07, Jack Krupansky  wrote:
> Or a Jira to document it.
>
> The basic idea is that if a normal leading wildcard is too slow, the user
> can index a copy of their text fields using the text_rev type, which
> indexes terms with their characters reversed and with a special marker.
> Then the query parser detects a leading wildcard and that the field type
> uses the reversed wildcard filter, and then it generates a wildcard query
> that using the reversed query token and wildcard pattern so that the
> leading wildcard becomes a trailing wildcard or prefix query
>
>
> -- Jack Krupansky
>
> On Fri, Jan 9, 2015 at 3:15 PM, Alexandre Rafalovitch 
> wrote:
>
>> Anybody? Otherwise, I guess it is a JIRA to delete the unused field?
>>
>> Regards,
>>Alex.
>> 
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>
>>
>> On 28 December 2014 at 13:16, Alexandre Rafalovitch 
>> wrote:
>> > I am looking at the collection1/techproducts schema and I can't figure
>> > out how the reversed wildcard example is supposed to work.
>> >
>> > We define text_general_rev type and text_rev field, but we don't seem
>> > to be populating it at any point. And running the example does not
>> > seem to show any tokens in the field even when the non-inverted text
>> > field does have some.
>> >
>> > Apparently, there is some magic in the QueryParser to do something
>> > about this at query time, but I see no explanation of what is supposed
>> > to happen at the index/schema time.
>> >
>> > Anybody has the skinny on this one?
>> >
>> > Regards,
>> >Alex.
>> > 
>> > Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>

Re: Determining the Number of Solr Shards

Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Re: How to return child documents with parent

Re: Tokenizer or Filter ?

Re: Tokenizer or Filter ?

Request two databases at the same time ?

Best way to implement Spotlight of certain results

filter on solr pivot data

Re: GC tuning question - can improving GC pauses cause indexing to slow down?

Re: Request two databases at the same time ?

Re: Best way to implement Spotlight of certain results

Re: filter on solr pivot data

Re: Best way to implement Spotlight of certain results

RE: can't make sense of spellchecker results when using techproducts example

Re: How does text-rev work?

Re: Request two databases at the same time ?

Re: How does text-rev work?

Unexplained leader initiated recovery after updates

Re: Unexplained leader initiated recovery after updates

Re: Garbage Collection tuning - G1 is now a good option

Re: Garbage Collection tuning - G1 is now a good option

Re: How does text-rev work?

23 matches

Site Navigation

Mail list logo

Footer information