Re: Suggester - how to return exact match?

2013-11-21 Thread Mirko
Hi,
I'd like to clarify our use case a bit more.

We want to return the exact search query as a suggestion only if it is
present in the index. So in my example we would expect to get the
suggestion "foo" for the query "foo" but no suggestion "abc" for the query
"abc" (because "abc" is not in the dictionary).

For me this use case seems quite common. Say, we have three products in our
store: "foo", "foo 1", "foo 2". If the user types "foo" in the product
search, we want to suggest all our products in the dropdown.

Is this something we can do with the Solr suggester?
Mirko


2013/11/20 Developer 

> May be there is a way to do this but it doesn't make sense to return the
> same
> search query as a suggestion (Search query is not a suggestion as it might
> or might not be present in the index).
>
> AFAIK you can use various look up algorithm to get the suggestion list and
> they lookup the terms based on the query value (some alogrithm implements
> fuzzy logic too). so searching Foo will return FooBar, Foo2 but not foo.
>
> You should fetch the suggestion only if the numfound is greater than 0 else
> you don't have any suggestion.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Suggester-how-to-return-exact-match-tp4102203p4102259.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4

2013-11-21 Thread RadhaJayalakshmi
Hi,I am using solr4.4 with zookeeper 3.3.5. While i was checking for error
conditions of my application, i came across a strange issue.Here is what i
tried:I have three fields  defined in my schemaa) UNIQUE_KEY  - of type
solr.TrieLongb) empId - of type Solr.TrieLongc) companyId - of type
Solr.TrieLongHow Am i Indexing:I am indexing
using SolrJ API. and the data for the indexing will be in a text file,
delimited by | symbol. My Indexer java program will read the textfile lineby
line, splits the data by | symbol and creates SolrInputdocument object (for
every line of the file) and adds the fields with values (that it read from
the file)Now, intentionally, in the data file, for unique_key, i had String
values(instead of long value) . something like123AB|111|222Now, when i index
this data, i am getting the below
exception:*org.apache.solr.client.solrj.SolrServerException*: No live
SolrServers available to handle this request*:[URL of my application]*  
   
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:333)
  
at
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318)

at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
 
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)  
   
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)  
Caused
by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Server at *[URL of my application] *returned non ok status:500,
message:Internal Server Error   at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:385)
  
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
  
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264)
But, when i correct the unique_key field data, but when i gave string data
for other two long fields, i am getting a different
exceptionorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
ERROR: [Error stating the field name for which it is
mismathing]orrg.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
  
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
  
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264)
  
at
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318)

at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
 
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)  
   
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)  
   
at  What is my question here:--During 
indexing, if
solr finds, that for any field, if the fieldtype declared in schema is
mismatching with the data that is being givem, then it should riase the same
type of exception.But in the above case, if it finds a mismatch for
Unique_key, it is raising SolrServerException. For all other fields, it is
raising, RemoteSolrException(which is an unchecked exception). Is this a bug
in solr or is there any reason for thowing different exception for the above
two cases.Expecting a positive replyThanksRadha 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrServerException-while-adding-an-invalid-UNIQUE-KEY-in-solr-4-4-tp4102346.html
Sent from the Solr - User mailing list archive at Nabble.com.

Best implementation for multi-price store?

2013-11-21 Thread Alejandro Marqués Rodríguez
Hi,

I've been recently ask to implement an application to search products from
several stores, each store having different prices and stock for the same
product.

So I have products that have the usual fields (name, description, brand,
etc) and also number of units and price for each store. I must be able to
filter for a given store and order by stock or price for that store. The
application should also allow incresing the number of stores, fields
depending of store and number of products without much work.

The numbers for the application are more or less 100 stores and 7M products.

I've been thinking of some ways of defining the index structure but I don't
know wich one is better as I think each one has it's pros and cons.


   1. *Each product-store as a document:* Denormalizing the information so
   for every product and store I have a different document. Pros are that I
   can filter and order without problems and that adding a new store-depending
   field is very easy. Cons are that the index goes from 7M documents to 700M
   and that most of the info is redundant as most of the fields are repeated
   among stores.
   2. *Each field-store as a field:* For example for price I would have
   "store1_price, store2_price, ...". Pros are that the index stays at 7M
   documents, and I can still filter and sort by those fields. Cons are that I
   have to add some logic so if I filter by one store I order for the
   associated price field, and that number of fields increases as number of
   store-depending fields x number of stores. I don't know if having more
   fields affects performance, but adding new store-depending fields will
   increase the number of fields even more
   3. *Join:* First time I read about solr joins thought it was the way to
   go in this case, but after reading a bit more and doing some tests I'm not
   so sure about it... Maybe I've done it wrong but I think it also
   denormalizes the info (So I will also havee 700M documents) and besides I
   can't order or filter by store fields.


I must say my preferred option is number 2, so I don't duplicate
information, I keep a relatively small number of documents and I can filter
and sort by the store fields. However, my main concern here is I don't know
if having too many fields in a document will be harmful to performance.

Which one do you think is the best approach for this application? Is there
a better approach that I have missed?

Thanks in advance



-- 
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


Parse eDisMax queries for keywords

2013-11-21 Thread Mirko
Hi,
We would like to implement special handling for queries that contain
certain keywords. Our particular use case:

In the example query "Footitle season 1" we want to discover the keywords
"season" , get the subsequent number, and boost (or filter for) documents
that match "1" on field name="season".

We have two fields in our schema:

























Our idea was to use a Keyword tokenizer and a Regex on the "season" field
to extract the season number from the complete query.

However, we use a ExtendedDisMax query parser in our search handler:



edismax

title season






The problem is that the eDisMax tokenizes the query, so that our field
"season" receives the tokens ["Foo", "season", "1"] without any order,
instead of the complete query.

How can we pass the complete query (untokenized) to the season field? We
don't understand which tokenizer is used here and why our "season" field
received tokens instead of the complete query.

Or is there another approach to solve this use case with Solr?

Thanks,
Mirko


Re: facet method=enum and uninvertedfield limitations

2013-11-21 Thread Dmitry Kan
What is the actual target speed you are pursuing? Is this for user
suggestions or something of that sort? Content based suggestions with
faceting and esp on 1.4 solr won't be lightning fast.

Have you looked at TermsComponent?
http://wiki.apache.org/solr/TermsComponent

By shingles, which in the rest of the world are more commonly called
ngrams, I meant a way of "compressing" the number of entities to iterate
through. Let's say if you only store bigrams or trigrams and facet based on
those (less in amount).

Dmitry


On Wed, Nov 20, 2013 at 6:10 PM, Lemke, Michael SZ/HZA-ZSW <
lemke...@schaeffler.com> wrote:

> On Wednesday, November 20, 2013 7:37 AM, Dmitry Kan wrote:
>
> Thanks for your reply.
>
> >
> >Since you are faceting on a text field (is this correct?) you deal with a
> >lot of unique values in it.
>
> Yes, this is a text field and we experimented with reducing the index.  As
> I said in my original question the stripped down index had 178,000 terms
> and it (fc) still didn't work.  Is number of terms the relevant quantity?
>
> >So your best bet is enum method.
>
> Hm, yes, that works but I have to wait 4 minutes for the answer (with the
> original data).  Not good.
>
> >Also if you
> >are on solr 4x try building doc values in the index: this suits faceting
> >well.
>
> We are on Solr 1.4, so, no.
>
> >
> >Otherwise start from your spec once again. Can you use shingles instead?
>
> Possibly but I don't know shingles.  Although I'd prefer to use our
> original
> index we are trying to build a specialized index just for this sort of
> query but still don't know what to look for.
>
> A query like
>
>
>  
> q=word&facet.field=CONTENT&facet=true&facet.limit=10&facet.mincount=1&facet.method=fc&facet.prefix=a&rows=0
>
> would give me the top ten results containing 'word' and something starting
> with 'a'.  That's what I want.  An empty facet.prefix should also work.
> Eventually, the query will be more complex containing other fields and
> filter queries but the basic function should be exactly like this.  How
> can we achieve this?
>
> Thanks,
> Michael
>
>
> >On 19 Nov 2013 17:44, "Lemke, Michael SZ/HZA-ZSW" <
> lemke...@schaeffler.com>
> >wrote:
> >
> >> On Friday, November 15, 2013 11:22 AM, Lemke, Michael SZ/HZA-ZSW wrote:
> >>
> >> Judging from numerous replies this seems to be a tough question.
> >> Nevertheless, I'd really appreciate any help as we are stuck.
> >> We'd really like to know what in our index causes the facet.method=fc
> >> query to fail.
> >>
> >> Thanks,
> >> Michael
> >>
> >> >On Thu, November 14, 2013 7:26 PM, Yonik Seeley wrote:
> >> >>On Thu, Nov 14, 2013 at 12:03 PM, Lemke, Michael  SZ/HZA-ZSW
> >> >> wrote:
> >> >>> I am running into performance problems with faceted queries.
> >> >>> If I do a
> >> >>>
> >> >>>
> >>
> q=word&facet.field=CONTENT&facet=true&facet.limit=10&facet.mincount=1&facet.method=fc&facet.prefix=a&rows=0
> >> >>>
> >> >>> I am getting an exception:
> >> >>> org.apache.solr.common.SolrException: Too many values for
> >> UnInvertedField faceting on field CONTENT
> >> >>> at
> >>
> org.apache.solr.request.UnInvertedField.uninvert(UnInvertedField.java:384)
> >> >>> at
> >>
> org.apache.solr.request.UnInvertedField.(UnInvertedField.java:178)
> >> >>> at
> >>
> org.apache.solr.request.UnInvertedField.getUnInvertedField(UnInvertedField.java:839)
> >> >>> ...
> >> >>>
> >> >>> I understand it's got something to do with a 24bit limit somewhere
> >> >>> in the code but I don't understand enough of it to be able to
> construct
> >> >>> a specialized index that can be queried with facet.method=enum.
> >> >>
> >> >>You shouldn't need to do anything differently to try facet.method=enum
> >> >>(just replace facet.method=fc with facet.method=enum)
> >> >
> >> >This is true and facet.method=enum does work indeed.  The problem is
> >> >runtime.  In particular queries with an empty facet.prefix= run many
> >> >seconds if not minutes.  I initially asked about this here:
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201310.mbox/%3c33ec3398272fbe47b64ee3b3e98f69a761427...@de011521.schaeffler.com%3E
> >> >
> >> >It was suggested that fc is much faster than enum and I'd like to
> >> >test that.  We are still fairly free to design the index such that
> >> >it performs well.  But to do that we need to understand what is
> >> >killing it.
> >> >
> >> >>
> >> >>You may also want to add the parameter
> >> >>facet.enum.cache.minDf=10
> >> >>to lower memory usage by only usiing the filter cache for terms that
> >> >>match more than 100K docs.
> >> >
> >> >That helped a little, cut down my particular test from 10 sec to 5 sec.
> >> >But still too slow.  Mind you this is for an autosuggest feature.
> >> >
> >> >Thanks for your reply.
> >> >
> >> >Michael
> >> >
> >> >
> >>
> >>
>
>


-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: twitter.com/dmitrykan


RE: Best implementation for multi-price store?

2013-11-21 Thread Petersen, Robert
Hi,

I'd go with (2) also but using dynamic fields so you don't have to define all 
the storeX_price fields in your schema but rather just one *_price field.  Then 
when you filter on store:store1 you'd know to sort with store1_price and so 
forth for units.  That should be pretty straightforward.

Hope that helps,
Robi

-Original Message-
From: Alejandro Marqués Rodríguez [mailto:amarq...@paradigmatecnologico.com] 
Sent: Thursday, November 21, 2013 1:36 AM
To: solr-user@lucene.apache.org
Subject: Best implementation for multi-price store?

Hi,

I've been recently ask to implement an application to search products from 
several stores, each store having different prices and stock for the same 
product.

So I have products that have the usual fields (name, description, brand,
etc) and also number of units and price for each store. I must be able to 
filter for a given store and order by stock or price for that store. The 
application should also allow incresing the number of stores, fields depending 
of store and number of products without much work.

The numbers for the application are more or less 100 stores and 7M products.

I've been thinking of some ways of defining the index structure but I don't 
know wich one is better as I think each one has it's pros and cons.


   1. *Each product-store as a document:* Denormalizing the information so
   for every product and store I have a different document. Pros are that I
   can filter and order without problems and that adding a new store-depending
   field is very easy. Cons are that the index goes from 7M documents to 700M
   and that most of the info is redundant as most of the fields are repeated
   among stores.
   2. *Each field-store as a field:* For example for price I would have
   "store1_price, store2_price, ...". Pros are that the index stays at 7M
   documents, and I can still filter and sort by those fields. Cons are that I
   have to add some logic so if I filter by one store I order for the
   associated price field, and that number of fields increases as number of
   store-depending fields x number of stores. I don't know if having more
   fields affects performance, but adding new store-depending fields will
   increase the number of fields even more
   3. *Join:* First time I read about solr joins thought it was the way to
   go in this case, but after reading a bit more and doing some tests I'm not
   so sure about it... Maybe I've done it wrong but I think it also
   denormalizes the info (So I will also havee 700M documents) and besides I
   can't order or filter by store fields.


I must say my preferred option is number 2, so I don't duplicate information, I 
keep a relatively small number of documents and I can filter and sort by the 
store fields. However, my main concern here is I don't know if having too many 
fields in a document will be harmful to performance.

Which one do you think is the best approach for this application? Is there a 
better approach that I have missed?

Thanks in advance



--
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42



Indexing data to a specific collection in Solr 4.5.0

2013-11-21 Thread Reyes, Mark
Hi all:

I’m currently on a Solr 4.5.0 instance and running this tutorial, 
http://lucene.apache.org/solr/4_5_0/tutorial.html

My question is specific to indexing data as proposed from this tutorial,

$ java -jar post.jar solr.xml monitor.xml

The tutorial advises to validate from your localhost,
http://localhost:8983/solr/collection1/select?q=solr&wt=xml

However, what if my Solr core has both a collection1 and collection2, yet I 
desire the XML files to only be posted to collection2 only?

If possible, please advise.

Thanks,
Mark

IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Re: Facet field query on subset of documents

2013-11-21 Thread Luis Lebolo
Hi Erick,

Thanks for the reply and sorry, my fault, wasn't clear enough. I was
wondering if there was a way to remove terms that would always be zero
(because the term came from a document that didn't match the filter query).

Here's an example. I have a bunch of documents with fields 'manufacturer'
and 'location'. If I set my filter query to "manufacturer = Sony" and all
Sony documents had a value of 'Florida' for location, then I want 'Florida'
NOT to show up in my facet field results. Instead, it shows up with a count
of zero (and it'll always be zero because of my filter query).

Using mincount = 1 doesn't solve my problem because I don't want it to hide
zeroes that came from documents that actually pass my filter query.

Does that make more sense?


On Thu, Nov 21, 2013 at 4:36 PM, Erick Erickson wrote:

> That's what faceting does. The facets are only tabulated
> for documents that satisfy they query, including all of
> the filter queries and anh other criteria.
>
> Otherwise, facet counts would be the same no matter
> what the query was.
>
> Or I'm completely misunderstanding your question...
>
> Best,
> Erick
>
>
> On Thu, Nov 21, 2013 at 4:22 PM, Luis Lebolo 
> wrote:
>
> > Hi All,
> >
> > Is it possible to perform a facet field query on a subset of documents
> (the
> > subset being defined via a filter query for instance)?
> >
> > I understand that facet pivoting might work, but it would require that
> the
> > subset be defined by some field hierarchy, e.g. manufacturer -> price
> (then
> > only look at the results for the manufacturer I'm interested in).
> >
> > What if I wanted to define a more complex subset (where the name starts
> > with A but ends with Z and some other field is greater than 5 and yet
> > another field is not 'x', etc.)?
> >
> > Ideally I would then define a "facet field constraining query" to include
> > only terms from documents that pass this query.
> >
> > Thanks,
> > Luis
> >
>


Periodic Slowness on Solr Cloud

2013-11-21 Thread Dave Seltzer
I'm doing some performance testing against an 8-node Solr cloud cluster,
and I'm noticing some periodic slowness.


http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png

I'm doing random test searches against an Alias Collection made up of four
smaller (monthly) collections. Like this:

MasterCollection
|- Collection201308
|- Collection201309
|- Collection201310
|- Collection201311

The last collection is constantly updated. New documents are being added at
the rate of about 3 documents per second.

I believe the slowness may due be to NRT, but I'm not sure. How should I
investigate this?

If the slowness is related to NRT, how can I alleviate the issue without
disabling NRT?

Thanks Much!

-Dave


RE: search with wildcard

2013-11-21 Thread Scott Schneider
I know it's documented that Lucene/Solr doesn't apply filters to queries with 
wildcards, but this seems to trip up a lot of users.  I can also see why 
wildcards break a number of filters, but a number of filters (e.g. mapping 
charsets) could mostly or entirely work.  The N-gram filter is another one that 
would be great to still run when there wildcards.  If you indexed 4-grams and 
the query is a "*testp*", you currently won't get any results; but the N-gram 
filter could have a wildcard mode that, in this case, would return just the 
first 4-gram as a token.

Is this something you've considered?  It would have to be enabled in the core 
network, but disabled by default for existing filters; then it could be enabled 
1-by-1 for existing filters.  Apologies if the dev list is a better place for 
this.

Scott


> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Thursday, November 21, 2013 8:40 AM
> To: solr-user@lucene.apache.org
> Subject: Re: search with wildcard
> 
> Hi Adnreas,
> 
> If you don't want to use wildcards at query time, alternative way is to
> use NGrams at indexing time. This will produce a lot of tokens. e.g.
> For example 4grams of your example : Supertestplan => supe uper pert
> erte rtes *test* estp stpl tpla plan
> 
> 
> Is that you want? By the way why do you want to search inside of words?
> 
>  maxGramSize="4"/>
> 
> 
> 
> 
> On Thursday, November 21, 2013 5:23 PM, Andreas Owen 
> wrote:
> 
> I suppose i have to create another field with diffenet tokenizers and
> set
> the boost very low so it doesn't really mess with my ranking because
> there
> the word is now in 2 fields. What kind of tokenizer can do the job?
> 
> 
> 
> From: Andreas Owen [mailto:a...@conx.ch]
> Sent: Donnerstag, 21. November 2013 16:13
> To: solr-user@lucene.apache.org
> Subject: search with wildcard
> 
> 
> 
> I am querying "test" in solr 4.3.1 over the field below and it's not
> finding
> all occurences. It seems that if it is a substring of a word like
> "Supertestplan" it isn't found unless I use a wildcards "*test*". This
> is
> write because of my tokenizer but does someone know a way around this?
> I
> don't want to add wildcards because that messes up queries with
> multiple
> words.
> 
> 
> 
>  positionIncrementGap="100">
> 
>       
> 
>         
> 
>         
> 
> 
> 
>          words="lang/stopwords_de.txt" format="snowball"
> enablePositionIncrements="true"/> 
> 
>         
> 
>                                 class="solr.SnowballPorterFilterFactory" language="German"/> 
> 
> 
> 
>       
> 
>     


Re: Facet field query on subset of documents

2013-11-21 Thread Erick Erickson
That's what faceting does. The facets are only tabulated
for documents that satisfy they query, including all of
the filter queries and anh other criteria.

Otherwise, facet counts would be the same no matter
what the query was.

Or I'm completely misunderstanding your question...

Best,
Erick


On Thu, Nov 21, 2013 at 4:22 PM, Luis Lebolo  wrote:

> Hi All,
>
> Is it possible to perform a facet field query on a subset of documents (the
> subset being defined via a filter query for instance)?
>
> I understand that facet pivoting might work, but it would require that the
> subset be defined by some field hierarchy, e.g. manufacturer -> price (then
> only look at the results for the manufacturer I'm interested in).
>
> What if I wanted to define a more complex subset (where the name starts
> with A but ends with Z and some other field is greater than 5 and yet
> another field is not 'x', etc.)?
>
> Ideally I would then define a "facet field constraining query" to include
> only terms from documents that pass this query.
>
> Thanks,
> Luis
>


Re: Indexing data to a specific collection in Solr 4.5.0

2013-11-21 Thread xiezhide

add Durl=http://localhost:8983/solr/collection2/update when run post.jar,
此邮件发送自189邮箱

"Reyes, Mark"  wrote:

>Hi all:
>
>I’m currently on a Solr 4.5.0 instance and running this tutorial, 
>http://lucene.apache.org/solr/4_5_0/tutorial.html
>
>My question is specific to indexing data as proposed from this tutorial,
>
>$ java -jar post.jar solr.xml monitor.xml
>
>The tutorial advises to validate from your localhost,
>http://localhost:8983/solr/collection1/select?q=solr&wt=xml
>
>However, what if my Solr core has both a collection1 and collection2, yet I 
>desire the XML files to only be posted to collection2 only?
>
>If possible, please advise.
>
>Thanks,
>Mark
>
>IMPORTANT NOTICE: This e-mail message is intended to be received only by 
>persons entitled to receive the confidential information it may contain. 
>E-mail messages sent from Bridgepoint Education may contain information that 
>is confidential and may be legally privileged. Please do not read, copy, 
>forward or store this message unless you are an intended recipient of it. If 
>you received this transmission in error, please notify the sender by reply 
>e-mail and delete the message and any attachments.

search with wildcard

2013-11-21 Thread Andreas Owen
I am querying "test" in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
"Supertestplan" it isn't found unless I use a wildcards "*test*". This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.

 



   





   

 







  





Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Erick Erickson
How real time is NRT? In particular, what are you commit settings?

And can you characterize "periodic slowness"? Queries that usually
take 500ms not tail 10s? Or 1s? How often? How are you measuring?

Details matter, a lot...

Best,
Erick




On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer  wrote:

> I'm doing some performance testing against an 8-node Solr cloud cluster,
> and I'm noticing some periodic slowness.
>
>
> http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png
>
> I'm doing random test searches against an Alias Collection made up of four
> smaller (monthly) collections. Like this:
>
> MasterCollection
> |- Collection201308
> |- Collection201309
> |- Collection201310
> |- Collection201311
>
> The last collection is constantly updated. New documents are being added at
> the rate of about 3 documents per second.
>
> I believe the slowness may due be to NRT, but I'm not sure. How should I
> investigate this?
>
> If the slowness is related to NRT, how can I alleviate the issue without
> disabling NRT?
>
> Thanks Much!
>
> -Dave
>


Multiple similarity scores for the same text field

2013-11-21 Thread Nikos Voskarides
I have the following simplified setting:
My schema contains one text field, named "text".
When I perform a query, I need to get the scores for the same text field
but for different similarity functions (e.g. TFIDF, BM25..) and combine
them externally using different weights.
An obvious way to achieve this is to keep multiple copies of the text field
in the schema for each similarity. I am wondering though whether there is a
more space-efficient way of doing this.

Thanks,

Nikos


Re: Indexing data to a specific collection in Solr 4.5.0

2013-11-21 Thread Erick Erickson
you're leaving off the - in front of the D,
-Durl.

Try java -jar post.jar -help for a list of options available


On Thu, Nov 21, 2013 at 12:04 PM, Reyes, Mark  wrote:

> So then,
> $ java -jar post.jar Durl=http://localhost:8983/solr/collection2/update
> solr.xml monitor.xml
>
>
>
>
>
> On 11/21/13, 8:14 AM, "xiezhide"  wrote:
>
> >
> >add Durl=http://localhost:8983/solr/collection2/update when run post.jar,
> >此邮件发送自189邮箱
> >
> >"Reyes, Mark"  wrote:
> >
> >>Hi all:
> >>
> >>I’m currently on a Solr 4.5.0 instance and running this tutorial,
> >>http://lucene.apache.org/solr/4_5_0/tutorial.html
> >>
> >>My question is specific to indexing data as proposed from this tutorial,
> >>
> >>$ java -jar post.jar solr.xml monitor.xml
> >>
> >>The tutorial advises to validate from your localhost,
> >>http://localhost:8983/solr/collection1/select?q=solr&wt=xml
> >>
> >>However, what if my Solr core has both a collection1 and collection2,
> >>yet I desire the XML files to only be posted to collection2 only?
> >>
> >>If possible, please advise.
> >>
> >>Thanks,
> >>Mark
> >>
> >>IMPORTANT NOTICE: This e-mail message is intended to be received only by
> >>persons entitled to receive the confidential information it may contain.
> >>E-mail messages sent from Bridgepoint Education may contain information
> >>that is confidential and may be legally privileged. Please do not read,
> >>copy, forward or store this message unless you are an intended recipient
> >>of it. If you received this transmission in error, please notify the
> >>sender by reply e-mail and delete the message and any attachments.
>
>
> IMPORTANT NOTICE: This e-mail message is intended to be received only by
> persons entitled to receive the confidential information it may contain.
> E-mail messages sent from Bridgepoint Education may contain information
> that is confidential and may be legally privileged. Please do not read,
> copy, forward or store this message unless you are an intended recipient of
> it. If you received this transmission in error, please notify the sender by
> reply e-mail and delete the message and any attachments.
>


Re: Suggester - how to return exact match?

2013-11-21 Thread Developer
Might not be a perfect solution but you can use edgengram filter and copy all
your field data to that field and use it for suggestion.


  



  
  


  


http://localhost:8983/solr/core1/select?q=name:iphone

The above query will return 
iphone
iphone5c
iphone4g



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggester-how-to-return-exact-match-tp4102203p4102521.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Mark Miller
Yes, more details…

Solr version, which garbage collector, how does heap usage look, cpu, etc.

- Mark

On Nov 21, 2013, at 6:46 PM, Erick Erickson  wrote:

> How real time is NRT? In particular, what are you commit settings?
> 
> And can you characterize "periodic slowness"? Queries that usually
> take 500ms not tail 10s? Or 1s? How often? How are you measuring?
> 
> Details matter, a lot...
> 
> Best,
> Erick
> 
> 
> 
> 
> On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer  wrote:
> 
>> I'm doing some performance testing against an 8-node Solr cloud cluster,
>> and I'm noticing some periodic slowness.
>> 
>> 
>> http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png
>> 
>> I'm doing random test searches against an Alias Collection made up of four
>> smaller (monthly) collections. Like this:
>> 
>> MasterCollection
>> |- Collection201308
>> |- Collection201309
>> |- Collection201310
>> |- Collection201311
>> 
>> The last collection is constantly updated. New documents are being added at
>> the rate of about 3 documents per second.
>> 
>> I believe the slowness may due be to NRT, but I'm not sure. How should I
>> investigate this?
>> 
>> If the slowness is related to NRT, how can I alleviate the issue without
>> disabling NRT?
>> 
>> Thanks Much!
>> 
>> -Dave
>> 



Re: SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4

2013-11-21 Thread Shawn Heisey

On 11/21/2013 1:57 AM, RadhaJayalakshmi wrote:

Hi,I am using solr4.4 with zookeeper 3.3.5. While i was checking for error
conditions of my application, i came across a strange issue.Here is what i
tried:I have three fields  defined in my schemaa) UNIQUE_KEY  - of type
solr.TrieLongb) empId - of type Solr.TrieLongc) companyId - of type
Solr.TrieLongHow Am i Indexing:I am indexing
using SolrJ API. and the data for the indexing will be in a text file,
delimited by | symbol. My Indexer java program will read the textfile lineby
line, splits the data by | symbol and creates SolrInputdocument object (for
every line of the file) and adds the fields with values (that it read from
the file)Now, intentionally, in the data file, for unique_key, i had String
values(instead of long value) . something like123AB|111|222Now, when i index
this data, i am getting the below
exception:*org.apache.solr.client.solrj.SolrServerException*: No live
SolrServers available to handle this request*:[URL of my application]*  

at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:333)
   
at
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318)
 
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
  
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)  

at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)  
Caused
by: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Server at *[URL of my application] *returned non ok status:500,
message:Internal Server Error   at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:385)
   
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
   
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264)
But, when i correct the unique_key field data, but when i gave string data
for other two long fields, i am getting a different
exceptionorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
ERROR: [Error stating the field name for which it is
mismathing]orrg.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
   
at
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
   
at
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:264)
   
at
org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:318)
 
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:117)
  
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)  

at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)  

at  What is my question here:--During 
indexing, if
solr finds, that for any field, if the fieldtype declared in schema is
mismatching with the data that is being givem, then it should riase the same
type of exception.But in the above case, if it finds a mismatch for
Unique_key, it is raising SolrServerException. For all other fields, it is
raising, RemoteSolrException(which is an unchecked exception). Is this a bug
in solr or is there any reason for thowing different exception for the above
two cases.Expecting a positive replyThanksRadha 



The first exception is an error thrown directly from SolrJ.  It was 
unable to find any server to deal with the request, so it threw its own 
SolrServerException wrapping the last RemoteSolrException (HTTP error 
500) it received.


The second exception happened in a different place.  In this case, the 
request made it past the server-side uniqueKey handling code and into 
the code that handles other fields, which froim what I can see here 
returns a different error message and possibly a different HTTP code.  
Because it was different, SolrJ sent the RemoteSolrException up the 
chain to your application rather than catching and wrapping it in 
SolrServerException.


I am not surprised to hear that you get a different error for invalid 
data in the uniqueKey field than you do in other fields. Because of its 
nature, it must be handled in a different code path.


Thanks,
Shawn



Re: Split shard and stream sub-shards to remote nodes?

2013-11-21 Thread Otis Gospodnetic
Hi,

On Wed, Nov 20, 2013 at 12:53 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> At the Lucene level, I think it would require a directory
> implementation which writes to a remote node directly. Otherwise, on
> the solr side, we must move the leader itself to another node which
> has enough disk space and then split the index.
>

Hm what about taking the source shard, splitting it, and sending docs
that come out of each sub-shards to a remote node at Solr level, as if
these documents are just being added (i.e. nothing at Lucene level)?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/




>
> On Wed, Nov 20, 2013 at 8:37 PM, Otis Gospodnetic
>  wrote:
> > Do you think this is something that is actually implementable?  If so,
> > I'll open an issue.
> >
> > One use-case where this may come in handy is when the disk space is
> > tight.  If a shard is using > 50% of the disk space on some node X,
> > you can't really split that shard because the 2 new sub-shards will
> > not fit on the local disk.  Or is there some trick one could use in
> > this situation?
> >
> > Thanks,
> > Otis
> > --
> > Performance Monitoring * Log Analytics * Search Analytics
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> > On Wed, Nov 20, 2013 at 6:48 AM, Shalin Shekhar Mangar
> >  wrote:
> >> No, it is not supported yet. We can't split to a remote node directly.
> >> The best bet is trigger a new leader election by unloading the leader
> >> node once all replicas are active.
> >>
> >> On Wed, Nov 20, 2013 at 1:32 AM, Otis Gospodnetic
> >>  wrote:
> >>> Hi,
> >>>
> >>> Is it possible to perform a shard split and stream data for the
> >>> new/sub-shards to remote nodes, avoiding persistence of new/sub-shards
> >>> on the local/source node first?
> >>>
> >>> Thanks,
> >>> Otis
> >>> --
> >>> Performance Monitoring * Log Analytics * Search Analytics
> >>> Solr & Elasticsearch Support * http://sematext.com/
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Shalin Shekhar Mangar.
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: How to index X™ as ™ (HTML decimal entity)

2013-11-21 Thread Walter Underwood
And this is the exact problem. Some characters are stored as entities, some are 
not. When it is time to display, what else needs escaped? At a minimum, you 
would have to always store & as & to avoid escaping the leading ampersand 
in the entities.

You could store every single character as a numeric entity. Or you could store 
every non-ASCII character as a numeric entity. Or every non-Latin1 character. 
Plus ampersand, of course.

In these e-mails, we are distinguishing between ™ and ™. How would you do 
that? By storing "™" as "™".

To avoid all this double-think, always store text as Unicode code points, 
encoded with a standard Unicode method (UTF-8, etc.).

When displaying, only make entities if the codepoints cannot be represented in 
the target character encoding. If you are sending things in US-ASCII, you will 
be sending lots of entities.

A good encoding library has callbacks for characters that cannot be 
represented. You can use these callbacks to format out-of-charset codepoints as 
entities. I've done this in product code, it really works.

Finally, if you don't believe me, believe the XML Infoset, where numeric 
entities are always interpreted as treated as Unicode codepoints.

The other way to go insane is storing local time in the database. Always store 
UTC and convert at the edges.

wunder

On Nov 21, 2013, at 7:50 AM, "Jack Krupansky"  wrote:

> "Would you store "a" as "A" ?"
> 
> No, not in any case.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Michael Sokolov
> Sent: Thursday, November 21, 2013 8:56 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
> 
> I have to agree w/Walter.  Use unicode as a storage format.  The entity
> encodings are for transfer/interchange.  Encode/decode on the way in and
> out if you have to.  Would you store "a" as "A" ?  It makes it
> impossible to search for, for one thing.  What if someone wants to
> search for the TM character?
> 
> -Mike
> 
> On 11/20/13 12:07 PM, Jack Krupansky wrote:
>> AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format for 
>> storing text to be rendered. If you disagree - try explaining yourself.
>> 
>> But maybe TM should be encoded as "™". Ditto for other named SGML 
>> entities.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Walter Underwood
>> Sent: Wednesday, November 20, 2013 11:21 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>> 
>> Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. 
>> Storing Unicode characters as XML/HTML encoded character references is an 
>> extremely bad idea.
>> 
>> wunder
>> 
>> On Nov 20, 2013, at 5:01 AM, "Jack Krupansky"  
>> wrote:
>> 
>>> Any analysis filtering affects the indexed value only, but the stored value 
>>> would be unchanged from the original input value. An update processor lets 
>>> you modify the original input value that will be stored.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Uwe Reh
>>> Sent: Wednesday, November 20, 2013 5:43 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>> 
>>> What's about having a simple charfilter in the analyzer queue for
>>> indexing *and* searching. e.g
>>> >> replacement="™" />
>>> or
>>> >> mapping="mapping-specials.txt" />
>>> 
>>> Uwe
>>> 
>>> Am 19.11.2013 23:46, schrieb Developer:
 I have a data coming in to SOLR as below.
 
 X™ - Black
 
 I need to store the HTML Entity (decimal) equivalent value (i.e. ™)
 in SOLR rather than storing the original value.
 
 Is there a way to do this?
>>> 
>> 
>> -- 
>> Walter Underwood
>> wun...@wunderwood.org
>> 
>> 
> 

--
Walter Underwood
wun...@wunderwood.org





Re: Indexing data to a specific collection in Solr 4.5.0

2013-11-21 Thread xiezhide


此邮件发送自189邮箱

"Reyes, Mark"  wrote:

>Hi all:
>
>I’m currently on a Solr 4.5.0 instance and running this tutorial, 
>http://lucene.apache.org/solr/4_5_0/tutorial.html
>
>My question is specific to indexing data as proposed from this tutorial,
>
>$ java -jar post.jar solr.xml monitor.xml
>
>The tutorial advises to validate from your localhost,
>http://localhost:8983/solr/collection1/select?q=solr&wt=xml
>
>However, what if my Solr core has both a collection1 and collection2, yet I 
>desire the XML files to only be posted to collection2 only?
>
>If possible, please advise.
>
>Thanks,
>Mark
>
>IMPORTANT NOTICE: This e-mail message is intended to be received only by 
>persons entitled to receive the confidential information it may contain. 
>E-mail messages sent from Bridgepoint Education may contain information that 
>is confidential and may be legally privileged. Please do not read, copy, 
>forward or store this message unless you are an intended recipient of it. If 
>you received this transmission in error, please notify the sender by reply 
>e-mail and delete the message and any attachments.

Facet field query on subset of documents

2013-11-21 Thread Luis Lebolo
Hi All,

Is it possible to perform a facet field query on a subset of documents (the
subset being defined via a filter query for instance)?

I understand that facet pivoting might work, but it would require that the
subset be defined by some field hierarchy, e.g. manufacturer -> price (then
only look at the results for the manufacturer I'm interested in).

What if I wanted to define a more complex subset (where the name starts
with A but ends with Z and some other field is greater than 5 and yet
another field is not 'x', etc.)?

Ideally I would then define a "facet field constraining query" to include
only terms from documents that pass this query.

Thanks,
Luis


Re: Parse eDisMax queries for keywords

2013-11-21 Thread Jack Krupansky
The query parser does its own tokenization and parsing before your analyzer 
tokenizer and filters are called, assuring that only one white 
space-delimited token is analyzed at a time.


You're probably best off having an application layer preprocessor for the 
query that "enriches" the query in the manner that you're describing.


Or, simply settle for a "heuristic" approach that may give you 70% of what 
you want using only existing Solr features on the server side.


-- Jack Krupansky

-Original Message- 
From: Mirko

Sent: Thursday, November 21, 2013 5:30 AM
To: solr-user@lucene.apache.org
Subject: Parse eDisMax queries for keywords

Hi,
We would like to implement special handling for queries that contain
certain keywords. Our particular use case:

In the example query "Footitle season 1" we want to discover the keywords
"season" , get the subsequent number, and boost (or filter for) documents
that match "1" on field name="season".

We have two fields in our schema:





   
   
   
   
   
   







   
   

   



Our idea was to use a Keyword tokenizer and a Regex on the "season" field
to extract the season number from the complete query.

However, we use a ExtendedDisMax query parser in our search handler:


   
   edismax
   
   title season
   

   



The problem is that the eDisMax tokenizes the query, so that our field
"season" receives the tokens ["Foo", "season", "1"] without any order,
instead of the complete query.

How can we pass the complete query (untokenized) to the season field? We
don't understand which tokenizer is used here and why our "season" field
received tokens instead of the complete query.

Or is there another approach to solve this use case with Solr?

Thanks,
Mirko 



Re: How to retain the original format of input document in search results in SOLR - Tomcat

2013-11-21 Thread Erick Erickson
Solr (actually Lucene) stores the input _exactly_ as it is entered, and
returns it the same way.

What you're seeing is almost certainly your display mechanism interpreting
the results,
whitespace is notoriously variable in terms of how it's displayed by various
interpretations of the "standard". For instance, HTML often just eats
whitespace.




On Thu, Nov 21, 2013 at 1:33 AM, ramesh py  wrote:

> Hi All,
>
>
>
> I am  new to apache solr. Recently  I could able to configure the solr with
> tomcat successfully. And its working fine except the format of the search
> results i.e., the format of the search results not displaying as like as
> input document.
>
>
>
> I am doing the below things
>
>
>
> 1.   Indexing the xml file into solr
>
> 2.   Format of the xml as below
>
> **
>
> some text
>
>  Title1: descriptions of the title
>
> Title2 : description of the title2
>
> Title3 : description of title3
>
> 
>
> some text 
>
> 
>
>
>
> 3.   After index, the results are displaying in the below format.
>
>
>
> *F1 : *some text
>
> *F2*: Title1: descriptions of the title Title2 : description of the title2
> Title3 : description of title3
>
> *F3*: some text
>
>
>
> *Expected Result :*
>
>
>
> *F1 : *some text
>
> *F2*: Title1: descriptions of the title
>
>   Title2 : description of the title2
>
>   Title3 : description of title3
>
> *F3*: some text
>
>
>
>
>
> If we see the F2 field, format id getting changed i.e., input format is of
> F2 field is line by line for each sub title, but in the result it
> displaying as single line.
>
>
>
>
>
> I would like to display the result like whenever any subtitle occurs in xml
> file for any field, that subtitle should display in the next  line in the
> results.
>
>
>
> Can anyone please help on this. Thanks in advance.
>
>
>
>
>
> Regards,
>
> Ramesh p.y
>
> --
> Ramesh P.Y
> pyrames...@gmail.com
> Mobile No:+91-9176361984
>


Re: search with wildcard

2013-11-21 Thread Ahmet Arslan
Hi Adnreas,

If you don't want to use wildcards at query time, alternative way is to use 
NGrams at indexing time. This will produce a lot of tokens. e.g.
For example 4grams of your example : Supertestplan => supe uper pert erte rtes 
*test* estp stpl tpla plan


Is that you want? By the way why do you want to search inside of words?






On Thursday, November 21, 2013 5:23 PM, Andreas Owen  wrote:
 
I suppose i have to create another field with diffenet tokenizers and set
the boost very low so it doesn't really mess with my ranking because there
the word is now in 2 fields. What kind of tokenizer can do the job?



From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Donnerstag, 21. November 2013 16:13
To: solr-user@lucene.apache.org
Subject: search with wildcard



I am querying "test" in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
"Supertestplan" it isn't found unless I use a wildcards "*test*". This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.





       

        

        

                              

         

        

                                

        

      

    

Re: How to index X™ as ™ (HTML decimal entity)

2013-11-21 Thread Michael Sokolov
OK - probably I should have said "A",or "a" :)  My point was just 
that there is not really anything special about "special" characters.


On 11/21/2013 10:50 AM, Jack Krupansky wrote:

"Would you store "a" as "A" ?"

No, not in any case.

-- Jack Krupansky

-Original Message- From: Michael Sokolov
Sent: Thursday, November 21, 2013 8:56 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

I have to agree w/Walter.  Use unicode as a storage format.  The entity
encodings are for transfer/interchange.  Encode/decode on the way in and
out if you have to.  Would you store "a" as "A" ?  It makes it
impossible to search for, for one thing.  What if someone wants to
search for the TM character?

-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a 
format for storing text to be rendered. If you disagree - try 
explaining yourself.


But maybe TM should be encoded as "™". Ditto for other named 
SGML entities.


-- Jack Krupansky

-Original Message- From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

Again, I'd like to know why this is wanted. It sounds like an X-Y, 
problem. Storing Unicode characters as XML/HTML encoded character 
references is an extremely bad idea.


wunder

On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" 
 wrote:


Any analysis filtering affects the indexed value only, but the 
stored value would be unchanged from the original input value. An 
update processor lets you modify the original input value that will 
be stored.


-- Jack Krupansky

-Original Message- From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g

or


Uwe

Am 19.11.2013 23:46, schrieb Developer:

I have a data coming in to SOLR as below.

X™ - Black

I need to store the HTML Entity (decimal) equivalent value (i.e. 
™)

in SOLR rather than storing the original value.

Is there a way to do this?




--
Walter Underwood
wun...@wunderwood.org









RE: search with wildcard

2013-11-21 Thread Andreas Owen
I suppose i have to create another field with diffenet tokenizers and set
the boost very low so it doesn't really mess with my ranking because there
the word is now in 2 fields. What kind of tokenizer can do the job?

 

From: Andreas Owen [mailto:a...@conx.ch] 
Sent: Donnerstag, 21. November 2013 16:13
To: solr-user@lucene.apache.org
Subject: search with wildcard

 

I am querying "test" in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
"Supertestplan" it isn't found unless I use a wildcards "*test*". This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.

 



   





   

 







  





Re: search with wildcard

2013-11-21 Thread Jack Krupansky
You might be able to make use of the dictionary compound word filter, but 
you will have to build up a dictionary of words to use:


http://lucene.apache.org/core/4_5_1/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html

My e-book has some examples and a better description.

-- Jack Krupansky

-Original Message- 
From: Ahmet Arslan

Sent: Thursday, November 21, 2013 11:40 AM
To: solr-user@lucene.apache.org
Subject: Re: search with wildcard

Hi Adnreas,

If you don't want to use wildcards at query time, alternative way is to use 
NGrams at indexing time. This will produce a lot of tokens. e.g.
For example 4grams of your example : Supertestplan => supe uper pert erte 
rtes *test* estp stpl tpla plan



Is that you want? By the way why do you want to search inside of words?






On Thursday, November 21, 2013 5:23 PM, Andreas Owen  wrote:

I suppose i have to create another field with diffenet tokenizers and set
the boost very low so it doesn't really mess with my ranking because there
the word is now in 2 fields. What kind of tokenizer can do the job?



From: Andreas Owen [mailto:a...@conx.ch]
Sent: Donnerstag, 21. November 2013 16:13
To: solr-user@lucene.apache.org
Subject: search with wildcard



I am querying "test" in solr 4.3.1 over the field below and it's not finding
all occurences. It seems that if it is a substring of a word like
"Supertestplan" it isn't found unless I use a wildcards "*test*". This is
write because of my tokenizer but does someone know a way around this? I
don't want to add wildcards because that messes up queries with multiple
words.





 

   

   





   

   



 





Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Dave Seltzer
Lots of questions. Okay.

In digging a little deeper and looking at the config I see that
true is commented out.  I believe this is the default
setting. So I don't know if NRT is enabled or not. Maybe just a red herring.

I don't know what Garbage Collector we're using. In this test I'm running
Solr 4.5.1 using Jetty from the example directory.

The CPU on the 8 nodes all stay around 70% use during the test. The nodes
have 28GB of RAM. Java is using about 6GB and the rest is being used by OS
cache.

To perform the test we're running 200 concurrent threads in JMeter. The
threads hit HAProxy which loadbalances the requests among the nodes. Each
query is for a random word out of a list of about 10,000 words. Some of the
queries have faceting turned on.

Because we're heavily loading the system the queries are returning quite
slowly. For a simple search, the average response time was 300ms. The peak
response time was 11,000ms. The spikes in latency seem to occur about every
2.5 minutes.

I haven't spent that much time messing with SolrConfig, so most of the
settings are the out-of-the-box defaults.

Where should I start to look?

Thanks so much!

-Dave





On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller  wrote:

> Yes, more details…
>
> Solr version, which garbage collector, how does heap usage look, cpu, etc.
>
> - Mark
>
> On Nov 21, 2013, at 6:46 PM, Erick Erickson 
> wrote:
>
> > How real time is NRT? In particular, what are you commit settings?
> >
> > And can you characterize "periodic slowness"? Queries that usually
> > take 500ms not tail 10s? Or 1s? How often? How are you measuring?
> >
> > Details matter, a lot...
> >
> > Best,
> > Erick
> >
> >
> >
> >
> > On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer 
> wrote:
> >
> >> I'm doing some performance testing against an 8-node Solr cloud cluster,
> >> and I'm noticing some periodic slowness.
> >>
> >>
> >> http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png
> >>
> >> I'm doing random test searches against an Alias Collection made up of
> four
> >> smaller (monthly) collections. Like this:
> >>
> >> MasterCollection
> >> |- Collection201308
> >> |- Collection201309
> >> |- Collection201310
> >> |- Collection201311
> >>
> >> The last collection is constantly updated. New documents are being
> added at
> >> the rate of about 3 documents per second.
> >>
> >> I believe the slowness may due be to NRT, but I'm not sure. How should I
> >> investigate this?
> >>
> >> If the slowness is related to NRT, how can I alleviate the issue without
> >> disabling NRT?
> >>
> >> Thanks Much!
> >>
> >> -Dave
> >>


Re: How to index X™ as ™ (HTML decimal entity)

2013-11-21 Thread Jack Krupansky

"there is not really anything special about "special" characters"

Well, the distinction was about "named entities", which are indeed special.

Besides, in general, for more sophisticated text processing, character 
"types" are a valid distinction.


But all of this begs the question of the original question: "I need to store 
the HTML Entity (decimal) equivalent value (i.e. ™) in SOLR rather 
than storing the original value."


Maybe the original poster could clarify the nature of their need.

-- Jack Krupansky

-Original Message- 
From: Michael Sokolov

Sent: Thursday, November 21, 2013 11:37 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

OK - probably I should have said "A",or "a" :)  My point was just
that there is not really anything special about "special" characters.

On 11/21/2013 10:50 AM, Jack Krupansky wrote:

"Would you store "a" as "A" ?"

No, not in any case.

-- Jack Krupansky

-Original Message- From: Michael Sokolov
Sent: Thursday, November 21, 2013 8:56 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

I have to agree w/Walter.  Use unicode as a storage format.  The entity
encodings are for transfer/interchange.  Encode/decode on the way in and
out if you have to.  Would you store "a" as "A" ?  It makes it
impossible to search for, for one thing.  What if someone wants to
search for the TM character?

-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format 
for storing text to be rendered. If you disagree - try explaining 
yourself.


But maybe TM should be encoded as "™". Ditto for other named SGML 
entities.


-- Jack Krupansky

-Original Message- From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

Again, I'd like to know why this is wanted. It sounds like an X-Y, 
problem. Storing Unicode characters as XML/HTML encoded character 
references is an extremely bad idea.


wunder

On Nov 20, 2013, at 5:01 AM, "Jack Krupansky"  
wrote:


Any analysis filtering affects the indexed value only, but the stored 
value would be unchanged from the original input value. An update 
processor lets you modify the original input value that will be stored.


-- Jack Krupansky

-Original Message- From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g

or


Uwe

Am 19.11.2013 23:46, schrieb Developer:

I have a data coming in to SOLR as below.

X™ - Black

I need to store the HTML Entity (decimal) equivalent value (i.e. 
™)

in SOLR rather than storing the original value.

Is there a way to do this?




--
Walter Underwood
wun...@wunderwood.org









Re: confirm subscribe to solr-user@lucene.apache.org

2013-11-21 Thread Paule LECUYER

I confirm

.


How to implement a conditional copyField working for partial updates ?

2013-11-21 Thread Paule LECUYER


Hello,

I'm using Solr 4.x. In my solr schema I have the following fields defined :

  stored="true" multiValued="true" />
  stored="false" multiValued="true" termVectors="true" />
  multiValued="true" termVectors="true" />
  multiValued="true" termVectors="true" />
  multiValued="true" termVectors="true" />

  ...


To fill in the language specific fields, I use a custom update  
processor chain, with a custom ConditionalCopyProcessor that copies  
"content" field into appropriate language field, depending on document  
language (as explained in  
http://wiki.apache.org/solr/UpdateRequestProcessor).


Problem is this custom chain is applied on update request document,  
thus it works all right when inserting a new document, or updating the  
whole document, but I lose language specific fields when I do a  
partial update (as those fields are not stored, and as the request  
document contains only updated fields).


I would avoid to set language specific fields to stored="true", as  
"content" field may hold big values.


Is there a way to have solr execute my ConditionalCopyProcessor on the  
actual updated doc (the one resulting from solr retrieving all stored  
values and merging with update request values), and not on the request  
doc ?


Thank a lot for your help.

P. Lecuyer

.


Re: How to index X™ as ™ (HTML decimal entity)

2013-11-21 Thread Jack Krupansky
Ah... now I understand your perspective - you have taken a narrow view of 
what "text" is. A broader view is that it can contain formatting and special 
"entities" as well, or rich text in general. My "read" is that it all 
depends on the nature of the application and its requirements, not a "one 
size fits all" approach. The four main approaches being pure ASCII, 
Unicode/UTF-8, SGML for non-ASCII characters, and full HTML for formatting 
and rich text. And let the app needs determine which is most appropriate for 
each piece of text.


The goal of SGML and HTML is not to hard-wire the final presentation, but 
simply to preserve some level of source format and structure, and then apply 
final presentation formatting on top of that.


Some apps may opt to store the same information in multiple formats, such as 
one for raw text search, one for basic display, and one for "detail" 
display.


I'm more of a "platform" guy than an "app-specific" guy - give the app 
developer tools that they can blend to meet their own requirements (or 
interests or tastes.)


But Solr users should make no mistake, SGML entities are a perfectly valid 
intermediate format for rich text.


-- Jack Krupansky

-Original Message- 
From: Walter Underwood

Sent: Thursday, November 21, 2013 11:44 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

And this is the exact problem. Some characters are stored as entities, some 
are not. When it is time to display, what else needs escaped? At a minimum, 
you would have to always store & as & to avoid escaping the leading 
ampersand in the entities.


You could store every single character as a numeric entity. Or you could 
store every non-ASCII character as a numeric entity. Or every non-Latin1 
character. Plus ampersand, of course.


In these e-mails, we are distinguishing between ™ and ™. How would you 
do that? By storing "™" as "™".


To avoid all this double-think, always store text as Unicode code points, 
encoded with a standard Unicode method (UTF-8, etc.).


When displaying, only make entities if the codepoints cannot be represented 
in the target character encoding. If you are sending things in US-ASCII, you 
will be sending lots of entities.


A good encoding library has callbacks for characters that cannot be 
represented. You can use these callbacks to format out-of-charset codepoints 
as entities. I've done this in product code, it really works.


Finally, if you don't believe me, believe the XML Infoset, where numeric 
entities are always interpreted as treated as Unicode codepoints.


The other way to go insane is storing local time in the database. Always 
store UTC and convert at the edges.


wunder

On Nov 21, 2013, at 7:50 AM, "Jack Krupansky"  
wrote:



"Would you store "a" as "A" ?"

No, not in any case.

-- Jack Krupansky

-Original Message- From: Michael Sokolov
Sent: Thursday, November 21, 2013 8:56 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

I have to agree w/Walter.  Use unicode as a storage format.  The entity
encodings are for transfer/interchange.  Encode/decode on the way in and
out if you have to.  Would you store "a" as "A" ?  It makes it
impossible to search for, for one thing.  What if someone wants to
search for the TM character?

-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format 
for storing text to be rendered. If you disagree - try explaining 
yourself.


But maybe TM should be encoded as "™". Ditto for other named SGML 
entities.


-- Jack Krupansky

-Original Message- From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

Again, I'd like to know why this is wanted. It sounds like an X-Y, 
problem. Storing Unicode characters as XML/HTML encoded character 
references is an extremely bad idea.


wunder

On Nov 20, 2013, at 5:01 AM, "Jack Krupansky"  
wrote:


Any analysis filtering affects the indexed value only, but the stored 
value would be unchanged from the original input value. An update 
processor lets you modify the original input value that will be stored.


-- Jack Krupansky

-Original Message- From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g

or


Uwe

Am 19.11.2013 23:46, schrieb Developer:

I have a data coming in to SOLR as below.

X™ - Black

I need to store the HTML Entity (decimal) equivalent value (i.e. 
™)

in SOLR rather than storing the original value.

Is there a way to do this?




--
Walter Underwood
wun...@wunderwood.org






--
Walter Underwood
wun...@wunderwood.org





Re: How to index X™ as ™ (HTML decimal entity)

2013-11-21 Thread Michael Sokolov
I have to agree w/Walter.  Use unicode as a storage format.  The entity 
encodings are for transfer/interchange.  Encode/decode on the way in and 
out if you have to.  Would you store "a" as "A" ?  It makes it 
impossible to search for, for one thing.  What if someone wants to 
search for the TM character?


-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format 
for storing text to be rendered. If you disagree - try explaining 
yourself.


But maybe TM should be encoded as "™". Ditto for other named 
SGML entities.


-- Jack Krupansky

-Original Message- From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

Again, I'd like to know why this is wanted. It sounds like an X-Y, 
problem. Storing Unicode characters as XML/HTML encoded character 
references is an extremely bad idea.


wunder

On Nov 20, 2013, at 5:01 AM, "Jack Krupansky" 
 wrote:


Any analysis filtering affects the indexed value only, but the stored 
value would be unchanged from the original input value. An update 
processor lets you modify the original input value that will be stored.


-- Jack Krupansky

-Original Message- From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g

or


Uwe

Am 19.11.2013 23:46, schrieb Developer:

I have a data coming in to SOLR as below.

X™ - Black

I need to store the HTML Entity (decimal) equivalent value (i.e. 
™)

in SOLR rather than storing the original value.

Is there a way to do this?




--
Walter Underwood
wun...@wunderwood.org







Re: How to index X™ as ™ (HTML decimal entity)

2013-11-21 Thread Walter Underwood
I know all about formatted text -- I worked at MarkLogic. That is why I 
mentioned the XML Infoset.

Numeric entities are part of the final presentation, really, part of the 
encoding. They should never be stored. Always store the Unicode.

Numeric and named entities are a convenience for tools and encodings that can't 
handle  Unicode. That is all they are.

wunder

On Nov 21, 2013, at 9:02 AM, "Jack Krupansky"  wrote:

> Ah... now I understand your perspective - you have taken a narrow view of 
> what "text" is. A broader view is that it can contain formatting and special 
> "entities" as well, or rich text in general. My "read" is that it all depends 
> on the nature of the application and its requirements, not a "one size fits 
> all" approach. The four main approaches being pure ASCII, Unicode/UTF-8, SGML 
> for non-ASCII characters, and full HTML for formatting and rich text. And let 
> the app needs determine which is most appropriate for each piece of text.
> 
> The goal of SGML and HTML is not to hard-wire the final presentation, but 
> simply to preserve some level of source format and structure, and then apply 
> final presentation formatting on top of that.
> 
> Some apps may opt to store the same information in multiple formats, such as 
> one for raw text search, one for basic display, and one for "detail" display.
> 
> I'm more of a "platform" guy than an "app-specific" guy - give the app 
> developer tools that they can blend to meet their own requirements (or 
> interests or tastes.)
> 
> But Solr users should make no mistake, SGML entities are a perfectly valid 
> intermediate format for rich text.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Walter Underwood
> Sent: Thursday, November 21, 2013 11:44 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to index X™ as ™ (HTML decimal entity)
> 
> And this is the exact problem. Some characters are stored as entities, some 
> are not. When it is time to display, what else needs escaped? At a minimum, 
> you would have to always store & as & to avoid escaping the leading 
> ampersand in the entities.
> 
> You could store every single character as a numeric entity. Or you could 
> store every non-ASCII character as a numeric entity. Or every non-Latin1 
> character. Plus ampersand, of course.
> 
> In these e-mails, we are distinguishing between ™ and ™. How would you 
> do that? By storing "™" as "™".
> 
> To avoid all this double-think, always store text as Unicode code points, 
> encoded with a standard Unicode method (UTF-8, etc.).
> 
> When displaying, only make entities if the codepoints cannot be represented 
> in the target character encoding. If you are sending things in US-ASCII, you 
> will be sending lots of entities.
> 
> A good encoding library has callbacks for characters that cannot be 
> represented. You can use these callbacks to format out-of-charset codepoints 
> as entities. I've done this in product code, it really works.
> 
> Finally, if you don't believe me, believe the XML Infoset, where numeric 
> entities are always interpreted as treated as Unicode codepoints.
> 
> The other way to go insane is storing local time in the database. Always 
> store UTC and convert at the edges.
> 
> wunder
> 
> On Nov 21, 2013, at 7:50 AM, "Jack Krupansky"  wrote:
> 
>> "Would you store "a" as "A" ?"
>> 
>> No, not in any case.
>> 
>> -- Jack Krupansky
>> 
>> -Original Message- From: Michael Sokolov
>> Sent: Thursday, November 21, 2013 8:56 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>> 
>> I have to agree w/Walter.  Use unicode as a storage format.  The entity
>> encodings are for transfer/interchange.  Encode/decode on the way in and
>> out if you have to.  Would you store "a" as "A" ?  It makes it
>> impossible to search for, for one thing.  What if someone wants to
>> search for the TM character?
>> 
>> -Mike
>> 
>> On 11/20/13 12:07 PM, Jack Krupansky wrote:
>>> AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format for 
>>> storing text to be rendered. If you disagree - try explaining yourself.
>>> 
>>> But maybe TM should be encoded as "™". Ditto for other named SGML 
>>> entities.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -Original Message- From: Walter Underwood
>>> Sent: Wednesday, November 20, 2013 11:21 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to index X™ as ™ (HTML decimal entity)
>>> 
>>> Again, I'd like to know why this is wanted. It sounds like an X-Y, problem. 
>>> Storing Unicode characters as XML/HTML encoded character references is an 
>>> extremely bad idea.
>>> 
>>> wunder
>>> 
>>> On Nov 20, 2013, at 5:01 AM, "Jack Krupansky"  
>>> wrote:
>>> 
 Any analysis filtering affects the indexed value only, but the stored 
 value would be unchanged from the original input value. An update 
 processor lets you modify the original input value that will be stored.
 
 -- Jack Krupansky

Re: Indexing data to a specific collection in Solr 4.5.0

2013-11-21 Thread Reyes, Mark
So then,
$ java -jar post.jar Durl=http://localhost:8983/solr/collection2/update
solr.xml monitor.xml





On 11/21/13, 8:14 AM, "xiezhide"  wrote:

>
>add Durl=http://localhost:8983/solr/collection2/update when run post.jar,
>此邮件发送自189邮箱
>
>"Reyes, Mark"  wrote:
>
>>Hi all:
>>
>>I’m currently on a Solr 4.5.0 instance and running this tutorial,
>>http://lucene.apache.org/solr/4_5_0/tutorial.html
>>
>>My question is specific to indexing data as proposed from this tutorial,
>>
>>$ java -jar post.jar solr.xml monitor.xml
>>
>>The tutorial advises to validate from your localhost,
>>http://localhost:8983/solr/collection1/select?q=solr&wt=xml
>>
>>However, what if my Solr core has both a collection1 and collection2,
>>yet I desire the XML files to only be posted to collection2 only?
>>
>>If possible, please advise.
>>
>>Thanks,
>>Mark
>>
>>IMPORTANT NOTICE: This e-mail message is intended to be received only by
>>persons entitled to receive the confidential information it may contain.
>>E-mail messages sent from Bridgepoint Education may contain information
>>that is confidential and may be legally privileged. Please do not read,
>>copy, forward or store this message unless you are an intended recipient
>>of it. If you received this transmission in error, please notify the
>>sender by reply e-mail and delete the message and any attachments.


IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Re: How to index X™ as ™ (HTML decimal entity)

2013-11-21 Thread Jack Krupansky

"Would you store "a" as "A" ?"

No, not in any case.

-- Jack Krupansky

-Original Message- 
From: Michael Sokolov

Sent: Thursday, November 21, 2013 8:56 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

I have to agree w/Walter.  Use unicode as a storage format.  The entity
encodings are for transfer/interchange.  Encode/decode on the way in and
out if you have to.  Would you store "a" as "A" ?  It makes it
impossible to search for, for one thing.  What if someone wants to
search for the TM character?

-Mike

On 11/20/13 12:07 PM, Jack Krupansky wrote:
AFAICT, it's not an "extremely bad idea" - using SGML/HTML as a format for 
storing text to be rendered. If you disagree - try explaining yourself.


But maybe TM should be encoded as "™". Ditto for other named SGML 
entities.


-- Jack Krupansky

-Original Message- From: Walter Underwood
Sent: Wednesday, November 20, 2013 11:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

Again, I'd like to know why this is wanted. It sounds like an X-Y, 
problem. Storing Unicode characters as XML/HTML encoded character 
references is an extremely bad idea.


wunder

On Nov 20, 2013, at 5:01 AM, "Jack Krupansky"  
wrote:


Any analysis filtering affects the indexed value only, but the stored 
value would be unchanged from the original input value. An update 
processor lets you modify the original input value that will be stored.


-- Jack Krupansky

-Original Message- From: Uwe Reh
Sent: Wednesday, November 20, 2013 5:43 AM
To: solr-user@lucene.apache.org
Subject: Re: How to index X™ as ™ (HTML decimal entity)

What's about having a simple charfilter in the analyzer queue for
indexing *and* searching. e.g

or


Uwe

Am 19.11.2013 23:46, schrieb Developer:

I have a data coming in to SOLR as below.

X™ - Black

I need to store the HTML Entity (decimal) equivalent value (i.e. 
™)

in SOLR rather than storing the original value.

Is there a way to do this?




--
Walter Underwood
wun...@wunderwood.org







RE: Periodic Slowness on Solr Cloud

2013-11-21 Thread Doug Turnbull
Dave you might want to connect JVisualVm and see if there's any pattern
with latency and garbage collection. That's a frequent culprit for
periodic hits in latency.

More info here
http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html

There's a couple GC implementations in Java that can be tuned as needed

With JvisualVM You can also add the mbeans plugin to get a ton of
performance stats out of Solr that might help debug latency issues.

Doug

Sent from my Windows Phone From: Dave Seltzer
Sent: 11/21/2013 8:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Periodic Slowness on Solr Cloud
Lots of questions. Okay.

In digging a little deeper and looking at the config I see that
true is commented out.  I believe this is the default
setting. So I don't know if NRT is enabled or not. Maybe just a red herring.

I don't know what Garbage Collector we're using. In this test I'm running
Solr 4.5.1 using Jetty from the example directory.

The CPU on the 8 nodes all stay around 70% use during the test. The nodes
have 28GB of RAM. Java is using about 6GB and the rest is being used by OS
cache.

To perform the test we're running 200 concurrent threads in JMeter. The
threads hit HAProxy which loadbalances the requests among the nodes. Each
query is for a random word out of a list of about 10,000 words. Some of the
queries have faceting turned on.

Because we're heavily loading the system the queries are returning quite
slowly. For a simple search, the average response time was 300ms. The peak
response time was 11,000ms. The spikes in latency seem to occur about every
2.5 minutes.

I haven't spent that much time messing with SolrConfig, so most of the
settings are the out-of-the-box defaults.

Where should I start to look?

Thanks so much!

-Dave





On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller  wrote:

> Yes, more details…
>
> Solr version, which garbage collector, how does heap usage look, cpu, etc.
>
> - Mark
>
> On Nov 21, 2013, at 6:46 PM, Erick Erickson 
> wrote:
>
> > How real time is NRT? In particular, what are you commit settings?
> >
> > And can you characterize "periodic slowness"? Queries that usually
> > take 500ms not tail 10s? Or 1s? How often? How are you measuring?
> >
> > Details matter, a lot...
> >
> > Best,
> > Erick
> >
> >
> >
> >
> > On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer 
> wrote:
> >
> >> I'm doing some performance testing against an 8-node Solr cloud cluster,
> >> and I'm noticing some periodic slowness.
> >>
> >>
> >> http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png
> >>
> >> I'm doing random test searches against an Alias Collection made up of
> four
> >> smaller (monthly) collections. Like this:
> >>
> >> MasterCollection
> >> |- Collection201308
> >> |- Collection201309
> >> |- Collection201310
> >> |- Collection201311
> >>
> >> The last collection is constantly updated. New documents are being
> added at
> >> the rate of about 3 documents per second.
> >>
> >> I believe the slowness may due be to NRT, but I'm not sure. How should I
> >> investigate this?
> >>
> >> If the slowness is related to NRT, how can I alleviate the issue without
> >> disabling NRT?
> >>
> >> Thanks Much!
> >>
> >> -Dave
> >>


a function query of time, frequency and score.

2013-11-21 Thread sling
Hi, guys.

I indexed 1000 documents, which have fields like title, ptime and frequency.

The title is a text fild, the ptime is a date field, and the frequency is a
int field.
Frequency field is ups and downs. say sometimes its value is 0, and
sometimes its value is 999.

Now, in my app, the query could work with function query well. The function
query is implemented as the score multiplied by an decreased date-weight
array. 

However, I have got no idea to add the frequency to this formula...

so could someone give me a clue?

Thanks again!

sling



--
View this message in context: 
http://lucene.472066.n3.nabble.com/a-function-query-of-time-frequency-and-score-tp4102531.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Doug Turnbull
Additional info on GC selection
http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#available_collectors

> If response time is more important than overall throughput and garbage
collection pauses must be kept shorter than approximately one second, then
select the concurrent collector with -XX:+UseConcMarkSweepGC. If only one
or two processors are available, consider using incremental mode, described
below.

I'm not entirely certain of the implications of GC tuning for SolrCloud. I
imagine distributed searching is going to be as slow as the slowest core
being queried.

I'd also be curious as to the root-cause of any excess GC churn. It sounds
like you're doing a ton of random queries. This probably creates a lot of
evictions your caches. There's nothing really worth caching, so the caches
fill up and empty frequently, causing a lot of heap activity. If you expect
to have high-load and a ton of turnover in queries, then tuning down cache
size might help minimize GC churn.

Solr Meter is another great tool for your perf testing that can help get at
some of these caching issues. It gives you some higher-level stats about
cache eviction, etc.
https://code.google.com/p/solrmeter/

-Doug



On Thu, Nov 21, 2013 at 10:24 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Dave you might want to connect JVisualVm and see if there's any pattern
> with latency and garbage collection. That's a frequent culprit for
> periodic hits in latency.
>
> More info here
>
> http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html
>
> There's a couple GC implementations in Java that can be tuned as needed
>
> With JvisualVM You can also add the mbeans plugin to get a ton of
> performance stats out of Solr that might help debug latency issues.
>
> Doug
>
> Sent from my Windows Phone From: Dave Seltzer
> Sent: 11/21/2013 8:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Periodic Slowness on Solr Cloud
> Lots of questions. Okay.
>
> In digging a little deeper and looking at the config I see that
> true is commented out.  I believe this is the default
> setting. So I don't know if NRT is enabled or not. Maybe just a red
> herring.
>
> I don't know what Garbage Collector we're using. In this test I'm running
> Solr 4.5.1 using Jetty from the example directory.
>
> The CPU on the 8 nodes all stay around 70% use during the test. The nodes
> have 28GB of RAM. Java is using about 6GB and the rest is being used by OS
> cache.
>
> To perform the test we're running 200 concurrent threads in JMeter. The
> threads hit HAProxy which loadbalances the requests among the nodes. Each
> query is for a random word out of a list of about 10,000 words. Some of the
> queries have faceting turned on.
>
> Because we're heavily loading the system the queries are returning quite
> slowly. For a simple search, the average response time was 300ms. The peak
> response time was 11,000ms. The spikes in latency seem to occur about every
> 2.5 minutes.
>
> I haven't spent that much time messing with SolrConfig, so most of the
> settings are the out-of-the-box defaults.
>
> Where should I start to look?
>
> Thanks so much!
>
> -Dave
>
>
>
>
>
> On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller 
> wrote:
>
> > Yes, more details…
> >
> > Solr version, which garbage collector, how does heap usage look, cpu,
> etc.
> >
> > - Mark
> >
> > On Nov 21, 2013, at 6:46 PM, Erick Erickson 
> > wrote:
> >
> > > How real time is NRT? In particular, what are you commit settings?
> > >
> > > And can you characterize "periodic slowness"? Queries that usually
> > > take 500ms not tail 10s? Or 1s? How often? How are you measuring?
> > >
> > > Details matter, a lot...
> > >
> > > Best,
> > > Erick
> > >
> > >
> > >
> > >
> > > On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer 
> > wrote:
> > >
> > >> I'm doing some performance testing against an 8-node Solr cloud
> cluster,
> > >> and I'm noticing some periodic slowness.
> > >>
> > >>
> > >> http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png
> > >>
> > >> I'm doing random test searches against an Alias Collection made up of
> > four
> > >> smaller (monthly) collections. Like this:
> > >>
> > >> MasterCollection
> > >> |- Collection201308
> > >> |- Collection201309
> > >> |- Collection201310
> > >> |- Collection201311
> > >>
> > >> The last collection is constantly updated. New documents are being
> > added at
> > >> the rate of about 3 documents per second.
> > >>
> > >> I believe the slowness may due be to NRT, but I'm not sure. How
> should I
> > >> investigate this?
> > >>
> > >> If the slowness is related to NRT, how can I alleviate the issue
> without
> > >> disabling NRT?
> > >>
> > >> Thanks Much!
> > >>
> > >> -Dave
> > >>
>



-- 
Doug Turnbull
Search & Big Data Architect
OpenSource Connections 


Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Dave Seltzer
Thanks Doug!

One thing I'm not clear on is how do I know if this is in-fact related to
Garbage Collection. If you're right, and the cluster is only as slow as its
slowest link, how do I determine that this is GC. Do I have to run the
profiler on all eight nodes?

Or is it a matter of turning on the correct logging and then watching and
waiting.

Thanks!

-D


On Thu, Nov 21, 2013 at 11:20 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Additional info on GC selection
>
> http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#available_collectors
>
> > If response time is more important than overall throughput and garbage
> collection pauses must be kept shorter than approximately one second, then
> select the concurrent collector with -XX:+UseConcMarkSweepGC. If only one
> or two processors are available, consider using incremental mode, described
> below.
>
> I'm not entirely certain of the implications of GC tuning for SolrCloud. I
> imagine distributed searching is going to be as slow as the slowest core
> being queried.
>
> I'd also be curious as to the root-cause of any excess GC churn. It sounds
> like you're doing a ton of random queries. This probably creates a lot of
> evictions your caches. There's nothing really worth caching, so the caches
> fill up and empty frequently, causing a lot of heap activity. If you expect
> to have high-load and a ton of turnover in queries, then tuning down cache
> size might help minimize GC churn.
>
> Solr Meter is another great tool for your perf testing that can help get
> at some of these caching issues. It gives you some higher-level stats about
> cache eviction, etc.
> https://code.google.com/p/solrmeter/
>
> -Doug
>
>
>
> On Thu, Nov 21, 2013 at 10:24 PM, Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
>
>> Dave you might want to connect JVisualVm and see if there's any pattern
>> with latency and garbage collection. That's a frequent culprit for
>> periodic hits in latency.
>>
>> More info here
>>
>> http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/jmx_connections.html
>>
>> There's a couple GC implementations in Java that can be tuned as needed
>>
>> With JvisualVM You can also add the mbeans plugin to get a ton of
>> performance stats out of Solr that might help debug latency issues.
>>
>> Doug
>>
>> Sent from my Windows Phone From: Dave Seltzer
>> Sent: 11/21/2013 8:42 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Periodic Slowness on Solr Cloud
>> Lots of questions. Okay.
>>
>> In digging a little deeper and looking at the config I see that
>> true is commented out.  I believe this is the default
>> setting. So I don't know if NRT is enabled or not. Maybe just a red
>> herring.
>>
>> I don't know what Garbage Collector we're using. In this test I'm running
>> Solr 4.5.1 using Jetty from the example directory.
>>
>> The CPU on the 8 nodes all stay around 70% use during the test. The nodes
>> have 28GB of RAM. Java is using about 6GB and the rest is being used by OS
>> cache.
>>
>> To perform the test we're running 200 concurrent threads in JMeter. The
>> threads hit HAProxy which loadbalances the requests among the nodes. Each
>> query is for a random word out of a list of about 10,000 words. Some of
>> the
>> queries have faceting turned on.
>>
>> Because we're heavily loading the system the queries are returning quite
>> slowly. For a simple search, the average response time was 300ms. The peak
>> response time was 11,000ms. The spikes in latency seem to occur about
>> every
>> 2.5 minutes.
>>
>> I haven't spent that much time messing with SolrConfig, so most of the
>> settings are the out-of-the-box defaults.
>>
>> Where should I start to look?
>>
>> Thanks so much!
>>
>> -Dave
>>
>>
>>
>>
>>
>> On Thu, Nov 21, 2013 at 6:53 PM, Mark Miller 
>> wrote:
>>
>> > Yes, more details…
>> >
>> > Solr version, which garbage collector, how does heap usage look, cpu,
>> etc.
>> >
>> > - Mark
>> >
>> > On Nov 21, 2013, at 6:46 PM, Erick Erickson 
>> > wrote:
>> >
>> > > How real time is NRT? In particular, what are you commit settings?
>> > >
>> > > And can you characterize "periodic slowness"? Queries that usually
>> > > take 500ms not tail 10s? Or 1s? How often? How are you measuring?
>> > >
>> > > Details matter, a lot...
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > >
>> > >
>> > >
>> > > On Thu, Nov 21, 2013 at 6:03 PM, Dave Seltzer 
>> > wrote:
>> > >
>> > >> I'm doing some performance testing against an 8-node Solr cloud
>> cluster,
>> > >> and I'm noticing some periodic slowness.
>> > >>
>> > >>
>> > >> http://farm4.staticflickr.com/3668/10985410633_23e26c7681_o.png
>> > >>
>> > >> I'm doing random test searches against an Alias Collection made up of
>> > four
>> > >> smaller (monthly) collections. Like this:
>> > >>
>> > >> MasterCollection
>> > >> |- Collection201308
>> > >> |- Collection201309
>> > >> |- Collection201310
>> > >> |- Collection201311
>> > >>
>> > >> The last collection 

Re: SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4

2013-11-21 Thread RadhaJayalakshmi
Thanks Shawn for your response.
So, from your email, it seems that unique_key validation is handled
differently from other field validation.
But what i am not very clear, is what the unique_key has to do with finding
the live server?
Becase if there is any mismatch in the unique_key, it is throwing
SolrServerException saying "No live servers found".. Because live servers
are being sourced by clusterstate of zookeeper. so i feel the unique key is
particular to a core/index.
So looking to understand the nature of this exception. Please explain me how
unique_key and live servers are related




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrServerException-while-adding-an-invalid-UNIQUE-KEY-in-solr-4-4-tp4102346p4102533.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Periodic Slowness on Solr Cloud

2013-11-21 Thread Shawn Heisey
On 11/21/2013 6:41 PM, Dave Seltzer wrote:
> In digging a little deeper and looking at the config I see that
> true is commented out.  I believe this is the default
> setting. So I don't know if NRT is enabled or not. Maybe just a red herring.

I had never seen this setting before.  The default is true.  SolrCloud
requires that it be set to true.  Looks like it's a new parameter in
4.5, added by SOLR-4909.  From what I can tell reading the issue,
turning it off effectively disables soft commits.

https://issues.apache.org/jira/browse/SOLR-4909

You've said that you are adding about 3 documents per second, but you
haven't said anything about how often you are doing commits.  Erick's
question basically boils down to this:  How quickly after indexing do
you expect the changes to be visible on a search, and how often are you
doing commits?

Generally speaking (and ignoring the fact that nrtMode now exists), NRT
is not something you enable, it's something you try to achieve, by using
soft commits quickly and often, and by adjusting the configuration to
make the commits go faster.

If you are trying to keep the interval between indexing and document
visibility down to less than a few seconds (especially if it's less than
one second), then you are trying to achieve NRT.

There's a lot of information on the following wiki page about
performance problems.  This specific link is to the last part of that
page, which deals with slow commits:

http://wiki.apache.org/solr/SolrPerformanceProblems#Slow_commits

> I don't know what Garbage Collector we're using. In this test I'm running
> Solr 4.5.1 using Jetty from the example directory.

If you aren't using any tuning parameters beyond setting the max heap,
then you are using the default parallel collector.  It's a poor choice
for Solr unless your heap is very small.  At 6GB, yours isn't very
small.  It's not particularly huge either, but not small.

> The CPU on the 8 nodes all stay around 70% use during the test. The nodes
> have 28GB of RAM. Java is using about 6GB and the rest is being used by OS
> cache.

How big is your index?  If it's larger than about 30 GB, you probably
need more memory.  If it's much larger than about 40 GB, you definitely
need more memory.

> To perform the test we're running 200 concurrent threads in JMeter. The
> threads hit HAProxy which loadbalances the requests among the nodes. Each
> query is for a random word out of a list of about 10,000 words. Some of the
> queries have faceting turned on.

That's a pretty high query load.  If you want to get anywhere near top
performance out of it, you'll want to have enough memory to fit your
entire index into RAM.  You'll also need to reduce the load introduced
by indexing.  A large part of the load from indexing comes from commits.

> Because we're heavily loading the system the queries are returning quite
> slowly. For a simple search, the average response time was 300ms. The peak
> response time was 11,000ms. The spikes in latency seem to occur about every
> 2.5 minutes.

I would bet that you're having one or both of the following issues:

1) Garbage collection issues from one or more of the following:
 a) Heap too small.
 b) Using the default GC instead of CMS with tuning.
2) General performance issues from one or more of the following:
 a) Not enough cache memory for your index size.
 b) Too-frequent commits.
 c) Commits taking a lot of time and resources due to cache warming.

With a high query and index load, any problems become magnified.

> I haven't spent that much time messing with SolrConfig, so most of the
> settings are the out-of-the-box defaults.

The defaults are very good for small to medium indexes and low to medium
query load.  If you have a big index and/or high query load, you'll
generally need to tune.

Thanks,
Shawn



Re: Best implementation for multi-price store?

2013-11-21 Thread Alejandro Marqués Rodríguez
Hi Robert,

That was the idea, dynamic fields, so, as you said, it is easier to sort
and filter. Besides, having dynamic fields it would be easier to add new
stores, as I wouldn't have to modify the schema :)

Thanks for the answer!


2013/11/21 Petersen, Robert 

> Hi,
>
> I'd go with (2) also but using dynamic fields so you don't have to define
> all the storeX_price fields in your schema but rather just one *_price
> field.  Then when you filter on store:store1 you'd know to sort with
> store1_price and so forth for units.  That should be pretty straightforward.
>
> Hope that helps,
> Robi
>
> -Original Message-
> From: Alejandro Marqués Rodríguez [mailto:
> amarq...@paradigmatecnologico.com]
> Sent: Thursday, November 21, 2013 1:36 AM
> To: solr-user@lucene.apache.org
> Subject: Best implementation for multi-price store?
>
> Hi,
>
> I've been recently ask to implement an application to search products from
> several stores, each store having different prices and stock for the same
> product.
>
> So I have products that have the usual fields (name, description, brand,
> etc) and also number of units and price for each store. I must be able to
> filter for a given store and order by stock or price for that store. The
> application should also allow incresing the number of stores, fields
> depending of store and number of products without much work.
>
> The numbers for the application are more or less 100 stores and 7M
> products.
>
> I've been thinking of some ways of defining the index structure but I
> don't know wich one is better as I think each one has it's pros and cons.
>
>
>1. *Each product-store as a document:* Denormalizing the information so
>for every product and store I have a different document. Pros are that I
>can filter and order without problems and that adding a new
> store-depending
>field is very easy. Cons are that the index goes from 7M documents to
> 700M
>and that most of the info is redundant as most of the fields are
> repeated
>among stores.
>2. *Each field-store as a field:* For example for price I would have
>"store1_price, store2_price, ...". Pros are that the index stays at 7M
>documents, and I can still filter and sort by those fields. Cons are
> that I
>have to add some logic so if I filter by one store I order for the
>associated price field, and that number of fields increases as number of
>store-depending fields x number of stores. I don't know if having more
>fields affects performance, but adding new store-depending fields will
>increase the number of fields even more
>3. *Join:* First time I read about solr joins thought it was the way to
>go in this case, but after reading a bit more and doing some tests I'm
> not
>so sure about it... Maybe I've done it wrong but I think it also
>denormalizes the info (So I will also havee 700M documents) and besides
> I
>can't order or filter by store fields.
>
>
> I must say my preferred option is number 2, so I don't duplicate
> information, I keep a relatively small number of documents and I can filter
> and sort by the store fields. However, my main concern here is I don't know
> if having too many fields in a document will be harmful to performance.
>
> Which one do you think is the best approach for this application? Is there
> a better approach that I have missed?
>
> Thanks in advance
>
>
>
> --
> Alejandro Marqués Rodríguez
>
> Paradigma Tecnológico
> http://www.paradigmatecnologico.com
> Avenida de Europa, 26. Ática 5. 3ª Planta
> 28224 Pozuelo de Alarcón
> Tel.: 91 352 59 42
>
>


-- 
Alejandro Marqués Rodríguez

Paradigma Tecnológico
http://www.paradigmatecnologico.com
Avenida de Europa, 26. Ática 5. 3ª Planta
28224 Pozuelo de Alarcón
Tel.: 91 352 59 42


Re: SolrServerException while adding an invalid UNIQUE_KEY in solr 4.4

2013-11-21 Thread Shawn Heisey
On 11/21/2013 9:51 PM, RadhaJayalakshmi wrote:
> Thanks Shawn for your response.
> So, from your email, it seems that unique_key validation is handled
> differently from other field validation.
> But what i am not very clear, is what the unique_key has to do with finding
> the live server?
> Becase if there is any mismatch in the unique_key, it is throwing
> SolrServerException saying "No live servers found".. Because live servers
> are being sourced by clusterstate of zookeeper. so i feel the unique key is
> particular to a core/index.
> So looking to understand the nature of this exception. Please explain me how
> unique_key and live servers are related

It's the HTTP error code, 500, which means internal server error.  SolrJ
interprets this to mean that there's something wrong with that server,
which is what the HTTP protocol specification says it must do.  That
makes it try the next server.  Because the problem is not actually a
server issue, the next server returns the same error.  This continues
until it's tried them all and gives up.

The validation for other fields returns a different error, one that
SolrJ interprets as a problem with the request, so it doesn't try other
servers.

Strictly speaking, Solr probably should not return error 500 for unique
key validation issues, which makes this a minor bug.  The actual results
are correct, because the update fails and the application is notified.
If all possible exceptions are caught, then it all works correctly.

Thanks,
Shawn