Re: integrate solr with preprocessor tools

2015-12-16 Thread Emir Arnautovic

Hi Sara,
I would recommend looking at code of some component that you use 
currently and start from that - you can extend that class or use it as 
template for your own.


Thanks,
Emir

On 16.12.2015 09:58, sara hajili wrote:

hi Emir,tnx for answering
now my question is how i write this class?
i must use solr interfaces?
i see in above link that i can use solr analyzer.but how i use that?
plz say me how i start to write my own analyzer step by step...
which interface i can use and change to achieve my goal?
tnx

On Wed, Dec 9, 2015 at 1:50 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Hi Sara,
You need to wrap your code in tokenizer or token filter
https://wiki.apache.org/solr/SolrPlugins

If you want to improve existing and believe others can benefit from
improvement, you can open ticket and submit patch.

Thanks,
Emir


On 09.12.2015 10:41, sara hajili wrote:


hi i wanna to use solr , and language of my documents that i stored in
solr
is persian.
solr doesn't support persian as well as i want.so i find preprocessor
tools
like a normalization,tockenizer and etc ...
i don't want to use solr persian filter like persian tockenizer,i mean i
wanna to improve it.

now my question is how i can integrate solr with this external
preprocessor
tools??

tnx



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: TPS with Solr Cloud

2015-12-21 Thread Emir Arnautovic

Hi Anshul,
TPS depends on number of concurrent request you can run and request 
processing time. With sharding you reduce processing time with reducing 
amount of data single node process, but you have overhead of inter shard 
communication and merging results from different shards. If that 
overhead is smaller than time you get when processing half of index, you 
will see increase of TPS. If you are running same query in a loop, first 
request will be processed and others will likely be returned from cache, 
so response time will not vary with index size hence sharding overhead 
will cause TPS to go down.
If you are happy with your response time, and want more TPS, you go with 
replications - that will increase number of concurrent requests you can run.


Also, make sure your tests are realistic in order to avoid having false 
estimates and have surprises when start running real load.


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 21.12.2015 08:18, Anshul Sharma wrote:

Hi,
I am trying to evaluate solr for one of my project for which i need to
check the scalability in terms of tps(transaction per second) for my
application.
I have configured solr on 1 AWS server as standalone application which is
giving me a tps of ~8000 for my query.
In order to test the scalability, i have done sharding of the same data
across two AWS servers with 2.5 milion records each .When i try to query
the cluster with the same query as before it gives me a tps of ~2500 .
My understanding is the tps should have been increased in a cluster as
these are two different machines which will perform separate I/O operations.
I have not configured any seperate load balancer as the document says that
by default solr cloud will perform load balancing in a round robin fashion.
Can you please help me in understanding the issue.



Re: new data structure for some fields

2015-12-21 Thread Emir Arnautovic
Maybe missing something but if c and b are one-to-one and you are 
filtering by c, how can you sort on b since all values will be the same?


On 21.12.2015 13:10, Abhishek Mishra wrote:

Hi binoy it will not work as category and integer is one to one mapping so
if category_id is multivalued same goes to integer also. and you need some
kind of mechanism which will identify which integer to pick given to
category_id for search thenafter you can implement sort according to it.

On Mon, Dec 21, 2015 at 5:27 PM, Binoy Dalal  wrote:


Small edit:
The sort parameter in the solrconfig goes in the request handler
declaration that you're using. So if it's select, put in the  list.

On Mon, 21 Dec 2015, 17:21 Binoy Dalal  wrote:


OK. You will only be able to sort based on the integers if the integer
field is single valued, I.e. only one integer is associated with one
category I'd.

To do this you've to use the sort parameter.
You can either specify it in your solrconfig.XML like so:
integer asc
Field name followed by the order - asc/desc

Or you can specify the it along with our query by appending it to your
query like so:
/select?q=query&sort=integet%20asc

If you want to apply these sorting rules for all docs, then specify the
sorting in your solrconfig. If you only want It for a certain subset then
apply the parameter from code at the app level

On Mon, 21 Dec 2015, 16:49 Abhishek Mishra  wrote:


hi binoy
thanks for reply. I mean by sort is to sort the data-sets on the basis

of

integers values given for that category.
For any document let say for an id P1,
category associated is c1,c2,c3,c4 (using multivalued field)
For new implementation
similarly a number is associated with each category. let say
c1---b1,c2---b2,c3---b3,c4---b4.
now when we querying into solr for the ids which have c1 in their
categories. (q=category_id:c1) now i want the result of this query

sorted

on the basis of number(b) associated with it throughout the result..

number of association is usually less than 20 (means an id can't be

mapped

more than 20 category_ids)


On Mon, Dec 21, 2015 at 3:59 PM, Binoy Dalal 
wrote:


When you say sort, do you mean search on the basis of category and
integers? Or score the docs based on their category and integer

values?

Also, for any given document, how many categories or integers are
associated with it?

On Mon, 21 Dec 2015, 14:43 Abhishek Mishra 

wrote:

Hello all

i am facing some kind of requirement that where for an id p1 is

associated

with some category_ids c1,c2,c3,c4 with some integers b1,b2,b3,b4.

We

need

to sort the query of solr on the basis of b1/b2/b3/b4 depending on

given

category_id . Right now we mapped the category_ids into multi-valued
attribute. [c1,c2,c3,c4] something like this. we are querying into

it.

But

from now we also need to find which integer b1,b2,b3.. associated

with

given category and also sort the whole query on it.


sorry for any typos..

Regards
Abhishek


--
Regards,
Binoy Dalal


--
Regards,
Binoy Dalal


--
Regards,
Binoy Dalal



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Best practices on monitoring Solr

2015-12-23 Thread Emir Arnautovic

Hi Shail,
As William mentioned, our SPM  
allows you to monitor all main Solr/Jvm/Host metrics and also set up 
alerts for some values or use anomaly detection to notify you when 
something is about to be wrong. You can test all features for free for 
30 days (no credit card required). There is embedded chat if you have 
some questions.


HTH,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 23.12.2015 07:38, William Bell wrote:

Sematext.com has a service for this...

Or just curl "http://localhost:8983/solr//select?q=*:*" to see
if it returns ?

On Tue, Dec 22, 2015 at 12:15 PM, Tiwari, Shailendra <
shailendra.tiw...@macmillan.com> wrote:


Hi,

Last week our Solr Search was un-responsive and we need to re-boot the
server, but we were able to find out after customer complained about it.
What's best way to monitor that search is working?
We can always add Gomez alerts from UI.
What are the best practices?

Thanks

Shail








Re: Does soft commit re-opens searchers in disk?

2016-01-04 Thread Emir Arnautovic

Hi Gili,
Visibility is related to searcher - if you reopen searcher it will be 
visible. If hard commit happens without reopening searcher, documents 
will not be visible till next soft commit happens.
You can find more details about commits on 
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


HTH,
Emir

On 04.01.2016 11:14, Gili Nachum wrote:

Hello,

When a new document is added, it becomes visible after a soft commit,
during which it is written to a Lucene RAMDirectory (in heap). Then after a
hard commit, the RAMDirectory is removed from memory and the docs are
written to the index on disk.
What happens if I hard commit (write to disk) with openSearcher=false.
Would I lose document visibility? since it's no longer in memory AND the
hard commit didn't open a new searcher on disk?

Does soft commit also re-opens Searchers over the index on disk?

Here's my commit configuration:


   60
*false*


  ${solr.autoSoftCommit.maxTime:3}
   

Thanks.



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Query behavior difference.

2016-01-06 Thread Emir Arnautovic

Hi Modassar,
It usually helps if you analyze extreme case: e.g. fl:a*
What terms should be better match? Those who are shorter or all should 
be equally good?
What should be top document? Assuming standard TF/IDF scoring is used, 
that would be one with the most terms that start with 'a' especially 
those that are not frequent in corpus. Calculating that could be 
expensive and irrelevant in most cases so constant score makes sense.


Thanks,
Emir

On 06.01.2016 12:07, Modassar Ather wrote:

Please help me understand why queries like wildcard, prefix and few others
are re-written into constant score query?
Why the scoring factors are not taken into consideration in such queries?

Please correct me if I am wrong that this behavior is per the query type
irrespective of the parser used.

Thanks,
Modassar

On Wed, Jan 6, 2016 at 12:56 PM, Modassar Ather 
wrote:


Thanks for your response Ahmet.

Best,
Modassar

On Mon, Jan 4, 2016 at 5:07 PM, Ahmet Arslan 
wrote:


Hi,

I think wildcard queries fl:networ* are re-written into Constant Score
Query.
fl=*,score should returns same score for all documents that are retrieved.

Ahmet



On Monday, January 4, 2016 12:22 PM, Modassar Ather <
modather1...@gmail.com> wrote:
Hi,

Kindly help me understand how will relevance ranking differ int following
searches.

query : fl:network
query : fl:networ*

What I am observing that the results returned are different in both of
them
in a way that the top documents returned for q=fl:network is not present
in
the top results of q=fl:networ*.
For example for q=fl:network I am getting top documents having around 20
occurrence of network whereas the top result of q=fl:networ* has only
couple of occurrence of network.
I am aware of the underlying normalization process participation in
relevance ranking of documents but not able to understand such a
difference
in the ranking of result for the queries.

Thanks,
Modassar





--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Newbie: Searching across 2 collections ?

2016-01-06 Thread Emir Arnautovic

Hi Bruno,
Can you check counts? Is it possible that first page is only with 
results from collection that you sent request to so you assumed it 
returns only results from single collection?


Thanks,
Emir

On 06.01.2016 14:33, Susheel Kumar wrote:

Hi Bruno,

I just tested this scenario in my local solr 5.3.1 and it returned results
from two identical collections. I doubt if it is broken in 5.4 just double
check if you are not missing anything else.

Thanks,
Susheel

http://localhost:8983/solr/c1/select?q=id_type%3Ahello&wt=json&indent=true&collection=c1,c2

responseHeader": {"status": 0,"QTime": 98,"params": {"q": "id_type:hello","
indent": "true","collection": "c1,c2","wt": "json"}},
response": {"numFound": 2,"start": 0,"maxScore": 1,"docs": [{"id": "1","
id_type": "hello","_version_": 1522623395043213300},{"id": "3","id_type": "
hello","_version_": 1522623422397415400}]}

On Wed, Jan 6, 2016 at 6:13 AM, Bruno Mannina  wrote:


yes id value is unique in C1 and unique in C2.
id in C1 is never present in C2
id in C2 is never present in C1


Le 06/01/2016 11:12, Binoy Dalal a écrit :


Are Id values for docs in both the collections exactly same?
To get proper results, the ids should be unique across both the cores.

On Wed, 6 Jan 2016, 15:11 Bruno Mannina  wrote:

Hi All,

Solr 5.4, Ubuntu

I thought it was simple to request across two collections with the same
schema but not.
I have one solr instance launch. 300 000 records in each collection.

I try to use this request without having both results:

http://my_adress:my_port
/solr/C1/select?collection=C1,C2&q=fid:34520196&wt=json

this request returns only C1 results and if I do:

http://my_adress:my_port
/solr/C2/select?collection=C1,C2&q=fid:34520196&wt=json

it returns only C2 results.

I have 5 identical fields on both collection
id, fid, st, cc, timestamp
where id is the unique key field.

Can someone could explain me why it doesn't work ?

Thanks a lot !
Bruno

---
L'absence de virus dans ce courrier électronique a été vérifiée par le
logiciel antivirus Avast.
http://www.avast.com

--


Regards,
Binoy Dalal



---
L'absence de virus dans ce courrier électronique a été vérifiée par le
logiciel antivirus Avast.
http://www.avast.com




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: solr BooleanClauses issue with space

2016-01-13 Thread Emir Arnautovic

Hi Sara,
You can run your query (or smaller one) with debugQuery=true and see how 
it is rewritten.


Thanks,
Emir

On 13.01.2016 16:01, sara hajili wrote:

tnx.
and my main question is about maxBooleanDefault in solr config.
it is 1024 by default.
and i have a edismax query with about 500 words in this way:
q1= str1 OR str2 OR str3 ...OR strn
it throws exception that cant't parse query too boolean clause.
so if i changed maxBooleanDefault to 1500 it works
but some thing is ambiguous for me is when i don;t change maxBooleanDefault
and it remains 1024.
but i changed query in this way
q2 = str1 str2 str3 ... strn // i eliminated OR between for and inside space
i didn't get exception !!!
why?!
what is diffrence between q1 an q2??

On Wed, Jan 13, 2016 at 6:28 AM, Shawn Heisey  wrote:


On 1/13/2016 5:40 AM, sara hajili wrote:

what is exactly diffrence between sapce and OR in solr query  ?
i mean what is diffrence  between
q = solr OR lucene OR search
and this
q = solr lucene search?

solr default boolean occurence is OR,isn't it?

This depends on what the default operator is.  The default for the
default operator is OR, and that would produce exactly the same results
for both of the queries you have mentioned.  If the default operator is
AND, then those two queries would be different.

The default operator applies to the lucene and edismax parsers.  The
lucene parser is Solr's default.  In older versions, the default
operator could be set by a defaultOperator parameter.  I do not remember
whether that was in solrconfig or schema.  That parameter is deprecated
and the q.op parameter should be used now.

Thanks,
Shawn




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Position increment in WordDelimiterFilter.

2016-01-14 Thread Emir Arnautovic

Hi Modassar,
Why do you think it should be at position 1? In that case searching for 
"3 d" would not find anything. Is it what you expect?


Thanks,
Emir

On 14.01.2016 10:15, Modassar Ather wrote:

Hi,

I have following definition for WordDelimiterFilter.



The analysis of 3d shows following four tokens and their positions.

token position
3d 1
3   1
3d 1
d   2

Please help me understand why d is at 2? Should not it also be at position
1.
Is it a bug and if not is there any attribute which I can use to restrict
the position increment?

Thanks,
Modassar



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Position increment in WordDelimiterFilter.

2016-01-14 Thread Emir Arnautovic

Hi,
It seems to me that you don't want to split on numbers. Maybe there are 
other cases where you need to so it is turned on. If there are such 
cases I would suggest you create test with expectations so you can check 
what is best working for you. It is highly likely that you will not be 
able to create solution that will suite all cases so you will have to do 
some tradeoffs.


Emir

On 14.01.2016 13:42, Modassar Ather wrote:

Thanks for your responses.

Why do you think it should be at position 1? In that case searching for "3
d" would not find anything. Is it what you expect?
During search some of the results returned are not wanted. Following is the
example.
Search query: "3d image"
Search results with 3-d image/3 d image/1d image are also returned. This is
happening because of position increment.
Another example is "1d obj*" returning results containing "d-object"
related results. This can bring a completely different search item. Here
the token d matches with d of d-object as this term is again split same way.
The position increment will also cause the "3d image" search fail on a
document containing "3d image" as the "d" comes at position 2.

1) can you confirm if you've made a typo while typing out your results?
I have confirmed the position attribute displayed on analysis page and I
found there is no typo.
2 ) you'll get the d and 3d as 2 since they're the 2nd token once 3d is
split.
Irrespective of it what I want to understand why there is an increment in
position. Should not all the terms be at same position as they are yielded
from the same term/token?

Best,
Modassar

On Thu, Jan 14, 2016 at 3:25 PM, Binoy Dalal  wrote:


I've tried out your settings and here's what I get:
3d 1
3   1
d   2
3d 2

1) can you confirm if you've made a typo while typing out your results?
2 ) you'll get the d and 3d as 2 since they're the 2nd token once 3d is
split.
Try the same thing with d3 and you'll get 3 and d3 at position 2

On Thu, 14 Jan 2016, 15:11 Emir Arnautovic 
wrote:


Hi Modassar,
Why do you think it should be at position 1? In that case searching for
"3 d" would not find anything. Is it what you expect?

Thanks,
Emir

On 14.01.2016 10:15, Modassar Ather wrote:

Hi,

I have following definition for WordDelimiterFilter.



The analysis of 3d shows following four tokens and their positions.

token position
3d 1
3   1
3d 1
d   2

Please help me understand why d is at 2? Should not it also be at

position

1.
Is it a bug and if not is there any attribute which I can use to

restrict

the position increment?

Thanks,
Modassar


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

--

Regards,
Binoy Dalal



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Position increment in WordDelimiterFilter.

2016-01-15 Thread Emir Arnautovic

Modassar,
Are you saying that WiFi Wi-Fi and Wi Fi should not match each other? 
Why do you use WordDelimiterFilter? Can you give us few examples where 
it is useful?


Thanks,
Emir

On 15.01.2016 05:13, Modassar Ather wrote:

Thanks for your responses.

It seems to me that you don't want to split on numbers.
It is not with number only. Even if you try to analyze WiFi it will create
4 token one of which will be at position 2. So basically the issue is with
position increment which causes few of the queries behave unexpectedly.

Which release of Solr are you using?
I am using Lucene/Solr-5.4.0.

Best,
Modassar

On Thu, Jan 14, 2016 at 9:44 PM, Jack Krupansky 
wrote:


Which release of Solr are you using? Last year (or so) there was a Lucene
change that had the effect of keeping all terms for WDF at the same
position. There was also some discussion about whether this was either a
bug or a bug fix, but I don't recall any resolution.

-- Jack Krupansky

On Thu, Jan 14, 2016 at 4:15 AM, Modassar Ather 
wrote:


Hi,

I have following definition for WordDelimiterFilter.



The analysis of 3d shows following four tokens and their positions.

token position
3d 1
3   1
3d 1
d   2

Please help me understand why d is at 2? Should not it also be at

position

1.
Is it a bug and if not is there any attribute which I can use to restrict
the position increment?

Thanks,
Modassar



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Speculation on Memory needed to efficently run a Solr Instance.

2016-01-15 Thread Emir Arnautovic

Hi,
OS does not care much about search v.s. retrieve so amount of RAM needed 
for file caches would depend on your index usage patterns. If you are 
not retrieving stored fields much and most/all results are only 
id+score, than it can be assumed that you can go with less RAM than 
actual index size. In such case you can question if you need stored 
fields in index. Also if your index/usage pattern is such that only 
small subset of documents is retrieved with stored fields, than it can 
also be assumed it will never need to cache entire fdt file.
One thing that you forgot (unless you index is static) is segments 
merging - in worst case system will have two "copies" of index and 
having extra memory can help in such cases.
The best approach is to use some tool and monitor IO and memory metrics. 
One such tool is Sematext's SPM (http://sematext.com/spm) where you can 
see metrics for both system and SOLR.


Thanks,
Emir

On 15.01.2016 10:43, Gian Maria Ricci - aka Alkampfer wrote:


Hi,

When it is time to calculate how much RAM a solr instance needs to run 
with good performance, I know that it is some form of art, but I’m 
looking at a general “formula” to have at least one good starting point.


Apart the RAM devoted to Java HEAP, that is strongly dependant on how 
I configure caches, and the distribution of queries in my system, I’m 
particularly interested in the amount of RAM to leave to operating 
system to use File Cache.


Suppose I have an index of 51 Gb of dimension, clearly having that 
amount of ram devoted to the OS is the best approach, so all index 
files can be cached into memory by the OS, thus I can achieve maximum 
speed.


But if I look at the detail of the index, in this particular example I 
see that the bigger file has .fdt extension, it is the stored field 
for the documents, so it affects retrieval of document data, not the 
real search process. Since this file is 24 GB of size, it is almost 
half of the space of the index.


My question is: it could be safe to assume that a good starting point 
for the amount of RAM to leave to the OS is the dimension of the index 
less the dimension of the .fdt file because it has less importance in 
the search process?


Are there any particular setting at OS level (CentOS linux) to have 
maximum benefit from OS file cache? (documentation at 
https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-MemoryandGCSettingsdoes 
not have any information related to OS configuration). Elasticsearch 
(https://www.elastic.co/guide/en/elasticsearch/reference/1.4/setup-configuration.html) 
generally have some suggestions such as using mlockall, disable swap 
etc etc, I wonder if there are similar suggestions for solr.


Many thanks for all the great help you are giving me in this mailing 
list.


--
Gian Maria Ricci
Cell: +39 320 0136949

https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZkVVhkPN7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d-e1-ft#http://www.codewrecks.com/files/signature/mvp.png 
https://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4jlIf_xrmGLUT3fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBclKA=s0-d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg 
https://ci3.googleusercontent.com/proxy/gjapMzu3KEakBQUstx_-cN7gHJ_GpcIZNEPjCzOYMrPl-r1DViPE378qNAQyEWbXMTj6mcduIAGaApe9qHG1KN_hyFxQAIkdNSVT=s0-d-e1-ft#http://www.codewrecks.com/files/signature/twitter.jpg 
https://ci5.googleusercontent.com/proxy/iuDOD2sdaxRDvTwS8MO7-CcXchpNJX96uaWuvagoVLcjpAPsJi88XeOonE4vHT6udVimo7yL9ZtdrYueEfH7jXnudmi_Vvw=s0-d-e1-ft#http://www.codewrecks.com/files/signature/rss.jpg 
https://ci6.googleusercontent.com/proxy/EBJjfkBzcsSlAzlyR88y86YXcwaKfn3x7ydAObL1vtjJYclQr_l5TvrFx4PQ5qLNYW3yp7Ig66DJ-0tPJCDbDmYAFcamPQehwg=s0-d-e1-ft#http://www.codewrecks.com/files/signature/skype.jpg




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Position increment in WordDelimiterFilter.

2016-01-15 Thread Emir Arnautovic
Can you please send us tokens you get (and positions) when you analyze 
*WiFi device*


On 15.01.2016 13:15, Modassar Ather wrote:

Are you saying that WiFi Wi-Fi and Wi Fi should not match each other?
I am using WhiteSpaceTokenizer in my analysis chain so wi fi becomes two
different token. Please refer to my examples given in previous mail about
the issues faced.
Wi Fi are two term which will match but what happens if for a content
having *WiFi device* is searched with *"WiFi device"*. It will not match as
there is a position increment by WordDelimiterFilter for WiFi.
"WiFi device"~1 will match which is confusing that there is no gap in the
content why a slop is required.

Why do you use WordDelimiterFilter? Can you give us few examples where it
is useful?
It is useful when a word like* lucene-search documentation *is indexed with
WordDelimiterFilter and it is broken in two terms like lucene and search
then it will be helpful to get the documents containing it for queries like
lucene documentation or search documentation.

Best,
Modassar

On Fri, Jan 15, 2016 at 2:14 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Modassar,
Are you saying that WiFi Wi-Fi and Wi Fi should not match each other? Why
do you use WordDelimiterFilter? Can you give us few examples where it is
useful?

Thanks,
Emir


On 15.01.2016 05:13, Modassar Ather wrote:


Thanks for your responses.

It seems to me that you don't want to split on numbers.
It is not with number only. Even if you try to analyze WiFi it will create
4 token one of which will be at position 2. So basically the issue is with
position increment which causes few of the queries behave unexpectedly.

Which release of Solr are you using?
I am using Lucene/Solr-5.4.0.

Best,
Modassar

On Thu, Jan 14, 2016 at 9:44 PM, Jack Krupansky 
change that had the effect of keeping all terms for WDF at the same
position. There was also some discussion about whether this was either a
bug or a bug fix, but I don't recall any resolution.

-- Jack Krupansky

On Thu, Jan 14, 2016 at 4:15 AM, Modassar Ather 
wrote:

Hi,

I have following definition for WordDelimiterFilter.



The analysis of 3d shows following four tokens and their positions.

token position
3d 1
3   1
3d 1
d   2

Please help me understand why d is at 2? Should not it also be at


position


1.
Is it a bug and if not is there any attribute which I can use to
restrict
the position increment?

Thanks,
Modassar



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: ramBufferSizeMB and maxIndexingThreads

2016-01-20 Thread Emir Arnautovic
Kind of obvious/logical, but seen some people forgetting that it is per 
core - if single node host multiple shards, each will take 100MB.


Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 20.01.2016 07:02, Shalin Shekhar Mangar wrote:

ramBufferSizeMB is independent of the maxIndexingThreads. If you set
it to 100MB then any lucene segment (or part of a segment) exceeding
100MB will be flushed to disk.

On Wed, Jan 20, 2016 at 3:50 AM, Angel Todorov  wrote:

hi guys,

quick question - is the ramBufferSizeMB the maximum value no matter how
maxIndexingThreads I have, or is it multiplied by the number if indexing
threads? So, if I  have ramBufferSizeMB set to 100 MB, and 8 indexing
threads, does this mean the total ram buffer will be 100 MB or 800 MB ?

Thanks
Angel







Re: Returning all documents in a collection

2016-01-20 Thread Emir Arnautovic

Hi Salman,
You should use cursors in order to avoid "deep paging issues". Take a 
look at 
https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results.


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 20.01.2016 12:55, Salman Ansari wrote:

Hi,

I am looking for a way to return all documents from a collection.
Currently, I am restricted to specifying the number of rows using Solr.NET
but I am looking for a better approach to actually return all documents. If
I specify a huge number such as 1M, the processing takes a long time.

Any feedback/comment will be appreciated.

Regards,
Salman





Re: solr score threashold

2016-01-20 Thread Emir Arnautovic

Hi Sara,
You can use funct and frange to achive needed, but note that scores are 
not normalized meaning score 8 does not mean it is good match - it is 
just best match. There are examples online how to normalize score (e.g. 
http://wiki.apache.org/lucene-java/ScoresAsPercentages).
Other approach is to write custom component that will filter out docs 
below some threshold.


Thanks,
Emir

On 20.01.2016 13:58, sara hajili wrote:

hi all,
i wanna to know about solr search relevency scoreing threashold.
can i change it?
i mean immagine when i searching i get this result
doc1 score =8
doc2 score =6.4
doc3 score=6
doc8score=5.5
doc5 score=2
i wana to change solr score threashold .in this way i set threashold for
example >4
and then i didn't get doc5 as result.can i do this?if yes how?
and if not how i can modified search to don't get docs as a result that
these docs have a lot distance from doc with max score?
in other word i wanna to delete this gap between solr results



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Couple of question about Virtualization and Load Balancer

2016-01-22 Thread Emir Arnautovic
There is other reason to avoid virtualization - fault tolerance. It is 
common to use virtualization on huge box and keep replications on same 
box. Such setup will survive VM failure but not HW failure.


Regards,
Emir

On 22.01.2016 11:05, Gian Maria Ricci - aka Alkampfer wrote:

Thanks, my actual strategy is using SolrMeter to test with real Virtualized
hardware and real result set to gain some number. The customer definitively
wants virtualization, and probably we will not test on bare metal
installation.

As I state in previous mail, the question arise because in some books /
blog, people suggest to avoid virtualization, and even if I know that a
virtualized hardware is slower than bare metal, usually the loss of
performance is negligible, so I wander if there are some proof of concepts
to back up these hypothesis.

As I suspected, probably that is a general advice, but it is always safer to
have a proof of concept on your hardware and data.

Thanks.


--
Gian Maria Ricci
Cell: +39 320 0136949
 



-Original Message-
From: Davis, Daniel (NIH/NLM) [C] [mailto:daniel.da...@nih.gov]
Sent: giovedì 21 gennaio 2016 16:32
To: solr-user@lucene.apache.org
Subject: RE: Couple of question about Virtualization and Load Balancer


The first one is about virtualization, I'd like to know if there are
any official test on loss of performance in virtualization
environment. I think that the loss of performance is negligible, and
quick question on test infrastructure is confirming this, but I'd like to

know if there is some official numbers on this.

I think any "official" test would run into the very reasonable problem of
which schema, indexed data, and queries to test.
This problem is well summarized by
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-w
e-dont-have-a-definitive-answer/.

However, there is a Solr performance test tool with a track record -
SolrMeter.You
can also do a lot with good old JMeter.

From: outlook_288fbf38c031d...@outlook.com
[mailto:outlook_288fbf38c031d...@outlook.com] On Behalf Of Gian Maria Ricci
- aka Alkampfer
Sent: Thursday, January 21, 2016 6:38 AM
To: solr-user@lucene.apache.org
Subject: Couple of question about Virtualization and Load Balancer

Hi,

I've a coupl
e of quick question about production setup.


The second question is about Load Balancer: any clue on how to automatically
change the configuration on the load balancer if some of the node goes down?
I'm looking to advices on what to monitor, the simplest solution could be
issuing some test query and verify if the node is able to answer, but it
would be nice to know if there are some standard metrics to monitor to
proactively alert. (Es. Heap size almost full, so it would be probably
better to remove the node from the balancer and alert a human to have a look
at the status of the node).

Many thanks.

--
Gian Maria Ricci
Cell: +39 320 0136949
[https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZkVVhkP
N7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d-e1-ft#http
://www.codewrecks.com/files/signature/mvp.png]
[https://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4j
lIf_xrmGLUT3fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBc
lKA=s0-d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg]

[https://ci3.googleusercontent.com/proxy/gjapMzu3KEakBQUstx_-cN7gHJ_GpcIZNEP
jCzOYMrPl-r1DViPE378qNAQyEWbXMTj6mcduIAGaApe9qHG1KN_hyFxQAIkdNSVT=s0-d-e1-ft
#http://www.codewrecks.com/files/signature/twitter.jpg]

[https://ci5.googleusercontent.com/proxy/iuDOD2sdaxRDvTwS8MO7-CcXchpNJX96uaW
uvagoVLcjpAPsJi88XeOonE4vHT6udVimo7yL9ZtdrYueEfH7jXnudmi_Vvw=s0-d-e1-ft#http
://www.codewrecks.com/files/signature/rss.jpg]

[https://ci6.googleusercontent.com/proxy/EBJjfkBzcsSlAzlyR88y86YXcwaKfn3x7yd
AObL1vtjJYclQr_l5TvrFx4PQ5qLNYW3yp7Ig66DJ-0tPJCDbDmYAFcamPQehwg=s0-d-e1-ft#h
ttp://www.codewrecks.com/files/signature/skype.jpg]



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Understanding solr commit

2016-01-25 Thread Emir Arnautovic

Hi Rahul,
If I got your mail right there is misconception of SolrCloud - nodes are 
infrastructure of cloud and collection is something that is "unit". So 
when you commit, you are committing changes you did on collection and 
SolrCloud will handle nodes. When you commit to three 3 nodes it is 
actually 3 commits to single collection.
It is not considered to be good practice to have script that does 
commits. Solr has autocommit functionality. You should also educate 
about soft v.s. hard commits. Following article is good starting point: 
https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/


Regards,
Emir

On 25.01.2016 12:02, Rahul Ramesh wrote:

We are facing some issue and we are finding it difficult to debug the
problem. We wanted to understand how solr commit works.
A background on our setup:
We have  3 Node Solr Cluster running in version 5.3.1. Its a index heavy
use case. In peak load, we index 400-500 documents/second.
We also want these documents to be visible as quickly as possible, hence we
run an external script which commits every 3 mins.

Consider the three nodes as N1, N2, N3. Commit is an synchronous operation.
So, we will not get control till the commit operation is complete.

Consider the following scenario. Although it looks like a basic scenario in
distributed system:-) but we just wanted to eliminate this possibility.

step 1 : At time T1, commit happens to Node N1
step 2: At same time T1, we search for all the documents inserted in Node
N2.

My question is

1. Is commit an atomic operation? I mean, will commit happen on all the
nodes at the same time?
2. Can we say that, the search result will always contain the documents
before commit / or after commit . Or can it so happen that we get new
documents fron N1, N2 but old documents (i.e., before commit)  from N3?

Thank you,
Rahul



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Understanding solr commit

2016-01-25 Thread Emir Arnautovic

Hi Rahul,
It is good that you commit only once, but not sure how external commits 
can do something auto commit cannot.
Can you give us bit more details about Solr heap parameters. Running 
Solr on the edge of OOM is always risk of starting snowball effect and 
crashing entire cluster. Also can you give us info about auto commit 
(both hard and soft) you used when experienced OOM.


Thanks,
Emir

On 25.01.2016 12:28, Rahul Ramesh wrote:

Thanks for your replies.

A bit more detail about our setup.
The index size is close to 80Gb spread across 30 collections. The main
memory available is around 32Gb. We are always in short of memory!
Unfortunately we could not expand the memory as the server motherboard
doesnt support it.

We tried with solr auto commit features. However, sometimes we were getting
Java OOM exception and when I start digging more about it, somebody
suggested that I am not committing the collections often. So, we started
committing the collections explicitly.

Please let me know if our approach is not correct.

*Emir*,
We are committing to the collection only once. We have Node N1, N2 and N3
and for a collection Coll1, commit will happen to N1/coll1 every 3 minutes.
we are not doing it for every node. We will remove _shard<>_replica<> and
use only the collection name to commit.

*Alessandro*,
We are using Solr Cloud with replication factor of 2 and no of shards as
either 2 or 3.

Thanks,
Rahul









On Mon, Jan 25, 2016 at 4:43 PM, Alessandro Benedetti 
wrote:
Let me answer in line :

On 25 January 2016 at 11:02, Rahul Ramesh  wrote:


We are facing some issue and we are finding it difficult to debug the
problem. We wanted to understand how solr commit works.
A background on our setup:
We have  3 Node Solr Cluster running in version 5.3.1. Its a index heavy
use case. In peak load, we index 400-500 documents/second.
We also want these documents to be visible as quickly as possible, hence

we

run an external script which commits every 3 mins.


This is weird, why not using the auto-soft commit if you want visibility
every 3 minutes ?
Is there any particular reason you trigger the commit from the client ?


Consider the three nodes as N1, N2, N3. Commit is an synchronous

operation.

So, we will not get control till the commit operation is complete.

Consider the following scenario. Although it looks like a basic scenario

in

distributed system:-) but we just wanted to eliminate this possibility.

step 1 : At time T1, commit happens to Node N1
step 2: At same time T1, we search for all the documents inserted in Node
N2.

My question is

1. Is commit an atomic operation? I mean, will commit happen on all the
nodes at the same time?


Which kind of architecture of Solr are you using ? Are you using SolrCloud
?

2. Can we say that, the search result will always contain the documents

before commit / or after commit . Or can it so happen that we get new
documents fron N1, N2 but old documents (i.e., before commit)  from N3?


With a manual cluster it could faintly happen.
In SolrCloud it should not, but I should double check the code !


Thank you,
Rahul




--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Understanding solr commit

2016-01-25 Thread Emir Arnautovic

Hi Rahul,
It is hard to tell without seeing metrics, but 8GB heap seems small for 
such setup - e.g. with indexing buffer of 32MB and 30 collections, it 
will eat almost 1GB memory.
About commits, you can set auto commit to be more frequent (keep 
openSearcher=false) and add soft commits every 3 min.
What you need to tune is your heap and heap related settings - indexing 
buffer, caches. Not sure what you use for monitoring Solr, but 
Sematext's SPM (http://sematext.com/spm) is one such tool that can give 
you info how you Solr, JVM and host handle different load. One such tool 
can give you enough info to tune your Solr.


Regards,
Emir

On 25.01.2016 13:42, Rahul Ramesh wrote:

Can you give us bit more details about Solr heap parameters.
Each node has 32Gb of RAM and we are using 8Gb for heap.
Index size in each node is around 80Gb
#of collections 30


Also can you give us info about auto commit (both hard and soft) you used
when experienced OOM.
 15000 15000 
false 

soft commit is not enabled.

-Rahul



On Mon, Jan 25, 2016 at 6:00 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Hi Rahul,
It is good that you commit only once, but not sure how external commits
can do something auto commit cannot.
Can you give us bit more details about Solr heap parameters. Running Solr
on the edge of OOM is always risk of starting snowball effect and crashing
entire cluster. Also can you give us info about auto commit (both hard and
soft) you used when experienced OOM.

Thanks,
Emir

On 25.01.2016 12:28, Rahul Ramesh wrote:


Thanks for your replies.

A bit more detail about our setup.
The index size is close to 80Gb spread across 30 collections. The main
memory available is around 32Gb. We are always in short of memory!
Unfortunately we could not expand the memory as the server motherboard
doesnt support it.

We tried with solr auto commit features. However, sometimes we were
getting
Java OOM exception and when I start digging more about it, somebody
suggested that I am not committing the collections often. So, we started
committing the collections explicitly.

Please let me know if our approach is not correct.

*Emir*,
We are committing to the collection only once. We have Node N1, N2 and N3
and for a collection Coll1, commit will happen to N1/coll1 every 3
minutes.
we are not doing it for every node. We will remove _shard<>_replica<> and
use only the collection name to commit.

*Alessandro*,

We are using Solr Cloud with replication factor of 2 and no of shards as
either 2 or 3.

Thanks,
Rahul









On Mon, Jan 25, 2016 at 4:43 PM, Alessandro Benedetti <
abenede...@apache.org


wrote:
Let me answer in line :

On 25 January 2016 at 11:02, Rahul Ramesh  wrote:

We are facing some issue and we are finding it difficult to debug the

problem. We wanted to understand how solr commit works.
A background on our setup:
We have  3 Node Solr Cluster running in version 5.3.1. Its a index heavy
use case. In peak load, we index 400-500 documents/second.
We also want these documents to be visible as quickly as possible, hence


we


run an external script which commits every 3 mins.

This is weird, why not using the auto-soft commit if you want visibility

every 3 minutes ?
Is there any particular reason you trigger the commit from the client ?

Consider the three nodes as N1, N2, N3. Commit is an synchronous
operation.


So, we will not get control till the commit operation is complete.

Consider the following scenario. Although it looks like a basic scenario


in


distributed system:-) but we just wanted to eliminate this possibility.

step 1 : At time T1, commit happens to Node N1
step 2: At same time T1, we search for all the documents inserted in
Node
N2.

My question is

1. Is commit an atomic operation? I mean, will commit happen on all the
nodes at the same time?

Which kind of architecture of Solr are you using ? Are you using

SolrCloud
?

2. Can we say that, the search result will always contain the documents


before commit / or after commit . Or can it so happen that we get new
documents fron N1, N2 but old documents (i.e., before commit)  from N3?

With a manual cluster it could faintly happen.

In SolrCloud it should not, but I should double check the code !

Thank you,

Rahul



--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: indexing rich data with solr 5.3.1 integreting in Ubuntu server

2016-01-26 Thread Emir Arnautovic

Hi,
I would first check if external libraries are present and loaded. How do 
you start Solr? Try explicitly setting solr.install.dir or set absolute 
path to libs and see in logs if they are loaded.





Thanks,
Emir

On 25.01.2016 15:16, kostali hassan wrote:

0down votefavorite


I have a problem with integrating solr in Ubuntu server.Before using solr
on ubuntu server i tested it on my mac it was working perfectly for DIH
request handler and update/extract. it indexed my PDF,Doc,Docx documents.so
after installing solr on ubuntu server and using the same configuration
files and librairies. i've found out that solr doesn't index PDf documents
and none Error and any exceptions in solr log.But i can search over .Doc
and .Docx documents.

here some parts of my solrconfig.xml contents :


   


 
   true
   ignored_
   _text_
 
   

DIH config:



tika.config.xml



tika.config.xml


 
 
 
 
 
 
  



 

 


 
 
 
 




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: To Detect Wheter Core is Available To Post

2016-01-26 Thread Emir Arnautovic

Hi Edwin,
Assuming you are using SolrCloud - why do you need specific core? Can 
you use some of status actions from collection API - there is 
CLUSTERSTATUS action?


Thanks,
Emir

On 26.01.2016 05:34, Edwin Lee wrote:

Hi All,

Our team is using the Solr to process log and we met a problem in SOLR
posting.

We want to detect the health of each core —- whether they are available to
post. We have to figure out ways to do that:

1. Using luke request . —- Cost is a bit high for core loading
2. We have designed a cache and adding the hook when the core is open or
closed to record whether the core is loaded. —- *Question: If a core is
loaded, is there situation that we still cannot post data to it?*
3. We try to post some meanless data with our unique id, and delete that
data within the same commit, for example, if we use json to post data, it
is like this,

{
 "add": {
 "doc":{
 "id": "%%ID%%"
 }
 },
 "delete": { "id": "%%ID%%" },
 "commit": {}}

*But we still not 100% sure whether it will mess up with our normal data.*

What is the best way for this requirment. We want to consult your opinions.

Thank you.

Regards,
Edwin Lee
20160126



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Apache solr can be made near-real-Time???

2016-01-28 Thread Emir Arnautovic

Hi Samina,
First to thank you for teaching me what "lakh" is :)

Solr is capable of handling large amount of data, but that requires 
large Solr cluster. What you need to determine is what is your real time 
- what is max time you can tolerate update to be visible; and determine 
acceptable query latency. After that you need to test with different 
shard size to achieve target latency. After that you can extrapolate it 
to your full data set and see how many shards you need.

What you can do with your data to reduce hw requirements:
* remove from index anything that is not needed
* in case you have time related data you can use time slicing
* in case of multi tenant index you can use routing

Regards,
Emir

On 28.01.2016 12:20, Samina wrote:

I want to use solr for enterprise level search on a large scale of data in
TB, where in  Lakh's of data will be update in an hour and approx 3 Lakh's
of data of would be seached in one hour.This is just the rough value though
nearby,so how can we achieve near -real-Time search in solr ? and how much
percent of real time search would be possible on this large data?
Can we even achieve this doing indexing at certain
interval(automatic/Manual)?
Please help and suggest



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Apache-solr-can-be-made-near-real-Time-tp4253808.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: implement exact match for one of the search fields only?

2016-01-28 Thread Emir Arnautovic

Hi Derek,
It is not clear what you are trying to achieve: "one of the search 
fields is an exact phrase match while the rest of the search fields can 
be exact or partial matches". What does "while" mean - it has to match 
in other fields as well or result should be scored better if it does but 
not mandatory to match?

For exact match you can use string type instead of text.
For querying multiple fields you can take a look at (e)dismax query parser.

Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 28.01.2016 10:52, Derek Poh wrote:

Hi

First of all, sorry for the long post.

How do I implement or structured the query such that one of the search 
fields is an exact phrase match while the rest of the search fields 
can be exact or partial matches? Is this possible?


I have the following search fields
- P_VeryShortDescription
- P_ShortDescription
- P_CatConcatKeyword
- spp_keyword_exact

For the spp_keyword_exact field, I want to apply an exact match to it.

I have a document with the following information. If I search for 
'dvd', this document should not match. However if I search for 'dvd 
bracket', this document should match.

Right now when I search for 'dvd', it is not return, which is correct.
I want it to be return when I search for 'dvd bracket' but it is not.
I try enclosing it in double quotes "dvd bracket" but it is not 
return. Then again I can't enclosed the search terms in double quotes 
"dvd bracket" as those documents with the word 'dvd' and 'bracket' in 
the other fields will not be match, am I right?


doc:

TV Mounts
dvd bracket

TV Mounts
Swivel TV Mounts, Suitable for 26-42 
Inches Screen

Swivel TV mounts


Here are the fields definition:

type="gs_keyword_exact" multiValued="true"/>


positionIncrementGap="100">

  



  
  



  



The other search fields are defined as

positionIncrementGap="100">

  
  
  
  words="stopwords.txt" />
  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

  
  

  
   
  words="stopwords.txt" />
  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

  
  


Derek

--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.


Re: implement exact match for one of the search fields only?

2016-01-29 Thread Emir Arnautovic

Hi Derek,
What if it does not match other fields but just exact match. From 
original question I assume it should return such results. It seems to me 
that you are AND-ing your fields and that is the reason why your query 
is not returning anything. Can you try just exact match field and see if 
it will return match.


Maybe it is best if you explain us what you want with couple of 
documents/query/result and we can give you suggestions for config and query.


Regards
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 29.01.2016 02:03, Derek Poh wrote:

Hi Emir

For the other search fields, if they have matches it should be return.

On 1/28/2016 8:17 PM, Emir Arnautovic wrote:

Hi Derek,
It is not clear what you are trying to achieve: "one of the search 
fields is an exact phrase match while the rest of the search fields 
can be exact or partial matches". What does "while" mean - it has to 
match in other fields as well or result should be scored better if it 
does but not mandatory to match?

For exact match you can use string type instead of text.
For querying multiple fields you can take a look at (e)dismax query 
parser.


Regards,
Emir




--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.


Re: Solr segment merging in different replica

2016-02-01 Thread Emir Arnautovic

Hi Edwin,
What is your setup - SolrCloud or Master-Slave? If it si SolrCloud, then 
under normal index updates, each core is behaving as independent index. 
In theory, if all changes happen at the same time on all nodes, merges 
will happen at the same time. But that is not realistic and it is 
expected to happen in slightly different time.
If you are running Master-Slave, then new segments will be copied from 
master to slave.


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 01.02.2016 11:56, Zheng Lin Edwin Yeo wrote:

Hi,

I would like to check, during segment merging, how did the replical node do
the merging?
Will it do the merging concurrently, or will the replica node delete the
old segment and replace the new one?

Also, is it possible to separate the network interface for inter-node
communication from the network interface for update/search requests?
If so I could put two network cards in each machine and route the index and
search traffic over the first interface and the traffic for the inter-node
communication (sending documents to replicas) over the second interface.

I'm using Solr 5.4.0

Regards,
Edwin





Re: Solr segment merging in different replica

2016-02-02 Thread Emir Arnautovic

Hi Edwin,
Do you see any signs of network being bottleneck that would justify such 
setup? I would suggest you monitor your cluster before deciding if you 
need separate interfaces for external and internal communication. 
Sematext's SPM (http://sematext.com/spm) allows you to monitor 
SolrCloud, hosts and network and identify bottlenecks in your cluster.


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 02.02.2016 00:50, Zheng Lin Edwin Yeo wrote:

Hi Emir,

My setup is SolrCloud.

Also, will it be good to use a separate network interface to connect the
two node with the interface that is used to connect to the network for
searching?

Regards,
Edwin


On 1 February 2016 at 19:01, Emir Arnautovic 
wrote:


Hi Edwin,
What is your setup - SolrCloud or Master-Slave? If it si SolrCloud, then
under normal index updates, each core is behaving as independent index. In
theory, if all changes happen at the same time on all nodes, merges will
happen at the same time. But that is not realistic and it is expected to
happen in slightly different time.
If you are running Master-Slave, then new segments will be copied from
master to slave.

Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




On 01.02.2016 11:56, Zheng Lin Edwin Yeo wrote:


Hi,

I would like to check, during segment merging, how did the replical node
do
the merging?
Will it do the merging concurrently, or will the replica node delete the
old segment and replace the new one?

Also, is it possible to separate the network interface for inter-node
communication from the network interface for update/search requests?
If so I could put two network cards in each machine and route the index
and
search traffic over the first interface and the traffic for the inter-node
communication (sending documents to replicas) over the second interface.

I'm using Solr 5.4.0

Regards,
Edwin




Re: Solr segment merging in different replica

2016-02-03 Thread Emir Arnautovic

Hi Edwin,
Master-Slave's main (maybe only) advantage is simpler infrastructure - 
it does not use ZK. Also, it does assume you don't need NRT search since 
there has to be longer periods between replicating master changes to slaves.


Regards,
Emir

On 03.02.2016 04:48, Zheng Lin Edwin Yeo wrote:

Hi Emir,

Thanks for your reply.

As currently both of my main and replica are in the same server, and as I
am using the SolrCloud setup, both the replica are doing the merging
concurrently, which causes the memory usage of the server to be very high,
and affect the other functions like querying. This issue should be
eliminated when I shift my replica to another server.

Would like to check, will there be any advantage if I change to the
Master-Slave setup, as compared to the SolrCloud setup which I am currently
using?

Regards,
Edwin



On 2 February 2016 at 21:23, Emir Arnautovic 
wrote:


Hi Edwin,
Do you see any signs of network being bottleneck that would justify such
setup? I would suggest you monitor your cluster before deciding if you need
separate interfaces for external and internal communication. Sematext's SPM
(http://sematext.com/spm) allows you to monitor SolrCloud, hosts and
network and identify bottlenecks in your cluster.

Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 02.02.2016 00:50, Zheng Lin Edwin Yeo wrote:


Hi Emir,

My setup is SolrCloud.

Also, will it be good to use a separate network interface to connect the
two node with the interface that is used to connect to the network for
searching?

Regards,
Edwin


On 1 February 2016 at 19:01, Emir Arnautovic <
emir.arnauto...@sematext.com>
wrote:

Hi Edwin,

What is your setup - SolrCloud or Master-Slave? If it si SolrCloud, then
under normal index updates, each core is behaving as independent index.
In
theory, if all changes happen at the same time on all nodes, merges will
happen at the same time. But that is not realistic and it is expected to
happen in slightly different time.
If you are running Master-Slave, then new segments will be copied from
master to slave.

Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




On 01.02.2016 11:56, Zheng Lin Edwin Yeo wrote:

Hi,

I would like to check, during segment merging, how did the replical node
do
the merging?
Will it do the merging concurrently, or will the replica node delete the
old segment and replace the new one?

Also, is it possible to separate the network interface for inter-node
communication from the network interface for update/search requests?
If so I could put two network cards in each machine and route the index
and
search traffic over the first interface and the traffic for the
inter-node
communication (sending documents to replicas) over the second interface.

I'm using Solr 5.4.0

Regards,
Edwin





--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: solr performance issue

2016-02-08 Thread Emir Arnautovic

Hi Sara,
Not sure if I am reading this right, but I read it as you have 1000 doc 
index and issues? Can you tell us bit more about your setup: number of 
servers, hw, index size, number of shards, queries that you run, do you 
index at the same time...


It seems to me that you are running Solr on server with limited RAM and 
probably small heap. Swapping for sure will slow things down and GC is 
most likely reason for high CPU.


You can use http://sematext.com/spm to collect Solr and host metrics and 
see where the issue is.


Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 08.02.2016 10:27, sara hajili wrote:

hi all.
i have a problem with my solr performance and usage hardware like a
ram,cup...
i have a lot of document and so indexed file about 1000 doc in solr that
every doc has about 8 field in average.
and each field has about 60 char.
i set my field as a storedfield = "false" except of  1 field. // i read
that this help performance.
i used copy field and dynamic field if it was necessary . // i read that
this help performance.
and now my question is that when i run a lot of query on solr i faced with
a problem solr use more cpu and ram and after that filled ,it use a lot
  swapped storage and then use hard,but doesn't create a system file! solr
fill hard until i forced to restart server to release hard disk.
and now my question is why solr treat in this way? and how i can avoid solr
to use huge cpu space?
any config need?!





Re: solr performance issue

2016-02-08 Thread Emir Arnautovic

Hi Sara,
It is still considered to be small index. Can you give us bit details 
about your setup?


Thanks,
Emir

On 08.02.2016 12:04, sara hajili wrote:

sorry i made a mistake i have a bout 1000 K doc.
i mean about 100 doc.

On Mon, Feb 8, 2016 at 1:35 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Hi Sara,
Not sure if I am reading this right, but I read it as you have 1000 doc
index and issues? Can you tell us bit more about your setup: number of
servers, hw, index size, number of shards, queries that you run, do you
index at the same time...

It seems to me that you are running Solr on server with limited RAM and
probably small heap. Swapping for sure will slow things down and GC is most
likely reason for high CPU.

You can use http://sematext.com/spm to collect Solr and host metrics and
see where the issue is.

Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 08.02.2016 10:27, sara hajili wrote:


hi all.
i have a problem with my solr performance and usage hardware like a
ram,cup...
i have a lot of document and so indexed file about 1000 doc in solr that
every doc has about 8 field in average.
and each field has about 60 char.
i set my field as a storedfield = "false" except of  1 field. // i read
that this help performance.
i used copy field and dynamic field if it was necessary . // i read that
this help performance.
and now my question is that when i run a lot of query on solr i faced with
a problem solr use more cpu and ram and after that filled ,it use a lot
   swapped storage and then use hard,but doesn't create a system file! solr
fill hard until i forced to restart server to release hard disk.
and now my question is why solr treat in this way? and how i can avoid
solr
to use huge cpu space?
any config need?!




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Solr architecture

2016-02-08 Thread Emir Arnautovic

Hi Mark,
Can you give us bit more details: size of docs, query types, are docs 
grouped somehow, are they time sensitive, will they update or it is 
rebuild every time, etc.


Thanks,
Emir

On 08.02.2016 16:56, Mark Robinson wrote:

Hi,
We have a requirement where we would need to index around 2 Billion docs in
a day.
The queries against this indexed data set can be around 80K queries per
second during peak time and during non peak hours around 12K queries per
second.

Can Solr realize this huge volumes.

If so, assuming we have no constraints for budget what would be a
recommended Solr set up (number of shards, number of Solr instances etc...)

Thanks!
Mark



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Solr architecture

2016-02-10 Thread Emir Arnautovic
d
appended to existing data already capturedrelated to that

session

and indexed back into Solr. So, longer the session the more data
retrieved
for each subsequent query to get current data captured for that

session.

  Also querying can be done on timestamp etc... which is captured
  along
with each action.
3. Are docs grouped somehow?
  All data related to a session are retrieved from Solr, updated and
indexed back to Solr based on sessionId. No other grouping.
4. Are they time sensitive (NRT or offline process does this)
  As mentioned above this is in NRT. Each time a new user action in
  that
session happens, we need to query existing session info already

captured

related to that session andappend this new data  to this

existing

info retrieved and index it back to Solr.
5. Will they update or it is rebuild every time, etc.
  Each time a new user action occurs, the full data pertaining to

that

session so far captured is retrieved from Solr, the extra latest data
pertaining to this new action is appended  and indexed  back to

Solr.

6. And the other thing you haven't told us is whether you plan on
_adding_
2B docs a day or whether that number is the total corpus size and you

are

re-indexing the 2B docs/day. IOW, if you are  adding 2B docs/day, 30

days

later do you have 2B docs or 60B docs in your
corpus?
We are expecting around 4 million sessions per day (per session 500
writes to Solr), which turns out to be 2B indexing done per day. So

after

30 days it would be 4Milion*30  docs in the index.
7. Is there any aging of docs
  No we always query against the whole corpus present.
8. Is any doc deleted?
  No all data remains in the index

Any suggestion is very welcome!

Thanks!
Mark.


On Mon, Feb 8, 2016 at 3:30 PM, Jack Krupansky <

jack.krupan...@gmail.com

wrote:


Oops... at 100 qps for a single node you would need 120 nodes to get

to 12K

qps and 800 nodes to get 80K qps, but that is just an extremely rough
ballpark estimate, not some precise and firm number. And that's if

all

the

queries can be evenly distributed throughout the cluster and don't

require

fanout to other shards, which effectively turns each incoming query

into n

queries where n is the number of shards.

-- Jack Krupansky

On Mon, Feb 8, 2016 at 12:07 PM, Jack Krupansky <

jack.krupan...@gmail.com>

wrote:


So is there any aging or TTL (in database terminology) of older

docs?

And do all of your queries need to query all of the older documents

all

of

the time or is there a clear hierarchy of querying for aged

documents,

like

past 24-hours vs. past week vs. past year vs. older than a year?

Sure,

you

can always use a function query to boost by the inverse of document

age,

but Solr would be more efficient with filter queries or separate

indexes

for different time scales.

Are documents ever updated or are they write-once?

Are documents explicitly deleted?

Technically you probably could meet those specs, but... how many
organizations have the resources and the energy to do so?

As a back of the envelope calculation, if Solr gave you 100 queries

per

second per node, that would mean you would need 1,200 nodes. It

would

also

depend on whether those queries are very narrow so that a single

node can

execute them or if they require fanout to other shards and then

aggregation

of results from those other shards.

-- Jack Krupansky

On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson <

erickerick...@gmail.com

wrote:


Short form: You really have to prototype. Here's the long form:




https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

I've seen between 20M and 200M docs fit on a single piece of

hardware,

so you'll absolutely have to shard.

And the other thing you haven't told us is whether you plan on
_adding_ 2B docs a day or whether that number is the total corpus

size

and you are re-indexing the 2B docs/day. IOW, if you are adding 2B
docs/day, 30 days later do you have 2B docs or 60B docs in your
corpus?

Best,
Erick

On Mon, Feb 8, 2016 at 8:09 AM, Susheel Kumar <

susheel2...@gmail.com>

wrote:

Also if you are expecting indexing of 2 billion docs as NRT or

if

it

will

be offline (during off hours etc).  For more accurate sizing you

may

also

want to index say 10 million documents which may give you idea

how

much

is

your index size and then use that for extrapolation to come up

with

memory

requirements.

Thanks,
Susheel

On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Hi Mark,
Can you give us bit more details: size of docs, query types,

are

docs

grouped somehow, are they time sensitive, will they update or

it

is

rebuild

every time, etc.

Thanks,
Emir


On 08.02.2016 16:56, Mark Robinson wrote:


Hi,
We have a requirement where we would need to index 

Re: Solr architecture

2016-02-11 Thread Emir Arnautovic

Hi Mark,
Nothing comes for free :) With doc per action, you will have to handle 
large number of docs. There is hard limit for number of docs per shard - 
it is ~4 billion (size of int) so sharding is mandatory. It is most 
likely that you will have to have more than one collection. Depending on 
your queries, different layouts can be applied. What will be these 320 
qps? Will you do some filtering (by user, country,...), will you focus 
on the latest data, what is your data retention strategy...


You should answer to such questions and decide setup that will handle 
important one in efficient way. With this amount of data you will most 
likely have to do some tradeoffs.


When it comes to sending docs to Solr, sending bulks is mandatory.

Regards,
Emir

On 10.02.2016 22:48, Mark Robinson wrote:

Thanks everyone for your suggestions.
Based on it I am planning to have one doc per event with sessionId common.

So in this case hopefully indexing each doc as and when it comes would be
okay? Or do we still need to batch and index to Solr?

Also with 4M sessions a day with about 6000 docs (events) per session we
can expect about 24Billion docs per day!

Will Solr still hold good. If so could some one please recommend a sizing
to cater to this levels of data.
The queries per second is around 320 qps.

Thanks!
Mark


On Wed, Feb 10, 2016 at 3:38 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Hi Mark,
Appending session actions just to be able to return more than one session
without retrieving large number of results is not good tradeoff. Like
Upayavira suggested, you should consider storing one action per doc and
aggregate on read time or push to Solr once session ends and aggregate on
some other layer.
If you are thinking handling infrastructure might be too much, you may
consider using some of logging services to hold data. One such service is
Sematext's Logsene (http://sematext.com/logsene).

Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 10.02.2016 03:22, Mark Robinson wrote:


Thanks for your replies and suggestions!

Why I store all events related to a session under one doc?
Each session can have about 500 total entries (events) corresponding to
it.
So when I try to retrieve a session's info it can back with around 500
records. If it is this compounded one doc per session, I can retrieve more
sessions at a time with one doc per session.
eg under a sessionId an array of eventA activities, eventB activities
   (using json). When an eventA activity again occurs, we will read all
that
data for that session, append this extra info to evenA data and push the
whole session related data back (indexing) to Solr. Like this for many
sessions parallely.


Why NRT?
Parallely many sessions are being written (4Million sessions hence
4Million
docs per day). A person can do this querying any time.

It is just a look up?
Yes. We just need to retrieve all info for a session and pass it on to
another system. We may even do some extra querying on some data like
timestamps, pageurl etc in that info added to a session.

Thinking of having the data separate from the actual Solr Instance and
mention the loc of the dataDir in solrconfig.

If Solr is not a good option could you please suggest something which will
satisfy this use case with min response time while querying.

Thanks!
Mark

On Tue, Feb 9, 2016 at 6:02 PM, Daniel Collins 
wrote:

So as I understand your use case, its effectively logging actions within a

user session, why do you have to do the update in NRT?  Why not just log
all the user session events (with some unique key, and ensuring the
session
Id is in the document somewhere), then when you want to do the query, you
join on the session id, and that gives you all the data records for that
session. I don't really follow why it has to be 1 document (which you
continually update). If you really need that aggregation, couldn't that
happen offline?

I guess your 1 saving grace is that you query using the unique ID (in
your
scenario) so you could use the real-time get handler, since you aren't
doing a complex query (strictly its not a search, its a raw key lookup).

But I would still question your use case, if you go the Solr route for
that
kind of scale with querying and indexing that much, you're going to have
to
throw a lot of hardware at it, as Jack says probably in the order of
hundreds of machines...

On 9 February 2016 at 19:00, Upayavira  wrote:

Bear in mind that Lucene is optimised towards high read lower write.

That is, it puts in a lot of effort at write time to make reading
efficient. It sounds like you are going to be doing far more writing
than reading, and I wonder whether you are necessarily choosing the
right tool for the job.

How would you later use this data, and what advantage is there to
storing it in Solr?

Upayavira

On Tue, Feb 9, 2016, at 03:

Re: solr-4.3.1 docValues usage

2016-02-15 Thread Emir Arnautovic

Hi,
Not  sure how ordering will help (maybe missing question) but what seems 
to me that would help your case is simple boosting. See 
https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_make_.22superman.22_in_the_title_field_score_higher_than_in_the_subject_field


Regards,
Emir

On 15.02.2016 14:16, Binoy Dalal wrote:

DocValues has nothing to do with your handler. It is a field property. To
use it simply put docValues=true in your field definitions and reindex.

On Mon, 15 Feb 2016, 18:40 Neeraj Lajpal  wrote:


Hi,
I recently asked this question on stackoverflow:
I am trying to access a field in custom request handler. I am accessing it
like this for each document:
Document doc;doc = reader.document(id);DocFields =
doc.getValues("state");There are around 600,000 documents in the solr. For
a query running on all the docs, it is taking more than 65 seconds.
I have also tried SolrIndexSearcher.doc method, but it is also taking
around 60 seconds.
Removing the above lines of code bring down the qtime to milliseconds.
But, I need to access that field for my algo.
Is there a more optimised way to do this?



In reply, I got a suggestion to use docValues. I read about it and it
seems to be useful for my case. But, I am unable to find/figure out how to
use it in my custom request handler.
Please tell me if there is some function/api to access a docValue field
from custom handler, that takes input the doc-id and field.
Thanks,Neeraj Lajpal


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: solr-4.3.1 docValues usage

2016-02-15 Thread Emir Arnautovic

Sorry - replied to wrong thread :(

On 15.02.2016 15:17, Emir Arnautovic wrote:

Hi,
Not  sure how ordering will help (maybe missing question) but what 
seems to me that would help your case is simple boosting. See 
https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_make_.22superman.22_in_the_title_field_score_higher_than_in_the_subject_field 



Regards,
Emir 


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: SOLR ranking

2016-02-15 Thread Emir Arnautovic

Hi,
Not  sure how ordering will help (maybe missing question) but what seems 
to me that would help your case is simple boosting. See 
https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_make_.22superman.22_in_the_title_field_score_higher_than_in_the_subject_field 



Regards,
Emir

On 15.02.2016 14:14, Binoy Dalal wrote:

Use the sort parameter with your query and pass the fields in the order in
which you want to sort them.
So if you want topic > subtopic > index > drug > content all ascending,
your sort parameter will look like
&sort=topic asc,subtopic asc,index asc,drug asc,content asc

On Mon, 15 Feb 2016, 18:17 Nitin.K  wrote:


I have five fields in SOLR
topic_title
subtopic_title
index_terms - Multivalued
drug - Multivalued
content

- Now, I want to rank the documents with all these fields; I want all those
documents that are haivng the search term in topic_title will come first in
the order
then documents having search term in subtopic title and then so on.

Example : If two documents are having search term in topic_title then the
solr should look for subtopic_ title similarly
if the search term is present in both topic_title and subtopic_title fields
then it should look for index term and so on; to decide the ranking order

- I dont want to consider the no. of occurrences in multivalued fields but
if the two documents are having search term in topic_title, subtopic_title,
index_term and drug then the documents
should be ranked in the order of no. of occurrences inside the content
field.


Kindly help in this. I will be really thankful



--
View this message in context:
http://lucene.472066.n3.nabble.com/SOLR-ranking-tp4257367.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: SOLR ranking

2016-02-15 Thread Emir Arnautovic

Hi Nitin,
You can use pf parameter to boost results with exact phrase. You can 
also use pf2 and pf3 to boost results with bigrams (phrase matches with 
2 or 3 words in case input is with more than 3 words)


Regards,
Emir

On 16.02.2016 06:18, Nitin.K wrote:

I am using edismax parser with the following query:

localhost:8983/solr/tgl/select?q=eating%20disorders&wt=xml&tie=1.0&rows=200&q.op=AND&indent=true&defType=edismax&stopwords=true&lowercaseOperators=true&debugQuery=true&qf=topic_title%5E100+subtopic_title%5E40+index_term%5E20+drug%5E15+content%5E3&pf2=topTitle%5E200+subTopTitle%5E80+indTerm%5E40+drugString%5E30+content%5E6

Configuration of schema.xml










































I want , if user will search for a phrase then that pharse should always
takes the priority in comaprison to the individual words;

Example: "Eating Disorders"

First it will search for "Eating Disorders" together and then the individual
words "Eating" and "Disorders"
but while searching for individual words, it will always return those
documents where both the words should exist for which i am already using
q.op="AND" in my query.

Thanks,
Nitin




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-ranking-tp4257367p4257510.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: SOLR ranking

2016-02-16 Thread Emir Arnautovic

Hi Nitin,
Not sure if you changed what fields you use for phrase boost, but in 
example you sent, all fields except content are "string" fields and 
content is boosted with 6 while topic_title in qf is boosted with 100. 
Try setting same field you use in qf in pf2 and you should see the 
difference. After that you can play with field analysis and which field 
to use just for boosting.


Regards,
Emir

On 16.02.2016 11:30, Nitin.K wrote:

Hi Emir,

I tried using the boost parameters for phrase search by removing the
omitTermFreqAndPositions from the multivalued field type but somehow while
searching phrases; the documents that have exact match are not coming up in
the order. Instead; in the content field, it is considering the mutual count
of both the terms and based on that, its deciding the order.

kindly let me know, how can i first search the phrase and then go to the
individual words (i.e word-1 AND word-2)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-ranking-tp4257367p4257556.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Which open-source crawler to use with SolrJ and Postgresql ?

2016-02-16 Thread Emir Arnautovic

Hi,
It is most common to use Nutch as crawler, but it seems that it still 
does not have support for SolrCloud (if I am reading this ticket 
correctly https://issues.apache.org/jira/browse/NUTCH-1662). Anyway, I 
would recommend Nutch with standard http client.


Regards,
Emir

On 16.02.2016 16:02, Victor D'agostino wrote:

Hi

I am building a Solr 5 architecture with 3 Solr nodes and 1 zookeeper.
The database backend is postgresql 9 on RHEL 6.

I am looking for a free open-source crawler which use SolrJ.

What do you guys recommend ?

Best regards
Victor d'Agostino




Ce message et les éventuels documents joints peuvent contenir des 
informations confidentielles. Au cas où il ne vous serait pas destiné, 
nous vous remercions de bien vouloir le supprimer et en aviser 
immédiatement l'expéditeur. Toute utilisation de ce message non 
conforme à sa destination, toute diffusion ou publication, totale ou 
partielle et quel qu'en soit le moyen est formellement interdite. Les 
communications sur internet n'étant pas sécurisées, l'intégrité de ce 
message n'est pas assurée et la société émettrice ne peut être tenue 
pour responsable de son contenu. 


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Which open-source crawler to use with SolrJ and Postgresql ?

2016-02-16 Thread Emir Arnautovic

Markus,
Ticket I run into is for Nutch2 and NUTCH-2197 is for Nutch1.

Haven't been using Nutch for a while so cannot recommend version.

Thanks,
Emir

On 16.02.2016 16:37, Markus Jelsma wrote:

Nutch has Solr 5 cloud support in trunk, i committed it earlier this month.
https://issues.apache.org/jira/browse/NUTCH-2197

Markus
  
-Original message-

From:Emir Arnautovic 
Sent: Tuesday 16th February 2016 16:26
To: solr-user@lucene.apache.org
Subject: Re: Which open-source crawler to use with SolrJ and Postgresql ?

Hi,
It is most common to use Nutch as crawler, but it seems that it still
does not have support for SolrCloud (if I am reading this ticket
correctly https://issues.apache.org/jira/browse/NUTCH-1662). Anyway, I
would recommend Nutch with standard http client.

Regards,
Emir

On 16.02.2016 16:02, Victor D'agostino wrote:

Hi

I am building a Solr 5 architecture with 3 Solr nodes and 1 zookeeper.
The database backend is postgresql 9 on RHEL 6.

I am looking for a free open-source crawler which use SolrJ.

What do you guys recommend ?

Best regards
Victor d'Agostino




Ce message et les éventuels documents joints peuvent contenir des
informations confidentielles. Au cas où il ne vous serait pas destiné,
nous vous remercions de bien vouloir le supprimer et en aviser
immédiatement l'expéditeur. Toute utilisation de ce message non
conforme à sa destination, toute diffusion ou publication, totale ou
partielle et quel qu'en soit le moyen est formellement interdite. Les
communications sur internet n'étant pas sécurisées, l'intégrité de ce
message n'est pas assurée et la société émettrice ne peut être tenue
pour responsable de son contenu.

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: SOLR ranking

2016-02-18 Thread Emir Arnautovic

Hi Nitin,
Can you send us how your parsed query looks like (from debug output).

Thanks,
Emir

On 17.02.2016 08:38, Nitin.K wrote:

Hi Binoy,

We are searching for both phrases and individual words
but we want that only those documents which are having phrases will come
first in the order and then the individual app.

termPositions = true is also not working in my case.

I have also removed the string type from copy fields. kindly look into the
changed configuration below:

Hi Emir,

I have changed the cofiguration as per your suggestion, added pf2 / pf3.
Yes, i saw the difference but still the ranking is not getting followed
correctly in case of phrases.

Changed configuration;















Copy fields again for the reference :







Added following field type:









Removed the string type from the copy fields.

Changed Query :

http://localhost:8983/solr/tgl/select?q=rheumatoid%20arthritis&wt=xml&tie=1.0&rows=200&q.op=AND&indent=true&defType=edismax&stopwords=true&lowercaseOperators=true&debugQuery=true&;
pf=topTitle^200 subtopTitle^80 indTerm^40 drugString^30 tglData^6&
pf2=topTitle^200 subtopTitle^80 indTerm^40 drugString^30 tglData^6&
pf3=topTitle^200 subtopTitle^80 indTerm^40 drugString^30 tglData^6&
qf=topic_title^100 subtopic_title^40 index_term^20 drug^15 content^3

After making these changes, I am able to get my search results correctly for
a single term but in case of phrase search, i am still not able to get the
results in the correct order.

Hi Modassar,

I tried using mm=100, but the order is still the same.

Hi Alessandro,

I have not yet tried the slope parameter. By default it is taking it as 1.0
when i looked it in debug mode. Will revert you definitely. So, let me try
this option too.

All,

Please suggest if anyone is having any other suggestion on this. I have to
implement it on urgent basis and i think i am very close to it. Thanks all
of you. I have reached to this level just because of you guys.

Thanks and Regards,
Nitin



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-ranking-tp4257367p4257782.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Sort vs boost

2016-02-22 Thread Emir Arnautovic

Hi Anil,
Decision also depends on your usecase - if you are sure that there will 
be no cases where documents matches are of different score or you don't 
care about how well document match query (e.g. all queries will be 
single term query) then sorting by time is way to go. But, if there is 
chance that some doc will end up being first because it was the latest 
even it poorly matches query, then sort by time is not an option.
In cases like this I usually go with extreme - imagine you forgot to 
remove frequent "X" from stopwords and you search for "A X B". If you 
are OR-ing in queries, top document will be the one added the last one 
with X regardless if it has A or B. If there is chance of such scenario, 
you should use boost - it may be slightly more expensive, but much safer.


Regards,
Emir

On 22.02.2016 11:39, Anil wrote:

Hi,

we would like to display recent records on top.

two ways

1. boost by create time desc
2. sort create time by desc

i tried both, seems both looks good.

which one is better in terms of performance ?

i noticed, sort is good than boost in terms of performance.

Please correct me if I am wrong.

Regards,
Anil



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Query time de-boost

2016-02-24 Thread Emir Arnautovic

Hi Shamik,
Is boosting others acceptable option to you, e.g. 
ContentGroup:"NonDeveloper"^100.

Which query parser do you use?

Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 23.02.2016 23:42, Shamik Bandopadhyay wrote:

Hi,

   I'm looking into the possibility of de-boosting a set of documents during
query time. In my application, when I search for e.g. "preferences", I want
to de-boost content tagged with ContentGroup:"Developer" or in other words,
push those content back in the order. Here's the catch. I've the following
weights.

text^1.5 title^4 IndexTerm^2

As you can see, Title has a higher weight.

Now, a bunch of content tagged with ContentGroup:"Developer" consists of a
title like "Preferences.material" or "Preferences Property" or
"Preferences.graphics". The boost on title pushes these documents at the
top.

What I'm looking is to see if there's a way deboost all documents that are
tagged with ContentGroup:"Developer" irrespective of the term occurrence is
text or title.

Any pointers will be appreciated.

Thanks,
Shamik



Re: What search metrics are useful?

2016-02-24 Thread Emir Arnautovic

Hi Bill,
You can take a look at Sematext's search analytics 
(https://sematext.com/search-analytics). It provides some of metrics you 
mentioned, plus some additional (top queries, CTR, click stats, paging 
stats etc.). In combination with Sematext's performance metrics 
(https://sematext.com/spm) you can have full picture of your search 
infrastructure.


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 24.02.2016 04:07, William Bell wrote:

How do others look at search metrics?

1. Search conversion? Do you look at searches and if the user does not
click on a result, and reruns the search that would be a failure?

2. How to measure auto complete success metrics?

3. Facets/filters could be considered negative, since we did not find the
results that the user wanted, and now they are filtering - who to measure?

4. One easy metric is searches with 0 results. We could auto expand the geo
distance or ask the user "did you mean" ?

5. Another easy one would be tech performance: "time it takes in seconds to
get a result".

6. How to measure fuzzy? How do you know you need more synonyms? How to
measure?

7. How many searches it takes before the user clicks on a result?

Other ideas? Is there a video or presentation on search metrics that would
be useful?





Re: Query time de-boost

2016-02-25 Thread Emir Arnautovic

Hi Shamik,
You are righ boosting with values that are lower than 1 is still 
positive, but you can boost with negative value and that should do the 
trick so you can do bq=ContenGroup-local:Developer^-99 (note that it can 
result in negative score).
If you need more than just Developer/Others you can also introduce 
additional field that can be used for boosting. Also, you can use 
dismax/edismax bf to get more control.


Regards,
Emir

On 24.02.2016 17:27, shamik wrote:

Hi Emir,

 I've a bunch of contentgroup values, so boosting them individually is
cumbersome. I've boosting on query fields

qf=text^6 title^15 IndexTerm^8

and

bq=Source:simplecontent^10 Source:Help^20
(-ContentGroup-local:("Developer"))^99

I was hoping *(-ContentGroup-local:("Developer"))^99* will implicitly boost
the rest, but that didn't happen.

I'm using edismax.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259551.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Query time de-boost

2016-02-26 Thread Emir Arnautovic

Hi Jack,
I just checked on 5.5 and 0.1 is positive boost.

Regards,
Emir

On 26.02.2016 01:11, Jack Krupansky wrote:

0.1 is a fractional boost - all intra-query boosts are multiplicative, not
additive, so term^0.1 reduces the term by 90%.

-- Jack Krupansky

On Wed, Feb 24, 2016 at 11:29 AM, shamik  wrote:


Binoy, 0.1 is still a positive boost. With title getting the highest
weight,
this won't make any difference. I've tried this as well.



--
View this message in context:
http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259552.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Query time de-boost

2016-02-28 Thread Emir Arnautovic

Hi Jack,
I think we are talking about different things: I agree that boost is 
multiplicative, and boost values less than zero will reduce score, but 
if you use such boost value in bq, it will still bust documents that are 
matching it. Simplest example is with ids. If you query:

  q=id:a OR id:b
both doc a and b will have same score. If you boost a^2 it will be 
first, if you boost a^0.1 it will be second. But if you use dismax's 
bq=id:a^0.1 it will be first. In such case you have to use negative 
boost to make sure it is last.


Are we on the same page now?

Regards,
Emir

On 26.02.2016 16:00, Jack Krupansky wrote:

Could you share your actual numbers and test case? IOW, the document score
without ^0.01 and with ^0.01.

Again, to repeat, the specific boost factor may be positive, but the effect
of a fractional boost is to reduce, not add, to the score, so that a score
of 0.5 boosted by 0.1 would become 0.05. IOW, it de-boosts occurrences of
the term.

The point remains that you do not need a "negative boost" to de-boost a
term.


-- Jack Krupansky

On Fri, Feb 26, 2016 at 4:01 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


Hi Jack,
I just checked on 5.5 and 0.1 is positive boost.

Regards,
Emir


On 26.02.2016 01:11, Jack Krupansky wrote:


0.1 is a fractional boost - all intra-query boosts are multiplicative, not
additive, so term^0.1 reduces the term by 90%.

-- Jack Krupansky

On Wed, Feb 24, 2016 at 11:29 AM, shamik  wrote:

Binoy, 0.1 is still a positive boost. With title getting the highest

weight,
this won't make any difference. I've tried this as well.



--
View this message in context:

http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259552.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Filter factory to reduce word from plural forms to singular forms correctly?

2016-02-29 Thread Emir Arnautovic

Hi Derek,
Why does aggressive stemming worries you? You might have false 
positives, but that is desired behavior in most cases. In your case 
"iphone" documents will also be returned for "iphon" query. Is this 
something that is not desired behavior? You can have more than one field 
if you want to prefer matches with exact wording, but that is 
unnecessary overhead in most cases.


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 29.02.2016 10:40, Derek Poh wrote:

Hi

I am using EnglishMinimalStemFilterFactory to reducewords in plural 
forms to singular forms.
The filter factory is not reducingthe plural formof 'es' to the 
singular form correctly. It is reducing correctly for plural form of 's'.

"boxes" is reduced to "boxe" instead of "box"
"glasses" to "glasse" instead of "glass" etc.

I tried with PorterStemFilterFactory, itis able to reduce the plural 
'es' formto singular form correctly. However itreduced "iphones" to 
"iphon" instead.


Is there other filter factory that can reduce pluralto singular 
correctly?


The field type definition of the field.
positionIncrementGap="100">













--
CONFIDENTIALITY NOTICE
This e-mail (including any attachments) may contain confidential 
and/or privileged information. If you are not the intended recipient 
or have received this e-mail in error, please inform the sender 
immediately and delete this e-mail (including any attachments) from 
your computer, and you must not use, disclose to anyone else or copy 
this e-mail (including any attachments), whether in whole or in part.
This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons. 


Re: Indexing books, chapters and pages

2016-03-01 Thread Emir Arnautovic

Hi,
From the top of my head - probably does not solve problem completely, 
but may trigger brainstorming: Index chapters and include page break 
tokens. Use highlighting to return matches and make sure fragment size 
is large enough to get page break token. In such scenario you should use 
slop for phrase searches...


More I write it, less I like it, but will not delete...

Regards,
Emir

On 01.03.2016 12:56, Zaccheo Bagnati wrote:

Hi all,
I'm searching for ideas on how to define schema and how to perform queries
in this use case: we have to index books, each book is split into chapters
and chapters are split into pages (pages represent original page cutting in
printed version). We should show the result grouped by books and chapters
(for the same book) and pages (for the same chapter). As far as I know, we
have 2 options:

1. index pages as SOLR documents. In this way we could theoretically
retrieve chapters (and books?)  using grouping but
 a. we will miss matches across two contiguous pages (page cutting is
only due to typographical needs so concepts could be split... as in printed
books)
 b. I don't know if it is possible in SOLR to group results on two
different levels (books and chapters)

2. index chapters as SOLR documents. In this case we will have the right
matches but how to obtain the matching pages? (we need pages because the
client can only display pages)

we have been struggling on this problem for a lot of time and we're  not
able to find a suitable solution so I'm looking if someone has ideas or has
already solved a similar issue.
Thanks



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: FW: Difference Between Tokenizer and filter

2016-03-02 Thread Emir Arnautovic

Hi Rajesh,
Processing flow is same for both indexing and querying. What is compared 
at the end are resulting tokens. In general flow is: text -> char filter 
-> filtered text -> tokenizer -> tokens -> filter1 -> tokens ... -> 
filterN -> tokens.


You can read more about analysis chain in Solr wiki: 
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 02.03.2016 10:00, G, Rajesh wrote:

Hi Team,

Can you please clarify the below. My understanding is tokenizer is used to say how the 
content should be indexed physically in file system. Filters are used to query result. The 
blow lines are from my setup. But I have seen eg that include filters inside  and tokenizer in  that confused me.

 
 



 
 

 
 

My goal is to user solr and find the best match among the technology names e.g
Actual tech name

1.   Microsoft Visual Studio

2.   Microsoft Internet Explorer

3.   Microsoft Visio

When user types Microsoft Visal Studio user should get Microsoft Visual Studio. 
Basically misspelled and jumble words should match closest tech name





Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.




Re: understand scoring

2016-03-02 Thread Emir Arnautovic

Hi Michael,
Can you please run query with debug and share title field configuration.

Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 02.03.2016 09:14, michael solomon wrote:

Thanks you, @Doug Turnbull I tried http://splainer.io but it's not for my
query(not explain for the docs..).
here the picture again...
https://drive.google.com/file/d/0B-7dnH4rlntJc2ZWdmxMS3RDMGc/view?usp=sharing

On Tue, Mar 1, 2016 at 10:06 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:


Supposedly Late April, early May. But don't hold me to it until I see copy
edits :) Of course looks like now you can read at least the full ebook in
MEAP form.

-Doug

On Tue, Mar 1, 2016 at 2:57 PM, shamik  wrote:


Doug, do we've a date for the hard copy launch?



--
View this message in context:


http://lucene.472066.n3.nabble.com/understand-scoring-tp4260837p4260860.html

Sent from the Solr - User mailing list archive at Nabble.com.




--
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search 
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.



Re: Commit after every document - alternate approach

2016-03-02 Thread Emir Arnautovic

Hi Sangeetha,
What is sure is that it is not going to work - with 200-300K doc/hour, 
there will be >50 commits/second, meaning there are <20ms time for 
doc+commit.
You can do is let Solr handle commits and maybe use real time get to 
verify doc is in Solr or do some periodic sanity checks.
Are you doing document updates so in order Solr updates are reason why 
you commit each doc before moving to next doc?


Regards,
Emir

On 02.03.2016 09:06, sangeetha.subraman...@gtnexus.com wrote:

Hi All,

I am trying to understand on how we can have commit issued to solr while 
indexing documents. Around 200K to 300K document/per hour with an avg size of 
10 KB size each will be getting into SOLR . JAVA code fetches the document from 
MQ and streamlines it to SOLR. The problem is the client code issues 
hard-commit after each document which is sent to SOLR for indexing and it waits 
for the response from SOLR to get assurance whether the document got indexed 
successfully. Only if it gets a OK status from SOLR the document is cleared out 
from SOLR.

As far as I understand doing a commit after each document is an expensive 
operation. But we need to make sure that all the documents which are put into 
MQ gets indexed in SOLR. Is there any other way of getting this done ? Please 
let me know.
If we do a batch indexing, is there any chances we can identify if some 
documents is missed from indexing ?

Thanks
Sangeetha



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Commit after every document - alternate approach

2016-03-04 Thread Emir Arnautovic

Hi Sangeetha,
It seems to me that you are using Solr as primary data store? If that is 
true, you should not do that - you should have some other store that is 
transactional and can support what you are trying to do with Solr. If 
you are not using Solr as primary store, and it is critical to have Solr 
in sync, you can run periodical (about same frequency as Solr commits) 
checks that will ensure the latest data reached Solr.


Regards,
Emir

On 04.03.2016 05:46, sangs8788 wrote:

Hi Emir,

Right now we are having only inserts into SOLR. The main reason for having
commit after each document is to get a guarantee that the document has got
indexed in solr. Until the commit status is received back the document will
not be deleted from MQ. So that even if there is a commit failure the
document can be resent from MQ.

Thanks
Sangeetha



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Commit-after-every-document-alternate-approach-tp4260946p4261575.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Spatial Search on Postal Code

2016-03-04 Thread Emir Arnautovic

Hi Manohar,
This depends on your requirements/usecase. If postal code is interpreted 
as point than it is expected to have radius that is significantly larger 
than postal code diameter. In such case you can go with first approach. 
In order to avoid missing results from postal code in case of small 
search radius and large postal code, you can reverse geocode records and 
store postal code with each document.
If you need to handle distance from postal code precisely - distance 
from its border, you have to get postal code polygon, expand it by 
search distance and use resulting polygon to find matches.


HTH,
Emir

On 04.03.2016 13:09, Manohar Sripada wrote:

Here's my requirement -  User enters postal code and provides the radius. I
need to find the records with in the radius from the provided postal code.

There are few ways I thought through after going through the "Spatial
Search" Solr wiki

1. As Latitude and Longitude positions are required for spatial search. Get
Latitude Longitude position (may be using GeoCoding API) of a postal code
and use "LatLonType" field type and query accordingly. As the GeoCoding API
returns one point and if the postal code area is too big, then I may end up
not getting any results (apart from the records from the same postal code)
if the radius provided is small.

2. Get the latitude longitude points of the postal code which forms a
border (not sure yet on how to get) and build a polygon (using RPT). While
querying use this polygon and provide the distance. Can this be achieved?
Or Am I ruminating too much? :(

Appreciate any help on this.

Thanks



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Spatial Search on Postal Code

2016-03-04 Thread Emir Arnautovic

Hi Manohar,
I don't think there is such functionality in Solr - you need to do it on 
client side:
1. find some postal code polygons (you can use open street map - 
http://wiki.openstreetmap.org/wiki/Key:postal_code)

2. create zip to polygon lookup
3. create code that will expand zip code polygon by some distance (you 
can use JTS buffer api)


On query time you get zip code and distance:
1. find polygon for zip
2. expand polygon
3. send resulting polygon to Solr and use Intersects function to filter 
results


Regards,
Emir

On 04.03.2016 19:49, Manohar Sripada wrote:

Thanks Emir,

Obviously #2 approach is much better. I know its not straight forward. But,
is it really acheivable in Solr? Like building a polygon for a postal code.
If so, can you throw some light how to do?

Thanks,
Manohar

On Friday, March 4, 2016, Emir Arnautovic 
wrote:


Hi Manohar,
This depends on your requirements/usecase. If postal code is interpreted
as point than it is expected to have radius that is significantly larger
than postal code diameter. In such case you can go with first approach. In
order to avoid missing results from postal code in case of small search
radius and large postal code, you can reverse geocode records and store
postal code with each document.
If you need to handle distance from postal code precisely - distance from
its border, you have to get postal code polygon, expand it by search
distance and use resulting polygon to find matches.

HTH,
Emir

On 04.03.2016 13:09, Manohar Sripada wrote:


Here's my requirement -  User enters postal code and provides the radius.
I
need to find the records with in the radius from the provided postal code.

There are few ways I thought through after going through the "Spatial
Search" Solr wiki

1. As Latitude and Longitude positions are required for spatial search.
Get
Latitude Longitude position (may be using GeoCoding API) of a postal code
and use "LatLonType" field type and query accordingly. As the GeoCoding
API
returns one point and if the postal code area is too big, then I may end
up
not getting any results (apart from the records from the same postal code)
if the radius provided is small.

2. Get the latitude longitude points of the postal code which forms a
border (not sure yet on how to get) and build a polygon (using RPT). While
querying use this polygon and provide the distance. Can this be achieved?
Or Am I ruminating too much? :(

Appreciate any help on this.

Thanks



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Text search NGram

2016-03-07 Thread Emir Arnautovic

Hi Rajesh,
It is most likely related to norms - you can try setting 
omitNorms="true" and reindexing content. Anyway, it is not common to use 
just ngrams for matching content - in such case you can expect more 
unexpected ordering/results. You should combine ngrams fields with 
normally tokenized fields (e.g. boost if matching tokenized fileds to 
make sure exact matches are ordered first).


Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:

Hi Team,

We have the blow type and we have indexed the value  "title": "Microsoft Visual Studio 2006" and 
"title": "Microsoft Visual Studio 8.0.61205.56 (2005)"

When I search for title:(Microsoft Visual AND Studio AND 2005)  I get Microsoft 
Visual Studio 8.0.61205.56 (2005) as the second record and  Microsoft Visual 
Studio 2006 as first record. I wanted to have Microsoft Visual Studio 
8.0.61205.56 (2005) listed first since the user has searched for Microsoft 
Visual Studio 2005. Can you please help?.

We are using NGram so it takes care of misspelled or jumbled words[it works as 
expected]
e.g.
searching Micrs Visual Studio will gets Microsoft Visual Studio
searching Visual Microsoft Studio will gets Microsoft Visual Studio

   
 
 
 
 
 
 
  
 
 
 
 
 
   



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Text search NGram

2016-03-07 Thread Emir Arnautovic

Hi Rajesh,
Solution includes 2 fields - one "ngram" field (like your txt_token) and 
other "nonngram" field - just tokenized (like your txt_token without 
ngram token filter). If you have two documents:

1. ABCDEF
2. ABCD
And you are searching for ABCD, if you use only ngram field, both are 
matches and doc 1 can be first, but if you search from ngram:ABCD OR 
nonngram:ABCD, doc 2 will have higher score.


Regards,
Emir

On 07.03.2016 15:20, G, Rajesh wrote:

Hi Emir,

Thanks for you email. Can you please help me to understand what do you mean by "e.g. 
boost if matching tokenized fileds to make sure exact matches are ordered first"



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-----Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It is most likely related to norms - you can try setting omitNorms="true" and 
reindexing content. Anyway, it is not common to use just ngrams for matching content - in 
such case you can expect more unexpected ordering/results. You should combine ngrams 
fields with normally tokenized fields (e.g. boost if matching tokenized fileds to make 
sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:

Hi Team,

We have the blow type and we have indexed the value  "title": "Microsoft Visual Studio 2006" and 
"title": "Microsoft Visual Studio 8.0.61205.56 (2005)"

When I search for title:(Microsoft Visual AND Studio AND 2005)  I get Microsoft 
Visual Studio 8.0.61205.56 (2005) as the second record and  Microsoft Visual 
Studio 2006 as first record. I wanted to have Microsoft Visual Studio 
8.0.61205.56 (2005) listed first since the user has searched for Microsoft 
Visual Studio 2005. Can you please help?.

We are using NGram so it takes care of misspelled or jumbled words[it
works as expected] e.g.
searching Micrs Visual Studio will gets Microsoft Visual Studio
searching Visual Microsoft Studio will gets Microsoft Visual Studio


  
  
  
  
  
  
   
  
  
  
  
  




Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & 
Elasticsearch Support * http://sematext.com/



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Text search NGram

2016-03-07 Thread Emir Arnautovic
Not sure I understood question. What I meant is you to try setting 
omitNorms="false" to your txt_token field type if you want to stick with 
ngram only solution:









 





  


and to add new field type and field to keep nonngram version of field. 
Something like:








 




  


and use copyField to copy to both fields and query title:test OR 
title_simple:test.


Emir


On 07.03.2016 15:31, G, Rajesh wrote:

Hi Emir,

I have already applied

 and then I have applied . Is this what you wanted me to 
have in my config?

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: G, Rajesh [mailto:r...@cebglobal.com]
Sent: Monday, March 7, 2016 7:50 PM
To: solr-user@lucene.apache.org
Subject: RE: Text search NGram

Hi Emir,

Thanks for you email. Can you please help me to understand what do you mean by "e.g. 
boost if matching tokenized fileds to make sure exact matches are ordered first"



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It is most likely related to norms - you can try setting omitNorms="true" and 
reindexing content. Anyway, it is not common to use just ngrams for matching content - in 
such case you can expect more unexpected ordering/results. You should combine ngrams 
fields with normally tokenized fields (e.g. boost if matching tokenized fileds to make 
sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:

Hi Team,

We have the blow type and we have indexed the value  "title": "Microsoft Visual Studio 2006" and 
"title": "Microsoft Visual Studio 8.0.61205.56 (2005)"

When I search for title:(Microsoft Visual AND Studio AND 2005)  I get Microsoft 
Visual Studio 8.0.61205.56 (2005) as the second record and  Microsoft Visual 
Studio 2006 as first record. I wanted to have Microsoft Visual Studio 
8.0.61205.56 (2005) listed first since the user has searched for Microsoft 
Visual Studio 2005. Can you please help?.

We are using NGram so it takes care of misspelled or jumbled words[it
works as expected] e.g.
searching Micrs Visual Studio will gets Microsoft Visual Studio
searching Visual Microsoft Studio will gets Microsoft Visual Studio


  
  
  
  
  
  
   
  
  
  
  
  




Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered o

Re: ngrams with position

2016-03-08 Thread Emir Arnautovic

Hi Elisabeth,
I don't think there is such token filter, so you would have to create 
your own token filter that takes token and emits ngram token of specific 
length. It should not be too hard to create such filter - you can take a 
look how nagram filter is coded - yours should be simpler than that.


Regards,
Emir

On 08.03.2016 08:52, elisabeth benoit wrote:

Hello,

I'm using solr 4.10.1. I'd like to index words with ngrams of fix lenght
with a position in the end.

For instance, with fix lenght 3, Amsterdam would be something like:


a0 (two spaces added at beginning)
am1
ams2
mst3
ste4
ter5
erd6
rda7
dam8
am9 (one more space in the end)

The number at the end being the position.

Does anyone have a clue how to achieve this?

Best regards,
Elisabeth



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: ngrams with position

2016-03-11 Thread Emir Arnautovic
 2016 at 10:08, elisabeth benoit <

elisaelisael...@gmail.com>

wrote:


Thanks for your answer Emir,

I'll check that out.

Best regards,
Elisabeth

2016-03-08 10:24 GMT+01:00 Emir Arnautovic <

emir.arnauto...@sematext.com

:


Hi Elisabeth,
I don't think there is such token filter, so you would have

to

create

your

own token filter that takes token and emits ngram token of

specific

length.

It should not be too hard to create such filter - you can

take

a

look

how

nagram filter is coded - yours should be simpler than that.

Regards,
Emir


On 08.03.2016 08:52, elisabeth benoit wrote:


Hello,

I'm using solr 4.10.1. I'd like to index words with ngrams

of

fix

lenght

with a position in the end.

For instance, with fix lenght 3, Amsterdam would be

something

like:


a0 (two spaces added at beginning)
am1
ams2
mst3
ste4
ter5
erd6
rda7
dam8
am9 (one more space in the end)

The number at the end being the position.

Does anyone have a clue how to achieve this?

Best regards,
Elisabeth



--
Monitoring * Alerting * Anomaly Detection * Centralized Log

Management

Solr & Elasticsearch Support * http://sematext.com/





--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England




--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England




--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England




--
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Text search NGram

2016-03-16 Thread Emir Arnautovic

Hi Rajesh,
Did you reindex afters setting omitNorms to false? Can you send results 
with debug for first query?


What query parser are you using for these queries? You should run your 
queries with debug=true and see how they are rewritten - that should 
explain why some cases do not return expected documents. If you have 
trouble understanding why it is not returned, you can post response to 
this thread.


Thanks,
Emir

On 16.03.2016 09:30, G, Rajesh wrote:

Hi Emir,

The solution we wanted to implement is to show top 100 best match technology 
names from the list of technology names we have. Whatever technology names user 
has typed will first reach SQL Server and exact match will be done if 
possible[name==name] , only those do not exactly match[spelling mistakes, 
jumbled words] will be searched in SOLR.

With the below setup if I query title:(Microsoft Ofice 365) I get the below 
result [note:scores are same?]
{
 "title":"Lync - Microsoft Office 365",
 "score":7.7472024
},
{
 "title":"Microsoft Office 365",
 "score":7.7472024
},
When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice 365) {
 "title":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 365 
1.0",
 "title_ws":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 365 
1.0",
 "score":3.9297152
},
   {
 "title":"Microsoft Office 365",
 "title_ws":"Microsoft Office 365",
 "score":3.1437721
}

When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice 365) 
qf=title_ws^1
I don’t get any results

The expected result is
{
 "title":"Microsoft Office 365",
 "title_ws":"Microsoft Office 365",
},
   {
 "title":"Microsoft Office 365 1.0",
 "title_ws":"Microsoft Office 365 1.0",
},
   {
 "title":"Microsoft Office 365 14.0",
 "title_ws":"Microsoft Office 365 14.0",
},
   {
 "title":"Microsoft Office 365 14.3",
 "title_ws":"Microsoft Office 365 14.3",
},
   {
 "title":"Microsoft Office 365 14.4",
 "title_ws":"Microsoft Office 365 14.4",
},


 
 
 
 
 
 
 
 
 
 
   

   
   
   
   
   

   
 
 
 
 
   
 
 
 
   

   
   
   
   

   
   
   
   

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 8:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Not sure I understood question. What I meant is you to try setting 
omitNorms="false" to your txt_token field type if you want to stick with ngram 
only solution:


  
  
  
  
  
  
   
  
  
  
  
  



and to add new field type and field to keep nonngram version of field.
Something like:


  
  
  
  
  
   
  
  
  
  



and use copyField to copy to both fields and query title:test OR 
title_simple:test.

Emir


On 07.03.2016 15:31, G, Rajesh wrote:

Hi Emir,

I have already applied

 and then I have applied . Is this what you wanted me to 
hav

Re: Text search NGram

2016-03-16 Thread Emir Arnautovic

Hi Rajesh,
It seems that title length is not different enough to have different 
fieldNorm - in all titles it is 0.5 so all documents for exact match 
query result in same score.


Query with "Ofice" results in wrong document being first because of its 
fieldNorm=1.0 - seems to me that this document was not reindexed after 
omitNorms=false.


Also noticed that ngram field is bit different in schema than in mail - 
has maxGramSize="800". Does not change explanation, but is easier to 
understand results when max=min.


HTH,
Emir

On 16.03.2016 10:31, G, Rajesh wrote:

Hi Emir,

Yes I have re-indexed after setting omitNorms to false. Attached is the result 
of the query in debug mode.

I am using LuceneQParser

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Wednesday, March 16, 2016 2:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
Did you reindex afters setting omitNorms to false? Can you send results with 
debug for first query?

What query parser are you using for these queries? You should run your queries 
with debug=true and see how they are rewritten - that should explain why some 
cases do not return expected documents. If you have trouble understanding why 
it is not returned, you can post response to this thread.

Thanks,
Emir

On 16.03.2016 09:30, G, Rajesh wrote:

Hi Emir,

The solution we wanted to implement is to show top 100 best match technology 
names from the list of technology names we have. Whatever technology names user 
has typed will first reach SQL Server and exact match will be done if 
possible[name==name] , only those do not exactly match[spelling mistakes, 
jumbled words] will be searched in SOLR.

With the below setup if I query title:(Microsoft Ofice 365) I get the
below result [note:scores are same?] {
  "title":"Lync - Microsoft Office 365",
  "score":7.7472024
},
{
  "title":"Microsoft Office 365",
  "score":7.7472024
},
When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice 365) {
  "title":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 365 
1.0",
  "title_ws":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 
365 1.0",
  "score":3.9297152
},
{
  "title":"Microsoft Office 365",
  "title_ws":"Microsoft Office 365",
  "score":3.1437721
}

When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice
365) qf=title_ws^1 I don’t get any results

The expected result is
{
  "title":"Microsoft Office 365",
  "title_ws":"Microsoft Office 365", },
{
  "title":"Microsoft Office 365 1.0",
  "title_ws":"Microsoft Office 365 1.0", },
{
  "title":"Microsoft Office 365 14.0",
  "title_ws":"Microsoft Office 365 14.0", },
{
  "title":"Microsoft Office 365 14.3",
  "title_ws":"Microsoft Office 365 14.3", },
{
  "title":"Microsoft Office 365 14.4",
  "title_ws":"Microsoft Office 365 14.4", },


  
  
  
  
  
  
  
  
  
  









  
  
  
  

  
  
  












Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged

Re: Text search NGram

2016-03-16 Thread Emir Arnautovic

Hi Rajesh,
Here is one bit older presentation https://vimeo.com/32701503 but all 
should be still applicable. You can google for more with "understanding 
solr debug".


Regrads,
Emir

On 16.03.2016 11:30, G, Rajesh wrote:

Hi Emir,
Yes I have changed it to 800 to see if it produces different result. Sorry I 
have not inform that before. I have deleted all folder and files in data folder 
and I have re-indexed. Attached is the result with debug on

Can you please let me know whether there are any utility or a blog that will 
help in understanding the result of debug[parsedquery ,explain...]

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-----
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Wednesday, March 16, 2016 3:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It seems that title length is not different enough to have different fieldNorm 
- in all titles it is 0.5 so all documents for exact match query result in same 
score.

Query with "Ofice" results in wrong document being first because of its
fieldNorm=1.0 - seems to me that this document was not reindexed after 
omitNorms=false.

Also noticed that ngram field is bit different in schema than in mail - has 
maxGramSize="800". Does not change explanation, but is easier to understand 
results when max=min.

HTH,
Emir

On 16.03.2016 10:31, G, Rajesh wrote:

Hi Emir,

Yes I have re-indexed after setting omitNorms to false. Attached is the result 
of the query in debug mode.

I am using LuceneQParser

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-----Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Wednesday, March 16, 2016 2:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
Did you reindex afters setting omitNorms to false? Can you send results with 
debug for first query?

What query parser are you using for these queries? You should run your queries 
with debug=true and see how they are rewritten - that should explain why some 
cases do not return expected documents. If you have trouble understanding why 
it is not returned, you can post response to this thread.

Thanks,
Emir

On 16.03.2016 09:30, G, Rajesh wrote:

Hi Emir,

The solution we wanted to implement is to show top 100 best match technology 
names from the list of technology names we have. Whatever technology names user 
has typed will first reach SQL Server and exact match will be done if 
possible[name==name] , only those do not exactly match[spelling mistakes, 
jumbled words] will be searched in SOLR.

With the below setup if I query title:(Microsoft Ofice 365) I get the
below result [note:scores are same?] {
   "title":"Lync - Microsoft Office 365",
   "score":7.7472024
},
{
   "title":"Microsoft Office 365",
   "score":7.7472024
},
When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice 365) {
   "title":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 365 
1.0",
   "title_ws":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 
365 1.0",
   "score":3.9297152
},
 {
   "title":"Microsoft Office 365",
   "title_ws":"Microsoft Office 365",

Re: Document Cache

2016-03-19 Thread Emir Arnautovic
Running single query that returns all docs and all fields will actually 
load as many document as queryResultWindowSize is.
What you need to do is run multiple queries that will return different 
documents. In case your id is numeric, you can run something like id:[1 
TO 100] and then id:[100 TO 200] etc. Make sure that it is done within 
those two minute period if there is any indexing activities.


Your index is relatively small so filter cache of initial size of 1000 
entries should take around 20MB (assuming single shard)


Thanks,
Emir

On 18.03.2016 17:02, Rallavagu wrote:



On 3/18/16 8:56 AM, Emir Arnautovic wrote:

Problem starts with autowarmCount="5000" - that executes 5000 queries
when new searcher is created and as queries are executed, document cache
is filled. If you have large queryResultWindowSize and queries return
big number of documents, that will eat up memory before new search is
executed. It probably takes some time as well.

This is also combined with filter cache. How big is your index?


Index is not very large.


numDocs:
85933

maxDoc:
161115

deletedDocs:
75182

Size
1.08 GB

I have run a query to return all documents with all fields. I could 
not reproduce OOM. I understand that I need to reduce cache sizes but 
wondering what conditions could have caused OOM so I can keep a watch.


Thanks



Thanks,
Emir

On 18.03.2016 15:43, Rallavagu wrote:

Thanks for the recommendations Shawn. Those are the lines I am
thinking as well. I am reviewing application also.

Going with the note on cache invalidation for every two minutes due to
soft commit, wonder how would it go OOM in simply two minutes or is it
likely that a thread is holding the searcher due to long running query
that might be potentially causing OOM? Was trying to reproduce but
could not so far.

Here is the filter cache config



Query Results cache



On 3/18/16 7:31 AM, Shawn Heisey wrote:

On 3/18/2016 8:22 AM, Rallavagu wrote:
So, each soft commit would create a new searcher that would 
invalidate

the old cache?

Here is the configuration for Document Cache



true


In an earlier message, you indicated you're running into OOM.  I think
we can see why with this cache definition.

There are exactly two ways to deal with OOM.  One is to increase the
heap size.  The other is to reduce the amount of memory that the 
program
requires by changing something -- that might be the code, the 
config, or

how you're using it.

Start by reducing that cache size to 4096 or 1024.

https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

If yuo've also got a very large filterCache, reduce that size too.  
The
filterCache typically eats up a LOT of memory, because each entry 
in the

cache is very large.

Thanks,
Shawn





--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Document Cache

2016-03-19 Thread Emir Arnautovic
Problem starts with autowarmCount="5000" - that executes 5000 queries 
when new searcher is created and as queries are executed, document cache 
is filled. If you have large queryResultWindowSize and queries return 
big number of documents, that will eat up memory before new search is 
executed. It probably takes some time as well.


This is also combined with filter cache. How big is your index?

Thanks,
Emir

On 18.03.2016 15:43, Rallavagu wrote:
Thanks for the recommendations Shawn. Those are the lines I am 
thinking as well. I am reviewing application also.


Going with the note on cache invalidation for every two minutes due to 
soft commit, wonder how would it go OOM in simply two minutes or is it 
likely that a thread is holding the searcher due to long running query 
that might be potentially causing OOM? Was trying to reproduce but 
could not so far.


Here is the filter cache config

autowarmCount="1000"/>


Query Results cache

initialSize="2" autowarmCount="5000"/>


On 3/18/16 7:31 AM, Shawn Heisey wrote:

On 3/18/2016 8:22 AM, Rallavagu wrote:

So, each soft commit would create a new searcher that would invalidate
the old cache?

Here is the configuration for Document Cache



true


In an earlier message, you indicated you're running into OOM.  I think
we can see why with this cache definition.

There are exactly two ways to deal with OOM.  One is to increase the
heap size.  The other is to reduce the amount of memory that the program
requires by changing something -- that might be the code, the config, or
how you're using it.

Start by reducing that cache size to 4096 or 1024.

https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap

If yuo've also got a very large filterCache, reduce that size too.  The
filterCache typically eats up a LOT of memory, because each entry in the
cache is very large.

Thanks,
Shawn



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Document Cache

2016-03-19 Thread Emir Arnautovic

Hi,
Your cache will be cleared on soft commits - every two minutes. It seems 
that it is either configured to be huge or you have big documents and 
retrieving all fields or dont have lazy field loading set to true.


Can you please share your document cache config and heap settings.

Thanks,
Emir

On 17.03.2016 22:24, Rallavagu wrote:

comments in line...

On 3/17/16 2:16 PM, Erick Erickson wrote:

First, I want to make sure when you say "TTL", you're talking about
documents being evicted from the documentCache and not the "Time To 
Live"

option whereby documents are removed completely from the index.


May be TTL was not the right word to use here. I wanted learn the 
criteria for an entry to be ejected.




The time varies with the number of new documents fetched. This is an LRU
cache whose size is configured in solrconfig.xml. It's pretty much
unpredictable. If for some odd reason every request gets the same 
document
it'll never be aged out. If no two queries return the same document, 
when

"cache size" docs are fetched by subsequent requests.

The entire thing is thrown out whenever a new searcher is opened (i.e.
softCommit or hardCommit with openSearcher=true)




But maybe this is an XY problem. Why do you care? Is there something 
you're
seeing that you're trying to understand or is this just a general 
interest

question?

I have following configuration,

${solr.autoCommit.maxTime:15000}false 



${solr.autoSoftCommit.maxTime:12} 



As you can see, openSearcher is set to "false". What I am seeing is 
(from heap dump due to OutOfMemory error) that the LRUCache pertaining 
"Document Cache" occupies around 85% of available heap and that is 
causing OOM errors. So, trying to understand the behavior to address 
the OOM issues.


Thanks



Best,
Erick

On Thu, Mar 17, 2016 at 1:40 PM, Rallavagu  wrote:


Solr 5.4 embedded Jetty

Is it the right assumption that whenever a document that is returned 
as a

response to a query is cached in "Document Cache"?

Essentially, if I request for any entry like /select?q=id:
will it be cached in "Document Cache"? If yes, what is the TTL?

Thanks in advance





--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Use default field, if more specific field does not exist

2016-03-25 Thread Emir Arnautovic

Hi Georg,
One solution that could work on existing schema is to use query faceting 
and queries like (for USER_ID = 1, bucker 100 to 200):


price_1:[100 TO 200] OR (-price_1:[* TO *] AND price:[100 TO 200])

Same query is used for filtering. What you should test is if 
performances are acceptable.


Thanks,
Emir

On 24.03.2016 22:31, Georg Sorst wrote:

Hi list,

we use Solr to search ecommerce products.

Items have a default price which can be overwritten per user. So when
searching we have to return the user price if it is set, otherwise the
default price. Same goes for building facets and when filtering by price.

What's the best way to achieve this in Solr? We know the user ID when
sending the request to Solr so we could do something like this:

* Add the default price in the field "price" to the items
* Add all the user prices in a field like "price_"

Now for displaying the correct price this is fine, just look if there is a
field "price_" for this result item, otherwise just display the
value of the "price" field.

The tricky part is faceting and filtering. Which field do we use?
"price_"? What happens for users that don't have a user price set
for an item? In this case the "price_" field does not exist so
faceting and filtering will not work.

We thought about adding a "price_" field for every item and every
user and fill in the default price for the item if the user does not have
an overwritten price for this item. This would potentially make our index
unnecessarily large. Consider 10,000 items and 10,000 users (quite
realistic), that's 100,000,000 "price_" fields, even though maybe
only a few users have overwritten prices.

What I've been (unsuccessfully) looking for is some sort of field fallback
where I can tell Solr something like "use price_ for the results,
facets and filter queries, and if that does not exist for an item, use
price instead". At first sight field aliases seemed like that but turns out
that just renames the field in the result items.

So, is there something like this or is there a better solution anyway?

Thanks,
Georg


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Use default field, if more specific field does not exist

2016-03-28 Thread Emir Arnautovic

Hi Georg,
I cannot think of similar trick that would enable you to facet on all 
values (other than applying this trick to buckets of size 1) but would 
warn you about faceting of high cardinality fields such as price. Not 
sure if you have some specific case, but calculating facet for such 
field can be pretty expensive and slow.
I haven't look at it in details, but maybe you could find something 
useful in new Json facet API.


Regards,
Emir

On 26.03.2016 12:15, Georg Sorst wrote:

Hi Emir,

that sounds like a great idea and filtering should be just fine!

In our case we need the individual price values (not the buckets), just
like facet.field=price but with respect to the user prices. Is this
possible as well?

About the performance: Are there any specific bottlenecks you would expect?

Best regards,
Georg

Emir Arnautovic  schrieb am Fr., 25. März
2016 um 11:47 Uhr:


Hi Georg,
One solution that could work on existing schema is to use query faceting
and queries like (for USER_ID = 1, bucker 100 to 200):

price_1:[100 TO 200] OR (-price_1:[* TO *] AND price:[100 TO 200])

Same query is used for filtering. What you should test is if
performances are acceptable.

Thanks,
Emir

On 24.03.2016 22:31, Georg Sorst wrote:

Hi list,

we use Solr to search ecommerce products.

Items have a default price which can be overwritten per user. So when
searching we have to return the user price if it is set, otherwise the
default price. Same goes for building facets and when filtering by price.

What's the best way to achieve this in Solr? We know the user ID when
sending the request to Solr so we could do something like this:

* Add the default price in the field "price" to the items
* Add all the user prices in a field like "price_"

Now for displaying the correct price this is fine, just look if there is

a

field "price_" for this result item, otherwise just display the
value of the "price" field.

The tricky part is faceting and filtering. Which field do we use?
"price_"? What happens for users that don't have a user price

set

for an item? In this case the "price_" field does not exist so
faceting and filtering will not work.

We thought about adding a "price_" field for every item and

every

user and fill in the default price for the item if the user does not have
an overwritten price for this item. This would potentially make our index
unnecessarily large. Consider 10,000 items and 10,000 users (quite
realistic), that's 100,000,000 "price_" fields, even though

maybe

only a few users have overwritten prices.

What I've been (unsuccessfully) looking for is some sort of field

fallback

where I can tell Solr something like "use price_ for the

results,

facets and filter queries, and if that does not exist for an item, use
price instead". At first sight field aliases seemed like that but turns

out

that just renames the field in the result items.

So, is there something like this or is there a better solution anyway?

Thanks,
Georg

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

--

*Georg M. Sorst I CTO*
FINDOLOGIC GmbH



Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708
E.: g.so...@findologic.com
www.findologic.com Folgen Sie uns auf: XING
<https://www.xing.com/profile/Georg_Sorst>facebook
<https://www.facebook.com/Findologic> Twitter
<https://twitter.com/findologic>

Wir sehen uns auf dem *Shopware Community Day in Ahaus am 20.05.2016!* Hier
 Termin
vereinbaren!
Wir sehen uns auf der* dmexco in Köln am 14.09. und 15.09.2016!* Hier
 Termin
vereinbaren!



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Complex Sort

2016-03-31 Thread Emir Arnautovic

Hi,
Not sure if I fully understood your case, but here are some ideas:
- if you have small number of ids you can have score_%id% field that can 
be used for sorting
- if number of ids is large you can use sort by function to parse score 
data and find right score

- if number of results is small, calculate score after returning results

Not sure why you have multiple score fields? Can you sum it in advance 
and have one score per id? Or you have multiple ids per request?


Regards,
Emir

On 31.03.2016 10:07, ~$alpha` wrote:

I have a column in Mysql and I need to do a complex sorting in Solr.

Considering below column

|175#40|173#17|174#13|134#11|17#8|95#4|64#3|116#3|343#0|
where 175 indicates values to be matches and 40 implies score.

So if logged in user value is 175 he will give a score of 40 and if the
value is 173 he should get a score of 17 and so on...

I have multiple such columns whose sum in above manner is my sorting
expression.

How can I do in Solr ?




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Complex-Sort-tp4267155.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Facet by truncated date

2016-03-31 Thread Emir Arnautovic

Hi Robert,
You can use range faceting and set use facet.range.gap to set how dates 
are "truncated".


Regards,
Emir

On 31.03.2016 10:52, Robert Brown wrote:

Hi,

Is it possible to facet by a date (solr.TrieDateField) but truncated 
to the day, or even the hour?


If not, are there any other options apart from storing that truncated 
data in another (string?) field?


Thanks,
Rob




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Facet by truncated date

2016-03-31 Thread Emir Arnautovic

Hi Yago,
Not sure if I misunderstood the case, but assuming you have date field 
called my_date you can facet last 10 days by day using range queries:


?facet.range=my_date&facet.range.start=NOW/DAY-10DAYS&facet.range.end=NOW/DAY+1DAY&facet.range.gap=+1DAY

Regards,
Emir

On 31.03.2016 11:14, Yago Riveiro wrote:

If you want aggregate the dat by the truncated date, I think the only way to
do it is using other field with the truncated date.

   


You can use a update request processor to calculate the truncated data
(https://wiki.apache.org/solr/UpdateRequestProcessor) or add the field in
indexing time.

   


date:"2016-03-31T12:00:0Z"

truncated_date_s:'2016-03-31' or truncated_date_i:20160331 (this should be
more memory efficient)

\--

   


/Yago Riveiro

   


![](https://link.nylas.com/open/m7fkqw0yim04itb62itnp7r9/4708d221e9a24b519bab6
3936013ce59)

On Mar 31 2016, at 10:08 am, Emir Arnautovic
<emir.arnauto...@sematext.com> wrote:


Hi Robert,

You can use range faceting and set use facet.range.gap to set how dates
are "truncated".


Regards,

Emir


On 31.03.2016 10:52, Robert Brown wrote:

> Hi,
>
> Is it possible to facet by a date (solr.TrieDateField) but truncated
> to the day, or even the hour?
>
> If not, are there any other options apart from storing that truncated
> data in another (string?) field?
>
> Thanks,
> Rob
>
>


\--

Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * <http://sematext.com/>




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Complex Sort

2016-03-31 Thread Emir Arnautovic

You would have to write your custom function for that.

On 31.03.2016 11:24, ~$alpha` wrote:

I am not sure how to use "Sort By Function" for Case.

|10#40|14#19|33#17|27#6|15#6|19#5|7#2|6#1|29#1|5#1|30#1|28#1|12#0|20#0|

Can you tell how to fetch 40 when input is 10.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Complex-Sort-tp4267155p4267165.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Facet by truncated date

2016-03-31 Thread Emir Arnautovic

Hi Rob,
Range is mandatory, and you should limit it since it will create too 
much buckets. I agree it would be great if it could use min/max values 
from query as start/end, but that is not how it works at the moment.


Regards,
Emir

On 31.03.2016 11:32, Robert Brown wrote:

Hi Emir,

What if I don't want to specify a range?  Or would I have to do year 0 
to NOW?


Thanks,
Rob


On 03/31/2016 10:26 AM, Emir Arnautovic wrote:

Hi Yago,
Not sure if I misunderstood the case, but assuming you have date 
field called my_date you can facet last 10 days by day using range 
queries:


?facet.range=my_date&facet.range.start=NOW/DAY-10DAYS&facet.range.end=NOW/DAY+1DAY&facet.range.gap=+1DAY 



Regards,
Emir

On 31.03.2016 11:14, Yago Riveiro wrote:
If you want aggregate the dat by the truncated date, I think the 
only way to

do it is using other field with the truncated date.


You can use a update request processor to calculate the truncated data
(https://wiki.apache.org/solr/UpdateRequestProcessor) or add the 
field in

indexing time.


date:"2016-03-31T12:00:0Z"

truncated_date_s:'2016-03-31' or truncated_date_i:20160331 (this 
should be

more memory efficient)

\--


/Yago Riveiro


![](https://link.nylas.com/open/m7fkqw0yim04itb62itnp7r9/4708d221e9a24b519bab6 


3936013ce59)

On Mar 31 2016, at 10:08 am, Emir Arnautovic
<emir.arnauto...@sematext.com> wrote:


Hi Robert,

You can use range faceting and set use facet.range.gap to set how dates
are "truncated".


Regards,

Emir


On 31.03.2016 10:52, Robert Brown wrote:

> Hi,
>
> Is it possible to facet by a date (solr.TrieDateField) but 
truncated

> to the day, or even the hour?
>
> If not, are there any other options apart from storing that 
truncated

> data in another (string?) field?
>
> Thanks,
> Rob
>
>


\--

Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * <http://sematext.com/>








--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Optimal indexing speed in Solr

2016-04-14 Thread Emir Arnautovic

Hi Edwin,
Indexing speed depends on multiple factors: HW, Solr configurations and 
load, documents, indexing client: More complex documents, more CPU time 
to process each document before indexing structure is written down to 
disk. Bigger the index, more heap is used, more frequent GCs. Maybe you 
are just not sending enough doc to Solr to have such throughput.
The best way to pinpoint bottleneck is to use some monitoring tool. One 
such tool is our SPM (http://sematext.com/spm) - it allows you to 
monitor both Solr and OS metrics.


HTH,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 14.04.2016 05:29, Zheng Lin Edwin Yeo wrote:

Hi,

Would like to find out, what is the optimal indexing speed in Solr?

Previously, I managed to get more than 3GB/hour, but now the speed has drop
to 0.7GB/hr. What could be the potential reason behind this?

Besides the index size getting bigger, I have only added in more
collections into the core and added another field. Other than that nothing
else has been changed..

Could the source file which I'm indexing made a difference in the indexing
speed?

I'm using Solr 5.4.0 for now, but will be planning to migrate to Solr 6.0.0.

Regards,
Edwin





Re: solr sql & streaming

2016-04-28 Thread Emir Arnautovic

Hi Shani,
Are you running in SolrCloud mode? Here is blog post you can follow: 
https://sematext.com/blog/2016/04/18/solr-6-solrcloud-sql-support/


Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 28.04.2016 13:45, Chaushu, Shani wrote:

Hi,
I installed solr 6 and try to run /sql and /stream request follow to this wiki 
https://cwiki.apache.org/confluence/display/solr/Parallel+SQL+Interface
I saw in changes list that it doesn't need request handler configuration, but 
when I try to acces I get the following message:



Error 404 Not Found

HTTP ERROR 404
Problem accessing /solr/collection_test/sql. Reason:
Not Found



My request was

curl --data-urlencode 'stmt=SELECT author, count(*) FROM collection_test GROUP 
BY author ORDER BY count(*) desc' 
http://localhost:8983/solr/collection_test/sql?aggregationMode=facet






-
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.



Re: problem with index size

2015-07-22 Thread Emir Arnautovic

Hi Daniel,
Do you need all fields stored in your index? Only field that is not 
stored is host.


Thanks,
Emir

On 22.07.2015 12:27, Daniel Holmes wrote:

Hi All
I have problem with index size in solr 4.7.2. My OS is Ubuntu 14.10 64-bit.
my fields are :













In one case for instance my segments size is 8.4G while index size is
28G!!! It seems unusual...

What suggestions do you have to reduce index size?
Is there any way to check disk usage details in cores? e.g. stop words,
stored docs, etc.



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Running SolrJ from Solr's REST API

2015-07-22 Thread Emir Arnautovic

Hi Edwin,
Not sure if I understood your case, but if I got it right you are trying 
to write some code that will run as part of SOLR.
If that's the case, then you should take a look how to write SOLR 
plugins (https://wiki.apache.org/solr/SolrPlugins). SolrJ is client side 
library that simplifies interactions between SOLR and other Java 
applications - not base tool for extending SOLR.


Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 22.07.2015 04:15, Zheng Lin Edwin Yeo wrote:

Hi,

Would like to check, as I've created a SorJ program and exported it as an
Runnable JAR, how do I integrate it together with Solr so that I can call
this JAR directly from Solr's REST API?

Currently I can only run it on command prompt using the command java -jar
solrj.jar

I'm using Solr 5.2.1.


Regards,
Edwin





Re: problem with index size

2015-07-22 Thread Emir Arnautovic
Is this test index? Do you rewrite documents with same ids? Did you try 
to optimize index?


Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 22.07.2015 13:10, Daniel Holmes wrote:

Upayavira number of docs in that case is 140275. The solr memory is 30Gb.

Yes Emir I need most of them to be saved.

I don't know Alessandro is that usual to use disk for indexing more than 3x
of document size and presumably it will grow up in continue of crawl
exponentially... Its so suboptimal I think.


On Wed, Jul 22, 2015 at 3:16 PM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:


"In one case for instance my segments size is 8.4G while index size is
28G!!! It seems unusual…"

The index is a collection of index segments + few overhead .
So, do you simply mean  you have 4 segments ?
Where is the problem anyway ?
You are also storing content which usually is a big part of the index.
As Upaya said, I am curious to know why you are so surprised !

Cheers

2015-07-22 11:27 GMT+01:00 Daniel Holmes :


Hi All
I have problem with index size in solr 4.7.2. My OS is Ubuntu 14.10

64-bit.

my fields are :













In one case for instance my segments size is 8.4G while index size is
28G!!! It seems unusual...

What suggestions do you have to reduce index size?
Is there any way to check disk usage details in cores? e.g. stop words,
stored docs, etc.




--
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England



Re: `cat /dev/null > solr-8983-console.log` frees host's memory

2015-10-21 Thread Emir Arnautovic

Hi Eric,
As Shawn explained, memory is freed because it was used to cache portion 
of log file.


Since you are already with Sematext, I guess you are aware, but doesn't 
hurt to remind you that we also have Logsene that you can use to manage 
your logs: http://sematext.com/logsene/index.html


Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 20.10.2015 17:42, Shawn Heisey wrote:

On 10/20/2015 9:19 AM, Eric Torti wrote:

I had a 52GB solr-8983-console.log on my Solr 5.2.1 Amazon Linux
64-bit box and decided to `cat /dev/null > solr-8983-console.log` to
free space.

The weird thing is that when I checked Sematext I noticed the OS had
freed a lot of memory at the same exact instant I did that.

On that memory graph, the legend doesn't indicate which of the graph
colors represent each of the four usage types at the top -- they all
have blue checkboxes, so I can't tell for sure what changed.

If the number that dropped is "cached" (which I think is likely) then
everything is working exactly as it should.  The OS had simply cached a
large chunk of the logfile, exactly as it is designed to do, and once
the file was deleted, it stopped reserving that memory and made it
available.

https://en.wikipedia.org/wiki/Page_cache

Thanks,
Shawn



Re: result grouping on all documents

2015-10-21 Thread Emir Arnautovic

Hi Christian,
It seems to me that you can use range faceting to get counts.

Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 20.10.2015 17:05, Christian Reuschling wrote:

Hi,

we try to get the number of documents for given time slots in the index 
efficiently.


For this, we query the solr index like this:

http://localhost:8014/solr/myCore/query?q=*:*&rows=1&fl=id&group=true&group.query=modified:[201103010%20TO%20201302010]&group.query=modified:[201303010%20TO%20201502010]&group.limit=1&distrib=false

for now, the modified field is a number field with trie index (tlong in 
schema.xml).

We have about 30M documents in the index.

This query works fine, but if the number of group queries gets higher (e.g. 
200), the response time
gets terribly slow.
As we need only the number of documents per group and never the score, or some 
other data of the
documents, we are wondering if there is a faster method to get this information.


Thanks

Christian



Re: Is it possible to specigfy only one-character term synonym for 2-gram tokenizer?

2015-10-22 Thread Emir Arnautovic

Hi Scott,
I don't have experience with Chinese, but SynonymFilter works on tokens, 
so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If 
not, than you can try configuring PatternReplaceCharFilter to replace C1 
to C2 during indexing and searching and get a match.


Thanks,
Emir

On 22.10.2015 10:53, Scott Chu wrote:

Hi solr-user,
I always uses CJKTokenizer on appropriate amount of Chinese news 
articles. Say in Chinese, character C1 has same meaning as 
character C2 (e.g 台=臺), Is it possible that I only add this line in 
synonym.txt:

C1,C2 (and in true exmaple: 台, 臺)
and by applying CJKTokenizer and SynonymFilter, I only have to query 
"C1Cm..."  (say Cm is arbitrary Chinese character) and Solr will 
return documents that matche whether "C1Cm" or "C2Cm"?

Scott Chu,scott@udngroup.com 
2015/10/22



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Is it possible to specigfy only one-character term synonym for2-gram tokenizer?

2015-10-22 Thread Emir Arnautovic

Hi Scott,
Using PatternReplaceCharFilter is not same as replacing raw data 
(replacing raw data is not proper solution as it does not solve issue 
when searching with "other" character). This is part of token 
standardization, no different than lower casing - it is standard 
approach as well when it comes to Latin characters:
mapping="mapping-ISOLatin1Accent.txt"/>


Quick search of "MappingCharFilterFactory chinese" shows it is used - 
you should check if suitable for your case.


Thanks,
Emir

On 22.10.2015 11:48, Scott Chu wrote:

Hi solr-user,
Ya, I thought about replacing C1 with C2 in the underground raw data. 
However, it's a huge data set (over 10M news articles) so I give up 
this strategy eariler. My current temporary solution is going back to 
use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1 
rule. But it is kinda ugly, especially when applying highlight, e.g. 
search "C1C2" Solr returns highlight snippet such as 
"...C1C2...".

Scott Chu,scott@udngroup.com <mailto:scott@udngroup.com>
2015/10/22

- Original Message -
*From: *Emir Arnautovic <mailto:emir.arnauto...@sematext.com>
*To: *solr-user <mailto:solr-user@lucene.apache.org>
*Date: *2015-10-22, 17:08:26
*Subject: *Re: Is it possible to specigfy only one-character term
synonym for2-gram tokenizer?

Hi Scott,
I don't have experience with Chinese, but SynonymFilter works on
tokens,
so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If
not, than you can try configuring PatternReplaceCharFilter to
replace C1
to C2 during indexing and searching and get a match.

Thanks,
Emir

On 22.10.2015 10:53, Scott Chu wrote:
> Hi solr-user,
> I always uses CJKTokenizer on appropriate amount of Chinese news
> articles. Say in Chinese, character C1 has same meaning as
> character C2 (e.g 台=臺), Is it possible that I only add this
line in
> synonym.txt:
> C1,C2 (and in true exmaple: 台, 臺)
> and by applying CJKTokenizer and SynonymFilter, I only have to
query
> "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will
> return documents that matche whether "C1Cm" or "C2Cm"?
> Scott Chu,scott@udngroup.com
<mailto:%20scott@udngroup.com> <mailto:scott@udngroup.com
<mailto:%20scott@udngroup.com>>
> 2015/10/22
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management

Solr & Elasticsearch Support * http://sematext.com/




-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Is it possible to specigfy only one-character term synonymfor2-gram tokenizer?

2015-10-23 Thread Emir Arnautovic

Hi Scott,
This replacement will only be in index terms and not in stored field so 
you are fine - problem you mention is related to case when you do 
replacement in raw text. However, this would be part of analysis chain 
(both index and query)  so has no effect on presentation (unless you are 
using index to reconstruct your text - which I assume you don't).


Thanks,
Emir

On 23.10.2015 03:26, Scott Chu wrote:

Hi Emir,
Very weirdly. I've reply to your email at home many times yesterday 
but they never show up in the solr-user email list again. Don't know 
why. So I reply this again at office. Hope this will show up.
Thanks to your explanation. I'll see PatternReplaceCharFilter as a 
workaround (As I know, Character filter are dealing with input stream 
before the tokenizer. In some way, indexed data no longer has original 
C1 if I do the replacement.) What I deal wth are published news 
articles and I don't know how the author of these articles feel about 
when they see C1 in their articles become C2 since some term 
containing C1 are proper nouns or terminologies. I'll talk to them to 
see if this is ok. Thanks anyway.

Scott Chu,scott@udngroup.com <mailto:scott@udngroup.com>
2015/10/23

- Original Message -
*From: *Emir Arnautovic <mailto:emir.arnauto...@sematext.com>
*To: *solr-user <mailto:solr-user@lucene.apache.org>
*Date: *2015-10-22, 18:20:38
*Subject: *Re: Is it possible to specigfy only one-character term
synonymfor2-gram tokenizer?

Hi Scott,
Using PatternReplaceCharFilter is not same as replacing raw data
(replacing raw data is not proper solution as it does not solve issue
when searching with "other" character). This is part of token
standardization, no different than lower casing - it is standard
approach as well when it comes to Latin characters:


Quick search of "MappingCharFilterFactory chinese" shows it is used -
you should check if suitable for your case.

Thanks,
Emir

On 22.10.2015 11:48, Scott Chu wrote:
> Hi solr-user,
> Ya, I thought about replacing C1 with C2 in the underground raw
data.
> However, it's a huge data set (over 10M news articles) so I give up
> this strategy eariler. My current temporary solution is going
back to
> use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1
> rule. But it is kinda ugly, especially when applying highlight,
e.g.
> search "C1C2" Solr returns highlight snippet such as
> "...C1C2...".
> Scott Chu,scott@udngroup.com
<mailto:%20scott@udngroup.com> <mailto:scott....@udngroup.com
<mailto:%20scott@udngroup.com>>
> 2015/10/22
>
> - Original Message -
> *From: *Emir Arnautovic <mailto:emir.arnauto...@sematext.com
<mailto:%20emir.arnauto...@sematext.com>>
> *To: *solr-user <mailto:solr-user@lucene.apache.org
<mailto:%20solr-u...@lucene.apache.org>>
> *Date: *2015-10-22, 17:08:26
> *Subject: *Re: Is it possible to specigfy only one-character term
> synonym for2-gram tokenizer?
>
> Hi Scott,
> I don't have experience with Chinese, but SynonymFilter works on
> tokens,
> so if CJKTokenizer recognizes C1 and Cm as tokens, it should
work. If
> not, than you can try configuring PatternReplaceCharFilter to
> replace C1
> to C2 during indexing and searching and get a match.
>
> Thanks,
> Emir
>
> On 22.10.2015 10:53, Scott Chu wrote:
> > Hi solr-user,
> > I always uses CJKTokenizer on appropriate amount of Chinese news
> > articles. Say in Chinese, character C1 has same meaning as
> > character C2 (e.g 台=臺), Is it possible that I only add this
> line in
> > synonym.txt:
> > C1,C2 (and in true exmaple: 台, 臺)
> > and by applying CJKTokenizer and SynonymFilter, I only have to
> query
> > "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will
> > return documents that matche whether "C1Cm" or "C2Cm"?
> > Scott Chu,scott@udngroup.com
<mailto:%20scott@udngroup.com>
> <mailto:%20scott@udngroup.com
<mailto:%2020scott@udngroup.com>>
<mailto:scott@udngroup.com <mailto:%20scott@udngroup.com>
> <mailto:%20scott@udngroup.com
<mailto:%2020scott@udngroup.com>>>
> > 2015/10/22
> >
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log
Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
> ---

Re: Does docValues impact termfreq ?

2015-10-26 Thread Emir Arnautovic
If I got it right, you are using term query, use function to get TF as 
score, iterate all documents in results and sum up total number of 
occurrences of specific term in index? Is this only way you use index or 
this is side functionality?


Thanks,
Emir

On 24.10.2015 22:28, Aki Balogh wrote:

Certainly, yes. I'm just doing a word count, ie how often does a specific
term come up in the corpus?
On Oct 24, 2015 4:20 PM, "Upayavira"  wrote:


yes, but what do you want to do with the TF? What problem are you
solving with it? If you are able to share that...

On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:

Yes, sorry, I am not being clear.

We are not even doing scoring, just getting the raw TF values. We're
doing
this in solr because it can scale well.

But with large corpora, retrieving the word counts takes some time, in
part
because solr is splitting up word count by document and generating a
large
request. We then get the request and just sum it all up. I'm wondering if
there's a more direct way.
On Oct 24, 2015 4:00 PM, "Upayavira"  wrote:


Can you explain more what you are using TF for? Because it sounds

rather

like scoring. You could disable field norms and IDF and scoring would

be

mostly TF, no?

Upayavira

On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:

Thanks, let me think about that.

We're using termfreq to get the TF score, but we don't know which

term

we'll need the TF for. So we'd have to do a corpuswide summing of
termfreq
for each potential term across all documents in the corpus. It seems

like

it'd require some development work to compute that, and our code

would be

fragile.

Let me think about that more.

It might make sense to just move to solrcloud, it's the right
architectural
decision anyway.


On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:


If you just want word length, then do work during indexing - index

a

field for the word length. Then, I believe you can do faceting -

e.g.

with the json faceting API I believe you can do a sum()

calculation on

a

field rather than the more traditional count.

Thinking aloud, there might be an easier way - index a field that

is

the

same for all documents, and facet on it. Instead of counting the

number

of documents, calculate the sum() of your word count field.

I *think* that should work.

Upayavira

On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:

Hi Jack,

I'm just using solr to get word count across a large number of

documents.

It's somewhat non-standard, because we're ignoring relevance,

but it

seems
to work well for this use case otherwise.

My understanding then is:
1) since termfreq is pre-processed and fetched, there's no good

way

to

speed it up (except by caching earlier calculations)

2) there's no way to have solr sum up all of the termfreqs

across all

documents in a search and just return one number for total

termfreqs


Are these correct?

Thanks,
Aki


On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky

wrote:


That's what a normal query does - Lucene takes all the terms

used

in

the

query and sums them up for each document in the response,

producing a

single number, the score, for each document. That's the way

Solr is

designed to be used. You still haven't elaborated why you are

trying

to use

Solr in a way other than it was intended.

-- Jack Krupansky

On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <

a...@marketmuse.com>

wrote:

Gotcha - that's disheartening.

One idea: when I run termfreq, I get all of the termfreqs for

each

document

one-by-one.

Is there a way to have solr sum it up before creating the

request,

so I

only receive one number in the response?


On Sat, Oct 24, 2015 at 11:05 AM, Upayavira 

wrote:

If you mean using the term frequency function query, then

I'm

not

sure

there's a huge amount you can do to improve performance.

The term frequency is a number that is used often, so it is

stored

in

the index pre-calculated. Perhaps, if your data is not

changing,

optimising your index would reduce it to one segment, and

thus

might

ever so slightly speed the aggregation of term frequencies,

but I

doubt

it'd make enough difference to make it worth doing.

Upayavira

On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:

Thanks, Jack. I did some more research and found similar

results.

In our application, we are making multiple (think: 50)

concurrent

requests
to calculate term frequency on a set of documents in

"real-time". The

faster that results return, the better.

Most of these requests are unique, so cache only helps

slightly.

This analysis is happening on a single solr instance.

Other than moving to solr cloud and splitting out the

processing

onto

multiple servers, do you have any suggestions for what

might

speed up

termfreq at query time?

Thanks,
Aki


On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky

wrote:


Term frequency applies only to the indexed terms of a

tokenized

field.

DocValues is really just a copy of the o

Re: Does docValues impact termfreq ?

2015-10-26 Thread Emir Arnautovic

Hi Aki,
IMO this is underuse of Solr (not to mention SolrCloud). I would 
recommend doing in memory document parsin (if you need something from 
Lucene/Solr analysis classes, use it) and use some other cache like 
solution to store term/total frequency pairs (you can try Redis).


That way you will have updatable, fast total frequency lookups.

Thanks,
Emir

On 26.10.2015 14:43, Aki Balogh wrote:

Hi Emir,

This is correct. This is the only way we use the index.

Thanks,
Aki

On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


If I got it right, you are using term query, use function to get TF as
score, iterate all documents in results and sum up total number of
occurrences of specific term in index? Is this only way you use index or
this is side functionality?

Thanks,
Emir


On 24.10.2015 22:28, Aki Balogh wrote:


Certainly, yes. I'm just doing a word count, ie how often does a specific
term come up in the corpus?
On Oct 24, 2015 4:20 PM, "Upayavira"  wrote:

yes, but what do you want to do with the TF? What problem are you

solving with it? If you are able to share that...

On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:


Yes, sorry, I am not being clear.

We are not even doing scoring, just getting the raw TF values. We're
doing
this in solr because it can scale well.

But with large corpora, retrieving the word counts takes some time, in
part
because solr is splitting up word count by document and generating a
large
request. We then get the request and just sum it all up. I'm wondering
if
there's a more direct way.
On Oct 24, 2015 4:00 PM, "Upayavira"  wrote:

Can you explain more what you are using TF for? Because it sounds
rather
like scoring. You could disable field norms and IDF and scoring would
be
mostly TF, no?

Upayavira

On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:


Thanks, let me think about that.

We're using termfreq to get the TF score, but we don't know which


term

we'll need the TF for. So we'd have to do a corpuswide summing of

termfreq
for each potential term across all documents in the corpus. It seems


like

it'd require some development work to compute that, and our code

would be

fragile.

Let me think about that more.

It might make sense to just move to solrcloud, it's the right
architectural
decision anyway.


On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:

If you just want word length, then do work during indexing - index
a

field for the word length. Then, I believe you can do faceting -

e.g.

with the json faceting API I believe you can do a sum()

calculation on

a

field rather than the more traditional count.

Thinking aloud, there might be an easier way - index a field that


is

the

same for all documents, and facet on it. Instead of counting the
number

of documents, calculate the sum() of your word count field.

I *think* that should work.

Upayavira

On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:


Hi Jack,

I'm just using solr to get word count across a large number of


documents.

It's somewhat non-standard, because we're ignoring relevance,

but it

seems

to work well for this use case otherwise.

My understanding then is:
1) since termfreq is pre-processed and fetched, there's no good


way

to

speed it up (except by caching earlier calculations)

2) there's no way to have solr sum up all of the termfreqs


across all

documents in a search and just return one number for total

termfreqs

Are these correct?

Thanks,
Aki


On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky

wrote:

That's what a normal query does - Lucene takes all the terms
used

in

the

query and sums them up for each document in the response,
producing a

single number, the score, for each document. That's the way

Solr is

designed to be used. You still haven't elaborated why you are

trying

to use

Solr in a way other than it was intended.

-- Jack Krupansky

On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <


a...@marketmuse.com>

wrote:

Gotcha - that's disheartening.

One idea: when I run termfreq, I get all of the termfreqs for


each

document

one-by-one.

Is there a way to have solr sum it up before creating the


request,

so I

only receive one number in the response?


On Sat, Oct 24, 2015 at 11:05 AM, Upayavira 


wrote:

If you mean using the term frequency function query, then

I'm

not

sure

there's a huge amount you can do to improve performance.

The term frequency is a number that is used often, so it is


stored

in

the index pre-calculated. Perhaps, if your data is not

changing,

optimising your index would reduce it to one segment, and

thus

might

ever so slightly speed the aggregation of term frequencies,

but I

doubt

it'd make enough difference to make it worth doing.

Upayavira

On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:


Thanks, Jack. I d

Re: solr 5.3.0 master-slave: TWO segments after optimize

2015-10-28 Thread Emir Arnautovic

Hi Andrii,
Observed high CPU is on master or slave? If on slave, is it on all 
slaves? Can you do thread dump and see if what is running.


Based on numbers this seems like small index and one segment is flush 
with only 160 doc. Anyway, this is small and something is really wrong 
if you notice issues on this scale.


Thanks,
Emir

On 28.10.2015 17:27, Andrii Berezhynskyi wrote:

Hi all!

We have master-slave configuration. Slaves are replicated on optimize.
Sometimes (not always) it happens that slave has two segments after
replication. I always thought that optimize is all about having only one
segment in your read replica. During this time when slave has two segments
we are experiencing higher response time and higher CPU usage. Is it
somehow possible to prevent slave from having two segments in
slave-replicated-on-optimize configuration?

Segments info shows the following for these two segments:

Segment *_1kxw*:
#docs:
167 841
#dels:
161
size:
96 985 075 bytes
age:
2015-10-28T16:10:01.894Z
source:
merge

Segment *_1kxv*:
#docs:
160
#dels:
0
size:
270 075 bytes
age:
2015-10-28T16:10:07.897Z
source:
flush

Best regards,
Andrii



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Solr results relevancy / scoring

2015-11-09 Thread Emir Arnautovic
To get answer for why 15, you can use field analysis for index/query and 
see that "15%" is probably tokenized and as both 15 and 15%.


Emir

On 06.11.2015 20:22, Erick Erickson wrote:

I'm not sure what the question your asking is. You say
that you have debugged the query and the score for 15 is
higher than the ones below it. What's surprising about that?

Are you saying you don't understand how the score is
calculated? Or the output when adding &debug=true
is inconsistent or what?

Best,
Erick

On Fri, Nov 6, 2015 at 11:04 AM, Brian Narsi  wrote:

I have a situation where.

User search query

q=15%

Solr results contain several documents that are

15%
15%
15%
15%
15 (why?)
15%
15%

I have debugged the query and can see that the score for 15 is higher than
the ones below it.

Why is that? Where can I read in detail about how the scoring is being done?

Thanks


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: solr search relevancy

2015-11-09 Thread Emir Arnautovic

Hi Dhanesh,
Several things you could try:
* when you are searching for "bank" you are actually searching for 
tag/category and in your query you are boosting name 300 while tag is 3.
* you must not sort on premium content weight - you can either use boost 
query clauses to prefer premium content
* use elevator component in case you want to explicitly list some 
results for some queries
* take a look at edismax query parser instead of building your own query 
- it gives you nice features you could use here: boost fields, minimum 
terms match, boost queries...


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 09.11.2015 11:50, Dhanesh Radhakrishnan wrote:

Hi,
Can anybody help me to resolve an issues with solr search relevancy.
Problem is that when somebody search "Bank", it displays some other
business related to this phrase.
For Eg it shows  "Blood bank" and "Power bank" as the first results.
To resolve this, we implemented the proximity search at the end of the
phrase for getting the relevancy and boost the field.This eliminate the
issue in search, and irrelevant results goes down to the last sections
For Eg : When a user search "Bank" he gets banks on the top results, not
the "Blood banks".This resolved with proximity in phrases and boosting.

http://stackoverflow.com/questions/12070016/solr-how-to-boost-score-for-early-matches

http://localhost:8983/solr/localbusiness/select?q=((name
:"bank")^300+OR+((categoryPrefixed:"_CATEGORY_PREFIXED_+bank"~1)^300+AND+(categoryPrefixed:"_CATEGORY_PREFIXED_+bank"~2)^200+AND+(categoryPrefixed:"_CATEGORY_PREFIXED_+bank"~3)^100)+OR+
(tag:"bank")^30+OR+(address:"bank")^5)
&start=0&rows=10&wt=json&indent=true


But the actual problem occurred when we sort the search result.

There is a specific requirement from client that the "Premium" listings
should display as top results.For that we have field packageWeight in solr
schema with values  like 10, 20, 30 for free, basic and premium
consecutively.




And now when we perform this sorting, we get some irrelevant results to
top, but its Premium listing.
How this happened is that
In solr schema, there is a field  "tag". A tag is a small summary or words
that used related to the business and can be implemented to provide a quick
overview of the business.
When a search perform based on  "Tag" which is comparatively very low boost.




In solr doc, there is a premium business named "Mobile Store" and which is
tagged with a keyword "Power Bank".
When we search "Bank" without sorting we are getting relevant results first.
But when we  sort result with field packageWeight, this doc comes first.

Is there any way to resolve this issue??
Or at least is there any way to remove certain fields from the sort, but
not from search.


Regards
dhanesh s r



Re: Search query speed

2015-11-12 Thread Emir Arnautovic
What are HW specs. 4 threads is not much but still makes test less 
deterministic, especially in case when queries are not equally "heavy".


Can you also collect QTime from Solr response and see if differences are 
caused by networking.


Emir

On 11.11.2015 20:44, John Stric wrote:

There is a .NET app that is calling solr. I am measuring time span using
.NET provided methods. It used to take about 42 msec and it started taking
66 msec from the time to compose the call and query solr, get results and
parse them back. Interestingly today it was close to 44 msec.
I am testing using 4 threads and 1000 iterations i.e. each thread makes 250
requests. I then do the average.

The query is being done using dismax parser with mm, qf, qs and bf. There
are 5 query fields. The query includes filter on 3 fields and facet on two
fields

No changes were done: no new docs, no updated or deleted docs. I did try to
restart machine and test multiple time/multiple ways and still the average
was about 66 msec.



On Wed, Nov 11, 2015 at 12:18 PM, Chris Hostetter 
wrote:


: The speed of particular query has gone from about 42 msec to 66 msec
: without any changes.

1) Define "speed" ?

how are you measuring?
where are you measuring?
are you measuring averages? over what sample size?

2) define "particular query" ?

what types of queries?
what types of params are in the request?
what do the results look like?

3) Define "any changes" ?

new docs?
deleted docs?
updated docs?
restarted server?
restarted solr?
increased query load?



-Hoss
http://www.lucidworks.com/



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Solr Cloud 5.3.0 Errors in Logs

2015-11-16 Thread Emir Arnautovic

Hi Adrian,
Can you give us bit more details about warmup queries you use and test 
that you are running when error occurs.


Thanks,
Emir

On 16.11.2015 08:40, Adrian Liew wrote:

Hi there,

Will like to get some opinions on the errors encountered below. I have 
currently setup a SolrCloud cluster of 3 servers (each server hosting a Solr 
instance and a Zookeeper instance).

I am encountering the errors below in the logs:
Monday, November 16, 2015 3:22:54 PM ERROR null SolrCore 
org.apache.solr.common.SolrException: Error opening new searcher. exceeded 
limit of maxWarmingSearchers=6,​ try again later.
Monday, November 16, 2015 3:22:54 PM ERROR null SolrCore 
org.apache.solr.common.SolrException: Error opening new searcher. exceeded 
limit of maxWarmingSearchers=6,​ try again later.
Monday, November 16, 2015 3:22:54 PM ERROR null SolrCore 
org.apache.solr.common.SolrException: Error opening new searcher. exceeded 
limit of maxWarmingSearchers=6,​ try again later.
Monday, November 16, 2015 3:22:54 PM ERROR null SolrCore 
org.apache.solr.common.SolrException: Error opening new searcher. exceeded 
limit of maxWarmingSearchers=6,​ try again later.
Monday, November 16, 2015 3:22:54 PM ERROR null SolrCmdDistributor 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at 
http://172.18.111.112:8983/solr/sitecore_master_index_shard1_replica1: Error 
opening new searcher. exceeded limit of maxWarmingSearchers=6,​ try again later.
Monday, November 16, 2015 3:22:54 PM ERROR null SolrCmdDistributor 
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
from server at 
http://172.18.111.112:8983/solr/sitecore_master_index_shard1_replica1: Error 
opening new searcher. exceeded limit of maxWarmingSearchers=6,​ try again later.
Monday, November 16, 2015 3:22:54 PM WARN null DistributedUpdateProcessor Error 
sending update to http://172.18.111.112:8983/solr
Monday, November 16, 2015 3:22:54 PM WARN null DistributedUpdateProcessor Error 
sending update to http://172.18.111.112:8983/solr
Monday, November 16, 2015 3:22:54 PM WARN null DistributedUpdateProcessor Error 
sending update to http://172.18.111.112:8983/solr
Monday, November 16, 2015 3:22:54 PM WARN null DistributedUpdateProcessor Error 
sending update to http://172.18.111.112:8983/solr

11/16/2015, 3:17:09 PM

WARN

null

DistributedUpdateProcessor

Error sending update to http://172.18.111.112:8983/solr

11/16/2015, 3:17:09 PM

WARN

null

DistributedUpdateProcessor

Error sending update to http://172.18.111.112:8983/solr

11/16/2015, 3:22:26 PM

ERROR

null

SolrCmdDistributor

org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting 
response from server at: 
http://172.18.111.112:8983/solr/sitecore_master_index_shard1_replica1



Main errors are Timeout occurred exceptions, maxWarmingSearchers exceeded. Is 
anyone able to advise or have experienced something the same as the above in 
their SolrCloud setup?

Regards,
Adrian





--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Undo Split Shard

2015-11-17 Thread Emir Arnautovic

Hi,
You can try manually adjusting cluster state in ZK to include parent 
shard and exclude splits, reload collection and try split again.


Btw. any error in logs when split failed?

Thanks,
Emir

On 17.11.2015 07:08, kiyer_adobe wrote:

We had 32 shards of 30GB each. The query performance was awful. We decided to
split shards for all of them. Most of them went fine but for 3 shards that
got split with _1 to 16GB but _0 in low MB's. The _1 is
fine but _0 is definitely wrong. The parent shard is inactive and now
the split shards are active.
I tried deleteshard on the splitshards to split it again but it does not all
deleteshard on active shards. Running splitshard again on parent shard
failed.

I am unsure of what the options are at this point and query went from bad
performance to not working at all.

Please advise.

Thanks.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Undo-Split-Shard-tp4240508.html
Sent from the Solr - User mailing list archive at Nabble.com.


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Performance testing on SOLR cloud

2015-11-18 Thread Emir Arnautovic

Hi Aswath,
It is not common to test only QPS unless it is static index most of the 
time. Usually you have to test and tune worst case scenario - max 
expected indexing rate + queries. You can get more QPS by reducing query 
latency or by increasing number of replicas. You manage latency by 
tuning Solr/JVM/queries and/or by sharding index. You first tune index 
without replication and when sure it is best single index can provide, 
you introduce replicas to achieve required throughput.


Hard part is tuning Solr. You can do it without specialized tools, but 
tools help a lot. One such tool is Sematext's SPM - 
https://sematext.com/spm/index.html where you can see all necessary 
Solr/JVM/OS metrics needed to tune Solr. It also provides QPS graph.


With index your size, unless documents are really big, you can start 
without sharding. After tuning, if not satisfied with query latency, you 
can try splitting to two shards.


Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 17.11.2015 23:45, Aswath Srinivasan (TMS) wrote:

Hi fellow developers,

Please share your experience, on how you did performance testing on SOLR? What 
I'm trying to do is have SOLR cloud on 3 Linux servers with 16 GB RAM and index 
a total of 2.2 million. Yet to decide how many shards and replicas to have (Any 
hint on this is welcome too, basically 'only' performance testing, so suggest 
the number of shards and replicas if you can). Ultimately, I'm trying to find 
the QPS that this SOLR cloud set up can handle.

To summarize,

1.   Find the QPS that my solr cloud set up can support

2.   Using 5.3.1 version with external zookeeper

3.   3 Linux servers with 16 GB RAM and index a total of 2.2 million documents

4.   Yet to decide number of shards and replicas

5.   Not using any custom search application (performance testing for SOLR and 
not for Search portal)

Thank you





Re: solr indexing warning

2015-11-19 Thread Emir Arnautovic
This means that one searcher is still warming when other searcher 
created due to commit with openSearcher=true. This can be due to 
frequent commits of searcher warmup taking too long.


Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 19.11.2015 12:16, Midas A wrote:

Getting following log on solr


PERFORMANCE WARNING: Overlapping onDeckSearchers=2`



Re: solr indexing warning

2015-11-20 Thread Emir Arnautovic

Hi,
Since this is master node, and not expected to have queries, you can 
disable caches completely. However, from numbers cache autowarm is not 
an issue here but probably frequency of commits and/or warmup queries. 
How do you do commits? Since master-slave, I don't see reason to have 
them too frequently. If you need NRT you should switch to SolrCloud. Do 
you have warmup queries? You don't need them on master node.


Regards,
Emir

On 20.11.2015 08:33, Midas A wrote:

thanks Shawn,

As we are this server as a master server  there are no queries running on
it  . in that case should i remove these configuration from config file .

Total Docs: 40 0

Stats
#

Document cache :
lookups:823
hits:4
hitratio:0.00
inserts:820
evictions:0
size:820
warmupTime:0
cumulative_lookups:24474
cumulative_hits:1746
cumulative_hitratio:0.07
cumulative_inserts:22728
cumulative_evictions:13345


fieldcache:
stats:
entries_count:2
entry#0:'SegmentCoreReader(​owner=_3bph(​4.2.1):C3918553)'=>'_version_',long,org.apache.lucene.search.FieldCache.NUMERIC_UTILS_LONG_PARSER=>org.apache.lucene.search.FieldCacheImpl$LongsFromArray#1919958905
entry#1:'SegmentCoreReader(​owner=_3bph(​4.2.1):C3918553)'=>'_version_',class
org.apache.lucene.search.FieldCacheImpl$DocsWithFieldCache,null=>org.apache.lucene.util.Bits$MatchAllBits#660036513
insanity_count:0


fieldValuecache:

lookups:0

hits:0

hitratio:0.00

inserts:0

evictions:0

size:0

warmupTime:0

cumulative_lookups:0

cumulative_hits:0

cumulative_hitratio:0.00

cumulative_inserts:0

cumulative_evictions:0


filtercache:


lookups:0

hits:0

hitratio:0.00

inserts:0

evictions:0

size:0

warmupTime:0

cumulative_lookups:0

cumulative_hits:0

cumulative_hitratio:0.00

cumulative_inserts:0

cumulative_evictions:0


QueryResultCache:

lookups:3841

hits:0

hitratio:0.00

inserts:4841

evictions:3841

size:1000

warmupTime:213

cumulative_lookups:58438

cumulative_hits:153

cumulative_hitratio:0.00

cumulative_inserts:58285

cumulative_evictions:57285



Please suggest .



On Fri, Nov 20, 2015 at 12:15 PM, Shawn Heisey  wrote:


On 11/19/2015 11:06 PM, Midas A wrote:


initialSize

="1000" autowarmCount="1000"/>

Your caches are quite large.  More importantly, your autowarmCount is
very large.  How many documents are in each of your cores?  If you check
the Plugins/Stats area in the admin UI for your core(s), how many
entries are actually in each of those three caches?  Also shown there is
the number of milliseconds that it took for each cache to warm.

The documentCache cannot be autowarmed, so that config is not doing
anything.

When a cache is autowarmed, what this does is look up the key for the
top N entries in the old cache, which contains the query used to
generate that cache entry, and executes each of those queries on the new
index to populate the new cache.

This means that up to 2000 queries are being executed every time you
commit and open a new searcher.  The actual number may be less, if the
filterCache and queryResultCache are not actually reaching 1000 entries
each.  Autowarming can take a significant amount of time when the
autowarmCount is high.  It should be lowered.

Thanks,
Shawn




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: ZooKeeper nodes die taking down Solr Cluster?

2015-12-01 Thread Emir Arnautovic

Hi Frank,
Can you please confirm that Solr nodes are aware of entire ZK ensemble? 
Can you give more info how it is deployed - ZK on separate servers? What 
is load on Solr when it happens? Do you see any errors in Solr logs?


Thanks,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



On 30.11.2015 15:42, Kelly, Frank wrote:

I am somewhat new to SolrCloud and ZooKeeper.

We are deploying ZK and SolrCloud on AWS.
We are noticing an issue where the one of the three nodes in the ZooKeeper ensemble 
"drops out" of the ensemble (although the Java process continues to run fine 
and nothing obviously bad in the ZooKeeper log files).
And (perhaps its a coincidence) but the Solr cluster then seems to be impacted where Solr 
nodes appear as "gone" even though their Java process is still running.

When we restart ZooKeeper nodes the Solr Cluster does not recover.

I have ZooKeeper logs and Solr logs - just wondering what else I should capture 
before posting to this mailing list

Thanks!

-Frank


Frank Kelly
Principal Software Engineer
Predictive Analytics Team (SCBE/HAC/CDA)




Email: 
frank.ke...@here.com
Website: http://www.here.com




5 Wayside Rd, Burlington, MA 01803, USA
Here, a Nokia business








Re: ZooKeeper nodes die taking down Solr Cluster?

2015-12-01 Thread Emir Arnautovic
:8983_solr) [   ]
o.a.s.s.ZkIndexSchemaReader Retrieved schema version 89 from ZooKeeper
597644246 INFO
(zkCallback-4-thread-1073-processing-n:52.91.90.134:8983_solr) [   ]
o.a.s.s.ZkIndexSchemaReader A schema change: WatchedEvent
state:SyncConnected type:NodeDataChanged
path:/configs/mycollection/managed-schema, has occurred - updating schema
from ZooKeeper ...
597674162 WARN
(zkCallback-4-thread-1073-processing-n:52.91.90.134:8983_solr) [   ]
o.a.s.s.ZkIndexSchemaReader ZooKeeper watch triggered, but Solr cannot
talk to ZK
597628388 ERROR
(zkCallback-4-thread-881-processing-n:52.91.90.134:8983_solr-EventThread)
[   ] o.a.z.ClientCnxn Error while calling watcher
java.lang.OutOfMemoryError: Java heap space
597593838 INFO
(zkCallback-4-thread-1103-processing-n:52.91.90.134:8983_solr) [   ]
o.a.s.s.ZkIndexSchemaReader A schema change: WatchedEvent
state:SyncConnected type:NodeDataChanged
path:/configs/mycollection/managed-schema, has occurred - updating schema
from ZooKeeper ...


So it looks like it ran out of memory . . . Strange but I thought my
collections were pretty small.
Any idea why a replace-field-type call might cause an OutOfMemoryException?

-Frank

Example Curl request that causes Solr nodes to appear as ³Gone²

curl -X POST -H 'Content-type:application/json' --data-binary '{
   "replace-field-type" : {
  "name":"text_ws",
  "class":"solr.TextField",
  "positionIncrementGap":"100",
  "indexAnalyzer" : {
 "tokenizer":{ "class":"solr.StandardTokenizerFactory" },
 "filters":[
 { "class":"solr.StopFilterFactory", "ignoreCase":true,
"words":"stopwords.txt" },
 { "class":"solr.LowerCaseFilterFactory" },
 { "class":"solr.EdgeNGramFilterFactory",
"minGramSize":"2", "maxGramSize":"15" }
  ]
  },
  "queryAnalyzer" : {
 "tokenizer":{ "class":"solr.StandardTokenizerFactory" },
 "filters":[
 { "class":"solr.StopFilterFactory", "ignoreCase":true,
"words":"stopwords.txt" },
 { "class":"solr.SynonymFilterFactory",
"synonyms":"synonyms.txt", "ignoreCase":true, "expand":true },
 { "class":"solr.LowerCaseFilterFactory" }
  ]
  }
}
}' http://aa.bb.cc.dd:8983/solr/my_collection_name/schema










On 12/1/15, 7:45 AM, "Emir Arnautovic" 
wrote:


Hi Frank,
Can you please confirm that Solr nodes are aware of entire ZK ensemble?
Can you give more info how it is deployed - ZK on separate servers? What
is load on Solr when it happens? Do you see any errors in Solr logs?

Thanks,
Emir



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: FastVector Highlighter

2015-12-07 Thread Emir Arnautovic

Hi Edwin,
FastVector Highlighter requires term vector, positions and frequencies, 
so if it is not enabled on fields that you want to highlight, it will 
increase index size. Since it is common to have those enabled for 
standard highlighter to speed up highlighting, those might already be 
enabled, otherwise reindexing is required as well.


Regards,
Emir

On 07.12.2015 12:06, Zheng Lin Edwin Yeo wrote:

Hi,

Would like to check, will using the FastVector Highlighter takes up more
indexing space (the index size) as compared to the Original Highlighter?

I'm using Solr 5.3.0

Regards,
Edwin



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: FastVector Highlighter

2015-12-07 Thread Emir Arnautovic

Hi Edwin,
It is expected since you are storing more info about document.

Thanks,
Emir

p.s. one correction - I meant offsets, not frequencies.

On 07.12.2015 15:47, Zheng Lin Edwin Yeo wrote:

Hi Emir,

The term vector, positions and frequencies wasn't enabled, so I'm doing the
re-indexing. That is where I found that the index size is bigger than
previously when I was using the Original Highlighter.

Regards,
Edwin


On 7 December 2015 at 19:19, Emir Arnautovic 
wrote:


Hi Edwin,
FastVector Highlighter requires term vector, positions and frequencies, so
if it is not enabled on fields that you want to highlight, it will increase
index size. Since it is common to have those enabled for standard
highlighter to speed up highlighting, those might already be enabled,
otherwise reindexing is required as well.

Regards,
Emir


On 07.12.2015 12:06, Zheng Lin Edwin Yeo wrote:


Hi,

Would like to check, will using the FastVector Highlighter takes up more
indexing space (the index size) as compared to the Original Highlighter?

I'm using Solr 5.3.0

Regards,
Edwin



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



Re: Solr 5.2.1 deadlock on commit

2015-12-08 Thread Emir Arnautovic

Hi Ali,
This thread is blocked because cannot obtain update lock - in this 
particular case when doing soft commit. I am guessing that there others 
are blocked for the same reason. Can you tell us bit more about your 
setup and indexing load and procedure? Do you do explicit commits?


Regards,
Emir

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On 08.12.2015 08:16, Ali Nazemian wrote:

Hi,
There is a while since I have had problem with Solr 5.2.1 and I could not
fix it yet. The only think that is clear to me is when I send bulk update
to Solr the commit thread will be blocked! Here is the thread dump output:

"qtp595445781-8207" prio=10 tid=0x7f0bf68f5800 nid=0x5785 waiting for
monitor entry [0x7f081cf04000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:608)
- waiting to lock <0x00067ba2e660> (a java.lang.Object)
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1635)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1612)
at
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:270)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:497)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)

Locked ownable synchronizers:
- None

FYI there are lots of blocked thread in thread dump report and Solr becomes
really slow in this case. The temporary solution would be restarting Solr.
But, I am really sick of restarting! I really appreciate if somebody can
help me to solve this problem?

Best regards.



Re: Use multiple istance simultaneously

2015-12-08 Thread Emir Arnautovic
Can you tolerate having indices in different state or you plan to keep 
them in sync with controlled commits. DIH-ing content from source when 
new machine is needed  will probably be slow and I am afraid that you 
will end up simulating master-slave model (copying state from one of 
healthy nodes and DIH-ing diff). I would recommend using SolrCloud with 
single shard and let Solr do the hard work.


Regards,
Emir

On 04.12.2015 14:37, Gian Maria Ricci - aka Alkampfer wrote:

Many thanks for your response.

I worked with Solr until early version 4.0, then switched to ElasticSearch
for a variety of reasons. I've used replication in the past with SolR, but
with Elasticsearch basically I had no problem because it works similar to
SolrCloud by default and with almost zero configuration.

Now I've a customer that want to use Solr, and he want the simplest possible
stuff to maintain in production. Since most of the work will be done by Data
Import Handler, having multiple parallel and independent machines is easy to
maintain. If one machine fails, it is enough to configure another machine,
configure core and restart DIH.

I'd like to know if other people went through this path in the past.

--
Gian Maria Ricci
Cell: +39 320 0136949
 


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: giovedì 3 dicembre 2015 10:15
To: solr-user@lucene.apache.org
Subject: Re: Use multiple istance simultaneously

On 12/3/2015 1:25 AM, Gian Maria Ricci - aka Alkampfer wrote:

In such a scenario could it be feasible to simply configure 2 or 3
identical instance of Solr and configure the application that transfer
data to solr to all the instances simultaneously (the approach will be
a DIH incremental for some core and an external application that push
data continuously for other cores)? Which could be the drawback of
using this approach?

When I first set up Solr, I used replication.  Then version 3.1.0 was
released, including a non-backward-compatible upgrade to javabin, and it was
not possible to replicate between 1.x and 3.x.

This incompatibility meant that it would not be possible to do a gradual
upgrade to 3.x, where the slaves are upgraded first and then the master.

To get around the problem, I basically did exactly wh at you've described.
I turned off replication and configured a second copy of my build program to
update what used to be slave servers.

Later, when I moved to a SolrJ program for index maintenance, I made one
copy of the maintenance program capable of updating multiple copies of the
index in parallel.

I have stuck with this architecture through 4.x and moving into 5.x, even
though I could go back to replication or switch to SolrCloud.
Having completely independent indexes allows a great deal of flexibility
with upgrades and testing new configurations, flexibility that isn't
available with SolrCloud or master-slave replication.

Thanks,
Shawn



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/



  1   2   >