Re: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Mikhail Khludnev
Suresh, There are a few common workaround for such problem. But, I think that submitting more than "maxIndexingThreads" is not really productive. Also, I think that out-of-memory problem is caused not by indexing, but by opening searcher. Do you really need to open it? I don't think it's a good id

Re: clarification regarding shard splitting and composite IDs

2015-02-04 Thread Gili Nachum
Hi, I'm also interested. When using composite the ID, the _route_ information is not kept on the document itself, so to me it looks like it's not possible as the split API doesn't have a relevant parameter to spl

Re: clarification regarding shard splitting and composite IDs

2015-02-04 Thread Anshum Gupta
In one line, shard splitting doesn't cater to depend on the routing mechanism but just the hash range so you could have documents for the same prefix split up. Here's an overview of routing in SolrCloud: * Happens based on a hash value * The hash is calculated using the multiple parts of the routi

More Like This similarity tuning

2015-02-04 Thread Ali Nazemian
Hi, I am looking for a best practice on More Like This parameters. I really appreciate if somebody can tell me what is the best value for these parameters in MLT query? Or at lease the proper methodology for finding the best value for each of these parameters: mlt.mintf mlt.mindf mlt.maxqt Thank y

Reg : How to properly outline an Apache Solr documentation

2015-02-04 Thread Bachan
1. What is the difference between delta import and full import in ApacheSolr? 2. What are all the naming conventions to be followed while writing deltaImportQuery and deltaQuery ( ID, TXT_ID etc), any references or tutorial explaining in detail about differences/relations between deltaImportQuery a

Re: MoreLikeThis filter by score threshold

2015-02-04 Thread Ali Nazemian
Dear Upayvira, Thank you very much. I think probably in long-term the fluctuation of changing score becomes really small. Probably I can reduce that with considering percentage instead of filtering by raw score. (however it is still possible that percentage changes during a period of time) Anyway,

RE: MoreLikeThis filter by score threshold

2015-02-04 Thread Markus Jelsma
Hello Ali - no it is not reasonable and it is unnecessary at best. Regardless of the query, you sort by score. This means that the top queries are always the most relevant, so what exactly do you need to filter? -Original message- > From:Ali Nazemian > Sent: Tuesday 3rd February 2015 2

RE: MoreLikeThis filter by score threshold

2015-02-04 Thread Markus Jelsma
Hello Upayavira - Indeed, it works, except ... insert-counter-arguments. It doesn't work after all :) Markus -Original message- > From:Upayavira > Sent: Tuesday 3rd February 2015 21:38 > To: solr-user@lucene.apache.org > Subject: Re: MoreLikeThis filter by score threshold > > I've seen

RE: More Like This similarity tuning

2015-02-04 Thread Markus Jelsma
Well, maxqt is easy, it is just the number of terms that compose your query. MinTF is a strange parameter, rare terms have a low DF and most usually not a high TF, so i would keep it at 1. MinDF is more useful, it depends entirely on the size of your corpus. If you have a lot of user-generated

Re: how to index data in solr form database automatically

2015-02-04 Thread nvmai
I use crontab * */1 * * * wget http://blog.nhipvang.com :8983/solr/imuzik/dataimport?command=full-import&clean=false - mẹo làm đẹp -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p4183792.html Sent from the

Re: DIH: entities in xml problem

2015-02-04 Thread Michael Sokolov
Yes, you could do some kind of regex preprocess; you can't feed named entities to an XML parser without providing a DTD, sadly. But that probably isn't any easier than adding the DTD -- both require updating every file. Another possibility is writing some custom code that does the modificatio

Re: Duplicate facets when the handler configuration specifies facet fields

2015-02-04 Thread Marius Dumitru Florea
I started having the same issue (facets being listed twice) after upgrading from Solr 4.8.1 to Solr 4.10.3. So it looks like a regression to me. I commented on SOLR-6780 as the Fix Version/s field is not correct. The fix was not merged correctly on the 4.10 branch (before the 4.10.3 release). I ho

Re: DIH: entities in xml problem

2015-02-04 Thread Raul
Thanks for the reply. I'm going tol try with a custom transform to see if process the file before the xml parser. And I will see the regex for determine if it is appropriate for this. Regards El 04/02/15 a las 13:37, Michael Sokolov escribió: Yes, you could do some kind of regex preprocess;

Re: MoreLikeThis filter by score threshold

2015-02-04 Thread Ali Nazemian
Dear Markus, I want to show the documents that are similar enough. I dont want to show them just one or two or ... documents. The number is not matter, similarity is what I am looking for. Regards. On Wed, Feb 4, 2015 at 2:42 PM, Markus Jelsma wrote: > Hello Upayavira - Indeed, it works, except

Re: how to index data in solr form database automatically

2015-02-04 Thread Shawn Heisey
On 2/4/2015 12:01 AM, nvmai wrote: > I use crontab > * */1 * * * wget http://blog.nhipvang.com > :8983/solr/imuzik/dataimport?command=full-import&clean=false That will attempt to start a full-import once a minute. Requests sent while an import is already underway will fail, but this basically me

Re: More Like This similarity tuning

2015-02-04 Thread Ali Nazemian
Dear Markus, Would you please explain more about maxqt parameter and the methodology of choosing best number of terms for this value? Best regards. On Wed, Feb 4, 2015 at 2:46 PM, Markus Jelsma wrote: > Well, maxqt is easy, it is just the number of terms that compose your > query. MinTF is a s

Re: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Dan Davis
Suresh and Meena, I have solved this problem by taking a row count on a query, and adding its modulo as another field called threadid. The base query is wrapped in a query that selects a subset of the results for indexing. The modulo on the row number was intentional - you cannot rely on id

low qps with high load averages on solrcloud

2015-02-04 Thread Suchi Amalapurapu
Hi Noticed that a solrcloud cluster doesn't scale linearly with # of nodes unlike the unsharded solr cluster. We are seeing a 10 fold drop in QPS in multi sharded mode. While some overhead is understandable, the pattern that we see is, some nodes load averages shoot up and that in turn skews up the

Re: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Dan Davis
"Data Import Handler is the only good solution that involves writing configuration, not code." - I also had a requirement not to look at product-oriented enhancements to Solr, and there are many products I didn't look at, or rejected, like django-haystack. Perl, ruby, and python have good handli

RE: Solr 4.9 Calling DIH concurrently

2015-02-04 Thread Dyer, James
Yes, that is what I mean. In my case, for each "/dataimport" in the "defaults" section, I also put something like this: 1 ...and then reference it in the data-config.xml with ${dataimporter.request.currentPartition} . This way the same data-config.xml can be used for each handler. As I said

Can solr TermVectorComponent return term frequency for the term in my query?

2015-02-04 Thread Aki Balogh
I'm using solr TermVectorComponent to get term frequencies for specific terms in a corpus. I.e. I query for "q=dog" and want to get back term frequencies for "dog" in the corpus. However, when I request term frequencies, I get back ALL term frequencies for ALL matching documents, which is generati

DocExpirationUpdateProcessorFactory not deleting records

2015-02-04 Thread S.L
I am trying to use the DocExpirationUpdateProcessorFactoryfactory in Solr 4.10.1 version. I have included the following in my solrconfig.xml id timestamp 30 ttl expire_at

Re: DocExpirationUpdateProcessorFactory not deleting records

2015-02-04 Thread Chris Hostetter
: : 30 : ttl : expire_at : ... : And I have included the following in my schema.xml : : there are a couple of problems here... : As you can see I am setting the time to live to be 60 seconds and checking : to delete every

Re: DocExpirationUpdateProcessorFactory not deleting records

2015-02-04 Thread S.L
Thanks for giving multiple options , I ll try them out both ,but last time I checked, having "+60SECONDS" as the default value for ttl was giving me an invalid date format exception, I am assuming that would only be the case if I use it with the default mechanism in schema.xml, but not when we

Re: DocExpirationUpdateProcessorFactory not deleting records

2015-02-04 Thread Chris Hostetter
: : Thanks for giving multiple options , I ll try them out both ,but last time : I checked, having "+60SECONDS" as the default value for ttl was giving me : an invalid date format exception, I am assuming that would only be the that's because ttl should not be a date field -- it should be a *

Re: DocExpirationUpdateProcessorFactory not deleting records

2015-02-04 Thread S.L
Great, this is the first example I have seen so far, I wish we could include this in the Wiki. Thanks again! On Wed, Feb 4, 2015 at 2:04 PM, Chris Hostetter wrote: > : > : Thanks for giving multiple options , I ll try them out both ,but last > time > : I checked, having "+60SECONDS" as the defau

Re: low qps with high load averages on solrcloud

2015-02-04 Thread Erick Erickson
YOu need to tell us a lot more about how you're measuring and how you're querying in order to help us help you. On the face of it, this isn't what I expect, but there are lots of things that can skew results. FWIW, Erick On Wed, Feb 4, 2015 at 11:16 AM, Suchi Amalapurapu wrote: > Hi > Noticed t

RE: low qps with high load averages on solrcloud

2015-02-04 Thread Toke Eskildsen
Suchi Amalapurapu [su...@bloomreach.com] wrote: > Noticed that a solrcloud cluster doesn't scale linearly with # of nodes > unlike the unsharded solr cluster. We are seeing a 10 fold drop in QPS in > multi sharded mode. As I understand it, you changed from single to multi shard. Guessing wildly:

RE: low qps with high load averages on solrcloud

2015-02-04 Thread Markus Jelsma
We recently upgraded our cloud from 4.8 to 4.10.3, the only config we updated was the luceneMatchVersion. Response times were very stable prior to the upgrade, but are quite erratic since the upgrade, and rising. I still have to check all the resolved issues but something went very wrong between

Re: Can solr TermVectorComponent return term frequency for the term in my query?

2015-02-04 Thread Ahmet Arslan
Hi Aki, How about tf function query? https://cwiki.apache.org/confluence/display/solr/Function+Queries Ahmet On Wednesday, February 4, 2015 7:59 PM, Aki Balogh wrote: I'm using solr TermVectorComponent to get term frequencies for specific terms in a corpus. I.e. I query for "q=dog" and want t

Re: Core property name ignored when creating collection using API

2015-02-04 Thread Avanish Raju
Hi Shawn, Thanks for clarifying! And my apologies, it looks like my question was posted twice to the forum. I've also received replies from Erick and Chris to help clear out my confusion - on this thread: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201502.mbox/browser Glad to see su

Re: Can solr TermVectorComponent return term frequency for the term in my query?

2015-02-04 Thread Aki Balogh
Hi Ahmet, Thank you for your idea, very helpful. I can indeed get tf values through the tf and ttf function queries. Since tf uses Similarity, I'm getting back some floats (i.e. "dog occurs 1.424 times"), when I was expecting ints. Is there a way to get back ints (simple word count)? Thanks, Aki

Re: Can solr TermVectorComponent return term frequency for the term in my query?

2015-02-04 Thread Ahmet Arslan
Hi, So you want raw tf. tf method implemented as square root of raw tf. So you can re-obtain it by reverse operation. 1.424 * 1.424 = 2.02 = int = 2 Ahmet On Wednesday, February 4, 2015 11:31 PM, Aki Balogh wrote: Hi Ahmet, Thank you for your idea, very helpful. I can indeed get tf value

Exception while loading 2 Billion + Documents in Solr 4.8.0

2015-02-04 Thread Arumugam, Suresh
Hi All, We are trying to load 14+ Billion documents into Solr. But we are failing to load them into Solr. Solr version: 4.8.0 Analyzer used: ClassicTokenizer for index as well as query. Can someone help me in getting into the core of this issue? For 14+ Billion document load, we are loading 2B

Re: Can solr TermVectorComponent return term frequency for the term in my query?

2015-02-04 Thread Aki Balogh
Is there a way to set solr to only return raw tf (i.e. by maybe turning off the DefaultSimilarity), so I could use ttf() to get the sum of raw tf values? Or do I need to parse each tf value, square it and add them up in post-processing? Thx, Aki On Wed, Feb 4, 2015 at 4:39 PM, Ahmet Arslan wro

RE: Exception while loading 2 Billion + Documents in Solr 4.8.0

2015-02-04 Thread Arumugam, Suresh
I guess, somewhere solr is referring the Integer datatype to define the maximum number of documents. This is limiting to the integer range to 2147483647. I am not able to find out the root cause of this issue. Can someone shed some light over this issue to identify the root cause of it. Regard

Re: Where can we set the parameters in Solr Config?

2015-02-04 Thread O. Olson
Thank you Alex and Jack for pointing out solrcore.properties and core.properties files. This is much better than specifying these on the command line. I think I need to use the solrcore.properties. I will try it in the next few days. Thanks again. Alexandre Rafalovitch wrote > core.properties? >

Re: Exception while loading 2 Billion + Documents in Solr 4.8.0

2015-02-04 Thread Jack Krupansky
What's your cluster size? The 2 billion limit is per-node. My personal recommendation is that you don't load more than 100 million documents per node. You need to do a proof of concept test to verify whether your particular data would support a higher number or not. Ultimately, it will not be a ma

Re: Exception while loading 2 Billion + Documents in Solr 4.8.0

2015-02-04 Thread Shawn Heisey
On 2/4/2015 2:54 PM, Arumugam, Suresh wrote: > > Hi All, > > > > We are trying to load 14+ Billion documents into Solr. But we are > failing to load them into Solr. > > > > Solr version: *4.8.0* > > Analyzer used: *ClassicTokenizer for index as well as query.* > > > > Can someone help me in g

Re: Exception while loading 2 Billion + Documents in Solr 4.8.0

2015-02-04 Thread Walter Underwood
You can only put 2 billion documents in one core. This error message is the clue: Too many documents, composite IndexReaders cannot exceed 2147483647 You will need to shard the collection. You might have multiple shards per node, but you will probably need 50-100 shards and lots of servers. wu

RE: Exception while loading 2 Billion + Documents in Solr 4.8.0

2015-02-04 Thread Arumugam, Suresh
Thanks Zack/Shawn for the response. We are trying to do a POC for searching our log files with a single node Solr(396 GB RAM with 14 TB Space). Since the server is powerful, added 2 Billion records successfully & search is working fine without much issues. Due to the restriction of the Lucence

Re: Can solr TermVectorComponent return term frequency for the term in my query?

2015-02-04 Thread Aki Balogh
PS - I found that termfreq() actually returns the raw tf, i.e. an integer for each document. However, I have to get the request and add them up on my end. Unfortunately totaltermfreq() sums the similarity-modified tf values. Is there a way to just get the sum of the termfreq() values? Akos (Aki

Re: Can solr TermVectorComponent return term frequency for the term in my query?

2015-02-04 Thread Aki Balogh
Nevermind -- I found I can just add another fq, so i'm not getting the 0s back, which makes it quick to add it up on my end. So the solution is: collection1/query?q=crawl_id:40fq=text:%22matched%20text%22&fl=termfreq(text,%27matched%20text%27)&rows=100&tv=false Thanks for your help! Akos (A

Re: Importing XML into SOLR, identifying a failed import document

2015-02-04 Thread Steve Rowe
Fixed in trunk, branch_5x, lucene_solr_5_0 and lucene_solr_4_10. > On Feb 4, 2015, at 2:56 AM, Mikhail Khludnev > wrote: > > Developers, would you mind to fix typo: applying XSL Transformeation ? > > On Tue, Feb 3, 2015 at 9:10 PM, Morris, Paul E. wrote: > >> Caused by: org.apache.solr.hand

Re: Solrcloud (to HDFS) poor indexing performance

2015-02-04 Thread Tim Smith
When I created the collection using "solrctl" I did not specify a replication factor. So after I read your email, I went looking for the current replication factor - and couldn't. Where do I find the current replication factor? I don't see it on the Solr Admin panel of a node. Nor do I seem to get

RE: Exception while loading 2 Billion + Documents in Solr 4.8.0

2015-02-04 Thread Chris Hostetter
: We are trying to do a POC for searching our log files with a single node Solr(396 GB RAM with 14 TB Space). : Since the server is powerful, added 2 Billion records successfully & search is working fine without much issues. how much CPU? Assuming it's comparable to the amount of RAM you've go

Re: clarification regarding shard splitting and composite IDs

2015-02-04 Thread Gili Nachum
Alright. So shard splitting and composite routing plays nicely together. Thank you Anshum. On Wed, Feb 4, 2015 at 11:24 AM, Anshum Gupta wrote: > In one line, shard splitting doesn't cater to depend on the routing > mechanism but just the hash range so you could have documents for the same > pre

Include stopwords in phrase search

2015-02-04 Thread Shamik Bandopadhyay
Hi, I'm having an issue running phrase quires with stopwords. Looks like Solr is ignoring the stopword during search. Here's my search term. "cannot open device" When I'm executing title:"cannot open device" , it's bringing back titles with "Find Open Devices". Here's my field definition for

Re: Include stopwords in phrase search

2015-02-04 Thread shamik
Well, I somehow made it work by using CommonGramsFilterFactory. Just wondering if it's the right approach ? -- View this message in context: http://lucene.472066.n3.nabble.com/Include-stopwords-in-phrase-search-tp4184067p4184068.html Sent from the Solr - User mailing list archive at Nabble.c

Re: clarification regarding shard splitting and composite IDs

2015-02-04 Thread Dan Davis
Doesn't relevancy for that assume that the IDF and TF for user1 and user2 are not too different?SolrCloud still doesn't use a distributed IDF, correct? On Wed, Feb 4, 2015 at 7:05 PM, Gili Nachum wrote: > Alright. So shard splitting and composite routing plays nicely together. > Thank you An

Re: clarification regarding shard splitting and composite IDs

2015-02-04 Thread Anshum Gupta
Solr 5.0 has support for distributed IDF. Also, users having the same IDF is orthogonal to the original question. In general, the Doc Freq. is only per-shard. If for some reason, a single user has documents split across shards, the IDF used would be different for docs on different shards. On Wed,

Re: Duplicate facets when the handler configuration specifies facet fields

2015-02-04 Thread Marius Dumitru Florea
Should be fixed now, thanks to Hoss Man. I'll wait for Solr 4.10.4 Thanks, Marius On Wed, Feb 4, 2015 at 3:00 PM, Marius Dumitru Florea wrote: > I started having the same issue (facets being listed twice) after > upgrading from Solr 4.8.1 to Solr 4.10.3. > So it looks like a regression to me. I

Re: WordDelimiterFilterFactory and position increment.

2015-02-04 Thread Dmitry Kan
Hi, Could you enable it on the querying side and re-test your case? The rule of thumb I usually follow is to make the index and query side transformations as close as possible. HTH, Dmitry On Wed, Feb 4, 2015 at 6:14 AM, Modassar Ather wrote: > Hi, > > No I am not using WordDelimiterFilter on